summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
8 hoursMerge tag 'block-7.0-20260216' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull more block updates from Jens Axboe: - Fix partial IOVA mapping cleanup in error handling - Minor prep series ignoring discard return value, as the inline value is always known - Ensure BLK_FEAT_STABLE_WRITES is set for drbd - Fix leak of folio in bio_iov_iter_bounce_read() - Allow IOC_PR_READ_* for read-only open - Another debugfs deadlock fix - A few doc updates * tag 'block-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: blk-mq: use NOIO context to prevent deadlock during debugfs creation blk-stat: convert struct blk_stat_callback to kernel-doc block: fix enum descriptions kernel-doc block: update docs for bio and bvec_iter block: change return type to void nvmet: ignore discard return value md: ignore discard return value block: fix partial IOVA mapping cleanup in blk_rq_dma_map_iova block: fix folio leak in bio_iov_iter_bounce_read() block: allow IOC_PR_READ_* ioctls with BLK_OPEN_READ drbd: always set BLK_FEAT_STABLE_WRITES
5 daysMerge tag 'mm-nonmm-stable-2026-02-12-10-48' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves disk space by teaching ocfs2 to reclaim suballocator block group space (Heming Zhao) - "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the ARRAY_END() macro and uses it in various places (Alejandro Colomar) - "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the page size (Pnina Feder) - "kallsyms: Prevent invalid access when showing module buildid" cleans up kallsyms code related to module buildid and fixes an invalid access crash when printing backtraces (Petr Mladek) - "Address page fault in ima_restore_measurement_list()" fixes a kexec-related crash that can occur when booting the second-stage kernel on x86 (Harshit Mogalapalli) - "kho: ABI headers and Documentation updates" updates the kexec handover ABI documentation (Mike Rapoport) - "Align atomic storage" adds the __aligned attribute to atomic_t and atomic64_t definitions to get natural alignment of both types on csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain) - "kho: clean up page initialization logic" simplifies the page initialization logic in kho_restore_page() (Pratyush Yadav) - "Unload linux/kernel.h" moves several things out of kernel.h and into more appropriate places (Yury Norov) - "don't abuse task_struct.group_leader" removes the usage of ->group_leader when it is "obviously unnecessary" (Oleg Nesterov) - "list private v2 & luo flb" adds some infrastructure improvements to the live update orchestrator (Pasha Tatashin) * tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits) watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency procfs: fix missing RCU protection when reading real_parent in do_task_stat() watchdog/softlockup: fix sample ring index wrap in need_counting_irqs() kcsan, compiler_types: avoid duplicate type issues in BPF Type Format kho: fix doc for kho_restore_pages() tests/liveupdate: add in-kernel liveupdate test liveupdate: luo_flb: introduce File-Lifecycle-Bound global state liveupdate: luo_file: Use private list list: add kunit test for private list primitives list: add primitives for private list manipulations delayacct: fix uapi timespec64 definition panic: add panic_force_cpu= parameter to redirect panic to a specific CPU netclassid: use thread_group_leader(p) in update_classid_task() RDMA/umem: don't abuse current->group_leader drm/pan*: don't abuse current->group_leader drm/amd: kill the outdated "Only the pthreads threading model is supported" checks drm/amdgpu: don't abuse current->group_leader android/binder: use same_thread_group(proc->tsk, current) in binder_mmap() android/binder: don't abuse current->group_leader kho: skip memoryless NUMA nodes when reserving scratch areas ...
6 daysmd: ignore discard return valueChaitanya Kulkarni
__blkdev_issue_discard() always returns 0, making all error checking at call sites dead code. Simplify md to only check !discard_bio by ignoring the __blkdev_issue_discard() value. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 daysMerge tag 'for-7.0/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mikulas Patocka: - dm-verity: - various optimizations and fixes related to forward error correction - add a .dm-verity keyring - dm-integrity: fix bugs with growing a device in bitmap mode - dm-mpath: - fix leaking fake timeout requests - fix UAF bug caused by stale rq->bio - fix minor bugs in device creation - dm-core: - fix a bug related to blkg association - avoid unnecessary blk-crypto work on invalid keys - dm-bufio: - dm-bufio cleanup and optimization (reducing hash table lookups) - various other minor fixes and cleanups * tag 'for-7.0/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (35 commits) dm mpath: make pg_init_delay_msecs settable Revert "dm: fix a race condition in retrieve_deps" dm mpath: Add missing dm_put_device when failing to get scsi dh name dm vdo encodings: clean up header and version functions dm: use bio_clone_blkg_association dm: fix excessive blk-crypto operations for invalid keys dm-verity: fix section mismatch error dm-unstripe: fix mapping bug when there are multiple targets in a table dm-integrity: fix recalculation in bitmap mode dm-bufio: avoid redundant buffer_tree lookups dm-bufio: merge cache_put() into cache_put_and_wake() selftests: add dm-verity keyring selftests dm-verity: add dm-verity keyring dm: clear cloned request bio pointer when last clone bio completes dm-verity: fix up various workqueue-related comments dm-verity: switch to bio_advance_iter_single() dm-verity: consolidate the BH and normal work structs dm: add WQ_PERCPU to alloc_workqueue users dm-integrity: fix a typo in the code for write/discard race dm: use READ_ONCE in dm_blk_report_zones ...
8 daysMerge tag 'for-7.0/block-20260206' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - Support for batch request processing for ublk, improving the efficiency of the kernel/ublk server communication. This can yield nice 7-12% performance improvements - Support for integrity data for ublk - Various other ublk improvements and additions, including a ton of selftests additions and updated - Move the handling of blk-crypto software fallback from below the block layer to above it. This reduces the complexity of dealing with bio splitting - Series fixing a number of potential deadlocks in blk-mq related to the queue usage counter and writeback throttling and rq-qos debugfs handling - Add an async_depth queue attribute, to resolve a performance regression that's been around for a qhilw related to the scheduler depth handling - Only use task_work for IOPOLL completions on NVMe, if it is necessary to do so. An earlier fix for an issue resulted in all these completions being punted to task_work, to guarantee that completions were only run for a given io_uring ring when it was local to that ring. With the new changes, we can detect if it's necessary to use task_work or not, and avoid it if possible. - rnbd fixes: - Fix refcount underflow in device unmap path - Handle PREFLUSH and NOUNMAP flags properly in protocol - Fix server-side bi_size for special IOs - Zero response buffer before use - Fix trace format for flags - Add .release to rnbd_dev_ktype - MD pull requests via Yu Kuai - Fix raid5_run() to return error when log_init() fails - Fix IO hang with degraded array with llbitmap - Fix percpu_ref not resurrected on suspend timeout in llbitmap - Fix GPF in write_page caused by resize race - Fix NULL pointer dereference in process_metadata_update - Fix hang when stopping arrays with metadata through dm-raid - Fix any_working flag handling in raid10_sync_request - Refactor sync/recovery code path, improve error handling for badblocks, and remove unused recovery_disabled field - Consolidate mddev boolean fields into mddev_flags - Use mempool to allocate stripe_request_ctx and make sure max_sectors is not less than io_opt in raid5 - Fix return value of mddev_trylock - Fix memory leak in raid1_run() - Add Li Nan as mdraid reviewer - Move phys_vec definitions to the kernel types, mostly in preparation for some VFIO and RDMA changes - Improve the speed for secure erase for some devices - Various little rust updates - Various other minor fixes, improvements, and cleanups * tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits) blk-mq: ABI/sysfs-block: fix docs build warnings selftests: ublk: organize test directories by test ID block: decouple secure erase size limit from discard size limit block: remove redundant kill_bdev() call in set_blocksize() blk-mq: add documentation for new queue attribute async_dpeth block, bfq: convert to use request_queue->async_depth mq-deadline: covert to use request_queue->async_depth kyber: covert to use request_queue->async_depth blk-mq: add a new queue sysfs attribute async_depth blk-mq: factor out a helper blk_mq_limit_depth() blk-mq-sched: unify elevators checking for async requests block: convert nr_requests to unsigned int block: don't use strcpy to copy blockdev name blk-mq-debugfs: warn about possible deadlock blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs() blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos() blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static blk-rq-qos: fix possible debugfs_mutex deadlock blk-mq-debugfs: factor out a helper to register debugfs for all rq_qos blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter ...
2026-02-02md: fix return value of mddev_trylockXiao Ni
A return value of 0 is treaded as successful lock acquisition. In fact, a return value of 1 means getting the lock successfully. Link: https://lore.kernel.org/linux-raid/20260127073951.17248-1-xni@redhat.com Fixes: 9e59d609763f ("md: call del_gendisk in control path") Reported-by: Bart Van Assche <bvanassche@acm.org> Closes: https://lore.kernel.org/linux-raid/20250611073108.25463-1-xni@redhat.com/T/#mfa369ef5faa4aa58e13e6d9fdb88aecd862b8f2f Signed-off-by: Xiao Ni <xni@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Li Nan <linan122@huawei.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-02-02md/raid1: fix memory leak in raid1_run()Zilin Guan
raid1_run() calls setup_conf() which registers a thread via md_register_thread(). If raid1_set_limits() fails, the previously registered thread is not unregistered, resulting in a memory leak of the md_thread structure and the thread resource itself. Add md_unregister_thread() to the error path to properly cleanup the thread, which aligns with the error handling logic of other paths in this function. Compile tested only. Issue found using a prototype static analysis tool and code review. Link: https://lore.kernel.org/linux-raid/20260126071533.606263-1-zilin@seu.edu.cn Fixes: 97894f7d3c29 ("md/raid1: use the atomic queue limit update APIs") Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Reviewed-by: Li Nan <linan122@huawei.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-30Merge tag 'block-6.19-20260130' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - Fix for an accounting leak in bcache that's been there forever, and a related dead code removal - Revert of a fix for rnbd that went into this series, but depends on other changes that are staged for 7.0 - NVMe pull request via Keith: - TCP target completion race condition fix (Ming) - DMA descriptor cleanup fix (Roger) * tag 'block-6.19-20260130' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: bcache: fix I/O accounting leak in detached_dev_do_request bcache: remove dead code in detached_dev_do_request nvme-pci: DMA unmap the correct regions in nvme_free_sgls Revert "rnbd-clt: fix refcount underflow in device unmap path" nvmet: fix race in nvmet_bio_done() leading to NULL pointer dereference
2026-01-28bcache: fix I/O accounting leak in detached_dev_do_requestShida Zhang
When a bcache device is detached, discard requests are completed immediately. However, the I/O accounting started in cached_dev_make_request() is not ended, leading to 100% disk utilization reports in iostat. Add the missing bio_end_io_acct() call. Fixes: cafe56359144 ("bcache: A block layer cache") Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Acked-by: Coly Li <colyli@fnnas.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-28bcache: remove dead code in detached_dev_do_requestShida Zhang
bio_alloc_clone() with GFP_NOIO and a mempool will not return NULL. Remove the unnecessary NULL check. Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-28dm mpath: make pg_init_delay_msecs settableBenjamin Marzinski
"pg_init_delay_msecs X" can be passed as a feature in the multipath table and is used to set m->pg_init_delay_msecs in parse_features(). However, alloc_multipath_stage2(), which is called after parse_features(), resets m->pg_init_delay_msecs to its default value. Instead, set m->pg_init_delay_msecs in alloc_multipath(), which is called before parse_features(), to avoid overwriting a value passed in by the table. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Cc: stable@vger.kernel.org
2026-01-28Revert "dm: fix a race condition in retrieve_deps"Benjamin Marzinski
This reverts commit f6007dce0cd35d634d9be91ef3515a6385dcee16. Commit f6007dce0cd3 ("dm: fix a race condition in retrieve_deps") was added to fix a race between retrieving the list of dm table devices and multipath_message() modifying the list of table devices. But Commit a48f6b82c5c4 ("dm mpath: don't call dm_get_device in multipath_message") removed the call to dm_get_device() from multipath_message(). After that commit, the only calls to dm_get_device() and dm_put_device() are in target constructors and destructors, so the race with retrieve_deps() is no longer possible. Suggested-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-01-28dm mpath: Add missing dm_put_device when failing to get scsi dh nameBenjamin Marzinski
When commit fd81bc5cca8f ("scsi: device_handler: Return error pointer in scsi_dh_attached_handler_name()") added code to fail parsing the path if scsi_dh_attached_handler_name() failed with -ENOMEM, it didn't clean up the reference to the path device that had just been taken. Fix this, and steamline the error paths of parse_path() a little. Fixes: fd81bc5cca8f ("scsi: device_handler: Return error pointer in scsi_dh_attached_handler_name()") Cc: stable@vger.kernel.org Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-01-27dm vdo encodings: clean up header and version functionsMatthew Sakai
Make several header functions static. Also remove vdo_is_upgradable_version, which is unused. Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-01-26dm: use bio_clone_blkg_associationMikulas Patocka
The origin bio carries blk-cgroup information which could be set from foreground(task_css(css) - wbc->wb->blkcg_css), so the blkcg won't control buffer io since commit ca522482e3eaf ("dm: pass NULL bdev to bio_alloc_clone"). The synchronous io is still under control by blkcg, because 'bio->bi_blkg' is set by io submitting task which has been added into 'cgroup.procs'. Fix it by using bio_clone_blkg_association when submitting a cloned bio. Link: https://bugzilla.kernel.org/show_bug.cgi?id=220985 Fixes: ca522482e3eaf ("dm: pass NULL bdev to bio_alloc_clone") Reported-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Tested-by: Zhihao Cheng <chengzhihao1@huawei.com>
2026-01-26md raid: fix hang when stopping arrays with metadata through dm-raidHeinz Mauelshagen
When using device-mapper's dm-raid target, stopping a RAID array can cause the system to hang under specific conditions. This occurs when: - A dm-raid managed device tree is suspended from top to bottom (the top-level RAID device is suspended first, followed by its underlying metadata and data devices) - The top-level RAID device is then removed Removing the top-level device triggers a hang in the following sequence: the dm-raid destructor calls md_stop(), which tries to flush the write-intent bitmap by writing to the metadata sub-devices. However, these devices are already suspended, making them unable to complete the write-intent operations and causing an indefinite block. Fix: - Prevent bitmap flushing when md_stop() is called from dm-raid destructor context and avoid a quiescing/unquescing cycle which could also cause I/O - Still allow write-intent bitmap flushing when called from dm-raid suspend context This ensures that RAID array teardown can complete successfully even when the underlying devices are in a suspended state. This second patch uses md_is_rdwr() to distinguish between suspend and destructor paths as elaborated on above. Link: https://lore.kernel.org/linux-raid/CAM23VxqYrwkhKEBeQrZeZwQudbiNey2_8B_SEOLqug=pXxaFrA@mail.gmail.com Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26md-cluster: fix NULL pointer dereference in process_metadata_updateJiasheng Jiang
The function process_metadata_update() blindly dereferences the 'thread' pointer (acquired via rcu_dereference_protected) within the wait_event() macro. While the code comment states "daemon thread must exist", there is a valid race condition window during the MD array startup sequence (md_run): 1. bitmap_load() is called, which invokes md_cluster_ops->join(). 2. join() starts the "cluster_recv" thread (recv_daemon). 3. At this point, recv_daemon is active and processing messages. 4. However, mddev->thread (the main MD thread) is not initialized until later in md_run(). If a METADATA_UPDATED message is received from a remote node during this specific window, process_metadata_update() will be called while mddev->thread is still NULL, leading to a kernel panic. To fix this, we must validate the 'thread' pointer. If it is NULL, we release the held lock (no_new_dev_lockres) and return early, safely ignoring the update request as the array is not yet fully ready to process it. Link: https://lore.kernel.org/linux-raid/20260117145903.28921-1-jiashengjiangcool@gmail.com Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26md/bitmap: fix GPF in write_page caused by resize raceJack Wang
A General Protection Fault occurs in write_page() during array resize: RIP: 0010:write_page+0x22b/0x3c0 [md_mod] This is a use-after-free race between bitmap_daemon_work() and __bitmap_resize(). The daemon iterates over `bitmap->storage.filemap` without locking, while the resize path frees that storage via md_bitmap_file_unmap(). `quiesce()` does not stop the md thread, allowing concurrent access to freed pages. Fix by holding `mddev->bitmap_info.mutex` during the bitmap update. Link: https://lore.kernel.org/linux-raid/20260120102456.25169-1-jinpu.wang@ionos.com Closes: https://lore.kernel.org/linux-raid/CAMGffE=Mbfp=7xD_hYxXk1PAaCZNSEAVeQGKGy7YF9f2S4=NEA@mail.gmail.com/T/#u Cc: stable@vger.kernel.org Fixes: d60b479d177a ("md/bitmap: add bitmap_resize function to allow bitmap resizing.") Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26md/md-llbitmap: fix percpu_ref not resurrected on suspend timeoutYu Kuai
When llbitmap_suspend_timeout() times out waiting for percpu_ref to become zero, it returns -ETIMEDOUT without resurrecting the percpu_ref. The caller (md_llbitmap_daemon_fn) then continues to the next page without calling llbitmap_resume(), leaving the percpu_ref in a killed state permanently. Fix this by resurrecting the percpu_ref before returning the error, ensuring the page control structure remains usable for subsequent operations. Link: https://lore.kernel.org/linux-raid/20260123182623.3718551-3-yukuai@fnnas.com Fixes: 5ab829f1971d ("md/md-llbitmap: introduce new lockless bitmap") Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com>
2026-01-26md/raid5: fix IO hang with degraded array with llbitmapYu Kuai
When llbitmap bit state is still unwritten, any new write should force rcw, as bitmap_ops->blocks_synced() is checked in handle_stripe_dirtying(). However, later the same check is missing in need_this_block(), causing stripe to deadloop during handling because handle_stripe() will decide to go to handle_stripe_fill(), meanwhile need_this_block() always return 0 and nothing is handled. Link: https://lore.kernel.org/linux-raid/20260123182623.3718551-2-yukuai@fnnas.com Fixes: 5ab829f1971d ("md/md-llbitmap: introduce new lockless bitmap") Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com>
2026-01-26md: remove recovery_disabledLi Nan
'recovery_disabled' logic is complex and confusing, originally intended to preserve raid in extreme scenarios. It was used in following cases: - When sync fails and setting badblocks also fails, kick out non-In_sync rdev and block spare rdev from joining to preserve raid [1] - When last backup is unavailable, prevent repeated add-remove of spares triggering recovery [2] The original issues are now resolved: - Error handlers in all raid types prevent last rdev from being kicked out - Disks with failed recovery are marked Faulty and can't re-join Therefore, remove 'recovery_disabled' as it's no longer needed. [1] 5389042ffa36 ("md: change managed of recovery_disabled.") [2] 4044ba58dd15 ("md: don't retry recovery of raid1 that fails due to error on source drive.") Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-13-linan666@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26md/raid10: cleanup skip handling in raid10_sync_requestLi Nan
Skip a sector in raid10_sync_request() when it needs no syncing or no readable device exists. Current skip handling is unnecessary: - Use 'skip' label to reissue the next sector instead of return directly - Complete sync and return 'max_sectors' when multiple sectors are skipped due to badblocks The first is error-prone. For example, commit bc49694a9e8f ("md: pass in max_sectors for pers->sync_request()") removed redundant max_sector assignments. Since skip modifies max_sectors, `goto skip` leaves max_sectors equal to sector_nr after the jump, which is incorrect. The second causes sync to complete erroneously when no actual sync occurs. For recovery, recording badblocks and continue syncing subsequent sectors is more suitable. For resync, just skip bad sectors and syncing subsequent sectors. Clean up complex and unnecessary skip code. Return immediately when a sector should be skipped. Reduce code paths and lower regression risk. Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-12-linan666@huaweicloud.com Fixes: bc49694a9e8f ("md: pass in max_sectors for pers->sync_request()") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26md/raid10: fix any_working flag handling in raid10_sync_requestLi Nan
In raid10_sync_request(), 'any_working' indicates if any IO will be submitted. When there's only one In_sync disk with badblocks, 'any_working' might be set to 1 but no IO is submitted. Fix it by setting 'any_working' after badblock checks. Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-11-linan666@huaweicloud.com Fixes: e875ecea266a ("md/raid10 record bad blocks as needed during recovery.") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26md: move finish_reshape to md_finish_sync()Li Nan
finish_reshape implementations of raid10 and raid5 only update mddev and rdev configurations. Move these operations to md_finish_sync() as it is more appropriate. No functional changes. Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-10-linan666@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26md: factor out sync completion update into helperLi Nan
Repeatedly reading 'mddev->recovery' flags in md_do_sync() may introduce potential risk if this flag is modified during sync, leading to incorrect offset updates. Therefore, replace direct 'mddev->recovery' checks with 'action'. Move sync completion update logic into helper md_finish_sync(), which improves readability and maintainability. The reshape completion update remains safe as it only updated after successful reshape when MD_RECOVERY_INTR is not set and 'curr_resync' equals 'max_sectors'. Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-9-linan666@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26md: remove MD_RECOVERY_ERROR handling and simplify resync_offset updateLi Nan
Following previous patch "md: update curr_resync_completed even when MD_RECOVERY_INTR is set", 'curr_resync_completed' always equals 'curr_resync' for resync, so MD_RECOVERY_ERROR can be removed. Also, simplify resync_offset update logic. Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-8-linan666@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26md: update curr_resync_completed even when MD_RECOVERY_INTR is setLi Nan
An error sync IO may be done and sub 'recovery_active' while its error handling work is pending. This work sets 'recovery_disabled' and MD_RECOVERY_INTR, then later removes the bad disk without Faulty flag. If 'curr_resync_completed' is updated before the disk is removed, it could lead to reading from sync-failed regions. With the previous patch, error IO will set badblocks or mark rdev as Faulty, sync-failed regions are no longer readable. After waiting for 'recovery_active' to reach 0 (in the previous line), all sync IO has *completed*, regardless of whether MD_RECOVERY_INTR is set. Thus, the MD_RECOVERY_INTR check can be removed. Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-7-linan666@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26md: mark rdev Faulty when badblocks setting failsLi Nan
Currently when sync read fails and badblocks set fails (exceeding 512 limit), rdev isn't immediately marked Faulty. Instead 'recovery_disabled' is set and non-In_sync rdevs are removed later. This preserves array availability if bad regions aren't read, but bad sectors might be read by users before rdev removal. This occurs due to incorrect resync/recovery_offset updates that include these bad sectors. When badblocks exceed 512, keeping the disk provides little benefit while adding complexity. Prompt disk replacement is more important. Therefore when badblocks set fails, directly call md_error to mark rdev Faulty immediately, preventing potential data access issues. After this change, cleanup of offset update logic and 'recovery_disabled' handling will follow. Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-6-linan666@huaweicloud.com Fixes: 5e5702898e93 ("md/raid10: Handle read errors during recovery better.") Fixes: 3a9f28a5117e ("md/raid1: improve handling of read failure during recovery.") Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26md: break remaining operations on badblocks set failure in narrow_write_errorLi Nan
Mark device faulty and exit at once when setting badblocks fails in narrow_write_error(). No need to continue processing remaining sections. With this change, narrow_write_error() no longer needs to return a value, so adjust its return type to void. Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-5-linan666@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26md/raid1,raid10: support narrow_write_error when badblocks is disabledLi Nan
When badblocks.shift < 0 (badblocks disabled), narrow_write_error() return false, preventing write error handling. Since narrow_write_error() only splits IO into smaller sizes and re-submits, it can work with badblocks disabled. Adjust to use the logical block size for block_sectors when badblocks is disabled, allowing narrow_write_error() to function in this case. Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-4-linan666@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26md: factor error handling out of md_done_sync into helperLi Nan
The 'ok' parameter in md_done_sync() is redundant for most callers that always pass 'true'. Factor error handling logic into a separate helper function md_sync_error() to eliminate unnecessary parameter passing and improve code clarity. No functional changes introduced. Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-3-linan666@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26md/raid1: simplify uptodate handling in end_sync_writeLi Nan
In end_sync_write, r1bio state is always set to either R1BIO_WriteError or R1BIO_MadeGood. Consequently, put_sync_write_buf() never takes the 'else' branch that calls md_done_sync(), making the uptodate parameter have no practical effect. Pass 1 to put_sync_write_buf(). A more complete cleanup will be done in a follow-up patch. Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-2-linan666@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26md/raid5: make sure max_sectors is not less than io_optYu Kuai
Otherwise, even if user issue IO by io_opt, such IO will be split by max_sectors before they are submitted to raid5. For consequence, full stripe IO is impossible. BTW, dm-raid5 is not affected and still have such problem. Link: https://lore.kernel.org/linux-raid/20260114171241.3043364-7-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26md/raid5: use mempool to allocate stripe_request_ctxYu Kuai
On the one hand, stripe_request_ctx is 72 bytes, and it's a bit huge for a stack variable. On the other hand, the bitmap sectors_to_do is a fixed size, result in max_hw_sector_kb of raid5 array is at most 256 * 4k = 1Mb, and this will make full stripe IO impossible for the array that chunk_size * data_disks is bigger. Allocate ctx during runtime will make it possible to get rid of this limit. Link: https://lore.kernel.org/linux-raid/20260114171241.3043364-6-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com>
2026-01-26md: merge mddev serialize_policy into mddev_flagsYu Kuai
There is not need to use a separate field in struct mddev, there are no functional changes. Link: https://lore.kernel.org/linux-raid/20260114171241.3043364-5-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com>
2026-01-26md: merge mddev faillast_dev into mddev_flagsYu Kuai
There is not need to use a separate field in struct mddev, there are no functional changes. Link: https://lore.kernel.org/linux-raid/20260114171241.3043364-4-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com>
2026-01-26md: merge mddev has_superblock into mddev_flagsYu Kuai
There is not need to use a separate field in struct mddev, there are no functional changes. Link: https://lore.kernel.org/linux-raid/20260114171241.3043364-3-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com>
2026-01-26md/raid5: fix raid5_run() to return error when log_init() failsYu Kuai
Since commit f63f17350e53 ("md/raid5: use the atomic queue limit update APIs"), the abort path in raid5_run() returns 'ret' instead of -EIO. However, if log_init() fails, 'ret' is still 0 from the previous successful call, causing raid5_run() to return success despite the failure. Fix this by capturing the return value from log_init(). Link: https://lore.kernel.org/linux-raid/20260114171241.3043364-2-yukuai@fnnas.com Fixes: f63f17350e53 ("md/raid5: use the atomic queue limit update APIs") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202601130531.LGfcZsa4-lkp@intel.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Li Nan <linan122@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-23Merge tag 'block-6.19-20260122' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - A set of selftest fixes for ublk - Fix for a pid mismatch in ublk, comparing PIDs in different namespaces if run inside a namespace - Fix for a regression added in this release with polling, where the nvme tcp connect code would spin forever - Zoned device error path fix - Tweak the blkzoned uapi additions from this kernel release, making them more easily discoverable - Fix for a regression in bcache with bio endio handling added in this release * tag 'block-6.19-20260122' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: bcache: use bio cloning for detached device requests blk-mq: use BLK_POLL_ONESHOT for synchronous poll completion selftests/ublk: fix garbage output in foreground mode selftests/ublk: fix error handling for starting device selftests/ublk: fix IO thread idle check block: make the new blkzoned UAPI constants discoverable ublk: fix ublksrv pid handling for pid namespaces block: Fix an error path in disk_update_zone_resources()
2026-01-22bcache: use bio cloning for detached device requestsShida Zhang
Previously, bcache hijacked the bi_end_io and bi_private fields of the incoming bio when the backing device was in a detached state. This is fragile and breaks if the bio is needed to be processed by other layers. This patch transitions to using a cloned bio embedded within a private structure. This ensures the original bio's metadata remains untouched. Fixes: 53280e398471 ("bcache: fix improper use of bi_end_io") Co-developed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Acked-by: Coly Li <colyli@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-21dm: fix excessive blk-crypto operations for invalid keysEric Biggers
dm_exec_wrappedkey_op() passes through the derive_sw_secret, import_key, generate_key, and prepare_key blk-crypto operations to an underlying device. Currently, it calls the operation on every underlying device until one returns success. This logic is flawed when the operation is expected to fail, such as an invalid key being passed to derive_sw_secret. That can happen if userspace passes an invalid key to the FS_IOC_ADD_ENCRYPTION_KEY ioctl. When that happens on a device-mapper device that consists of many dm-linear targets, a lot of unnecessary key unwrapping requests get sent to the underlying key wrapping hardware. Fix this by considering the first device only. As already documented in the comment, it was already checked that all underlying devices support wrapped keys, so this should be fine. Fixes: e93912786e50 ("dm: pass through operations on wrapped inline crypto keys") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-01-21dm-verity: fix section mismatch errorMikulas Patocka
The function "__init dm_verity_init" was calling "__exit dm_verity_verify_sig_exit" and this triggered section mismatch error. Fix this by dropping the "__exit" tag on dm_verity_verify_sig_exit. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: 033724b1c627A ("dm-verity: add dm-verity keyring") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202601210645.11u5Myme-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202601211041.pcTzwcdp-lkp@intel.com/
2026-01-20kernel.h: drop hex.h and update all hex.h usersRandy Dunlap
Remove <linux/hex.h> from <linux/kernel.h> and update all users/callers of hex.h interfaces to directly #include <linux/hex.h> as part of the process of putting kernel.h on a diet. Removing hex.h from kernel.h means that 36K C source files don't have to pay the price of parsing hex.h for the roughly 120 C source files that need it. This change has been build-tested with allmodconfig on most ARCHes. Also, all users/callers of <linux/hex.h> in the entire source tree have been updated if needed (if not already #included). Link: https://lkml.kernel.org/r/20251215005206.2362276-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20block: pass io_comp_batch to rq_end_io_fn callbackMing Lei
Add a third parameter 'const struct io_comp_batch *' to the rq_end_io_fn callback signature. This allows end_io handlers to access the completion batch context when requests are completed via blk_mq_end_request_batch(). The io_comp_batch is passed from blk_mq_end_request_batch(), while NULL is passed from __blk_mq_end_request() and blk_mq_put_rq_ref() which don't have batch context. This infrastructure change enables drivers to detect whether they're being called from a batched completion path (like iopoll) and access additional context stored in the io_comp_batch. Update all rq_end_io_fn implementations: - block/blk-mq.c: blk_end_sync_rq - block/blk-flush.c: flush_end_io, mq_flush_data_end_io - drivers/nvme/host/ioctl.c: nvme_uring_cmd_end_io - drivers/nvme/host/core.c: nvme_keep_alive_end_io - drivers/nvme/host/pci.c: abort_endio, nvme_del_queue_end, nvme_del_cq_end - drivers/nvme/target/passthru.c: nvmet_passthru_req_done - drivers/scsi/scsi_error.c: eh_lock_door_done - drivers/scsi/sg.c: sg_rq_end_io - drivers/scsi/st.c: st_scsi_execute_end - drivers/target/target_core_pscsi.c: pscsi_req_done - drivers/md/dm-rq.c: end_clone_request Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-19dm-unstripe: fix mapping bug when there are multiple targets in a tableMatt Whitlock
The "unstriped" device-mapper target incorrectly calculates the sector offset on the mapped device when the target's origin is not zero. Take for example this hypothetical concatenation of the members of a two-disk RAID0: linearized: 0 2097152 unstriped 2 128 0 /dev/md/raid0 0 linearized: 2097152 2097152 unstriped 2 128 1 /dev/md/raid0 0 The intent in this example is to create a single device named /dev/mapper/linearized that comprises all of the chunks of the first disk of the RAID0 set, followed by all of the chunks of the second disk of the RAID0 set. This fails because dm-unstripe.c's map_to_core function does its computations based on the sector number within the mapper device rather than the sector number within the target. The bug turns invisible when the target's origin is at sector zero of the mapper device, as is the common case. In the example above, however, what happens is that the first half of the mapper device gets mapped correctly to the first disk of the RAID0, but the second half of the mapper device gets mapped past the end of the RAID0 device, and accesses to any of those sectors return errors. Signed-off-by: Matt Whitlock <kernel@mattwhitlock.name> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Fixes: 18a5bf270532 ("dm: add unstriped target")
2026-01-19dm-integrity: fix recalculation in bitmap modeMikulas Patocka
There's a logic quirk in the handling of suspend in the bitmap mode: This is the sequence of calls if we are reloading a dm-integrity table: * dm_integrity_ctr reads a superblock with the flag SB_FLAG_DIRTY_BITMAP set. * dm_integrity_postsuspend initializes a journal and clears the flag SB_FLAG_DIRTY_BITMAP. * dm_integrity_resume sees the superblock with SB_FLAG_DIRTY_BITMAP set - thus it interprets the journal as if it were a bitmap. This quirk causes recalculation problem if the user increases the size of the device in the bitmap mode. Fix this by reading a fresh copy on the superblock in dm_integrity_resume. This commit also fixes another logic quirk - the branch that sets bitmap bits if the device was extended should only be executed if the flag SB_FLAG_DIRTY_BITMAP is set. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Tested-by: Ondrej Kozina <okozina@redhat.com> Fixes: 468dfca38b1a ("dm integrity: add a bitmap mode") Cc: stable@vger.kernel.org
2026-01-19dm-bufio: avoid redundant buffer_tree lookupsEric Biggers
dm-bufio's map from block number to buffer is organized as a hash table of red-black trees. It does far more lookups in this hash table than necessary: typically one lookup to lock the tree, one lookup to search the tree, and one lookup to unlock the tree. Only one of those lookups is needed. Optimize it to do only the minimum number of lookups. This improves performance. It also reduces the object code size, considering that the redundant hash table lookups were being inlined. For example, the size of the text section of dm-bufio.o decreases from 15599 to 15070 bytes with gcc 15 and x86_64, or from 20652 to 20244 bytes with clang 21 and arm64. Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-01-19dm-bufio: merge cache_put() into cache_put_and_wake()Eric Biggers
Merge cache_put() into its only caller, cache_put_and_wake(). Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-01-19dm-verity: add dm-verity keyringChristian Brauner
Add a dedicated ".dm-verity" keyring for root hash signature verification, similar to the ".fs-verity" keyring used by fs-verity. By default the keyring is unused retaining the exact same old behavior. For systems that provision additional keys only intended for dm-verity images during boot, the dm_verity.keyring_unsealed=1 kernel parameter leaves the keyring open. We want to use this in systemd as a way add keys during boot that are only used for creating dm-verity devices for later mounting and nothing else. The discoverable disk image (DDI) spec at [1] heavily relies on dm-verity and we would like to expand this even more. This will allow us to do that in a fully backward compatible way. Once provisioning is complete, userspace restricts and activates it for dm-verity verification. If userspace fully seals the keyring then it gains the guarantee that no new keys can be added. Link: https://uapi-group.org/specifications/specs/discoverable_partitions_specification [1] Co-developed-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-01-14dm: clear cloned request bio pointer when last clone bio completesMichael Liang
Stale rq->bio values have been observed to cause double-initialization of cloned bios in request-based device-mapper targets, leading to use-after-free and double-free scenarios. One such case occurs when using dm-multipath on top of a PCIe NVMe namespace, where cloned request bios are freed during blk_complete_request(), but rq->bio is left intact. Subsequent clone teardown then attempts to free the same bios again via blk_rq_unprep_clone(). The resulting double-free path looks like: nvme_pci_complete_batch() nvme_complete_batch() blk_mq_end_request_batch() blk_complete_request() // called on a DM clone request bio_endio() // first free of all clone bios ... rq->end_io() // end_clone_request() dm_complete_request(tio->orig) dm_softirq_done() dm_done() dm_end_request() blk_rq_unprep_clone() // second free of clone bios Fix this by clearing the clone request's bio pointer when the last cloned bio completes, ensuring that later teardown paths do not attempt to free already-released bios. Signed-off-by: Michael Liang <mliang@purestorage.com> Reviewed-by: Mohamed Khalfella <mkhalfella@purestorage.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org