summaryrefslogtreecommitdiff
path: root/drivers/block
AgeCommit message (Collapse)Author
2026-01-22ublk: handle UBLK_U_IO_COMMIT_IO_CMDSMing Lei
Handle UBLK_U_IO_COMMIT_IO_CMDS by walking the uring_cmd fixed buffer: - read each element into one temp buffer in batch style - parse and apply each element for committing io result Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22ublk: handle UBLK_U_IO_PREP_IO_CMDSMing Lei
This commit implements the handling of the UBLK_U_IO_PREP_IO_CMDS command, which allows userspace to prepare a batch of I/O requests. The core of this change is the `ublk_walk_cmd_buf` function, which iterates over the elements in the uring_cmd fixed buffer. For each element, it parses the I/O details, finds the corresponding `ublk_io` structure, and prepares it for future dispatch. Add per-io lock for protecting concurrent delivery and committing. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22ublk: add new batch command UBLK_U_IO_PREP_IO_CMDS & UBLK_U_IO_COMMIT_IO_CMDSMing Lei
Add new command UBLK_U_IO_PREP_IO_CMDS, which is the batch version of UBLK_IO_FETCH_REQ. Add new command UBLK_U_IO_COMMIT_IO_CMDS, which is for committing io command result only, still the batch version. The new command header type is `struct ublk_batch_io`. This patch doesn't actually implement these commands yet, just validates the SQE fields. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22ublk: prepare for not tracking task context for command batchMing Lei
batch io is designed to be independent of task context, and we will not track task context for batch io feature. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22ublk: define ublk_ch_batch_io_fops for the coming feature F_BATCH_IOMing Lei
Introduces the basic structure for a batched I/O feature in the ublk driver. It adds placeholder functions and a new file operations structure, ublk_ch_batch_io_fops, which will be used for fetching and committing I/O commands in batches. Currently, the feature is disabled. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-21ublk: fix ublksrv pid handling for pid namespacesSeamus Connor
When ublksrv runs inside a pid namespace, START/END_RECOVERY compared the stored init-ns tgid against the userspace pid (getpid vnr), so the check failed and control ops could not proceed. Compare against the caller’s init-ns tgid and store that value, then translate it back to the caller’s pid namespace when reporting GET_DEV_INFO so ublk list shows a sensible pid. Testing: start/recover in a pid namespace; `ublk list` shows reasonable pid values in init, child, and sibling namespaces. Fixes: c2c8089f325e ("ublk: validate ublk server pid") Signed-off-by: Seamus Connor <sconnor@purestorage.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-20array_size.h: add ARRAY_END()Alejandro Colomar
Patch series "Add ARRAY_END(), and use it to fix off-by-one bugs", v6. Add ARRAY_END(), and use it to fix off-by-one bugs ARRAY_END() is a macro to calculate a pointer to one past the last element of an array argument. This is a very common pointer, which is used to iterate over all elements of an array: for (T *p = a; p < ARRAY_END(a); p++) ... Of course, this pointer should never be dereferenced. A pointer one past the last element of an array should not be dereferenced; it's perfectly fine to hold such a pointer --and a good thing to do--, but the only thing it should be used for is comparing it with other pointers derived from the same array. Due to how special these pointers are, it would be good to use consistent naming. It's common to name such a pointer 'end' --in fact, we have many such cases in the kernel--. C++ even standardized this name with std::end(). Let's try naming such pointers 'end', and try also avoid using 'end' for pointers that are not the result of ARRAY_END(). It has been incorrectly suggested that these pointers are dangerous, and that they should never be used, suggesting to use something like #define ARRAY_LAST(a) ((a) + ARRAY_SIZE(a) - 1) for (T *p = a; p <= ARRAY_LAST(a); p++) ... This is bogus, as it doesn't scale down to arrays of 0 elements. In the case of an array of 0 elements, ARRAY_LAST() would underflow the pointer, which not only it can't be dereferenced, it can't even be held (it produces Undefined Behavior). That would be a footgun. Such arrays don't exist per the ISO C standard; however, GCC supports them as an extension (with partial support, though; GCC has a few bugs which need to be fixed). This patch set fixes a few places where it was intended to use the array end (that is, one past the last element), but accidentally a pointer to the last element was used instead, thus wasting one byte. It also replaces other places where the array end was correctly calculated with ARRAY_SIZE(), by using the simpler ARRAY_END(). Also, there was one drivers/ file that already defined this macro. We remove that definition, to not conflict with this one. This patch (of 4): ARRAY_END() returns a pointer one past the end of the last element in the array argument. This pointer is useful for iterating over the elements of an array: for (T *p = a, p < ARRAY_END(a); p++) ... Link: https://lkml.kernel.org/r/cover.1765449750.git.alx@kernel.org Link: https://lkml.kernel.org/r/5973cfb674192bc8e533485dbfb54e3062896be1.1765449750.git.alx@kernel.org Signed-off-by: Alejandro Colomar <alx@kernel.org> Cc: Kees Cook <kees@kernel.org> Cc: Christopher Bazley <chris.bazley.wg14@gmail.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Marco Elver <elver@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Jann Horn <jannh@google.com> Cc: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20zram: remove KMSG_COMPONENT macroHeiko Carstens
The KMSG_COMPONENT macro is a leftover of the s390 specific "kernel message catalog" from 2008 [1] which never made it upstream. The macro was added to s390 code to allow for an out-of-tree patch which used this to generate unique message ids. Also this out-of-tree doesn't exist anymore. The pattern of how the KMSG_COMPONENT is used was partially also used for non s390 specific code, for whatever reasons. Remove the macro in order to get rid of a pointless indirection. Link: https://lkml.kernel.org/r/20251126143602.2207435-1-hca@linux.ibm.com Link: https://lwn.net/Articles/292650/ [1] Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20zram: drop pp_in_progressSergey Senozhatsky
pp_in_progress makes sure that only one post-processing (writeback or recomrpession) is active at any given time. Functionality wise it, basically, shadows zram init_lock, when init_lock is acquired in writer mode. Switch recompress_store() and writeback_store() to take zram init_lock in writer mode, like all store() sysfs handlers should do, so that we can drop pp_in_progress. Recompression and writeback can be somewhat slow, so holding init_lock in writer mode can block zram attrs reads, but in reality the only zram attrs reads that take place are mm_stat reads, and usually it's the same process that reads mm_stat and does recompression or writeback. Link: https://lkml.kernel.org/r/20251216071342.687993-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20zram: fixup read_block_state()Sergey Senozhatsky
ac_time is now in seconds, do not use ktime_to_timespec64() [akpm@linux-foundation.org: remove now-unused local `ts'] [akpm@linux-foundation.org: fix build] Link: https://lkml.kernel.org/r/20260115033031.3818977-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reported-by: Chris Mason <clm@meta.com> Closes: https://lkml.kernel.org/r/20260114124522.1326519-1-clm@meta.com Cc: Brian Geffon <bgeffon@google.com> Cc: David Stevens <stevensd@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Richard Chang <richardycc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20zram: trivial fix of recompress_slot() coding stylesSergey Senozhatsky
A minor fixup of 80-cols breakage in recompress_slot() comment and zs_malloc() call. Link: https://lkml.kernel.org/r/ff3254847dbdc6fbd2e3fed53c572a261d60b7b6.1765775954.git.senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Brian Geffon <bgeffon@google.com> Cc: David Stevens <stevensd@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Richard Chang <richardycc@google.com> Cc: Chris Mason <clm@meta.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20zram: rename internal slot APISergey Senozhatsky
We have a somewhat confusing internal API naming. E.g. the following code: zram_slot_lock() if (zram_allocated()) zram_set_flag() zram_slot_unlock() may look like it does something on zram device level, but in fact it tests and sets slot entry flags, not the device ones. Rename API to explicitly distinguish functions that operate on the slot level from functions that operate on the zram device level. While at it, fixup some coding styles. [senozhatsky@chromium.org: fix up mark_slot_accessed()] Link: https://lkml.kernel.org/r/20260115031922.3813659-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/775a0b1a0ace5caf1f05965d8bc637c1192820fa.1765775954.git.senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Brian Geffon <bgeffon@google.com> Cc: David Stevens <stevensd@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Richard Chang <richardycc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20zram: use u32 for entry ac_time trackingSergey Senozhatsky
We can reduce sizeof(zram_table_entry) on 64-bit systems by converting flags and ac_time to u32. Entry flags fit into u32, and for ac_time u32 gives us over a century of entry lifespan (approx 136 years) which is plenty (zram uses system boot time (seconds)). In struct zram_table_entry we use bytes aliasing, because bit-wait API (for slot lock) requires a whole unsigned long word. Link: https://lkml.kernel.org/r/d7c0b48450c70eeb5fd8acd6ecd23593f30dbf1f.1765775954.git.senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Suggested-by: David Stevens <stevensd@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Richard Chang <richardycc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20zram: consolidate device-attr declarationsSergey Senozhatsky
Do not spread device attributes declarations across the file, move io_stat, mm_stat, debug_stat to a common device-attr section. Link: https://lkml.kernel.org/r/20251201094754.4149975-8-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Brian Geffon <bgeffon@google.com> Cc: David Stevens <stevensd@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Richard Chang <richardycc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20zram: switch to guard() for init_lockSergey Senozhatsky
Use init_lock guard() in sysfs store/show handlers, in order to simplify and, more importantly, to modernize the code. While at it, fix up more coding styles. Link: https://lkml.kernel.org/r/20251201094754.4149975-7-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Brian Geffon <bgeffon@google.com> Cc: David Stevens <stevensd@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Richard Chang <richardycc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20zram: rename zram_free_page()Sergey Senozhatsky
We don't free page in zram_free_page(), not all slots even have any memory associated with them (e.g. ZRAM_SAME). We free the slot (or reset it), rename the function accordingly. Link: https://lkml.kernel.org/r/20251201094754.4149975-6-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Brian Geffon <bgeffon@google.com> Cc: David Stevens <stevensd@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Richard Chang <richardycc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20zram: move bd_stat to writeback sectionSergey Senozhatsky
Move bd_stat function and attribute declaration to existing CONFIG_WRITEBACK ifdef-sections. Link: https://lkml.kernel.org/r/20251201094754.4149975-5-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Brian Geffon <bgeffon@google.com> Cc: David Stevens <stevensd@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Richard Chang <richardycc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20zram: introduce writeback_compressed device attributeRichard Chang
Introduce witeback_compressed device attribute to toggle compressed writeback (decompression on demand) feature. [senozhatsky@chromium.org: rewrote original patch, added documentation] Link: https://lkml.kernel.org/r/20251201094754.4149975-3-senozhatsky@chromium.org Signed-off-by: Richard Chang <richardycc@google.com> Co-developed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Brian Geffon <bgeffon@google.com> Cc: David Stevens <stevensd@google.com> Cc: Minchan Kim <minchan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20zram: introduce compressed data writebackRichard Chang
Patch series "zram: introduce compressed data writeback", v2. As writeback becomes more common there is another shortcoming that needs to be addressed - compressed data writeback. Currently zram does uncompressed data writeback which is not optimal due to potential CPU and battery wastage. This series changes suboptimal uncompressed writeback to a more optimal compressed data writeback. This patch (of 7): zram stores all written back slots raw, which implies that during writeback zram first has to decompress slots (except for ZRAM_HUGE slots, which are raw already). The problem with this approach is that not every written back page gets read back (either via read() or via page-fault), which means that zram basically wastes CPU cycles and battery decompressing such slots. This changes with introduction of decompression on demand, in other words decompression on read()/page-fault. One caveat of decompression on demand is that async read is completed in IRQ context, while zram decompression is sleepable. To workaround this, read-back decompression is offloaded to a preemptible context - system high-prio work-queue. At this point compressed writeback is still disabled, a follow up patch will introduce a new device attribute which will make it possible to toggle compressed writeback per-device. [senozhatsky@chromium.org: rewrote original implementation] Link: https://lkml.kernel.org/r/20251201094754.4149975-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20251201094754.4149975-2-senozhatsky@chromium.org Signed-off-by: Richard Chang <richardycc@google.com> Co-developed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Suggested-by: Minchan Kim <minchan@google.com> Suggested-by: Brian Geffon <bgeffon@google.com> Cc: David Stevens <stevensd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-16Merge tag 'block-6.19-20260116' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - Device quirk to disable faulty temperature (Ilikara) - TCP target null pointer fix from bad host protocol usage (Shivam) - Add apple,t8103-nvme-ans2 as a compatible apple controller (Janne) - FC tagset leak fix (Chaitanya) - TCP socket deadlock fix (Hannes) - Target name buffer overrun fix (Shin'ichiro) - Fix for an underflow for rnbd during device unmap - Zero the non-PI part of the auto integrity buffer - Fix for a configfs memory leak in the null block driver * tag 'block-6.19-20260116' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: rnbd-clt: fix refcount underflow in device unmap path nvme: fix PCIe subsystem reset controller state transition nvmet: do not copy beyond sybsysnqn string length nvmet-tcp: fixup hang in nvmet_tcp_listen_data_ready() null_blk: fix kmemleak by releasing references to fault configfs items block: zero non-PI portion of auto integrity buffer nvme-fc: release admin tagset if init fails nvme-apple: add "apple,t8103-nvme-ans2" as compatible nvme-tcp: fix NULL pointer dereferences in nvmet_tcp_build_pdu_iovec nvme-pci: disable secondary temp for Wodposit WPBSNM8
2026-01-15rnbd-clt: fix refcount underflow in device unmap pathChaitanya Kulkarni
During device unmapping (triggered by module unload or explicit unmap), a refcount underflow occurs causing a use-after-free warning: [14747.574913] ------------[ cut here ]------------ [14747.574916] refcount_t: underflow; use-after-free. [14747.574917] WARNING: lib/refcount.c:28 at refcount_warn_saturate+0x55/0x90, CPU#9: kworker/9:1/378 [14747.574924] Modules linked in: rnbd_client(-) rtrs_client rnbd_server rtrs_server rtrs_core ... [14747.574998] CPU: 9 UID: 0 PID: 378 Comm: kworker/9:1 Tainted: G O N 6.19.0-rc3lblk-fnext+ #42 PREEMPT(voluntary) [14747.575005] Workqueue: rnbd_clt_wq unmap_device_work [rnbd_client] [14747.575010] RIP: 0010:refcount_warn_saturate+0x55/0x90 [14747.575037] Call Trace: [14747.575038] <TASK> [14747.575038] rnbd_clt_unmap_device+0x170/0x1d0 [rnbd_client] [14747.575044] process_one_work+0x211/0x600 [14747.575052] worker_thread+0x184/0x330 [14747.575055] ? __pfx_worker_thread+0x10/0x10 [14747.575058] kthread+0x10d/0x250 [14747.575062] ? __pfx_kthread+0x10/0x10 [14747.575066] ret_from_fork+0x319/0x390 [14747.575069] ? __pfx_kthread+0x10/0x10 [14747.575072] ret_from_fork_asm+0x1a/0x30 [14747.575083] </TASK> [14747.575096] ---[ end trace 0000000000000000 ]--- Befor this patch :- The bug is a double kobject_put() on dev->kobj during device cleanup. Kobject Lifecycle: kobject_init_and_add() sets kobj.kref = 1 (initialization) kobject_put() sets kobj.kref = 0 (should be called once) * Before this patch: rnbd_clt_unmap_device() rnbd_destroy_sysfs() kobject_del(&dev->kobj) [remove from sysfs] kobject_put(&dev->kobj) PUT #1 (WRONG!) kref: 1 to 0 rnbd_dev_release() kfree(dev) [DEVICE FREED!] rnbd_destroy_gen_disk() [use-after-free!] rnbd_clt_put_dev() refcount_dec_and_test(&dev->refcount) kobject_put(&dev->kobj) PUT #2 (UNDERFLOW!) kref: 0 to -1 [WARNING!] The first kobject_put() in rnbd_destroy_sysfs() prematurely frees the device via rnbd_dev_release(), then the second kobject_put() in rnbd_clt_put_dev() causes refcount underflow. * After this patch :- Remove kobject_put() from rnbd_destroy_sysfs(). This function should only remove sysfs visibility (kobject_del), not manage object lifetime. Call Graph (FIXED): rnbd_clt_unmap_device() rnbd_destroy_sysfs() kobject_del(&dev->kobj) [remove from sysfs only] [kref unchanged: 1] rnbd_destroy_gen_disk() [device still valid] rnbd_clt_put_dev() refcount_dec_and_test(&dev->refcount) kobject_put(&dev->kobj) ONLY PUT (CORRECT!) kref: 1 to 0 [BALANCED] rnbd_dev_release() kfree(dev) [CLEAN DESTRUCTION] This follows the kernel pattern where sysfs removal (kobject_del) is separate from object destruction (kobject_put). Fixes: 581cf833cac4 ("block: rnbd: add .release to rnbd_dev_ktype") Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com> Acked-by: Jack Wang <jinpu.wang@ionos.com> Reviewed-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-13null_blk: fix kmemleak by releasing references to fault configfs itemsNilay Shroff
When CONFIG_BLK_DEV_NULL_BLK_FAULT_INJECTION is enabled, the null-blk driver sets up fault injection support by creating the timeout_inject, requeue_inject, and init_hctx_fault_inject configfs items as children of the top-level nullbX configfs group. However, when the nullbX device is removed, the references taken to these fault-config configfs items are not released. As a result, kmemleak reports a memory leak, for example: unreferenced object 0xc00000021ff25c40 (size 32): comm "mkdir", pid 10665, jiffies 4322121578 hex dump (first 32 bytes): 69 6e 69 74 5f 68 63 74 78 5f 66 61 75 6c 74 5f init_hctx_fault_ 69 6e 6a 65 63 74 00 88 00 00 00 00 00 00 00 00 inject.......... backtrace (crc 1a018c86): __kmalloc_node_track_caller_noprof+0x494/0xbd8 kvasprintf+0x74/0xf4 config_item_set_name+0xf0/0x104 config_group_init_type_name+0x48/0xfc fault_config_init+0x48/0xf0 0xc0080000180559e4 configfs_mkdir+0x304/0x814 vfs_mkdir+0x49c/0x604 do_mkdirat+0x314/0x3d0 sys_mkdir+0xa0/0xd8 system_call_exception+0x1b0/0x4f0 system_call_vectored_common+0x15c/0x2ec Fix this by explicitly releasing the references to the fault-config configfs items when dropping the reference to the top-level nullbX configfs group. Cc: stable@vger.kernel.org Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Fixes: bb4c19e030f4 ("block: null_blk: make fault-injection dynamically configurable per device") Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12ublk: add UBLK_CMD_TRY_STOP_DEV commandYoav Cohen
Add a best-effort stop command, UBLK_CMD_TRY_STOP_DEV, which only stops a ublk device when it has no active openers. Unlike UBLK_CMD_STOP_DEV, this command does not disrupt existing users. New opens are blocked only after disk_openers has reached zero; if the device is busy, the command returns -EBUSY and leaves it running. The ub->block_open flag is used only to close a race with an in-progress open and does not otherwise change open behavior. Advertise support via the UBLK_F_SAFE_STOP_DEV feature flag. Signed-off-by: Yoav Cohen <yoav@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12ublk: make ublk_ctrl_stop_dev return voidYoav Cohen
This function always returns 0, so there is no need to return a value. Signed-off-by: Yoav Cohen <yoav@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12ublk: optimize ublk_user_copy() on daemon taskCaleb Sander Mateos
ublk user copy syscalls may be issued from any task, so they take a reference count on the struct ublk_io to check whether it is owned by the ublk server and prevent a concurrent UBLK_IO_COMMIT_AND_FETCH_REQ from completing the request. However, if the user copy syscall is issued on the io's daemon task, a concurrent UBLK_IO_COMMIT_AND_FETCH_REQ isn't possible, so the atomic reference count dance is unnecessary. Check for UBLK_IO_FLAG_OWNED_BY_SRV to ensure the request is dispatched to the sever and obtain the request from ublk_io's req field instead of looking it up on the tagset. Skip the reference count increment and decrement. Commit 8a8fe42d765b ("ublk: optimize UBLK_IO_REGISTER_IO_BUF on daemon task") made an analogous optimization for ublk zero copy buffer registration. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12ublk: support UBLK_F_INTEGRITYStanley Zhang
Now that all the components of the ublk integrity feature have been implemented, add UBLK_F_INTEGRITY to UBLK_F_ALL, conditional on block layer integrity support (CONFIG_BLK_DEV_INTEGRITY). This allows ublk servers to create ublk devices with UBLK_F_INTEGRITY set and UBLK_U_CMD_GET_FEATURES to report the feature as supported. Signed-off-by: Stanley Zhang <stazhang@purestorage.com> [csander: make feature conditional on CONFIG_BLK_DEV_INTEGRITY] Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12ublk: implement integrity user copyStanley Zhang
Add a function ublk_copy_user_integrity() to copy integrity information between a request and a user iov_iter. This mirrors the existing ublk_copy_user_pages() but operates on request integrity data instead of regular data. Check UBLKSRV_IO_INTEGRITY_FLAG in iocb->ki_pos in ublk_user_copy() to choose between copying data or integrity data. [csander: change offset units from data bytes to integrity data bytes, fix CONFIG_BLK_DEV_INTEGRITY=n build, rebase on user copy refactor] Signed-off-by: Stanley Zhang <stazhang@purestorage.com> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12ublk: move offset check out of __ublk_check_and_get_req()Caleb Sander Mateos
__ublk_check_and_get_req() checks that the passed in offset is within the data length of the specified ublk request. However, only user copy (ublk_check_and_get_req()) supports accessing ublk request data at a nonzero offset. Zero-copy buffer registration (ublk_register_io_buf()) always passes 0 for the offset, so the check is unnecessary. Move the check from __ublk_check_and_get_req() to ublk_check_and_get_req(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12ublk: inline ublk_check_and_get_req() into ublk_user_copy()Caleb Sander Mateos
ublk_check_and_get_req() has a single callsite in ublk_user_copy(). It takes a ton of arguments in order to pass local variables from ublk_user_copy() to ublk_check_and_get_req() and vice versa. And more are about to be added. Combine the functions to reduce the argument passing noise. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12ublk: split out ublk_user_copy() helperCaleb Sander Mateos
ublk_ch_read_iter() and ublk_ch_write_iter() are nearly identical except for the iter direction. Split out a helper function ublk_user_copy() to reduce the code duplication as these functions are about to get larger. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12ublk: split out ublk_copy_user_bvec() helperCaleb Sander Mateos
Factor a helper function ublk_copy_user_bvec() out of ublk_copy_user_pages(). It will be used for copying integrity data too. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12ublk: set UBLK_IO_F_INTEGRITY in ublksrv_io_descCaleb Sander Mateos
Indicate to the ublk server when an incoming request has integrity data by setting UBLK_IO_F_INTEGRITY in the ublksrv_io_desc's op_flags field. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12ublk: support UBLK_PARAM_TYPE_INTEGRITY in device creationStanley Zhang
Add a feature flag UBLK_F_INTEGRITY for a ublk server to request integrity/metadata support when creating a ublk device. The ublk server can also check for the feature flag on the created device or the result of UBLK_U_CMD_GET_FEATURES to tell if the ublk driver supports it. UBLK_F_INTEGRITY requires UBLK_F_USER_COPY, as user copy is the only data copy mode initially supported for integrity data. Add UBLK_PARAM_TYPE_INTEGRITY and struct ublk_param_integrity to struct ublk_params to specify the integrity params of a ublk device. UBLK_PARAM_TYPE_INTEGRITY requires UBLK_F_INTEGRITY and a nonzero metadata_size. The LBMD_PI_CAP_* and LBMD_PI_CSUM_* values from the linux/fs.h UAPI header are used for the flags and csum_type fields. If the UBLK_PARAM_TYPE_INTEGRITY flag is set, validate the integrity parameters and apply them to the blk_integrity limits. The struct ublk_param_integrity validations are based on the checks in blk_validate_integrity_limits(). Any invalid parameters should be rejected before being applied to struct blk_integrity. [csander: drop redundant pi_tuple_size field, use block metadata UAPI constants, add param validation] Signed-off-by: Stanley Zhang <stazhang@purestorage.com> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12ublk: move ublk flag check functions earlierCaleb Sander Mateos
ublk_dev_support_user_copy() will be used in ublk_validate_params(). Move these functions next to ublk_{dev,queue}_is_zoned() to avoid needing to forward-declare them. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-11Merge branch 'block-6.19' into for-7.0/blockJens Axboe
Merge in fixes that went to 6.19 after for-7.0/block was branched. Pending ublk changes depend on particularly the async scan work. * block-6.19: block: zero non-PI portion of auto integrity buffer ublk: fix use-after-free in ublk_partition_scan_work blk-mq: avoid stall during boot due to synchronize_rcu_expedited loop: add missing bd_abort_claiming in loop_set_status block: don't merge bios with different app_tags blk-rq-qos: Remove unlikely() hints from QoS checks loop: don't change loop device under exclusive opener in loop_set_status block, bfq: update outdated comment blk-mq: skip CPU offline notify on unmapped hctx selftests/ublk: fix Makefile to rebuild on header changes selftests/ublk: add test for async partition scan ublk: scan partition in async way block,bfq: fix aux stat accumulation destination md: Fix forward incompatibility from configurable logical block size md: Fix logical_block_size configuration being overwritten md: suspend array while updating raid_disks via sysfs md/raid5: fix possible null-pointer dereferences in raid5_store_group_thread_cnt() md: Fix static checker warning in analyze_sbs
2026-01-09Merge tag 'block-6.19-20260109' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - Kill unlikely checks for blk-rq-qos. These checks are really all-or-nothing, either the branch is taken all the time, or it's not. Depending on the configuration, either one of those cases may be true. Just remove the annotation - Fix for merging bios with different app tags set - Fix for a recently introduced slowdown due to RCU synchronization - Fix for a status change on loop while it's in use, and then a later fix for that fix - Fix for the async partition scanning in ublk * tag 'block-6.19-20260109' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: ublk: fix use-after-free in ublk_partition_scan_work blk-mq: avoid stall during boot due to synchronize_rcu_expedited loop: add missing bd_abort_claiming in loop_set_status block: don't merge bios with different app_tags blk-rq-qos: Remove unlikely() hints from QoS checks loop: don't change loop device under exclusive opener in loop_set_status
2026-01-09ublk: fix use-after-free in ublk_partition_scan_workMing Lei
A race condition exists between the async partition scan work and device teardown that can lead to a use-after-free of ub->ub_disk: 1. ublk_ctrl_start_dev() schedules partition_scan_work after add_disk() 2. ublk_stop_dev() calls ublk_stop_dev_unlocked() which does: - del_gendisk(ub->ub_disk) - ublk_detach_disk() sets ub->ub_disk = NULL - put_disk() which may free the disk 3. The worker ublk_partition_scan_work() then dereferences ub->ub_disk leading to UAF Fix this by using ublk_get_disk()/ublk_put_disk() in the worker to hold a reference to the disk during the partition scan. The spinlock in ublk_get_disk() synchronizes with ublk_detach_disk() ensuring the worker either gets a valid reference or sees NULL and exits early. Also change flush_work() to cancel_work_sync() to avoid running the partition scan work unnecessarily when the disk is already detached. Fixes: 7fc4da6a304b ("ublk: scan partition in async way") Reported-by: Ruikai Peng <ruikai@pwno.io> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-07loop: add missing bd_abort_claiming in loop_set_statusTetsuo Handa
Commit 08e136ebd193 ("loop: don't change loop device under exclusive opener in loop_set_status") forgot to call bd_abort_claiming() when mutex_lock_killable() failed. Fixes: 08e136ebd193 ("loop: don't change loop device under exclusive opener in loop_set_status") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06loop: don't change loop device under exclusive opener in loop_set_statusRaphael Pinsonneault-Thibeault
loop_set_status() is allowed to change the loop device while there are other openers of the device, even exclusive ones. In this case, it causes a KASAN: slab-out-of-bounds Read in ext4_search_dir(), since when looking for an entry in an inlined directory, e_value_offs is changed underneath the filesystem by loop_set_status(). Fix the problem by forbidding loop_set_status() from modifying the loop device while there are exclusive openers of the device. This is similar to the fix in loop_configure() by commit 33ec3e53e7b1 ("loop: Don't change loop device under exclusive opener") alongside commit ecbe6bc0003b ("block: use bd_prepare_to_claim directly in the loop driver"). Reported-by: syzbot+3ee481e21fd75e14c397@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=3ee481e21fd75e14c397 Tested-by: syzbot+3ee481e21fd75e14c397@syzkaller.appspotmail.com Tested-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Signed-off-by: Raphael Pinsonneault-Thibeault <rpthibeault@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06rnbd-srv: Zero the rsp buffer before using itMd Haris Iqbal
Before using the data buffer to send back the response message, zero it completely. This prevents any stray bytes to be picked up by the client side when there the message is exchanged between different protocol versions. Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06rnbd-srv: Fix server side setting of bi_size for special IOsFlorian-Ewald Mueller
On rnbd-srv, the bi_size of the bio is set during the bio_add_page function, to which datalen is passed. But for special IOs like DISCARD and WRITE_ZEROES, datalen is 0, since there is no data to write. For these special IOs, use the bi_size of the rnbd_msg_io. Fixes: f6f84be089c9 ("block/rnbd-srv: Add sanity check and remove redundant assignment") Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com> Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com> Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06rnbd-srv: fix the trace format for flagsJack Wang
The __print_flags helper meant for bitmask, while the rnbd_rw_flags is mixed with bitmask and enum, to avoid confusion, just print the data as it is. Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Reviewed-by: Md Haris Iqbal <haris.iqbal@ionos.com> Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06block/rnbd-proto: Check and retain the NOUNMAP flag for requestsMd Haris Iqbal
The NOUNMAP flag is in combination with WRITE_ZEROES flag to indicate that the upper layers wants the sectors zeroed, but does not want it to get freed. This instruction is especially important for storage stacks which involves a layer capable of thin provisioning. This commit makes RNBD block device transfer and retain this NOUNMAP flag for requests, so it can be passed onto the backend device on the server side. Since it is a change in the wire protocol, bump the minor version of protocol. Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06block: rnbd: add .release to rnbd_dev_ktypeZhu Yanjun
Every ktype must provides a .release function that will be called after the last kobject_put. Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> Reviewed-by: Md Haris Iqbal <haris.iqbal@ionos.com> Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06block/rnbd-proto: Handle PREFLUSH flag properly for IOsMd Haris Iqbal
In RNBD client, for a WRITE request of size 0, with only the REQ_PREFLUSH bit set, while converting from bio_opf to rnbd_opf, we do REQ_OP_WRITE to RNBD_OP_WRITE, and then check if the rq is flush through function op_is_flush. That function checks both REQ_PREFLUSH and REQ_FUA flag, and if any of them is set, the RNBD_F_FUA is set. On the RNBD server side, while converting the RNBD flags to req flags, if the RNBD_F_FUA flag is set, we just set the REQ_FUA flag. This means we have lost the PREFLUSH flag, and added the REQ_FUA flag in its place. This commits adds a new RNBD_F_PREFLUSH flag, and also adds separate handling for REQ_PREFLUSH flag. On the server side, if the RNBD_F_PREFLUSH is present, the REQ_PREFLUSH is added to the bio. Since it is a change in the wire protocol, bump the minor version of protocol. The change is backwards compatible, and does not change the functionality if either the client or the server is running older/newer versions. If the client side is running the older version, both REQ_PREFLUSH and REQ_FUA is converted to RNBD_F_FUA. The server running newer one would still add only the REQ_FUA flag which is what happens when both client and server is running the older version. If the client side is running the newer version, just like before a RNBD_F_FUA is added, but now a RNBD_F_PREFLUSH is also added to the rnbd_opf. In case the server is running the older version the RNBD_F_PREFLUSH is ignored, and only the RNBD_F_FUA is processed. Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com> Reviewed-by: Jack Wang <jinpu.wang@ionos.com> Reviewed-by: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com> Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-02Merge tag 'block-6.19-20260102' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - Scan partition tables asynchronously for ublk, similarly to how nvme does it. This avoids potential deadlocks, which is why nvme does it that way too. Includes a set of selftests as well. - MD pull request via Yu: - Fix null-pointer dereference in raid5 sysfs group_thread_cnt store (Tuo Li) - Fix possible mempool corruption during raid1 raid_disks update via sysfs (FengWei Shih) - Fix logical_block_size configuration being overwritten during super_1_validate() (Li Nan) - Fix forward incompatibility with configurable logical block size: arrays assembled on new kernels could not be assembled on older kernels (v6.18 and before) due to non-zero reserved pad rejection (Li Nan) - Fix static checker warning about iterator not incremented (Li Nan) - Skip CPU offlining notifications on unmapped hardware queues - bfq-iosched block stats fix - Fix outdated comment in bfq-iosched * tag 'block-6.19-20260102' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: block, bfq: update outdated comment blk-mq: skip CPU offline notify on unmapped hctx selftests/ublk: fix Makefile to rebuild on header changes selftests/ublk: add test for async partition scan ublk: scan partition in async way block,bfq: fix aux stat accumulation destination md: Fix forward incompatibility from configurable logical block size md: Fix logical_block_size configuration being overwritten md: suspend array while updating raid_disks via sysfs md/raid5: fix possible null-pointer dereferences in raid5_store_group_thread_cnt() md: Fix static checker warning in analyze_sbs
2025-12-29null_blk: Constify struct configfs_item_operations and configfs_group_operationsChristophe JAILLET
'struct configfs_item_operations' and 'configfs_group_operations' are not modified in this driver. Constifying these structures moves some data to a read-only section, so increases overall security, especially when the structure holds some function pointers. On a x86_64, with allmodconfig: Before: ====== text data bss dec hex filename 100263 37808 2752 140823 22617 drivers/block/null_blk/main.o After: ===== text data bss dec hex filename 100423 37648 2752 140823 22617 drivers/block/null_blk/main.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-28brd: replace simple_strtol with kstrtoul in ramdisk_sizeThorsten Blum
Replace simple_strtol() with the recommended kstrtoul() for parsing the 'ramdisk_size=' boot parameter. Unlike simple_strtol(), which returns a long, kstrtoul() converts the string directly to an unsigned long and avoids implicit casting. Check the return value of kstrtoul() and reject invalid values. This adds error handling while preserving behavior for existing values, and removes use of the deprecated simple_strtol() helper. The current code silently sets 'rd_size = 0' if parsing fails, instead of leaving the default value (CONFIG_BLK_DEV_RAM_SIZE) unchanged. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-28rnull: replace `kernel::c_str!` with C-StringsTamir Duberstein
C-String literals were added in Rust 1.77. Replace instances of `kernel::c_str!` with C-String literals where possible. Signed-off-by: Tamir Duberstein <tamird@gmail.com> Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-28ublk: scan partition in async wayMing Lei
Implement async partition scan to avoid IO hang when reading partition tables. Similar to nvme_partition_scan_work(), partition scanning is deferred to a work queue to prevent deadlocks. When partition scan happens synchronously during add_disk(), IO errors can cause the partition scan to wait while holding ub->mutex, which can deadlock with other operations that need the mutex. Changes: - Add partition_scan_work to ublk_device structure - Implement ublk_partition_scan_work() to perform async scan - Always suppress sync partition scan during add_disk() - Schedule async work after add_disk() for trusted daemons - Add flush_work() in ublk_stop_dev() before grabbing ub->mutex Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reported-by: Yoav Cohen <yoav@nvidia.com> Closes: https://lore.kernel.org/linux-block/DM4PR12MB63280C5637917C071C2F0D65A9A8A@DM4PR12MB6328.namprd12.prod.outlook.com/ Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>