| Age | Commit message (Collapse) | Author |
|
Handle UBLK_U_IO_COMMIT_IO_CMDS by walking the uring_cmd fixed buffer:
- read each element into one temp buffer in batch style
- parse and apply each element for committing io result
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This commit implements the handling of the UBLK_U_IO_PREP_IO_CMDS command,
which allows userspace to prepare a batch of I/O requests.
The core of this change is the `ublk_walk_cmd_buf` function, which iterates
over the elements in the uring_cmd fixed buffer. For each element, it parses
the I/O details, finds the corresponding `ublk_io` structure, and prepares it
for future dispatch.
Add per-io lock for protecting concurrent delivery and committing.
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add new command UBLK_U_IO_PREP_IO_CMDS, which is the batch version of
UBLK_IO_FETCH_REQ.
Add new command UBLK_U_IO_COMMIT_IO_CMDS, which is for committing io command
result only, still the batch version.
The new command header type is `struct ublk_batch_io`.
This patch doesn't actually implement these commands yet, just validates the
SQE fields.
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
batch io is designed to be independent of task context, and we will not
track task context for batch io feature.
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Introduces the basic structure for a batched I/O feature in the ublk driver.
It adds placeholder functions and a new file operations structure,
ublk_ch_batch_io_fops, which will be used for fetching and committing I/O
commands in batches. Currently, the feature is disabled.
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When ublksrv runs inside a pid namespace, START/END_RECOVERY compared
the stored init-ns tgid against the userspace pid (getpid vnr), so the
check failed and control ops could not proceed. Compare against the
caller’s init-ns tgid and store that value, then translate it back to
the caller’s pid namespace when reporting GET_DEV_INFO so ublk list
shows a sensible pid.
Testing: start/recover in a pid namespace; `ublk list` shows
reasonable pid values in init, child, and sibling namespaces.
Fixes: c2c8089f325e ("ublk: validate ublk server pid")
Signed-off-by: Seamus Connor <sconnor@purestorage.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Patch series "Add ARRAY_END(), and use it to fix off-by-one bugs", v6.
Add ARRAY_END(), and use it to fix off-by-one bugs
ARRAY_END() is a macro to calculate a pointer to one past the last element
of an array argument. This is a very common pointer, which is used to
iterate over all elements of an array:
for (T *p = a; p < ARRAY_END(a); p++)
...
Of course, this pointer should never be dereferenced. A pointer one past
the last element of an array should not be dereferenced; it's perfectly
fine to hold such a pointer --and a good thing to do--, but the only thing
it should be used for is comparing it with other pointers derived from the
same array.
Due to how special these pointers are, it would be good to use consistent
naming. It's common to name such a pointer 'end' --in fact, we have many
such cases in the kernel--. C++ even standardized this name with
std::end(). Let's try naming such pointers 'end', and try also avoid
using 'end' for pointers that are not the result of ARRAY_END().
It has been incorrectly suggested that these pointers are dangerous, and
that they should never be used, suggesting to use something like
#define ARRAY_LAST(a) ((a) + ARRAY_SIZE(a) - 1)
for (T *p = a; p <= ARRAY_LAST(a); p++)
...
This is bogus, as it doesn't scale down to arrays of 0 elements. In the
case of an array of 0 elements, ARRAY_LAST() would underflow the pointer,
which not only it can't be dereferenced, it can't even be held (it
produces Undefined Behavior). That would be a footgun. Such arrays don't
exist per the ISO C standard; however, GCC supports them as an extension
(with partial support, though; GCC has a few bugs which need to be fixed).
This patch set fixes a few places where it was intended to use the array
end (that is, one past the last element), but accidentally a pointer to
the last element was used instead, thus wasting one byte.
It also replaces other places where the array end was correctly calculated
with ARRAY_SIZE(), by using the simpler ARRAY_END().
Also, there was one drivers/ file that already defined this macro. We
remove that definition, to not conflict with this one.
This patch (of 4):
ARRAY_END() returns a pointer one past the end of the last element in the
array argument. This pointer is useful for iterating over the elements of
an array:
for (T *p = a, p < ARRAY_END(a); p++)
...
Link: https://lkml.kernel.org/r/cover.1765449750.git.alx@kernel.org
Link: https://lkml.kernel.org/r/5973cfb674192bc8e533485dbfb54e3062896be1.1765449750.git.alx@kernel.org
Signed-off-by: Alejandro Colomar <alx@kernel.org>
Cc: Kees Cook <kees@kernel.org>
Cc: Christopher Bazley <chris.bazley.wg14@gmail.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Marco Elver <elver@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The KMSG_COMPONENT macro is a leftover of the s390 specific "kernel
message catalog" from 2008 [1] which never made it upstream.
The macro was added to s390 code to allow for an out-of-tree patch which
used this to generate unique message ids. Also this out-of-tree doesn't
exist anymore.
The pattern of how the KMSG_COMPONENT is used was partially also used for
non s390 specific code, for whatever reasons.
Remove the macro in order to get rid of a pointless indirection.
Link: https://lkml.kernel.org/r/20251126143602.2207435-1-hca@linux.ibm.com
Link: https://lwn.net/Articles/292650/ [1]
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
pp_in_progress makes sure that only one post-processing (writeback or
recomrpession) is active at any given time. Functionality wise it,
basically, shadows zram init_lock, when init_lock is acquired in writer
mode.
Switch recompress_store() and writeback_store() to take zram init_lock in
writer mode, like all store() sysfs handlers should do, so that we can
drop pp_in_progress. Recompression and writeback can be somewhat slow, so
holding init_lock in writer mode can block zram attrs reads, but in
reality the only zram attrs reads that take place are mm_stat reads, and
usually it's the same process that reads mm_stat and does recompression or
writeback.
Link: https://lkml.kernel.org/r/20251216071342.687993-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
ac_time is now in seconds, do not use ktime_to_timespec64()
[akpm@linux-foundation.org: remove now-unused local `ts']
[akpm@linux-foundation.org: fix build]
Link: https://lkml.kernel.org/r/20260115033031.3818977-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reported-by: Chris Mason <clm@meta.com>
Closes: https://lkml.kernel.org/r/20260114124522.1326519-1-clm@meta.com
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Stevens <stevensd@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Richard Chang <richardycc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
A minor fixup of 80-cols breakage in recompress_slot() comment and
zs_malloc() call.
Link: https://lkml.kernel.org/r/ff3254847dbdc6fbd2e3fed53c572a261d60b7b6.1765775954.git.senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Stevens <stevensd@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Richard Chang <richardycc@google.com>
Cc: Chris Mason <clm@meta.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We have a somewhat confusing internal API naming. E.g. the following
code:
zram_slot_lock()
if (zram_allocated())
zram_set_flag()
zram_slot_unlock()
may look like it does something on zram device level, but in fact it tests
and sets slot entry flags, not the device ones.
Rename API to explicitly distinguish functions that operate on the slot
level from functions that operate on the zram device level.
While at it, fixup some coding styles.
[senozhatsky@chromium.org: fix up mark_slot_accessed()]
Link: https://lkml.kernel.org/r/20260115031922.3813659-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/775a0b1a0ace5caf1f05965d8bc637c1192820fa.1765775954.git.senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Stevens <stevensd@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Richard Chang <richardycc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We can reduce sizeof(zram_table_entry) on 64-bit systems by converting
flags and ac_time to u32. Entry flags fit into u32, and for ac_time u32
gives us over a century of entry lifespan (approx 136 years) which is
plenty (zram uses system boot time (seconds)).
In struct zram_table_entry we use bytes aliasing, because bit-wait API
(for slot lock) requires a whole unsigned long word.
Link: https://lkml.kernel.org/r/d7c0b48450c70eeb5fd8acd6ecd23593f30dbf1f.1765775954.git.senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Suggested-by: David Stevens <stevensd@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Richard Chang <richardycc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Do not spread device attributes declarations across the file, move
io_stat, mm_stat, debug_stat to a common device-attr section.
Link: https://lkml.kernel.org/r/20251201094754.4149975-8-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Stevens <stevensd@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Richard Chang <richardycc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use init_lock guard() in sysfs store/show handlers, in order to simplify
and, more importantly, to modernize the code.
While at it, fix up more coding styles.
Link: https://lkml.kernel.org/r/20251201094754.4149975-7-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Stevens <stevensd@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Richard Chang <richardycc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We don't free page in zram_free_page(), not all slots even have any memory
associated with them (e.g. ZRAM_SAME). We free the slot (or reset it),
rename the function accordingly.
Link: https://lkml.kernel.org/r/20251201094754.4149975-6-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Stevens <stevensd@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Richard Chang <richardycc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Move bd_stat function and attribute declaration to
existing CONFIG_WRITEBACK ifdef-sections.
Link: https://lkml.kernel.org/r/20251201094754.4149975-5-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Stevens <stevensd@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Richard Chang <richardycc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Introduce witeback_compressed device attribute to toggle compressed
writeback (decompression on demand) feature.
[senozhatsky@chromium.org: rewrote original patch, added documentation]
Link: https://lkml.kernel.org/r/20251201094754.4149975-3-senozhatsky@chromium.org
Signed-off-by: Richard Chang <richardycc@google.com>
Co-developed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Stevens <stevensd@google.com>
Cc: Minchan Kim <minchan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "zram: introduce compressed data writeback", v2.
As writeback becomes more common there is another shortcoming that needs
to be addressed - compressed data writeback. Currently zram does
uncompressed data writeback which is not optimal due to potential CPU and
battery wastage. This series changes suboptimal uncompressed writeback to
a more optimal compressed data writeback.
This patch (of 7):
zram stores all written back slots raw, which implies that during
writeback zram first has to decompress slots (except for ZRAM_HUGE slots,
which are raw already). The problem with this approach is that not every
written back page gets read back (either via read() or via page-fault),
which means that zram basically wastes CPU cycles and battery
decompressing such slots. This changes with introduction of decompression
on demand, in other words decompression on read()/page-fault.
One caveat of decompression on demand is that async read is completed in
IRQ context, while zram decompression is sleepable. To workaround this,
read-back decompression is offloaded to a preemptible context - system
high-prio work-queue.
At this point compressed writeback is still disabled, a follow up patch
will introduce a new device attribute which will make it possible to
toggle compressed writeback per-device.
[senozhatsky@chromium.org: rewrote original implementation]
Link: https://lkml.kernel.org/r/20251201094754.4149975-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20251201094754.4149975-2-senozhatsky@chromium.org
Signed-off-by: Richard Chang <richardycc@google.com>
Co-developed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Suggested-by: Minchan Kim <minchan@google.com>
Suggested-by: Brian Geffon <bgeffon@google.com>
Cc: David Stevens <stevensd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block fixes from Jens Axboe:
- NVMe pull request via Keith:
- Device quirk to disable faulty temperature (Ilikara)
- TCP target null pointer fix from bad host protocol usage (Shivam)
- Add apple,t8103-nvme-ans2 as a compatible apple controller
(Janne)
- FC tagset leak fix (Chaitanya)
- TCP socket deadlock fix (Hannes)
- Target name buffer overrun fix (Shin'ichiro)
- Fix for an underflow for rnbd during device unmap
- Zero the non-PI part of the auto integrity buffer
- Fix for a configfs memory leak in the null block driver
* tag 'block-6.19-20260116' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
rnbd-clt: fix refcount underflow in device unmap path
nvme: fix PCIe subsystem reset controller state transition
nvmet: do not copy beyond sybsysnqn string length
nvmet-tcp: fixup hang in nvmet_tcp_listen_data_ready()
null_blk: fix kmemleak by releasing references to fault configfs items
block: zero non-PI portion of auto integrity buffer
nvme-fc: release admin tagset if init fails
nvme-apple: add "apple,t8103-nvme-ans2" as compatible
nvme-tcp: fix NULL pointer dereferences in nvmet_tcp_build_pdu_iovec
nvme-pci: disable secondary temp for Wodposit WPBSNM8
|
|
During device unmapping (triggered by module unload or explicit unmap),
a refcount underflow occurs causing a use-after-free warning:
[14747.574913] ------------[ cut here ]------------
[14747.574916] refcount_t: underflow; use-after-free.
[14747.574917] WARNING: lib/refcount.c:28 at refcount_warn_saturate+0x55/0x90, CPU#9: kworker/9:1/378
[14747.574924] Modules linked in: rnbd_client(-) rtrs_client rnbd_server rtrs_server rtrs_core ...
[14747.574998] CPU: 9 UID: 0 PID: 378 Comm: kworker/9:1 Tainted: G O N 6.19.0-rc3lblk-fnext+ #42 PREEMPT(voluntary)
[14747.575005] Workqueue: rnbd_clt_wq unmap_device_work [rnbd_client]
[14747.575010] RIP: 0010:refcount_warn_saturate+0x55/0x90
[14747.575037] Call Trace:
[14747.575038] <TASK>
[14747.575038] rnbd_clt_unmap_device+0x170/0x1d0 [rnbd_client]
[14747.575044] process_one_work+0x211/0x600
[14747.575052] worker_thread+0x184/0x330
[14747.575055] ? __pfx_worker_thread+0x10/0x10
[14747.575058] kthread+0x10d/0x250
[14747.575062] ? __pfx_kthread+0x10/0x10
[14747.575066] ret_from_fork+0x319/0x390
[14747.575069] ? __pfx_kthread+0x10/0x10
[14747.575072] ret_from_fork_asm+0x1a/0x30
[14747.575083] </TASK>
[14747.575096] ---[ end trace 0000000000000000 ]---
Befor this patch :-
The bug is a double kobject_put() on dev->kobj during device cleanup.
Kobject Lifecycle:
kobject_init_and_add() sets kobj.kref = 1 (initialization)
kobject_put() sets kobj.kref = 0 (should be called once)
* Before this patch:
rnbd_clt_unmap_device()
rnbd_destroy_sysfs()
kobject_del(&dev->kobj) [remove from sysfs]
kobject_put(&dev->kobj) PUT #1 (WRONG!)
kref: 1 to 0
rnbd_dev_release()
kfree(dev) [DEVICE FREED!]
rnbd_destroy_gen_disk() [use-after-free!]
rnbd_clt_put_dev()
refcount_dec_and_test(&dev->refcount)
kobject_put(&dev->kobj) PUT #2 (UNDERFLOW!)
kref: 0 to -1 [WARNING!]
The first kobject_put() in rnbd_destroy_sysfs() prematurely frees the
device via rnbd_dev_release(), then the second kobject_put() in
rnbd_clt_put_dev() causes refcount underflow.
* After this patch :-
Remove kobject_put() from rnbd_destroy_sysfs(). This function should
only remove sysfs visibility (kobject_del), not manage object lifetime.
Call Graph (FIXED):
rnbd_clt_unmap_device()
rnbd_destroy_sysfs()
kobject_del(&dev->kobj) [remove from sysfs only]
[kref unchanged: 1]
rnbd_destroy_gen_disk() [device still valid]
rnbd_clt_put_dev()
refcount_dec_and_test(&dev->refcount)
kobject_put(&dev->kobj) ONLY PUT (CORRECT!)
kref: 1 to 0 [BALANCED]
rnbd_dev_release()
kfree(dev) [CLEAN DESTRUCTION]
This follows the kernel pattern where sysfs removal (kobject_del) is
separate from object destruction (kobject_put).
Fixes: 581cf833cac4 ("block: rnbd: add .release to rnbd_dev_ktype")
Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Reviewed-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When CONFIG_BLK_DEV_NULL_BLK_FAULT_INJECTION is enabled, the null-blk
driver sets up fault injection support by creating the timeout_inject,
requeue_inject, and init_hctx_fault_inject configfs items as children
of the top-level nullbX configfs group.
However, when the nullbX device is removed, the references taken to
these fault-config configfs items are not released. As a result,
kmemleak reports a memory leak, for example:
unreferenced object 0xc00000021ff25c40 (size 32):
comm "mkdir", pid 10665, jiffies 4322121578
hex dump (first 32 bytes):
69 6e 69 74 5f 68 63 74 78 5f 66 61 75 6c 74 5f init_hctx_fault_
69 6e 6a 65 63 74 00 88 00 00 00 00 00 00 00 00 inject..........
backtrace (crc 1a018c86):
__kmalloc_node_track_caller_noprof+0x494/0xbd8
kvasprintf+0x74/0xf4
config_item_set_name+0xf0/0x104
config_group_init_type_name+0x48/0xfc
fault_config_init+0x48/0xf0
0xc0080000180559e4
configfs_mkdir+0x304/0x814
vfs_mkdir+0x49c/0x604
do_mkdirat+0x314/0x3d0
sys_mkdir+0xa0/0xd8
system_call_exception+0x1b0/0x4f0
system_call_vectored_common+0x15c/0x2ec
Fix this by explicitly releasing the references to the fault-config
configfs items when dropping the reference to the top-level nullbX
configfs group.
Cc: stable@vger.kernel.org
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Fixes: bb4c19e030f4 ("block: null_blk: make fault-injection dynamically configurable per device")
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add a best-effort stop command, UBLK_CMD_TRY_STOP_DEV, which only stops a
ublk device when it has no active openers.
Unlike UBLK_CMD_STOP_DEV, this command does not disrupt existing users.
New opens are blocked only after disk_openers has reached zero; if the
device is busy, the command returns -EBUSY and leaves it running.
The ub->block_open flag is used only to close a race with an in-progress
open and does not otherwise change open behavior.
Advertise support via the UBLK_F_SAFE_STOP_DEV feature flag.
Signed-off-by: Yoav Cohen <yoav@nvidia.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This function always returns 0, so there is no need to return a value.
Signed-off-by: Yoav Cohen <yoav@nvidia.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
ublk user copy syscalls may be issued from any task, so they take a
reference count on the struct ublk_io to check whether it is owned by
the ublk server and prevent a concurrent UBLK_IO_COMMIT_AND_FETCH_REQ
from completing the request. However, if the user copy syscall is issued
on the io's daemon task, a concurrent UBLK_IO_COMMIT_AND_FETCH_REQ isn't
possible, so the atomic reference count dance is unnecessary. Check for
UBLK_IO_FLAG_OWNED_BY_SRV to ensure the request is dispatched to the
sever and obtain the request from ublk_io's req field instead of looking
it up on the tagset. Skip the reference count increment and decrement.
Commit 8a8fe42d765b ("ublk: optimize UBLK_IO_REGISTER_IO_BUF on daemon
task") made an analogous optimization for ublk zero copy buffer
registration.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Now that all the components of the ublk integrity feature have been
implemented, add UBLK_F_INTEGRITY to UBLK_F_ALL, conditional on block
layer integrity support (CONFIG_BLK_DEV_INTEGRITY). This allows ublk
servers to create ublk devices with UBLK_F_INTEGRITY set and
UBLK_U_CMD_GET_FEATURES to report the feature as supported.
Signed-off-by: Stanley Zhang <stazhang@purestorage.com>
[csander: make feature conditional on CONFIG_BLK_DEV_INTEGRITY]
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add a function ublk_copy_user_integrity() to copy integrity information
between a request and a user iov_iter. This mirrors the existing
ublk_copy_user_pages() but operates on request integrity data instead of
regular data. Check UBLKSRV_IO_INTEGRITY_FLAG in iocb->ki_pos in
ublk_user_copy() to choose between copying data or integrity data.
[csander: change offset units from data bytes to integrity data bytes,
fix CONFIG_BLK_DEV_INTEGRITY=n build, rebase on user copy refactor]
Signed-off-by: Stanley Zhang <stazhang@purestorage.com>
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
__ublk_check_and_get_req() checks that the passed in offset is within
the data length of the specified ublk request. However, only user copy
(ublk_check_and_get_req()) supports accessing ublk request data at a
nonzero offset. Zero-copy buffer registration (ublk_register_io_buf())
always passes 0 for the offset, so the check is unnecessary. Move the
check from __ublk_check_and_get_req() to ublk_check_and_get_req().
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
ublk_check_and_get_req() has a single callsite in ublk_user_copy(). It
takes a ton of arguments in order to pass local variables from
ublk_user_copy() to ublk_check_and_get_req() and vice versa. And more
are about to be added. Combine the functions to reduce the argument
passing noise.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
ublk_ch_read_iter() and ublk_ch_write_iter() are nearly identical except
for the iter direction. Split out a helper function ublk_user_copy() to
reduce the code duplication as these functions are about to get larger.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Factor a helper function ublk_copy_user_bvec() out of
ublk_copy_user_pages(). It will be used for copying integrity data too.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Indicate to the ublk server when an incoming request has integrity data
by setting UBLK_IO_F_INTEGRITY in the ublksrv_io_desc's op_flags field.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add a feature flag UBLK_F_INTEGRITY for a ublk server to request
integrity/metadata support when creating a ublk device. The ublk server
can also check for the feature flag on the created device or the result
of UBLK_U_CMD_GET_FEATURES to tell if the ublk driver supports it.
UBLK_F_INTEGRITY requires UBLK_F_USER_COPY, as user copy is the only
data copy mode initially supported for integrity data.
Add UBLK_PARAM_TYPE_INTEGRITY and struct ublk_param_integrity to struct
ublk_params to specify the integrity params of a ublk device.
UBLK_PARAM_TYPE_INTEGRITY requires UBLK_F_INTEGRITY and a nonzero
metadata_size. The LBMD_PI_CAP_* and LBMD_PI_CSUM_* values from the
linux/fs.h UAPI header are used for the flags and csum_type fields.
If the UBLK_PARAM_TYPE_INTEGRITY flag is set, validate the integrity
parameters and apply them to the blk_integrity limits.
The struct ublk_param_integrity validations are based on the checks in
blk_validate_integrity_limits(). Any invalid parameters should be
rejected before being applied to struct blk_integrity.
[csander: drop redundant pi_tuple_size field, use block metadata UAPI
constants, add param validation]
Signed-off-by: Stanley Zhang <stazhang@purestorage.com>
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
ublk_dev_support_user_copy() will be used in ublk_validate_params().
Move these functions next to ublk_{dev,queue}_is_zoned() to avoid
needing to forward-declare them.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Merge in fixes that went to 6.19 after for-7.0/block was branched.
Pending ublk changes depend on particularly the async scan work.
* block-6.19:
block: zero non-PI portion of auto integrity buffer
ublk: fix use-after-free in ublk_partition_scan_work
blk-mq: avoid stall during boot due to synchronize_rcu_expedited
loop: add missing bd_abort_claiming in loop_set_status
block: don't merge bios with different app_tags
blk-rq-qos: Remove unlikely() hints from QoS checks
loop: don't change loop device under exclusive opener in loop_set_status
block, bfq: update outdated comment
blk-mq: skip CPU offline notify on unmapped hctx
selftests/ublk: fix Makefile to rebuild on header changes
selftests/ublk: add test for async partition scan
ublk: scan partition in async way
block,bfq: fix aux stat accumulation destination
md: Fix forward incompatibility from configurable logical block size
md: Fix logical_block_size configuration being overwritten
md: suspend array while updating raid_disks via sysfs
md/raid5: fix possible null-pointer dereferences in raid5_store_group_thread_cnt()
md: Fix static checker warning in analyze_sbs
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block fixes from Jens Axboe:
- Kill unlikely checks for blk-rq-qos. These checks are really
all-or-nothing, either the branch is taken all the time, or it's not.
Depending on the configuration, either one of those cases may be
true. Just remove the annotation
- Fix for merging bios with different app tags set
- Fix for a recently introduced slowdown due to RCU synchronization
- Fix for a status change on loop while it's in use, and then a later
fix for that fix
- Fix for the async partition scanning in ublk
* tag 'block-6.19-20260109' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
ublk: fix use-after-free in ublk_partition_scan_work
blk-mq: avoid stall during boot due to synchronize_rcu_expedited
loop: add missing bd_abort_claiming in loop_set_status
block: don't merge bios with different app_tags
blk-rq-qos: Remove unlikely() hints from QoS checks
loop: don't change loop device under exclusive opener in loop_set_status
|
|
A race condition exists between the async partition scan work and device
teardown that can lead to a use-after-free of ub->ub_disk:
1. ublk_ctrl_start_dev() schedules partition_scan_work after add_disk()
2. ublk_stop_dev() calls ublk_stop_dev_unlocked() which does:
- del_gendisk(ub->ub_disk)
- ublk_detach_disk() sets ub->ub_disk = NULL
- put_disk() which may free the disk
3. The worker ublk_partition_scan_work() then dereferences ub->ub_disk
leading to UAF
Fix this by using ublk_get_disk()/ublk_put_disk() in the worker to hold
a reference to the disk during the partition scan. The spinlock in
ublk_get_disk() synchronizes with ublk_detach_disk() ensuring the worker
either gets a valid reference or sees NULL and exits early.
Also change flush_work() to cancel_work_sync() to avoid running the
partition scan work unnecessarily when the disk is already detached.
Fixes: 7fc4da6a304b ("ublk: scan partition in async way")
Reported-by: Ruikai Peng <ruikai@pwno.io>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 08e136ebd193 ("loop: don't change loop device under exclusive
opener in loop_set_status") forgot to call bd_abort_claiming() when
mutex_lock_killable() failed.
Fixes: 08e136ebd193 ("loop: don't change loop device under exclusive opener in loop_set_status")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
loop_set_status() is allowed to change the loop device while there
are other openers of the device, even exclusive ones.
In this case, it causes a KASAN: slab-out-of-bounds Read in
ext4_search_dir(), since when looking for an entry in an inlined
directory, e_value_offs is changed underneath the filesystem by
loop_set_status().
Fix the problem by forbidding loop_set_status() from modifying the loop
device while there are exclusive openers of the device. This is similar
to the fix in loop_configure() by commit 33ec3e53e7b1 ("loop: Don't
change loop device under exclusive opener") alongside commit ecbe6bc0003b
("block: use bd_prepare_to_claim directly in the loop driver").
Reported-by: syzbot+3ee481e21fd75e14c397@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=3ee481e21fd75e14c397
Tested-by: syzbot+3ee481e21fd75e14c397@syzkaller.appspotmail.com
Tested-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Signed-off-by: Raphael Pinsonneault-Thibeault <rpthibeault@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Before using the data buffer to send back the response message, zero it
completely. This prevents any stray bytes to be picked up by the client
side when there the message is exchanged between different protocol
versions.
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
On rnbd-srv, the bi_size of the bio is set during the bio_add_page
function, to which datalen is passed. But for special IOs like DISCARD
and WRITE_ZEROES, datalen is 0, since there is no data to write. For
these special IOs, use the bi_size of the rnbd_msg_io.
Fixes: f6f84be089c9 ("block/rnbd-srv: Add sanity check and remove redundant assignment")
Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The __print_flags helper meant for bitmask, while the rnbd_rw_flags is
mixed with bitmask and enum, to avoid confusion, just print the data
as it is.
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Reviewed-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The NOUNMAP flag is in combination with WRITE_ZEROES flag to indicate
that the upper layers wants the sectors zeroed, but does not want it to
get freed. This instruction is especially important for storage stacks
which involves a layer capable of thin provisioning.
This commit makes RNBD block device transfer and retain this NOUNMAP flag
for requests, so it can be passed onto the backend device on the server
side.
Since it is a change in the wire protocol, bump the minor version of
protocol.
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Every ktype must provides a .release function that will be called after
the last kobject_put.
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In RNBD client, for a WRITE request of size 0, with only the REQ_PREFLUSH
bit set, while converting from bio_opf to rnbd_opf, we do REQ_OP_WRITE to
RNBD_OP_WRITE, and then check if the rq is flush through function
op_is_flush. That function checks both REQ_PREFLUSH and REQ_FUA flag, and
if any of them is set, the RNBD_F_FUA is set.
On the RNBD server side, while converting the RNBD flags to req flags, if
the RNBD_F_FUA flag is set, we just set the REQ_FUA flag. This means we
have lost the PREFLUSH flag, and added the REQ_FUA flag in its place.
This commits adds a new RNBD_F_PREFLUSH flag, and also adds separate
handling for REQ_PREFLUSH flag. On the server side, if the RNBD_F_PREFLUSH
is present, the REQ_PREFLUSH is added to the bio.
Since it is a change in the wire protocol, bump the minor version of
protocol.
The change is backwards compatible, and does not change the functionality
if either the client or the server is running older/newer versions.
If the client side is running the older version, both REQ_PREFLUSH and
REQ_FUA is converted to RNBD_F_FUA. The server running newer one would
still add only the REQ_FUA flag which is what happens when both client and
server is running the older version.
If the client side is running the newer version, just like before a
RNBD_F_FUA is added, but now a RNBD_F_PREFLUSH is also added to the
rnbd_opf. In case the server is running the older version the
RNBD_F_PREFLUSH is ignored, and only the RNBD_F_FUA is processed.
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Reviewed-by: Jack Wang <jinpu.wang@ionos.com>
Reviewed-by: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block fixes from Jens Axboe:
- Scan partition tables asynchronously for ublk, similarly to how nvme
does it. This avoids potential deadlocks, which is why nvme does it
that way too. Includes a set of selftests as well.
- MD pull request via Yu:
- Fix null-pointer dereference in raid5 sysfs group_thread_cnt
store (Tuo Li)
- Fix possible mempool corruption during raid1 raid_disks update
via sysfs (FengWei Shih)
- Fix logical_block_size configuration being overwritten during
super_1_validate() (Li Nan)
- Fix forward incompatibility with configurable logical block size:
arrays assembled on new kernels could not be assembled on older
kernels (v6.18 and before) due to non-zero reserved pad rejection
(Li Nan)
- Fix static checker warning about iterator not incremented (Li Nan)
- Skip CPU offlining notifications on unmapped hardware queues
- bfq-iosched block stats fix
- Fix outdated comment in bfq-iosched
* tag 'block-6.19-20260102' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
block, bfq: update outdated comment
blk-mq: skip CPU offline notify on unmapped hctx
selftests/ublk: fix Makefile to rebuild on header changes
selftests/ublk: add test for async partition scan
ublk: scan partition in async way
block,bfq: fix aux stat accumulation destination
md: Fix forward incompatibility from configurable logical block size
md: Fix logical_block_size configuration being overwritten
md: suspend array while updating raid_disks via sysfs
md/raid5: fix possible null-pointer dereferences in raid5_store_group_thread_cnt()
md: Fix static checker warning in analyze_sbs
|
|
'struct configfs_item_operations' and 'configfs_group_operations' are not
modified in this driver.
Constifying these structures moves some data to a read-only section, so
increases overall security, especially when the structure holds some
function pointers.
On a x86_64, with allmodconfig:
Before:
======
text data bss dec hex filename
100263 37808 2752 140823 22617 drivers/block/null_blk/main.o
After:
=====
text data bss dec hex filename
100423 37648 2752 140823 22617 drivers/block/null_blk/main.o
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Replace simple_strtol() with the recommended kstrtoul() for parsing the
'ramdisk_size=' boot parameter. Unlike simple_strtol(), which returns a
long, kstrtoul() converts the string directly to an unsigned long and
avoids implicit casting.
Check the return value of kstrtoul() and reject invalid values. This
adds error handling while preserving behavior for existing values, and
removes use of the deprecated simple_strtol() helper. The current code
silently sets 'rd_size = 0' if parsing fails, instead of leaving the
default value (CONFIG_BLK_DEV_RAM_SIZE) unchanged.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
C-String literals were added in Rust 1.77. Replace instances of
`kernel::c_str!` with C-String literals where possible.
Signed-off-by: Tamir Duberstein <tamird@gmail.com>
Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Implement async partition scan to avoid IO hang when reading partition
tables. Similar to nvme_partition_scan_work(), partition scanning is
deferred to a work queue to prevent deadlocks.
When partition scan happens synchronously during add_disk(), IO errors
can cause the partition scan to wait while holding ub->mutex, which
can deadlock with other operations that need the mutex.
Changes:
- Add partition_scan_work to ublk_device structure
- Implement ublk_partition_scan_work() to perform async scan
- Always suppress sync partition scan during add_disk()
- Schedule async work after add_disk() for trusted daemons
- Add flush_work() in ublk_stop_dev() before grabbing ub->mutex
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Reported-by: Yoav Cohen <yoav@nvidia.com>
Closes: https://lore.kernel.org/linux-block/DM4PR12MB63280C5637917C071C2F0D65A9A8A@DM4PR12MB6328.namprd12.prod.outlook.com/
Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|