user/sven/linux.git/include/linux/blkdev.h, branch v6.12.7

block: Prevent potential deadlocks in zone write plug error recovery

2024-12-19T17:13:00Z

commit fe0418eb9bd69a19a948b297c8de815e05f3cde1 upstream. Zone write plugging for handling writes to zones of a zoned block device always execute a zone report whenever a write BIO to a zone fails. The intent of this is to ensure that the tracking of a zone write pointer is always correct to ensure that the alignment to a zone write pointer of write BIOs can be checked on submission and that we can always correctly emulate zone append operations using regular write BIOs. However, this error recovery scheme introduces a potential deadlock if a device queue freeze is initiated while BIOs are still plugged in a zone write plug and one of these write operation fails. In such case, the disk zone write plug error recovery work is scheduled and executes a report zone. This in turn can result in a request allocation in the underlying driver to issue the report zones command to the device. But with the device queue freeze already started, this allocation will block, preventing the report zone execution and the continuation of the processing of the plugged BIOs. As plugged BIOs hold a queue usage reference, the queue freeze itself will never complete, resulting in a deadlock. Avoid this problem by completely removing from the zone write plugging code the use of report zones operations after a failed write operation, instead relying on the device user to either execute a report zones, reset the zone, finish the zone, or give up writing to the device (which is a fairly common pattern for file systems which degrade to read-only after write failures). This is not an unreasonnable requirement as all well-behaved applications, FSes and device mapper already use report zones to recover from write errors whenever possible by comparing the current position of a zone write pointer with what their assumption about the position is. The changes to remove the automatic error recovery are as follows: - Completely remove the error recovery work and its associated resources (zone write plug list head, disk error list, and disk zone_wplugs_work work struct). This also removes the functions disk_zone_wplug_set_error() and disk_zone_wplug_clear_error(). - Change the BLK_ZONE_WPLUG_ERROR zone write plug flag into BLK_ZONE_WPLUG_NEED_WP_UPDATE. This new flag is set for a zone write plug whenever a write opration targetting the zone of the zone write plug fails. This flag indicates that the zone write pointer offset is not reliable and that it must be updated when the next report zone, reset zone, finish zone or disk revalidation is executed. - Modify blk_zone_write_plug_bio_endio() to set the BLK_ZONE_WPLUG_NEED_WP_UPDATE flag for the target zone of a failed write BIO. - Modify the function disk_zone_wplug_set_wp_offset() to clear this new flag, thus implementing recovery of a correct write pointer offset with the reset (all) zone and finish zone operations. - Modify blkdev_report_zones() to always use the disk_report_zones_cb() callback so that disk_zone_wplug_sync_wp_offset() can be called for any zone marked with the BLK_ZONE_WPLUG_NEED_WP_UPDATE flag. This implements recovery of a correct write pointer offset for zone write plugs marked with BLK_ZONE_WPLUG_NEED_WP_UPDATE and within the range of the report zones operation executed by the user. - Modify blk_revalidate_seq_zone() to call disk_zone_wplug_sync_wp_offset() for all sequential write required zones when a zoned block device is revalidated, thus always resolving any inconsistency between the write pointer offset of zone write plugs and the actual write pointer position of sequential zones. Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Reviewed-by: Martin K. Petersen Link: https://lore.kernel.org/r/20241209122357.47838-5-dlemoal@kernel.org Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

dm: Fix dm-zoned-reclaim zone write pointer alignment

2024-12-19T17:13:00Z

commit b76b840fd93374240b59825f1ab8e2f5c9907acb upstream. The zone reclaim processing of the dm-zoned device mapper uses blkdev_issue_zeroout() to align the write pointer of a zone being used for reclaiming another zone, to write the valid data blocks from the zone being reclaimed at the same position relative to the zone start in the reclaim target zone. The first call to blkdev_issue_zeroout() will try to use hardware offload using a REQ_OP_WRITE_ZEROES operation if the device reports a non-zero max_write_zeroes_sectors queue limit. If this operation fails because of the lack of hardware support, blkdev_issue_zeroout() falls back to using a regular write operation with the zero-page as buffer. Currently, such REQ_OP_WRITE_ZEROES failure is automatically handled by the block layer zone write plugging code which will execute a report zones operation to ensure that the write pointer of the target zone of the failed operation has not changed and to "rewind" the zone write pointer offset of the target zone as it was advanced when the write zero operation was submitted. So the REQ_OP_WRITE_ZEROES failure does not cause any issue and blkdev_issue_zeroout() works as expected. However, since the automatic recovery of zone write pointers by the zone write plugging code can potentially cause deadlocks with queue freeze operations, a different recovery must be implemented in preparation for the removal of zone write plugging report zones based recovery. Do this by introducing the new function blk_zone_issue_zeroout(). This function first calls blkdev_issue_zeroout() with the flag BLKDEV_ZERO_NOFALLBACK to intercept failures on the first execution which attempt to use the device hardware offload with the REQ_OP_WRITE_ZEROES operation. If this attempt fails, a report zone operation is issued to restore the zone write pointer offset of the target zone to the correct position and blkdev_issue_zeroout() is called again without the BLKDEV_ZERO_NOFALLBACK flag. The report zones operation performing this recovery is implemented using the helper function disk_zone_sync_wp_offset() which calls the gendisk report_zones file operation with the callback disk_report_zones_cb(). This callback updates the target write pointer offset of the target zone using the new function disk_zone_wplug_sync_wp_offset(). dmz_reclaim_align_wp() is modified to change its call to blkdev_issue_zeroout() to a call to blk_zone_issue_zeroout() without any other change needed as the two functions are functionnally equivalent. Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Acked-by: Mike Snitzer Reviewed-by: Martin K. Petersen Link: https://lore.kernel.org/r/20241209122357.47838-4-dlemoal@kernel.org Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

block: RCU protect disk->conv_zones_bitmap

2024-12-14T19:03:35Z

[ Upstream commit d7cb6d7414ea1b33536fa6d11805cb8dceec1f97 ] Ensure that a disk revalidation changing the conventional zones bitmap of a disk does not cause invalid memory references when using the disk_zone_is_conv() helper by RCU protecting the disk->conv_zones_bitmap pointer. disk_zone_is_conv() is modified to operate under the RCU read lock and the function disk_set_conv_zones_bitmap() is added to update a disk conv_zones_bitmap pointer using rcu_replace_pointer() with the disk zone_wplugs_lock spinlock held. disk_free_zone_resources() is modified to call disk_update_zone_resources() with a NULL bitmap pointer to free the disk conv_zones_bitmap. disk_set_conv_zones_bitmap() is also used in disk_update_zone_resources() to set the new (revalidated) bitmap and free the old one. Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Reviewed-by: Johannes Thumshirn Link: https://lore.kernel.org/r/20241107064300.227731-2-dlemoal@kernel.org Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

block: always verify unfreeze lock on the owner task

2024-12-05T13:03:10Z

commit 6a78699838a0ddeed3620ddf50c1521f1fe1e811 upstream. commit f1be1788a32e ("block: model freeze & enter queue as lock for supporting lockdep") tries to apply lockdep for verifying freeze & unfreeze. However, the verification is only done the outmost freeze and unfreeze. This way is actually not correct because q->mq_freeze_depth still may drop to zero on other task instead of the freeze owner task. Fix this issue by always verifying the last unfreeze lock on the owner task context, and make sure both the outmost freeze & unfreeze are verified in the current task. Fixes: f1be1788a32e ("block: model freeze & enter queue as lock for supporting lockdep") Signed-off-by: Ming Lei Link: https://lore.kernel.org/r/20241031133723.303835-4-ming.lei@redhat.com Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

block: return unsigned int from bdev_io_min

2024-12-05T13:03:06Z

[ Upstream commit 46fd48ab3ea3eb3bb215684bd66ea3d260b091a9 ] The underlying limit is defined as an unsigned int, so return that from bdev_io_min as well. Fixes: ac481c20ef8f ("block: Topology ioctls") Signed-off-by: Christoph Hellwig Reviewed-by: Martin K. Petersen Reviewed-by: John Garry Link: https://lore.kernel.org/r/20241119072602.1059488-1-hch@lst.de Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

block: model freeze & enter queue as lock for supporting lockdep

2024-12-05T13:03:05Z

[ Upstream commit f1be1788a32e8fa63416ad4518bbd1a85a825c9d ] Recently we got several deadlock report[1][2][3] caused by blk_mq_freeze_queue and blk_enter_queue(). Turns out the two are just like acquiring read/write lock, so model them as read/write lock for supporting lockdep: 1) model q->q_usage_counter as two locks(io and queue lock) - queue lock covers sync with blk_enter_queue() - io lock covers sync with bio_enter_queue() 2) make the lockdep class/key as per-queue: - different subsystem has very different lock use pattern, shared lock class causes false positive easily - freeze_queue degrades to no lock in case that disk state becomes DEAD because bio_enter_queue() won't be blocked any more - freeze_queue degrades to no lock in case that request queue becomes dying because blk_enter_queue() won't be blocked any more 3) model blk_mq_freeze_queue() as acquire_exclusive & try_lock - it is exclusive lock, so dependency with blk_enter_queue() is covered - it is trylock because blk_mq_freeze_queue() are allowed to run concurrently 4) model blk_enter_queue() & bio_enter_queue() as acquire_read() - nested blk_enter_queue() are allowed - dependency with blk_mq_freeze_queue() is covered - blk_queue_exit() is often called from other contexts(such as irq), and it can't be annotated as lock_release(), so simply do it in blk_enter_queue(), this way still covered cases as many as possible With lockdep support, such kind of reports may be reported asap and needn't wait until the real deadlock is triggered. For example, lockdep report can be triggered in the report[3] with this patch applied. [1] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler' https://bugzilla.kernel.org/show_bug.cgi?id=219166 [2] del_gendisk() vs blk_queue_enter() race condition https://lore.kernel.org/linux-block/20241003085610.GK11458@google.com/ [3] queue_freeze & queue_enter deadlock in scsi https://lore.kernel.org/linux-block/ZxG38G9BuFdBpBHZ@fedora/T/#u Reviewed-by: Christoph Hellwig Signed-off-by: Ming Lei Link: https://lore.kernel.org/r/20241025003722.3630252-4-ming.lei@redhat.com Signed-off-by: Jens Axboe Stable-dep-of: 3802f73bd807 ("block: fix uaf for flush rq while iterating tags") Signed-off-by: Sasha Levin

block: Remove unused blk_limits_io_{min,opt}

2024-09-20T06:19:48Z

blk_limits_io_min and blk_limits_io_opt are unused since the recent commit 0a94a469a4f0 ("dm: stop using blk_limits_io_{min,opt}") Remove them. Signed-off-by: Dr. David Alan Gilbert Link: https://lore.kernel.org/r/20240920004817.676216-1-linux@treblig.org Signed-off-by: Jens Axboe

Merge tag 'v6.11' into for-6.12/block

2024-09-17T14:32:53Z

Merge in 6.11 final to get the fix for preventing deadlocks on an elevator switch, as there's a fixup for that patch. * tag 'v6.11': (1788 commits) Linux 6.11 Revert "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop" pinctrl: pinctrl-cy8c95x0: Fix regcache cifs: Fix signature miscalculation mm: avoid leaving partial pfn mappings around in error case drm/xe/client: add missing bo locking in show_meminfo() drm/xe/client: fix deadlock in show_meminfo() drm/xe/oa: Enable Xe2+ PES disaggregation drm/xe/display: fix compat IS_DISPLAY_STEP() range end drm/xe: Fix access_ok check in user_fence_create drm/xe: Fix possible UAF in guc_exec_queue_process_msg drm/xe: Remove fence check from send_tlb_invalidation drm/xe/gt: Remove double include net: netfilter: move nf flowtable bpf initialization in nf_flow_table_module_init() PCI: Fix potential deadlock in pcim_intx() workqueue: Clear worker->pool in the worker thread context net: tighten bad gso csum offset check in virtio_net_hdr netlink: specs: mptcp: fix port endianness net: dpaa: Pad packets to ETH_ZLEN mptcp: pm: Fix uaf in __timer_delete_sync ...

block: constify the lim argument to queue_limits_max_zone_append_sectors

2024-08-29T10:32:32Z

queue_limits_max_zone_append_sectors doesn't change the lim argument, so mark it as const. Signed-off-by: Christoph Hellwig Reviewed-by: Damien Le Moal Tested-by: Hans Holmberg Reviewed-by: Hans Holmberg Link: https://lore.kernel.org/r/20240826173820.1690925-3-hch@lst.de Signed-off-by: Jens Axboe

block: Drop NULL check in bdev_write_zeroes_sectors()

2024-08-19T15:48:59Z

Function bdev_get_queue() must not return NULL, so drop the check in bdev_write_zeroes_sectors(). Reviewed-by: Christoph Hellwig Signed-off-by: John Garry Reviewed-by: Martin K. Petersen Reviewed-by: Nitesh Shetty Link: https://lore.kernel.org/r/20240815163228.216051-3-john.g.garry@oracle.com Signed-off-by: Jens Axboe