user/sven/linux.git/include, branch v6.12.4

drm/panic: Fix uninitialized spinlock acquisition with CONFIG_DRM_PANIC=n

2024-12-09T09:41:12Z

commit 319e53f155907cf2c6dabc16ec9dce0179bc04d1 upstream. It turns out that if you happen to have a kernel config where CONFIG_DRM_PANIC is disabled and spinlock debugging is enabled, along with KMS being enabled - we'll end up trying to acquire an uninitialized spin_lock with drm_panic_lock() when we try to do a commit: rvkms rvkms.0: [drm:drm_atomic_commit] committing 0000000068d2ade1 INFO: trying to register non-static key. The code is fine but needs lockdep annotation, or maybe you didn't initialize this object before use? turning off the locking correctness validator. CPU: 4 PID: 1347 Comm: modprobe Not tainted 6.10.0-rc1Lyude-Test+ #272 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20240524-3.fc40 05/24/2024 Call Trace: dump_stack_lvl+0x77/0xa0 assign_lock_key+0x114/0x120 register_lock_class+0xa8/0x2c0 __lock_acquire+0x7d/0x2bd0 ? __vmap_pages_range_noflush+0x3a8/0x550 ? drm_atomic_helper_swap_state+0x2ad/0x3a0 lock_acquire+0xec/0x290 ? drm_atomic_helper_swap_state+0x2ad/0x3a0 ? lock_release+0xee/0x310 _raw_spin_lock_irqsave+0x4e/0x70 ? drm_atomic_helper_swap_state+0x2ad/0x3a0 drm_atomic_helper_swap_state+0x2ad/0x3a0 drm_atomic_helper_commit+0xb1/0x270 drm_atomic_commit+0xaf/0xe0 ? __pfx___drm_printfn_info+0x10/0x10 drm_client_modeset_commit_atomic+0x1a1/0x250 drm_client_modeset_commit_locked+0x4b/0x180 drm_client_modeset_commit+0x27/0x50 __drm_fb_helper_restore_fbdev_mode_unlocked+0x76/0x90 drm_fb_helper_set_par+0x38/0x40 fbcon_init+0x3c4/0x690 visual_init+0xc0/0x120 do_bind_con_driver+0x409/0x4c0 do_take_over_console+0x233/0x280 do_fb_registered+0x11f/0x210 fbcon_fb_registered+0x2c/0x60 register_framebuffer+0x248/0x2a0 __drm_fb_helper_initial_config_and_unlock+0x58a/0x720 drm_fbdev_generic_client_hotplug+0x6e/0xb0 drm_client_register+0x76/0xc0 _RNvXs_CsHeezP08sTT_5rvkmsNtB4_5RvkmsNtNtCs1cdwasc6FUb_6kernel8platform6Driver5probe+0xed2/0x1060 [rvkms] ? _RNvMs_NtCs1cdwasc6FUb_6kernel8platformINtB4_7AdapterNtCsHeezP08sTT_5rvkms5RvkmsE14probe_callbackBQ_+0x2b/0x70 [rvkms] ? acpi_dev_pm_attach+0x25/0x110 ? platform_probe+0x6a/0xa0 ? really_probe+0x10b/0x400 ? __driver_probe_device+0x7c/0x140 ? driver_probe_device+0x22/0x1b0 ? __device_attach_driver+0x13a/0x1c0 ? __pfx___device_attach_driver+0x10/0x10 ? bus_for_each_drv+0x114/0x170 ? __device_attach+0xd6/0x1b0 ? bus_probe_device+0x9e/0x120 ? device_add+0x288/0x4b0 ? platform_device_add+0x75/0x230 ? platform_device_register_full+0x141/0x180 ? rust_helper_platform_device_register_simple+0x85/0xb0 ? _RNvMs2_NtCs1cdwasc6FUb_6kernel8platformNtB5_6Device13create_simple+0x1d/0x60 ? _RNvXs0_CsHeezP08sTT_5rvkmsNtB5_5RvkmsNtCs1cdwasc6FUb_6kernel6Module4init+0x11e/0x160 [rvkms] ? 0xffffffffc083f000 ? init_module+0x20/0x1000 [rvkms] ? kernfs_xattr_get+0x3e/0x80 ? do_one_initcall+0x148/0x3f0 ? __lock_acquire+0x5ef/0x2bd0 ? __lock_acquire+0x5ef/0x2bd0 ? __lock_acquire+0x5ef/0x2bd0 ? put_cpu_partial+0x51/0x1d0 ? lock_acquire+0xec/0x290 ? put_cpu_partial+0x51/0x1d0 ? lock_release+0xee/0x310 ? put_cpu_partial+0x51/0x1d0 ? fs_reclaim_acquire+0x69/0xf0 ? lock_acquire+0xec/0x290 ? fs_reclaim_acquire+0x69/0xf0 ? kfree+0x22f/0x340 ? lock_release+0xee/0x310 ? kmalloc_trace_noprof+0x48/0x340 ? do_init_module+0x22/0x240 ? kmalloc_trace_noprof+0x155/0x340 ? do_init_module+0x60/0x240 ? __se_sys_finit_module+0x2e0/0x3f0 ? do_syscall_64+0xa4/0x180 ? syscall_exit_to_user_mode+0x108/0x140 ? do_syscall_64+0xb0/0x180 ? vma_end_read+0xd0/0xe0 ? do_user_addr_fault+0x309/0x640 ? clear_bhb_loop+0x45/0xa0 ? clear_bhb_loop+0x45/0xa0 ? clear_bhb_loop+0x45/0xa0 ? entry_SYSCALL_64_after_hwframe+0x76/0x7e Fix this by stubbing these macros out when this config option isn't enabled, along with fixing the unused variable warning that introduces. Signed-off-by: Lyude Paul Reviewed-by: Daniel Vetter Fixes: e2a1cda3e0c7 ("drm/panic: Add drm panic locking") Cc: # v6.10+ Link: https://patchwork.freedesktop.org/patch/msgid/20240916230103.611490-1-lyude@redhat.com Signed-off-by: Greg Kroah-Hartman

mm/vmalloc: combine all TLB flush operations of KASAN shadow virtual address into one operation

2024-12-09T09:41:08Z

commit 9e9e085effe9b7e342138fde3cf8577d22509932 upstream. When compiling kernel source 'make -j $(nproc)' with the up-and-running KASAN-enabled kernel on a 256-core machine, the following soft lockup is shown: watchdog: BUG: soft lockup - CPU#28 stuck for 22s! [kworker/28:1:1760] CPU: 28 PID: 1760 Comm: kworker/28:1 Kdump: loaded Not tainted 6.10.0-rc5 #95 Workqueue: events drain_vmap_area_work RIP: 0010:smp_call_function_many_cond+0x1d8/0xbb0 Code: 38 c8 7c 08 84 c9 0f 85 49 08 00 00 8b 45 08 a8 01 74 2e 48 89 f1 49 89 f7 48 c1 e9 03 41 83 e7 07 4c 01 e9 41 83 c7 03 f3 90 <0f> b6 01 41 38 c7 7c 08 84 c0 0f 85 d4 06 00 00 8b 45 08 a8 01 75 RSP: 0018:ffffc9000cb3fb60 EFLAGS: 00000202 RAX: 0000000000000011 RBX: ffff8883bc4469c0 RCX: ffffed10776e9949 RDX: 0000000000000002 RSI: ffff8883bb74ca48 RDI: ffffffff8434dc50 RBP: ffff8883bb74ca40 R08: ffff888103585dc0 R09: ffff8884533a1800 R10: 0000000000000004 R11: ffffffffffffffff R12: ffffed1077888d39 R13: dffffc0000000000 R14: ffffed1077888d38 R15: 0000000000000003 FS: 0000000000000000(0000) GS:ffff8883bc400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005577b5c8d158 CR3: 0000000004850000 CR4: 0000000000350ef0 Call Trace: ? watchdog_timer_fn+0x2cd/0x390 ? __pfx_watchdog_timer_fn+0x10/0x10 ? __hrtimer_run_queues+0x300/0x6d0 ? sched_clock_cpu+0x69/0x4e0 ? __pfx___hrtimer_run_queues+0x10/0x10 ? srso_return_thunk+0x5/0x5f ? ktime_get_update_offsets_now+0x7f/0x2a0 ? srso_return_thunk+0x5/0x5f ? srso_return_thunk+0x5/0x5f ? hrtimer_interrupt+0x2ca/0x760 ? __sysvec_apic_timer_interrupt+0x8c/0x2b0 ? sysvec_apic_timer_interrupt+0x6a/0x90 ? asm_sysvec_apic_timer_interrupt+0x16/0x20 ? smp_call_function_many_cond+0x1d8/0xbb0 ? __pfx_do_kernel_range_flush+0x10/0x10 on_each_cpu_cond_mask+0x20/0x40 flush_tlb_kernel_range+0x19b/0x250 ? srso_return_thunk+0x5/0x5f ? kasan_release_vmalloc+0xa7/0xc0 purge_vmap_node+0x357/0x820 ? __pfx_purge_vmap_node+0x10/0x10 __purge_vmap_area_lazy+0x5b8/0xa10 drain_vmap_area_work+0x21/0x30 process_one_work+0x661/0x10b0 worker_thread+0x844/0x10e0 ? srso_return_thunk+0x5/0x5f ? __kthread_parkme+0x82/0x140 ? __pfx_worker_thread+0x10/0x10 kthread+0x2a5/0x370 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 Debugging Analysis: 1. The following ftrace log shows that the lockup CPU spends too much time iterating vmap_nodes and flushing TLB when purging vm_area structures. (Some info is trimmed). kworker: funcgraph_entry: | drain_vmap_area_work() { kworker: funcgraph_entry: | mutex_lock() { kworker: funcgraph_entry: 1.092 us | __cond_resched(); kworker: funcgraph_exit: 3.306 us | } ... ... kworker: funcgraph_entry: | flush_tlb_kernel_range() { ... ... kworker: funcgraph_exit: # 7533.649 us | } ... ... kworker: funcgraph_entry: 2.344 us | mutex_unlock(); kworker: funcgraph_exit: $ 23871554 us | } The drain_vmap_area_work() spends over 23 seconds. There are 2805 flush_tlb_kernel_range() calls in the ftrace log. * One is called in __purge_vmap_area_lazy(). * Others are called by purge_vmap_node->kasan_release_vmalloc. purge_vmap_node() iteratively releases kasan vmalloc allocations and flushes TLB for each vmap_area. - [Rough calculation] Each flush_tlb_kernel_range() runs about 7.5ms. -- 2804 * 7.5ms = 21.03 seconds. -- That's why a soft lock is triggered. 2. Extending the soft lockup time can work around the issue (For example, # echo 60 > /proc/sys/kernel/watchdog_thresh). This confirms the above-mentioned speculation: drain_vmap_area_work() spends too much time. If we combine all TLB flush operations of the KASAN shadow virtual address into one operation in the call path 'purge_vmap_node()->kasan_release_vmalloc()', the running time of drain_vmap_area_work() can be saved greatly. The idea is from the flush_tlb_kernel_range() call in __purge_vmap_area_lazy(). And, the soft lockup won't be triggered. Here is the test result based on 6.10: [6.10 wo/ the patch] 1. ftrace latency profiling (record a trace if the latency > 20s). echo 20000000 > /sys/kernel/debug/tracing/tracing_thresh echo drain_vmap_area_work > /sys/kernel/debug/tracing/set_graph_function echo function_graph > /sys/kernel/debug/tracing/current_tracer echo 1 > /sys/kernel/debug/tracing/tracing_on 2. Run `make -j $(nproc)` to compile the kernel source 3. Once the soft lockup is reproduced, check the ftrace log: cat /sys/kernel/debug/tracing/trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 76) $ 50412985 us | } /* __purge_vmap_area_lazy */ 76) $ 50412997 us | } /* drain_vmap_area_work */ 76) $ 29165911 us | } /* __purge_vmap_area_lazy */ 76) $ 29165926 us | } /* drain_vmap_area_work */ 91) $ 53629423 us | } /* __purge_vmap_area_lazy */ 91) $ 53629434 us | } /* drain_vmap_area_work */ 91) $ 28121014 us | } /* __purge_vmap_area_lazy */ 91) $ 28121026 us | } /* drain_vmap_area_work */ [6.10 w/ the patch] 1. Repeat step 1-2 in "[6.10 wo/ the patch]" 2. The soft lockup is not triggered and ftrace log is empty. cat /sys/kernel/debug/tracing/trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 3. Setting 'tracing_thresh' to 10/5 seconds does not get any ftrace log. 4. Setting 'tracing_thresh' to 1 second gets ftrace log. cat /sys/kernel/debug/tracing/trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 23) $ 1074942 us | } /* __purge_vmap_area_lazy */ 23) $ 1074950 us | } /* drain_vmap_area_work */ The worst execution time of drain_vmap_area_work() is about 1 second. Link: https://lore.kernel.org/lkml/ZqFlawuVnOMY2k3E@pc638.lan/ Link: https://lkml.kernel.org/r/20240726165246.31326-1-ahuang12@lenovo.com Fixes: 282631cb2447 ("mm: vmalloc: remove global purge_vmap_area_root rb-tree") Signed-off-by: Adrian Huang Co-developed-by: Uladzislau Rezki (Sony) Signed-off-by: Uladzislau Rezki (Sony) Tested-by: Jiwei Sun Reviewed-by: Baoquan He Cc: Alexander Potapenko Cc: Andrey Konovalov Cc: Andrey Ryabinin Cc: Christoph Hellwig Cc: Dmitry Vyukov Cc: Vincenzo Frascino Cc: Signed-off-by: Andrew Morton Signed-off-by: Greg Kroah-Hartman

util_macros.h: fix/rework find_closest() macros

2024-12-09T09:41:06Z

commit bc73b4186736341ab5cd2c199da82db6e1134e13 upstream. A bug was found in the find_closest() (find_closest_descending() is also affected after some testing), where for certain values with small progressions, the rounding (done by averaging 2 values) causes an incorrect index to be returned. The rounding issues occur for progressions of 1, 2 and 3. It goes away when the progression/interval between two values is 4 or larger. It's particularly bad for progressions of 1. For example if there's an array of 'a = { 1, 2, 3 }', using 'find_closest(2, a ...)' would return 0 (the index of '1'), rather than returning 1 (the index of '2'). This means that for exact values (with a progression of 1), find_closest() will misbehave and return the index of the value smaller than the one we're searching for. For progressions of 2 and 3, the exact values are obtained correctly; but values aren't approximated correctly (as one would expect). Starting with progressions of 4, all seems to be good (one gets what one would expect). While one could argue that 'find_closest()' should not be used for arrays with progressions of 1 (i.e. '{1, 2, 3, ...}', the macro should still behave correctly. The bug was found while testing the 'drivers/iio/adc/ad7606.c', specifically the oversampling feature. For reference, the oversampling values are listed as: static const unsigned int ad7606_oversampling_avail[7] = { 1, 2, 4, 8, 16, 32, 64, }; When doing: 1. $ echo 1 > /sys/bus/iio/devices/iio\:device0/oversampling_ratio $ cat /sys/bus/iio/devices/iio\:device0/oversampling_ratio 1 # this is fine 2. $ echo 2 > /sys/bus/iio/devices/iio\:device0/oversampling_ratio $ cat /sys/bus/iio/devices/iio\:device0/oversampling_ratio 1 # this is wrong; 2 should be returned here 3. $ echo 3 > /sys/bus/iio/devices/iio\:device0/oversampling_ratio $ cat /sys/bus/iio/devices/iio\:device0/oversampling_ratio 2 # this is fine 4. $ echo 4 > /sys/bus/iio/devices/iio\:device0/oversampling_ratio $ cat /sys/bus/iio/devices/iio\:device0/oversampling_ratio 4 # this is fine And from here-on, the values are as correct (one gets what one would expect.) While writing a kunit test for this bug, a peculiar issue was found for the array in the 'drivers/hwmon/ina2xx.c' & 'drivers/iio/adc/ina2xx-adc.c' drivers. While running the kunit test (for 'ina226_avg_tab' from these drivers): * idx = find_closest([-1 to 2], ina226_avg_tab, ARRAY_SIZE(ina226_avg_tab)); This returns idx == 0, so value. * idx = find_closest(3, ina226_avg_tab, ARRAY_SIZE(ina226_avg_tab)); This returns idx == 0, value 1; and now one could argue whether 3 is closer to 4 or to 1. This quirk only appears for value '3' in this array, but it seems to be a another rounding issue. * And from 4 onwards the 'find_closest'() works fine (one gets what one would expect). This change reworks the find_closest() macros to also check the difference between the left and right elements when 'x'. If the distance to the right is smaller (than the distance to the left), the index is incremented by 1. This also makes redundant the need for using the DIV_ROUND_CLOSEST() macro. In order to accommodate for any mix of negative + positive values, the internal variables '__fc_x', '__fc_mid_x', '__fc_left' & '__fc_right' are forced to 'long' type. This also addresses any potential bugs/issues with 'x' being of an unsigned type. In those situations any comparison between signed & unsigned would be promoted to a comparison between 2 unsigned numbers; this is especially annoying when '__fc_left' & '__fc_right' underflow. The find_closest_descending() macro was also reworked and duplicated from the find_closest(), and it is being iterated in reverse. The main reason for this is to get the same indices as 'find_closest()' (but in reverse). The comparison for '__fc_right < __fc_left' favors going the array in ascending order. For example for array '{ 1024, 512, 256, 128, 64, 16, 4, 1 }' and x = 3, we get: __fc_mid_x = 2 __fc_left = -1 __fc_right = -2 Then '__fc_right < __fc_left' evaluates to true and '__fc_i++' becomes 7 which is not quite incorrect, but 3 is closer to 4 than to 1. This change has been validated with the kunit from the next patch. Link: https://lkml.kernel.org/r/20241105145406.554365-1-aardelean@baylibre.com Fixes: 95d119528b0b ("util_macros.h: add find_closest() macro") Signed-off-by: Alexandru Ardelean Cc: Bartosz Golaszewski Cc: Greg Kroah-Hartman Cc: Signed-off-by: Andrew Morton Signed-off-by: Greg Kroah-Hartman

netkit: Add option for scrubbing skb meta data

2024-12-09T09:40:56Z

commit 83134ef4609388f6b9ca31a384f531155196c2a7 upstream. Jordan reported that when running Cilium with netkit in per-endpoint-routes mode, network policy misclassifies traffic. In this direct routing mode of Cilium which is used in case of GKE/EKS/AKS, the Pod's BPF program to enforce policy sits on the netkit primary device's egress side. The issue here is that in case of netkit's netkit_prep_forward(), it will clear meta data such as skb->mark and skb->priority before executing the BPF program. Thus, identity data stored in there from earlier BPF programs (e.g. from tcx ingress on the physical device) gets cleared instead of being made available for the primary's program to process. While for traffic egressing the Pod via the peer device this might be desired, this is different for the primary one where compared to tcx egress on the host veth this information would be available. To address this, add a new parameter for the device orchestration to allow control of skb->mark and skb->priority scrubbing, to make the two accessible from BPF (and eventually leave it up to the program to scrub). By default, the current behavior is retained. For netkit peer this also enables the use case where applications could cooperate/signal intent to the BPF program. Note that struct netkit has a 4 byte hole between policy and bundle which is used here, in other words, struct netkit's first cacheline content used in fast-path does not get moved around. Fixes: 35dfaad7188c ("netkit, bpf: Add bpf programmable net device") Reported-by: Jordan Rife Signed-off-by: Daniel Borkmann Cc: Nikolay Aleksandrov Link: https://github.com/cilium/cilium/issues/34042 Acked-by: Jakub Kicinski Acked-by: Nikolay Aleksandrov Link: https://lore.kernel.org/r/20241004101335.117711-1-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau Signed-off-by: Greg Kroah-Hartman

block: always verify unfreeze lock on the owner task

2024-12-05T13:03:10Z

commit 6a78699838a0ddeed3620ddf50c1521f1fe1e811 upstream. commit f1be1788a32e ("block: model freeze & enter queue as lock for supporting lockdep") tries to apply lockdep for verifying freeze & unfreeze. However, the verification is only done the outmost freeze and unfreeze. This way is actually not correct because q->mq_freeze_depth still may drop to zero on other task instead of the freeze owner task. Fix this issue by always verifying the last unfreeze lock on the owner task context, and make sure both the outmost freeze & unfreeze are verified in the current task. Fixes: f1be1788a32e ("block: model freeze & enter queue as lock for supporting lockdep") Signed-off-by: Ming Lei Link: https://lore.kernel.org/r/20241031133723.303835-4-ming.lei@redhat.com Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

Rename .data.once to .data..once to fix resetting WARN*_ONCE

2024-12-05T13:03:08Z

[ Upstream commit dbefa1f31a91670c9e7dac9b559625336206466f ] Commit b1fca27d384e ("kernel debug: support resetting WARN*_ONCE") added support for clearing the state of once warnings. However, it is not functional when CONFIG_LD_DEAD_CODE_DATA_ELIMINATION or CONFIG_LTO_CLANG is enabled, because .data.once matches the .data.[0-9a-zA-Z_]* pattern in the DATA_MAIN macro. Commit cb87481ee89d ("kbuild: linker script do not match C names unless LD_DEAD_CODE_DATA_ELIMINATION is configured") was introduced to suppress the issue for the default CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=n case, providing a minimal fix for stable backporting. We were aware this did not address the issue for CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y. The plan was to apply correct fixes and then revert cb87481ee89d. [1] Seven years have passed since then, yet the #ifdef workaround remains in place. Meanwhile, commit b1fca27d384e introduced the .data.once section, and commit dc5723b02e52 ("kbuild: add support for Clang LTO") extended the #ifdef. Using a ".." separator in the section name fixes the issue for CONFIG_LD_DEAD_CODE_DATA_ELIMINATION and CONFIG_LTO_CLANG. [1]: https://lore.kernel.org/linux-kbuild/CAK7LNASck6BfdLnESxXUeECYL26yUDm0cwRZuM4gmaWUkxjL5g@mail.gmail.com/ Fixes: b1fca27d384e ("kernel debug: support resetting WARN*_ONCE") Fixes: dc5723b02e52 ("kbuild: add support for Clang LTO") Signed-off-by: Masahiro Yamada Signed-off-by: Sasha Levin

Rename .data.unlikely to .data..unlikely

2024-12-05T13:03:07Z

[ Upstream commit bb43a59944f45e89aa158740b8a16ba8f0b0fa2b ] Commit 7ccaba5314ca ("consolidate WARN_...ONCE() static variables") was intended to collect all .data.unlikely sections into one chunk. However, this has not worked when CONFIG_LD_DEAD_CODE_DATA_ELIMINATION or CONFIG_LTO_CLANG is enabled, because .data.unlikely matches the .data.[0-9a-zA-Z_]* pattern in the DATA_MAIN macro. Commit cb87481ee89d ("kbuild: linker script do not match C names unless LD_DEAD_CODE_DATA_ELIMINATION is configured") was introduced to suppress the issue for the default CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=n case, providing a minimal fix for stable backporting. We were aware this did not address the issue for CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y. The plan was to apply correct fixes and then revert cb87481ee89d. [1] Seven years have passed since then, yet the #ifdef workaround remains in place. Using a ".." separator in the section name fixes the issue for CONFIG_LD_DEAD_CODE_DATA_ELIMINATION and CONFIG_LTO_CLANG. [1]: https://lore.kernel.org/linux-kbuild/CAK7LNASck6BfdLnESxXUeECYL26yUDm0cwRZuM4gmaWUkxjL5g@mail.gmail.com/ Fixes: cb87481ee89d ("kbuild: linker script do not match C names unless LD_DEAD_CODE_DATA_ELIMINATION is configured") Signed-off-by: Masahiro Yamada Signed-off-by: Sasha Levin

block: return unsigned int from bdev_io_min

2024-12-05T13:03:06Z

[ Upstream commit 46fd48ab3ea3eb3bb215684bd66ea3d260b091a9 ] The underlying limit is defined as an unsigned int, so return that from bdev_io_min as well. Fixes: ac481c20ef8f ("block: Topology ioctls") Signed-off-by: Christoph Hellwig Reviewed-by: Martin K. Petersen Reviewed-by: John Garry Link: https://lore.kernel.org/r/20241119072602.1059488-1-hch@lst.de Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

block: model freeze & enter queue as lock for supporting lockdep

2024-12-05T13:03:05Z

[ Upstream commit f1be1788a32e8fa63416ad4518bbd1a85a825c9d ] Recently we got several deadlock report[1][2][3] caused by blk_mq_freeze_queue and blk_enter_queue(). Turns out the two are just like acquiring read/write lock, so model them as read/write lock for supporting lockdep: 1) model q->q_usage_counter as two locks(io and queue lock) - queue lock covers sync with blk_enter_queue() - io lock covers sync with bio_enter_queue() 2) make the lockdep class/key as per-queue: - different subsystem has very different lock use pattern, shared lock class causes false positive easily - freeze_queue degrades to no lock in case that disk state becomes DEAD because bio_enter_queue() won't be blocked any more - freeze_queue degrades to no lock in case that request queue becomes dying because blk_enter_queue() won't be blocked any more 3) model blk_mq_freeze_queue() as acquire_exclusive & try_lock - it is exclusive lock, so dependency with blk_enter_queue() is covered - it is trylock because blk_mq_freeze_queue() are allowed to run concurrently 4) model blk_enter_queue() & bio_enter_queue() as acquire_read() - nested blk_enter_queue() are allowed - dependency with blk_mq_freeze_queue() is covered - blk_queue_exit() is often called from other contexts(such as irq), and it can't be annotated as lock_release(), so simply do it in blk_enter_queue(), this way still covered cases as many as possible With lockdep support, such kind of reports may be reported asap and needn't wait until the real deadlock is triggered. For example, lockdep report can be triggered in the report[3] with this patch applied. [1] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler' https://bugzilla.kernel.org/show_bug.cgi?id=219166 [2] del_gendisk() vs blk_queue_enter() race condition https://lore.kernel.org/linux-block/20241003085610.GK11458@google.com/ [3] queue_freeze & queue_enter deadlock in scsi https://lore.kernel.org/linux-block/ZxG38G9BuFdBpBHZ@fedora/T/#u Reviewed-by: Christoph Hellwig Signed-off-by: Ming Lei Link: https://lore.kernel.org/r/20241025003722.3630252-4-ming.lei@redhat.com Signed-off-by: Jens Axboe Stable-dep-of: 3802f73bd807 ("block: fix uaf for flush rq while iterating tags") Signed-off-by: Sasha Levin

blk-mq: add non_owner variant of start_freeze/unfreeze queue APIs

2024-12-05T13:03:05Z

[ Upstream commit 8acdd0e7bfadda6b5103f2960d293581954454ed ] Add non_owner variant of start_freeze/unfreeze queue APIs, so that the caller knows that what they are doing, and we can skip lockdep support for non_owner variant in per-call level. Prepare for supporting lockdep for freezing/unfreezing queue. Reviewed-by: Christoph Hellwig Suggested-by: Christoph Hellwig Signed-off-by: Ming Lei Link: https://lore.kernel.org/r/20241025003722.3630252-2-ming.lei@redhat.com Signed-off-by: Jens Axboe Stable-dep-of: 3802f73bd807 ("block: fix uaf for flush rq while iterating tags") Signed-off-by: Sasha Levin