user/sven/linux.git/include/linux/genhd.h, branch v5.4.43

block: fix use-after-free on gendisk

2019-04-22T15:48:12Z

commit 2da78092dda "block: Fix dev_t minor allocation lifetime" specifically moved blk_free_devt(dev->devt) call to part_release() to avoid reallocating device number before the device is fully shutdown. However, it can cause use-after-free on gendisk in get_gendisk(). We use md device as example to show the race scenes: Process1 Worker Process2 md_free blkdev_open del_gendisk add delete_partition_work_fn() to wq __blkdev_get get_gendisk put_disk disk_release kfree(disk) find part from ext_devt_idr get_disk_and_module(disk) cause use after free delete_partition_work_fn put_device(part) part_release remove part from ext_devt_idr Before is removed from ext_devt_idr by delete_partition_work_fn(), we can find the devt and then access gendisk by hd_struct pointer. But, if we access the gendisk after it have been freed, it can cause in use-after-freeon gendisk in get_gendisk(). We fix this by adding a new helper blk_invalidate_devt() in delete_partition() and del_gendisk(). It replaces hd_struct pointer in idr with value 'NULL', and deletes the entry from idr in part_release() as we do now. Thanks to Jan Kara for providing the solution and more clear comments for the code. Fixes: 2da78092dda1 ("block: Fix dev_t minor allocation lifetime") Cc: Al Viro Reviewed-by: Bart Van Assche Reviewed-by: Keith Busch Reviewed-by: Jan Kara Suggested-by: Jan Kara Signed-off-by: Yufen Yu Signed-off-by: Jens Axboe

block: disk_events: introduce event flags

2019-04-12T19:35:24Z

Currently, an empty disk->events field tells the block layer not to forward media change events to user space. This was done in commit 7c88a168da80 ("block: don't propagate unlisted DISK_EVENTs to userland") in order to avoid events from "fringe" drivers to be forwarded to user space. By doing so, the block layer lost the information which events were supported by a particular block device, and most importantly, whether or not a given device supports media change events at all. Prepare for not interpreting the "events" field this way in the future any more. This is done by adding an additional field "event_flags" to struct gendisk, and two flag bits that can be set to have the device treated like one that had the "events" field set to a non-zero value before. This applies only to the sd and sr drivers, which are changed to set the new flags. The new flags are DISK_EVENT_FLAG_POLL to enforce polling of the device for synchronous events, and DISK_EVENT_FLAG_UEVENT to tell the blocklayer to generate udev events from kernel events. In order to add the event_flags field to struct gendisk, the events field is converted to an "unsigned short"; it doesn't need to hold values bigger than 2 anyway. This patch doesn't change behavior. Reviewed-by: Christoph Hellwig Signed-off-by: Martin Wilck Signed-off-by: Jens Axboe

block: genhd: remove async_events field

2019-04-12T19:35:22Z

The async_events field, intended to be used for drivers that support asynchronous notifications about disk events (aka media change events), isn't currently used by any driver, and apparently that has been that way for a long time (if not forever). Remove it. Reviewed-by: Hannes Reinecke Reviewed-by: Christoph Hellwig Signed-off-by: Martin Wilck Signed-off-by: Jens Axboe

block: remove CONFIG_LBDAF

2019-04-06T16:48:35Z

Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit architectures. These types are required to support block device and/or file sizes larger than 2 TiB, and have generally defaulted to on for a long time. Enabling the option only increases the i386 tinyconfig size by 145 bytes, and many data structures already always use 64-bit values for their in-core and on-disk data structures anyway, so there should not be a large change in dynamic memory usage either. Dropping this option removes a somewhat weird non-default config that has cause various bugs or compiler warnings when actually used. Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe

block: return just one value from part_in_flight

2018-12-10T15:30:38Z

The previous patches deleted all the code that needed the second value returned from part_in_flight - now the kernel only uses the first value. Consequently, part_in_flight (and blk_mq_in_flight) may be changed so that it only returns one value. This patch just refactors the code, there's no functional change. Signed-off-by: Mikulas Patocka Signed-off-by: Mike Snitzer Signed-off-by: Jens Axboe

block: switch to per-cpu in-flight counters

2018-12-10T15:30:37Z

Now when part_round_stats is gone, we can switch to per-cpu in-flight counters. We use the local-atomic type local_t, so that if part_inc_in_flight or part_dec_in_flight is reentrantly called from an interrupt, the value will be correct. The other counters could be corrupted due to reentrant interrupt, but the corruption only results in slight counter skew - the in_flight counter must be exact, so it needs local_t. Signed-off-by: Mikulas Patocka Signed-off-by: Mike Snitzer Signed-off-by: Jens Axboe

block: delete part_round_stats and switch to less precise counting

2018-12-10T15:30:37Z

We want to convert to per-cpu in_flight counters. The function part_round_stats needs the in_flight counter every jiffy, it would be too costly to sum all the percpu variables every jiffy, so it must be deleted. part_round_stats is used to calculate two counters - time_in_queue and io_ticks. time_in_queue can be calculated without part_round_stats, by adding the duration of the I/O when the I/O ends (the value is almost as exact as the previously calculated value, except that time for in-progress I/Os is not counted). io_ticks can be approximated by increasing the value when I/O is started or ended and the jiffies value has changed. If the I/Os take less than a jiffy, the value is as exact as the previously calculated value. If the I/Os take more than a jiffy, io_ticks can drift behind the previously calculated value. Signed-off-by: Mikulas Patocka Signed-off-by: Mike Snitzer Signed-off-by: Jens Axboe

block: stop passing 'cpu' to all percpu stats methods

2018-12-10T15:30:37Z

All of part_stat_* and related methods are used with preempt disabled, so there is no need to pass cpu around to allow of them. Just call smp_processor_id() as needed. Suggested-by: Jens Axboe Signed-off-by: Mike Snitzer Signed-off-by: Jens Axboe

block: use rcu_work instead of call_rcu to avoid sleep in softirq

2018-11-28T16:08:27Z

We recently got a stack by syzkaller like this: BUG: sleeping function called from invalid context at mm/slab.h:361 in_atomic(): 1, irqs_disabled(): 0, pid: 6644, name: blkid INFO: lockdep is turned off. CPU: 1 PID: 6644 Comm: blkid Not tainted 4.4.163-514.55.6.9.x86_64+ #76 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 0000000000000000 5ba6a6b879e50c00 ffff8801f6b07b10 ffffffff81cb2194 0000000041b58ab3 ffffffff833c7745 ffffffff81cb2080 5ba6a6b879e50c00 0000000000000000 0000000000000001 0000000000000004 0000000000000000 Call Trace: [] __dump_stack lib/dump_stack.c:15 [inline] [] dump_stack+0x114/0x1a0 lib/dump_stack.c:51 [] ___might_sleep+0x291/0x490 kernel/sched/core.c:7675 [] __might_sleep+0xb3/0x270 kernel/sched/core.c:7637 [] slab_pre_alloc_hook mm/slab.h:361 [inline] [] slab_alloc_node mm/slub.c:2610 [inline] [] slab_alloc mm/slub.c:2692 [inline] [] kmem_cache_alloc_trace+0x2c3/0x5c0 mm/slub.c:2709 [] kmalloc include/linux/slab.h:479 [inline] [] kzalloc include/linux/slab.h:623 [inline] [] kobject_uevent_env+0x2c7/0x1150 lib/kobject_uevent.c:227 [] kobject_uevent+0x1f/0x30 lib/kobject_uevent.c:374 [] kobject_cleanup lib/kobject.c:633 [inline] [] kobject_release+0x229/0x440 lib/kobject.c:675 [] kref_sub include/linux/kref.h:73 [inline] [] kref_put include/linux/kref.h:98 [inline] [] kobject_put+0x72/0xd0 lib/kobject.c:692 [] put_device+0x25/0x30 drivers/base/core.c:1237 [] delete_partition_rcu_cb+0x1d4/0x2f0 block/partition-generic.c:232 [] __rcu_reclaim kernel/rcu/rcu.h:118 [inline] [] rcu_do_batch kernel/rcu/tree.c:2705 [inline] [] invoke_rcu_callbacks kernel/rcu/tree.c:2973 [inline] [] __rcu_process_callbacks kernel/rcu/tree.c:2940 [inline] [] rcu_process_callbacks+0x59c/0x1c70 kernel/rcu/tree.c:2957 [] __do_softirq+0x299/0xe20 kernel/softirq.c:273 [] invoke_softirq kernel/softirq.c:350 [inline] [] irq_exit+0x216/0x2c0 kernel/softirq.c:391 [] exiting_irq arch/x86/include/asm/apic.h:652 [inline] [] smp_apic_timer_interrupt+0x8b/0xc0 arch/x86/kernel/apic/apic.c:926 [] apic_timer_interrupt+0xa5/0xb0 arch/x86/entry/entry_64.S:746 [] ? audit_kill_trees+0x180/0x180 [] fd_install+0x57/0x80 fs/file.c:626 [] do_sys_open+0x45e/0x550 fs/open.c:1043 [] SYSC_open fs/open.c:1055 [inline] [] SyS_open+0x32/0x40 fs/open.c:1050 [] entry_SYSCALL_64_fastpath+0x1e/0x9a In softirq context, we call rcu callback function delete_partition_rcu_cb(), which may allocate memory by kzalloc with GFP_KERNEL flag. If the allocation cannot be satisfied, it may sleep. However, That is not allowed in softirq contex. Although we found this problem on linux 4.4, the latest kernel version seems to have this problem as well. And it is very similar to the previous one: https://lkml.org/lkml/2018/7/9/391 Fix it by using RCU workqueue, which allows sleep. Reviewed-by: Paul E. McKenney Signed-off-by: Yufen Yu Signed-off-by: Jens Axboe

Merge tag 'v4.19-rc6' into for-4.20/block

2018-10-01T14:58:57Z

Merge -rc6 in, for two reasons: 1) Resolve a trivial conflict in the blk-mq-tag.c documentation 2) A few important regression fixes went into upstream directly, so they aren't in the 4.20 branch. Signed-off-by: Jens Axboe * tag 'v4.19-rc6': (780 commits) Linux 4.19-rc6 MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c cpufreq: qcom-kryo: Fix section annotations perf/core: Add sanity check to deal with pinned event failure xen/blkfront: correct purging of persistent grants Revert "xen/blkfront: When purging persistent grants, keep them in the buffer" selftests/powerpc: Fix Makefiles for headers_install change blk-mq: I/O and timer unplugs are inverted in blktrace dax: Fix deadlock in dax_lock_mapping_entry() x86/boot: Fix kexec booting failure in the SEV bit detection code bcache: add separate workqueue for journal_write to avoid deadlock drm/amd/display: Fix Edid emulation for linux drm/amd/display: Fix Vega10 lightup on S3 resume drm/amdgpu: Fix vce work queue was not cancelled when suspend Revert "drm/panel: Add device_link from panel device to DRM device" xen/blkfront: When purging persistent grants, keep them in the buffer clocksource/drivers/timer-atmel-pit: Properly handle error cases block: fix deadline elevator drain for zoned block devices ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set ... Signed-off-by: Jens Axboe