user/sven/linux.git/block, branch v5.1.14

blk-mq: Fix memory leak in error handling

2019-06-22T06:09:14Z

[ Upstream commit 41de54c64811bf087c8464fdeb43c6ad8be2686b ] If blk_mq_init_allocated_queue() fails, make sure to free the poll stat callback struct allocated. Signed-off-by: Jes Sorensen Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

block, bfq: increase idling for weight-raised queues

2019-06-15T09:53:04Z

[ Upstream commit 778c02a236a8728bb992de10ed1f12c0be5b7b0e ] If a sync bfq_queue has a higher weight than some other queue, and remains temporarily empty while in service, then, to preserve the bandwidth share of the queue, it is necessary to plug I/O dispatching until a new request arrives for the queue. In addition, a timeout needs to be set, to avoid waiting for ever if the process associated with the queue has actually finished its I/O. Even with the above timeout, the device is however not fed with new I/O for a while, if the process has finished its I/O. If this happens often, then throughput drops and latencies grow. For this reason, the timeout is kept rather low: 8 ms is the current default. Unfortunately, such a low value may cause, on the opposite end, a violation of bandwidth guarantees for a process that happens to issue new I/O too late. The higher the system load, the higher the probability that this happens to some process. This is a problem in scenarios where service guarantees matter more than throughput. One important case are weight-raised queues, which need to be granted a very high fraction of the bandwidth. To address this issue, this commit lower-bounds the plugging timeout for weight-raised queues to 20 ms. This simple change provides relevant benefits. For example, on a PLEXTOR PX-256M5S, with which gnome-terminal starts in 0.6 seconds if there is no other I/O in progress, the same applications starts in - 0.8 seconds, instead of 1.2 seconds, if ten files are being read sequentially in parallel - 1 second, instead of 2 seconds, if, in parallel, five files are being read sequentially, and five more files are being written sequentially Tested-by: Holger Hoffstätte Tested-by: Oleksandr Natalenko Signed-off-by: Paolo Valente Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

blk-mq: move cancel of requeue_work into blk_mq_release

2019-06-15T09:52:58Z

[ Upstream commit fbc2a15e3433058582e5635aabe48a3011a644a8 ] With holding queue's kobject refcount, it is safe for driver to schedule requeue. However, blk_mq_kick_requeue_list() may be called after blk_sync_queue() is done because of concurrent requeue activities, then requeue work may not be completed when freeing queue, and kernel oops is triggered. So moving the cancel of requeue_work into blk_mq_release() for avoiding race between requeue and freeing queue. Cc: Dongli Zhang Cc: James Smart Cc: Bart Van Assche Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen , Cc: Christoph Hellwig , Cc: James E . J . Bottomley , Reviewed-by: Bart Van Assche Reviewed-by: Johannes Thumshirn Reviewed-by: Hannes Reinecke Reviewed-by: Christoph Hellwig Tested-by: James Smart Signed-off-by: Ming Lei Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

block: pass page to xen_biovec_phys_mergeable

2019-05-31T13:43:44Z

[ Upstream commit 0383ad4374f7ad7edd925a2ee4753035c3f5508a ] xen_biovec_phys_mergeable() only needs .bv_page of the 2nd bio bvec for checking if the two bvecs can be merged, so pass page to xen_biovec_phys_mergeable() directly. No function change. Cc: ris Ostrovsky Cc: Juergen Gross Cc: xen-devel@lists.xenproject.org Cc: Omar Sandoval Cc: Christoph Hellwig Reviewed-by: Christoph Hellwig Reviewed-by: Boris Ostrovsky Signed-off-by: Ming Lei Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

block: avoid to break XEN by multi-page bvec

2019-05-31T13:43:43Z

[ Upstream commit db5ebd6edd2627d7e81a031643cf43587f63e66c ] XEN has special page merge requirement, see xen_biovec_phys_mergeable(). We can't merge pages into one bvec simply for XEN. So move XEN's specific check on page merge into __bio_try_merge_page(), then abvoid to break XEN by multi-page bvec. Cc: ris Ostrovsky Cc: xen-devel@lists.xenproject.org Cc: Omar Sandoval Cc: Christoph Hellwig Reviewed-by: Juergen Gross Signed-off-by: Ming Lei Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

block: sed-opal: fix IOC_OPAL_ENABLE_DISABLE_MBR

2019-05-31T13:43:36Z

[ Upstream commit 78bf47353b0041865564deeed257a54f047c2fdc ] The implementation of IOC_OPAL_ENABLE_DISABLE_MBR handled the value opal_mbr_data.enable_disable incorrectly: enable_disable is expected to be one of OPAL_MBR_ENABLE(0) or OPAL_MBR_DISABLE(1). enable_disable was passed directly to set_mbr_done and set_mbr_enable_disable where is was interpreted as either OPAL_TRUE(1) or OPAL_FALSE(0). The end result was that calling IOC_OPAL_ENABLE_DISABLE_MBR with OPAL_MBR_ENABLE actually disabled the shadow MBR and vice versa. This patch adds correct conversion from OPAL_MBR_DISABLE/ENABLE to OPAL_FALSE/TRUE. The change affects existing programs using IOC_OPAL_ENABLE_DISABLE_MBR but this is typically used only once when setting up an Opal drive. Acked-by: Jon Derrick Reviewed-by: Christoph Hellwig Reviewed-by: Scott Bauer Signed-off-by: David Kozub Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

block: fix use-after-free on gendisk

2019-05-31T13:43:28Z

[ Upstream commit 2c88e3c7ec32d7a40cc7c9b4a487cf90e4671bdd ] commit 2da78092dda "block: Fix dev_t minor allocation lifetime" specifically moved blk_free_devt(dev->devt) call to part_release() to avoid reallocating device number before the device is fully shutdown. However, it can cause use-after-free on gendisk in get_gendisk(). We use md device as example to show the race scenes: Process1 Worker Process2 md_free blkdev_open del_gendisk add delete_partition_work_fn() to wq __blkdev_get get_gendisk put_disk disk_release kfree(disk) find part from ext_devt_idr get_disk_and_module(disk) cause use after free delete_partition_work_fn put_device(part) part_release remove part from ext_devt_idr Before is removed from ext_devt_idr by delete_partition_work_fn(), we can find the devt and then access gendisk by hd_struct pointer. But, if we access the gendisk after it have been freed, it can cause in use-after-freeon gendisk in get_gendisk(). We fix this by adding a new helper blk_invalidate_devt() in delete_partition() and del_gendisk(). It replaces hd_struct pointer in idr with value 'NULL', and deletes the entry from idr in part_release() as we do now. Thanks to Jan Kara for providing the solution and more clear comments for the code. Fixes: 2da78092dda1 ("block: Fix dev_t minor allocation lifetime") Cc: Al Viro Reviewed-by: Bart Van Assche Reviewed-by: Keith Busch Reviewed-by: Jan Kara Suggested-by: Jan Kara Signed-off-by: Yufen Yu Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

blk-mq: grab .q_usage_counter when queuing request from plug code path

2019-05-31T13:43:15Z

[ Upstream commit e87eb301bee183d82bb3d04bd71b6660889a2588 ] Just like aio/io_uring, we need to grab 2 refcount for queuing one request, one is for submission, another is for completion. If the request isn't queued from plug code path, the refcount grabbed in generic_make_request() serves for submission. In theroy, this refcount should have been released after the sumission(async run queue) is done. blk_freeze_queue() works with blk_sync_queue() together for avoiding race between cleanup queue and IO submission, given async run queue activities are canceled because hctx->run_work is scheduled with the refcount held, so it is fine to not hold the refcount when running the run queue work function for dispatch IO. However, if request is staggered into plug list, and finally queued from plug code path, the refcount in submission side is actually missed. And we may start to run queue after queue is removed because the queue's kobject refcount isn't guaranteed to be grabbed in flushing plug list context, then kernel oops is triggered, see the following race: blk_mq_flush_plug_list(): blk_mq_sched_insert_requests() insert requests to sw queue or scheduler queue blk_mq_run_hw_queue Because of concurrent run queue, all requests inserted above may be completed before calling the above blk_mq_run_hw_queue. Then queue can be freed during the above blk_mq_run_hw_queue(). Fixes the issue by grab .q_usage_counter before calling blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is safe because the queue is absolutely alive before inserting request. Cc: Dongli Zhang Cc: James Smart Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen , Cc: Christoph Hellwig , Cc: James E . J . Bottomley , Reviewed-by: Bart Van Assche Tested-by: James Smart Signed-off-by: Ming Lei Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

blk-mq: split blk_mq_alloc_and_init_hctx into two parts

2019-05-31T13:43:15Z

[ Upstream commit 7c6c5b7c9186e3fb5b10afb8e5f710ae661144c6 ] Split blk_mq_alloc_and_init_hctx into two parts, and one is blk_mq_alloc_hctx() for allocating all hctx resources, another is blk_mq_init_hctx() for initializing hctx, which serves as counter-part of blk_mq_exit_hctx(). Cc: Dongli Zhang Cc: James Smart Cc: Bart Van Assche Cc: linux-scsi@vger.kernel.org Cc: Martin K . Petersen Cc: Christoph Hellwig Cc: James E . J . Bottomley Reviewed-by: Hannes Reinecke Reviewed-by: Christoph Hellwig Tested-by: James Smart Signed-off-by: Ming Lei Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

blk-mq: free hw queue's resource in hctx's release handler

2019-05-25T16:16:18Z

commit c7e2d94b3d1634988a95ac4d77a72dc7487ece06 upstream. Once blk_cleanup_queue() returns, tags shouldn't be used any more, because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2 ("blk-mq: Fix a use-after-free") fixes this issue exactly. However, that commit introduces another issue. Before 45a9c9d909b2, we are allowed to run queue during cleaning up queue if the queue's kobj refcount is held. After that commit, queue can't be run during queue cleaning up, otherwise oops can be triggered easily because some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue(). We have invented ways for addressing this kind of issue before, such as: 8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done") c2856ae2f315 ("blk-mq: quiesce queue before freeing queue") But still can't cover all cases, recently James reports another such kind of issue: https://marc.info/?l=linux-scsi&m=155389088124782&w=2 This issue can be quite hard to address by previous way, given scsi_run_queue() may run requeues for other LUNs. Fixes the above issue by freeing hctx's resources in its release handler, and this way is safe becasue tags isn't needed for freeing such hctx resource. This approach follows typical design pattern wrt. kobject's release handler. Cc: Dongli Zhang Cc: James Smart Cc: Bart Van Assche Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen , Cc: Christoph Hellwig , Cc: James E . J . Bottomley , Reported-by: James Smart Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free") Cc: stable@vger.kernel.org Reviewed-by: Hannes Reinecke Reviewed-by: Christoph Hellwig Tested-by: James Smart Signed-off-by: Ming Lei Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman