user/sven/linux.git/block, branch v3.17

blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe

2014-09-24T14:29:36Z

blk-mq uses percpu_ref for its usage counter which tracks the number of in-flight commands and used to synchronously drain the queue on freeze. percpu_ref shutdown takes measureable wallclock time as it involves a sched RCU grace period. This means that draining a blk-mq takes measureable wallclock time. One would think that this shouldn't matter as queue shutdown should be a rare event which takes place asynchronously w.r.t. userland. Unfortunately, SCSI probing involves synchronously setting up and then tearing down a lot of request_queues back-to-back for non-existent LUNs. This means that SCSI probing may take more than ten seconds when scsi-mq is used. This will be properly fixed by implementing a mechanism to keep q->mq_usage_counter in atomic mode till genhd registration; however, that involves rather big updates to percpu_ref which is difficult to apply late in the devel cycle (v3.17-rc6 at the moment). As a stop-gap measure till the proper fix can be implemented in the next cycle, this patch introduces __percpu_ref_kill_expedited() and makes blk_mq_freeze_queue() use it. This is heavy-handed but should work for testing the experimental SCSI blk-mq implementation. Signed-off-by: Tejun Heo Reported-by: Christoph Hellwig Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de Fixes: add703fda981 ("blk-mq: use percpu_ref for mq usage count") Cc: Kent Overstreet Cc: Jens Axboe Tested-by: Christoph Hellwig Signed-off-by: Jens Axboe

genhd: fix leftover might_sleep() in blk_free_devt()

2014-09-22T20:45:45Z

Commit 2da78092 changed the locking from a mutex to a spinlock, so we now longer sleep in this context. But there was a leftover might_sleep() in there, which now triggers since we do the final free from an RCU callback. Get rid of it. Reported-by: Pontus Fuchs Signed-off-by: Jens Axboe

blk-mq: use blk_mq_start_hw_queues() when running requeue work

2014-09-22T17:55:56Z

When requests are retried due to hw or sw resource shortages, we often stop the associated hardware queue. So ensure that we restart the queues when running the requeue work, otherwise the queue run will be a no-op. Signed-off-by: Jens Axboe

blk-mq: fix potential oops on out-of-memory in __blk_mq_alloc_rq_maps()

2014-09-22T17:55:23Z

__blk_mq_alloc_rq_maps() can be invoked multiple times, if we scale back the queue depth if we are low on memory. So don't clear set->tags when we fail, this is handled directly in the parent function, blk_mq_alloc_tag_set(). Reported-by: Robert Elliott Signed-off-by: Jens Axboe

blk-mq: avoid infinite recursion with the FUA flag

2014-09-22T17:55:19Z

We should not insert requests into the flush state machine from blk_mq_insert_request. All incoming flush requests come through blk_{m,s}q_make_request and are handled there, while blk_execute_rq_nowait should only be called for BLOCK_PC requests. All other callers deal with requests that already went through the flush statemchine and shouldn't be reinserted into it. Reported-by: Robert Elliott Debugged-by: Ming Lei Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe

blk-mq: Avoid race condition with uninitialized requests

2014-09-22T17:55:14Z

This patch should fix the bug reported in https://lkml.org/lkml/2014/9/11/249. We have to initialize at least the atomic_flags and the cmd_flags when allocating storage for the requests. Otherwise blk_mq_timeout_check() might dereference uninitialized pointers when racing with the creation of a request. Also move the reset of cmd_flags for the initializing code to the point where a request is freed. So we will never end up with pending flush request indicators that might trigger dereferences of invalid pointers in blk_mq_timeout_check(). Cc: stable@vger.kernel.org Signed-off-by: David Hildenbrand Reported-by: Paulo De Rezende Pinatti Tested-by: Paulo De Rezende Pinatti Acked-by: Christian Borntraeger Signed-off-by: Jens Axboe

blk-mq: request deadline must be visible before marking rq as started

2014-09-22T17:54:04Z

When we start the request, we set the deadline and flip the bits marking the request as started and non-complete. However, it's important that the deadline store is ordered before flipping the bits, otherwise we could have a small window where the request is marked started but with an invalid deadline. This can confuse the timeout handling. Suggested-by: Ming Lei Signed-off-by: Jens Axboe

blk-mq: scale depth and rq map appropriate if low on memory

2014-09-10T15:02:03Z

If we are running in a kdump environment, resources are scarce. For some SCSI setups with a huge set of shared tags, we run out of memory allocating what the drivers is asking for. So implement a scale back logic to reduce the tag depth for those cases, allowing the driver to successfully load. We should extend this to detect low memory situations, and implement a sane fallback for those (1 queue, 64 tags, or something like that). Tested-by: Robert Elliott Signed-off-by: Jens Axboe

Block: fix unbalanced bypass-disable in blk_register_queue

2014-09-09T16:44:24Z

When a queue is registered, the block layer turns off the bypass setting (because bypass is enabled when the queue is created). This doesn't work well for queues that are unregistered and then registered again; we get a WARNING because of the unbalanced calls to blk_queue_bypass_end(). This patch fixes the problem by making blk_register_queue() call blk_queue_bypass_end() only the first time the queue is registered. Signed-off-by: Alan Stern Acked-by: Tejun Heo CC: James Bottomley CC: Jens Axboe Signed-off-by: Jens Axboe

block: Fix dev_t minor allocation lifetime

2014-09-03T21:01:02Z

Releases the dev_t minor when all references are closed to prevent another device from acquiring the same major/minor. Since the partition's release may be invoked from call_rcu's soft-irq context, the ext_dev_idr's mutex had to be replaced with a spinlock so as not so sleep. Signed-off-by: Keith Busch Cc: stable@kernel.org Signed-off-by: Jens Axboe