| Age | Commit message (Collapse) | Author |
|
The current problem seen is that the queue lock is actually in the
SCSI device structure, so when that structure is freed on device
release, we go boom if the queue tries to access the lock again.
The fix here is to move the lock from the scsi_device to the queue.
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
|
|
This makes it hard(er) to mix argument orders by mistake for things like
kmalloc() and friends, since silent integer promotion is now caught by
sparse.
|
|
As promised to Andrew, here are the latest bits that fixup the block io
barrier handling.
- Add io scheduler ->deactivate hook to tell the io scheduler is a
request is suspended from the block layer. cfq and as needs this hook.
- Locking updates
- Make sure a driver doesn't reuse the flush rq before a previous one
has completed
- Typo in the scsi_io_completion() function, the bit shift was wrong
- sd needs proper timeout on the flush
- remove silly debug leftover in ide-disk wrt "hdc"
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This reworks the core barrier support to be a lot nicer, so that all the
nasty code resides outside of drivers/ide. It requires minimal changes to
support in a driver, I've added SCSI support as an example. The ide code
is adapted to the new code.
With this patch, we support full barriers on sata now. Bart has acked the
addition to -mm, I would like for this to be submitted as soon as 2.6.12
opens.
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This is needed for several things, one in-tree user which I will introduce
after this patch.
This adds a ->end_io callback to struct request, so it can be used with
async io of any sort. Right now users have to wait for completion in a
blocking manner. In the next iteration, ->waiting can be folded into
->end_io_data since it is just a special case of that use.
From: Peter Osterlund <petero2@telia.com>
The problem is that the add-struct-request-end_io-callback patch forgot to
update pktcdvd.c. This patch fixes it.
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Peter Osterlund <petero2@telia.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
I stumbled across this the other day. The block layer only uses a single
memory pool for request allocation, so it's very possible for eg writes
to have allocated them all at any point in time. If that is the case and
the machine is low on memory, a reader attempting to allocate a request
and failing in blk_alloc_request() can get stuck for a long time since
no one is there to wake it up.
The solution is either to add the extra mempool so both reads and writes
have one, or attempt to handle the situation. I chose the latter, to
save the extra memory required for the additional mempool with
BLKDEV_MIN_RQ statically allocated requests per-queue.
If a read allocation fails and we have no readers in flight for this
queue, mark us rq-starved so that the next write being freed will wake
up the sleeping reader(s). Same situation would happen for writes as
well of course, it's just a lot more unlikely.
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
As the unplug timer can potentially fire at any time, and and it access
data that is released by the md ->stop function, we need to del_timer_sync
before releasing that data.
(After much discussion, we created blk_sync_queue() for this)
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Contributions from Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
There are two issues with online io scheduler switching that this patch
addresses. The first is pretty simple - it concerns racing with scheduler
removal on switch. elevator_find() does not grab a reference to the io
scheduler, so before elevator_attach() is run it could go away. Add
elevator_get() to solve that.
Second issue is the flushing out of requests that is needed before
switching can be problematic with requests that aren't allocated in the
block layer (such as requests on the stack of a process). The problem is
that we don't know when they will have finished, and most io schedulers
need to access the elevator structures on io completion. This can be fixed
by adding an intermedia step that switches to noop, since it doesn't need
to touch anything but the request_queue. The queue drain can then safely
be split into two operations - one that waits for file system requests, and
one that waits for the queue to completely empty. Requests arriving after
the first drain will get stuck in a seperate queue list.
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
All users of this code were fixed to use scatterlists.
Acked by Jens.
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
|
|
Here is the next incarnation of the CFQ io scheduler, so far known as
CFQ v2 locally. It attempts to address some of the limitations of the
original CFQ io scheduler (hence forth known as CFQ v1). Some of the
problems with CFQ v1 are:
- It does accounting for the lifetime of the cfq_queue, which is setup
and torn down for the time when a process has io in flight. For a fork
heavy work load (such as a kernel compile, for instance), new
processes can effectively starve io of running processes. This is in
part due to the fact that CFQ v1 gives preference to a new processes
to get better latency numbers. Removing that heuristic is not an
option exactly because of that.
- It makes no attempts to address inter-cfq_queue fairness.
- It makes no attempt to limit upper latency bound of a single request.
- It only provides per-tgid grouping. You need to change the source to
group on a different criteria.
- It uses a mempool for the cfq_queues. Theoretically this could
deadlock if io bound processes never exit.
- The may_queue() logic can be unfair since it fluctuates quickly, thus
leaving processes sleeping while new processes are allowed to allocate
a request.
CFQ v2 attempts to fix these issues. It uses the process io_context
logic to maintain a cfq_queue lifetime of the duration of the process
(and its io). This means we can now be a lot more clever in deciding
which process is allowed to queue or dispatch io to the device. The
cfq_io_context is per-process per-queue, this is an extension to what AS
currently does in that we truly do have a unique per-process identifier
for io grouping. Busy queues are sorted by service time used, sub sorted
by in_flight requests. Queues that have no io in flight are also
preferred at dispatch time.
Accounting is done on completion time of a request, or with a fixed cost
for tagged command queueing. Requests are fifo'ed like with deadline, to
make sure that a single request doesn't stay in the io scheduler for
ages.
Process grouping is selectable at runtime. I provide 4 grouping
criterias: process group, thread group id, user id, and group id.
As usual, settings are sysfs tweakable in /sys/block/<dev>/queue/iosched
axboe@apu:[.]s/block/hda/queue/iosched $ ls
back_seek_max fifo_batch_expire find_best_crq queued
back_seek_penalty fifo_expire_async key_type show_status
clear_elapsed fifo_expire_sync quantum tagged
In order, each of these settings control:
back_seek_max
back_seek_penalty:
Useful logic stolen from AS that allow small backwards seeks in
the io stream if we deem them useful. CFQ uses a strict
ascending elevator otherwise. _max controls the maximum allowed
backwards seek, defaulting to 16MiB. _penalty denotes how
expensive we account a backwards seek compared to a forward
seek. Default is 2, meaning it's twice as expensive.
clear_elapsed:
Really a debug switch, will go away in the future. It clears the
maximum values for completion and dispatch time, shown in
show_status.
fifo_batch_expire
fifo_batch_async
fifo_batch_sync:
The settings for the expiry fifo. batch_expire is how often we
allow the fifo expire to control which request to select.
Default is 125ms. _async is the deadline for async requests
(typically writes), _sync is the deadline for sync requests
(reads and sync writes). Defaults are, respectively, 5 seconds
and 0.5 seconds.
key_type:
The grouping key. Can be set to pgid, tgid, uid, or gid. The
current value is shown bracketed:
axboe@apu:[.]s/block/hda/queue/iosched $ cat key_type
[pgid] tgid uid gid
Default is tgid. To set, simply echo any of the 4 words into the
file.
quantum:
The amount of requests we select for dispatch when the driver
asks for work to do and the current pending list is empty.
Default is 4.
queued:
The minimum amount of requests a group is allowed to queue.
Default is 8.
show_status:
Debug output showing the current state of the queues.
tagged:
Set this to 1 if the device is using tagged command queueing.
This cannot be reliably detected by CFQ yet, since most drivers
don't use the block layer (well it could, by looking at number
of requests being between dispatch and completion. but not
completely reliably). Default is 0.
The patch is a little big, but works reliably here on my laptop. There
are a number of other changes and fixes in there (like converting to
hlist for hashes). The code is commented a lot better, CFQ v1 has
basically no comments (reflecting that it was writting in one go, no
touched or tuned much since then). This is of course only done to
increase the AAF, akpm acceptance factor. Since I'm on the road, I
cannot provide any really good numbers of CFQ v1 compared to v2, maybe
someone will help me out there.
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch modularizes the io schedulers completely, allowing them to be
modular. Additionally it enables online switching of io schedulers. See
also http://lwn.net/Articles/102593/ .
There's a scheduler file in the sysfs directory for the block device
queue:
axboe@router:/sys/block/hda/queue> ls
iosched max_sectors_kb read_ahead_kb
max_hw_sectors_kb nr_requests scheduler
If you list the contents of the file, it will show available schedulers
and the active one:
axboe@router:/sys/block/hda/queue> cat scheduler
[cfq]
Lets load a few more.
router:/sys/block/hda/queue # modprobe deadline-iosched
router:/sys/block/hda/queue # modprobe as-iosched
router:/sys/block/hda/queue # cat scheduler
[cfq] deadline anticipatory
Changing is done with
router:/sys/block/hda/queue # echo deadline > scheduler
router:/sys/block/hda/queue # cat scheduler
cfq [deadline] anticipatory
deadline is now the new active io scheduler for hda.
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Introduces two new /sys/block values:
/sys/block/*/queue/max_hw_sectors_kb
/sys/block/*/queue/max_sectors_kb
max_hw_sectors_kb is the maximum that the driver can handle and is
readonly. max_sectors_kb is the current max_sectors value and can be tuned
by root. PAGE_SIZE granularity is enforced.
It's all locking-safe and all affected layered drivers have been updated as
well. The patch has been in testing for a couple of weeks already as part
of the voluntary-preempt patches and it works just fine - people use it to
reduce IDE IRQ handling latencies.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
IDE disk barrier core.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
They'll need it for permission checking.
|
|
blk_rq_map_user() is a bit of a hack currently, since it drops back to
kmalloc() if bio_map_user() fails. This is unfortunate since it means we
do no real segment or size checking (and the request segment counts contain
crap, already found one bug in a scsi lld). It's also pretty nasty for >
PAGE_SIZE requests, as we attempt to do higher order page allocations.
Even worse still, ide-cd will drop back to PIO for non-sg/bio requests.
All in all, very suboptimal.
This patch adds bio_copy_user() which simply sets up a bio with kernel
pages and copies data as needed for reads and writes. It also changes
bio_map_user() to return an error pointer like bio_copy_user(), so we can
return something sane to the user instead of always -ENOMEM.
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
build
Well, one of these (fs/block_dev.c) is little non-trivial, but i felt
throwing that away would be a shame (and I did add comments ;-).
Also almost all of these have been submitted earlier through other
channels, but have not been picked up (the only controversial is again the
fs/block_dev.c patch, where Linus felt a better job would be done with
__ffs(), but I could not convince myself that is does the same thing as
original code).
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The 'unplug on queued exceeding unplug threshold' logic only works for file
system requests currently, since it's in __make_request(). Move it where
it belongs, in elv_add_request(). This way it works for queued block sg
requests as well.
Signed-Off-By: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
scsi_cmd_ioctl() switched to __user *, block/scsi_ioctl.c annotated.
|
|
blk_run_page() is incorrectly using page->mapping, which makes it racy against
removal from swapcache.
Make block_sync_page() use page_mapping(), and remove bkl_run_page(), which
only had one caller.
|
|
From: Andrea Arcangeli <andrea@suse.de>
From: Jens Axboe
Add blk_run_page() API. This is so that we can pass the target page all the
way down to (for example) the swap unplug function. So swap can work out
which blockdevs back this particular page.
|
|
From: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
It's kind of redundant that queue_congestion_on/off_threshold gets
calculated on every I/O and they produce the same number over and over
again unless q->nr_requests gets changed (which is probably a very rare
event). We can cache those values in the request_queue structure.
|
|
We cannot always rely on ->biotail remaining untouched. Currently we
leak all the pinned user pages when doing cdda ripping at least, so I
see no way around keeping the bio pointer seperate and passing it back
in for unmap. Alternatively, we could invent a struct blk_map_data and
put it on the stack for passing to both map and unmap.
|
|
From: "Randy.Dunlap" <rddunlap@osdl.org>
These are EXPORTed SYMBOLs; 'inline' was removed from them in ll_rw_blk.c
on 2002-11-25.
|
|
From: Jens Axboe <axboe@suse.de>
There's a small discrepancy in when we decide to unplug a queue based on
q->unplug_thresh. Basically it doesn't work for tagged queues, since
q->rq.count[READ] + q->rq.count[WRITE] is just the number of allocated
requests, not the number of requests stuck in the io scheduler. We could
just change the nr_queued == to a nr_queued >=, however that is still
suboptimal.
This patch adds accounting for requests that have been dequeued from the io
scheduler, but not freed yet. These are q->in_flight. allocated_requests
- q->in_flight == requests_in_scheduler. So the condition correctly
becomes
if (requests_in_scheduler == q->unplug_thresh)
instead. I did a quick round of testing, and for dbench on a SCSI disk the
number of timer induced unplugs was reduced from 13 to 5 :-). Not a huge
number, but there might be cases where it's more significant. Either way,
it gets ->unplug_thresh always right, which the old logic didn't.
|
|
From: Jens Axboe <axboe@suse.de>,
Chris Mason,
me, others.
The global unplug list causes horrid spinlock contention on many-disk
many-CPU setups - throughput is worse than halved.
The other problem with the global unplugging is of course that it will cause
the unplugging of queues which are unrelated to the I/O upon which the caller
is about to wait.
So what we do to solve these problems is to remove the global unplug and set
up the infrastructure under which the VFS can tell the block layer to unplug
only those queues which are relevant to the page or buffer_head whcih is
about to be waited upon.
We do this via the very appropriate address_space->backing_dev_info structure.
Most of the complexity is in devicemapper, MD and swapper_space, because for
these backing devices, multiple queues may need to be unplugged to complete a
page/buffer I/O. In each case we ensure that data structures are in place to
permit us to identify all the lower-level queues which contribute to the
higher-level backing_dev_info. Each contributing queue is told to unplug in
response to a higher-level unplug.
To simplify things in various places we also introduce the concept of a
"synchronous BIO": it is tagged with BIO_RW_SYNC. The block layer will
perform an immediate unplug when it sees one of these go past.
|
|
Teach blk_congestion_wait() to return the number of jiffies remaining. This
is for debug, but it is also nicely consistent.
|
|
This patch allows you to map a request with user data for io, similarly
to what you can do with bio_map_user() already to a bio. However, this
goes one step further and populates the request so the user only has to
fill in the cdb (almost) and put it on the queue for execution. Patch
converts sg_io() to use it, next patch I'll send adapts cdrom layer to
use it for zero copy cdda dma extraction.
|
|
blk_start_queue() fix.
|
|
The carmel driver will want to use this rather
than muck around in queue internals directly.
|
|
This patch against a recent bk 2.6 changes scsi_cmd_ioctl to take a
gendisk as an argument instead of a request_queue_t. This allows scsi char
devices to use the scsi_cmd_ioctl interface.
In turn, change bio_map_user to also pass a request_queue_t, and add a
__bio_add_page helper that takes a request_queue_t.
Tested ide cd burning with no problems.
If the scsi upper level scsi_cmd_ioctl usage were consolidated in
scsi_prep_fn, we could pass a request_queue_t instead of a gendisk to
scsi_cmd_ioctl.
|
|
From: Rusty Russell <rusty@rustcorp.com.au>
Looks like an obvious typo. Works fine if "bio" is the name of the
iterator.
|
|
From: Xavier Bestel <xavier.bestel@free.fr>
Within the body of this macro we are accessing rq->bio, but `bio' is an arg
to the macro. If someone uses this macro with some variable which is not
named `bio' it won't compile.
So use a more-likely-to-be-unique identifier for the macro.
|
|
From: Mike Christie <michaelc@cs.wisc.edu>,
Jens Axboe <axboe@suse.de>
It's cleaner and more correct to look at req->rl to determine whether this
request got from the block layer requests lists instead of using req->q.
It's handy to always have req->q available, to lookup the queue from the
request.
|
|
Use rq->cmd[0] instead of rq->flags for storing special request flags.
Per Jens' suggestion. Tested by Stef van der Made <svdmade@planet.nl>.
|
|
The previous scsi_ioctl.c patch didn't cleanup the buffer/bio in the
error case.
Fix it by copying the command data earlier.
|
|
Noticed by Stuart_Hayes@Dell.com:
I've noticed that, in the 2.6 (test 9) kernel, the "cmd" field (of type int)
in struct request has been removed, and it looks like all of the code in
ide-tape has just had a find & replace run on it to replace any instance of
rq.cmd or rq->cmd with rq.flags or rq->flags.
The values being put into "cmd" in 2.4 (now "flags", in 2.6) by ide-tape are
8-bit numbers, like 90, 91, etc... and the actual flags that are being used
in "flags" cover the low 23 bits. So, not only do the flags get wiped out
when, say, ide-tape assigns, say, 90 to "flags", but also the 90 gets wiped
out when one of the flags is modified.
I noticed this, because ide-tape checks this value, and spews error codes
when it isn't correct--continuously--as soon as you load the module, because
ide-tape is calling ide_do_drive_cmd with an action of ide_preempt, which
causes ide_do_drive_cmd to set the REQ_PREEMPT flag, so "flags" isn't the
same when it gets back to idetape_do_request.
|
|
This implements the possibility for sharing a tag map between queues.
Some (most?) scsi host adapters needs this, and SATA tcq will need it
for some cases, too.
|
|
The "insert_here" list pointer logic was broken, and unnecessary.
Kill it and its associated logic off completely - just tell the IO
scheduler what kind of insert it is.
This also makes the *_insert_request strategies much easier to follow,
imo.
|
|
From: Rusty Russell <rusty@rustcorp.com.au>
Previously, default aliases were hardwired into modutils. Now they should
be inside the modules, using MODULE_ALIAS() (they will be overridden by any
user alias).
|
|
This adds support for software controlled hard drive LED activity.
This is really nice on such machines as Apple Powerbooks, where there is
no such LED in the first place and the sleep/suspend LED isn't used for
anything when the machine is running.
|
|
From: Neil Brown <neilb@cse.unsw.edu.au>
Fix "bio too big" problem with md
Whenever a device is attached to an md device, we make sure the sector
limits of the md device do not exceed those of the added device.
|
|
To be able to properly be able to keep references to block queues,
we make blk_init_queue() return the queue that it initialized, and
let it be independently allocated and then cleaned up on the last
reference.
I have grepped high and low, and there really shouldn't be any broken
uses of blk_init_queue() in the kernel drivers left. The added bonus
being blk_init_queue() error checking is explicit now, most of the
drivers were broken in this regard (even IDE/SCSI).
No drivers have embedded request queue structures. Drivers that don't
use blk_init_queue() but blk_queue_make_request(), should allocate the
queue with blk_alloc_queue(gfp_mask). I've converted all of them to do
that, too. They can call blk_cleanup_queue() now too, using the define
blk_put_queue() is probably cleaner though.
|
|
From Lou Langholtz <ldl@aros.net>
The queue_wait field of struct request_queue is not used anymore, and
this gets rid of it.
|
|
This allows SCSI to survive the I/O queueing stress harness with AS.
Jens has signed off on it, and Mark Havercamp confirms it also
eliminates the test induced hangs for him too.
|
|
Here's the patch to enable failfast flag in the bio submission code, and
use it for multipath and readahead.
|
|
|
|
into jet.(none):/home1/jejb/BK/scsi-for-linus-2.5
|
|
For CONFIG_LBD=n case it was passing a u32 into do_div().
|
|
This patch removes the scsi mid layer dependency on __elv_add_request
and introduces a new blk_requeue_request() function so the block
layer specificially knows a requeue is in progress.
It also adds an elevator hook for elevators like AS which need to
hook into the requeue for correct adjustment of internal counters.
|
|
- pass gfp_flags to get_io_context(): not all callers are forced to use
GFP_ATOMIC().
- fix locking in get_io_context(): bump the refcount whilein the exclusive
region.
- don't go oops in get_io_context() if the kmalloc failed.
- in as_get_io_context(): fail the whole thing if we were unable to
allocate the AS-specific part.
- as_remove_queued_request() cleanup
|