diff options
| author | Andrew Morton <akpm@osdl.org> | 2003-07-04 19:36:30 -0700 |
|---|---|---|
| committer | Linus Torvalds <torvalds@home.osdl.org> | 2003-07-04 19:36:30 -0700 |
| commit | 97ff29c22ec3df25621561194692e7e945fcf489 (patch) | |
| tree | 1d61e0e3664eaf48cd8581ec5d20805c097b4eb3 /include/linux/elevator.h | |
| parent | 104e6fdc6f35ea08e1c6ed03158b336b2e9983ed (diff) | |
[PATCH] anticipatory I/O scheduler
From: Nick Piggin <piggin@cyberone.com.au>
This is the core anticipatory IO scheduler. There are nearly 100 changesets
in this and five months work. I really cannot describe it fully here.
Major points:
- It works by recognising that reads are dependent: we don't know where the
next read will occur, but it's probably close-by the previous one. So once
a read has completed we leave the disk idle, anticipating that a request
for a nearby read will come in.
- There is read batching and write batching logic.
- when we're servicing a batch of writes we will refuse to seek away
for a read for some tens of milliseconds. Then the write stream is
preempted.
- when we're servicing a batch of reads (via anticipation) we'll do
that for some tens of milliseconds, then preempt.
- There are request deadlines, for latency and fairness.
The oldest outstanding request is examined at regular intervals. If
this request is older than a specific deadline, it will be the next
one dispatched. This gives a good fairness heuristic while being simple
because processes tend to have localised IO.
Just about all of the rest of the complexity involves an array of fixups
which prevent most of teh obvious failure modes with anticipation: trying to
not leave the disk head pointlessly idle. Some of these algorithms are:
- Process tracking. If the process whose read we are anticipating submits
a write, abandon anticipation.
- Process exit tracking. If the process whose read we are anticipating
exits, abandon anticipation.
- Process IO history. We accumulate statistical info on the process's
recent IO patterns to aid in making decisions about how long to anticipate
new reads.
Currently thinktime and seek distance are tracked. Thinktime is the
time between when a process's last request has completed and when it
submits another one. Seek distance is simply the number of sectors
between each read request. If either statistic becomes too high, the
it isn't anticipated that the process will submit another read.
The above all means that we need a per-process "io context". This is a fully
refcounted structure. In this patch it is AS-only. later we generalise it a
little so other IO schedulers could use the same framework.
- Requests are grouped as synchronous and asynchronous whereas deadline
scheduler groups requests as reads and writes. This can provide better
sync write performance, and may give better responsiveness with journalling
filesystems (although we haven't done that yet).
We currently detect synchronous writes by nastily setting PF_SYNCWRITE in
current->flags. The plan is to remove this later, and to propagate the
sync hint from writeback_contol.sync_mode into bio->bi_flags thence into
request->flags. Once that is done, direct-io needs to set the BIO sync
hint as well.
- There is also quite a bit of complexity gone into bashing TCQ into
submission. Timing for a read batch is not started until the first read
request actually completes. A read batch also does not start until all
outstanding writes have completed.
AS is the default IO scheduler. deadline may be chosen by booting with
"elevator=deadline".
There are a few reasons for retaining deadline:
- AS is often slower than deadline in random IO loads with large TCQ
windows. The usual real world task here is OLTP database loads.
- deadline is presumably more stable.
- deadline is much simpler.
The tunable per-queue entries under /sys/block/*/iosched/ are all in
milliseconds:
* read_expire
Controls how long until a request becomes "expired".
It also controls the interval between which expired requests are served,
so set to 50, a request might take anywhere < 100ms to be serviced _if_ it
is the next on the expired list.
Obviously it can't make the disk go faster. Result is basically the
timeslice a reader gets in the presence of other IO. 100*((seek time /
read_expire) + 1) is very roughly the % streaming read efficiency your disk
should get in the presence of multiple readers.
* read_batch_expire
Controls how much time a batch of reads is given before pending writes
are served. Higher value is more efficient. Shouldn't really be below
read_expire.
* write_ versions of the above
* antic_expire
Controls the maximum amount of time we can anticipate a good read before
giving up. Many other factors may cause anticipation to be stopped early,
or some processes will not be "anticipated" at all. Should be a bit higher
for big seek time devices though not a linear correspondance - most
processes have only a few ms thinktime.
Diffstat (limited to 'include/linux/elevator.h')
| -rw-r--r-- | include/linux/elevator.h | 5 |
1 files changed, 5 insertions, 0 deletions
diff --git a/include/linux/elevator.h b/include/linux/elevator.h index 07de69c1ef8a..d793bb97dd54 100644 --- a/include/linux/elevator.h +++ b/include/linux/elevator.h @@ -89,6 +89,11 @@ extern elevator_t elevator_noop; */ extern elevator_t iosched_deadline; +/* + * anticipatory I/O scheduler + */ +extern elevator_t iosched_as; + extern int elevator_init(request_queue_t *, elevator_t *); extern void elevator_exit(request_queue_t *); extern inline int elv_rq_merge_ok(struct request *, struct bio *); |
