| Age | Commit message (Collapse) | Author |
|
elv_register() always returns 0, and there isn't anything it does where
it should return an error (the only error condition is so grave that
it's handled with a BUG_ON).
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
Some of the code has been gradually transitioned to using the proper
struct request_queue, but there's lots left. So do a full sweet of
the kernel and get rid of this typedef and replace its uses with
the proper type.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
Currently we allow any merge, even if the io originates from different
processes. This can cause really bad starvation and unfairness, if those
ios happen to be synchronous (reads or direct writes).
So add a allow_merge hook to the io scheduler ops, so an io scheduler can
help decide whether a bio/process combination may be merged with an
existing request.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
- ->init_queue() does not need the elevator passed in
- ->put_request() is a hot path and need not have the queue passed in
- cfq_update_io_seektime() does not need cfqd passed in
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
elevator_type field in elevator_type structure is useless:
it isn't used anywhere in kernel sources.
Signed-off-by: Vasily Tarasov <vtaras@openvz.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
Make it possible to disable the block layer. Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.
This patch does the following:
(*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
support.
(*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
an item that uses the block layer. This includes:
(*) Block I/O tracing.
(*) Disk partition code.
(*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
(*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
block layer to do scheduling. Some drivers that use SCSI facilities -
such as USB storage - end up disabled indirectly from this.
(*) Various block-based device drivers, such as IDE and the old CDROM
drivers.
(*) MTD blockdev handling and FTL.
(*) JFFS - which uses set_bdev_super(), something it could avoid doing by
taking a leaf out of JFFS2's book.
(*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
however, still used in places, and so is still available.
(*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
parts of linux/fs.h.
(*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
(*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
(*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
is not enabled.
(*) fs/no-block.c is created to hold out-of-line stubs and things that are
required when CONFIG_BLOCK is not set:
(*) Default blockdev file operations (to give error ENODEV on opening).
(*) Makes some /proc changes:
(*) /proc/devices does not list any blockdevs.
(*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
(*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
(*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
given command other than Q_SYNC or if a special device is specified.
(*) In init/do_mounts.c, no reference is made to the blockdev routines if
CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.
(*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
error ENOSYS by way of cond_syscall if so).
(*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
CONFIG_BLOCK is not set, since they can't then happen.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
None of the in-kernel primitives for handling "atomic" counting seem
to be a good fit. We need something that is essentially free for
incrementing/decrementing, while the read side may be more expensive
as we only ever need to do that when a device is removed from the
kernel.
Use a per-cpu variable for maintaining a per-cpu ioc count and define
a reading mechanism that just sums up the values.
Signed-off-by: Jens Axboe <axboe@suse.de>
|
|
It's not needed for anything, so kill the bio passing.
Signed-off-by: Jens Axboe <axboe@suse.de>
|
|
The io schedulers can use this instead of having to allocate space for
it themselves.
Signed-off-by: Jens Axboe <axboe@suse.de>
|
|
The rbtree sort/lookup/reposition logic is mostly duplicated in
cfq/deadline/as, so move it to the elevator core. The io schedulers
still provide the actual rb root, as we don't want to impose any sort
of specific handling on the schedulers.
Introduce the helpers and rb_node in struct request to help migrate the
IO schedulers.
Signed-off-by: Jens Axboe <axboe@suse.de>
|
|
Right now, every IO scheduler implements its own backmerging (except for
noop, which does no merging). That results in duplicated code for
essentially the same operation, which is never a good thing. This patch
moves the backmerging out of the io schedulers and into the elevator
core. We save 1.6kb of text and as a bonus get backmerging for noop as
well. Win-win!
Signed-off-by: Jens Axboe <axboe@suse.de>
|
|
There's a race between shutting down one io scheduler and firing up the
next, in which a new io could enter and cause the io scheduler to be
invoked with bad or NULL data.
To fix this, we need to maintain the queue lock for a bit longer.
Unfortunately we cannot do that, since the elevator init requires to be
run without the lock held. This isn't easily fixable, without also
changing the mempool API. So split the initialization into two parts,
and alloc-init operation and an attach operation. Then we can
preallocate the io scheduler and related structures, and run the attach
inside the lock after we detach the old one.
This patch has survived 30 minutes of 1 second io scheduler switching
with a very busy io load.
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
|
|
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
q->ordcolor must only be flipped on initial queueing of a hardbarrier
request.
Constructing ordered sequence and requeueing used to pass through
__elv_add_request() which flips q->ordcolor when it sees a barrier
request.
This patch separates out elv_insert() from __elv_add_request() and uses
elv_insert() when constructing ordered sequence and requeueing.
elv_insert() inserts the given request at the specified position and
does nothing else.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
elv_try_last_merge().
Signed-off-by: Coywolf Qi Hunt <qiyong@fc-cn.com>
Signed-off-by: Jens Axboe <axboe@suse.de>
|
|
Reimplement handling of barrier requests.
* Flexible handling to deal with various capabilities of
target devices.
* Retry support for falling back.
* Tagged queues which don't support ordered tag can do ordered.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jens Axboe <axboe@suse.de>
|
|
git://brick.kernel.dk/data/git/linux-2.6-block
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
- Split elv_dispatch_insert() into two functions
- Rename rq_last_sector() to rq_end_sector()
Signed-off-by: Jens Axboe <axboe@suse.de>
|
|
Implements generic dispatch queue which can replace all
dispatch queues implemented by each iosched. This reduces
code duplication, eases enforcing semantics over dispatch
queue, and simplifies specific ioscheds.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jens Axboe <axboe@suse.de>
|
|
This updates the CFQ io scheduler to the new time sliced design (cfq
v3). It provides full process fairness, while giving excellent
aggregate system throughput even for many competing processes. It
supports io priorities, either inherited from the cpu nice value or set
directly with the ioprio_get/set syscalls. The latter closely mimic
set/getpriority.
This import is based on my latest from -mm.
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
As promised to Andrew, here are the latest bits that fixup the block io
barrier handling.
- Add io scheduler ->deactivate hook to tell the io scheduler is a
request is suspended from the block layer. cfq and as needs this hook.
- Locking updates
- Make sure a driver doesn't reuse the flush rq before a previous one
has completed
- Typo in the scsi_io_completion() function, the bit shift was wrong
- sd needs proper timeout on the flush
- remove silly debug leftover in ide-disk wrt "hdc"
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Here is the next incarnation of the CFQ io scheduler, so far known as
CFQ v2 locally. It attempts to address some of the limitations of the
original CFQ io scheduler (hence forth known as CFQ v1). Some of the
problems with CFQ v1 are:
- It does accounting for the lifetime of the cfq_queue, which is setup
and torn down for the time when a process has io in flight. For a fork
heavy work load (such as a kernel compile, for instance), new
processes can effectively starve io of running processes. This is in
part due to the fact that CFQ v1 gives preference to a new processes
to get better latency numbers. Removing that heuristic is not an
option exactly because of that.
- It makes no attempts to address inter-cfq_queue fairness.
- It makes no attempt to limit upper latency bound of a single request.
- It only provides per-tgid grouping. You need to change the source to
group on a different criteria.
- It uses a mempool for the cfq_queues. Theoretically this could
deadlock if io bound processes never exit.
- The may_queue() logic can be unfair since it fluctuates quickly, thus
leaving processes sleeping while new processes are allowed to allocate
a request.
CFQ v2 attempts to fix these issues. It uses the process io_context
logic to maintain a cfq_queue lifetime of the duration of the process
(and its io). This means we can now be a lot more clever in deciding
which process is allowed to queue or dispatch io to the device. The
cfq_io_context is per-process per-queue, this is an extension to what AS
currently does in that we truly do have a unique per-process identifier
for io grouping. Busy queues are sorted by service time used, sub sorted
by in_flight requests. Queues that have no io in flight are also
preferred at dispatch time.
Accounting is done on completion time of a request, or with a fixed cost
for tagged command queueing. Requests are fifo'ed like with deadline, to
make sure that a single request doesn't stay in the io scheduler for
ages.
Process grouping is selectable at runtime. I provide 4 grouping
criterias: process group, thread group id, user id, and group id.
As usual, settings are sysfs tweakable in /sys/block/<dev>/queue/iosched
axboe@apu:[.]s/block/hda/queue/iosched $ ls
back_seek_max fifo_batch_expire find_best_crq queued
back_seek_penalty fifo_expire_async key_type show_status
clear_elapsed fifo_expire_sync quantum tagged
In order, each of these settings control:
back_seek_max
back_seek_penalty:
Useful logic stolen from AS that allow small backwards seeks in
the io stream if we deem them useful. CFQ uses a strict
ascending elevator otherwise. _max controls the maximum allowed
backwards seek, defaulting to 16MiB. _penalty denotes how
expensive we account a backwards seek compared to a forward
seek. Default is 2, meaning it's twice as expensive.
clear_elapsed:
Really a debug switch, will go away in the future. It clears the
maximum values for completion and dispatch time, shown in
show_status.
fifo_batch_expire
fifo_batch_async
fifo_batch_sync:
The settings for the expiry fifo. batch_expire is how often we
allow the fifo expire to control which request to select.
Default is 125ms. _async is the deadline for async requests
(typically writes), _sync is the deadline for sync requests
(reads and sync writes). Defaults are, respectively, 5 seconds
and 0.5 seconds.
key_type:
The grouping key. Can be set to pgid, tgid, uid, or gid. The
current value is shown bracketed:
axboe@apu:[.]s/block/hda/queue/iosched $ cat key_type
[pgid] tgid uid gid
Default is tgid. To set, simply echo any of the 4 words into the
file.
quantum:
The amount of requests we select for dispatch when the driver
asks for work to do and the current pending list is empty.
Default is 4.
queued:
The minimum amount of requests a group is allowed to queue.
Default is 8.
show_status:
Debug output showing the current state of the queues.
tagged:
Set this to 1 if the device is using tagged command queueing.
This cannot be reliably detected by CFQ yet, since most drivers
don't use the block layer (well it could, by looking at number
of requests being between dispatch and completion. but not
completely reliably). Default is 0.
The patch is a little big, but works reliably here on my laptop. There
are a number of other changes and fixes in there (like converting to
hlist for hashes). The code is commented a lot better, CFQ v1 has
basically no comments (reflecting that it was writting in one go, no
touched or tuned much since then). This is of course only done to
increase the AAF, akpm acceptance factor. Since I'm on the road, I
cannot provide any really good numbers of CFQ v1 compared to v2, maybe
someone will help me out there.
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch modularizes the io schedulers completely, allowing them to be
modular. Additionally it enables online switching of io schedulers. See
also http://lwn.net/Articles/102593/ .
There's a scheduler file in the sysfs directory for the block device
queue:
axboe@router:/sys/block/hda/queue> ls
iosched max_sectors_kb read_ahead_kb
max_hw_sectors_kb nr_requests scheduler
If you list the contents of the file, it will show available schedulers
and the active one:
axboe@router:/sys/block/hda/queue> cat scheduler
[cfq]
Lets load a few more.
router:/sys/block/hda/queue # modprobe deadline-iosched
router:/sys/block/hda/queue # modprobe as-iosched
router:/sys/block/hda/queue # cat scheduler
[cfq] deadline anticipatory
Changing is done with
router:/sys/block/hda/queue # echo deadline > scheduler
router:/sys/block/hda/queue # cat scheduler
cfq [deadline] anticipatory
deadline is now the new active io scheduler for hda.
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: Jens Axboe <axboe@suse.de>
CFQ I/O scheduler
|
|
include/linux/elevator.h:106: sorry, unimplemented: inlining failed in call to 'elv_try_last_merge': function body not available
|
|
- Remove dead declaration from elevator.h (Nick Piggin)
- Fix the scheduler selection boot-time message. "Using anticipatory
scheduling io scheduler" is not grammatical.
- Remove last use of __SMP__ (Randy Dunlap)
|
|
The "insert_here" list pointer logic was broken, and unnecessary.
Kill it and its associated logic off completely - just tell the IO
scheduler what kind of insert it is.
This also makes the *_insert_request strategies much easier to follow,
imo.
|
|
Add kconfig options to allow excluding either or both the I/O
schedulers. This can be useful for embedded systems (saves about ~13KB).
All schedulers are enabled by default for non-embedded.
|
|
into jet.(none):/home1/jejb/BK/scsi-for-linus-2.5
|
|
This patch removes the scsi mid layer dependency on __elv_add_request
and introduces a new blk_requeue_request() function so the block
layer specificially knows a requeue is in progress.
It also adds an elevator hook for elevators like AS which need to
hook into the requeue for correct adjustment of internal counters.
|
|
From: Nick Piggin <piggin@cyberone.com.au>
This gets rid of the global queue_nr_requests and usage of BLKDEV_MAX_RQ
(the latter is now only used to set the queues' defaults).
The queue depth becomes per-queue, controlled by a sysfs entry.
|
|
From: Nick Piggin <piggin@cyberone.com.au>
This is the core anticipatory IO scheduler. There are nearly 100 changesets
in this and five months work. I really cannot describe it fully here.
Major points:
- It works by recognising that reads are dependent: we don't know where the
next read will occur, but it's probably close-by the previous one. So once
a read has completed we leave the disk idle, anticipating that a request
for a nearby read will come in.
- There is read batching and write batching logic.
- when we're servicing a batch of writes we will refuse to seek away
for a read for some tens of milliseconds. Then the write stream is
preempted.
- when we're servicing a batch of reads (via anticipation) we'll do
that for some tens of milliseconds, then preempt.
- There are request deadlines, for latency and fairness.
The oldest outstanding request is examined at regular intervals. If
this request is older than a specific deadline, it will be the next
one dispatched. This gives a good fairness heuristic while being simple
because processes tend to have localised IO.
Just about all of the rest of the complexity involves an array of fixups
which prevent most of teh obvious failure modes with anticipation: trying to
not leave the disk head pointlessly idle. Some of these algorithms are:
- Process tracking. If the process whose read we are anticipating submits
a write, abandon anticipation.
- Process exit tracking. If the process whose read we are anticipating
exits, abandon anticipation.
- Process IO history. We accumulate statistical info on the process's
recent IO patterns to aid in making decisions about how long to anticipate
new reads.
Currently thinktime and seek distance are tracked. Thinktime is the
time between when a process's last request has completed and when it
submits another one. Seek distance is simply the number of sectors
between each read request. If either statistic becomes too high, the
it isn't anticipated that the process will submit another read.
The above all means that we need a per-process "io context". This is a fully
refcounted structure. In this patch it is AS-only. later we generalise it a
little so other IO schedulers could use the same framework.
- Requests are grouped as synchronous and asynchronous whereas deadline
scheduler groups requests as reads and writes. This can provide better
sync write performance, and may give better responsiveness with journalling
filesystems (although we haven't done that yet).
We currently detect synchronous writes by nastily setting PF_SYNCWRITE in
current->flags. The plan is to remove this later, and to propagate the
sync hint from writeback_contol.sync_mode into bio->bi_flags thence into
request->flags. Once that is done, direct-io needs to set the BIO sync
hint as well.
- There is also quite a bit of complexity gone into bashing TCQ into
submission. Timing for a read batch is not started until the first read
request actually completes. A read batch also does not start until all
outstanding writes have completed.
AS is the default IO scheduler. deadline may be chosen by booting with
"elevator=deadline".
There are a few reasons for retaining deadline:
- AS is often slower than deadline in random IO loads with large TCQ
windows. The usual real world task here is OLTP database loads.
- deadline is presumably more stable.
- deadline is much simpler.
The tunable per-queue entries under /sys/block/*/iosched/ are all in
milliseconds:
* read_expire
Controls how long until a request becomes "expired".
It also controls the interval between which expired requests are served,
so set to 50, a request might take anywhere < 100ms to be serviced _if_ it
is the next on the expired list.
Obviously it can't make the disk go faster. Result is basically the
timeslice a reader gets in the presence of other IO. 100*((seek time /
read_expire) + 1) is very roughly the % streaming read efficiency your disk
should get in the presence of multiple readers.
* read_batch_expire
Controls how much time a batch of reads is given before pending writes
are served. Higher value is more efficient. Shouldn't really be below
read_expire.
* write_ versions of the above
* antic_expire
Controls the maximum amount of time we can anticipate a good read before
giving up. Many other factors may cause anticipation to be stopped early,
or some processes will not be "anticipated" at all. Should be a bit higher
for big seek time devices though not a linear correspondance - most
processes have only a few ms thinktime.
|
|
From: Nick Piggin <piggin@cyberone.com.au>
Introduces an elevator_completed_req() callback with which the generic
queueing layer may tell an IO scheduler that a particualr request has
finished.
|
|
Introduces the elv_may_queue() predicate with which the IO scheduler may tell
the generic request layer that we may add another request to this queue.
It is used by the CFQ elevator.
|
|
The noop io scheduler has a data corrupting bug, because q->last_merge
doesn't get cleared properly. So do that in io scheduler core, and
remove the same code from deadline.
Also kill bio_rq_in_between(), it's not used by anyone anymore. rbtrees
are the hot thing these days.
And finally, remove a direct test for REQ_CMD in rq flags, use
blk_fs_request() instead.
|
|
This patch adds dynamic allocation of request structures. Right now we
are reserving 256 requests per initialized queue, which adds up to quite
a lot of memory for even a modest number of queues. For the quoted 4000
disk systems, it's a disaster.
Instead, we mempool 4 requests per queue and put an upper limit on the
number of requests that we will put in-flight as well. I've kept the 128
read/write max in-flight limit for now. It is trivial to experiement
with larger queue sizes now, but I want to change one thing at the time
(the truncate scenario doesn't look all that good with a huge number of
requests, for instance).
Patch has been in -mm for a while, I'm running it here against stock 2.5
as well. Additionally, it actually kills quite a bit of code as well
|
|
This file was _the_ header for block-device related stuff in earlier
Linux versions, but nowdays there's just a few prototypes left that
really belong into blkdev.h or genhd.h (and in one case elevator.h).
This patch moves them over and removes everything but including
blkdev.h from blk.h Note that blkdev.h gets all the headers that
were included in blk.h inmplicitly too. Now we can start removing
all references to it an maybe kill it off before 2.6. *sniff*
|
|
This patch has a bunch of io scheduler goodies that are, by now, well
tested in -mm and by self and Nick Piggin. In order of interest:
- Use rbtree data structure for sorting of requests. Even with the
default queue lengths that are fairly short, this cuts a lot of run
time for io scheduler intensive work loads. If we go to longer queue
lengths, it very quickly becomes a necessity.
- Add sysfs interface for the tunables. At the same time, finally kill
the BLKELVGET/BLKELVSET completely. I made these return -ENOTTY in
2.5.1, but there are left-overs around the kernel. This old interface
was never any good, it was centered around just one io scheduler.
The io scheduler core itself has received count less hours of tuning by
myself and Nick, should be in pretty good shape. Please apply.
Andrew, I made some sysfs changes to the version from 2.5.56-mm1. It
didn't even compile without warnings (or work, for that matter), as the
sysfs store/show procedures needed updating. Hmm?
|
|
Request insertion in the current tree is a mess. We have all sorts of
variants of *elv_add_request*, and it's not at all clear who does what
and with what locks (or not). This patch cleans it up to be:
o __elv_add_request(queue, request, at_end, plug)
Core function, requires queue lock to be held
o elv_add_request(queue, request, at_end, plug)
Like __elv_add_request(), but grabs queue lock
o __elv_add_request_pos(queue, request, position)
Insert request at a given location, lock must be held
|
|
Ingo spotted this one too, it's a leftover from when the elevator type
wasn't a variable. Also don't pass in &q->elevator, it can always be
deduced from queue itself of course.
|
|
This fixes a problem with the deadline io scheduler, if the correct
insertion point is at the front of the list. This is something that we
never have gotten right in 2.4 either.
The problem is that the elevator merge function has to return a pointer
to a struct request, and for front insert we really have to return the
head of the list which cannot be expressed as a request of course.
The real issue is that the elevator_merge function actually performs two
functions - it scans for a merge, and if it can't find any, it selects
and insertion point. It's done this way for efficiency reasons, even if
the design isn't all that clean.
So we change the io scheduler merge functions to get passed a pointer to
a list_head pointer instead. This works for both inserts and merges.
In addition, deadline checks if it really should insert at the very
front.
Also don't pass in request to elv_try_last_merge(), the very name of the
function suggests that it's q->last_merge that we are interested in.
|
|
Some various small cleanups, optimizations, and fixes.
o Make fifo_batch=32 as default, from testing this appears a good
default value. We still get good throughput, and latency is good.
o Reintroduce the merge_cleanup logic. We need it for deadline for
rehashing requests when they have been merged.
o Cleanup last_merge logic. Move it to the new elv_merged_request(),
this is where it really belongs. Doing it inside the io scheduler core
can causes false positives, when the queue merge functions reject an
otherwise good merge
o Have deadline_move_requests() account from last entry on the dispatch
queue, if it is non-empty. It doesn't really matter what the last
extracted sector was, if we are not right behind it.
o Clean/optimize deadline_move_requests()
o Account size of a request just a little bit. Streaming transfer isn't
for free, it's just a lot cheaper than a seek.
o Make deadline_check_fifo() more readable.
|
|
Patch killing off elevator_linus for good. Sniffle.
|
|
This introduces the deadline-ioscheduler, making it the default. 2nd
patch coming that deletes elevator_linus in a minute.
This one has read_expire at 500ms, and writes_starved at 2.
|
|
elevator_linus is seriously broken wrt accounting. Marcelo recently took
the patch to fix it in 2.4.20-pre, here's the 2.5 equiv.
Right now, we account merges as costly and seeks as not. Only thing that
prevents seek starvation is the aging scan. That is broken, very much
so. This patch fixes that to account merges and inserts differently. A
seek is ELV_LINUS_SEEK_COST more costly than a merge, currently that
define is at '16'. Doing the math on a disk, this sort of makes sense.
Defaults are read latency of 1024, which means 1024 merges or 64 seeks.
Writes are double that.
|
|
I've got a new i/o scheduler in testing, some changes where needed in
the block layer to accomodate it. Basically because right now
assumptions are made about q->queue_head being the sort list. The
changes in detail:
o elevator_merge_requests_fn takes queue argument as well
o __make_request() inits insert_here to NULL instead of
q->queue_head.prev, which means that the i/o schedulers must
explicitly check for this condition now.
o incorporate elv_queue_empty(), it was just a place holder before
o add elv_get_sort_head(). it returns the sort head of the elevator for
a given request. attempt_{back,front}_merge uses it to determine
whether a request is valid or not. Maybe attempt_{back,front}_merge
should just be killed, I doubt they have much relevance with the wake
up batching.
o call the merge_cleanup functions of the elevator _after_ the merge has
been done, not before. This way the elevator functions get the new
state of the request, which is the most interesting.
o Kill extra nr_sectors check in ll_merge_requests_fn()
o bi->bi_bdev is always set in __make_request(), so kill check.
|