summaryrefslogtreecommitdiff
path: root/include/linux/elevator.h
AgeCommit message (Collapse)Author
2005-06-27[PATCH] Update cfq io scheduler to time sliced designJens Axboe
This updates the CFQ io scheduler to the new time sliced design (cfq v3). It provides full process fairness, while giving excellent aggregate system throughput even for many competing processes. It supports io priorities, either inherited from the cpu nice value or set directly with the ioprio_get/set syscalls. The latter closely mimic set/getpriority. This import is based on my latest from -mm. Signed-off-by: Jens Axboe <axboe@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-08[PATCH] barrier rework updatesJens Axboe
As promised to Andrew, here are the latest bits that fixup the block io barrier handling. - Add io scheduler ->deactivate hook to tell the io scheduler is a request is suspended from the block layer. cfq and as needs this hook. - Locking updates - Make sure a driver doesn't reuse the flush rq before a previous one has completed - Typo in the scsi_io_completion() function, the bit shift was wrong - sd needs proper timeout on the flush - remove silly debug leftover in ide-disk wrt "hdc" Signed-off-by: Jens Axboe <axboe@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18[PATCH] cfq-v2 I/O scheduler updateJens Axboe
Here is the next incarnation of the CFQ io scheduler, so far known as CFQ v2 locally. It attempts to address some of the limitations of the original CFQ io scheduler (hence forth known as CFQ v1). Some of the problems with CFQ v1 are: - It does accounting for the lifetime of the cfq_queue, which is setup and torn down for the time when a process has io in flight. For a fork heavy work load (such as a kernel compile, for instance), new processes can effectively starve io of running processes. This is in part due to the fact that CFQ v1 gives preference to a new processes to get better latency numbers. Removing that heuristic is not an option exactly because of that. - It makes no attempts to address inter-cfq_queue fairness. - It makes no attempt to limit upper latency bound of a single request. - It only provides per-tgid grouping. You need to change the source to group on a different criteria. - It uses a mempool for the cfq_queues. Theoretically this could deadlock if io bound processes never exit. - The may_queue() logic can be unfair since it fluctuates quickly, thus leaving processes sleeping while new processes are allowed to allocate a request. CFQ v2 attempts to fix these issues. It uses the process io_context logic to maintain a cfq_queue lifetime of the duration of the process (and its io). This means we can now be a lot more clever in deciding which process is allowed to queue or dispatch io to the device. The cfq_io_context is per-process per-queue, this is an extension to what AS currently does in that we truly do have a unique per-process identifier for io grouping. Busy queues are sorted by service time used, sub sorted by in_flight requests. Queues that have no io in flight are also preferred at dispatch time. Accounting is done on completion time of a request, or with a fixed cost for tagged command queueing. Requests are fifo'ed like with deadline, to make sure that a single request doesn't stay in the io scheduler for ages. Process grouping is selectable at runtime. I provide 4 grouping criterias: process group, thread group id, user id, and group id. As usual, settings are sysfs tweakable in /sys/block/<dev>/queue/iosched axboe@apu:[.]s/block/hda/queue/iosched $ ls back_seek_max fifo_batch_expire find_best_crq queued back_seek_penalty fifo_expire_async key_type show_status clear_elapsed fifo_expire_sync quantum tagged In order, each of these settings control: back_seek_max back_seek_penalty: Useful logic stolen from AS that allow small backwards seeks in the io stream if we deem them useful. CFQ uses a strict ascending elevator otherwise. _max controls the maximum allowed backwards seek, defaulting to 16MiB. _penalty denotes how expensive we account a backwards seek compared to a forward seek. Default is 2, meaning it's twice as expensive. clear_elapsed: Really a debug switch, will go away in the future. It clears the maximum values for completion and dispatch time, shown in show_status. fifo_batch_expire fifo_batch_async fifo_batch_sync: The settings for the expiry fifo. batch_expire is how often we allow the fifo expire to control which request to select. Default is 125ms. _async is the deadline for async requests (typically writes), _sync is the deadline for sync requests (reads and sync writes). Defaults are, respectively, 5 seconds and 0.5 seconds. key_type: The grouping key. Can be set to pgid, tgid, uid, or gid. The current value is shown bracketed: axboe@apu:[.]s/block/hda/queue/iosched $ cat key_type [pgid] tgid uid gid Default is tgid. To set, simply echo any of the 4 words into the file. quantum: The amount of requests we select for dispatch when the driver asks for work to do and the current pending list is empty. Default is 4. queued: The minimum amount of requests a group is allowed to queue. Default is 8. show_status: Debug output showing the current state of the queues. tagged: Set this to 1 if the device is using tagged command queueing. This cannot be reliably detected by CFQ yet, since most drivers don't use the block layer (well it could, by looking at number of requests being between dispatch and completion. but not completely reliably). Default is 0. The patch is a little big, but works reliably here on my laptop. There are a number of other changes and fixes in there (like converting to hlist for hashes). The code is commented a lot better, CFQ v1 has basically no comments (reflecting that it was writting in one go, no touched or tuned much since then). This is of course only done to increase the AAF, akpm acceptance factor. Since I'm on the road, I cannot provide any really good numbers of CFQ v1 compared to v2, maybe someone will help me out there. Signed-off-by: Jens Axboe <axboe@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18[PATCH] switchable and modular io schedulersJens Axboe
This patch modularizes the io schedulers completely, allowing them to be modular. Additionally it enables online switching of io schedulers. See also http://lwn.net/Articles/102593/ . There's a scheduler file in the sysfs directory for the block device queue: axboe@router:/sys/block/hda/queue> ls iosched max_sectors_kb read_ahead_kb max_hw_sectors_kb nr_requests scheduler If you list the contents of the file, it will show available schedulers and the active one: axboe@router:/sys/block/hda/queue> cat scheduler [cfq] Lets load a few more. router:/sys/block/hda/queue # modprobe deadline-iosched router:/sys/block/hda/queue # modprobe as-iosched router:/sys/block/hda/queue # cat scheduler [cfq] deadline anticipatory Changing is done with router:/sys/block/hda/queue # echo deadline > scheduler router:/sys/block/hda/queue # cat scheduler cfq [deadline] anticipatory deadline is now the new active io scheduler for hda. Signed-off-by: Jens Axboe <axboe@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-04-12[PATCH] CFQ io schedulerAndrew Morton
From: Jens Axboe <axboe@suse.de> CFQ I/O scheduler
2004-02-03[PATCH] gcc-3.5: elevator.h fixesAndrew Morton
include/linux/elevator.h:106: sorry, unimplemented: inlining failed in call to 'elv_try_last_merge': function body not available
2003-09-21[PATCH] misc fixesAndrew Morton
- Remove dead declaration from elevator.h (Nick Piggin) - Fix the scheduler selection boot-time message. "Using anticipatory scheduling io scheduler" is not grammatical. - Remove last use of __SMP__ (Randy Dunlap)
2003-09-04[PATCH] fix IO hangsJens Axboe
The "insert_here" list pointer logic was broken, and unnecessary. Kill it and its associated logic off completely - just tell the IO scheduler what kind of insert it is. This also makes the *_insert_request strategies much easier to follow, imo.
2003-07-26[PATCH] Make I/O schedulers optionalBernardo Innocenti
Add kconfig options to allow excluding either or both the I/O schedulers. This can be useful for embedded systems (saves about ~13KB). All schedulers are enabled by default for non-embedded.
2003-07-17Merge jet.(none):/home1/jejb/BK/scsi-misc-2.5James Bottomley
into jet.(none):/home1/jejb/BK/scsi-for-linus-2.5
2003-07-17[PATCH] Consolidate SCSI requeueing and add blk elevator hookJens Axboe
This patch removes the scsi mid layer dependency on __elv_add_request and introduces a new blk_requeue_request() function so the block layer specificially knows a requeue is in progress. It also adds an elevator hook for elevators like AS which need to hook into the requeue for correct adjustment of internal counters.
2003-07-04[PATCH] per queue nr_requestsAndrew Morton
From: Nick Piggin <piggin@cyberone.com.au> This gets rid of the global queue_nr_requests and usage of BLKDEV_MAX_RQ (the latter is now only used to set the queues' defaults). The queue depth becomes per-queue, controlled by a sysfs entry.
2003-07-04[PATCH] anticipatory I/O schedulerAndrew Morton
From: Nick Piggin <piggin@cyberone.com.au> This is the core anticipatory IO scheduler. There are nearly 100 changesets in this and five months work. I really cannot describe it fully here. Major points: - It works by recognising that reads are dependent: we don't know where the next read will occur, but it's probably close-by the previous one. So once a read has completed we leave the disk idle, anticipating that a request for a nearby read will come in. - There is read batching and write batching logic. - when we're servicing a batch of writes we will refuse to seek away for a read for some tens of milliseconds. Then the write stream is preempted. - when we're servicing a batch of reads (via anticipation) we'll do that for some tens of milliseconds, then preempt. - There are request deadlines, for latency and fairness. The oldest outstanding request is examined at regular intervals. If this request is older than a specific deadline, it will be the next one dispatched. This gives a good fairness heuristic while being simple because processes tend to have localised IO. Just about all of the rest of the complexity involves an array of fixups which prevent most of teh obvious failure modes with anticipation: trying to not leave the disk head pointlessly idle. Some of these algorithms are: - Process tracking. If the process whose read we are anticipating submits a write, abandon anticipation. - Process exit tracking. If the process whose read we are anticipating exits, abandon anticipation. - Process IO history. We accumulate statistical info on the process's recent IO patterns to aid in making decisions about how long to anticipate new reads. Currently thinktime and seek distance are tracked. Thinktime is the time between when a process's last request has completed and when it submits another one. Seek distance is simply the number of sectors between each read request. If either statistic becomes too high, the it isn't anticipated that the process will submit another read. The above all means that we need a per-process "io context". This is a fully refcounted structure. In this patch it is AS-only. later we generalise it a little so other IO schedulers could use the same framework. - Requests are grouped as synchronous and asynchronous whereas deadline scheduler groups requests as reads and writes. This can provide better sync write performance, and may give better responsiveness with journalling filesystems (although we haven't done that yet). We currently detect synchronous writes by nastily setting PF_SYNCWRITE in current->flags. The plan is to remove this later, and to propagate the sync hint from writeback_contol.sync_mode into bio->bi_flags thence into request->flags. Once that is done, direct-io needs to set the BIO sync hint as well. - There is also quite a bit of complexity gone into bashing TCQ into submission. Timing for a read batch is not started until the first read request actually completes. A read batch also does not start until all outstanding writes have completed. AS is the default IO scheduler. deadline may be chosen by booting with "elevator=deadline". There are a few reasons for retaining deadline: - AS is often slower than deadline in random IO loads with large TCQ windows. The usual real world task here is OLTP database loads. - deadline is presumably more stable. - deadline is much simpler. The tunable per-queue entries under /sys/block/*/iosched/ are all in milliseconds: * read_expire Controls how long until a request becomes "expired". It also controls the interval between which expired requests are served, so set to 50, a request might take anywhere < 100ms to be serviced _if_ it is the next on the expired list. Obviously it can't make the disk go faster. Result is basically the timeslice a reader gets in the presence of other IO. 100*((seek time / read_expire) + 1) is very roughly the % streaming read efficiency your disk should get in the presence of multiple readers. * read_batch_expire Controls how much time a batch of reads is given before pending writes are served. Higher value is more efficient. Shouldn't really be below read_expire. * write_ versions of the above * antic_expire Controls the maximum amount of time we can anticipate a good read before giving up. Many other factors may cause anticipation to be stopped early, or some processes will not be "anticipated" at all. Should be a bit higher for big seek time devices though not a linear correspondance - most processes have only a few ms thinktime.
2003-07-04[PATCH] elevator completion APIAndrew Morton
From: Nick Piggin <piggin@cyberone.com.au> Introduces an elevator_completed_req() callback with which the generic queueing layer may tell an IO scheduler that a particualr request has finished.
2003-07-04[PATCH] elv_may_queue() API functionAndrew Morton
Introduces the elv_may_queue() predicate with which the IO scheduler may tell the generic request layer that we may add another request to this queue. It is used by the CFQ elevator.
2003-05-23[PATCH] elevator core updateJens Axboe
The noop io scheduler has a data corrupting bug, because q->last_merge doesn't get cleared properly. So do that in io scheduler core, and remove the same code from deadline. Also kill bio_rq_in_between(), it's not used by anyone anymore. rbtrees are the hot thing these days. And finally, remove a direct test for REQ_CMD in rq flags, use blk_fs_request() instead.
2003-05-10[PATCH] dynamic request allocationJens Axboe
This patch adds dynamic allocation of request structures. Right now we are reserving 256 requests per initialized queue, which adds up to quite a lot of memory for even a modest number of queues. For the quoted 4000 disk systems, it's a disaster. Instead, we mempool 4 requests per queue and put an upper limit on the number of requests that we will put in-flight as well. I've kept the 128 read/write max in-flight limit for now. It is trivial to experiement with larger queue sizes now, but I want to change one thing at the time (the truncate scenario doesn't look all that good with a huge number of requests, for instance). Patch has been in -mm for a while, I'm running it here against stock 2.5 as well. Additionally, it actually kills quite a bit of code as well
2003-05-03[PATCH] make <linux/blk.h> obsoleteChristoph Hellwig
This file was _the_ header for block-device related stuff in earlier Linux versions, but nowdays there's just a few prototypes left that really belong into blkdev.h or genhd.h (and in one case elevator.h). This patch moves them over and removes everything but including blkdev.h from blk.h Note that blkdev.h gets all the headers that were included in blk.h inmplicitly too. Now we can start removing all references to it an maybe kill it off before 2.6. *sniff*
2003-01-12[PATCH] rbtree core for io schedulerJens Axboe
This patch has a bunch of io scheduler goodies that are, by now, well tested in -mm and by self and Nick Piggin. In order of interest: - Use rbtree data structure for sorting of requests. Even with the default queue lengths that are fairly short, this cuts a lot of run time for io scheduler intensive work loads. If we go to longer queue lengths, it very quickly becomes a necessity. - Add sysfs interface for the tunables. At the same time, finally kill the BLKELVGET/BLKELVSET completely. I made these return -ENOTTY in 2.5.1, but there are left-overs around the kernel. This old interface was never any good, it was centered around just one io scheduler. The io scheduler core itself has received count less hours of tuning by myself and Nick, should be in pretty good shape. Please apply. Andrew, I made some sysfs changes to the version from 2.5.56-mm1. It didn't even compile without warnings (or work, for that matter), as the sysfs store/show procedures needed updating. Hmm?
2002-10-27[PATCH] elv_add_request cleanupsJens Axboe
Request insertion in the current tree is a mess. We have all sorts of variants of *elv_add_request*, and it's not at all clear who does what and with what locks (or not). This patch cleans it up to be: o __elv_add_request(queue, request, at_end, plug) Core function, requires queue lock to be held o elv_add_request(queue, request, at_end, plug) Like __elv_add_request(), but grabs queue lock o __elv_add_request_pos(queue, request, position) Insert request at a given location, lock must be held
2002-10-03[PATCH] pass elevator type by reference, not valueJens Axboe
Ingo spotted this one too, it's a leftover from when the elevator type wasn't a variable. Also don't pass in &q->elevator, it can always be deduced from queue itself of course.
2002-09-26[PATCH] io scheduler updateJens Axboe
This fixes a problem with the deadline io scheduler, if the correct insertion point is at the front of the list. This is something that we never have gotten right in 2.4 either. The problem is that the elevator merge function has to return a pointer to a struct request, and for front insert we really have to return the head of the list which cannot be expressed as a request of course. The real issue is that the elevator_merge function actually performs two functions - it scans for a merge, and if it can't find any, it selects and insertion point. It's done this way for efficiency reasons, even if the design isn't all that clean. So we change the io scheduler merge functions to get passed a pointer to a list_head pointer instead. This works for both inserts and merges. In addition, deadline checks if it really should insert at the very front. Also don't pass in request to elv_try_last_merge(), the very name of the function suggests that it's q->last_merge that we are interested in.
2002-09-25[PATCH] deadline ioscheduler cleanupsJens Axboe
Some various small cleanups, optimizations, and fixes. o Make fifo_batch=32 as default, from testing this appears a good default value. We still get good throughput, and latency is good. o Reintroduce the merge_cleanup logic. We need it for deadline for rehashing requests when they have been merged. o Cleanup last_merge logic. Move it to the new elv_merged_request(), this is where it really belongs. Doing it inside the io scheduler core can causes false positives, when the queue merge functions reject an otherwise good merge o Have deadline_move_requests() account from last entry on the dispatch queue, if it is non-empty. It doesn't really matter what the last extracted sector was, if we are not right behind it. o Clean/optimize deadline_move_requests() o Account size of a request just a little bit. Streaming transfer isn't for free, it's just a lot cheaper than a seek. o Make deadline_check_fifo() more readable.
2002-09-24[PATCH] remove elevator_linusJens Axboe
Patch killing off elevator_linus for good. Sniffle.
2002-09-24[PATCH] deadline schedulerJens Axboe
This introduces the deadline-ioscheduler, making it the default. 2nd patch coming that deletes elevator_linus in a minute. This one has read_expire at 500ms, and writes_starved at 2.
2002-09-15[PATCH] fix elevator_linus accountingJens Axboe
elevator_linus is seriously broken wrt accounting. Marcelo recently took the patch to fix it in 2.4.20-pre, here's the 2.5 equiv. Right now, we account merges as costly and seeks as not. Only thing that prevents seek starvation is the aging scan. That is broken, very much so. This patch fixes that to account merges and inserts differently. A seek is ELV_LINUS_SEEK_COST more costly than a merge, currently that define is at '16'. Doing the math on a disk, this sort of makes sense. Defaults are read latency of 1024, which means 1024 merges or 64 seeks. Writes are double that.
2002-07-31[PATCH] misc elevator/block updatesJens Axboe
I've got a new i/o scheduler in testing, some changes where needed in the block layer to accomodate it. Basically because right now assumptions are made about q->queue_head being the sort list. The changes in detail: o elevator_merge_requests_fn takes queue argument as well o __make_request() inits insert_here to NULL instead of q->queue_head.prev, which means that the i/o schedulers must explicitly check for this condition now. o incorporate elv_queue_empty(), it was just a place holder before o add elv_get_sort_head(). it returns the sort head of the elevator for a given request. attempt_{back,front}_merge uses it to determine whether a request is valid or not. Maybe attempt_{back,front}_merge should just be killed, I doubt they have much relevance with the wake up batching. o call the merge_cleanup functions of the elevator _after_ the merge has been done, not before. This way the elevator functions get the new state of the request, which is the most interesting. o Kill extra nr_sectors check in ll_merge_requests_fn() o bi->bi_bdev is always set in __make_request(), so kill check.
2002-06-20[PATCH] uninline elv_next_request()Jens Axboe
Uninline elv_next_request() and move it to elevator.c, where it belongs. Because of CURRENT declaration, this actually saves lots of space. From Andrew.
2002-02-04v2.5.1.6 -> v2.5.1.7Linus Torvalds
- Jeff Garzik: fix up loop and md for struct kdev_t typechecking - Jeff Garzik: improved old-tulip network driver - Arnaldo: more scsi driver bio updates - Kai Germaschewski: ISDN updates - various: kdev_t updates
2002-02-04v2.5.1 -> v2.5.1.1Linus Torvalds
- me: revert the "kill(-1..)" change. POSIX isn't that clear on the issue anyway, and the new behaviour breaks things. - Jens Axboe: more bio updates - Al Viro: rd_load cleanups. hpfs mount fix, mount cleanups - Ingo Molnar: more raid updates - Jakub Jelinek: fix Linux/x86 confusion about arg passing of "save_v86_state" and "do_signal" - Trond Myklebust: fix NFS client race conditions
2002-02-04v2.5.0.2 -> v2.5.0.3Linus Torvalds
- Al Viro: more superblock cleanups - Jens Axboe: more patches for new block IO layer - Christoph Hellwig: get rid of the old, long- deprecated SCSI error handling
2002-02-04v2.5.0.1 -> v2.5.0.2Linus Torvalds
- Greg KH: USB update - Richard Gooch: refcounting for devfs - Jens Axboe: start of new block IO layer
2002-02-04v2.4.1.3 -> v2.4.1.4Linus Torvalds
- big S/390x 64-bit merge - typos and license name fixes. doc updates. - more include file cleanups (phase out "malloc.h") - even more elevator corner cases.. When not merging, find the best insertion point. - pmac ide update - network fixes (netif_wake_queue on tx timeout) - USB printer select() fix - NFS client missed initialization, deamon fixed client address check
2002-02-04v2.4.1.2 -> v2.4.1.3Linus Torvalds
- Jens: better ordering of requests when unable to merge - Neil Brown: make md work as a module again (we cannot autodetect in modules, not enough background information) - Neil Brown: raid5 SMP locking cleanups - Neil Brown: nfsd: handle Irix NFS clients named pipe behavior and dentry leak fix - maestro3 shutdown fix - fix dcache hash calculation that could cause bad hashes under certain circumstances (Dean Gaudet) - David Miller: networking and sparc updates - Jeff Garzik: include file cleanups - Andy Grover: ACPI update - Coda-fs error return fixes - rth: alpha Jensen update
2002-02-04v2.4.0.4 -> v2.4.0.5Linus Torvalds
- ppp UP deadlock attack fix
2002-02-04Import changesetLinus Torvalds