summaryrefslogtreecommitdiff
path: root/include/linux/backing-dev.h
AgeCommit message (Collapse)Author
2007-10-17mm: per device dirty thresholdPeter Zijlstra
Scale writeback cache per backing device, proportional to its writeout speed. By decoupling the BDI dirty thresholds a number of problems we currently have will go away, namely: - mutual interference starvation (for any number of BDIs); - deadlocks with stacked BDIs (loop, FUSE and local NFS mounts). It might be that all dirty pages are for a single BDI while other BDIs are idling. By giving each BDI a 'fair' share of the dirty limit, each one can have dirty pages outstanding and make progress. A global threshold also creates a deadlock for stacked BDIs; when A writes to B, and A generates enough dirty pages to get throttled, B will never start writeback until the dirty pages go away. Again, by giving each BDI its own 'independent' dirty limit, this problem is avoided. So the problem is to determine how to distribute the total dirty limit across the BDIs fairly and efficiently. A DBI that has a large dirty limit but does not have any dirty pages outstanding is a waste. What is done is to keep a floating proportion between the DBIs based on writeback completions. This way faster/more active devices get a larger share than slower/idle devices. [akpm@linux-foundation.org: fix warnings] [hugh@veritas.com: Fix occasional hang when a task couldn't get out of balance_dirty_pages] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17mm: count writeback pages per BDIPeter Zijlstra
Count per BDI writeback pages. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17mm: count reclaimable pages per BDIPeter Zijlstra
Count per BDI reclaimable pages; nr_reclaimable = nr_dirty + nr_unstable. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17mm: scalable bdi statistics countersPeter Zijlstra
Provide scalable per backing_dev_info statistics counters. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17mm: bdi init hooksPeter Zijlstra
provide BDI constructor/destructor hooks [akpm@linux-foundation.org: compile fix] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17nfs: remove congestion_end()Peter Zijlstra
These patches aim to improve balance_dirty_pages() and directly address three issues: 1) inter device starvation 2) stacked device deadlocks 3) inter process starvation 1 and 2 are a direct result from removing the global dirty limit and using per device dirty limits. By giving each device its own dirty limit is will no longer starve another device, and the cyclic dependancy on the dirty limit is broken. In order to efficiently distribute the dirty limit across the independant devices a floating proportion is used, this will allocate a share of the total limit proportional to the device's recent activity. 3 is done by also scaling the dirty limit proportional to the current task's recent dirty rate. This patch: nfs: remove congestion_end(). It's redundant, clear_bdi_congested() already wakes the waiters. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16remove mm/backing-dev.c:congestion_wait_interruptible()Adrian Bunk
congestion_wait_interruptible() is no longer used. Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-16[PATCH] nfs: fix congestion controlPeter Zijlstra
The current NFS client congestion logic is severly broken, it marks the backing device congested during each nfs_writepages() call but doesn't mirror this in nfs_writepage() which makes for deadlocks. Also it implements its own waitqueue. Replace this by a more regular congestion implementation that puts a cap on the number of active writeback pages and uses the bdi congestion waitqueue. Also always use an interruptible wait since it makes sense to be able to SIGKILL the process even for mounts without 'intr'. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2006-10-20[PATCH] separate bdi congestion functions from queue congestion functionsAndrew Morton
Separate out the concept of "queue congestion" from "backing-dev congestion". Congestion is a backing-dev concept, not a queue concept. The blk_* congestion functions are retained, as wrappers around the core backing-dev congestion functions. This proper layering is needed so that NFS can cleanly use the congestion functions, and so that CONFIG_BLOCK=n actually links. Cc: "Thomas Maier" <balagi@justmail.de> Cc: "Jens Axboe" <jens.axboe@oracle.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: David Howells <dhowells@redhat.com> Cc: Peter Osterlund <petero2@telia.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-30[PATCH] BDI: Provide backing device capability information [try #3]David Howells
The attached patch replaces backing_dev_info::memory_backed with capabilitied bitmap. The capabilities available include: (*) BDI_CAP_NO_ACCT_DIRTY Set if the pages associated with this backing device should not be tracked by the dirty page accounting. (*) BDI_CAP_NO_WRITEBACK Set if dirty pages associated with this backing device should not have writepage() or writepages() invoked upon them to clean them. (*) Capability markers that indicate what a backing device is capable of with regard to memory mapping facilities. These flags indicate whether a device can be mapped directly, whether it can be copied for a mapping, and whether direct mappings can be read, written and/or executed. This information is primarily aimed at improving no-MMU private mapping support. The patch also provides convenience functions for determining the dirty-page capabilities available on backing devices directly or on the backing devices associated with a mapping. These are provided to keep line length down when checking for the capabilities. Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-05-14[PATCH] Add blk_run_page()Andrew Morton
From: Andrea Arcangeli <andrea@suse.de> From: Jens Axboe Add blk_run_page() API. This is so that we can pass the target page all the way down to (for example) the swap unplug function. So swap can work out which blockdevs back this particular page.
2004-04-12[PATCH] per-backing dev unpluggingAndrew Morton
From: Jens Axboe <axboe@suse.de>, Chris Mason, me, others. The global unplug list causes horrid spinlock contention on many-disk many-CPU setups - throughput is worse than halved. The other problem with the global unplugging is of course that it will cause the unplugging of queues which are unrelated to the I/O upon which the caller is about to wait. So what we do to solve these problems is to remove the global unplug and set up the infrastructure under which the VFS can tell the block layer to unplug only those queues which are relevant to the page or buffer_head whcih is about to be waited upon. We do this via the very appropriate address_space->backing_dev_info structure. Most of the complexity is in devicemapper, MD and swapper_space, because for these backing devices, multiple queues may need to be unplugged to complete a page/buffer I/O. In each case we ensure that data structures are in place to permit us to identify all the lower-level queues which contribute to the higher-level backing_dev_info. Each contributing queue is told to unplug in response to a higher-level unplug. To simplify things in various places we also introduce the concept of a "synchronous BIO": it is tagged with BIO_RW_SYNC. The block layer will perform an immediate unplug when it sees one of these go past.
2004-04-12[PATCH] Add queue congestion calloutAndrew Morton
From: Miquel van Smoorenburg <miquels@cistron.nl> The VM and VFS use the address_space_backing_dev_info to track the realtime status of the device which backs the mapping. The read_congested and write_congested fields are used to determine whether a read or write against that device may block. We use this infrastructure to a) allow pdflush to service many queues in parallel (by not getting stuck on any particular one) and b) to avoid undesirable and uncontrolled latencies in places such as page reclaim and c) To avoid blocking in readahead operations The current code only supports simple disk queues (and I have a patch here for NFS). Stacked queues (MD and DM) don't get this information right and problems were expected. Efficiency problems have now been noted and it's time to fix it. This patch lays down the infrastructure which permits the queue implementation to get control when someone at a higher level is querying the queue's congestion state. So DM (for example) can run around and examine all the queues which contribute to the higher-level queue. It also adds bdi_rw_congested() for code in xfs and ext2 that calls both bdi_read_congested() and bdi_write_congested() in a row, and it was "free" anyway.
2003-03-17[PATCH] remove unused block congestion codeAndrew Morton
Patch from: Hugh Dickins <hugh@veritas.com> Removes a ton of dead code from ll_rw_blk.c. I don't expect we'll be using this now.
2002-11-21[PATCH] Fix busy-wait with writeback to large queuesAndrew Morton
blk_congestion_wait() is a utility function which various callers use to throttle themselves to the rate at which the IO system can retire writes. The current implementation refuses to wait if no queues are "congested" (>75% of requests are in flight). That doesn't work if the queue is so huge that it can hold more than 40% (dirty_ratio) of memory. The queue simply cannot enter congestion because the VM refuses to allow more than 40% of memory to be dirtied. (This spin could happen with a lot of normal-sized queues too) So this patch simply changes blk_congestion_wait() to throttle even if there are no congested queues. It will cause the caller to sleep until someone puts back a write request against any queue. (Nobody uses blk_congestion_wait for read congestion). The patch adds new state to backing_dev_info->state: a couple of flags which indicate whether there are _any_ reads or writes in flight against that queue. This was added to prevent blk_congestion_wait() from taking a nap when there are no writes at all in flight. But the "are there any reads" info could be used to defer background writeout from pdflush, to reduce read-vs-write competition. We'll see. Because the large request queues have made a fundamental change: blocking in get_request_wait() has been the main form of VM throttling for years. But with large queues it doesn't work any more - all throttling happens in blk_congestion_wait(). Also, change io_schedule_timeout() to propagate the schedule_timeout() return value. I was using that in some debug code, but it should have been like that from day one.
2002-09-22[PATCH] infrastructure for monitoring queue congestion stateAndrew Morton
The patch provides a means for the VM to be able to determine whether a request queue is in a "congested" state. If it is congested, then a write to (or read from) the queue may cause blockage in get_request_wait(). So the VM can do: if (!bdi_write_congested(page->mapping->backing_dev_info)) writepage(page); This is not exact. The code assumes that if the request queue still has 1/4 of its capacity (queue_nr_requests) available then a request will be non-blocking. There is a small chance that another CPU could zoom in and consume those requests. But on the rare occasions where that may happen the result will mereley be some unexpected latency - it's not worth doing anything elaborate to prevent this. The patch decreases the size of `batch_requests'. batch_requests is positively harmful - when a "heavy" writer and a "light" writer are both writing to the same queue, batch_requests provides a means for the heavy writer to massively stall the light writer. Instead of waiting for one or two requests to come free, the light writer has to wait for 32 requests to complete. Plus batch_requests generally makes things harder to tune, understand and predict. I wanted to kill it altogether, but Jens says that it is important for some hardware - it allows decent size requests to be submitted. The VM changes which go along with this code cause batch_requests to be not so painful anyway - the only processes which sleep in get_request_wait() are the ones which we elect, by design, to wait in there - typically heavy writers. The patch changes the meaning of `queue_nr_requests'. It used to mean "total number of requests per queue". Half of these are for reads, and half are for writes. This always confused the heck out of me, and the code needs to divide queue_nr_requests by two all over the place. So queue_nr_requests now means "the number of write requests per queue" and "the number of read requests per queue". ie: I halved it. Also, queue_nr_requests was converted to static scope. Nothing else uses it. The accuracy of bdi_read_congested() and bdi_write_congested() depends upon the accuracy of mapping->backing_dev_info. With complex block stacking arrangements it is possible that ->backing_dev_info is pointing at the wrong queue. I don't know. But the cost of getting this wrong is merely latency, and if it is a problem we can fix it up in the block layer, by getting stacking devices to communicate their congestion state upwards in some manner.
2002-09-09[PATCH] exact dirty state accountingAndrew Morton
Some adjustments to global dirty page accounting. Previously, dirty page accounting counted all dirty pages. Even dirty anonymous pages. This has potential to upset the throttling logic in balance_dirty_pages(). Particularly as I suspect we should decrease the dirty memory writeback thresholds by a lot. So this patch changes it so that we only account for dirty pagecache pages which have backing store. Not anonymous pages, not swapcache, not in-memory filesystem pages. To support this, the `memory_backed' boolean has been added to struct backing_dev_info. When an address space's backing device is marked as memory-backed, the core kernel knows to not include that mapping's pages in the dirty memory accounting. For memory-backed mappings, dirtiness is a way of pinning the page, and there's nothing the kernel can to do clean the page to make it freeable. driverfs, tmpfs, and ranfs have been coverted to mark their mappings as memory-backed. The ramdisk driver hasn't been converted. I have a separate patch for ramdisk, which fails to fix the longstanding problems in there :( With this patch, /bin/sync now sends /proc/meminfo:Dirty to zero, which is rather comforting.
2002-05-19[PATCH] pdflush exclusion infrastructureAndrew Morton
Collision avoidance for pdflush threads. Turns the request_queue-based `unsigned long ra_pages' into a structure which contains ra_pages as well as a longword. That longword is used to record the fact that a pdflush thread is currently writing something back against this request_queue. Avoids the situation where several pdflush threads are sleeping on the same request_queue. This patch provides only the infrastructure for the pdflush exclusion. This infrastructure gets used in pdflush-single.patch