summaryrefslogtreecommitdiff
path: root/include/linux/writeback.h
AgeCommit message (Collapse)Author
2007-05-21Detach sched.h from mm.hAlexey Dobriyan
First thing mm.h does is including sched.h solely for can_do_mlock() inline function which has "current" dereference inside. By dealing with can_do_mlock() mm.h can be detached from sched.h which is good. See below, why. This patch a) removes unconditional inclusion of sched.h from mm.h b) makes can_do_mlock() normal function in mm/mlock.c c) exports can_do_mlock() to not break compilation d) adds sched.h inclusions back to files that were getting it indirectly. e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were getting them indirectly Net result is: a) mm.h users would get less code to open, read, preprocess, parse, ... if they don't need sched.h b) sched.h stops being dependency for significant number of files: on x86_64 allmodconfig touching sched.h results in recompile of 4083 files, after patch it's only 3744 (-8.3%). Cross-compile tested on all arm defconfigs, all mips defconfigs, all powerpc defconfigs, alpha alpha-up arm i386 i386-up i386-defconfig i386-allnoconfig ia64 ia64-up m68k mips parisc parisc-up powerpc powerpc-up s390 s390-up sparc sparc-up sparc64 sparc64-up um-x86_64 x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig as well as my two usual configs. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11consolidate generic_writepages and mpage_writepagesMiklos Szeredi
Clean up massive code duplication between mpage_writepages() and generic_writepages(). The new generic function, write_cache_pages() takes a function pointer argument, which will be called for each page to be written. Maybe cifs_writepages() too can use this infrastructure, but I'm not touching that with a ten-foot pole. The upcoming page writeback support in fuse will also want this. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-30NFS: Fix a race when doing NFS write coalescingTrond Myklebust
Currently we do write coalescing in a very inefficient manner: one pass in generic_writepages() in order to lock the pages for writing, then one pass in nfs_flush_mapping() and/or nfs_sync_mapping_wait() in order to gather the locked pages for coalescing into RPC requests of size "wsize". In fact, it turns out there is actually a deadlock possible here since we only start I/O on the second pass. If the user signals the process while we're in nfs_sync_mapping_wait(), for instance, then we may exit before starting I/O on all the requests that have been queued up. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2007-03-01[PATCH] throttle_vm_writeout(): don't loop on GFP_NOFS and GFP_NOIO allocationsAndrew Morton
throttle_vm_writeout() is designed to wait for the dirty levels to subside. But if the caller holds IO or FS locks, we might be holding up that writeout. So change it to take a single nap to give other devices a chance to clean some memory, then return. Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Pete Zaitcev <zaitcev@redhat.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2006-10-20[PATCH] separate bdi congestion functions from queue congestion functionsAndrew Morton
Separate out the concept of "queue congestion" from "backing-dev congestion". Congestion is a backing-dev concept, not a queue concept. The blk_* congestion functions are retained, as wrappers around the core backing-dev congestion functions. This proper layering is needed so that NFS can cleanly use the congestion functions, and so that CONFIG_BLOCK=n actually links. Cc: "Thomas Maier" <balagi@justmail.de> Cc: "Jens Axboe" <jens.axboe@oracle.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: David Howells <dhowells@redhat.com> Cc: Peter Osterlund <petero2@telia.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03fix file specification in commentsUwe Zeisberger
Many files include the filename at the beginning, serveral used a wrong one. Signed-off-by: Uwe Zeisberger <Uwe_Zeisberger@digi.com> Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-09-30[PATCH] BLOCK: Dissociate generic_writepages() from mpage stuff [try #6]David Howells
Dissociate the generic_writepages() function from the mpage stuff, moving its declaration to linux/mm.h and actually emitting a full implementation into mm/page-writeback.c. The implementation is a partial duplicate of mpage_writepages() with all BIO references removed. It is used by NFS to do writeback. Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-29[PATCH] call mm/page-writeback.c:set_ratelimit() when new pages are hot-addedChandra Seetharaman
ratelimit_pages in page-writeback.c is recalculated (in set_ratelimit()) every time a CPU is hot-added/removed. But this value is not recalculated when new pages are hot-added. This patch fixes that problem by calling set_ratelimit() when new pages are hot-added. [akpm@osdl.org: cleanups] Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26[PATCH] mm: balance dirty pagesPeter Zijlstra
Now that we can detect writers of shared mappings, throttle them. Avoids OOM by surprise. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-22Add a real API for dealing with blk_congestion_wait()Trond Myklebust
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2006-06-23[PATCH] writeback: fix range handlingOGAWA Hirofumi
When a writeback_control's `start' and `end' fields are used to indicate a one-byte-range starting at file offset zero, the required values of .start=0,.end=0 mean that the ->writepages() implementation has no way of telling that it is being asked to perform a range request. Because we're currently overloading (start == 0 && end == 0) to mean "this is not a write-a-range request". To make all this sane, the patch changes range of writeback_control. So caller does: If it is calling ->writepages() to write pages, it sets range (range_start/end or range_cyclic) always. And if range_cyclic is true, ->writepages() thinks the range is cyclic, otherwise it just uses range_start and range_end. This patch does, - Add LLONG_MAX, LLONG_MIN, ULLONG_MAX to include/linux/kernel.h -1 is usually ok for range_end (type is long long). But, if someone did, range_end += val; range_end is "val - 1" u64val = range_end >> bits; u64val is "~(0ULL)" or something, they are wrong. So, this adds LLONG_MAX to avoid nasty things, and uses LLONG_MAX for range_end. - All callers of ->writepages() sets range_start/end or range_cyclic. - Fix updates of ->writeback_index. It seems already bit strange. If it starts at 0 and ended by check of nr_to_write, this last index may reduce chance to scan end of file. So, this updates ->writeback_index only if range_cyclic is true or whole-file is scanned. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Nathan Scott <nathans@sgi.com> Cc: Anton Altaparmakov <aia21@cantab.net> Cc: Steven French <sfrench@us.ibm.com> Cc: "Vladimir V. Saveliev" <vs@namesys.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24[PATCH] balance_dirty_pages_ratelimited: take nr_pages argAndrew Morton
Modify balance_dirty_pages_ratelimited() so that it can take a number-of-pages-which-I-just-dirtied argument. For msync(). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24[PATCH] Represent dirty_*_centisecs as jiffies internallyBart Samwel
Make that the internal values for: /proc/sys/vm/dirty_writeback_centisecs /proc/sys/vm/dirty_expire_centisecs are stored as jiffies instead of centiseconds. Let the sysctl interface do the conversions with full precision using clock_t_to_jiffies, instead of doing overflow-sensitive on-the-fly conversions every time the values are used. Cons: apparent precision loss if HZ is not a multiple of 100, because of conversion back and forth. This is a common problem for all sysctl values that use proc_dointvec_userhz_jiffies. (There is only one other in-tree use, in net/core/neighbour.c.) Signed-off-by: Bart Samwel <bart@samwel.tk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] export/change sync_page_range/_nolock()OGAWA Hirofumi
This exports/changes the sync_page_range/_nolock(). The fatfs needs sync_page_range/_nolock() for expanding truncate, and changes "size_t count" to "loff_t count". Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06identify multipage ->writepages() callsAndrew Morton
NFS needs to be able to distinguish between single-page ->writepage() calls and multipage ->writepages() calls. For the single-page writepage calls NFS can kick off the I/O within the context of ->writepage(). For multipage ->writepages calls, nfs_writepage() will leave the I/O pending and nfs_writepages() will kick off the I/O when it all has been queued up within NFS. Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2006-01-03[PATCH] add AOP_TRUNCATED_PAGE, prepend AOP_ to WRITEPAGE_ACTIVATEZach Brown
readpage(), prepare_write(), and commit_write() callers are updated to understand the special return code AOP_TRUNCATED_PAGE in the style of writepage() and WRITEPAGE_ACTIVATE. AOP_TRUNCATED_PAGE tells the caller that the callee has unlocked the page and that the operation should be tried again with a new page. OCFS2 uses this to detect and work around a lock inversion in its aop methods. There should be no change in behaviour for methods that don't return AOP_TRUNCATED_PAGE. WRITEPAGE_ACTIVATE is also prepended with AOP_ for consistency and they are made enums so that kerneldoc can be used to document their semantics. Signed-off-by: Zach Brown <zach.brown@oracle.com>
2005-09-10[PATCH] mm/filemap.c: make two functions staticAdrian Bunk
With Nick Piggin <npiggin@suse.de> Give some things static scope. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-28[PATCH] rename wakeup_bdflush to wakeup_pdflushPekka J Enberg
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-27[PATCH] Update cfq io scheduler to time sliced designJens Axboe
This updates the CFQ io scheduler to the new time sliced design (cfq v3). It provides full process fairness, while giving excellent aggregate system throughput even for many competing processes. It supports io priorities, either inherited from the cpu nice value or set directly with the ioprio_get/set syscalls. The latter closely mimic set/getpriority. This import is based on my latest from -mm. Signed-off-by: Jens Axboe <axboe@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-07[PATCH] vm: pageout throttlingMarcelo Tosatti
With silly pageout testcases it is possible to place huge amounts of memory under I/O. With a large request queue (CFQ uses 8192 requests) it is possible to place _all_ memory under I/O at the same time. This means that all memory is pinned and unreclaimable and the VM gets upset and goes oom. The patch limits the amount of memory which is under pageout writeout to be a little more than the amount of memory at which balance_dirty_pages() callers will synchronously throttle. This means that heavy pageout activity can starve heavy writeback activity completely, but heavy writeback activity will not cause starvation of pageout. Because we don't want a simple `dd' to be causing excessive latencies in page reclaim. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-10[PATCH] Fix O_SYNC speedup for generic_file_write_nolockSuparna Bhattacharya
The O_SYNC speedup patches missed the generic_file_xxx_nolock cases, which means that pages weren't actually getting sync'ed in those cases. This patch fixes that. Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18[PATCH] eliminate inode waitqueue hashtableWilliam Lee Irwin III
Eliminate the inode waitqueue hashtable using bit_waitqueue() via wait_on_bit() and wake_up_bit() to locate the waitqueue head associated with a bit. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-26[PATCH] Add a few might_sleep() checksIngo Molnar
Add a whole bunch more might_sleep() checks. We also enable might_sleep() checking in copy_*_user(). This was non-trivial because of the "copy_*_user() in atomic regions" trick would generate false positives. Fix that up by adding a new __copy_*_user_inatomic(), which avoids the might_sleep() check. Only i386 is supported in this patch. With: Arjan van de Ven <arjanv@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-23Don't use signed one-bit bitfields.Linus Torvalds
We assign 0 and 1 to it, but since it's signed, that's actually already overflowing the poor thing. So make it unsigned, which is what it really was supposed to be in the first place.
2004-08-22[PATCH] Concurrent O_SYNC write supportAndrew Morton
In databases it is common to have multiple threads or processes performing O_SYNC writes against different parts of the same file. Our performance at this is poor, because each writer blocks access to the file by waiting on I/O completion while holding i_sem: everything is serialised. The patch improves things by moving the writing and waiting outside i_sem. So other threads can get in and submit their I/O and permit the disk scheduler to optimise the IO patterns better. Also, the O_SYNC writer only writes and waits on the pages which he wrote, rather than writing and waiting on all dirty pages in the file. The reason we haven't been able to do this before is that the required walk of the address_space page lists is easily livelockable without the i_sem serialisation. But in this patch we perform the waiting via a radix-tree walk of the affected pages. This cannot be livelocked. The sync of the inode's metadata is still performed inside i_sem. This is because it is list-based and is hence still livelockable. However it is usually the case that databases are overwriting existing file blocks and there will be no dirty buffers attached to the address_space anyway. The code is careful to ensure that the IO for the pages and the IO for the metadata are nonblockingly scheduled at the same time. This is am improvemtn over the current code, which will issue two separate write-and-wait cycles: one for metadata, one for pages. Note from Suparna: Reworked to use the tagged radix-tree based writeback infrastructure. Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] Writeback page range hintAndrew Morton
Modify mpage_writepages to optionally only write back dirty pages within a specified range in a file (as in the case of O_SYNC). Cheat a little to avoid changes to prototypes of aops - just put the <start, end> hint into the writeback_control struct instead. If <start, end> are not set, then default to writing back all the mapping's dirty pages. Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-07Make sysctl pass the pos pointer around properly.Linus Torvalds
Nobody ever fixed the big FIXME in sysctl - but we really need to pass around the proper "loff_t *" to all the sysctl functions if we want them to be well-behaved wrt the file pointer position. This is all preparation for making direct f_pos accesses go away.
2004-04-11[PATCH] laptop modeAndrew Morton
From: Bart Samwel <bart@samwel.tk> Adds /proc/sys/vm/laptop-mode: a special knob which says "this is a laptop". In this mode the kernel will attempt to avoid spinning disks up. Algorithm: the idea is to hold dirty data in memory for a long time, but to flush everything which has been accumulated if the disk happens to spin up for other reasons. - Whenever a disk request completes (read or write), schedule a timer a few seconds hence. If the timer was already pending, reset it to a few seconds hence. - When the timer expires, write back the whole world. We use sync_filesystems() for this because it will force ext3 journal commits as well. - In balance_dirty_pages(), kick off background writeback when we hit the high threshold (dirty_ratio), not when we hit the low threshold. This has the effect of causing "lumpy" writeback which is something I spent a year fixing, but in laptop mode, it is desirable. - In try_to_free_pages(), only kick pdflush if the VM is getting into distress: we want to keep scanning for clean pages, deferring writeback. - In page reclaim, avoid writing back the odd random dirty page off the LRU: only start I/O if the scanning is working harder. The effect is to perform a sync() a few seconds after all I/O has ceased. The value which was written into /proc/sys/vm/laptop-mode determines, in seconds, the delay between the final I/O and the flush. Additionally, the patch adds tools which help answer the question "why the heck does my disk spin up all the time?". The user may set /proc/sys/vm/block_dump to a non-zero value and the kernel will print out information which will identify the process which is performing disk reads or which is dirtying pagecache. The user should probably disable syslogd before setting block-dump.
2004-04-11[PATCH] don't allow background writes to hide dirty buffersAndrew Morton
If pdflush hits a locked-and-clean buffer in __block_write_full_page() it will just pass over the buffer. Typically the buffer is an ext3 data=ordered buffer which is being written by kjournald, but a similar thing can happen with blockdev buffers and ll_rw_block(). This is bad because the buffer is still under I/O and a subsequent fsync's fdatawait() needs to know about it. It is not practical to tag the page for writeback - only the submitter of the I/O can do that, because the submitter has control of the end_io handler. So instead, redirty the page so a subsequent fsync's fdatawrite() will wait on the underway I/O. There is a risk that pdflush::background_writeout() will lock up, repeatedly trying and failing to write the same page. This is prevented by ensuring that background_writeout() always throttles when it made no progress.
2003-09-21[PATCH] real-time enhanced page allocator and throttlingAndrew Morton
From: Robert Love <rml@tech9.net> - Let real-time tasks dip further into the reserves than usual in __alloc_pages(). There are a lot of ways to special case this. This patch just cuts z->pages_low in half, before doing the incremental min thing, for real-time tasks. I do not do anything in the low memory slow path. We can be a _lot_ more aggressive if we want. Right now, we just give real-time tasks a little help. - Never ever call balance_dirty_pages() on a real-time task. Where and how exactly we handle this is up for debate. We could, for example, special case real-time tasks inside balance_dirty_pages(). This would allow us to perform some of the work (say, waking up pdflush) but not other work (say, the active throttling). As it stands now, we do the per-processor accounting in balance_dirty_pages_ratelimited() but we never call balance_dirty_pages(). Lots of approaches work. What we want to do is never engage the real-time task in forced writeback.
2003-09-08[PATCH] sparse fix sysctlAndries E. Brouwer
2003-06-02[PATCH] dirty_writeback_centisecs fixesAndrew Morton
This /proc tunable sets the kupdate interval. It has a couple of problems: - No way to turn it off completely (userspace dirty memory management solutions require this). - If it has been set to one hour and then the user resets it to five seconds, that resetting will not take effect for up to an hour. Fix that up by providing a sysctl handler. Setting the tunable to zero now disables the kupdate function.
2003-03-16[PATCH] Early writeback initialisationAndrew Morton
Patch from Anders Gustafsson <andersg@0x63.nu> We're getting a division-by-zero in the writeback code during early rootfs population, because writeback has not yet been initialised. Fix that by performing an explicit initialisation rather than relying on initcall ordering.
2002-12-14[PATCH] remove PF_SYNCAndrew Morton
current->flags:PF_SYNC was a hack I added because I didn't want to change all ->writepage implementations. It's foul. And it means that if someone happens to run direct page reclaim within the context of (say) sys_sync, the writepage invokations from the VM will be treated as "data integrity" operations, not "memory cleansing" operations, which would cause latency. So the patch removes PF_SYNC and adds an extra arg to a_ops->writepage. It is the `writeback_control' structure which contains the full context information about why writepage was called. The initial version of this patch just passed in a bare `int sync', but the XFS team need more info so they can perform writearound from within page reclaim. The patch also adds writeback_control.for_reclaim, so writepage implementations can inspect that to work out the call context rather than peeking at current->flags:PF_MEMALLOC.
2002-12-14[PATCH] fs-writeback rework.Andrew Morton
I've revisited all the superblock->inode->page writeback paths. There were several silly things in there, and things were not as clear as they could be. scenario 1: create and dirty a MAP_SHARED segment over a sparse file, then exit. All the memory turns into dirty pagecache, but the kupdate function only writes it out at a trickle - 4 megabytes every thirty seconds. We should sync it all within 30 seconds. What's happening is that when writeback tries to write those pages, the filesystem needs to instantiate new blocks for them (they're over holes). The filesystem runs mark_inode_dirty() within the writeback function. This redirtying of the inode while we're writing it out triggers some livelock avoidance code in __sync_single_inode(). That function says "ah, someone redirtied the file while I was writing it. Let's move the file to the new end of the superblock dirty list and write it out later." Problem is, writeback dirtied the inode itself. (It is rather silly that mark_inode_dirty() sets I_DIRTY_PAGES when clearly no pages have been dirtied. Fixing that up would be a largish work, so work around it here). So this patch just removes the livelock avoidance from __sync_single_inode(). It is no longer needed anyway - writeback livelock is now avoided (in all writeback paths) by writing a finite number of pages. scenario 2: an application is continuously dirtying a 200 megabyte file, and your disk has a bandwidth of less than 40 megabytes/sec. What happens is that once 30 seconds passes, pdflush starts writing out the file. And because that writeout will take more than five seconds (a `kupdate' interval), pdflush just keeps writing it out forever - continuous I/O. What we _want_ to happen is that the 200 megabytes gets written, and then IO stops for thirty seconds (minus the writeout period). So the file is fully synced every thirty seconds. The patch solves this by using mapping->io_pages more intelligently. When the time comes to write the file out, move all the dirty pages onto io_pages. That is a "batch of pages for this kupdate round". When io_pages is empty, we know we're done. The address_space_operations.writepages() API is changed! It now only needs to write the pages which the caller placed on mapping->io_pages. This conceptually cleans things up a bit, by more clearly defining the role of ->io_pages, and the motion between the various mapping lists. The treatment of sb->s_dirty and sb->s_io is now conceptually identical to mapping->dirty_pages and mapping->io_pages: move the items-to-be written onto ->s_io/io_pages, alk walk that list. As inodes (or pages) are written, move them over to the clean/locked/dirty lists. Oh, scenario 3: start an app whcih continuously overwrites a 5 meg file. Wait five seconds, start another, wait 5 seconds, start another. What we _should_ see is three 5-meg writes, five seconds apart, every thirty seconds. That did all sorts of odd things. It now does the right thing.
2002-12-14[PATCH] Remove fail_writepage, reduxAndrew Morton
fail_writepage() does not work. Its activate_page() call cannot activate the page because it is not on the LRU. So perform that function (more efficiently) in the VM. Remove fail_writepage() and, if the filesystem does not implement ->writepage() then activate the page from shrink_list(). A special case is tmpfs, which does have a writepage, but which sometimes wants to activate the pages anyway. The most important case is when there is no swap online and we don't want to keep all those pages on the inactive list. So just as a tmpfs special-case, allow writepage() to return WRITEPAGE_ACTIVATE, and handle that in the VM. Also, the whole idea of allowing ->writepage() to return -EAGAIN, and handling that in the caller has been reverted. If a writepage() implementation wants to back out and not write the page, it must redirty the page, unlock it and return zero. (This is Hugh's preferred way). And remove the now-unneeded shmem_writepages() - shmem inodes are marked as `memory backed' so it will not be called. And remove the test for non-null ->writepage() in generic_file_mmap(). Memory-backed files _are_ mmappable, and they do not have a writepage(). It just isn't called. So the locking rules for writepage() are unchanged. They are: - Called with the page locked - Returns with the page unlocked - Must redirty the page itself if it wasn't all written. But there is a new, special, hidden, undocumented, secret hack for tmpfs: writepage may return WRITEPAGE_ACTIVATE to tell the VM to move the page to the active list. The page must be kept locked in this one case.
2002-10-12[PATCH] rename /proc/sys/vm/dirty_async_ratio to dirty_ratioAndrew Morton
Since /proc/sys/vm/dirty_sync_ratio went away, the name "dirty_async_ratio" makes no sense. So rename it to just /proc/sys/vm/dirty_ratio.
2002-09-22[PATCH] low-latency page reclaimAndrew Morton
Convert the VM to not wait on other people's dirty data. - If we find a dirty page and its queue is not congested, do some writeback. - If we find a dirty page and its queue _is_ congested then just refile the page. - If we find a PageWriteback page then just refile the page. - There is additional throttling for write(2) callers. Within generic_file_write(), record their backing queue in ->current. Within page reclaim, if this tasks encounters a page which is dirty or under writeback onthis queue, block on it. This gives some more writer throttling and reduces the page refiling frequency. It's somewhat CPU expensive - under really heavy load we only get a 50% reclaim rate in pages coming off the tail of the LRU. This can be fixed by splitting the inactive list into reclaimable and non-reclaimable lists. But the CPU load isn't too bad, and latency is much, much more important in these situations. Example: with `mem=512m', running 4 instances of `dbench 100', 2.5.34 took 35 minutes to compile a kernel. With this patch, it took three minutes, 45 seconds. I haven't done swapcache or MAP_SHARED pages yet. If there's tons of dirty swapcache or mmap data around we still stall heavily in page reclaim. That's less important. This patch also has a tweak for swapless machines: don't even bother bringing anon pages onto the inactive list if there is no swap online.
2002-09-22[PATCH] use the congestion APIs in pdflushAndrew Morton
The key concept here is that pdflush does not block on request queues any more. Instead, it circulates across the queues, keeping any non-congested queues full of write data. When all queues are full, pdflush takes a nap, to be woken when *any* queue exits write congestion. This code can keep sixty spindles saturated - we've never been able to do that before. - Add the `nonblocking' flag to struct writeback_control, and teach the writeback paths to honour it. - Add the `encountered_congestion' flag to struct writeback_control and teach the writeback paths to set it. So as soon as a mapping's backing_dev_info indicates that it is getting congested, bale out of writeback. And don't even start writeback against filesystems whose queues are congested. - Convert pdflush's background_writeback() function to use nonblocking writeback. This way, a single pdflush thread will circulate around all the dirty queues, keeping them filled. - Convert the pdlfush `kupdate' function to do the same thing. This solves the problem of pdflush thread pool exhaustion. It solves the problem of pdflush startup latency. It solves the (minor) problem wherein `kupdate' writeback only writes back a single disk at a time (it was getting blocked on each queue in turn). It probably means that we only ever need a single pdflush thread.
2002-09-19[PATCH] remove /proc/sys/vm/dirty_sync_threshAndrew Morton
This was designed to be a really sterm throttling threshold: if dirty memory reaches this level then perform writeback and actually wait on it. It doesn't work. Because memory dirtiers are required to perform writeback if the amount of dirty AND writeback memory exceeds dirty_async_ratio. So kill it, and rely just on the request queues being appropriately scaled to the machine size (they are). This is basically what 2.4 does.
2002-09-19[PATCH] clean up argument passing in writeback pathsAndrew Morton
The writeback code paths which walk the superblocks and inodes are getting an increasing arguments passed to them. The patch wraps those args into the new `struct writeback_control', and uses that instead. There is no functional change. The new writeback_control structure is passed down through the writeback paths in the place where the old `nr_to_write' pointer used to be. writeback_control will be used to pass new information up and down the writeback paths. Such as whether the writeback should be non-blocking, and whether queue congestion was encountered.
2002-08-30[PATCH] writeback correctness and efficiency changesAndrew Morton
This is a performance and correctness fix against the writeback paths. The writeback code has competing requirements. Sometimes it is used for "memory cleansing": kupdate, bdflush, writer throttling, page allocator writeback, etc. And sometimes this same code is used for data integrity pruposes: fsync, msync, fdatasync, sync, umount, various other kernel-internal uses. The problem is: how to handle a dirty buffer or page which is currently under writeback. For memory cleansing, we just want to skip that buffer/page and go onto the next one. But for sync, we must wait on the old writeback and then start new writeback. mpage_writepages() is current correct for cleansing, but incorrect for sync. block_write_full_page() is currently correct for sync, but inefficient for cleansing. The fix is fairly simple. - In mpage_writepages(), don't skip the page is it's a sync operation. - In block_write_full_page(), skip the buffer if it is a sync operation. And return -EAGAIN to tell the caller that the writeout didn't work out. The caller must then set the page dirty again and move it onto mapping->dirty_pages. This is an extension of the writepage API: writepage can now return EAGAIN. There are only three callers, and they have been updated. fail_writepage() and ext3_writepage() were actually doing this by hand. They have been changed to return -EAGAIN. NTFS will want to be able to return -EAGAIN from its writepage as well. - A sticky question is: how to tell the writeout code which mode it is operating in? Cleansing or sync? It's such a tiny code change that I didn't have the heart to go and propagate a `mode' argument down every instance of writepages() and writepage() in the kernel. So I passed it in via current->flags. Incidentally, the occurrence of a locked-and-dirty buffer in block_write_full_page() is fairly rare: normally the collision avoidance happens at the address_space level, via PageWriteback. But some mappings (blockdevs, ext3 files, etc) have their dirty buffers written out via submit_bh(). It is these buffers which can stall block_write_full_page(). This wart will be pretty intrusive to fix. ext3 needs to become fully page-based (ugh. It's a block-based journalling filesystem, and pages are unnatural). blockdev mappings are still written out by buffers because that's how filesystems use them. Putting _all_ metadata (indirects, inodes, superblocks, etc) into standalone address_spaces would fix that up. - filemap_fdatawrite() sets PF_SYNC. So filemap_fdatawrite() is the kernel function which will start writeback against a mapping for "data integrity" purposes, whereas the unexported, internal-only do_writepages() is the writeback function which is used for memory cleansing. This difference is the reason why I didn't consolidate those functions ages ago... - Lots of code paths had a bogus extra call to filemap_fdatawait(), which I previously added in a moment of weak-headedness. They have all been removed.
2002-07-23[PATCH] page-writeback.c compile warning fixAndrew Morton
2002-07-04[PATCH] pdflush cleanupAndrew Morton
Writeback/pdflush cleanup patch from Steven Augart * Exposes nr_pdflush_threads as /proc/sys/vm/nr_pdflush_threads, read-only. (I like this - I expect that management of the pdflush thread pool will be important for many-spindle machines, and this is a neat way of getting at the info). * Adds minimum and maximum checking to the five writable pdflush and fs-writeback parameters. * Minor indentation fix in sysctl.c * mm/pdflush.c now includes linux/writeback.h, which prototypes pdflush_operation. This is so that the compiler can automatically check that the prototype matches the definition. * Adds a few comments to existing code.
2002-07-04[PATCH] misc cleanups and fixesAndrew Morton
- Comment and documentation fixlets - Remove some unneeded fields from swapper_inode (these are a leftover from when I had swap using the filesystem IO functions). - fix a printk bug in pci/pool.c: when dma_addr_t is 64 bit it generates a compile warning, and will print out garbage. Cast it to unsigned long long. - Convert some writeback #defines into enums (Steven Augart)
2002-06-17[PATCH] writeback tunablesAndrew Morton
Adds five sysctls for tuning the writeback behaviour: dirty_async_ratio dirty_background_ratio dirty_sync_ratio dirty_expire_centisecs dirty_writeback_centisecs these are described in Documentation/filesystems/proc.txt They are basically the tradiditional knobs which we've always had... We are accreting a ton of obsolete sysctl numbers under /proc/sys/vm/. I didn't recycle these - just mark them unused and remove the obsolete documentation.
2002-06-02[PATCH] remove inode.i_waitAndrew Morton
Remove i_wait from struct inode and hash it instead. This is a pure space-saving exercise - 12 bytes from struct inode on x86. NFS was using i_wait for its own purposes. Add a wait_queue_head_t to the fs-private inode for that. This change has been acked by Trond.
2002-05-27[PATCH] rename writeback_mapping to writepagesAndrew Morton
Spot the difference: aops.readpage aops.readpages aops.writepage aops.writeback_mapping The patch renames `writeback_mapping' to `writepages'
2002-05-19[PATCH] improved I/O scheduling for indirect blocksAndrew Morton
Fixes a performance problem with many-small-file writeout. At present, files are written out via their mapping and their indirect blocks are written out via the blockdev mapping. As we know that indirects are disk-adjacent to the data it is better to start I/O against the indirects at the same time as the data. The delalloc pathes have code in ext2_writepage() which recognises when the target page->index was at an indirect boundary and does an explicit hunt-and-write against the neighbouring indirect block. Which is ideal. (Unless the file was dirtied seekily and the page which is next to the indirect was not dirtied). This patch does it the other way: when we start writeback against a mapping, also start writeback against any dirty buffers which are attached to mapping->private_list. Let the elevator take care of the rest. The patch makes a number of tuning changes to the writeback path in fs-writeback.c. This is very fiddly code: getting the throughput tuned, getting the data-integrity "sync" operations right, avoiding most of the livelock opportunities, getting the `kupdate' function working efficiently, keeping it all least somewhat comprehensible. An important intent here is to ensure that metadata blocks for inodes are marked dirty before writeback starts working the blockdev mapping, so all the inode blocks are efficiently written back. The patch removes try_to_writeback_unused_inodes(), which became unreferenced in vm-writeback.patch. The patch has a tweak in ext2_put_inode() to prevent ext2 from incorrectly droppping its preallocation window in response to a random iput(). Generally, many-small-file writeout is a lot faster than 2.5.7 (which is linux-before-I-futzed-with-it). The workload which was optimised was tar xfz /nfs/mountpoint/linux-2.4.18.tar.gz ; sync on mem=128M and mem=2048M. With these patches, 2.5.15 is completing in about 2/3 of the time of 2.5.7. But it is only a shade faster than 2.4.19-pre7. Why is 2.5.7 so much slower than 2.4.19? Not sure yet. Heavy dbench loads (dbench 32 on mem=128M) are slightly faster than 2.5.7 and significantly slower than 2.4.19. It appears that the cause is poor read throughput at the later stages of the run. Because there are background writeback threads operating at the same time. The 2.4.19-pre8 write scheduling manages to stop writeback during the latter stages of the dbench run in a way which I haven't been able to sanely emulate yet. It may not be desirable to do this anyway - it's optimising for the case where the files are about to be deleted. But it would be good to find a way of "pausing" the writeback for a few seconds to allow readers to get an interval of decent bandwidth. tiobench throughput is basically the same across all recent kernels. CPU load on writes is down maybe 30% in 2.5.15.
2002-05-19[PATCH] writeback tuningAndrew Morton
Tune up the VM-based writeback a bit. - Always use the multipage clustered-writeback function from within shrink_cache(), even if the page's mapping has a NULL ->vm_writeback(). So clustered writeback is turned on for all address_spaces, not just ext2. Subtle effect of this change: it is now the case that *all* writeback proceeds along the mapping->dirty_pages list. The orderedness of the page LRUs no longer has an impact on disk scheduling. So we only have one list to keep well-sorted rather than two, and churning pages around on the LRU will no longer damage write bandwidth - it's all up to the filesystem. - Decrease the clustered writeback from 1024 pages(!) to 32 pages. (1024 was a leftover from when this code was always dispatching writeback to a pdflush thread). - Fix wakeup_bdflush() so that it actually does write something (duh). do_wp_page() needs to call balance_dirty_pages_ratelimited(), so we throttle mmap page-dirtiers in the same way as write(2) page-dirtiers. This may make wakeup_bdflush() obsolete, but it doesn't hurt. - Converts generic_vm_writeback() to directly call ->writeback_mapping(), rather that going through writeback_single_inode(). This prevents memory allocators from blocking on the inode's I_LOCK. But it does mean that two processes can be writing pages from the same mapping at the same time. If filesystems care about this (for layout reasons) then they should serialise in their ->writeback_mapping a_op. This means that memory-allocators will writeback only pages, not pages and inodes. There are no locks in that writeback path (except for request queue exhaustion). Reduces memory allocation latency. - Implement new background_writeback function, which when kicked off will perform writeback until dirty memory falls below the background threshold. - Put written-back pages onto the remote end of the page LRU. It does this in the slow-and-stupid way at present. pagemap_lru_lock stress-relief is planned... - Remove the funny writeback_unused_inodes() stuff from prune_icache(). Writeback from wakeup_bdflush() and the `kupdate' function now just naturally cleanses the oldest inodes so we don't need to do anything there. - Dirty memory balancing is still using magic numbers: "after you dirtied your 1,000th page, go write 1,500". Obviously, this needs more work.