user/sven/linux.git - Linux Kernel

Age	Commit message (Collapse)	Author
2005-03-07	[PATCH] vm: pageout throttling	Marcelo Tosatti
	With silly pageout testcases it is possible to place huge amounts of memory under I/O. With a large request queue (CFQ uses 8192 requests) it is possible to place _all_ memory under I/O at the same time. This means that all memory is pinned and unreclaimable and the VM gets upset and goes oom. The patch limits the amount of memory which is under pageout writeout to be a little more than the amount of memory at which balance_dirty_pages() callers will synchronously throttle. This means that heavy pageout activity can starve heavy writeback activity completely, but heavy writeback activity will not cause starvation of pageout. Because we don't want a simple `dd' to be causing excessive latencies in page reclaim. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-10	[PATCH] Fix O_SYNC speedup for generic_file_write_nolock	Suparna Bhattacharya
	The O_SYNC speedup patches missed the generic_file_xxx_nolock cases, which means that pages weren't actually getting sync'ed in those cases. This patch fixes that. Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18	[PATCH] eliminate inode waitqueue hashtable	William Lee Irwin III
	Eliminate the inode waitqueue hashtable using bit_waitqueue() via wait_on_bit() and wake_up_bit() to locate the waitqueue head associated with a bit. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-26	[PATCH] Add a few might_sleep() checks	Ingo Molnar
	Add a whole bunch more might_sleep() checks. We also enable might_sleep() checking in copy__user(). This was non-trivial because of the "copy__user() in atomic regions" trick would generate false positives. Fix that up by adding a new __copy_*_user_inatomic(), which avoids the might_sleep() check. Only i386 is supported in this patch. With: Arjan van de Ven <arjanv@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-23	Don't use signed one-bit bitfields.	Linus Torvalds
	We assign 0 and 1 to it, but since it's signed, that's actually already overflowing the poor thing. So make it unsigned, which is what it really was supposed to be in the first place.
2004-08-22	[PATCH] Concurrent O_SYNC write support	Andrew Morton
	In databases it is common to have multiple threads or processes performing O_SYNC writes against different parts of the same file. Our performance at this is poor, because each writer blocks access to the file by waiting on I/O completion while holding i_sem: everything is serialised. The patch improves things by moving the writing and waiting outside i_sem. So other threads can get in and submit their I/O and permit the disk scheduler to optimise the IO patterns better. Also, the O_SYNC writer only writes and waits on the pages which he wrote, rather than writing and waiting on all dirty pages in the file. The reason we haven't been able to do this before is that the required walk of the address_space page lists is easily livelockable without the i_sem serialisation. But in this patch we perform the waiting via a radix-tree walk of the affected pages. This cannot be livelocked. The sync of the inode's metadata is still performed inside i_sem. This is because it is list-based and is hence still livelockable. However it is usually the case that databases are overwriting existing file blocks and there will be no dirty buffers attached to the address_space anyway. The code is careful to ensure that the IO for the pages and the IO for the metadata are nonblockingly scheduled at the same time. This is am improvemtn over the current code, which will issue two separate write-and-wait cycles: one for metadata, one for pages. Note from Suparna: Reworked to use the tagged radix-tree based writeback infrastructure. Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22	[PATCH] Writeback page range hint	Andrew Morton
	Modify mpage_writepages to optionally only write back dirty pages within a specified range in a file (as in the case of O_SYNC). Cheat a little to avoid changes to prototypes of aops - just put the <start, end> hint into the writeback_control struct instead. If <start, end> are not set, then default to writing back all the mapping's dirty pages. Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-07	Make sysctl pass the pos pointer around properly.	Linus Torvalds
	Nobody ever fixed the big FIXME in sysctl - but we really need to pass around the proper "loff_t *" to all the sysctl functions if we want them to be well-behaved wrt the file pointer position. This is all preparation for making direct f_pos accesses go away.
2004-04-11	[PATCH] laptop mode	Andrew Morton
	From: Bart Samwel <bart@samwel.tk> Adds /proc/sys/vm/laptop-mode: a special knob which says "this is a laptop". In this mode the kernel will attempt to avoid spinning disks up. Algorithm: the idea is to hold dirty data in memory for a long time, but to flush everything which has been accumulated if the disk happens to spin up for other reasons. - Whenever a disk request completes (read or write), schedule a timer a few seconds hence. If the timer was already pending, reset it to a few seconds hence. - When the timer expires, write back the whole world. We use sync_filesystems() for this because it will force ext3 journal commits as well. - In balance_dirty_pages(), kick off background writeback when we hit the high threshold (dirty_ratio), not when we hit the low threshold. This has the effect of causing "lumpy" writeback which is something I spent a year fixing, but in laptop mode, it is desirable. - In try_to_free_pages(), only kick pdflush if the VM is getting into distress: we want to keep scanning for clean pages, deferring writeback. - In page reclaim, avoid writing back the odd random dirty page off the LRU: only start I/O if the scanning is working harder. The effect is to perform a sync() a few seconds after all I/O has ceased. The value which was written into /proc/sys/vm/laptop-mode determines, in seconds, the delay between the final I/O and the flush. Additionally, the patch adds tools which help answer the question "why the heck does my disk spin up all the time?". The user may set /proc/sys/vm/block_dump to a non-zero value and the kernel will print out information which will identify the process which is performing disk reads or which is dirtying pagecache. The user should probably disable syslogd before setting block-dump.
2004-04-11	[PATCH] don't allow background writes to hide dirty buffers	Andrew Morton
	If pdflush hits a locked-and-clean buffer in __block_write_full_page() it will just pass over the buffer. Typically the buffer is an ext3 data=ordered buffer which is being written by kjournald, but a similar thing can happen with blockdev buffers and ll_rw_block(). This is bad because the buffer is still under I/O and a subsequent fsync's fdatawait() needs to know about it. It is not practical to tag the page for writeback - only the submitter of the I/O can do that, because the submitter has control of the end_io handler. So instead, redirty the page so a subsequent fsync's fdatawrite() will wait on the underway I/O. There is a risk that pdflush::background_writeout() will lock up, repeatedly trying and failing to write the same page. This is prevented by ensuring that background_writeout() always throttles when it made no progress.
2003-09-21	[PATCH] real-time enhanced page allocator and throttling	Andrew Morton
	From: Robert Love <rml@tech9.net> - Let real-time tasks dip further into the reserves than usual in __alloc_pages(). There are a lot of ways to special case this. This patch just cuts z->pages_low in half, before doing the incremental min thing, for real-time tasks. I do not do anything in the low memory slow path. We can be a _lot_ more aggressive if we want. Right now, we just give real-time tasks a little help. - Never ever call balance_dirty_pages() on a real-time task. Where and how exactly we handle this is up for debate. We could, for example, special case real-time tasks inside balance_dirty_pages(). This would allow us to perform some of the work (say, waking up pdflush) but not other work (say, the active throttling). As it stands now, we do the per-processor accounting in balance_dirty_pages_ratelimited() but we never call balance_dirty_pages(). Lots of approaches work. What we want to do is never engage the real-time task in forced writeback.
2003-09-08	[PATCH] sparse fix sysctl	Andries E. Brouwer

2003-06-02	[PATCH] dirty_writeback_centisecs fixes	Andrew Morton
	This /proc tunable sets the kupdate interval. It has a couple of problems: - No way to turn it off completely (userspace dirty memory management solutions require this). - If it has been set to one hour and then the user resets it to five seconds, that resetting will not take effect for up to an hour. Fix that up by providing a sysctl handler. Setting the tunable to zero now disables the kupdate function.
2003-03-16	[PATCH] Early writeback initialisation	Andrew Morton
	Patch from Anders Gustafsson <andersg@0x63.nu> We're getting a division-by-zero in the writeback code during early rootfs population, because writeback has not yet been initialised. Fix that by performing an explicit initialisation rather than relying on initcall ordering.
2002-12-14	[PATCH] remove PF_SYNC	Andrew Morton
	current->flags:PF_SYNC was a hack I added because I didn't want to change all ->writepage implementations. It's foul. And it means that if someone happens to run direct page reclaim within the context of (say) sys_sync, the writepage invokations from the VM will be treated as "data integrity" operations, not "memory cleansing" operations, which would cause latency. So the patch removes PF_SYNC and adds an extra arg to a_ops->writepage. It is the `writeback_control' structure which contains the full context information about why writepage was called. The initial version of this patch just passed in a bare `int sync', but the XFS team need more info so they can perform writearound from within page reclaim. The patch also adds writeback_control.for_reclaim, so writepage implementations can inspect that to work out the call context rather than peeking at current->flags:PF_MEMALLOC.
2002-12-14	[PATCH] fs-writeback rework.	Andrew Morton
	I've revisited all the superblock->inode->page writeback paths. There were several silly things in there, and things were not as clear as they could be. scenario 1: create and dirty a MAP_SHARED segment over a sparse file, then exit. All the memory turns into dirty pagecache, but the kupdate function only writes it out at a trickle - 4 megabytes every thirty seconds. We should sync it all within 30 seconds. What's happening is that when writeback tries to write those pages, the filesystem needs to instantiate new blocks for them (they're over holes). The filesystem runs mark_inode_dirty() within the writeback function. This redirtying of the inode while we're writing it out triggers some livelock avoidance code in __sync_single_inode(). That function says "ah, someone redirtied the file while I was writing it. Let's move the file to the new end of the superblock dirty list and write it out later." Problem is, writeback dirtied the inode itself. (It is rather silly that mark_inode_dirty() sets I_DIRTY_PAGES when clearly no pages have been dirtied. Fixing that up would be a largish work, so work around it here). So this patch just removes the livelock avoidance from __sync_single_inode(). It is no longer needed anyway - writeback livelock is now avoided (in all writeback paths) by writing a finite number of pages. scenario 2: an application is continuously dirtying a 200 megabyte file, and your disk has a bandwidth of less than 40 megabytes/sec. What happens is that once 30 seconds passes, pdflush starts writing out the file. And because that writeout will take more than five seconds (a `kupdate' interval), pdflush just keeps writing it out forever - continuous I/O. What we _want_ to happen is that the 200 megabytes gets written, and then IO stops for thirty seconds (minus the writeout period). So the file is fully synced every thirty seconds. The patch solves this by using mapping->io_pages more intelligently. When the time comes to write the file out, move all the dirty pages onto io_pages. That is a "batch of pages for this kupdate round". When io_pages is empty, we know we're done. The address_space_operations.writepages() API is changed! It now only needs to write the pages which the caller placed on mapping->io_pages. This conceptually cleans things up a bit, by more clearly defining the role of ->io_pages, and the motion between the various mapping lists. The treatment of sb->s_dirty and sb->s_io is now conceptually identical to mapping->dirty_pages and mapping->io_pages: move the items-to-be written onto ->s_io/io_pages, alk walk that list. As inodes (or pages) are written, move them over to the clean/locked/dirty lists. Oh, scenario 3: start an app whcih continuously overwrites a 5 meg file. Wait five seconds, start another, wait 5 seconds, start another. What we _should_ see is three 5-meg writes, five seconds apart, every thirty seconds. That did all sorts of odd things. It now does the right thing.
2002-12-14	[PATCH] Remove fail_writepage, redux	Andrew Morton
	fail_writepage() does not work. Its activate_page() call cannot activate the page because it is not on the LRU. So perform that function (more efficiently) in the VM. Remove fail_writepage() and, if the filesystem does not implement ->writepage() then activate the page from shrink_list(). A special case is tmpfs, which does have a writepage, but which sometimes wants to activate the pages anyway. The most important case is when there is no swap online and we don't want to keep all those pages on the inactive list. So just as a tmpfs special-case, allow writepage() to return WRITEPAGE_ACTIVATE, and handle that in the VM. Also, the whole idea of allowing ->writepage() to return -EAGAIN, and handling that in the caller has been reverted. If a writepage() implementation wants to back out and not write the page, it must redirty the page, unlock it and return zero. (This is Hugh's preferred way). And remove the now-unneeded shmem_writepages() - shmem inodes are marked as `memory backed' so it will not be called. And remove the test for non-null ->writepage() in generic_file_mmap(). Memory-backed files _are_ mmappable, and they do not have a writepage(). It just isn't called. So the locking rules for writepage() are unchanged. They are: - Called with the page locked - Returns with the page unlocked - Must redirty the page itself if it wasn't all written. But there is a new, special, hidden, undocumented, secret hack for tmpfs: writepage may return WRITEPAGE_ACTIVATE to tell the VM to move the page to the active list. The page must be kept locked in this one case.
2002-10-12	[PATCH] rename /proc/sys/vm/dirty_async_ratio to dirty_ratio	Andrew Morton
	Since /proc/sys/vm/dirty_sync_ratio went away, the name "dirty_async_ratio" makes no sense. So rename it to just /proc/sys/vm/dirty_ratio.
2002-09-22	[PATCH] low-latency page reclaim	Andrew Morton
	Convert the VM to not wait on other people's dirty data. - If we find a dirty page and its queue is not congested, do some writeback. - If we find a dirty page and its queue _is_ congested then just refile the page. - If we find a PageWriteback page then just refile the page. - There is additional throttling for write(2) callers. Within generic_file_write(), record their backing queue in ->current. Within page reclaim, if this tasks encounters a page which is dirty or under writeback onthis queue, block on it. This gives some more writer throttling and reduces the page refiling frequency. It's somewhat CPU expensive - under really heavy load we only get a 50% reclaim rate in pages coming off the tail of the LRU. This can be fixed by splitting the inactive list into reclaimable and non-reclaimable lists. But the CPU load isn't too bad, and latency is much, much more important in these situations. Example: with `mem=512m', running 4 instances of `dbench 100', 2.5.34 took 35 minutes to compile a kernel. With this patch, it took three minutes, 45 seconds. I haven't done swapcache or MAP_SHARED pages yet. If there's tons of dirty swapcache or mmap data around we still stall heavily in page reclaim. That's less important. This patch also has a tweak for swapless machines: don't even bother bringing anon pages onto the inactive list if there is no swap online.
2002-09-22	[PATCH] use the congestion APIs in pdflush	Andrew Morton
	The key concept here is that pdflush does not block on request queues any more. Instead, it circulates across the queues, keeping any non-congested queues full of write data. When all queues are full, pdflush takes a nap, to be woken when any queue exits write congestion. This code can keep sixty spindles saturated - we've never been able to do that before. - Add the `nonblocking' flag to struct writeback_control, and teach the writeback paths to honour it. - Add the `encountered_congestion' flag to struct writeback_control and teach the writeback paths to set it. So as soon as a mapping's backing_dev_info indicates that it is getting congested, bale out of writeback. And don't even start writeback against filesystems whose queues are congested. - Convert pdflush's background_writeback() function to use nonblocking writeback. This way, a single pdflush thread will circulate around all the dirty queues, keeping them filled. - Convert the pdlfush `kupdate' function to do the same thing. This solves the problem of pdflush thread pool exhaustion. It solves the problem of pdflush startup latency. It solves the (minor) problem wherein `kupdate' writeback only writes back a single disk at a time (it was getting blocked on each queue in turn). It probably means that we only ever need a single pdflush thread.
2002-09-19	[PATCH] remove /proc/sys/vm/dirty_sync_thresh	Andrew Morton
	This was designed to be a really sterm throttling threshold: if dirty memory reaches this level then perform writeback and actually wait on it. It doesn't work. Because memory dirtiers are required to perform writeback if the amount of dirty AND writeback memory exceeds dirty_async_ratio. So kill it, and rely just on the request queues being appropriately scaled to the machine size (they are). This is basically what 2.4 does.
2002-09-19	[PATCH] clean up argument passing in writeback paths	Andrew Morton
	The writeback code paths which walk the superblocks and inodes are getting an increasing arguments passed to them. The patch wraps those args into the new `struct writeback_control', and uses that instead. There is no functional change. The new writeback_control structure is passed down through the writeback paths in the place where the old `nr_to_write' pointer used to be. writeback_control will be used to pass new information up and down the writeback paths. Such as whether the writeback should be non-blocking, and whether queue congestion was encountered.
2002-08-30	[PATCH] writeback correctness and efficiency changes	Andrew Morton
	This is a performance and correctness fix against the writeback paths. The writeback code has competing requirements. Sometimes it is used for "memory cleansing": kupdate, bdflush, writer throttling, page allocator writeback, etc. And sometimes this same code is used for data integrity pruposes: fsync, msync, fdatasync, sync, umount, various other kernel-internal uses. The problem is: how to handle a dirty buffer or page which is currently under writeback. For memory cleansing, we just want to skip that buffer/page and go onto the next one. But for sync, we must wait on the old writeback and then start new writeback. mpage_writepages() is current correct for cleansing, but incorrect for sync. block_write_full_page() is currently correct for sync, but inefficient for cleansing. The fix is fairly simple. - In mpage_writepages(), don't skip the page is it's a sync operation. - In block_write_full_page(), skip the buffer if it is a sync operation. And return -EAGAIN to tell the caller that the writeout didn't work out. The caller must then set the page dirty again and move it onto mapping->dirty_pages. This is an extension of the writepage API: writepage can now return EAGAIN. There are only three callers, and they have been updated. fail_writepage() and ext3_writepage() were actually doing this by hand. They have been changed to return -EAGAIN. NTFS will want to be able to return -EAGAIN from its writepage as well. - A sticky question is: how to tell the writeout code which mode it is operating in? Cleansing or sync? It's such a tiny code change that I didn't have the heart to go and propagate a `mode' argument down every instance of writepages() and writepage() in the kernel. So I passed it in via current->flags. Incidentally, the occurrence of a locked-and-dirty buffer in block_write_full_page() is fairly rare: normally the collision avoidance happens at the address_space level, via PageWriteback. But some mappings (blockdevs, ext3 files, etc) have their dirty buffers written out via submit_bh(). It is these buffers which can stall block_write_full_page(). This wart will be pretty intrusive to fix. ext3 needs to become fully page-based (ugh. It's a block-based journalling filesystem, and pages are unnatural). blockdev mappings are still written out by buffers because that's how filesystems use them. Putting _all_ metadata (indirects, inodes, superblocks, etc) into standalone address_spaces would fix that up. - filemap_fdatawrite() sets PF_SYNC. So filemap_fdatawrite() is the kernel function which will start writeback against a mapping for "data integrity" purposes, whereas the unexported, internal-only do_writepages() is the writeback function which is used for memory cleansing. This difference is the reason why I didn't consolidate those functions ages ago... - Lots of code paths had a bogus extra call to filemap_fdatawait(), which I previously added in a moment of weak-headedness. They have all been removed.
2002-07-23	[PATCH] page-writeback.c compile warning fix	Andrew Morton

2002-07-04	[PATCH] pdflush cleanup	Andrew Morton
	Writeback/pdflush cleanup patch from Steven Augart * Exposes nr_pdflush_threads as /proc/sys/vm/nr_pdflush_threads, read-only. (I like this - I expect that management of the pdflush thread pool will be important for many-spindle machines, and this is a neat way of getting at the info). * Adds minimum and maximum checking to the five writable pdflush and fs-writeback parameters. * Minor indentation fix in sysctl.c * mm/pdflush.c now includes linux/writeback.h, which prototypes pdflush_operation. This is so that the compiler can automatically check that the prototype matches the definition. * Adds a few comments to existing code.
2002-07-04	[PATCH] misc cleanups and fixes	Andrew Morton
	- Comment and documentation fixlets - Remove some unneeded fields from swapper_inode (these are a leftover from when I had swap using the filesystem IO functions). - fix a printk bug in pci/pool.c: when dma_addr_t is 64 bit it generates a compile warning, and will print out garbage. Cast it to unsigned long long. - Convert some writeback #defines into enums (Steven Augart)
2002-06-17	[PATCH] writeback tunables	Andrew Morton
	Adds five sysctls for tuning the writeback behaviour: dirty_async_ratio dirty_background_ratio dirty_sync_ratio dirty_expire_centisecs dirty_writeback_centisecs these are described in Documentation/filesystems/proc.txt They are basically the tradiditional knobs which we've always had... We are accreting a ton of obsolete sysctl numbers under /proc/sys/vm/. I didn't recycle these - just mark them unused and remove the obsolete documentation.
2002-06-02	[PATCH] remove inode.i_wait	Andrew Morton
	Remove i_wait from struct inode and hash it instead. This is a pure space-saving exercise - 12 bytes from struct inode on x86. NFS was using i_wait for its own purposes. Add a wait_queue_head_t to the fs-private inode for that. This change has been acked by Trond.
2002-05-27	[PATCH] rename writeback_mapping to writepages	Andrew Morton
	Spot the difference: aops.readpage aops.readpages aops.writepage aops.writeback_mapping The patch renames `writeback_mapping' to `writepages'
2002-05-19	[PATCH] improved I/O scheduling for indirect blocks	Andrew Morton
	Fixes a performance problem with many-small-file writeout. At present, files are written out via their mapping and their indirect blocks are written out via the blockdev mapping. As we know that indirects are disk-adjacent to the data it is better to start I/O against the indirects at the same time as the data. The delalloc pathes have code in ext2_writepage() which recognises when the target page->index was at an indirect boundary and does an explicit hunt-and-write against the neighbouring indirect block. Which is ideal. (Unless the file was dirtied seekily and the page which is next to the indirect was not dirtied). This patch does it the other way: when we start writeback against a mapping, also start writeback against any dirty buffers which are attached to mapping->private_list. Let the elevator take care of the rest. The patch makes a number of tuning changes to the writeback path in fs-writeback.c. This is very fiddly code: getting the throughput tuned, getting the data-integrity "sync" operations right, avoiding most of the livelock opportunities, getting the `kupdate' function working efficiently, keeping it all least somewhat comprehensible. An important intent here is to ensure that metadata blocks for inodes are marked dirty before writeback starts working the blockdev mapping, so all the inode blocks are efficiently written back. The patch removes try_to_writeback_unused_inodes(), which became unreferenced in vm-writeback.patch. The patch has a tweak in ext2_put_inode() to prevent ext2 from incorrectly droppping its preallocation window in response to a random iput(). Generally, many-small-file writeout is a lot faster than 2.5.7 (which is linux-before-I-futzed-with-it). The workload which was optimised was tar xfz /nfs/mountpoint/linux-2.4.18.tar.gz ; sync on mem=128M and mem=2048M. With these patches, 2.5.15 is completing in about 2/3 of the time of 2.5.7. But it is only a shade faster than 2.4.19-pre7. Why is 2.5.7 so much slower than 2.4.19? Not sure yet. Heavy dbench loads (dbench 32 on mem=128M) are slightly faster than 2.5.7 and significantly slower than 2.4.19. It appears that the cause is poor read throughput at the later stages of the run. Because there are background writeback threads operating at the same time. The 2.4.19-pre8 write scheduling manages to stop writeback during the latter stages of the dbench run in a way which I haven't been able to sanely emulate yet. It may not be desirable to do this anyway - it's optimising for the case where the files are about to be deleted. But it would be good to find a way of "pausing" the writeback for a few seconds to allow readers to get an interval of decent bandwidth. tiobench throughput is basically the same across all recent kernels. CPU load on writes is down maybe 30% in 2.5.15.
2002-05-19	[PATCH] writeback tuning	Andrew Morton
	Tune up the VM-based writeback a bit. - Always use the multipage clustered-writeback function from within shrink_cache(), even if the page's mapping has a NULL ->vm_writeback(). So clustered writeback is turned on for all address_spaces, not just ext2. Subtle effect of this change: it is now the case that all writeback proceeds along the mapping->dirty_pages list. The orderedness of the page LRUs no longer has an impact on disk scheduling. So we only have one list to keep well-sorted rather than two, and churning pages around on the LRU will no longer damage write bandwidth - it's all up to the filesystem. - Decrease the clustered writeback from 1024 pages(!) to 32 pages. (1024 was a leftover from when this code was always dispatching writeback to a pdflush thread). - Fix wakeup_bdflush() so that it actually does write something (duh). do_wp_page() needs to call balance_dirty_pages_ratelimited(), so we throttle mmap page-dirtiers in the same way as write(2) page-dirtiers. This may make wakeup_bdflush() obsolete, but it doesn't hurt. - Converts generic_vm_writeback() to directly call ->writeback_mapping(), rather that going through writeback_single_inode(). This prevents memory allocators from blocking on the inode's I_LOCK. But it does mean that two processes can be writing pages from the same mapping at the same time. If filesystems care about this (for layout reasons) then they should serialise in their ->writeback_mapping a_op. This means that memory-allocators will writeback only pages, not pages and inodes. There are no locks in that writeback path (except for request queue exhaustion). Reduces memory allocation latency. - Implement new background_writeback function, which when kicked off will perform writeback until dirty memory falls below the background threshold. - Put written-back pages onto the remote end of the page LRU. It does this in the slow-and-stupid way at present. pagemap_lru_lock stress-relief is planned... - Remove the funny writeback_unused_inodes() stuff from prune_icache(). Writeback from wakeup_bdflush() and the `kupdate' function now just naturally cleanses the oldest inodes so we don't need to do anything there. - Dirty memory balancing is still using magic numbers: "after you dirtied your 1,000th page, go write 1,500". Obviously, this needs more work.
2002-05-19	[PATCH] pdflush exclusion	Andrew Morton
	Use the pdflush exclusion infrastructure to ensure that only one pdlfush thread is ever performing writeback against a particular request_queue. This works rather well. It requires a lot of activity against a lot of disks to cause more pdflush threads to start up. Possibly the thread-creation logic is a little weak: it starts more threads when a pdflush thread goes back to sleep. It may be better to start new threads within pdlfush_operation(). All non-request_queue-backed address_spaces share the global default_backing_dev_info structure. So at present only a single pdflush instance will be available for background writeback of all NFS filesystems (for example). If there is benefit in concurrent background writeback for multiple NFS mounts then NFS would need to create per-mount backing_dev_info structures and install those into new inode's address_spaces in some manner.
2002-04-29	[PATCH] writeback from address spaces	Andrew Morton
	[ I reversed the order in which writeback walks the superblock's dirty inodes. It sped up dbench's unlink phase greatly. I'm such a sleaze ] The core writeback patch. Switches file writeback from the dirty buffer LRU over to address_space.dirty_pages. - The buffer LRU is removed - The buffer hash is removed (uses blockdev pagecache lookups) - The bdflush and kupdate functions are implemented against address_spaces, via pdflush. - The relationship between pages and buffers is changed. - If a page has dirty buffers, it is marked dirty - If a page is marked dirty, it may have dirty buffers. - A dirty page may be "partially dirty". block_write_full_page discovers this. - A bunch of consistency checks of the form if (!something_which_should_be_true()) buffer_error(); have been introduced. These fog the code up but are important for ensuring that the new buffer/page code is working correctly. - New locking (inode.i_bufferlist_lock) is introduced for exclusion from try_to_free_buffers(). This is needed because set_page_dirty is called under spinlock, so it cannot lock the page. But it needs access to page->buffers to set them all dirty. i_bufferlist_lock is also used to protect inode.i_dirty_buffers. - fs/inode.c has been split: all the code related to file data writeback has been moved into fs/fs-writeback.c - Code related to file data writeback at the address_space level is in the new mm/page-writeback.c - try_to_free_buffers() is now non-blocking - Switches vmscan.c over to understand that all pages with dirty data are now marked dirty. - Introduces a new a_op for VM writeback: ->vm_writeback(struct page page, int nr_to_write) this is a bit half-baked at present. The intent is that the address_space is given the opportunity to perform clustered writeback. To allow it to opportunistically write out disk-contiguous dirty data which may be in other zones. To allow delayed-allocate filesystems to get good disk layout. - Added address_space.io_pages. Pages which are being prepared for writeback. This is here for two reasons: 1: It will be needed later, when BIOs are assembled direct against pagecache, bypassing the buffer layer. It avoids a deadlock which would occur if someone moved the page back onto the dirty_pages list after it was added to the BIO, but before it was submitted. (hmm. This may not be a problem with PG_writeback logic). 2: Avoids a livelock which would occur if some other thread is continually redirtying pages. - There are two known performance problems in this code: 1: Pages which are locked for writeback cause undesirable blocking when they are being overwritten. A patch which leaves pages unlocked during writeback comes later in the series. 2: While inodes are under writeback, they are locked. This causes namespace lookups against the file to get unnecessarily blocked in wait_on_inode(). This is a fairly minor problem. I don't have a fix for this at present - I'll fix this when I attach dirty address_spaces direct to super_blocks. - The patch vastly increases the amount of dirty data which the kernel permits highmem machines to maintain. This is because the balancing decisions are made against the amount of memory in the machine, not against the amount of buffercache-allocatable memory. This may be very wrong, although it works fine for me (2.5 gigs). We can trivially go back to the old-style throttling with s/nr_free_pagecache_pages/nr_free_buffer_pages/ in balance_dirty_pages(). But better would be to allow blockdev mappings to use highmem (I'm thinking about this one, slowly). And to move writer-throttling and writeback decisions into the VM (modulo the file-overwriting problem). - Drops 24 bytes from struct buffer_head. More to come. - There's some gunk like super_block.flags:MS_FLUSHING which needs to be killed. Need a better way of providing collision avoidance between pdflush threads, to prevent more than one pdflush thread working a disk at the same time. The correct way to do that is to put a flag in the request queue to say "there's a pdlfush thread working this disk". This is easy to do: just generalise the "ra_pages" pointer to point at a struct which includes ra_pages and the new collision-avoidance flag.