[PATCH] writeback from address spaces

[ I reversed the order in which writeback walks the superblock's dirty inodes. It sped up dbench's unlink phase greatly. I'm such a sleaze ] The core writeback patch. Switches file writeback from the dirty buffer LRU over to address_space.dirty_pages. - The buffer LRU is removed - The buffer hash is removed (uses blockdev pagecache lookups) - The bdflush and kupdate functions are implemented against address_spaces, via pdflush. - The relationship between pages and buffers is changed. - If a page has dirty buffers, it is marked dirty - If a page is marked dirty, it *may* have dirty buffers. - A dirty page may be "partially dirty". block_write_full_page discovers this. - A bunch of consistency checks of the form if (!something_which_should_be_true()) buffer_error(); have been introduced. These fog the code up but are important for ensuring that the new buffer/page code is working correctly. - New locking (inode.i_bufferlist_lock) is introduced for exclusion from try_to_free_buffers(). This is needed because set_page_dirty is called under spinlock, so it cannot lock the page. But it needs access to page->buffers to set them all dirty. i_bufferlist_lock is also used to protect inode.i_dirty_buffers. - fs/inode.c has been split: all the code related to file data writeback has been moved into fs/fs-writeback.c - Code related to file data writeback at the address_space level is in the new mm/page-writeback.c - try_to_free_buffers() is now non-blocking - Switches vmscan.c over to understand that all pages with dirty data are now marked dirty. - Introduces a new a_op for VM writeback: ->vm_writeback(struct page *page, int *nr_to_write) this is a bit half-baked at present. The intent is that the address_space is given the opportunity to perform clustered writeback. To allow it to opportunistically write out disk-contiguous dirty data which may be in other zones. To allow delayed-allocate filesystems to get good disk layout. - Added address_space.io_pages. Pages which are being prepared for writeback. This is here for two reasons: 1: It will be needed later, when BIOs are assembled direct against pagecache, bypassing the buffer layer. It avoids a deadlock which would occur if someone moved the page back onto the dirty_pages list after it was added to the BIO, but before it was submitted. (hmm. This may not be a problem with PG_writeback logic). 2: Avoids a livelock which would occur if some other thread is continually redirtying pages. - There are two known performance problems in this code: 1: Pages which are locked for writeback cause undesirable blocking when they are being overwritten. A patch which leaves pages unlocked during writeback comes later in the series. 2: While inodes are under writeback, they are locked. This causes namespace lookups against the file to get unnecessarily blocked in wait_on_inode(). This is a fairly minor problem. I don't have a fix for this at present - I'll fix this when I attach dirty address_spaces direct to super_blocks. - The patch vastly increases the amount of dirty data which the kernel permits highmem machines to maintain. This is because the balancing decisions are made against the amount of memory in the machine, not against the amount of buffercache-allocatable memory. This may be very wrong, although it works fine for me (2.5 gigs). We can trivially go back to the old-style throttling with s/nr_free_pagecache_pages/nr_free_buffer_pages/ in balance_dirty_pages(). But better would be to allow blockdev mappings to use highmem (I'm thinking about this one, slowly). And to move writer-throttling and writeback decisions into the VM (modulo the file-overwriting problem). - Drops 24 bytes from struct buffer_head. More to come. - There's some gunk like super_block.flags:MS_FLUSHING which needs to be killed. Need a better way of providing collision avoidance between pdflush threads, to prevent more than one pdflush thread working a disk at the same time. The correct way to do that is to put a flag in the request queue to say "there's a pdlfush thread working this disk". This is easy to do: just generalise the "ra_pages" pointer to point at a struct which includes ra_pages and the new collision-avoidance flag.
author: Andrew Morton <akpm@zip.com.au> 2002-04-29 23:52:10 -0700
committer: Linus Torvalds <torvalds@home.transmeta.com> 2002-04-29 23:52:10 -0700
commit: 090da37209e13c26f3723e847860e9f7ab23e113 (patch)
tree: 2acec3966e6c590447508917411c0248fecb5015 /kernel/sysctl.c
parent: 00d6555e3c1568842beef2085045baaae59d347c (diff)
1 files changed, 0 insertions, 4 deletions
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 66ccb010e1e5..7869159de04a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -43,7 +43,6 @@
 /* External variables not in a header file. */
 extern int panic_timeout;
 extern int C_A_D;
-extern int bdf_prm[], bdflush_min[], bdflush_max[];
 extern int sysctl_overcommit_memory;
 extern int max_threads;
 extern atomic_t nr_queued_signals;
@@ -259,9 +258,6 @@ static ctl_table kern_table[] = {
 };
 
 static ctl_table vm_table[] = {
-	{VM_BDFLUSH, "bdflush", &bdf_prm, 9*sizeof(int), 0644, NULL,
-	 &proc_dointvec_minmax, &sysctl_intvec, NULL,
-	 &bdflush_min, &bdflush_max},
 	{VM_OVERCOMMIT_MEMORY, "overcommit_memory", &sysctl_overcommit_memory,
 	 sizeof(sysctl_overcommit_memory), 0644, NULL, &proc_dointvec},
 	{VM_PAGERDAEMON, "kswapd",
author	Andrew Morton <akpm@zip.com.au>	2002-04-29 23:52:10 -0700
committer	Linus Torvalds <torvalds@home.transmeta.com>	2002-04-29 23:52:10 -0700
commit	090da37209e13c26f3723e847860e9f7ab23e113 (patch)
tree	2acec3966e6c590447508917411c0248fecb5015 /kernel/sysctl.c
parent	00d6555e3c1568842beef2085045baaae59d347c (diff)