diff options
| author | Andrew Morton <akpm@zip.com.au> | 2002-04-29 23:52:10 -0700 |
|---|---|---|
| committer | Linus Torvalds <torvalds@home.transmeta.com> | 2002-04-29 23:52:10 -0700 |
| commit | 090da37209e13c26f3723e847860e9f7ab23e113 (patch) | |
| tree | 2acec3966e6c590447508917411c0248fecb5015 /include | |
| parent | 00d6555e3c1568842beef2085045baaae59d347c (diff) | |
[PATCH] writeback from address spaces
[ I reversed the order in which writeback walks the superblock's
dirty inodes. It sped up dbench's unlink phase greatly. I'm
such a sleaze ]
The core writeback patch. Switches file writeback from the dirty
buffer LRU over to address_space.dirty_pages.
- The buffer LRU is removed
- The buffer hash is removed (uses blockdev pagecache lookups)
- The bdflush and kupdate functions are implemented against
address_spaces, via pdflush.
- The relationship between pages and buffers is changed.
- If a page has dirty buffers, it is marked dirty
- If a page is marked dirty, it *may* have dirty buffers.
- A dirty page may be "partially dirty". block_write_full_page
discovers this.
- A bunch of consistency checks of the form
if (!something_which_should_be_true())
buffer_error();
have been introduced. These fog the code up but are important for
ensuring that the new buffer/page code is working correctly.
- New locking (inode.i_bufferlist_lock) is introduced for exclusion
from try_to_free_buffers(). This is needed because set_page_dirty
is called under spinlock, so it cannot lock the page. But it
needs access to page->buffers to set them all dirty.
i_bufferlist_lock is also used to protect inode.i_dirty_buffers.
- fs/inode.c has been split: all the code related to file data writeback
has been moved into fs/fs-writeback.c
- Code related to file data writeback at the address_space level is in
the new mm/page-writeback.c
- try_to_free_buffers() is now non-blocking
- Switches vmscan.c over to understand that all pages with dirty data
are now marked dirty.
- Introduces a new a_op for VM writeback:
->vm_writeback(struct page *page, int *nr_to_write)
this is a bit half-baked at present. The intent is that the address_space
is given the opportunity to perform clustered writeback. To allow it to
opportunistically write out disk-contiguous dirty data which may be in other zones.
To allow delayed-allocate filesystems to get good disk layout.
- Added address_space.io_pages. Pages which are being prepared for
writeback. This is here for two reasons:
1: It will be needed later, when BIOs are assembled direct
against pagecache, bypassing the buffer layer. It avoids a
deadlock which would occur if someone moved the page back onto the
dirty_pages list after it was added to the BIO, but before it was
submitted. (hmm. This may not be a problem with PG_writeback logic).
2: Avoids a livelock which would occur if some other thread is continually
redirtying pages.
- There are two known performance problems in this code:
1: Pages which are locked for writeback cause undesirable
blocking when they are being overwritten. A patch which leaves
pages unlocked during writeback comes later in the series.
2: While inodes are under writeback, they are locked. This
causes namespace lookups against the file to get unnecessarily
blocked in wait_on_inode(). This is a fairly minor problem.
I don't have a fix for this at present - I'll fix this when I
attach dirty address_spaces direct to super_blocks.
- The patch vastly increases the amount of dirty data which the
kernel permits highmem machines to maintain. This is because the
balancing decisions are made against the amount of memory in the
machine, not against the amount of buffercache-allocatable memory.
This may be very wrong, although it works fine for me (2.5 gigs).
We can trivially go back to the old-style throttling with
s/nr_free_pagecache_pages/nr_free_buffer_pages/ in
balance_dirty_pages(). But better would be to allow blockdev
mappings to use highmem (I'm thinking about this one, slowly). And
to move writer-throttling and writeback decisions into the VM (modulo
the file-overwriting problem).
- Drops 24 bytes from struct buffer_head. More to come.
- There's some gunk like super_block.flags:MS_FLUSHING which needs to
be killed. Need a better way of providing collision avoidance
between pdflush threads, to prevent more than one pdflush thread
working a disk at the same time.
The correct way to do that is to put a flag in the request queue to
say "there's a pdlfush thread working this disk". This is easy to
do: just generalise the "ra_pages" pointer to point at a struct which
includes ra_pages and the new collision-avoidance flag.
Diffstat (limited to 'include')
| -rw-r--r-- | include/linux/fs.h | 103 | ||||
| -rw-r--r-- | include/linux/mm.h | 28 | ||||
| -rw-r--r-- | include/linux/sched.h | 3 | ||||
| -rw-r--r-- | include/linux/swap.h | 1 | ||||
| -rw-r--r-- | include/linux/sysctl.h | 2 | ||||
| -rw-r--r-- | include/linux/writeback.h | 53 |
6 files changed, 130 insertions, 60 deletions
diff --git a/include/linux/fs.h b/include/linux/fs.h index f0d997aeecb4..4b38c11f9723 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -112,6 +112,7 @@ extern int leases_enable, dir_notify_enable, lease_break_time; #define MS_MOVE 8192 #define MS_REC 16384 #define MS_VERBOSE 32768 +#define MS_FLUSHING (1<<16) /* inodes are currently under writeout */ #define MS_ACTIVE (1<<30) #define MS_NOUSER (1<<31) @@ -155,6 +156,7 @@ extern int leases_enable, dir_notify_enable, lease_break_time; #define IS_RDONLY(inode) ((inode)->i_sb->s_flags & MS_RDONLY) #define IS_SYNC(inode) (__IS_FLG(inode, MS_SYNCHRONOUS) || ((inode)->i_flags & S_SYNC)) #define IS_MANDLOCK(inode) __IS_FLG(inode, MS_MANDLOCK) +#define IS_FLUSHING(inode) __IS_FLG(inode, MS_FLUSHING) #define IS_QUOTAINIT(inode) ((inode)->i_flags & S_QUOTA) #define IS_NOQUOTA(inode) ((inode)->i_flags & S_NOQUOTA) @@ -215,11 +217,10 @@ enum bh_state_bits { BH_Dirty, /* 1 if the buffer is dirty */ BH_Lock, /* 1 if the buffer is locked */ BH_Req, /* 0 if the buffer has been invalidated */ + BH_Mapped, /* 1 if the buffer has a disk mapping */ BH_New, /* 1 if the buffer is new and not yet written out */ BH_Async, /* 1 if the buffer is under end_buffer_io_async I/O */ - BH_Wait_IO, /* 1 if we should write out this buffer */ - BH_launder, /* 1 if we should throttle on this buffer */ BH_JBD, /* 1 if it has an attached journal_head */ BH_PrivateStart,/* not a state bit, but the first bit available @@ -240,22 +241,16 @@ enum bh_state_bits { */ struct buffer_head { /* First cache line: */ - struct buffer_head *b_next; /* Hash queue list */ sector_t b_blocknr; /* block number */ unsigned short b_size; /* block size */ - unsigned short b_list; /* List that this buffer appears */ struct block_device *b_bdev; atomic_t b_count; /* users using this block */ unsigned long b_state; /* buffer state bitmap (see above) */ - unsigned long b_flushtime; /* Time when (dirty) buffer should be written */ - - struct buffer_head *b_next_free;/* lru/free list linkage */ - struct buffer_head *b_prev_free;/* doubly linked list of buffers */ struct buffer_head *b_this_page;/* circular list of buffers in one page */ - struct buffer_head **b_pprev; /* doubly linked list of hash-queue */ - char * b_data; /* pointer to data block */ struct page *b_page; /* the page this bh is mapped to */ + + char * b_data; /* pointer to data block */ void (*b_end_io)(struct buffer_head *bh, int uptodate); /* I/O completion */ void *b_private; /* reserved for b_end_io */ @@ -371,6 +366,16 @@ struct address_space_operations { int (*writepage)(struct page *); int (*readpage)(struct file *, struct page *); int (*sync_page)(struct page *); + + /* Write back some dirty pages from this mapping. */ + int (*writeback_mapping)(struct address_space *, int *nr_to_write); + + /* Perform a writeback as a memory-freeing operation. */ + int (*vm_writeback)(struct page *, int *nr_to_write); + + /* Set a page dirty */ + int (*set_page_dirty)(struct page *page); + /* * ext3 requires that a successful prepare_write() call be followed * by a commit_write() call - they must be balanced @@ -391,12 +396,14 @@ struct address_space { struct list_head clean_pages; /* list of clean pages */ struct list_head dirty_pages; /* list of dirty pages */ struct list_head locked_pages; /* list of locked pages */ + struct list_head io_pages; /* being prepared for I/O */ unsigned long nrpages; /* number of total pages */ struct address_space_operations *a_ops; /* methods */ struct inode *host; /* owner: inode, block_device */ list_t i_mmap; /* list of private mappings */ list_t i_mmap_shared; /* list of private mappings */ spinlock_t i_shared_lock; /* and spinlock protecting it */ + unsigned long dirtied_when; /* jiffies of first page dirtying */ int gfp_mask; /* how to allocate the pages */ unsigned long *ra_pages; /* device readahead */ }; @@ -427,9 +434,10 @@ struct inode { struct list_head i_hash; struct list_head i_list; struct list_head i_dentry; - - struct list_head i_dirty_buffers; + + struct list_head i_dirty_buffers; /* uses i_bufferlist_lock */ struct list_head i_dirty_data_buffers; + spinlock_t i_bufferlist_lock; unsigned long i_ino; atomic_t i_count; @@ -697,8 +705,9 @@ struct super_block { struct list_head s_list; /* Keep this first */ kdev_t s_dev; unsigned long s_blocksize; - unsigned char s_blocksize_bits; unsigned long s_old_blocksize; + unsigned short s_writeback_gen;/* To avoid writeback livelock */ + unsigned char s_blocksize_bits; unsigned char s_dirt; unsigned long long s_maxbytes; /* Max file size */ struct file_system_type *s_type; @@ -903,7 +912,7 @@ struct super_operations { int (*show_options)(struct seq_file *, struct vfsmount *); }; -/* Inode state bits.. */ +/* Inode state bits. Protected by inode_lock. */ #define I_DIRTY_SYNC 1 /* Not dirty enough for O_DATASYNC */ #define I_DIRTY_DATASYNC 2 /* Data-related inode changes pending */ #define I_DIRTY_PAGES 4 /* Data-related inode changes pending */ @@ -924,11 +933,6 @@ static inline void mark_inode_dirty_sync(struct inode *inode) __mark_inode_dirty(inode, I_DIRTY_SYNC); } -static inline void mark_inode_dirty_pages(struct inode *inode) -{ - __mark_inode_dirty(inode, I_DIRTY_PAGES); -} - struct dquot_operations { void (*initialize) (struct inode *, short); void (*drop) (struct inode *); @@ -1215,19 +1219,14 @@ extern struct file_operations rdwr_pipe_fops; extern int fs_may_remount_ro(struct super_block *); -extern int try_to_free_buffers(struct page *, unsigned int); -extern void refile_buffer(struct buffer_head * buf); -extern void create_empty_buffers(struct page *, unsigned long); +extern int try_to_free_buffers(struct page *); +extern void create_empty_buffers(struct page *, unsigned long, + unsigned long b_state); extern void end_buffer_io_sync(struct buffer_head *bh, int uptodate); /* reiserfs_writepage needs this */ extern void set_buffer_async_io(struct buffer_head *bh) ; -#define BUF_CLEAN 0 -#define BUF_LOCKED 1 /* Buffers scheduled for write */ -#define BUF_DIRTY 2 /* Dirty buffers, not yet scheduled for write */ -#define NR_LIST 3 - static inline void get_bh(struct buffer_head * bh) { atomic_inc(&(bh)->b_count); @@ -1252,29 +1251,27 @@ static inline void mark_buffer_uptodate(struct buffer_head * bh, int on) #define atomic_set_buffer_clean(bh) test_and_clear_bit(BH_Dirty, &(bh)->b_state) -static inline void __mark_buffer_clean(struct buffer_head *bh) -{ - refile_buffer(bh); -} - static inline void mark_buffer_clean(struct buffer_head * bh) { - if (atomic_set_buffer_clean(bh)) - __mark_buffer_clean(bh); + clear_bit(BH_Dirty, &(bh)->b_state); } -extern void FASTCALL(__mark_dirty(struct buffer_head *bh)); -extern void FASTCALL(__mark_buffer_dirty(struct buffer_head *bh)); extern void FASTCALL(mark_buffer_dirty(struct buffer_head *bh)); -extern void FASTCALL(buffer_insert_list(struct buffer_head *, struct list_head *)); +extern void buffer_insert_list(spinlock_t *lock, + struct buffer_head *, struct list_head *); -static inline void buffer_insert_inode_queue(struct buffer_head *bh, struct inode *inode) +static inline void +buffer_insert_inode_queue(struct buffer_head *bh, struct inode *inode) { - buffer_insert_list(bh, &inode->i_dirty_buffers); + buffer_insert_list(&inode->i_bufferlist_lock, + bh, &inode->i_dirty_buffers); } -static inline void buffer_insert_inode_data_queue(struct buffer_head *bh, struct inode *inode) + +static inline void +buffer_insert_inode_data_queue(struct buffer_head *bh, struct inode *inode) { - buffer_insert_list(bh, &inode->i_dirty_data_buffers); + buffer_insert_list(&inode->i_bufferlist_lock, + bh, &inode->i_dirty_data_buffers); } #define atomic_set_buffer_dirty(bh) test_and_set_bit(BH_Dirty, &(bh)->b_state) @@ -1322,8 +1319,6 @@ static inline void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode buffer_insert_inode_queue(bh, inode); } -extern void set_buffer_flushtime(struct buffer_head *); -extern void balance_dirty(void); extern int check_disk_change(kdev_t); extern int invalidate_inodes(struct super_block *); extern int invalidate_device(kdev_t, int); @@ -1334,8 +1329,6 @@ extern void invalidate_inode_buffers(struct inode *); #define destroy_buffers(dev) __invalidate_buffers((dev), 1) extern void invalidate_bdev(struct block_device *, int); extern void __invalidate_buffers(kdev_t dev, int); -extern void sync_inodes(void); -extern void sync_unlocked_inodes(void); extern void write_inode_now(struct inode *, int); extern int sync_buffers(struct block_device *, int); extern int fsync_dev(kdev_t); @@ -1343,15 +1336,16 @@ extern int fsync_bdev(struct block_device *); extern int fsync_super(struct super_block *); extern int fsync_no_super(struct block_device *); extern void sync_inodes_sb(struct super_block *); -extern int osync_buffers_list(struct list_head *); -extern int fsync_buffers_list(struct list_head *); +extern int fsync_buffers_list(spinlock_t *lock, struct list_head *); static inline int fsync_inode_buffers(struct inode *inode) { - return fsync_buffers_list(&inode->i_dirty_buffers); + return fsync_buffers_list(&inode->i_bufferlist_lock, + &inode->i_dirty_buffers); } static inline int fsync_inode_data_buffers(struct inode *inode) { - return fsync_buffers_list(&inode->i_dirty_data_buffers); + return fsync_buffers_list(&inode->i_bufferlist_lock, + &inode->i_dirty_data_buffers); } extern int inode_has_buffers(struct inode *); extern int filemap_fdatasync(struct address_space *); @@ -1452,6 +1446,7 @@ static inline struct inode *iget(struct super_block *sb, unsigned long ino) return iget4(sb, ino, NULL, NULL); } +extern void __iget(struct inode * inode); extern void clear_inode(struct inode *); extern struct inode *new_inode(struct super_block *); extern void remove_suid(struct dentry *); @@ -1539,6 +1534,7 @@ static inline void map_bh(struct buffer_head *bh, struct super_block *sb, int bl bh->b_bdev = sb->s_bdev; bh->b_blocknr = block; } + extern void wakeup_bdflush(void); extern void put_unused_buffer_head(struct buffer_head * bh); extern struct buffer_head * get_unused_buffer_head(int async); @@ -1549,9 +1545,7 @@ typedef int (get_block_t)(struct inode*,sector_t,struct buffer_head*,int); /* Generic buffer handling for block filesystems.. */ extern int try_to_release_page(struct page * page, int gfp_mask); -extern int discard_bh_page(struct page *, unsigned long, int); -#define block_flushpage(page, offset) discard_bh_page(page, offset, 1) -#define block_invalidate_page(page) discard_bh_page(page, 0, 0) +extern int block_flushpage(struct page *page, unsigned long offset); extern int block_symlink(struct inode *, const char *, int); extern int block_write_full_page(struct page*, get_block_t*); extern int block_read_full_page(struct page*, get_block_t*); @@ -1579,6 +1573,8 @@ extern loff_t generic_file_llseek(struct file *file, loff_t offset, int origin); extern loff_t remote_llseek(struct file *file, loff_t offset, int origin); extern int generic_file_open(struct inode * inode, struct file * filp); +extern int generic_vm_writeback(struct page *page, int *nr_to_write); + extern struct file_operations generic_ro_fops; extern int vfs_readlink(struct dentry *, char *, int, const char *); @@ -1636,6 +1632,9 @@ static inline ino_t parent_ino(struct dentry *dentry) return res; } +void __buffer_error(char *file, int line); +#define buffer_error() __buffer_error(__FILE__, __LINE__) + #endif /* __KERNEL__ */ #endif /* _LINUX_FS_H */ diff --git a/include/linux/mm.h b/include/linux/mm.h index 5f1c731ddde1..b548d2cd8504 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -361,8 +361,6 @@ static inline void set_page_zone(struct page *page, unsigned long zone_num) #endif /* CONFIG_HIGHMEM || WANT_PAGE_VIRTUAL */ -extern void FASTCALL(set_page_dirty(struct page *)); - /* * Error return values for the *_nopage functions */ @@ -405,6 +403,26 @@ extern int ptrace_check_attach(struct task_struct *task, int kill); int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, int len, int write, int force, struct page **pages, struct vm_area_struct **vmas); +int __set_page_dirty_buffers(struct page *page); +int __set_page_dirty_nobuffers(struct page *page); + +/* + * If the mapping doesn't provide a set_page_dirty a_op, then + * just fall through and assume that it wants buffer_heads. + * FIXME: make the method unconditional. + */ +static inline int set_page_dirty(struct page *page) +{ + if (page->mapping) { + int (*spd)(struct page *); + + spd = page->mapping->a_ops->set_page_dirty; + if (spd) + return (*spd)(page); + } + return __set_page_dirty_buffers(page); +} + /* * On a two-level page table, this ends up being trivial. Thus the * inlining and the symmetry break with pte_alloc_map() that does all @@ -496,6 +514,9 @@ extern void truncate_inode_pages(struct address_space *, loff_t); extern int filemap_sync(struct vm_area_struct *, unsigned long, size_t, unsigned int); extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, int); +/* mm/page-writeback.c */ +int generic_writeback_mapping(struct address_space *mapping, int *nr_to_write); + /* readahead.c */ #define VM_MAX_READAHEAD 128 /* kbytes */ #define VM_MIN_READAHEAD 16 /* kbytes (includes current page) */ @@ -550,9 +571,6 @@ static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * m extern struct vm_area_struct *find_extend_vma(struct mm_struct *mm, unsigned long addr); -extern int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0); -extern int pdflush_flush(unsigned long nr_pages); - extern struct page * vmalloc_to_page(void *addr); extern unsigned long get_page_cache_size(void); diff --git a/include/linux/sched.h b/include/linux/sched.h index 8a4826427f7f..09056c01bc8c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -368,8 +368,7 @@ do { if (atomic_dec_and_test(&(tsk)->usage)) __put_task_struct(tsk); } while(0) #define PF_MEMALLOC 0x00000800 /* Allocating memory */ #define PF_MEMDIE 0x00001000 /* Killed for out-of-memory */ #define PF_FREE_PAGES 0x00002000 /* per process page freeing */ -#define PF_NOIO 0x00004000 /* avoid generating further I/O */ -#define PF_FLUSHER 0x00008000 /* responsible for disk writeback */ +#define PF_FLUSHER 0x00004000 /* responsible for disk writeback */ /* * Ptrace flags diff --git a/include/linux/swap.h b/include/linux/swap.h index 287faa9dc620..86eb09dfca0d 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -97,6 +97,7 @@ extern int nr_swap_pages; extern unsigned int nr_free_pages(void); extern unsigned int nr_free_buffer_pages(void); +extern unsigned int nr_free_pagecache_pages(void); extern int nr_active_pages; extern int nr_inactive_pages; extern atomic_t nr_async_pages; diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h index 30caa40c26be..2f25df04d925 100644 --- a/include/linux/sysctl.h +++ b/include/linux/sysctl.h @@ -133,7 +133,7 @@ enum VM_SWAPCTL=1, /* struct: Set vm swapping control */ VM_SWAPOUT=2, /* int: Linear or sqrt() swapout for hogs */ VM_FREEPG=3, /* struct: Set free page thresholds */ - VM_BDFLUSH=4, /* struct: Control buffer cache flushing */ + VM_BDFLUSH_UNUSED=4, /* Spare */ VM_OVERCOMMIT_MEMORY=5, /* Turn off the virtual memory safety limit */ VM_BUFFERMEM=6, /* struct: Set buffer memory thresholds */ VM_PAGECACHE=7, /* struct: Set cache memory thresholds */ diff --git a/include/linux/writeback.h b/include/linux/writeback.h new file mode 100644 index 000000000000..1978e06d1131 --- /dev/null +++ b/include/linux/writeback.h @@ -0,0 +1,53 @@ +/* + * include/linux/writeback.h. + * + * These declarations are private to fs/ and mm/. + * Declarations which are exported to filesystems do not + * get placed here. + */ +#ifndef WRITEBACK_H +#define WRITEBACK_H + +extern spinlock_t inode_lock; +extern struct list_head inode_in_use; +extern struct list_head inode_unused; + +/* + * fs/fs-writeback.c + */ +#define WB_SYNC_NONE 0 /* Don't wait on anything */ +#define WB_SYNC_LAST 1 /* Wait on the last-written mapping */ +#define WB_SYNC_ALL 2 /* Wait on every mapping */ + +void try_to_writeback_unused_inodes(unsigned long pexclusive); +void writeback_single_inode(struct inode *inode, + int sync, int *nr_to_write); +void writeback_unlocked_inodes(int *nr_to_write, int sync_mode, + unsigned long *older_than_this); +void writeback_inodes_sb(struct super_block *); +void __wait_on_inode(struct inode * inode); +void sync_inodes(void); + +static inline void wait_on_inode(struct inode *inode) +{ + if (inode->i_state & I_LOCK) + __wait_on_inode(inode); +} + +/* + * mm/page-writeback.c + */ +/* + * How much data to write out at a time in various places. This isn't + * really very important - it's just here to prevent any thread from + * locking an inode for too long and blocking other threads which wish + * to write the same file for allocation throttling purposes. + */ +#define WRITEOUT_PAGES ((4096 * 1024) / PAGE_CACHE_SIZE) + +void balance_dirty_pages(struct address_space *mapping); +void balance_dirty_pages_ratelimited(struct address_space *mapping); +int pdflush_flush(unsigned long nr_pages); +int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0); + +#endif /* WRITEBACK_H */ |
