| Age | Commit message (Collapse) | Author |
|
With silly pageout testcases it is possible to place huge amounts of memory
under I/O. With a large request queue (CFQ uses 8192 requests) it is
possible to place _all_ memory under I/O at the same time.
This means that all memory is pinned and unreclaimable and the VM gets
upset and goes oom.
The patch limits the amount of memory which is under pageout writeout to be
a little more than the amount of memory at which balance_dirty_pages()
callers will synchronously throttle.
This means that heavy pageout activity can starve heavy writeback activity
completely, but heavy writeback activity will not cause starvation of
pageout. Because we don't want a simple `dd' to be causing excessive
latencies in page reclaim.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The O_SYNC speedup patches missed the generic_file_xxx_nolock cases, which
means that pages weren't actually getting sync'ed in those cases. This
patch fixes that.
Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Eliminate the inode waitqueue hashtable using bit_waitqueue() via
wait_on_bit() and wake_up_bit() to locate the waitqueue head associated
with a bit.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Add a whole bunch more might_sleep() checks. We also enable might_sleep()
checking in copy_*_user(). This was non-trivial because of the "copy_*_user()
in atomic regions" trick would generate false positives. Fix that up by
adding a new __copy_*_user_inatomic(), which avoids the might_sleep() check.
Only i386 is supported in this patch.
With: Arjan van de Ven <arjanv@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
We assign 0 and 1 to it, but since it's signed, that's
actually already overflowing the poor thing. So make
it unsigned, which is what it really was supposed to be
in the first place.
|
|
In databases it is common to have multiple threads or processes performing
O_SYNC writes against different parts of the same file.
Our performance at this is poor, because each writer blocks access to the
file by waiting on I/O completion while holding i_sem: everything is
serialised.
The patch improves things by moving the writing and waiting outside i_sem.
So other threads can get in and submit their I/O and permit the disk
scheduler to optimise the IO patterns better.
Also, the O_SYNC writer only writes and waits on the pages which he wrote,
rather than writing and waiting on all dirty pages in the file.
The reason we haven't been able to do this before is that the required walk
of the address_space page lists is easily livelockable without the i_sem
serialisation. But in this patch we perform the waiting via a radix-tree
walk of the affected pages. This cannot be livelocked.
The sync of the inode's metadata is still performed inside i_sem. This is
because it is list-based and is hence still livelockable. However it is
usually the case that databases are overwriting existing file blocks and
there will be no dirty buffers attached to the address_space anyway.
The code is careful to ensure that the IO for the pages and the IO for the
metadata are nonblockingly scheduled at the same time. This is am improvemtn
over the current code, which will issue two separate write-and-wait cycles:
one for metadata, one for pages.
Note from Suparna:
Reworked to use the tagged radix-tree based writeback infrastructure.
Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Modify mpage_writepages to optionally only write back dirty pages within a
specified range in a file (as in the case of O_SYNC). Cheat a little to avoid
changes to prototypes of aops - just put the <start, end> hint into the
writeback_control struct instead. If <start, end> are not set, then default
to writing back all the mapping's dirty pages.
Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Nobody ever fixed the big FIXME in sysctl - but we really need
to pass around the proper "loff_t *" to all the sysctl functions
if we want them to be well-behaved wrt the file pointer position.
This is all preparation for making direct f_pos accesses go
away.
|
|
From: Bart Samwel <bart@samwel.tk>
Adds /proc/sys/vm/laptop-mode: a special knob which says "this is a laptop".
In this mode the kernel will attempt to avoid spinning disks up.
Algorithm: the idea is to hold dirty data in memory for a long time, but to
flush everything which has been accumulated if the disk happens to spin up
for other reasons.
- Whenever a disk request completes (read or write), schedule a timer a few
seconds hence. If the timer was already pending, reset it to a few seconds
hence.
- When the timer expires, write back the whole world. We use
sync_filesystems() for this because it will force ext3 journal commits as
well.
- In balance_dirty_pages(), kick off background writeback when we hit the
high threshold (dirty_ratio), not when we hit the low threshold. This has
the effect of causing "lumpy" writeback which is something I spent a year
fixing, but in laptop mode, it is desirable.
- In try_to_free_pages(), only kick pdflush if the VM is getting into
distress: we want to keep scanning for clean pages, deferring writeback.
- In page reclaim, avoid writing back the odd random dirty page off the
LRU: only start I/O if the scanning is working harder.
The effect is to perform a sync() a few seconds after all I/O has ceased.
The value which was written into /proc/sys/vm/laptop-mode determines, in
seconds, the delay between the final I/O and the flush.
Additionally, the patch adds tools which help answer the question "why the
heck does my disk spin up all the time?". The user may set
/proc/sys/vm/block_dump to a non-zero value and the kernel will print out
information which will identify the process which is performing disk reads or
which is dirtying pagecache.
The user should probably disable syslogd before setting block-dump.
|
|
If pdflush hits a locked-and-clean buffer in __block_write_full_page() it
will just pass over the buffer. Typically the buffer is an ext3 data=ordered
buffer which is being written by kjournald, but a similar thing can happen
with blockdev buffers and ll_rw_block().
This is bad because the buffer is still under I/O and a subsequent fsync's
fdatawait() needs to know about it.
It is not practical to tag the page for writeback - only the submitter of the
I/O can do that, because the submitter has control of the end_io handler.
So instead, redirty the page so a subsequent fsync's fdatawrite() will wait on
the underway I/O.
There is a risk that pdflush::background_writeout() will lock up, repeatedly
trying and failing to write the same page. This is prevented by ensuring
that background_writeout() always throttles when it made no progress.
|
|
From: Robert Love <rml@tech9.net>
- Let real-time tasks dip further into the reserves than usual in
__alloc_pages(). There are a lot of ways to special case this. This
patch just cuts z->pages_low in half, before doing the incremental min
thing, for real-time tasks. I do not do anything in the low memory slow
path. We can be a _lot_ more aggressive if we want. Right now, we just
give real-time tasks a little help.
- Never ever call balance_dirty_pages() on a real-time task. Where and
how exactly we handle this is up for debate. We could, for example,
special case real-time tasks inside balance_dirty_pages(). This would
allow us to perform some of the work (say, waking up pdflush) but not
other work (say, the active throttling). As it stands now, we do the
per-processor accounting in balance_dirty_pages_ratelimited() but we
never call balance_dirty_pages(). Lots of approaches work. What we want
to do is never engage the real-time task in forced writeback.
|
|
|
|
This /proc tunable sets the kupdate interval. It has a couple of problems:
- No way to turn it off completely (userspace dirty memory management
solutions require this).
- If it has been set to one hour and then the user resets it to five
seconds, that resetting will not take effect for up to an hour.
Fix that up by providing a sysctl handler. Setting the tunable to zero now
disables the kupdate function.
|
|
Patch from Anders Gustafsson <andersg@0x63.nu>
We're getting a division-by-zero in the writeback code during early rootfs
population, because writeback has not yet been initialised.
Fix that by performing an explicit initialisation rather than relying on
initcall ordering.
|
|
current->flags:PF_SYNC was a hack I added because I didn't want to
change all ->writepage implementations.
It's foul. And it means that if someone happens to run direct page
reclaim within the context of (say) sys_sync, the writepage invokations
from the VM will be treated as "data integrity" operations, not "memory
cleansing" operations, which would cause latency.
So the patch removes PF_SYNC and adds an extra arg to a_ops->writepage.
It is the `writeback_control' structure which contains the full context
information about why writepage was called.
The initial version of this patch just passed in a bare `int sync', but
the XFS team need more info so they can perform writearound from within
page reclaim.
The patch also adds writeback_control.for_reclaim, so writepage
implementations can inspect that to work out the call context rather
than peeking at current->flags:PF_MEMALLOC.
|
|
I've revisited all the superblock->inode->page writeback paths. There
were several silly things in there, and things were not as clear as they
could be.
scenario 1: create and dirty a MAP_SHARED segment over a sparse file,
then exit.
All the memory turns into dirty pagecache, but the kupdate function
only writes it out at a trickle - 4 megabytes every thirty seconds.
We should sync it all within 30 seconds.
What's happening is that when writeback tries to write those pages,
the filesystem needs to instantiate new blocks for them (they're over
holes). The filesystem runs mark_inode_dirty() within the writeback
function.
This redirtying of the inode while we're writing it out triggers
some livelock avoidance code in __sync_single_inode(). That function
says "ah, someone redirtied the file while I was writing it. Let's
move the file to the new end of the superblock dirty list and write
it out later." Problem is, writeback dirtied the inode itself.
(It is rather silly that mark_inode_dirty() sets I_DIRTY_PAGES when
clearly no pages have been dirtied. Fixing that up would be a
largish work, so work around it here).
So this patch just removes the livelock avoidance from
__sync_single_inode(). It is no longer needed anyway - writeback
livelock is now avoided (in all writeback paths) by writing a finite
number of pages.
scenario 2: an application is continuously dirtying a 200 megabyte
file, and your disk has a bandwidth of less than 40 megabytes/sec.
What happens is that once 30 seconds passes, pdflush starts writing
out the file. And because that writeout will take more than five
seconds (a `kupdate' interval), pdflush just keeps writing it out
forever - continuous I/O.
What we _want_ to happen is that the 200 megabytes gets written,
and then IO stops for thirty seconds (minus the writeout period). So
the file is fully synced every thirty seconds.
The patch solves this by using mapping->io_pages more intelligently.
When the time comes to write the file out, move all the dirty pages
onto io_pages. That is a "batch of pages for this kupdate round".
When io_pages is empty, we know we're done.
The address_space_operations.writepages() API is changed! It now only
needs to write the pages which the caller placed on mapping->io_pages.
This conceptually cleans things up a bit, by more clearly defining the
role of ->io_pages, and the motion between the various mapping lists.
The treatment of sb->s_dirty and sb->s_io is now conceptually identical
to mapping->dirty_pages and mapping->io_pages: move the items-to-be
written onto ->s_io/io_pages, alk walk that list. As inodes (or pages)
are written, move them over to the clean/locked/dirty lists.
Oh, scenario 3: start an app whcih continuously overwrites a 5 meg
file. Wait five seconds, start another, wait 5 seconds, start another.
What we _should_ see is three 5-meg writes, five seconds apart, every
thirty seconds. That did all sorts of odd things. It now does the
right thing.
|
|
fail_writepage() does not work. Its activate_page() call cannot
activate the page because it is not on the LRU.
So perform that function (more efficiently) in the VM. Remove
fail_writepage() and, if the filesystem does not implement
->writepage() then activate the page from shrink_list().
A special case is tmpfs, which does have a writepage, but which
sometimes wants to activate the pages anyway. The most important case
is when there is no swap online and we don't want to keep all those
pages on the inactive list. So just as a tmpfs special-case, allow
writepage() to return WRITEPAGE_ACTIVATE, and handle that in the VM.
Also, the whole idea of allowing ->writepage() to return -EAGAIN, and
handling that in the caller has been reverted. If a writepage()
implementation wants to back out and not write the page, it must
redirty the page, unlock it and return zero. (This is Hugh's preferred
way).
And remove the now-unneeded shmem_writepages() - shmem inodes are
marked as `memory backed' so it will not be called.
And remove the test for non-null ->writepage() in generic_file_mmap().
Memory-backed files _are_ mmappable, and they do not have a
writepage(). It just isn't called.
So the locking rules for writepage() are unchanged. They are:
- Called with the page locked
- Returns with the page unlocked
- Must redirty the page itself if it wasn't all written.
But there is a new, special, hidden, undocumented, secret hack for
tmpfs: writepage may return WRITEPAGE_ACTIVATE to tell the VM to move
the page to the active list. The page must be kept locked in this one
case.
|
|
Since /proc/sys/vm/dirty_sync_ratio went away, the name
"dirty_async_ratio" makes no sense.
So rename it to just /proc/sys/vm/dirty_ratio.
|
|
Convert the VM to not wait on other people's dirty data.
- If we find a dirty page and its queue is not congested, do some writeback.
- If we find a dirty page and its queue _is_ congested then just
refile the page.
- If we find a PageWriteback page then just refile the page.
- There is additional throttling for write(2) callers. Within
generic_file_write(), record their backing queue in ->current.
Within page reclaim, if this tasks encounters a page which is dirty
or under writeback onthis queue, block on it. This gives some more
writer throttling and reduces the page refiling frequency.
It's somewhat CPU expensive - under really heavy load we only get a 50%
reclaim rate in pages coming off the tail of the LRU. This can be
fixed by splitting the inactive list into reclaimable and
non-reclaimable lists. But the CPU load isn't too bad, and latency is
much, much more important in these situations.
Example: with `mem=512m', running 4 instances of `dbench 100', 2.5.34
took 35 minutes to compile a kernel. With this patch, it took three
minutes, 45 seconds.
I haven't done swapcache or MAP_SHARED pages yet. If there's tons of
dirty swapcache or mmap data around we still stall heavily in page
reclaim. That's less important.
This patch also has a tweak for swapless machines: don't even bother
bringing anon pages onto the inactive list if there is no swap online.
|
|
The key concept here is that pdflush does not block on request queues
any more. Instead, it circulates across the queues, keeping any
non-congested queues full of write data. When all queues are full,
pdflush takes a nap, to be woken when *any* queue exits write
congestion.
This code can keep sixty spindles saturated - we've never been able to
do that before.
- Add the `nonblocking' flag to struct writeback_control, and teach
the writeback paths to honour it.
- Add the `encountered_congestion' flag to struct writeback_control
and teach the writeback paths to set it.
So as soon as a mapping's backing_dev_info indicates that it is getting
congested, bale out of writeback. And don't even start writeback
against filesystems whose queues are congested.
- Convert pdflush's background_writeback() function to use
nonblocking writeback.
This way, a single pdflush thread will circulate around all the
dirty queues, keeping them filled.
- Convert the pdlfush `kupdate' function to do the same thing.
This solves the problem of pdflush thread pool exhaustion.
It solves the problem of pdflush startup latency.
It solves the (minor) problem wherein `kupdate' writeback only writes
back a single disk at a time (it was getting blocked on each queue in
turn).
It probably means that we only ever need a single pdflush thread.
|
|
This was designed to be a really sterm throttling threshold: if dirty
memory reaches this level then perform writeback and actually wait on
it.
It doesn't work. Because memory dirtiers are required to perform
writeback if the amount of dirty AND writeback memory exceeds
dirty_async_ratio.
So kill it, and rely just on the request queues being appropriately
scaled to the machine size (they are).
This is basically what 2.4 does.
|
|
The writeback code paths which walk the superblocks and inodes are
getting an increasing arguments passed to them.
The patch wraps those args into the new `struct writeback_control',
and uses that instead. There is no functional change.
The new writeback_control structure is passed down through the
writeback paths in the place where the old `nr_to_write' pointer used
to be.
writeback_control will be used to pass new information up and down the
writeback paths. Such as whether the writeback should be non-blocking,
and whether queue congestion was encountered.
|
|
This is a performance and correctness fix against the writeback paths.
The writeback code has competing requirements. Sometimes it is used
for "memory cleansing": kupdate, bdflush, writer throttling, page
allocator writeback, etc. And sometimes this same code is used for
data integrity pruposes: fsync, msync, fdatasync, sync, umount, various
other kernel-internal uses.
The problem is: how to handle a dirty buffer or page which is currently
under writeback.
For memory cleansing, we just want to skip that buffer/page and go onto
the next one. But for sync, we must wait on the old writeback and then
start new writeback.
mpage_writepages() is current correct for cleansing, but incorrect for
sync. block_write_full_page() is currently correct for sync, but
inefficient for cleansing.
The fix is fairly simple.
- In mpage_writepages(), don't skip the page is it's a sync
operation.
- In block_write_full_page(), skip the buffer if it is a sync
operation. And return -EAGAIN to tell the caller that the writeout
didn't work out. The caller must then set the page dirty again and
move it onto mapping->dirty_pages.
This is an extension of the writepage API: writepage can now return
EAGAIN. There are only three callers, and they have been updated.
fail_writepage() and ext3_writepage() were actually doing this by
hand. They have been changed to return -EAGAIN. NTFS will want to
be able to return -EAGAIN from its writepage as well.
- A sticky question is: how to tell the writeout code which mode it
is operating in? Cleansing or sync?
It's such a tiny code change that I didn't have the heart to go and
propagate a `mode' argument down every instance of writepages() and
writepage() in the kernel. So I passed it in via current->flags.
Incidentally, the occurrence of a locked-and-dirty buffer in
block_write_full_page() is fairly rare: normally the collision avoidance
happens at the address_space level, via PageWriteback. But some
mappings (blockdevs, ext3 files, etc) have their dirty buffers written
out via submit_bh(). It is these buffers which can stall
block_write_full_page().
This wart will be pretty intrusive to fix. ext3 needs to become fully
page-based (ugh. It's a block-based journalling filesystem, and pages
are unnatural). blockdev mappings are still written out by buffers
because that's how filesystems use them. Putting _all_ metadata
(indirects, inodes, superblocks, etc) into standalone address_spaces
would fix that up.
- filemap_fdatawrite() sets PF_SYNC. So filemap_fdatawrite() is the
kernel function which will start writeback against a mapping for
"data integrity" purposes, whereas the unexported, internal-only
do_writepages() is the writeback function which is used for memory
cleansing. This difference is the reason why I didn't consolidate
those functions ages ago...
- Lots of code paths had a bogus extra call to filemap_fdatawait(),
which I previously added in a moment of weak-headedness. They have
all been removed.
|
|
|
|
Writeback/pdflush cleanup patch from Steven Augart
* Exposes nr_pdflush_threads as /proc/sys/vm/nr_pdflush_threads, read-only.
(I like this - I expect that management of the pdflush thread pool
will be important for many-spindle machines, and this is a neat way
of getting at the info).
* Adds minimum and maximum checking to the five writable pdflush
and fs-writeback parameters.
* Minor indentation fix in sysctl.c
* mm/pdflush.c now includes linux/writeback.h, which prototypes
pdflush_operation. This is so that the compiler can
automatically check that the prototype matches the definition.
* Adds a few comments to existing code.
|
|
- Comment and documentation fixlets
- Remove some unneeded fields from swapper_inode (these are a
leftover from when I had swap using the filesystem IO functions).
- fix a printk bug in pci/pool.c: when dma_addr_t is 64 bit it
generates a compile warning, and will print out garbage. Cast it to
unsigned long long.
- Convert some writeback #defines into enums (Steven Augart)
|
|
Adds five sysctls for tuning the writeback behaviour:
dirty_async_ratio
dirty_background_ratio
dirty_sync_ratio
dirty_expire_centisecs
dirty_writeback_centisecs
these are described in Documentation/filesystems/proc.txt They are
basically the tradiditional knobs which we've always had...
We are accreting a ton of obsolete sysctl numbers under /proc/sys/vm/.
I didn't recycle these - just mark them unused and remove the obsolete
documentation.
|
|
Remove i_wait from struct inode and hash it instead.
This is a pure space-saving exercise - 12 bytes from struct
inode on x86.
NFS was using i_wait for its own purposes. Add a wait_queue_head_t to
the fs-private inode for that. This change has been acked by Trond.
|
|
Spot the difference:
aops.readpage
aops.readpages
aops.writepage
aops.writeback_mapping
The patch renames `writeback_mapping' to `writepages'
|
|
Fixes a performance problem with many-small-file writeout.
At present, files are written out via their mapping and their indirect
blocks are written out via the blockdev mapping. As we know that
indirects are disk-adjacent to the data it is better to start I/O
against the indirects at the same time as the data.
The delalloc pathes have code in ext2_writepage() which recognises when
the target page->index was at an indirect boundary and does an explicit
hunt-and-write against the neighbouring indirect block. Which is
ideal. (Unless the file was dirtied seekily and the page which is next
to the indirect was not dirtied).
This patch does it the other way: when we start writeback against a
mapping, also start writeback against any dirty buffers which are
attached to mapping->private_list. Let the elevator take care of the
rest.
The patch makes a number of tuning changes to the writeback path in
fs-writeback.c. This is very fiddly code: getting the throughput
tuned, getting the data-integrity "sync" operations right, avoiding
most of the livelock opportunities, getting the `kupdate' function
working efficiently, keeping it all least somewhat comprehensible.
An important intent here is to ensure that metadata blocks for inodes
are marked dirty before writeback starts working the blockdev mapping,
so all the inode blocks are efficiently written back.
The patch removes try_to_writeback_unused_inodes(), which became
unreferenced in vm-writeback.patch.
The patch has a tweak in ext2_put_inode() to prevent ext2 from
incorrectly droppping its preallocation window in response to a random
iput().
Generally, many-small-file writeout is a lot faster than 2.5.7 (which
is linux-before-I-futzed-with-it). The workload which was optimised was
tar xfz /nfs/mountpoint/linux-2.4.18.tar.gz ; sync
on mem=128M and mem=2048M.
With these patches, 2.5.15 is completing in about 2/3 of the time of
2.5.7. But it is only a shade faster than 2.4.19-pre7. Why is 2.5.7
so much slower than 2.4.19? Not sure yet.
Heavy dbench loads (dbench 32 on mem=128M) are slightly faster than
2.5.7 and significantly slower than 2.4.19. It appears that the cause
is poor read throughput at the later stages of the run. Because there
are background writeback threads operating at the same time.
The 2.4.19-pre8 write scheduling manages to stop writeback during the
latter stages of the dbench run in a way which I haven't been able to
sanely emulate yet. It may not be desirable to do this anyway - it's
optimising for the case where the files are about to be deleted. But
it would be good to find a way of "pausing" the writeback for a few
seconds to allow readers to get an interval of decent bandwidth.
tiobench throughput is basically the same across all recent kernels.
CPU load on writes is down maybe 30% in 2.5.15.
|
|
Tune up the VM-based writeback a bit.
- Always use the multipage clustered-writeback function from within
shrink_cache(), even if the page's mapping has a NULL ->vm_writeback(). So
clustered writeback is turned on for all address_spaces, not just ext2.
Subtle effect of this change: it is now the case that *all* writeback
proceeds along the mapping->dirty_pages list. The orderedness of the page
LRUs no longer has an impact on disk scheduling. So we only have one list
to keep well-sorted rather than two, and churning pages around on the LRU
will no longer damage write bandwidth - it's all up to the filesystem.
- Decrease the clustered writeback from 1024 pages(!) to 32 pages.
(1024 was a leftover from when this code was always dispatching writeback
to a pdflush thread).
- Fix wakeup_bdflush() so that it actually does write something (duh).
do_wp_page() needs to call balance_dirty_pages_ratelimited(), so we
throttle mmap page-dirtiers in the same way as write(2) page-dirtiers.
This may make wakeup_bdflush() obsolete, but it doesn't hurt.
- Converts generic_vm_writeback() to directly call ->writeback_mapping(),
rather that going through writeback_single_inode(). This prevents memory
allocators from blocking on the inode's I_LOCK. But it does mean that two
processes can be writing pages from the same mapping at the same time. If
filesystems care about this (for layout reasons) then they should serialise
in their ->writeback_mapping a_op.
This means that memory-allocators will writeback only pages, not pages
and inodes. There are no locks in that writeback path (except for request
queue exhaustion). Reduces memory allocation latency.
- Implement new background_writeback function, which when kicked off
will perform writeback until dirty memory falls below the background
threshold.
- Put written-back pages onto the remote end of the page LRU. It
does this in the slow-and-stupid way at present. pagemap_lru_lock
stress-relief is planned...
- Remove the funny writeback_unused_inodes() stuff from prune_icache().
Writeback from wakeup_bdflush() and the `kupdate' function now just
naturally cleanses the oldest inodes so we don't need to do anything
there.
- Dirty memory balancing is still using magic numbers: "after you
dirtied your 1,000th page, go write 1,500". Obviously, this needs
more work.
|
|
Use the pdflush exclusion infrastructure to ensure that only one
pdlfush thread is ever performing writeback against a particular
request_queue.
This works rather well. It requires a lot of activity against a lot of
disks to cause more pdflush threads to start up. Possibly the
thread-creation logic is a little weak: it starts more threads when a
pdflush thread goes back to sleep. It may be better to start new
threads within pdlfush_operation().
All non-request_queue-backed address_spaces share the global
default_backing_dev_info structure. So at present only a single
pdflush instance will be available for background writeback of *all*
NFS filesystems (for example).
If there is benefit in concurrent background writeback for multiple NFS
mounts then NFS would need to create per-mount backing_dev_info
structures and install those into new inode's address_spaces in some
manner.
|
|
[ I reversed the order in which writeback walks the superblock's
dirty inodes. It sped up dbench's unlink phase greatly. I'm
such a sleaze ]
The core writeback patch. Switches file writeback from the dirty
buffer LRU over to address_space.dirty_pages.
- The buffer LRU is removed
- The buffer hash is removed (uses blockdev pagecache lookups)
- The bdflush and kupdate functions are implemented against
address_spaces, via pdflush.
- The relationship between pages and buffers is changed.
- If a page has dirty buffers, it is marked dirty
- If a page is marked dirty, it *may* have dirty buffers.
- A dirty page may be "partially dirty". block_write_full_page
discovers this.
- A bunch of consistency checks of the form
if (!something_which_should_be_true())
buffer_error();
have been introduced. These fog the code up but are important for
ensuring that the new buffer/page code is working correctly.
- New locking (inode.i_bufferlist_lock) is introduced for exclusion
from try_to_free_buffers(). This is needed because set_page_dirty
is called under spinlock, so it cannot lock the page. But it
needs access to page->buffers to set them all dirty.
i_bufferlist_lock is also used to protect inode.i_dirty_buffers.
- fs/inode.c has been split: all the code related to file data writeback
has been moved into fs/fs-writeback.c
- Code related to file data writeback at the address_space level is in
the new mm/page-writeback.c
- try_to_free_buffers() is now non-blocking
- Switches vmscan.c over to understand that all pages with dirty data
are now marked dirty.
- Introduces a new a_op for VM writeback:
->vm_writeback(struct page *page, int *nr_to_write)
this is a bit half-baked at present. The intent is that the address_space
is given the opportunity to perform clustered writeback. To allow it to
opportunistically write out disk-contiguous dirty data which may be in other zones.
To allow delayed-allocate filesystems to get good disk layout.
- Added address_space.io_pages. Pages which are being prepared for
writeback. This is here for two reasons:
1: It will be needed later, when BIOs are assembled direct
against pagecache, bypassing the buffer layer. It avoids a
deadlock which would occur if someone moved the page back onto the
dirty_pages list after it was added to the BIO, but before it was
submitted. (hmm. This may not be a problem with PG_writeback logic).
2: Avoids a livelock which would occur if some other thread is continually
redirtying pages.
- There are two known performance problems in this code:
1: Pages which are locked for writeback cause undesirable
blocking when they are being overwritten. A patch which leaves
pages unlocked during writeback comes later in the series.
2: While inodes are under writeback, they are locked. This
causes namespace lookups against the file to get unnecessarily
blocked in wait_on_inode(). This is a fairly minor problem.
I don't have a fix for this at present - I'll fix this when I
attach dirty address_spaces direct to super_blocks.
- The patch vastly increases the amount of dirty data which the
kernel permits highmem machines to maintain. This is because the
balancing decisions are made against the amount of memory in the
machine, not against the amount of buffercache-allocatable memory.
This may be very wrong, although it works fine for me (2.5 gigs).
We can trivially go back to the old-style throttling with
s/nr_free_pagecache_pages/nr_free_buffer_pages/ in
balance_dirty_pages(). But better would be to allow blockdev
mappings to use highmem (I'm thinking about this one, slowly). And
to move writer-throttling and writeback decisions into the VM (modulo
the file-overwriting problem).
- Drops 24 bytes from struct buffer_head. More to come.
- There's some gunk like super_block.flags:MS_FLUSHING which needs to
be killed. Need a better way of providing collision avoidance
between pdflush threads, to prevent more than one pdflush thread
working a disk at the same time.
The correct way to do that is to put a flag in the request queue to
say "there's a pdlfush thread working this disk". This is easy to
do: just generalise the "ra_pages" pointer to point at a struct which
includes ra_pages and the new collision-avoidance flag.
|