| Age | Commit message (Collapse) | Author |
|
After making dirty a 100M file, the normal behavior is to start the
writeback for all data after 30s delays. But sometimes the following
happens instead:
- after 30s: ~4M
- after 5s: ~4M
- after 5s: all remaining 92M
Some analyze shows that the internal io dispatch queues goes like this:
s_io s_more_io
-------------------------
1) 100M,1K 0
2) 1K 96M
3) 0 96M
1) initial state with a 100M file and a 1K file
2) 4M written, nr_to_write <= 0, so write more
3) 1K written, nr_to_write > 0, no more writes(BUG)
nr_to_write > 0 in (3) fools the upper layer to think that data have all
been written out. The big dirty file is actually still sitting in
s_more_io. We cannot simply splice s_more_io back to s_io as soon as s_io
becomes empty, and let the loop in generic_sync_sb_inodes() continue: this
may starve newly expired inodes in s_dirty. It is also not an option to
draw inodes from both s_more_io and s_dirty, an let the loop go on: this
might lead to live locks, and might also starve other superblocks in sync
time(well kupdate may still starve some superblocks, that's another bug).
We have to return when a full scan of s_io completes. So nr_to_write > 0
does not necessarily mean that "all data are written". This patch
introduces a flag writeback_control.more_io to indicate that more io should
be done. With it the big dirty file no longer has to wait for the next
kupdate invokation 5s later.
In sync_sb_inodes() we only set more_io on super_blocks we actually
visited. This avoids the interaction between two pdflush deamons.
Also in __sync_single_inode() we don't blindly keep requeuing the io if the
filesystem cannot progress. Failing to do so may lead to 100% iowait.
Tested-by: Mike Snitzer <snitzer@gmail.com>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Michael Rubin <mrubin@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Add vm.highmem_is_dirtyable toggle
A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of
approximately 2Gb size which contains a hash format that is written
randomly by the dbclean process. On 2.6.16 this process took a few
minutes. With lowmem only accounting of dirty ratios, this takes about 12
hours of 100% disk IO, all random writes.
Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to
add the highmem back to the total available memory count.
[akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build]
Signed-off-by: Bron Gondwana <brong@fastmail.fm>
Cc: Ethan Solomita <solo@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This reverts commit 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b, as
requested by Fengguang Wu. It's not quite fully baked yet, and while
there are patches around to fix the problems it caused, they should get
more testing. Says Fengguang: "I'll resend them both for -mm later on,
in a more complete patchset".
See
http://bugzilla.kernel.org/show_bug.cgi?id=9738
for some of this discussion.
Requested-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
I_LOCK was used for several unrelated purposes, which caused deadlock
situations in certain filesystems as a side effect. One of the purposes
now uses the new I_SYNC bit.
Also document the various bits and change their order from historical to
logical.
[bunk@stusta.de: make fs/inode.c:wake_up_inode() static]
Signed-off-by: Joern Engel <joern@wohnheim.fh-wedel.de>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: David Chinner <dgc@sgi.com>
Cc: Anton Altaparmakov <aia21@cam.ac.uk>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
After making dirty a 100M file, the normal behavior is to start the writeback
for all data after 30s delays. But sometimes the following happens instead:
- after 30s: ~4M
- after 5s: ~4M
- after 5s: all remaining 92M
Some analyze shows that the internal io dispatch queues goes like this:
s_io s_more_io
-------------------------
1) 100M,1K 0
2) 1K 96M
3) 0 96M
1) initial state with a 100M file and a 1K file
2) 4M written, nr_to_write <= 0, so write more
3) 1K written, nr_to_write > 0, no more writes(BUG)
nr_to_write > 0 in (3) fools the upper layer to think that data have all been
written out. The big dirty file is actually still sitting in s_more_io. We
cannot simply splice s_more_io back to s_io as soon as s_io becomes empty, and
let the loop in generic_sync_sb_inodes() continue: this may starve newly
expired inodes in s_dirty. It is also not an option to draw inodes from both
s_more_io and s_dirty, an let the loop go on: this might lead to live locks,
and might also starve other superblocks in sync time(well kupdate may still
starve some superblocks, that's another bug).
We have to return when a full scan of s_io completes. So nr_to_write > 0 does
not necessarily mean that "all data are written". This patch introduces a
flag writeback_control.more_io to indicate this situation. With it the big
dirty file no longer has to wait for the next kupdate invocation 5s later.
Cc: David Chinner <dgc@sgi.com>
Cc: Ken Chen <kenchen@google.com>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Scale writeback cache per backing device, proportional to its writeout speed.
By decoupling the BDI dirty thresholds a number of problems we currently have
will go away, namely:
- mutual interference starvation (for any number of BDIs);
- deadlocks with stacked BDIs (loop, FUSE and local NFS mounts).
It might be that all dirty pages are for a single BDI while other BDIs are
idling. By giving each BDI a 'fair' share of the dirty limit, each one can have
dirty pages outstanding and make progress.
A global threshold also creates a deadlock for stacked BDIs; when A writes to
B, and A generates enough dirty pages to get throttled, B will never start
writeback until the dirty pages go away. Again, by giving each BDI its own
'independent' dirty limit, this problem is avoided.
So the problem is to determine how to distribute the total dirty limit across
the BDIs fairly and efficiently. A DBI that has a large dirty limit but does
not have any dirty pages outstanding is a waste.
What is done is to keep a floating proportion between the DBIs based on
writeback completions. This way faster/more active devices get a larger share
than slower/idle devices.
[akpm@linux-foundation.org: fix warnings]
[hugh@veritas.com: Fix occasional hang when a task couldn't get out of balance_dirty_pages]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
* git://git.linux-nfs.org/pub/linux/nfs-2.6: (131 commits)
NFSv4: Fix a typo in nfs_inode_reclaim_delegation
NFS: Add a boot parameter to disable 64 bit inode numbers
NFS: nfs_refresh_inode should clear cache_validity flags on success
NFS: Fix a connectathon regression in NFSv3 and NFSv4
NFS: Use nfs_refresh_inode() in ops that aren't expected to change the inode
SUNRPC: Don't call xprt_release in call refresh
SUNRPC: Don't call xprt_release() if call_allocate fails
SUNRPC: Fix buggy UDP transmission
[23/37] Clean up duplicate includes in
[2.6 patch] net/sunrpc/rpcb_clnt.c: make struct rpcb_program static
SUNRPC: Use correct type in buffer length calculations
SUNRPC: Fix default hostname created in rpc_create()
nfs: add server port to rpc_pipe info file
NFS: Get rid of some obsolete macros
NFS: Simplify filehandle revalidation
NFS: Ensure that nfs_link() returns a hashed dentry
NFS: Be strict about dentry revalidation when doing exclusive create
NFS: Don't zap the readdir caches upon error
NFS: Remove the redundant nfs_reval_fsid()
NFSv3: Always use directory post-op attributes in nfs3_proc_lookup
...
Fix up trivial conflict due to sock_owned_by_user() cleanup manually in
net/sunrpc/xprtsock.c
|
|
Hide everything in blkdev.h with CONFIG_BLOCK isn't set, and fixup
the (few) files that fail to build because they were relying on blkdev.h
pulling in extra includes for them.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
The only user of this field was NFS.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
All the current page_mkwrite() implementations also set the page dirty. Which
results in the set_page_dirty_balance() call to _not_ call balance, because the
page is already found dirty.
This allows us to dirty a _lot_ of pages without ever hitting
balance_dirty_pages(). Not good (tm).
Force a balance call if ->page_mkwrite() was successful.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
First thing mm.h does is including sched.h solely for can_do_mlock() inline
function which has "current" dereference inside. By dealing with can_do_mlock()
mm.h can be detached from sched.h which is good. See below, why.
This patch
a) removes unconditional inclusion of sched.h from mm.h
b) makes can_do_mlock() normal function in mm/mlock.c
c) exports can_do_mlock() to not break compilation
d) adds sched.h inclusions back to files that were getting it indirectly.
e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
getting them indirectly
Net result is:
a) mm.h users would get less code to open, read, preprocess, parse, ... if
they don't need sched.h
b) sched.h stops being dependency for significant number of files:
on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
after patch it's only 3744 (-8.3%).
Cross-compile tested on
all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
alpha alpha-up
arm
i386 i386-up i386-defconfig i386-allnoconfig
ia64 ia64-up
m68k
mips
parisc parisc-up
powerpc powerpc-up
s390 s390-up
sparc sparc-up
sparc64 sparc64-up
um-x86_64
x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig
as well as my two usual configs.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Clean up massive code duplication between mpage_writepages() and
generic_writepages().
The new generic function, write_cache_pages() takes a function pointer
argument, which will be called for each page to be written.
Maybe cifs_writepages() too can use this infrastructure, but I'm not
touching that with a ten-foot pole.
The upcoming page writeback support in fuse will also want this.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently we do write coalescing in a very inefficient manner: one pass in
generic_writepages() in order to lock the pages for writing, then one pass
in nfs_flush_mapping() and/or nfs_sync_mapping_wait() in order to gather
the locked pages for coalescing into RPC requests of size "wsize".
In fact, it turns out there is actually a deadlock possible here since we
only start I/O on the second pass. If the user signals the process while
we're in nfs_sync_mapping_wait(), for instance, then we may exit before
starting I/O on all the requests that have been queued up.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
throttle_vm_writeout() is designed to wait for the dirty levels to subside.
But if the caller holds IO or FS locks, we might be holding up that writeout.
So change it to take a single nap to give other devices a chance to clean some
memory, then return.
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Pete Zaitcev <zaitcev@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Separate out the concept of "queue congestion" from "backing-dev congestion".
Congestion is a backing-dev concept, not a queue concept.
The blk_* congestion functions are retained, as wrappers around the core
backing-dev congestion functions.
This proper layering is needed so that NFS can cleanly use the congestion
functions, and so that CONFIG_BLOCK=n actually links.
Cc: "Thomas Maier" <balagi@justmail.de>
Cc: "Jens Axboe" <jens.axboe@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: David Howells <dhowells@redhat.com>
Cc: Peter Osterlund <petero2@telia.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Many files include the filename at the beginning, serveral used a wrong one.
Signed-off-by: Uwe Zeisberger <Uwe_Zeisberger@digi.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
Dissociate the generic_writepages() function from the mpage stuff, moving its
declaration to linux/mm.h and actually emitting a full implementation into
mm/page-writeback.c.
The implementation is a partial duplicate of mpage_writepages() with all BIO
references removed.
It is used by NFS to do writeback.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
ratelimit_pages in page-writeback.c is recalculated (in set_ratelimit())
every time a CPU is hot-added/removed. But this value is not recalculated
when new pages are hot-added.
This patch fixes that problem by calling set_ratelimit() when new pages
are hot-added.
[akpm@osdl.org: cleanups]
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Now that we can detect writers of shared mappings, throttle them. Avoids OOM
by surprise.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
When a writeback_control's `start' and `end' fields are used to
indicate a one-byte-range starting at file offset zero, the required
values of .start=0,.end=0 mean that the ->writepages() implementation
has no way of telling that it is being asked to perform a range
request. Because we're currently overloading (start == 0 && end == 0)
to mean "this is not a write-a-range request".
To make all this sane, the patch changes range of writeback_control.
So caller does: If it is calling ->writepages() to write pages, it
sets range (range_start/end or range_cyclic) always.
And if range_cyclic is true, ->writepages() thinks the range is
cyclic, otherwise it just uses range_start and range_end.
This patch does,
- Add LLONG_MAX, LLONG_MIN, ULLONG_MAX to include/linux/kernel.h
-1 is usually ok for range_end (type is long long). But, if someone did,
range_end += val; range_end is "val - 1"
u64val = range_end >> bits; u64val is "~(0ULL)"
or something, they are wrong. So, this adds LLONG_MAX to avoid nasty
things, and uses LLONG_MAX for range_end.
- All callers of ->writepages() sets range_start/end or range_cyclic.
- Fix updates of ->writeback_index. It seems already bit strange.
If it starts at 0 and ended by check of nr_to_write, this last
index may reduce chance to scan end of file. So, this updates
->writeback_index only if range_cyclic is true or whole-file is
scanned.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Steven French <sfrench@us.ibm.com>
Cc: "Vladimir V. Saveliev" <vs@namesys.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Modify balance_dirty_pages_ratelimited() so that it can take a
number-of-pages-which-I-just-dirtied argument. For msync().
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Make that the internal values for:
/proc/sys/vm/dirty_writeback_centisecs
/proc/sys/vm/dirty_expire_centisecs
are stored as jiffies instead of centiseconds. Let the sysctl interface do
the conversions with full precision using clock_t_to_jiffies, instead of
doing overflow-sensitive on-the-fly conversions every time the values are
used.
Cons: apparent precision loss if HZ is not a multiple of 100, because of
conversion back and forth. This is a common problem for all sysctl values
that use proc_dointvec_userhz_jiffies. (There is only one other in-tree
use, in net/core/neighbour.c.)
Signed-off-by: Bart Samwel <bart@samwel.tk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This exports/changes the sync_page_range/_nolock(). The fatfs needs
sync_page_range/_nolock() for expanding truncate, and changes "size_t count"
to "loff_t count".
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
NFS needs to be able to distinguish between single-page ->writepage() calls and
multipage ->writepages() calls.
For the single-page writepage calls NFS can kick off the I/O within the
context of ->writepage().
For multipage ->writepages calls, nfs_writepage() will leave the I/O pending
and nfs_writepages() will kick off the I/O when it all has been queued up
within NFS.
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
readpage(), prepare_write(), and commit_write() callers are updated to
understand the special return code AOP_TRUNCATED_PAGE in the style of
writepage() and WRITEPAGE_ACTIVATE. AOP_TRUNCATED_PAGE tells the caller that
the callee has unlocked the page and that the operation should be tried again
with a new page. OCFS2 uses this to detect and work around a lock inversion in
its aop methods. There should be no change in behaviour for methods that don't
return AOP_TRUNCATED_PAGE.
WRITEPAGE_ACTIVATE is also prepended with AOP_ for consistency and they are
made enums so that kerneldoc can be used to document their semantics.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
|
|
With Nick Piggin <npiggin@suse.de>
Give some things static scope.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This updates the CFQ io scheduler to the new time sliced design (cfq
v3). It provides full process fairness, while giving excellent
aggregate system throughput even for many competing processes. It
supports io priorities, either inherited from the cpu nice value or set
directly with the ioprio_get/set syscalls. The latter closely mimic
set/getpriority.
This import is based on my latest from -mm.
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
With silly pageout testcases it is possible to place huge amounts of memory
under I/O. With a large request queue (CFQ uses 8192 requests) it is
possible to place _all_ memory under I/O at the same time.
This means that all memory is pinned and unreclaimable and the VM gets
upset and goes oom.
The patch limits the amount of memory which is under pageout writeout to be
a little more than the amount of memory at which balance_dirty_pages()
callers will synchronously throttle.
This means that heavy pageout activity can starve heavy writeback activity
completely, but heavy writeback activity will not cause starvation of
pageout. Because we don't want a simple `dd' to be causing excessive
latencies in page reclaim.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The O_SYNC speedup patches missed the generic_file_xxx_nolock cases, which
means that pages weren't actually getting sync'ed in those cases. This
patch fixes that.
Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Eliminate the inode waitqueue hashtable using bit_waitqueue() via
wait_on_bit() and wake_up_bit() to locate the waitqueue head associated
with a bit.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Add a whole bunch more might_sleep() checks. We also enable might_sleep()
checking in copy_*_user(). This was non-trivial because of the "copy_*_user()
in atomic regions" trick would generate false positives. Fix that up by
adding a new __copy_*_user_inatomic(), which avoids the might_sleep() check.
Only i386 is supported in this patch.
With: Arjan van de Ven <arjanv@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
We assign 0 and 1 to it, but since it's signed, that's
actually already overflowing the poor thing. So make
it unsigned, which is what it really was supposed to be
in the first place.
|
|
In databases it is common to have multiple threads or processes performing
O_SYNC writes against different parts of the same file.
Our performance at this is poor, because each writer blocks access to the
file by waiting on I/O completion while holding i_sem: everything is
serialised.
The patch improves things by moving the writing and waiting outside i_sem.
So other threads can get in and submit their I/O and permit the disk
scheduler to optimise the IO patterns better.
Also, the O_SYNC writer only writes and waits on the pages which he wrote,
rather than writing and waiting on all dirty pages in the file.
The reason we haven't been able to do this before is that the required walk
of the address_space page lists is easily livelockable without the i_sem
serialisation. But in this patch we perform the waiting via a radix-tree
walk of the affected pages. This cannot be livelocked.
The sync of the inode's metadata is still performed inside i_sem. This is
because it is list-based and is hence still livelockable. However it is
usually the case that databases are overwriting existing file blocks and
there will be no dirty buffers attached to the address_space anyway.
The code is careful to ensure that the IO for the pages and the IO for the
metadata are nonblockingly scheduled at the same time. This is am improvemtn
over the current code, which will issue two separate write-and-wait cycles:
one for metadata, one for pages.
Note from Suparna:
Reworked to use the tagged radix-tree based writeback infrastructure.
Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Modify mpage_writepages to optionally only write back dirty pages within a
specified range in a file (as in the case of O_SYNC). Cheat a little to avoid
changes to prototypes of aops - just put the <start, end> hint into the
writeback_control struct instead. If <start, end> are not set, then default
to writing back all the mapping's dirty pages.
Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Nobody ever fixed the big FIXME in sysctl - but we really need
to pass around the proper "loff_t *" to all the sysctl functions
if we want them to be well-behaved wrt the file pointer position.
This is all preparation for making direct f_pos accesses go
away.
|
|
From: Bart Samwel <bart@samwel.tk>
Adds /proc/sys/vm/laptop-mode: a special knob which says "this is a laptop".
In this mode the kernel will attempt to avoid spinning disks up.
Algorithm: the idea is to hold dirty data in memory for a long time, but to
flush everything which has been accumulated if the disk happens to spin up
for other reasons.
- Whenever a disk request completes (read or write), schedule a timer a few
seconds hence. If the timer was already pending, reset it to a few seconds
hence.
- When the timer expires, write back the whole world. We use
sync_filesystems() for this because it will force ext3 journal commits as
well.
- In balance_dirty_pages(), kick off background writeback when we hit the
high threshold (dirty_ratio), not when we hit the low threshold. This has
the effect of causing "lumpy" writeback which is something I spent a year
fixing, but in laptop mode, it is desirable.
- In try_to_free_pages(), only kick pdflush if the VM is getting into
distress: we want to keep scanning for clean pages, deferring writeback.
- In page reclaim, avoid writing back the odd random dirty page off the
LRU: only start I/O if the scanning is working harder.
The effect is to perform a sync() a few seconds after all I/O has ceased.
The value which was written into /proc/sys/vm/laptop-mode determines, in
seconds, the delay between the final I/O and the flush.
Additionally, the patch adds tools which help answer the question "why the
heck does my disk spin up all the time?". The user may set
/proc/sys/vm/block_dump to a non-zero value and the kernel will print out
information which will identify the process which is performing disk reads or
which is dirtying pagecache.
The user should probably disable syslogd before setting block-dump.
|
|
If pdflush hits a locked-and-clean buffer in __block_write_full_page() it
will just pass over the buffer. Typically the buffer is an ext3 data=ordered
buffer which is being written by kjournald, but a similar thing can happen
with blockdev buffers and ll_rw_block().
This is bad because the buffer is still under I/O and a subsequent fsync's
fdatawait() needs to know about it.
It is not practical to tag the page for writeback - only the submitter of the
I/O can do that, because the submitter has control of the end_io handler.
So instead, redirty the page so a subsequent fsync's fdatawrite() will wait on
the underway I/O.
There is a risk that pdflush::background_writeout() will lock up, repeatedly
trying and failing to write the same page. This is prevented by ensuring
that background_writeout() always throttles when it made no progress.
|
|
From: Robert Love <rml@tech9.net>
- Let real-time tasks dip further into the reserves than usual in
__alloc_pages(). There are a lot of ways to special case this. This
patch just cuts z->pages_low in half, before doing the incremental min
thing, for real-time tasks. I do not do anything in the low memory slow
path. We can be a _lot_ more aggressive if we want. Right now, we just
give real-time tasks a little help.
- Never ever call balance_dirty_pages() on a real-time task. Where and
how exactly we handle this is up for debate. We could, for example,
special case real-time tasks inside balance_dirty_pages(). This would
allow us to perform some of the work (say, waking up pdflush) but not
other work (say, the active throttling). As it stands now, we do the
per-processor accounting in balance_dirty_pages_ratelimited() but we
never call balance_dirty_pages(). Lots of approaches work. What we want
to do is never engage the real-time task in forced writeback.
|
|
|
|
This /proc tunable sets the kupdate interval. It has a couple of problems:
- No way to turn it off completely (userspace dirty memory management
solutions require this).
- If it has been set to one hour and then the user resets it to five
seconds, that resetting will not take effect for up to an hour.
Fix that up by providing a sysctl handler. Setting the tunable to zero now
disables the kupdate function.
|
|
Patch from Anders Gustafsson <andersg@0x63.nu>
We're getting a division-by-zero in the writeback code during early rootfs
population, because writeback has not yet been initialised.
Fix that by performing an explicit initialisation rather than relying on
initcall ordering.
|
|
current->flags:PF_SYNC was a hack I added because I didn't want to
change all ->writepage implementations.
It's foul. And it means that if someone happens to run direct page
reclaim within the context of (say) sys_sync, the writepage invokations
from the VM will be treated as "data integrity" operations, not "memory
cleansing" operations, which would cause latency.
So the patch removes PF_SYNC and adds an extra arg to a_ops->writepage.
It is the `writeback_control' structure which contains the full context
information about why writepage was called.
The initial version of this patch just passed in a bare `int sync', but
the XFS team need more info so they can perform writearound from within
page reclaim.
The patch also adds writeback_control.for_reclaim, so writepage
implementations can inspect that to work out the call context rather
than peeking at current->flags:PF_MEMALLOC.
|
|
I've revisited all the superblock->inode->page writeback paths. There
were several silly things in there, and things were not as clear as they
could be.
scenario 1: create and dirty a MAP_SHARED segment over a sparse file,
then exit.
All the memory turns into dirty pagecache, but the kupdate function
only writes it out at a trickle - 4 megabytes every thirty seconds.
We should sync it all within 30 seconds.
What's happening is that when writeback tries to write those pages,
the filesystem needs to instantiate new blocks for them (they're over
holes). The filesystem runs mark_inode_dirty() within the writeback
function.
This redirtying of the inode while we're writing it out triggers
some livelock avoidance code in __sync_single_inode(). That function
says "ah, someone redirtied the file while I was writing it. Let's
move the file to the new end of the superblock dirty list and write
it out later." Problem is, writeback dirtied the inode itself.
(It is rather silly that mark_inode_dirty() sets I_DIRTY_PAGES when
clearly no pages have been dirtied. Fixing that up would be a
largish work, so work around it here).
So this patch just removes the livelock avoidance from
__sync_single_inode(). It is no longer needed anyway - writeback
livelock is now avoided (in all writeback paths) by writing a finite
number of pages.
scenario 2: an application is continuously dirtying a 200 megabyte
file, and your disk has a bandwidth of less than 40 megabytes/sec.
What happens is that once 30 seconds passes, pdflush starts writing
out the file. And because that writeout will take more than five
seconds (a `kupdate' interval), pdflush just keeps writing it out
forever - continuous I/O.
What we _want_ to happen is that the 200 megabytes gets written,
and then IO stops for thirty seconds (minus the writeout period). So
the file is fully synced every thirty seconds.
The patch solves this by using mapping->io_pages more intelligently.
When the time comes to write the file out, move all the dirty pages
onto io_pages. That is a "batch of pages for this kupdate round".
When io_pages is empty, we know we're done.
The address_space_operations.writepages() API is changed! It now only
needs to write the pages which the caller placed on mapping->io_pages.
This conceptually cleans things up a bit, by more clearly defining the
role of ->io_pages, and the motion between the various mapping lists.
The treatment of sb->s_dirty and sb->s_io is now conceptually identical
to mapping->dirty_pages and mapping->io_pages: move the items-to-be
written onto ->s_io/io_pages, alk walk that list. As inodes (or pages)
are written, move them over to the clean/locked/dirty lists.
Oh, scenario 3: start an app whcih continuously overwrites a 5 meg
file. Wait five seconds, start another, wait 5 seconds, start another.
What we _should_ see is three 5-meg writes, five seconds apart, every
thirty seconds. That did all sorts of odd things. It now does the
right thing.
|
|
fail_writepage() does not work. Its activate_page() call cannot
activate the page because it is not on the LRU.
So perform that function (more efficiently) in the VM. Remove
fail_writepage() and, if the filesystem does not implement
->writepage() then activate the page from shrink_list().
A special case is tmpfs, which does have a writepage, but which
sometimes wants to activate the pages anyway. The most important case
is when there is no swap online and we don't want to keep all those
pages on the inactive list. So just as a tmpfs special-case, allow
writepage() to return WRITEPAGE_ACTIVATE, and handle that in the VM.
Also, the whole idea of allowing ->writepage() to return -EAGAIN, and
handling that in the caller has been reverted. If a writepage()
implementation wants to back out and not write the page, it must
redirty the page, unlock it and return zero. (This is Hugh's preferred
way).
And remove the now-unneeded shmem_writepages() - shmem inodes are
marked as `memory backed' so it will not be called.
And remove the test for non-null ->writepage() in generic_file_mmap().
Memory-backed files _are_ mmappable, and they do not have a
writepage(). It just isn't called.
So the locking rules for writepage() are unchanged. They are:
- Called with the page locked
- Returns with the page unlocked
- Must redirty the page itself if it wasn't all written.
But there is a new, special, hidden, undocumented, secret hack for
tmpfs: writepage may return WRITEPAGE_ACTIVATE to tell the VM to move
the page to the active list. The page must be kept locked in this one
case.
|
|
Since /proc/sys/vm/dirty_sync_ratio went away, the name
"dirty_async_ratio" makes no sense.
So rename it to just /proc/sys/vm/dirty_ratio.
|
|
Convert the VM to not wait on other people's dirty data.
- If we find a dirty page and its queue is not congested, do some writeback.
- If we find a dirty page and its queue _is_ congested then just
refile the page.
- If we find a PageWriteback page then just refile the page.
- There is additional throttling for write(2) callers. Within
generic_file_write(), record their backing queue in ->current.
Within page reclaim, if this tasks encounters a page which is dirty
or under writeback onthis queue, block on it. This gives some more
writer throttling and reduces the page refiling frequency.
It's somewhat CPU expensive - under really heavy load we only get a 50%
reclaim rate in pages coming off the tail of the LRU. This can be
fixed by splitting the inactive list into reclaimable and
non-reclaimable lists. But the CPU load isn't too bad, and latency is
much, much more important in these situations.
Example: with `mem=512m', running 4 instances of `dbench 100', 2.5.34
took 35 minutes to compile a kernel. With this patch, it took three
minutes, 45 seconds.
I haven't done swapcache or MAP_SHARED pages yet. If there's tons of
dirty swapcache or mmap data around we still stall heavily in page
reclaim. That's less important.
This patch also has a tweak for swapless machines: don't even bother
bringing anon pages onto the inactive list if there is no swap online.
|
|
The key concept here is that pdflush does not block on request queues
any more. Instead, it circulates across the queues, keeping any
non-congested queues full of write data. When all queues are full,
pdflush takes a nap, to be woken when *any* queue exits write
congestion.
This code can keep sixty spindles saturated - we've never been able to
do that before.
- Add the `nonblocking' flag to struct writeback_control, and teach
the writeback paths to honour it.
- Add the `encountered_congestion' flag to struct writeback_control
and teach the writeback paths to set it.
So as soon as a mapping's backing_dev_info indicates that it is getting
congested, bale out of writeback. And don't even start writeback
against filesystems whose queues are congested.
- Convert pdflush's background_writeback() function to use
nonblocking writeback.
This way, a single pdflush thread will circulate around all the
dirty queues, keeping them filled.
- Convert the pdlfush `kupdate' function to do the same thing.
This solves the problem of pdflush thread pool exhaustion.
It solves the problem of pdflush startup latency.
It solves the (minor) problem wherein `kupdate' writeback only writes
back a single disk at a time (it was getting blocked on each queue in
turn).
It probably means that we only ever need a single pdflush thread.
|
|
This was designed to be a really sterm throttling threshold: if dirty
memory reaches this level then perform writeback and actually wait on
it.
It doesn't work. Because memory dirtiers are required to perform
writeback if the amount of dirty AND writeback memory exceeds
dirty_async_ratio.
So kill it, and rely just on the request queues being appropriately
scaled to the machine size (they are).
This is basically what 2.4 does.
|