| Age | Commit message (Collapse) | Author |
|
This fourth patch of eight in this cleancache series provides the
core hooks in VFS for: initializing cleancache per filesystem;
capturing clean pages reclaimed by page cache; attempting to get
pages from cleancache before filesystem read; and ensuring coherency
between pagecache, disk, and cleancache. Note that the placement
of these hooks was stable from 2.6.18 to 2.6.38; a minor semantic
change was required due to a patchset in 2.6.39.
All hooks become no-ops if CONFIG_CLEANCACHE is unset, or become
a check of a boolean global if CONFIG_CLEANCACHE is set but no
cleancache "backend" has claimed cleancache_ops.
Details and a FAQ can be found in Documentation/vm/cleancache.txt
[v8: minchan.kim@gmail.com: adapt to new remove_from_page_cache function]
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik Van Riel <riel@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Andreas Dilger <adilger@sun.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Nitin Gupta <ngupta@vflare.org>
|
|
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
Merge mpage_end_io_read() and mpage_end_io_write() into mpage_end_io() to
eliminate code duplication.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Hai Shan <shan.hai@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
|
|
Some comments misspell "invocation"; this fixes them. No code
changes.
Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
|
These struct buffer_heads are allocated on the stack (and hence are
initialized with stack garbage). They are only used to call a
get_blocks() function, so that's mostly OK, but b_state must be
initialized to be 0 so we don't have any unexpected BH_* flags set by
accident, such as BH_Unwritten or BH_Delay.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Commit 29a814d2ee0e43c2980f33f91c1311ec06c0aa35 (vfs: add hooks for
ext4's delayed allocation support) exported the following functions
mpage_bio_submit()
__mpage_writepage()
for the benefit of ext4's delayed allocation support. Since commit
a1d6cc563bfdf1bf2829d3e6ce4d8b774251796b (ext4: Rework the
ext4_da_writepages() function), these functions are not used by the
ext4 driver anymore. However, the now unnecessary exports still
remain, and this patch removes those. Moreover, these two functions
can become static again.
The issue was spotted by namespacecheck.
Signed-off-by: Dmitri Vorobiev <dmitri.vorobiev@movial.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
It is known that buffer_mapped() is false in this code path.
Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
While tracing I/O patterns with blktrace (a great tool) a few weeks ago I
identified a minor issue in fs/mpage.c
As the comment above mpage_readpages() says, a fs's get_block function
will set BH_Boundary when it maps a block just before a block for which
extra I/O is required.
Since get_block() can map a range of pages, for all these pages the
BH_Boundary flag will be set. But we only need to push what I/O we have
accumulated at the last block of this range.
This makes do_mpage_readpage() send out the largest possible bio instead
of a bunch of page-sized ones in the BH_Boundary case.
Signed-off-by: Miquel van Smoorenburg <mikevs@xs4all.net>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
People can use the real name an an index into MAINTAINERS to find the
current email address.
Signed-off-by: Francois Cami <francois.cami@free.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Export mpage_bio_submit() and __mpage_writepage() for the benefit of
ext4's delayed allocation support. Also change __block_write_full_page
so that if buffers that have the BH_Delay flag set it will call
get_block() to get the physical block allocated, just as in the
!BH_Mapped case.
Signed-off-by: Alex Tomas <alex@clusterfs.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Fix docbook problems in filesystems.tmpl.
These cause the generated docbook to be incorrect.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Simplify page cache zeroing of segments of pages through 3 functions
zero_user_segments(page, start1, end1, start2, end2)
Zeros two segments of the page. It takes the position where to
start and end the zeroing which avoids length calculations and
makes code clearer.
zero_user_segment(page, start, end)
Same for a single segment.
zero_user(page, start, length)
Length variant for the case where we know the length.
We remove the zero_user_page macro. Issues:
1. Its a macro. Inline functions are preferable.
2. The KM_USER0 macro is only defined for HIGHMEM.
Having to treat this special case everywhere makes the
code needlessly complex. The parameter for zeroing is always
KM_USER0 except in one single case that we open code.
Avoiding KM_USER0 makes a lot of code not having to be dealing
with the special casing for HIGHMEM anymore. Dealing with
kmap is only necessary for HIGHMEM configurations. In those
configurations we use KM_USER0 like we do for a series of other
functions defined in highmem.h.
Since KM_USER0 is depends on HIGHMEM the existing zero_user_page
function could not be a macro. zero_user_* functions introduced
here can be be inline because that constant is not used when these
functions are called.
Also extract the flushing of the caches to be outside of the kmap.
[akpm@linux-foundation.org: fix nfs and ntfs build]
[akpm@linux-foundation.org: fix ntfs build some more]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Quite a bit of code is used in maintaining these "cached pages" that are
probably pretty unlikely to get used. It would require a narrow race where
the page is inserted concurrently while this process is allocating a page
in order to create the spare page. Then a multi-page write into an uncached
part of the file, to make use of it.
Next, the buffered write path (and others) uses its own LRU pagevec when it
should be just using the per-CPU LRU pagevec (which will cut down on both data
and code size cacheline footprint). Also, these private LRU pagevecs are
emptied after just a very short time, in contrast with the per-CPU pagevecs
that are persistent. Net result: 7.3 times fewer lru_lock acquisitions required
to add the pages to pagecache for a bulk write (in 4K chunks).
[this gets rid of some cond_resched() calls in readahead.c and mpage.c due
to clashes in -mm. What put them there, and why? ]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
As bi_end_io is only called once when the reqeust is complete,
the 'size' argument is now redundant. Remove it.
Now there is no need for bio_endio to subtract the size completed
from bi_size. So don't do that either.
While we are at it, change bi_end_io to return void.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
Clean up massive code duplication between mpage_writepages() and
generic_writepages().
The new generic function, write_cache_pages() takes a function pointer
argument, which will be called for each page to be written.
Maybe cifs_writepages() too can use this infrastructure, but I'm not
touching that with a ten-foot pole.
The upcoming page writeback support in fuse will also want this.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
It's very common for file systems to need to zero part or all of a page,
the simplist way is just to use kmap_atomic() and memset(). There's
actually a library function in include/linux/highmem.h that does exactly
that, but it's confusingly named memclear_highpage_flush(), which is
descriptive of *how* it does the work rather than what the *purpose* is.
So this patchset renames the function to zero_user_page(), and calls it
from the various places that currently open code it.
This first patch introduces the new function call, and converts all the
core kernel callsites, both the open-coded ones and the old
memclear_highpage_flush() ones. Following this patch is a series of
conversions for each file system individually, per AKPM, and finally a
patch deprecating the old call. The diffstat below shows the entire
patchset.
[akpm@linux-foundation.org: fix a few things]
Signed-off-by: Nate Diller <nate.diller@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Cleanup: setting an outstanding error on a mapping was open coded too many
times. Factor it out in mapping_set_error().
Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Dissociate the generic_writepages() function from the mpage stuff, moving its
declaration to linux/mm.h and actually emitting a full implementation into
mm/page-writeback.c.
The implementation is a partial duplicate of mpage_writepages() with all BIO
references removed.
It is used by NFS to do writeback.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When a writeback_control's `start' and `end' fields are used to
indicate a one-byte-range starting at file offset zero, the required
values of .start=0,.end=0 mean that the ->writepages() implementation
has no way of telling that it is being asked to perform a range
request. Because we're currently overloading (start == 0 && end == 0)
to mean "this is not a write-a-range request".
To make all this sane, the patch changes range of writeback_control.
So caller does: If it is calling ->writepages() to write pages, it
sets range (range_start/end or range_cyclic) always.
And if range_cyclic is true, ->writepages() thinks the range is
cyclic, otherwise it just uses range_start and range_end.
This patch does,
- Add LLONG_MAX, LLONG_MIN, ULLONG_MAX to include/linux/kernel.h
-1 is usually ok for range_end (type is long long). But, if someone did,
range_end += val; range_end is "val - 1"
u64val = range_end >> bits; u64val is "~(0ULL)"
or something, they are wrong. So, this adds LLONG_MAX to avoid nasty
things, and uses LLONG_MAX for range_end.
- All callers of ->writepages() sets range_start/end or range_cyclic.
- Fix updates of ->writeback_index. It seems already bit strange.
If it starts at 0 and ended by check of nr_to_write, this last
index may reduce chance to scan end of file. So, this updates
->writeback_index only if range_cyclic is true or whole-file is
scanned.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Steven French <sfrench@us.ibm.com>
Cc: "Vladimir V. Saveliev" <vs@namesys.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch changes mpage_readpages() and get_block() to get the disk mapping
information for multiple blocks at the same time.
b_size represents the amount of disk mapping that needs to mapped. On the
successful get_block() b_size indicates the amount of disk mapping thats
actually mapped. Only the filesystems who care to use this information and
provide multiple disk blocks at a time can choose to do so.
No changes are needed for the filesystems who wants to ignore this.
[akpm@osdl.org: cleanups]
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Pass amount of disk needs to be mapped to get_block(). This way one can
modify the fs ->get_block() functions to map multiple blocks at the same time.
[akpm@osdl.org: performance tweak]
[akpm@osdl.org: remove unneeded assignments]
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
We've had two instances recently of overflows when doing
64_bit_value = (32_bit_value << PAGE_CACHE_SHIFT)
I did a tree-wide grep of `<<.*PAGE_CACHE_SHIFT' and this is the result.
- afs_rxfs_fetch_descriptor.offset is of type off_t, which seems broken.
- jfs and jffs are limited to 4GB anyway.
- reiserfs map_block_for_writepage() takes an unsigned long for the block -
it should take sector_t. (It'll fail for huge filesystems with
blocksize<PAGE_CACHE_SIZE)
- cramfs_read() needs to use sector_t (I think cramsfs is busted on large
filesystems anyway)
- affs is limited in file size anyway.
- I generally didn't fix 32-bit overflows in directory operations.
- arm's __flush_dcache_page() is peculiar. What if the page lies beyond 4G?
- gss_wrap_req_priv() needs checking (snd_buf->page_base)
Cc: Oleg Drokin <green@linuxhacker.ru>
Cc: David Howells <dhowells@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: <reiserfs-dev@namesys.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: <linux-fsdevel@vger.kernel.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
readpage(), prepare_write(), and commit_write() callers are updated to
understand the special return code AOP_TRUNCATED_PAGE in the style of
writepage() and WRITEPAGE_ACTIVATE. AOP_TRUNCATED_PAGE tells the caller that
the callee has unlocked the page and that the operation should be tried again
with a new page. OCFS2 uses this to detect and work around a lock inversion in
its aop methods. There should be no change in behaviour for methods that don't
return AOP_TRUNCATED_PAGE.
WRITEPAGE_ACTIVATE is also prepended with AOP_ for consistency and they are
made enums so that kerneldoc can be used to document their semantics.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
|
|
- added typedef unsigned int __nocast gfp_t;
- replaced __nocast uses for gfp flags with gfp_t - it gives exactly
the same warnings as far as sparse is concerned, doesn't change
generated code (from gcc point of view we replaced unsigned int with
typedef) and documents what's going on far better.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
When fsync() runs wait_on_page_writeback_range() it only inspects pages which
are actually under I/O (PAGECACHE_TAG_WRITEBACK). If a page completed I/O
prior to wait_on_page_writeback_range() looking at it, it is supposed to have
recorded its I/O error state in the address_space.
But mpage_mpage_end_io_write() forgot to set the address_space error flag in
this case.
Signed-off-by: Qu Fuping <fs@ercist.iscas.ac.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch makes some needlessly global identifiers static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Arjan van de Ven <arjanv@infradead.org>
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This had a fatal lock ranking bug: we do journal_start outside
mpage_writepages()'s lock_page().
Revert the whole thing, think again.
Credit-to: Jan Kara <jack@suse.cz>
For identifying the bug.
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Some KernelDoc descriptions are updated to match the current code.
No code changes.
Signed-off-by: Martin Waitz <tali@admingilde.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
When ->writepage() returns WRITEPAGE_ACTIVATE, the page is still locked.
Explicitly unlock the page in mpage_writepages().
Signed-off-by: Nikita Danilov <nikita@clusterfs.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
|
|
Add writepages support for ext3 writeback mode.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This makes it hard(er) to mix argument orders by mistake for things like
kmalloc() and friends, since silent integer promotion is now caught by
sparse.
|
|
Add nobh_wripage() support for the filesystems which uses
nobh_prepare_write/nobh_commit_write().
Idea here is to reduce unnecessary bufferhead creation/attachment to the
page through pageout()->block_write_full_page(). nobh_wripage() tries to
operate by directly creating bios, but it falls back to
__block_write_full_page() if it can't make progress.
Note that this is not really generic routine and can't be used for
filesystems which uses page->Private for anything other than buffer heads.
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The problem is, if we increase our readhead size arbitrarily (say 2M), we
call mpage_readpages() with 2M and when it tries to allocated a bio enough to
fit 2M it fails, then we kick it back to "confused" code - which does 4K at
a time.
The fix is to ask for the maxium the driver can handle.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Safeguard to make sure we break out of pagevec_lookup_tag loop if we are
beyond the specified range.
Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Modify mpage_writepages to optionally only write back dirty pages within a
specified range in a file (as in the case of O_SYNC). Cheat a little to avoid
changes to prototypes of aops - just put the <start, end> hint into the
writeback_control struct instead. If <start, end> are not set, then default
to writing back all the mapping's dirty pages.
Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
I think I understood why some ext2 fs corruption still happens even after
the last i_size fix.
what happened I believe is that the writepages layer got a not a fully
uptodate page (in turn with bh mapped on top of it), and then right before
unlocking the page and entering the writeback mode, it freed all the bh.
Without bh a not uptodate page will trigger a full readpage from disk, that
overwrites the pagecache before the multi-bio gets submitted, generating fs
corruption.
I believe the below patch should fix it (untested) against kernel CVS.
The testcases developed by Kurt showed the pagecache being overwritten with
on-disk data at block offsets, and Chris as well was wondering about races
between wait_on_page_writeback and readpage. the below fix just explains
everything we've seen since not-fully-uptodate pages must have always bh on
them and the below patch enforces just that invariant, and it should fix
our pagecache-overwritten-by-disk-data problem.
Signed-off-by: Andrea Arcangeli <andrea@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
I believe reading the i_size from memory multiple times can generate fs
corruption. The "offset" and the "end_index" were not coherent. this is
writepages and it runs w/o the i_sem, so the i_size can change from under
us anytime. If a parallel write happens while writepages run, the i_size
could advance from 4095 to 4100. With the current 2.6 code that could
translate in end_index = 0 and offset = 4. That's broken because end_index
and offset could be not coherent. Either end_index=1 and offset =4, or
end_index = 0 and offset = 4095. When they lose coherency the memset can
zeroout actual data. The below patch fixes that (it's at least a
theoretical bug).
I don't really expect this tiny race to fix the bug in practice after the
more serious bugs we covered yesterday didn't fix it (more likely the
compiler will get involved into the equation soon ;).
This is also an optimization for 32bit archs that needs special locking to
read 64bit i_size coherenty.
This patch also arranges for mpage_writepages() to always zero out the file's
final page between i_size and the end of the file's final block. This is a
best-effort correctness thing to deal with errant applications which write
into the mmapped page beyond the underlying file's EOF.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Fix a data loss bug in mpage_writepages(), triggerable under extreme memory
pressure on ext2, JFS, hfs and hfsplus:
The bug is the marking of the bh clean despite we could still run into the
"confused" path. After that the confused path really becomes confused and it
writes nothing and fs corruption triggers silenty (the reugular writepage only
writes bh that are marked dirty, it never attempts to submit_bh anything
marked clean). The mpage-writepage code must never mark the bh clean as far
as it wants to still fallback in the regular writepage which depends on the bh
to be dirty (i.e. the "goto confused" path). This could only triggers with
memory pressure (it also needs buffer_heads_over_limit == 0, and that is
frequent under mm pressure).
Thanks a lot to Chris for his fine debugging that localized the problem in the
writepage code.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: Jeff Garzik <jgarzik@pobox.com>
It was debug code, no longer required.
|
|
Rework the code layout a bit. No logic change.
|
|
From: Jens Axboe <axboe@suse.de>
Takashi did some nice latency testing of the current kernel (with -mm
writeback changes), and the biggest offender in general core is
mpage_writepages().
|
|
The radix-tree walk for writeback has a couple of problems:
a) It always scans a file from its first dirty page, so if someone
is repeatedly dirtying the front part of a file, pages near the end
may be starved of writeout. (Well, not completely: the `kupdate'
function will write an entire file once the file's dirty timestamp
has expired).
b) When the disk queues are huge (10000 requests), there can be a
very large number of locked pages. Scanning past these in writeback
consumes quite some CPU time.
So in each address_space we record the index at which the last batch of
writeout terminated and start the next batch of writeback from that
point.
|
|
fdatasync can fail to wait on some pages due to a race.
If some task (eg pdflush) is flushing the same mapping it can remove a page's
dirty tag but not then mark that page as being under writeback, because
pdflush hit a locked buffer in __block_write_full_page(). This will happen
because kjournald is writing the buffer. In this situation
__block_write_full_page() will redirty the page so that fsync notices it, but
there is a window where the page eludes the radix tree dirty page walk.
Consequently a concurrent fsync will fail to notice the page when walking the
radix tree's dirty pages.
The approach taken by this patch is to leave the page marked as dirty in the
radix tree while ->writepage is working out what to do with it. This ensures
that a concurrent write-for-sync will successfully locate the page and will
then block in lock_page() until the non-write-for-sync code has finished
altering the page state.
|
|
The address_space.readapges() function currently takes a list of pages,
strung together via page->list. Switch it to using page->lru.
This changes the API into filesystems.
|
|
Now remove address_space.io_pages.
|
|
Move everything over to walking the radix tree via the PAGECACHE_TAG_DIRTY
tag. Remove address_space.dirty_pages.
|
|
Arrange for under-writeback pages to be marked thus in their pagecache radix
tree.
|
|
Intro to these patches:
- Major surgery against the pagecache, radix-tree and writeback code. This
work is to address the O_DIRECT-vs-buffered data exposure horrors which
we've been struggling with for months.
As a side-effect, 32 bytes are saved from struct inode and eight bytes
are removed from struct page. At a cost of approximately 2.5 bits per page
in the radix tree nodes on 4k pagesize, assuming the pagecache is densely
populated. Not all pages are pagecache; other pages gain the full 8 byte
saving.
This change will break any arch code which is using page->list and will
also break any arch code which is using page->lru of memory which was
obtained from slab.
The basic problem which we (mainly Daniel McNeil) have been struggling
with is in getting a really reliable fsync() across the page lists while
other processes are performing writeback against the same file. It's like
juggling four bars of wet soap with your eyes shut while someone is
whacking you with a baseball bat. Daniel pretty much has the problem
plugged but I suspect that's just because we don't have testcases to
trigger the remaining problems. The complexity and additional locking
which those patches add is worrisome.
So the approach taken here is to remove the page lists altogether and
replace the list-based writeback and wait operations with in-order
radix-tree walks.
The radix-tree code has been enhanced to support "tagging" of pages, for
later searches for pages which have a particular tag set. This means that
we can ask the radix tree code "find me the next 16 dirty pages starting at
pagecache index N" and it will do that in O(log64(N)) time.
This affects I/O scheduling potentially quite significantly. It is no
longer the case that the kernel will submit pages for I/O in the order in
which the application dirtied them. We instead submit them in file-offset
order all the time.
This is likely to be advantageous when applications are seeking all over
a large file randomly writing small amounts of data. I haven't performed
much benchmarking, but tiobench random write throughput seems to be
increased by 30%. Other tests appear to be unaltered. dbench may have got
10-20% quicker, but it's variable.
There is one large file which everyone seeks all over randomly writing
small amounts of data: the blockdev mapping which caches filesystem
metadata. The kernel's IO submission patterns for this are now ideal.
Because writeback and wait-for-writeback use a tree walk instead of a
list walk they are no longer livelockable. This probably means that we no
longer need to hold i_sem across O_SYNC writes and perhaps fsync() and
fdatasync(). This may be beneficial for databases: multiple processes
writing and syncing different parts of the same file at the same time can
now all submit and wait upon writes to just their own little bit of the
file, so we can get a lot more data into the queues.
It is trivial to implement a part-file-fdatasync() as well, so
applications can say "sync the file from byte N to byte M", and multiple
applications can do this concurrently. This is easy for ext2 filesystems,
but probably needs lots of work for data-journalled filesystems and XFS and
it probably doesn't offer much benefit over an i_semless O_SYNC write.
These patches can end up making ext3 (even) slower:
for i in 1 2 3 4
do
dd if=/dev/zero of=$i bs=1M count=2000 &
done
runs awfully slow on SMP. This is, yet again, because all the file
blocks are jumbled up and the per-file linear writeout causes tons of
seeking. The above test runs sweetly on UP because the on UP we don't
allocate blocks to different files in parallel.
Mingming and Badari are working on getting block reservation working for
ext3 (preallocation on steroids). That should fix ext3 up.
This patch:
- Later, we'll need to access the radix trees from inside disk I/O
completion handlers. So make mapping->page_lock irq-safe. And rename it
to tree_lock to reliably break any missed conversions.
|