| Age | Commit message (Collapse) | Author |
|
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
Fsync currently has a fdatawrite/fdatawait pair around the method call,
and a mutex_lock/unlock of the inode mutex. All callers of fsync have
to duplicate this, but we have a few and most of them don't quite get
it right. This patch adds a new vfs_fsync that takes care of this.
It's a little more complicated as usual as ->fsync might get a NULL file
pointer and just a dentry from nfsd, but otherwise gets afile and we
want to take the mapping and file operations from it when it is there.
Notes on the fsync callers:
- ecryptfs wasn't calling filemap_fdatawrite / filemap_fdatawait on the
lower file
- coda wasn't calling filemap_fdatawrite / filemap_fdatawait on the host
file, and returning 0 when ->fsync was missing
- shm wasn't calling either filemap_fdatawrite / filemap_fdatawait nor
taking i_mutex. Now given that shared memory doesn't have disk
backing not doing anything in fsync seems fine and I left it out of
the vfs_fsync conversion for now, but in that case we might just
not pass it through to the lower file at all but just call the no-op
simple_sync_file directly.
[and now actually export vfs_fsync]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
First thing mm.h does is including sched.h solely for can_do_mlock() inline
function which has "current" dereference inside. By dealing with can_do_mlock()
mm.h can be detached from sched.h which is good. See below, why.
This patch
a) removes unconditional inclusion of sched.h from mm.h
b) makes can_do_mlock() normal function in mm/mlock.c
c) exports can_do_mlock() to not break compilation
d) adds sched.h inclusions back to files that were getting it indirectly.
e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
getting them indirectly
Net result is:
a) mm.h users would get less code to open, read, preprocess, parse, ... if
they don't need sched.h
b) sched.h stops being dependency for significant number of files:
on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
after patch it's only 3744 (-8.3%).
Cross-compile tested on
all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
alpha alpha-up
arm
i386 i386-up i386-defconfig i386-allnoconfig
ia64 ia64-up
m68k
mips
parisc parisc-up
powerpc powerpc-up
s390 s390-up
sparc sparc-up
sparc64 sparc64-up
um-x86_64
x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig
as well as my two usual configs.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
With the tracking of dirty pages properly done now, msync doesn't need to scan
the PTEs anymore to determine the dirty status.
From: Hugh Dickins <hugh@veritas.com>
In looking to do that, I made some other tidyups: can remove several
#includes, and sys_msync loop termination not quite right.
Most of those points are criticisms of the existing sys_msync, not of your
patch. In particular, the loop termination errors were introduced in 2.6.17:
I did notice this shortly before it came out, but decided I was more likely to
get it wrong myself, and make matters worse if I tried to rush a last-minute
fix in. And it's not terribly likely to go wrong, nor disastrous if it does
go wrong (may miss reporting an unmapped area; may also fsync file of a
following vma).
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
A process flag to indicate whether we are doing sync io is incredibly
ugly. It also causes performance problems when one does a lot of async
io and then proceeds to sync it. Part of the io will go out as async,
and the other part as sync. This causes a disconnect between the
previously submitted io and the synced io. For io schedulers such as CFQ,
this will cause us lost merges and suboptimal behaviour in scheduling.
Remove PF_SYNCWRITE completely from the fsync/msync paths, and let
the O_DIRECT path just directly indicate that the writes are sync
by using WRITE_SYNC instead.
Signed-off-by: Jens Axboe <axboe@suse.de>
|
|
because of a typo. This patch just changes "my" to "by", which I
believe was the original intent.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
No need to duplicate all that code.
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
msync() does a strange thing. Essentially:
vma = find_vma();
for ( ; ; ) {
if (!vma)
return -ENOMEM;
...
vma = vma->vm_next;
}
so an msync() request which starts within or before a valid VMA and which ends
within or beyond the final VMA will incorrectly return -ENOMEM.
Fix.
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
It seems bad to hold mmap_sem while performing synchronous disk I/O. Alter
the msync(MS_SYNC) code so that the lock is released while we sync the file.
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
It seems sensible to perform dirty page throttling in msync: as the application
dirties pages we can kick off pdflush early, or even force the msync() caller
to perform writeout, or even throttle the msync() caller.
The main effect of this is to start disk writeback earlier if we've just
discovered that a large amount of pagecache has been dirtied. (Otherwise it
wouldn't happen for up to five seconds, next time pdflush wakes up).
It also will cause the page-dirtying process to get panalised for dirtying
those pages rather than whacking someone else with the problem.
We should do this for munmap() and possibly even exit(), too.
We drop the mmap_sem while performing the dirty page balancing. It doesn't
seem right to hold mmap_sem for that long.
Note that this patch only affects MS_ASYNC. MS_SYNC will be syncing all the
dirty pages anyway.
We note that msync(MS_SYNC) does a full-file-sync inside mmap_sem, and always
has. We can fix that up...
The patch also tightens up the mmap_sem coverage in sys_msync(): no point in
taking it while we perform the incoming arg checking.
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch converts the inode semaphore to a mutex. I have tested it on
XFS and compiled as much as one can consider on an ia64. Anyway your
luck with it might be different.
Modified-by: Ingo Molnar <mingo@elte.hu>
(finished the conversion)
Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
This replaces the (in my opinion horrible) VM_UNMAPPED logic with very
explicit support for a "remapped page range" aka VM_PFNMAP. It allows a
VM area to contain an arbitrary range of page table entries that the VM
never touches, and never considers to be normal pages.
Any user of "remap_pfn_range()" automatically gets this new
functionality, and doesn't even have to mark the pages reserved or
indeed mark them any other way. It just works. As a side effect, doing
mmap() on /dev/mem works for arbitrary ranges.
Sparc update from David in the next commit.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Although we tend to associate VM_RESERVED with remap_pfn_range, quite a few
drivers set VM_RESERVED on areas which are then populated by nopage. The
PageReserved removal in 2.6.15-rc1 changed VM_RESERVED not to free pages in
zap_pte_range, without changing those drivers not to set it: so their pages
just leak away.
Let's not change miscellaneous drivers now: introduce VM_UNPAGED at the core,
to flag the special areas where the ptes may have no struct page, or if they
have then it's not to be touched. Replace most instances of VM_RESERVED in
core mm by VM_UNPAGED. Force it on in remap_pfn_range, and the sparc and
sparc64 io_remap_pfn_range.
Revert addition of VM_RESERVED to powerpc vdso, it's not needed there. Is it
needed anywhere? It still governs the mm->reserved_vm statistic, and special
vmas not to be merged, and areas not to be core dumped; but could probably be
eliminated later (the drivers are probably specifying it because in 2.4 it
kept swapout off the vma, but in 2.6 we work from the LRU, which these pages
don't get on).
Use the VM_SHM slot for VM_UNPAGED, and define VM_SHM to 0: it serves no
purpose whatsoever, and should be removed from drivers when we clean up.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: William Irwin <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Convert those common loops using page_table_lock on the outside and
pte_offset_map within to use just pte_offset_map_lock within instead.
These all hold mmap_sem (some exclusively, some not), so at no level can a
page table be whipped away from beneath them. But whereas pte_alloc loops
tested with the "atomic" pmd_present, these loops are testing with pmd_none,
which on i386 PAE tests both lower and upper halves.
That's now unsafe, so add a cast into pmd_none to test only the vital lower
half: we lose a little sensitivity to a corrupt middle directory, but not
enough to worry about. It appears that i386 and UML were the only
architectures vulnerable in this way, and pgd and pud no problem.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Remove PageReserved() calls from core code by tightening VM_RESERVED
handling in mm/ to cover PageReserved functionality.
PageReserved special casing is removed from get_page and put_page.
All setting and clearing of PageReserved is retained, and it is now flagged
in the page_alloc checks to help ensure we don't introduce any refcount
based freeing of Reserved pages.
MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being
deprecated. We never completely handled it correctly anyway, and is be
reintroduced in future if required (Hugh has a proof of concept).
Once PageReserved() calls are removed from kernel/power/swsusp.c, and all
arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can
be trivially removed.
Last real user of PageReserved is swsusp, which uses PageReserved to
determine whether a struct page points to valid memory or not. This still
needs to be addressed (a generic page_is_ram() should work).
A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and
thus mapcounted and count towards shared rss). These writes to the struct
page could cause excessive cacheline bouncing on big systems. There are a
number of ways this could be addressed if it is an issue.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Refcount bug fix for filemap_xip.c
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Use latency breaking in msync_pte_range like that in copy_pte_range, instead
of the ugly CONFIG_PREEMPT filemap_msync alternatives.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This is not problem actually, but sync_page_range() is using for exported
function to filesystems.
The msync_xxx is more readable at least to me.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
It's common practice to msync a large address range regularly, in which
often only a few ptes have actually been dirtied since the previous pass.
sync_pte_range then goes much faster if it tests whether pte is dirty
before locating and accessing each struct page cacheline; and it is hardly
slowed by ptep_clear_flush_dirty repeating that test in the opposite case,
when every pte actually is dirty.
But beware, s390's pte_dirty always says false, since its dirty bit is kept
in the storage key, located via the struct page address. So skip this
optimization in its case: use a pte_maybe_dirty macro which just says true
if page_test_and_clear_dirty is implemented.
Signed-off-by: Abhijit Karmarkar <abhijitk@veritas.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
As a general rule, ask the compiler to inline action_on_pmd_range and
action_on_pud_range: they're none very interesting, and it has a better
chance of eliding them that way. But conversely, it helps debug traces
if action_on_pte_range and top action_on_page_range remain uninlined.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
To handle large sparse areas a little more efficiently, follow Nick and
move the p?d_none_or_clear_bad tests up from the start of each function
to its callsite.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Convert filemap_sync pagetable walkers to loops using p?d_addr_end; use
similar loop to split filemap_sync into chunks. Merge filemap_sync_pte
into sync_pte_range, cut filemap_ off the longer names, vma arg first.
There is no error from filemap_sync, nor is any use made of the flags:
if it should do something else for MS_INVALIDATE, reinstate it when that
is implemented. Remove the redundant flush_tlb_range from afterwards:
as its comment noted, each dirty pte has already been flushed.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Replace the repetitive p?d_none, p?d_bad, p?d_ERROR, p?d_clear clauses
by pgd_none_or_clear_bad, pud_none_or_clear_bad, pmd_none_or_clear_bad
inlines throughout common and i386 - avoids a sprinkling of "unlikely"s.
Tests inline, but unlikely error handling in mm/memory.c - so the ERROR
file and line won't tell much; but it comes too late anyway, and hardly
ever seen outside development.
Let mremap use them in get_one_pte_map, as it already did in _nested;
but leave follow_page and untouched_anonymous page just skipping _bad
as before - they don't have quite the same ownership of the mm.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The attached patch, written by Andrew Morton, fixes long scheduling
latencies in filemap_sync().
Has been tested as part of the -VP patchset.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Pass the "we are doing synchronous writes" hint down from msync().
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
We hold i_sem during the various sync() operations to prevent livelocks:
if another thread is dirtying the file, a sync() may never return.
Or at least, that used to be true when we were using the per-address_space
page lists. Since writeback has used radix tree traversal it is not possible
to livelock the sync() operations, because they only visit each page a single
time.
sync_page_range() (used by O_SYNC writes) has not been holding i_sem for quite
some time, for the above reasons.
The patch converts fsync(), fdatasync() and msync() to also not hold i_sem
during the radix-tree-based writeback.
Now, we _do_ still need to hold i_sem across the file->f_op->fsync() call,
because that is still based on a list_head walk, and is still livelockable.
But in the case of msync() I deliberately left i_sem untaken. This is because
we're currently deadlockable in msync, because mmap_sem is already held, and
mmap_sem nexts inside i_sem, due to direct-io.c.
And yes, the ranking of down_read() veruss down() does matter:
Task A Task B Task C
down_read(rwsem)
down(sem)
down_write(rwsem)
down(sem)
down_read(rwsem)
C's down_write() will cause B's down_read to block. B holds `sem', so A will
never release `rwsem'.
So the patch fixes a hard-to-hit triple-task deadlock, but adds a possible
livelock in msync(). It is possible to fix sys_msync() so that it takes i_sem
outside i_mmap_sem. Later.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Extend the Linux MM to 4level page tables.
This is the core patch for mm/*, fs/*, include/linux/*
It breaks all architectures, which will be fixed in separate patches.
The conversion is quite straight forward. All the functions walking the page
table hierarchy have been changed to deal with another level at the top. The
additional level is called pml4.
mm/memory.c has changed a lot because it did most of the heavy lifting here.
Most of the changes here are extensions of the previous code.
Signed-off-by: Andi Kleen <ak@suse.de>
Converted by Nick Piggin to use the pud_t 'page upper' level between pgd
and pmd instead of Andi's pml4 level above pgd.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
I found that the prototypes for sys_waitid and sys_fcntl in
<linux/syscalls.h> don't match the implementation. In order to keep all
prototypes in sync in the future, now include the header from each file
implementing any syscall.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch cleans up needless includes of asm/pgalloc.h from the fs/
kernel/ and mm/ subtrees. Compile tested on multiple ARM platforms, and
x86, this patch appears safe.
This patch is part of a larger patch aiming towards getting the include of
asm/pgtable.h out of linux/mm.h, so that asm/pgtable.h can sanely get at
things like mm_struct and friends.
I suggest testing in -mm for a while to ensure there aren't any hidden arch
issues.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: David Gibson <david@gibson.dropbear.id.au>
Currently, calling msync() on a hugepage area will cause the kernel to blow
up with a bad_page() (at least on ppc64, but I think the problem will exist
on other archs too). The msync path attempts to walk pagetables which may
not be there, or may have an unusual layout for hugepages.
Lucikly we shouldn't need to do anything for an msync on hugetlbfs beyond
flushing the cache, so this patch should be sufficient to fix the problem.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
while searching for a s390 tlb flush problem I noticed some superflous tlb
flushes. One in zeromap_page_range, one in remap_page_range, and another one
in filemap_sync. The patch just adds comments but I think these three
flush_tlb_range calls can be removed.
|
|
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
this is another s/390 related mm patch. It introduces the concept of
physical dirty and referenced bits into the common mm code. I always
had the nagging feeling that the pte functions for setting/clearing
the dirty and referenced bits are not appropriate for s/390. It works
but it is a bit of a hack.
After the wake of rmap it is now possible to put a much better solution
into place. The idea is simple: since there are not dirty/referenced
bits in the pte make these function nops on s/390 and add operations
on the physical page to the appropriate places. For the referenced bit
this is the page_referenced() function. For the dirty bit there are
two relevant spots: in page_remove_rmap after the last user of the
page removed its reverse mapping and in try_to_unmap after the last
user was unmapped. There are two new functions to accomplish this:
* page_test_and_clear_dirty: Test and clear the dirty bit of a
physical page. This function is analog to ptep_test_and_clear_dirty
but gets a struct page as argument instead of a pte_t pointer.
* page_test_and_clear_young: Test and clear the referenced bit
of a physical page. This function is analog to ptep_test_and_clear_young
but gets a struct page as argument instead of a pte_t pointer.
Its pretty straightforward and with it the s/390 mm makes much more
sense. You'll need the tls flush optimization patch for the patch.
Comments ?
|
|
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
On the s/390 architecture we still have the issue with tlb flushing and the
ipte instruction. We can optimize the tlb flushing a lot with some minor
interface changes between the arch backend and the memory management core.
In the end the whole thing is about the Invalidate Page Table Entry (ipte)
instruction. The instruction sets the invalid bit in the pte and removes the
tlb for the page on all cpus for the virtual to physical mapping of the page
in a particular address space. The nice thing is that only the tlb for this
page gets removed, all the other tlbs stay valid. The reason we can't use
ipte to implement flush_tlb_page() is one of the requirements of the
instruction: the pte that should get flushed needs to be *valid*.
I'd like to add the following four functions to the mm interface:
* ptep_establish: Establish a new mapping. This sets a pte entry to a
page table and flushes the tlb of the old entry on all cpus if it
exists. This is more or less what establish_pte in mm/memory.c does
right now but without the update_mmu_cache call.
* ptep_test_and_clear_and_flush_young. Do what ptep_test_and_clear_young
does and flush the tlb.
* ptep_test_and_clear_and_flush_dirty. Do what ptep_test_and_clear_dirty
does and flush the tlb.
* ptep_get_and_clear_and_flush: Do what ptep_get_and_clear does and
flush the tlb.
The s/390 specific functions in include/pgtable.h define their own optimized
version of these four functions by use of the ipte.
I avoid the definition of these function for every architecture I added them
to include/asm-generic/pgtable.h. Since i386/x86 and others don't include
this header yet and define their own version of the functions found there I
#ifdef'd all functions in include/asm-generic/pgtable.h to be able to pick
the ones that are needed for each architecture (see patch for details).
With the new functions in place it is easy to do the optimization, e.g. the
sequence
ptep_get_and_clear(ptep);
flush_tlb_page(vma, address);
gets replace by
ptep_get_and_clear_and_flush(vma, address, ptep);
The old sequence still works but it is suboptimal on s/390.
|
|
From: viro@parcelfarce.linux.theplanet.co.uk <viro@parcelfarce.linux.theplanet.co.uk>
In a bunch of places we used file->f_dentry->d_inode->i_sem to protect
fdatasync et.al. Replaced with corrent file->f_mapping->host->i_sem - the
object we are protecting is address_space, so we want an exclusion that would
work for redirected ->i_mapping. For normal files (not coda, not bdev) it's
all the same, of course - there we have
file->f_mapping->host == file->f_dentry->d_inode
and change above is an equivalent transfromation.
|
|
MS_ASYNC will currently wait on previously-submitted I/O, then start new I/O
and not wait on it. This can cause undesirable blocking if msync is called
rapidly against the same memory.
So instead, change msync(MS_ASYNC) to not start any IO at all. Just flush
the pte dirty bits into the pageframe and leave it at that.
The IO _will_ happen within a kupdate period. And the application can use
fsync() or fadvise(FADV_DONTNEED) if it actually wants to schedule the IO
immediately.
(This has triggered an ext3 bug - the page's buffers get dirtied so fast
that kjournald keeps writing the buffers over and over for 10-20 seconds
before deciding to give up for some reason)
|
|
Tuned for gcc-2.95.3:
filemap.c: 10815 -> 10046
highmem.c: 3392 -> 3104
mmap.c: 5998 -> 5854
mremap.c: 3058 -> 2802
msync.c: 1521 -> 1489
page_alloc.c: 8487 -> 8167
|
|
From Christpoh Hellwig.
Make filemap_sync() static, and not exported to modules
|
|
From Anton Blanchard. This fixes a couple of Linux Test Project
failures.
- Returns EBUSY if the caller is trying to invalidate memory which is
covered by a locked vma.
The open group say:
[EBUSY]
Some or all of the addresses in the range starting
at addr and continuing for len bytes are locked,
and MS_INVALIDATE is specified.
- Returns EINVAL if the caller specified both MS_SYNC and MS_ASYNC
[EINVAL]
The value of flags is invalid.
and:
"Either MS_ASYNC or MS_SYNC is specified, but not both."
|
|
This is a performance and correctness fix against the writeback paths.
The writeback code has competing requirements. Sometimes it is used
for "memory cleansing": kupdate, bdflush, writer throttling, page
allocator writeback, etc. And sometimes this same code is used for
data integrity pruposes: fsync, msync, fdatasync, sync, umount, various
other kernel-internal uses.
The problem is: how to handle a dirty buffer or page which is currently
under writeback.
For memory cleansing, we just want to skip that buffer/page and go onto
the next one. But for sync, we must wait on the old writeback and then
start new writeback.
mpage_writepages() is current correct for cleansing, but incorrect for
sync. block_write_full_page() is currently correct for sync, but
inefficient for cleansing.
The fix is fairly simple.
- In mpage_writepages(), don't skip the page is it's a sync
operation.
- In block_write_full_page(), skip the buffer if it is a sync
operation. And return -EAGAIN to tell the caller that the writeout
didn't work out. The caller must then set the page dirty again and
move it onto mapping->dirty_pages.
This is an extension of the writepage API: writepage can now return
EAGAIN. There are only three callers, and they have been updated.
fail_writepage() and ext3_writepage() were actually doing this by
hand. They have been changed to return -EAGAIN. NTFS will want to
be able to return -EAGAIN from its writepage as well.
- A sticky question is: how to tell the writeout code which mode it
is operating in? Cleansing or sync?
It's such a tiny code change that I didn't have the heart to go and
propagate a `mode' argument down every instance of writepages() and
writepage() in the kernel. So I passed it in via current->flags.
Incidentally, the occurrence of a locked-and-dirty buffer in
block_write_full_page() is fairly rare: normally the collision avoidance
happens at the address_space level, via PageWriteback. But some
mappings (blockdevs, ext3 files, etc) have their dirty buffers written
out via submit_bh(). It is these buffers which can stall
block_write_full_page().
This wart will be pretty intrusive to fix. ext3 needs to become fully
page-based (ugh. It's a block-based journalling filesystem, and pages
are unnatural). blockdev mappings are still written out by buffers
because that's how filesystems use them. Putting _all_ metadata
(indirects, inodes, superblocks, etc) into standalone address_spaces
would fix that up.
- filemap_fdatawrite() sets PF_SYNC. So filemap_fdatawrite() is the
kernel function which will start writeback against a mapping for
"data integrity" purposes, whereas the unexported, internal-only
do_writepages() is the writeback function which is used for memory
cleansing. This difference is the reason why I didn't consolidate
those functions ages ago...
- Lots of code paths had a bogus extra call to filemap_fdatawait(),
which I previously added in a moment of weak-headedness. They have
all been removed.
|
|
SuSv3 says: "The msync() function shall fail if:
[EBUSY]
Some or all of the addresses in the range starting at addr and
continuing for len bytes are locked, and MS_INVALIDATE is
specified.
[EINVAL]
The value of flags is invalid.
[EINVAL]
The value of addr is not a multiple of the page size {PAGESIZE}.
[ENOMEM]
The addresses in the range starting at addr and continuing for len
bytes are outside the range allowed for the address space of a process
or specify one or more pages that are not mapped."
This fixes error code of msync() of the EINVAL case.
|
|
Heaven knows why, but that's what the opengroup say, and returning
-EFAULT causes 2.5 to fail one of the Linux Test Project tests.
[ENOMEM]
The addresses in the range starting at addr and continuing
for len bytes are outside the range allowed for the address
space of a process or specify one or more pages that are not
mapped.
2.4 has it right, but 2.5 doesn't.
|
|
This patch removes VALID_PAGE(), as the test was always too late for
discontinous memory configuration. It is replaced with pfn_valid()/
virt_addr_valid(), which are used to test the original input value.
Other helper functions:
pte_pfn() - extract the page number from a pte
pfn_to_page()/page_to_pfn() - convert a page number to/from a page struct
|
|
- Fixes a performance problem - callers of
prepare_write/commit_write, etc are locking pages, which synchronises
them behind writeback, which also locks these pages. Significant
slowdowns for some workloads.
- So pages are no longer locked while under writeout. Introduce a
new PG_writeback and associated infrastructure to support this design
change.
- Pages which are under read I/O still use PageLocked. Pages which
are under write I/O have PageWriteback() true.
I considered creating Page_IO instead of PageWriteback, and marking
both readin and writeout pages as PageIO(). So pages are unlocked
during both read and write. There just doesn't seem a need to do
this - nobody ever needs unblocking access to a page which is under
read I/O.
- Pages under swapout (brw_page) are PageLocked, not PageWriteback.
So their treatment is unchangeded.
It's not obvious that pages which are under swapout actually need
the more asynchronous behaviour of PageWriteback.
I was setting the swapout pages PageWriteback and unlocking them
prior to submitting the buffers in brw_page(). This led to deadlocks
on the exit_mmap->zap_page_range->free_swap_and_cache path. These
functions call block_flushpage under spinlock. If the page is
unlocked but has locked buffers, block_flushpage->discard_buffer()
sleeps. Under spinlock. So that will need fixing if for some reason
we want swapout to use PageWriteback.
Kernel has called block_flushpage() under spinlock for a long time.
It is assuming that a locked page will never have locked buffers.
This appears to be true, but it's ugly.
- Adds new function wait_on_page_writeback(). Renames wait_on_page()
to wait_on_page_locked() to remind people that they need to call the
appropriate one.
- Renames filemap_fdatasync() to filemap_fdatawrite(). It's more
accurate - "sync" implies, if anything, writeout and wait. (fsync,
msync) Or writeout. it's not clear.
- Subtly changes the filemap_fdatawrite() internals - this function
used to do a lock_page() - it waited for any other user of the page
to let go before submitting new I/O against a page. It has been
changed to simply skip over any pages which are currently under
writeback.
This is the right thing to do for memory-cleansing reasons.
But it's the wrong thing to do for data consistency operations (eg,
fsync()). For those operations we must ensure that all data which
was dirty *at the time of the system call* are tight on disk before
the call returns.
So all places which care about this have been converted to do:
filemap_fdatawait(mapping); /* Wait for current writeback */
filemap_fdatawrite(mapping); /* Write all dirty pages */
filemap_fdatawait(mapping); /* Wait for I/O to complete */
- Fixes a truncate_inode_pages problem - truncate currently will
block when it hits a locked page, so it ends up getting into lockstep
behind writeback and all of the file is pointlessly written back.
One fix for this is for truncate to simply walk the page list in the
opposite direction from writeback.
I chose to use a separate cleansing pass. It is more
CPU-intensive, but it is surer and clearer. This is because there is
no reason why the per-address_space ->vm_writeback and
->writeback_mapping functions *have* to perform writeout in
->dirty_pages order. They may choose to do something totally
different.
(set_page_dirty() is an a_op now, so address_spaces could almost
privatise the whole dirty-page handling thing. Except
truncate_inode_pages and invalidate_inode_pages assume that the pages
are on the address_space lists. hmm. So making truncate_inode_pages
and invalidate_inode_pages a_ops would make some sense).
|
|
asm/pgtable.h and/or asm/pgalloc.h to asm/cacheflush.h, and
tlb flushing routines to asm/tlbflush.h.
|
|
|
|
changes.
|