| Age | Commit message (Collapse) | Author |
|
Add remove_from_swap
remove_from_swap() allows the restoration of the pte entries that existed
before page migration occurred for anonymous pages by walking the reverse
maps. This reduces swap use and establishes regular pte's without the need
for page faults.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Add direct migration support with fall back to swap.
Direct migration support on top of the swap based page migration facility.
This allows the direct migration of anonymous pages and the migration of file
backed pages by dropping the associated buffers (requires writeout).
Fall back to swap out if necessary.
The patch is based on lots of patches from the hotplug project but the code
was restructured, documented and simplified as much as possible.
Note that an additional patch that defines the migrate_page() method for
filesystems is necessary in order to avoid writeback for anonymous and file
backed pages.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Optimise rmap functions by minimising atomic operations when we know there
will be no concurrent modifications.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Some users (hi Zwane) have seen a problem when running a workload that
eats nearly all of physical memory - th system does an OOM kill, even
when there is still a lot of swap free.
The problem appears to be a very big task that is holding the swap
token, and the VM has a very hard time finding any other page in the
system that is swappable.
Instead of ignoring the swap token when sc->priority reaches 0, we could
simply take the swap token away from the memory hog and make sure we
don't give it back to the memory hog for a few seconds.
This patch resolves the problem Zwane ran into.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
rmap's page_check_address descend without page_table_lock. First just
pte_offset_map in case there's no pte present worth locking for, then take
page_table_lock for the full check, and pass ptl back to caller in the same
style as pte_offset_map_lock. __xip_unmap, page_referenced_one and
try_to_unmap_one use pte_unmap_unlock. try_to_unmap_cluster also.
page_check_address reformatted to avoid progressive indentation. No use is
made of its one error code, return NULL when it fails.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
- generic_file* file operations do no longer have a xip/non-xip split
- filemap_xip.c implements a new set of fops that require get_xip_page
aop to work proper. all new fops are exported GPL-only (don't like to
see whatever code use those except GPL modules)
- __xip_unmap now uses page_check_address, which is no longer static
in rmap.c, and defined in linux/rmap.h
- mm/filemap.h is now much more clean, plainly having just Linus'
inline funcs moved here from filemap.c
- fix includes in filemap_xip to make it build cleanly on i386
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The token-based thrashing control patches introduced a problem: when a task
which doesn't hold the token tries to run direct-reclaim, that task is told
that pages which belong to the token-holding mm are referenced, even though
they are not. This means that it is possible for a huge number of a
non-token-holding mm's pages to be scanned to no effect. Eventually, we give
up and go and oom-kill something.
So the patch arranges for the thrashing control logic to be defeated if the
caller has reached the highest level of scanning priority.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
With this we're back to the times when changing skbuff.h only triggers
rebuild of _net_ related stuff 8)
This uncovered a bug in rmap.h, that was not including mm.h to get the
definition of struct vm_area_struct, working by luck.
Signed-off-by: Arnaldo Carvalho de Melo <acme@conectiva.com.br>
Signed-off-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Swapoff can make good use of a page's anon_vma and index, while it's still
left in swapcache, or once it's brought back in and the first pte mapped back:
unuse_vma go directly to just one page of only those vmas with the same
anon_vma. And unuse_process can skip any vmas without an anon_vma (extending
the hugetlb check: hugetlb vmas have no anon_vma).
This just hacks in on top of the existing procedure, still going through all
the vmas of all the mms in mmlist. A more elegant procedure might replace
mmlist by a list of anon_vmas: but that would be more work to implement, with
apparently more overhead in the common paths.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The pte_chains rmap used pte_chain_lock (bit_spin_lock on PG_chainlock) to
lock its pte_chains. We kept this (as page_map_lock: bit_spin_lock on
PG_maplock) when we moved to objrmap. But the file objrmap locks its vma tree
with mapping->i_mmap_lock, and the anon objrmap locks its vma list with
anon_vma->lock: so isn't the page_map_lock superfluous?
Pretty much, yes. The mapcount was protected by it, and needs to become an
atomic: starting at -1 like page _count, so nr_mapped can be tracked precisely
up and down. The last page_remove_rmap can't clear anon page mapping any
more, because of races with page_add_rmap; from which some BUG_ONs must go for
the same reason, but they've served their purpose.
vmscan decisions are naturally racy, little change there beyond removing
page_map_lock/unlock. But to stabilize the file-backed page->mapping against
truncation while acquiring i_mmap_lock, page_referenced_file now needs page
lock to be held even for refill_inactive_zone. There's a similar issue in
acquiring anon_vma->lock, where page lock doesn't help: which this patch
pretends to handle, but actually it needs the next.
Roughly 10% cut off lmbench fork numbers on my 2*HT*P4. Must confess my
testing failed to show the races even while they were knowingly exposed: would
benefit from testing on racier equipment.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: Hugh Dickins <hugh@veritas.com>
Andrea Arcangeli's anon_vma object-based reverse mapping scheme for anonymous
pages. Instead of tracking anonymous pages by pte_chains or by mm, this
tracks them by vma. But because vmas are frequently split and merged
(particularly by mprotect), a page cannot point directly to its vma(s), but
instead to an anon_vma list of those vmas likely to contain the page - a list
on which vmas can easily be linked and unlinked as they come and go. The vmas
on one list are all related, either by forking or by splitting.
This has three particular advantages over anonmm: that it can cope
effortlessly with mremap moves; and no longer needs page_table_lock to protect
an mm's vma tree, since try_to_unmap finds vmas via page -> anon_vma -> vma
instead of using find_vma; and should use less cpu for swapout since it can
locate its anonymous vmas more quickly.
It does have disadvantages too: a lot more change in mmap.c to deal with
anon_vmas, though small straightforward additions now that the vma merging has
been refactored there; more lowmem needed for each anon_vma and vma structure;
an additional restriction on the merging of vmas (cannot be merged if already
assigned different anon_vmas, since then their pages will be pointing to
different heads).
(There would be no need to enlarge the vma structure if anonymous pages
belonged only to anonymous vmas; but private file mappings accumulate
anonymous pages by copy-on-write, so need to be listed in both anon_vma and
prio_tree at the same time. A different implementation could avoid that by
using anon_vmas only for purely anonymous vmas, and use the existing prio_tree
to locate cow pages - but that would involve a long search for each single
private copy, probably not a good idea.)
Where before the vm_pgoff of a purely anonymous (not file-backed) vma was
meaningless, now it represents the virtual start address at which that vma is
mapped - which the standard file pgoff manipulations treat linearly as vmas
are split and merged. But if mremap moves the vma, then it generally carries
its original vm_pgoff to the new location, so pages shared with the old
location can still be found. Magic.
Hugh has massaged it somewhat: building on the earlier rmap patches, this
patch is a fifth of the size of Andrea's original anon_vma patch. Please note
that this posting will be his first sight of this patch, which he may or may
not approve.
|
|
From: Hugh Dickins <hugh@veritas.com>
Before moving on to anon_vma rmap, remove now what's peculiar to anonmm rmap:
the anonmm handling and the mremap move cows. Temporarily reduce
page_referenced_anon and try_to_unmap_anon to stubs, so a kernel built with
this patch will not swap anonymous at all.
|
|
From: Hugh Dickins <hugh@veritas.com>
Silly final patch for anonmm rmap: change page_add_anon_rmap's mm arg to vma
arg like anon_vma rmap, to smooth the transition between them.
|
|
From: Hugh Dickins <hugh@veritas.com>
I like CONFIG_REGPARM, even when it's forced on: because it's easy to force
off for debugging - easier than editing out scattered fastcalls. Plus I've
never understood why we make function foo a fastcall, but function bar not.
Remove fastcall directives from rmap. And fix comment about mremap_moved
race: it only applies to anon pages.
|
|
Having a semaphore in there causes modest performance regressions on heavily
mmap-intensive workloads on some hardware. Specifically, up to 30% in SDET on
NUMAQ and big PPC64.
So switch it back to being a spinlock. This does mean that unmap_vmas() needs
to be told whether or not it is allowed to schedule away; that's simple to do
via the zap_details structure.
This change means that there will be high scheuling latencies when someone
truncates a large file which is currently mmapped, but nobody does that
anyway. The scheduling points in unmap_vmas() are mainly for munmap() and
exit(), and they still will work OK for that.
From: Hugh Dickins <hugh@veritas.com>
Sorry, my premature optimizations (trying to pass down NULL zap_details
except when needed) have caught you out doubly: unmap_mapping_range_list was
NULLing the details even though atomic was set; and if it hadn't, then
zap_pte_range would have missed free_swap_and_cache and pte_clear when pte
not present. Moved the optimization into zap_pte_range itself. Plus
massive documentation update.
From: Hugh Dickins <hugh@veritas.com>
Here's a second patch to add to the first: mremap's cows can't come home
without releasing the i_mmap_lock, better move the whole "Subtle point"
locking from move_vma into move_page_tables. And it's possible for the file
that was behind an anonymous page to be truncated while we drop that lock,
don't want to abort mremap because of VM_FAULT_SIGBUS.
(Eek, should we be checking do_swap_page of a vm_file area against the
truncate_count sequence? Technically yes, but I doubt we need bother.)
- We cannot hold i_mmap_lock across move_one_page() because
move_one_page() needs to perform __GFP_WAIT allocations of pagetable pages.
- Move the cond_resched() out so we test it once per page rather than only
when move_one_page() returns -EAGAIN.
|
|
From: Hugh Dickins <hugh@veritas.com>
A weakness of the anonmm scheme is its difficulty in tracking pages shared
between two or more mms (one being an ancestor of the other), when mremap has
been used to move a range of pages in one of those mms. mremap move is not
very common anyway, and it's more often used on a page range exclusive to the
mm; but uncommon though it may be, we must not allow unlocked pages to become
unswappable.
This patch follows Linus' suggestion, simply to take a private copy of the
page in such a case: early C-O-W. My previous implementation was daft with
respect to pages currently on swap: it insisted on swapping them in to copy
them. No need for that: just take the copy when a page is brought in from
swap, and its intended address is found to clash with what rmap has already
noted.
If do_swap_page has to make this copy in the mremap moved case (simply a call
to do_wp_page), might as well do so also in the case when it's a write access
but the page not exclusive, it's always seemed a little odd that swapin needed
a second fault for that. A bug even: get_user_pages force imagines that a
single call to handle_mm_fault must break C-O-W. Another bugfix: swapoff's
unuse_process didn't check is_vm_hugetlb_page.
Andrea's anon_vma has no such problem with mremap moved pages, handling them
with elegant use of vm_pgoff - though at some cost to vma merging. How
important is it to handle them efficiently? For now there's a msg
printk(KERN_WARNING "%s: mremap moved %d cows\n", current->comm, cows);
|
|
From: Hugh Dickins <hugh@veritas.com>
Hugh's anonmm object-based reverse mapping scheme for anonymous pages. We
have not yet decided whether to adopt this scheme, or Andrea's more advanced
anon_vma scheme. anonmm is easier for me to merge quickly, to replace the
pte_chain rmap taken out in the previous patch; a patch to install Andrea's
anon_vma will follow in due course.
Why build up and tear down chains of pte pointers for anonymous pages, when a
page can only appear at one particular address, in a restricted group of mms
that might share it? (Except: see next patch on mremap.)
Introduce struct anonmm per mm to track anonymous pages, all forks from one
exec sharing the same bundle of linked anonmms. Anonymous pages originate in
one mm, but may be forked into another mm of the bundle later on. Callouts
from fork.c to allocate, dup and exit the anonmm structure private to rmap.c.
From: Hugh Dickins <hugh@veritas.com>
Two concurrent exits (of the last two mms sharing the anonhd). First
exit_rmap brings anonhd->count down to 2, gets preempted (at the
spin_unlock) by second, which brings anonhd->count down to 1, sees it's 1
and frees the anonhd (without making any change to anonhd->count itself),
cpu goes on to do something new which reallocates the old anonhd as a new
struct anonmm (probably not a head, in which case count will start at 1),
first resumes after the spin_unlock and sees anonhd->count 1, frees "anonhd"
again, it's used for something else, a later exit_rmap list_del finds list
corrupt.
|
|
From: Hugh Dickins <hugh@veritas.com>
Lots of deletions: the next patch will put in the new anon rmap, which
should look clearer if first we remove all of the old pte-pointer-based
rmap from the core in this patch - which therefore leaves anonymous rmap
totally disabled, anon pages locked in memory until process frees them.
Leave arch files (and page table rmap) untouched for now, clean them up in
a later batch. A few constructive changes amidst all the deletions:
Choose names (e.g. page_add_anon_rmap) and args (e.g. no more pteps) now
so we need not revisit so many files in the next patch. Inline function
page_dup_rmap for fork's copy_page_range, simply bumps mapcount under lock.
cond_resched_lock in copy_page_range. Struct page rearranged: no pte
union, just mapcount moved next to atomic count, so two ints can occupy one
long on 64-bit; i386 struct page now 32 bytes even with PAE. Never pass
PageReserved to page_remove_rmap, only do_wp_page did so.
From: Hugh Dickins <hugh@veritas.com>
Move page_add_anon_rmap's BUG_ON(page_mapping(page)) inside the rmap_lock
(well, might as well just check mapping if !mapcount then): if this page is
being mapped or unmapped on another cpu at the same time, page_mapping's
PageAnon(page) and page->mapping are volatile.
But page_mapping(page) is used more widely: I've a nasty feeling that
clear_page_anon, page_add_anon_rmap and/or page_mapping need barriers added
(also in 2.6.6 itself),
|
|
From: Hugh Dickins <hugh@veritas.com>
Dave McCracken's object-based reverse mapping scheme for file pages: why
build up and tear down chains of pte pointers for file pages, when
page->mapping has i_mmap and i_mmap_shared lists of all the vmas which
might contain that page, and it appears at one deterministic position
within the vma (unless vma is nonlinear - see next patch)?
Has some drawbacks: more work to locate the ptes from page_referenced and
try_to_unmap, especially if the i_mmap lists contain a lot of vmas covering
different ranges; has to down_trylock the i_shared_sem, and hope that
doesn't fail too often. But attractive in that it uses less lowmem, and
shifts the rmap burden away from the hot paths, to swapout.
Hybrid scheme for the moment: carry on with pte_chains for anonymous pages,
that's unchanged; but file pages keep mapcount in the pte union of struct
page, where anonymous pages keep chain pointer or direct pte address: so
page_mapped(page) works on both.
Hugh massaged it a little: distinct page_add_file_rmap entry point; list
searches check rss so as not to waste time on mms fully swapped out; check
mapcount to terminate once all ptes have been found; and a WARN_ON if
page_referenced should have but couldn't find all the ptes.
|
|
Sync this up with Andrea's patches.
|
|
From: Hugh Dickins <hugh@veritas.com>
First of a batch of three rmap patches: this initial batch of three paving
the way for a move to some form of object-based rmap (probably Andrea's, but
drawing from mine too), and making almost no functional change by itself. A
few days will intervene before the next batch, to give the struct page
changes in the second patch some exposure before proceeding.
rmap 1 create include/linux/rmap.h
Start small: linux/rmap-locking.h has already gathered some declarations
unrelated to locking, and the rest of the rmap declarations were over in
linux/swap.h: gather them all together in linux/rmap.h, and rename the
pte_chain_lock to rmap_lock.
|