| Age | Commit message (Collapse) | Author |
|
commit 30dad30922ccc733cfdbfe232090cf674dc374dc upstream.
When we have a page fault for the address which is backed by a hugepage
under migration, the kernel can't wait correctly and do busy looping on
hugepage fault until the migration finishes. As a result, users who try
to kick hugepage migration (via soft offlining, for example) occasionally
experience long delay or soft lockup.
This is because pte_offset_map_lock() can't get a correct migration entry
or a correct page table lock for hugepage. This patch introduces
migration_entry_wait_huge() to solve this.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 9b15b817f3d62409290fd56fe3cbb076a931bb0a upstream.
Minchan Kim reports that when a system has many swap areas, and tmpfs
swaps out to the ninth or more, shmem_getpage_gfp()'s attempts to read
back the page cannot locate it, and the read fails with -ENOMEM.
Whoops. Yes, I blindly followed read_swap_header()'s pte_to_swp_entry(
swp_entry_to_pte()) technique for determining maximum usable swap
offset, without stopping to realize that that actually depends upon the
pte swap encoding shifting swap offset to the higher bits and truncating
it there. Whereas our radix_tree swap encoding leaves offset in the
lower bits: it's swap "type" (that is, index of swap area) that was
truncated.
Fix it by reducing the SWP_TYPE_SHIFT() in swapops.h, and removing the
broken radix_to_swp_entry(swp_to_radix_entry()) from read_swap_header().
This does not reduce the usable size of a swap area any further, it
leaves it as claimed when making the original commit: no change from 3.0
on x86_64, nor on i386 without PAE; but 3.0's 512GB is reduced to 128GB
per swapfile on i386 with PAE. It's not a change I would have risked
five years ago, but with x86_64 supported for ten years, I believe it's
appropriate now.
Hmm, and what if some architecture implements its swap pte with offset
encoded below type? That would equally break the maximum usable swap
offset check. Happily, they all follow the same tradition of encoding
offset above type, but I'll prepare a check on that for next.
Reported-and-Reviewed-and-Tested-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
If swap entries are to be stored along with struct page pointers in a
radix tree, they need to be distinguished as exceptional entries.
Most of the handling of swap entries in radix tree will be contained in
shmem.c, but a few functions in filemap.c's common code need to check
for their appearance: find_get_page(), find_lock_page(),
find_get_pages() and find_get_pages_contig().
So as not to slow their fast paths, tuck those checks inside the
existing checks for unlikely radix_tree_deref_slot(); except for
find_lock_page(), where it is an added test. And make it a BUG in
find_get_pages_tag(), which is not applied to tmpfs files.
A part of the reason for eliminating shmem_readpage() earlier, was to
minimize the places where common code would need to allow for swap
entries.
The swp_entry_t known to swapfile.c must be massaged into a slightly
different form when stored in the radix tree, just as it gets massaged
into a pte_t when stored in page tables.
In an i386 kernel this limits its information (type and page offset) to
30 bits: given 32 "types" of swapfile and 4kB pagesize, that's a maximum
swapfile size of 128GB. Which is less than the 512GB we previously
allowed with X86_PAE (where the swap entry can occupy the entire upper
32 bits of a pte_t), but not a new limitation on 32-bit without PAE; and
there's not a new limitation on 64-bit (where swap filesize is already
limited to 16TB by a 32-bit page offset). Thirty areas of 128GB is
probably still enough swap for a 64GB 32-bit machine.
Provide swp_to_radix_entry() and radix_to_swp_entry() conversions, and
enforce filesize limit in read_swap_header(), just as for ptes.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Memory migration uses special swap entry types to trigger special actions on
page faults. Extend this mechanism to also support poisoned swap entries, to
trigger poison handling on page faults. This allows follow-on patches to
prevent processes from faulting in poisoned pages again.
v2: Fix overflow in MAX_SWAPFILES (Fengguang Wu)
v3: Better overflow fix (Hidehiro Kawai)
Signed-off-by: Andi Kleen <ak@linux.intel.com>
|
|
CC mm/vmscan.o
In file included from
/home/bunk/linux/kernel-2.6/git/linux-2.6/mm/vmscan.c:44:
/home/bunk/linux/kernel-2.6/git/linux-2.6/include/linux/swapops.h: In function 'is_swap_pte':
/home/bunk/linux/kernel-2.6/git/linux-2.6/include/linux/swapops.h:48: error: implicit declaration of function 'pte_none'
/home/bunk/linux/kernel-2.6/git/linux-2.6/include/linux/swapops.h:48: error: implicit declaration of function 'pte_present'
Does it ever make sense to ask "is this pte a swap entry?" on a machine
with no MMU? Presumably this also means it has no ptes too, right? In
which case, it's better to comment the whole function out. Then when
someone tries to ask the above meaningless question, they get a compile
error rather than a meaningless answer.
Signed-off-by: Matt Mackall <mpm@selenic.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Reported-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Move is_swap_pte helper function to swapops.h for use by pagemap code
Signed-off-by: Matt Mackall <mpm@selenic.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
allnoconfig:
mm/mincore.c: In function 'do_mincore':
mm/mincore.c:122: warning: unused variable 'entry'
Yet another entry in the why-macros-are-wrong encyclopedia.
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
smp_entry_t -> swap_entry_t
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
There is a useless AND in swp_type() function.
We just shifted right SWP_TYPE_SHIFT() bits the value from the swp_entry_t,
and then we AND it with "(1 << 5) - 1" (which is a mask corresponding to
the number of bits used by "type").
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
This fixes a problem in sys_swapon that can cause the creation of invalid
swap ptes. This has its cause in the arch-independent swap entries vs.
the pte coded swap entries. The swp_entry_t uses 27 bits for the offset
and 5 bits for the type. In sys_swapon this definition is used to find how
many swap devices and how many pages on each device there can be. But the
swap entries encoded in a pte can be subject to additional restrictions due
to the hardware besides the 27/5 division of the bits in the swp_entry_t
type. This is solved by adding pte_to_swp_entry and swp_entry_to_pte calls
to the calculations for maximum type and offset.
In addition the s390 swap pte division for offset/type is changed from 19/6
bits to 20/5 bits.
|
|
This patch requires arch support. I have patches for ia32, ppc64 and x86_64.
Other architectures will break. It is a five-minute fix. See
http://mail.nl.linux.org/linux-mm/2003-03/msg00174.html
for implementation details.
Patch from: Ingo Molnar <mingo@elte.hu>
the attached patch, against BK-curr, is a preparation to make
remap_file_pages() usable on swappable vmas as well. When 'swapping out'
shared-named mappings the page offset is written into the pte.
it takes one bit from the swap-type bits, otherwise it does not change the
pte layout - so it should be easy to adapt any other architecture to this
change as well. (this patch does not introduce the protection-bits-in-pte
approach used in my previous patch.)
On 32-bit pte sizes with an effective usable pte range of 29 bits, this
limits mmap()-able file size to 4096 * 2^29 == 2 TBs. If the usable range is
smaller, then the maximum mmap() size is reduced as well. The worst-case i
found (PPC) was 2 hw-reserved bits in the swap-case, which limits us to 1 TB
filesize. Is there any other hw that has an even worse ratio of sw-usable
pte bits?
this mmap() limit can be eliminated by simply not converting the swapped out
pte to a file-pte, but clearning it and falling back to the linear mapping
upon swapin. This puts the limit into remap_file_pages() alone, but i really
hope no-one wants to use remap_file_pages() on a 32-bit platform, on a larger
than 1-2 TB file.
sys_remap_file_pages() is now enforcing the 'prot' parameter to be zero.
This restriction might be lifted in the future - i really hope we can have
more flexible remapping once 64-bit platforms are commonplace - eg. things
like memory debuggers could just use the permission bits directly, instead of
creating many small vmas.
i've tested swappable nonlinear ptes and they are swapped out/in
correctly.
some other changes in -A0 relative to 2.5.63-BK:
- slightly smarter TLB flushing in install_page(). This is still only a
stupid helper functions - a more efficient 'walk the pagecache tree
and pagetable at once and use TLB-gather' implementation is preferred.
- cleanup: pass on pgprot_t instead of unsigned long prot.
- some sanity checks to make sure file_pte() rules are followed.
- do not reduce the vma's default protection to PROT_NONE when using
remap_file_pages() on it. With swappable ptes this is now safe.
|
|
This fixes two shift warnings on 64 bit archs.
|
|
First some terminology: this patch introduces a kernel-wide `pgoff_t'
type. It is the index of a page into the pagecache. The thing at
page->index. For most mappings it is also the offset of the page into
that mapping. This type has a very distinct function in the kernel and
it needs a name. I don't have any particular plans to go and migrate
everything so we can support 64-bit pagecache indices on x86, but this
would be the way to do it.
This patch improves the packing density of swapcache pages in the radix
tree.
A swapcache page is identified by the `swap type' (indexes the swap
device) and the `offset' (into that swap device). These two numbers
are encoded into a `swp_entry_t' machine word in arch-specific code
because the resulting number is placed into pagetables in a form which
will generate a fault.
The kernel also need to generate a pgoff_t for that page to index it
into the swapper_space radix tree. That pgoff_t is usually
bitwise-identical to the swp_entry_t. That worked OK when the
pagecache was using a hash. But with a radix tree, it produces
catastrophically bad results.
x86 (and many other architectures) place the `type' field into the
low-order bits of the swp_entry_t. So *all* swapcache pages are
basically identical in the eight low-order bits. This produces a very
sparse radix tree for swapcache. I'm observing packing densities of 1%
to 2%: so the typical 128-slot radix tree node has only one or two
pages in it.
The end result is that the kernel needs to allocate approximately one
new radix-tree node for each page which is added to the swapcache. So
no wonder we're having radix-tree node exhaustion during swapout!
(It's actually quite encouraging that the kernel works as well as it
does).
The patch changes the encoding of the swp_entry_t so that its
most-significant bits contain the `type' field and the
least-significant bits contain the `offset' field, right-aligned.
That is: the encoding in swp_entry_t is now arch-independent. The new
file <linux/swapops.h> has conversion functions which convert the
swp_entry_t to and from its machine pte representation.
Packing density in the swapper_space mapping goes up to around 90%
(observed) and the kernel is tons happier under swap load.
An alternative approach would be to create new conversion functions
which convert an arch-specific swp_entry_t to and from a pgoff_t. I
tried that. It worked, but I liked it less.
|