summaryrefslogtreecommitdiff
path: root/include/linux/rmap.h
AgeCommit message (Collapse)Author
2009-06-16vmscan: report vm_flags in page_referenced()Wu Fengguang
Collect vma->vm_flags of the VMAs that actually referenced the page. This is preparing for more informed reclaim heuristics, eg. to protect executable file pages more aggressively. For now only the VM_EXEC bit will be used by the caller. Thanks to Johannes, Peter and Minchan for all the good tips. Acked-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16mm: remove CONFIG_UNEVICTABLE_LRU config optionKOSAKI Motohiro
Currently, nobody wants to turn UNEVICTABLE_LRU off. Thus this configurability is unnecessary. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andi Kleen <andi@firstfloor.org> Acked-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06badpage: remove vma from page_remove_rmapHugh Dickins
Remove page_remove_rmap()'s vma arg, which was only for the Eeek message. And remove the BUG_ON(page_mapcount(page) == 0) from CONFIG_DEBUG_VM's page_dup_rmap(): we're trying to be more resilient about that than BUGs. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: make page_lock_anon_vma() staticHugh Dickins
page_lock_anon_vma() and page_unlock_anon_vma() were made available to show_page_path() in vmscan.c; but now that has been removed, make them static in rmap.c again, they're better kept private if possible. Signed-off-by: Hugh Dickins <hugh@veritas.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20make mm/rmap.c:anon_vma_cachep staticAdrian Bunk
This patch makes the needlessly global anon_vma_cachep static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20vmscan: unevictable LRU scan sysctlLee Schermerhorn
This patch adds a function to scan individual or all zones' unevictable lists and move any pages that have become evictable onto the respective zone's inactive list, where shrink_inactive_list() will deal with them. Adds sysctl to scan all nodes, and per node attributes to individual nodes' zones. Kosaki: If evictable page found in unevictable lru when write /proc/sys/vm/scan_unevictable_pages, print filename and file offset of these pages. [akpm@linux-foundation.org: fix one CONFIG_MMU=n build error] [kosaki.motohiro@jp.fujitsu.com: adapt vmscan-unevictable-lru-scan-sysctl.patch to new sysfs API] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20mlock: mlocked pages are unevictableNick Piggin
Make sure that mlocked pages also live on the unevictable LRU, so kswapd will not scan them over and over again. This is achieved through various strategies: 1) add yet another page flag--PG_mlocked--to indicate that the page is locked for efficient testing in vmscan and, optionally, fault path. This allows early culling of unevictable pages, preventing them from getting to page_referenced()/try_to_unmap(). Also allows separate accounting of mlock'd pages, as Nick's original patch did. Note: Nick's original mlock patch used a PG_mlocked flag. I had removed this in favor of the PG_unevictable flag + an mlock_count [new page struct member]. I restored the PG_mlocked flag to eliminate the new count field. 2) add the mlock/unevictable infrastructure to mm/mlock.c, with internal APIs in mm/internal.h. This is a rework of Nick's original patch to these files, taking into account that mlocked pages are now kept on unevictable LRU list. 3) update vmscan.c:page_evictable() to check PageMlocked() and, if vma passed in, the vm_flags. Note that the vma will only be passed in for new pages in the fault path; and then only if the "cull unevictable pages in fault path" patch is included. 4) add try_to_unlock() to rmap.c to walk a page's rmap and ClearPageMlocked() if no other vmas have it mlocked. Reuses as much of try_to_unmap() as possible. This effectively replaces the use of one of the lru list links as an mlock count. If this mechanism let's pages in mlocked vmas leak through w/o PG_mlocked set [I don't know that it does], we should catch them later in try_to_unmap(). One hopes this will be rare, as it will be relatively expensive. Original mm/internal.h, mm/rmap.c and mm/mlock.c changes: Signed-off-by: Nick Piggin <npiggin@suse.de> splitlru: introduce __get_user_pages(): New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS. because current get_user_pages() can't grab PROT_NONE pages theresore it cause PROT_NONE pages can't munlock. [akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch] [akpm@linux-foundation.org: untangle patch interdependencies] [akpm@linux-foundation.org: fix things after out-of-order merging] [hugh@veritas.com: fix page-flags mess] [lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm'] [kosaki.motohiro@jp.fujitsu.com: build fix] [kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments] [kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-20mm: dirty page tracking race fixNick Piggin
There is a race with dirty page accounting where a page may not properly be accounted for. clear_page_dirty_for_io() calls page_mkclean; then TestClearPageDirty. page_mkclean walks the rmaps for that page, and for each one it cleans and write protects the pte if it was dirty. It uses page_check_address to find the pte. That function has a shortcut to avoid the ptl if the pte is not present. Unfortunately, the pte can be switched to not-present then back to present by other code while holding the page table lock -- this should not be a signal for page_mkclean to ignore that pte, because it may be dirty. For example, powerpc64's set_pte_at will clear a previously present pte before setting it to the desired value. There may also be other code in core mm or in arch which do similar things. The consequence of the bug is loss of data integrity due to msync, and loss of dirty page accounting accuracy. XIP's __xip_unmap could easily also be unreliable (depending on the exact XIP locking scheme), which can lead to data corruption. Fix this by having an option to always take ptl to check the pte in page_check_address. It's possible to retain this optimization for page_referenced and try_to_unmap. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Jared Hulbert <jaredeh@gmail.com> Cc: Carsten Otte <cotte@freenet.de> Cc: Hugh Dickins <hugh@veritas.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-28mmu-notifiers: add mm_take_all_locks() operationAndrea Arcangeli
mm_take_all_locks holds off reclaim from an entire mm_struct. This allows mmu notifiers to register into the mm at any time with the guarantee that no mmu operation is in progress on the mm. This operation locks against the VM for all pte/vma/mm related operations that could ever happen on a certain mm. This includes vmtruncate, try_to_unmap, and all page faults. The caller must take the mmap_sem in write mode before calling mm_take_all_locks(). The caller isn't allowed to release the mmap_sem until mm_drop_all_locks() returns. mmap_sem in write mode is required in order to block all operations that could modify pagetables and free pages without need of altering the vma layout (for example populate_range() with nonlinear vmas). It's also needed in write mode to avoid new anon_vmas to be associated with existing vmas. A single task can't take more than one mm_take_all_locks() in a row or it would deadlock. mm_take_all_locks() and mm_drop_all_locks are expensive operations that may have to take thousand of locks. mm_take_all_locks() can fail if it's interrupted by signals. When mmu_notifier_register returns, we must be sure that the driver is notified if some task is in the middle of a vmtruncate for the 'mm' where the mmu notifier was registered (mmu_notifier_invalidate_range_start/end is run around the vmtruncation but mmu_notifier_register can run after mmu_notifier_invalidate_range_start and before mmu_notifier_invalidate_range_end). Same problem for rmap paths. And we've to remove page pinning to avoid replicating the tlb_gather logic inside KVM (and GRU doesn't work well with page pinning regardless of needing tlb_gather), so without mm_take_all_locks when vmtruncate frees the page, kvm would have no way to notice that it mapped into sptes a page that is going into the freelist without a chance of any further mmu_notifier notification. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Jack Steiner <steiner@sgi.com> Cc: Robin Holt <holt@sgi.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Kanoj Sarcar <kanojsarcar@yahoo.com> Cc: Roland Dreier <rdreier@cisco.com> Cc: Steve Wise <swise@opengridcomputing.com> Cc: Avi Kivity <avi@qumranet.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Anthony Liguori <aliguori@us.ibm.com> Cc: Chris Wright <chrisw@redhat.com> Cc: Marcelo Tosatti <marcelo@kvack.org> Cc: Eric Dumazet <dada1@cosmosbay.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Izik Eidus <izike@qumranet.com> Cc: Anthony Liguori <aliguori@us.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07Memory controller: make page_referenced() cgroup awareBalbir Singh
Make page_referenced() cgroup aware. Without this patch, page_referenced() can cause a page to be skipped while reclaiming pages. This patch ensures that other cgroups do not hold pages in a particular cgroup hostage. It is required to ensure that shared pages are freed from a cgroup when they are not actively referenced from the cgroup that brought them in Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17mm: more rmap checkingNick Piggin
Re-introduce rmap verification patches that Hugh removed when he removed PG_map_lock. PG_map_lock actually isn't needed to synchronise access to anonymous pages, because PG_locked and PTL together already do. These checks were important in discovering and fixing a rare rmap corruption in SLES9. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2006-12-22[PATCH] mm: more rmap debuggingNick Piggin
Add more debugging in the rmap code in an attempt to locate to source of the occasional "mapcount went negative" assertions. Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07[PATCH] slab: remove kmem_cache_tChristoph Lameter
Replace all uses of kmem_cache_t with struct kmem_cache. The patch was generated using the following script: #!/bin/sh # # Replace one string by another in all the kernel sources. # set -e for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do quilt add $file sed -e "1,\$s/$1/$2/g" $file >/tmp/$$ mv /tmp/$$ $file quilt refresh done The script was run like this sh replace kmem_cache_t "struct kmem_cache" Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07[PATCH] slab: remove SLAB_KERNELChristoph Lameter
SLAB_KERNEL is an alias of GFP_KERNEL. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26[PATCH] mm: tracking shared dirty pagesPeter Zijlstra
Tracking of dirty pages in shared writeable mmap()s. The idea is simple: write protect clean shared writeable pages, catch the write-fault, make writeable and set dirty. On page write-back clean all the PTE dirty bits and write protect them once again. The implementation is a tad harder, mainly because the default backing_dev_info capabilities were too loosely maintained. Hence it is not enough to test the backing_dev_info for cap_account_dirty. The current heuristic is as follows, a VMA is eligible when: - its shared writeable (vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED) - it is not a 'special' mapping (vm_flags & (VM_PFNMAP|VM_INSERTPAGE)) == 0 - the backing_dev_info is cap_account_dirty mapping_cap_account_dirty(vma->vm_file->f_mapping) - f_op->mmap() didn't change the default page protection Page from remap_pfn_range() are explicitly excluded because their COW semantics are already horrid enough (see vm_normal_page() in do_wp_page()) and because they don't have a backing store anyway. mprotect() is taught about the new behaviour as well. However it overrides the last condition. Cleaning the pages on write-back is done with page_mkclean() a new rmap call. It can be called on any page, but is currently only implemented for mapped pages, if the page is found the be of a VMA that accounts dirty pages it will also wrprotect the PTE. Finally, in fs/buffers.c:try_to_free_buffers(); remove clear_page_dirty() from under ->private_lock. This seems to be safe, since ->private_lock is used to serialize access to the buffers, not the page itself. This is needed because clear_page_dirty() will call into page_mkclean() and would thereby violate locking order. [dhowells@redhat.com: Provide a page_mkclean() implementation for NOMMU] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] Swapless page migration: rip out swap based logicChristoph Lameter
Rip the page migration logic out. Remove all code that has to do with swapping during page migration. This also guts the ability to migrate pages to swap. No one used that so lets let it go for good. Page migration should be a bit broken after this patch. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-26Don't include linux/config.h from anywhere else in include/David Woodhouse
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-02-01[PATCH] Direct Migration V9: remove_from_swap() to remove swap ptesChristoph Lameter
Add remove_from_swap remove_from_swap() allows the restoration of the pte entries that existed before page migration occurred for anonymous pages by walking the reverse maps. This reduces swap use and establishes regular pte's without the need for page faults. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] Direct Migration V9: migrate_pages() extensionChristoph Lameter
Add direct migration support with fall back to swap. Direct migration support on top of the swap based page migration facility. This allows the direct migration of anonymous pages and the migration of file backed pages by dropping the associated buffers (requires writeout). Fall back to swap out if necessary. The patch is based on lots of patches from the hotplug project but the code was restructured, documented and simplified as much as possible. Note that an additional patch that defines the migrate_page() method for filesystems is necessary in order to avoid writeback for anonymous and file backed pages. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Mike Kravetz <kravetz@us.ibm.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] mm: rmap optimisationNick Piggin
Optimise rmap functions by minimising atomic operations when we know there will be no concurrent modifications. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-28[PATCH] temporarily disable swap token on memory pressureRik van Riel
Some users (hi Zwane) have seen a problem when running a workload that eats nearly all of physical memory - th system does an OOM kill, even when there is still a lot of swap free. The problem appears to be a very big task that is holding the swap token, and the VM has a very hard time finding any other page in the system that is swappable. Instead of ignoring the swap token when sc->priority reaches 0, we could simply take the swap token away from the memory hog and make sure we don't give it back to the memory hog for a few seconds. This patch resolves the problem Zwane ran into. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: rmap with inner ptlockHugh Dickins
rmap's page_check_address descend without page_table_lock. First just pte_offset_map in case there's no pte present worth locking for, then take page_table_lock for the full check, and pass ptl back to caller in the same style as pte_offset_map_lock. __xip_unmap, page_referenced_one and try_to_unmap_one use pte_unmap_unlock. try_to_unmap_cluster also. page_check_address reformatted to avoid progressive indentation. No use is made of its one error code, return NULL when it fails. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-24[PATCH] xip: fs/mm: execute in placeCarsten Otte
- generic_file* file operations do no longer have a xip/non-xip split - filemap_xip.c implements a new set of fops that require get_xip_page aop to work proper. all new fops are exported GPL-only (don't like to see whatever code use those except GPL modules) - __xip_unmap now uses page_check_address, which is no longer static in rmap.c, and defined in linux/rmap.h - mm/filemap.h is now much more clean, plainly having just Linus' inline funcs moved here from filemap.c - fix includes in filemap_xip to make it build cleanly on i386 Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-18[PATCH] vmscan: ignore swap token when in troubleAndrew Morton
The token-based thrashing control patches introduced a problem: when a task which doesn't hold the token tries to run direct-reclaim, that task is told that pages which belong to the token-holding mm are referenced, even though they are not. This means that it is possible for a huge number of a non-token-holding mm's pages to be scanned to no effect. Eventually, we give up and go and oom-kill something. So the patch arranges for the thrashing control logic to be defeated if the caller has reached the highest level of scanning priority. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-25[PATCH] lsm: remove net related includes from security.hChris Wright
With this we're back to the times when changing skbuff.h only triggers rebuild of _net_ related stuff 8) This uncovered a bug in rmap.h, that was not including mm.h to get the definition of struct vm_area_struct, working by luck. Signed-off-by: Arnaldo Carvalho de Melo <acme@conectiva.com.br> Signed-off-by: Chris Wright <chrisw@osdl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-23[PATCH] rmaplock: swapoff use anon_vmaHugh Dickins
Swapoff can make good use of a page's anon_vma and index, while it's still left in swapcache, or once it's brought back in and the first pte mapped back: unuse_vma go directly to just one page of only those vmas with the same anon_vma. And unuse_process can skip any vmas without an anon_vma (extending the hugetlb check: hugetlb vmas have no anon_vma). This just hacks in on top of the existing procedure, still going through all the vmas of all the mms in mmlist. A more elegant procedure might replace mmlist by a list of anon_vmas: but that would be more work to implement, with apparently more overhead in the common paths. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-23[PATCH] rmaplock: kill page_map_lockHugh Dickins
The pte_chains rmap used pte_chain_lock (bit_spin_lock on PG_chainlock) to lock its pte_chains. We kept this (as page_map_lock: bit_spin_lock on PG_maplock) when we moved to objrmap. But the file objrmap locks its vma tree with mapping->i_mmap_lock, and the anon objrmap locks its vma list with anon_vma->lock: so isn't the page_map_lock superfluous? Pretty much, yes. The mapcount was protected by it, and needs to become an atomic: starting at -1 like page _count, so nr_mapped can be tracked precisely up and down. The last page_remove_rmap can't clear anon page mapping any more, because of races with page_add_rmap; from which some BUG_ONs must go for the same reason, but they've served their purpose. vmscan decisions are naturally racy, little change there beyond removing page_map_lock/unlock. But to stabilize the file-backed page->mapping against truncation while acquiring i_mmap_lock, page_referenced_file now needs page lock to be held even for refill_inactive_zone. There's a similar issue in acquiring anon_vma->lock, where page lock doesn't help: which this patch pretends to handle, but actually it needs the next. Roughly 10% cut off lmbench fork numbers on my 2*HT*P4. Must confess my testing failed to show the races even while they were knowingly exposed: would benefit from testing on racier equipment. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-05-22[PATCH] rmap 39 add anon_vma rmapAndrew Morton
From: Hugh Dickins <hugh@veritas.com> Andrea Arcangeli's anon_vma object-based reverse mapping scheme for anonymous pages. Instead of tracking anonymous pages by pte_chains or by mm, this tracks them by vma. But because vmas are frequently split and merged (particularly by mprotect), a page cannot point directly to its vma(s), but instead to an anon_vma list of those vmas likely to contain the page - a list on which vmas can easily be linked and unlinked as they come and go. The vmas on one list are all related, either by forking or by splitting. This has three particular advantages over anonmm: that it can cope effortlessly with mremap moves; and no longer needs page_table_lock to protect an mm's vma tree, since try_to_unmap finds vmas via page -> anon_vma -> vma instead of using find_vma; and should use less cpu for swapout since it can locate its anonymous vmas more quickly. It does have disadvantages too: a lot more change in mmap.c to deal with anon_vmas, though small straightforward additions now that the vma merging has been refactored there; more lowmem needed for each anon_vma and vma structure; an additional restriction on the merging of vmas (cannot be merged if already assigned different anon_vmas, since then their pages will be pointing to different heads). (There would be no need to enlarge the vma structure if anonymous pages belonged only to anonymous vmas; but private file mappings accumulate anonymous pages by copy-on-write, so need to be listed in both anon_vma and prio_tree at the same time. A different implementation could avoid that by using anon_vmas only for purely anonymous vmas, and use the existing prio_tree to locate cow pages - but that would involve a long search for each single private copy, probably not a good idea.) Where before the vm_pgoff of a purely anonymous (not file-backed) vma was meaningless, now it represents the virtual start address at which that vma is mapped - which the standard file pgoff manipulations treat linearly as vmas are split and merged. But if mremap moves the vma, then it generally carries its original vm_pgoff to the new location, so pages shared with the old location can still be found. Magic. Hugh has massaged it somewhat: building on the earlier rmap patches, this patch is a fifth of the size of Andrea's original anon_vma patch. Please note that this posting will be his first sight of this patch, which he may or may not approve.
2004-05-22[PATCH] rmap 38 remove anonmm rmapAndrew Morton
From: Hugh Dickins <hugh@veritas.com> Before moving on to anon_vma rmap, remove now what's peculiar to anonmm rmap: the anonmm handling and the mremap move cows. Temporarily reduce page_referenced_anon and try_to_unmap_anon to stubs, so a kernel built with this patch will not swap anonymous at all.
2004-05-22[PATCH] rmap 37 page_add_anon_rmap vmaAndrew Morton
From: Hugh Dickins <hugh@veritas.com> Silly final patch for anonmm rmap: change page_add_anon_rmap's mm arg to vma arg like anon_vma rmap, to smooth the transition between them.
2004-05-22[PATCH] rmap 24 no rmap fastcallsAndrew Morton
From: Hugh Dickins <hugh@veritas.com> I like CONFIG_REGPARM, even when it's forced on: because it's easy to force off for debugging - easier than editing out scattered fastcalls. Plus I've never understood why we make function foo a fastcall, but function bar not. Remove fastcall directives from rmap. And fix comment about mremap_moved race: it only applies to anon pages.
2004-05-22[PATCH] Convert i_shared_sem back to a spinlockAndrew Morton
Having a semaphore in there causes modest performance regressions on heavily mmap-intensive workloads on some hardware. Specifically, up to 30% in SDET on NUMAQ and big PPC64. So switch it back to being a spinlock. This does mean that unmap_vmas() needs to be told whether or not it is allowed to schedule away; that's simple to do via the zap_details structure. This change means that there will be high scheuling latencies when someone truncates a large file which is currently mmapped, but nobody does that anyway. The scheduling points in unmap_vmas() are mainly for munmap() and exit(), and they still will work OK for that. From: Hugh Dickins <hugh@veritas.com> Sorry, my premature optimizations (trying to pass down NULL zap_details except when needed) have caught you out doubly: unmap_mapping_range_list was NULLing the details even though atomic was set; and if it hadn't, then zap_pte_range would have missed free_swap_and_cache and pte_clear when pte not present. Moved the optimization into zap_pte_range itself. Plus massive documentation update. From: Hugh Dickins <hugh@veritas.com> Here's a second patch to add to the first: mremap's cows can't come home without releasing the i_mmap_lock, better move the whole "Subtle point" locking from move_vma into move_page_tables. And it's possible for the file that was behind an anonymous page to be truncated while we drop that lock, don't want to abort mremap because of VM_FAULT_SIGBUS. (Eek, should we be checking do_swap_page of a vm_file area against the truncate_count sequence? Technically yes, but I doubt we need bother.) - We cannot hold i_mmap_lock across move_one_page() because move_one_page() needs to perform __GFP_WAIT allocations of pagetable pages. - Move the cond_resched() out so we test it once per page rather than only when move_one_page() returns -EAGAIN.
2004-05-22[PATCH] rmap 11 mremap movesAndrew Morton
From: Hugh Dickins <hugh@veritas.com> A weakness of the anonmm scheme is its difficulty in tracking pages shared between two or more mms (one being an ancestor of the other), when mremap has been used to move a range of pages in one of those mms. mremap move is not very common anyway, and it's more often used on a page range exclusive to the mm; but uncommon though it may be, we must not allow unlocked pages to become unswappable. This patch follows Linus' suggestion, simply to take a private copy of the page in such a case: early C-O-W. My previous implementation was daft with respect to pages currently on swap: it insisted on swapping them in to copy them. No need for that: just take the copy when a page is brought in from swap, and its intended address is found to clash with what rmap has already noted. If do_swap_page has to make this copy in the mremap moved case (simply a call to do_wp_page), might as well do so also in the case when it's a write access but the page not exclusive, it's always seemed a little odd that swapin needed a second fault for that. A bug even: get_user_pages force imagines that a single call to handle_mm_fault must break C-O-W. Another bugfix: swapoff's unuse_process didn't check is_vm_hugetlb_page. Andrea's anon_vma has no such problem with mremap moved pages, handling them with elegant use of vm_pgoff - though at some cost to vma merging. How important is it to handle them efficiently? For now there's a msg printk(KERN_WARNING "%s: mremap moved %d cows\n", current->comm, cows);
2004-05-22[PATCH] rmap 10 add anonmm rmapAndrew Morton
From: Hugh Dickins <hugh@veritas.com> Hugh's anonmm object-based reverse mapping scheme for anonymous pages. We have not yet decided whether to adopt this scheme, or Andrea's more advanced anon_vma scheme. anonmm is easier for me to merge quickly, to replace the pte_chain rmap taken out in the previous patch; a patch to install Andrea's anon_vma will follow in due course. Why build up and tear down chains of pte pointers for anonymous pages, when a page can only appear at one particular address, in a restricted group of mms that might share it? (Except: see next patch on mremap.) Introduce struct anonmm per mm to track anonymous pages, all forks from one exec sharing the same bundle of linked anonmms. Anonymous pages originate in one mm, but may be forked into another mm of the bundle later on. Callouts from fork.c to allocate, dup and exit the anonmm structure private to rmap.c. From: Hugh Dickins <hugh@veritas.com> Two concurrent exits (of the last two mms sharing the anonhd). First exit_rmap brings anonhd->count down to 2, gets preempted (at the spin_unlock) by second, which brings anonhd->count down to 1, sees it's 1 and frees the anonhd (without making any change to anonhd->count itself), cpu goes on to do something new which reallocates the old anonhd as a new struct anonmm (probably not a head, in which case count will start at 1), first resumes after the spin_unlock and sees anonhd->count 1, frees "anonhd" again, it's used for something else, a later exit_rmap list_del finds list corrupt.
2004-05-22[PATCH] rmap 9 remove pte_chainsAndrew Morton
From: Hugh Dickins <hugh@veritas.com> Lots of deletions: the next patch will put in the new anon rmap, which should look clearer if first we remove all of the old pte-pointer-based rmap from the core in this patch - which therefore leaves anonymous rmap totally disabled, anon pages locked in memory until process frees them. Leave arch files (and page table rmap) untouched for now, clean them up in a later batch. A few constructive changes amidst all the deletions: Choose names (e.g. page_add_anon_rmap) and args (e.g. no more pteps) now so we need not revisit so many files in the next patch. Inline function page_dup_rmap for fork's copy_page_range, simply bumps mapcount under lock. cond_resched_lock in copy_page_range. Struct page rearranged: no pte union, just mapcount moved next to atomic count, so two ints can occupy one long on 64-bit; i386 struct page now 32 bytes even with PAE. Never pass PageReserved to page_remove_rmap, only do_wp_page did so. From: Hugh Dickins <hugh@veritas.com> Move page_add_anon_rmap's BUG_ON(page_mapping(page)) inside the rmap_lock (well, might as well just check mapping if !mapcount then): if this page is being mapped or unmapped on another cpu at the same time, page_mapping's PageAnon(page) and page->mapping are volatile. But page_mapping(page) is used more widely: I've a nasty feeling that clear_page_anon, page_add_anon_rmap and/or page_mapping need barriers added (also in 2.6.6 itself),
2004-05-22[PATCH] rmap 7 object-based rmapAndrew Morton
From: Hugh Dickins <hugh@veritas.com> Dave McCracken's object-based reverse mapping scheme for file pages: why build up and tear down chains of pte pointers for file pages, when page->mapping has i_mmap and i_mmap_shared lists of all the vmas which might contain that page, and it appears at one deterministic position within the vma (unless vma is nonlinear - see next patch)? Has some drawbacks: more work to locate the ptes from page_referenced and try_to_unmap, especially if the i_mmap lists contain a lot of vmas covering different ranges; has to down_trylock the i_shared_sem, and hope that doesn't fail too often. But attractive in that it uses less lowmem, and shifts the rmap burden away from the hot paths, to swapout. Hybrid scheme for the moment: carry on with pte_chains for anonymous pages, that's unchanged; but file pages keep mapcount in the pte union of struct page, where anonymous pages keep chain pointer or direct pte address: so page_mapped(page) works on both. Hugh massaged it a little: distinct page_add_file_rmap entry point; list searches check rss so as not to waste time on mms fully swapped out; check mapcount to terminate once all ptes have been found; and a WARN_ON if page_referenced should have but couldn't find all the ptes.
2004-05-14[PATCH] rename rmap_lock to page_map_lockAndrew Morton
Sync this up with Andrea's patches.
2004-04-12[PATCH] rmap 1 linux/rmap.hAndrew Morton
From: Hugh Dickins <hugh@veritas.com> First of a batch of three rmap patches: this initial batch of three paving the way for a move to some form of object-based rmap (probably Andrea's, but drawing from mine too), and making almost no functional change by itself. A few days will intervene before the next batch, to give the struct page changes in the second patch some exposure before proceeding. rmap 1 create include/linux/rmap.h Start small: linux/rmap-locking.h has already gathered some declarations unrelated to locking, and the rest of the rmap declarations were over in linux/swap.h: gather them all together in linux/rmap.h, and rename the pte_chain_lock to rmap_lock.