user/sven/linux.git/mm, branch v2.6.37.2

memory hotplug: one more lock on memory hotplug

2011-02-17T23:14:50Z

commit 925268a06dc2b1ff7bfcc37419a6827a0e739639 upstream. Now, memory_hotplug_(un)lock() is used for add/remove/offline pages for avoiding races with hibernation. But this should be held in online_pages(), too. It seems asymmetric. There are cases where one has to avoid a race with memory hotplug notifier and his own local code, and hotplug v.s. hotplug. This will add a generic solution for avoiding races. In other view, having lock here has no big impacts. online pages is tend to be done by udev script at el against each memory section one by one. Then, it's better to have lock here, too. Reviewed-by: Christoph Lameter Acked-by: David Rientjes Signed-off-by: KAMEZAWA Hiroyuki Signed-off-by: Pekka Enberg Signed-off-by: Greg Kroah-Hartman

memsw: handle swapaccount kernel parameter correctly

2011-02-17T23:14:46Z

commit fceda1bf498677501befc7da72fd2e4de7f18466 upstream. __setup based kernel command line parameters handlers which are handled in obsolete_checksetup are provided with the parameter value including = (more precisely everything right after the parameter name). This means that the current implementation of swapaccount[=1|0] doesn't work at all because if there is a value for the parameter then we are testing for "0" resp. "1" but we are getting "=0" resp. "=1" and if there is no parameter value we are getting an empty string rather than NULL. The original noswapccount parameter, which doesn't care about the value, works correctly. Signed-off-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki Cc: Daisuke Nishimura Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman

mm: page allocator: adjust the per-cpu counter threshold when memory is low

2011-02-17T23:14:38Z

commit 88f5acf88ae6a9778f6d25d0d5d7ec2d57764a97 upstream. Commit aa45484 ("calculate a better estimate of NR_FREE_PAGES when memory is low") noted that watermarks were based on the vmstat NR_FREE_PAGES. To avoid synchronization overhead, these counters are maintained on a per-cpu basis and drained both periodically and when a threshold is above a threshold. On large CPU systems, the difference between the estimate and real value of NR_FREE_PAGES can be very high. The system can get into a case where pages are allocated far below the min watermark potentially causing livelock issues. The commit solved the problem by taking a better reading of NR_FREE_PAGES when memory was low. Unfortately, as reported by Shaohua Li this accurate reading can consume a large amount of CPU time on systems with many sockets due to cache line bouncing. This patch takes a different approach. For large machines where counter drift might be unsafe and while kswapd is awake, the per-cpu thresholds for the target pgdat are reduced to limit the level of drift to what should be a safe level. This incurs a performance penalty in heavy memory pressure by a factor that depends on the workload and the machine but the machine should function correctly without accidentally exhausting all memory on a node. There is an additional cost when kswapd wakes and sleeps but the event is not expected to be frequent - in Shaohua's test case, there was one recorded sleep and wake event at least. To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is introduced that takes a more accurate reading of NR_FREE_PAGES when called from wakeup_kswapd, when deciding whether it is really safe to go back to sleep in sleeping_prematurely() and when deciding if a zone is really balanced or not in balance_pgdat(). We are still using an expensive function but limiting how often it is called. When the test case is reproduced, the time spent in the watermark functions is reduced. The following report is on the percentage of time spent cumulatively spent in the functions zone_nr_free_pages(), zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(), zone_page_state_snapshot(), zone_page_state(). vanilla 11.6615% disable-threshold 0.2584% David said: : We had to pull aa454840 "mm: page allocator: calculate a better estimate : of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36 : internally because tests showed that it would cause the machine to stall : as the result of heavy kswapd activity. I merged it back with this fix as : it is pending in the -mm tree and it solves the issue we were seeing, so I : definitely think this should be pushed to -stable (and I would seriously : consider it for 2.6.37 inclusion even at this late date). Signed-off-by: Mel Gorman Reported-by: Shaohua Li Reviewed-by: Christoph Lameter Tested-by: Nicolas Bareil Cc: David Rientjes Cc: Kyle McMartin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman

slub: Avoid use of slub_lock in show_slab_objects()

2011-02-17T23:14:37Z

commit 04d94879c8a4973b5499dc26b9d38acee8928791 upstream. The purpose of the locking is to prevent removal and additions of nodes when statistics are gathered for a slab cache. So we need to avoid racing with memory hotplug functionality. It is enough to take the memory hotplug locks there instead of the slub_lock. online_pages() currently does not acquire the memory_hotplug lock. Another patch will be submitted by the memory hotplug authors to take the memory hotplug lock and describe the uses of the memory hotplug lock to protect against adding and removal of nodes from non hotplug data structures. Reported-and-tested-by: Bart Van Assche Acked-by: David Rientjes Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg Signed-off-by: Greg Kroah-Hartman

memcg: fix account leak at failure of memsw acconting

2011-02-17T23:14:36Z

commit 01c88e2d6b7330c0cc5867fe2297e7d826e1337d upstream. Commit 4b53433468 ("memcg: clean up try_charge main loop") removes a cancel of charge at case: memory charge-> success. mem+swap charge-> failure. This leaks usage of memory. Fix it. Signed-off-by: KAMEZAWA Hiroyuki Reviewed-by: Johannes Weiner Acked-by: Daisuke Nishimura Cc: Balbir Singh Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman

mm: fix hugepage migration

2011-02-17T23:14:18Z

commit fd4a4663db293bfd5dc20fb4113977f62895e550 upstream. 2.6.37 added an unmap_and_move_huge_page() for memory failure recovery, but its anon_vma handling was still based around the 2.6.35 conventions. Update it to use page_lock_anon_vma, get_anon_vma, page_unlock_anon_vma, drop_anon_vma in the same way as we're now changing unmap_and_move(). I don't particularly like to propose this for stable when I've not seen its problems in practice nor tested the solution: but it's clearly out of synch at present. Signed-off-by: Hugh Dickins Cc: Mel Gorman Cc: Rik van Riel Cc: Naoya Horiguchi Cc: "Jun'ichi Nomura" Cc: Andi Kleen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman

mm: fix migration hangs on anon_vma lock

2011-02-17T23:14:18Z

commit 1ce82b69e96c838d007f316b8347b911fdfa9842 upstream. Increased usage of page migration in mmotm reveals that the anon_vma locking in unmap_and_move() has been deficient since 2.6.36 (or even earlier). Review at the time of f18194275c39835cb84563500995e0d503a32d9a ("mm: fix hang on anon_vma->root->lock") missed the issue here: the anon_vma to which we get a reference may already have been freed back to its slab (it is in use when we check page_mapped, but that can change), and so its anon_vma->root may be switched at any moment by reuse in anon_vma_prepare. Perhaps we could fix that with a get_anon_vma_unless_zero(), but let's not: just rely on page_lock_anon_vma() to do all the hard thinking for us, then we don't need any rcu read locking over here. In removing the rcu_unlock label: since PageAnon is a bit in page->mapping, it's impossible for a !page->mapping page to be anon; but insert VM_BUG_ON in case the implementation ever changes. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Hugh Dickins Reviewed-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Naoya Horiguchi Cc: "Jun'ichi Nomura" Cc: Andi Kleen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman

mm: migration: use rcu_dereference_protected when dereferencing the radix tree slot during file page migration

2011-02-17T23:14:18Z

commit 29c1f677d424e8c5683a837fc4f03fc9f19201d7 upstream. migrate_pages() -> unmap_and_move() only calls rcu_read_lock() for anonymous pages, as introduced by git commit 989f89c57e6361e7d16fbd9572b5da7d313b073d ("fix rcu_read_lock() in page migraton"). The point of the RCU protection there is part of getting a stable reference to anon_vma and is only held for anon pages as file pages are locked which is sufficient protection against freeing. However, while a file page's mapping is being migrated, the radix tree is double checked to ensure it is the expected page. This uses radix_tree_deref_slot() -> rcu_dereference() without the RCU lock held triggering the following warning. [ 173.674290] =================================================== [ 173.676016] [ INFO: suspicious rcu_dereference_check() usage. ] [ 173.676016] --------------------------------------------------- [ 173.676016] include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection! [ 173.676016] [ 173.676016] other info that might help us debug this: [ 173.676016] [ 173.676016] [ 173.676016] rcu_scheduler_active = 1, debug_locks = 0 [ 173.676016] 1 lock held by hugeadm/2899: [ 173.676016] #0: (&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [] migrate_page_move_mapping+0x40/0x1ab [ 173.676016] [ 173.676016] stack backtrace: [ 173.676016] Pid: 2899, comm: hugeadm Not tainted 2.6.37-rc5-autobuild [ 173.676016] Call Trace: [ 173.676016] [] ? printk+0x14/0x1b [ 173.676016] [] lockdep_rcu_dereference+0x7d/0x86 [ 173.676016] [] migrate_page_move_mapping+0xca/0x1ab [ 173.676016] [] migrate_page+0x23/0x39 [ 173.676016] [] buffer_migrate_page+0x22/0x107 [ 173.676016] [] ? buffer_migrate_page+0x0/0x107 [ 173.676016] [] move_to_new_page+0x9a/0x1ae [ 173.676016] [] migrate_pages+0x1e7/0x2fa This patch introduces radix_tree_deref_slot_protected() which calls rcu_dereference_protected(). Users of it must pass in the mapping->tree_lock that is protecting this dereference. Holding the tree lock protects against parallel updaters of the radix tree meaning that rcu_dereference_protected is allowable. [akpm@linux-foundation.org: remove unneeded casts] Signed-off-by: Mel Gorman Cc: Minchan Kim Cc: KAMEZAWA Hiroyuki Cc: Milton Miller Cc: Nick Piggin Cc: Wu Fengguang Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman

memcg: fix wrong VM_BUG_ON() in try_charge()'s mm->owner check

2010-12-30T18:07:06Z

At __mem_cgroup_try_charge(), VM_BUG_ON(!mm->owner) is checked. But as commented in mem_cgroup_from_task(), mm->owner can be NULL in some racy case. This check of VM_BUG_ON() is bad. A possible story to hit this is at swapoff()->try_to_unuse(). It passes mm_struct to mem_cgroup_try_charge_swapin() while mm->owner is NULL. If we can't get proper mem_cgroup from swap_cgroup information, mm->owner is used as charge target and we see NULL. Cc: Daisuke Nishimura Cc: KOSAKI Motohiro Reported-by: Hugh Dickins Reported-by: Thomas Meyer Signed-off-by: KAMEZAWA Hiroyuki Reviewed-by: Balbir Singh Signed-off-by: Hugh Dickins Cc: stable@kernel.org Signed-off-by: Linus Torvalds

Merge branch 'nommu-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/nommu-2.6

2010-12-27T18:36:27Z

* 'nommu-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/nommu-2.6: nommu: Provide stubbed alloc/free_vm_area() implementation. nommu: Fix up vmalloc_node() symbol export regression.