user/sven/linux.git/mm/compaction.c, branch v3.4.93

mm: compaction: fix echo 1 > compact_memory return error issue

2013-01-17T16:50:43Z

commit 7964c06d66c76507d8b6b662bffea770c29ef0ce upstream. when run the folloing command under shell, it will return error sh/$ echo 1 > /proc/sys/vm/compact_memory sh/$ sh: write error: Bad address After strace, I found the following log: ... write(1, "1\n", 2) = 3 write(1, "", 4294967295) = -1 EFAULT (Bad address) write(2, "echo: write error: Bad address\n", 31echo: write error: Bad address ) = 31 This tells system return 3(COMPACT_COMPLETE) after write data to compact_memory. The fix is to make the system just return 0 instead 3(COMPACT_COMPLETE) from sysctl_compaction_handler after compaction_nodes finished. Signed-off-by: Jason Liu Suggested-by: David Rientjes Acked-by: Mel Gorman Cc: Rik van Riel Cc: Minchan Kim Cc: KAMEZAWA Hiroyuki Acked-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman

mm, thp: abort compaction if migration page cannot be charged to memcg

2012-07-16T16:04:44Z

commit 4bf2bba3750f10aa9e62e6949bc7e8329990f01b upstream. If page migration cannot charge the temporary page to the memcg, migrate_pages() will return -ENOMEM. This isn't considered in memory compaction however, and the loop continues to iterate over all pageblocks trying to isolate and migrate pages. If a small number of very large memcgs happen to be oom, however, these attempts will mostly be futile leading to an enormous amout of cpu consumption due to the page migration failures. This patch will short circuit and fail memory compaction if migrate_pages() returns -ENOMEM. COMPACT_PARTIAL is returned in case some migrations were successful so that the page allocator will retry. Signed-off-by: David Rientjes Acked-by: Mel Gorman Cc: Minchan Kim Cc: Kamezawa Hiroyuki Cc: Rik van Riel Cc: Andrea Arcangeli Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman

mm: compaction: make compact_control order signed

2012-03-22T00:54:56Z

"order" is -1 when compacting via /proc/sys/vm/compact_memory. Making it unsigned causes a bug in __compact_pgdat() when we test: if (cc->order < 0 || !compaction_deferred(zone, cc->order)) compact_zone(zone, cc); [akpm@linux-foundation.org: make __compact_pgdat()'s comparison match other code sites] Signed-off-by: Dan Carpenter Cc: Mel Gorman Cc: Minchan Kim Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

compact_pgdat: workaround lockdep warning in kswapd

2012-03-22T00:54:56Z

I get this lockdep warning from swapping load on linux-next, due to "vmscan: kswapd carefully call compaction". ================================= [ INFO: inconsistent lock state ] 3.3.0-rc2-next-20120201 #5 Not tainted --------------------------------- inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. kswapd0/28 [HC0[0]:SC0[0]:HE1:SE1] takes: (pcpu_alloc_mutex){+.+.?.}, at: [] pcpu_alloc+0x67/0x325 {RECLAIM_FS-ON-W} state was registered at: [] mark_held_locks+0xd7/0x103 [] lockdep_trace_alloc+0x85/0x9e [] __kmalloc+0x6c/0x14b [] pcpu_mem_zalloc+0x59/0x62 [] pcpu_extend_area_map+0x26/0xb1 [] pcpu_alloc+0x182/0x325 [] __alloc_percpu+0xb/0xd [] snmp_mib_init+0x1e/0x2e [] ipv4_mib_init_net+0x7a/0x184 [] ops_init.clone.0+0x6b/0x73 [] register_pernet_operations+0x61/0xa0 [] register_pernet_subsys+0x29/0x42 [] inet_init+0x1ad/0x252 [] do_one_initcall+0x7a/0x12f [] kernel_init+0x9d/0x11e [] kernel_thread_helper+0x4/0x10 irq event stamp: 656613 hardirqs last enabled at (656613): [] __mutex_unlock_slowpath+0x104/0x128 hardirqs last disabled at (656612): [] __mutex_unlock_slowpath+0x5c/0x128 softirqs last enabled at (655568): [] __do_softirq+0x120/0x136 softirqs last disabled at (654757): [] call_softirq+0x1c/0x30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(pcpu_alloc_mutex); lock(pcpu_alloc_mutex); *** DEADLOCK *** no locks held by kswapd0/28. stack backtrace: Pid: 28, comm: kswapd0 Not tainted 3.3.0-rc2-next-20120201 #5 Call Trace: [] print_usage_bug+0x1bf/0x1d0 [] ? print_irq_inversion_bug+0x1d9/0x1d9 [] mark_lock_irq+0xbb/0x22e [] ? free_hot_cold_page+0x13d/0x14f [] mark_lock+0x251/0x331 [] mark_irqflags+0x12f/0x141 [] __lock_acquire+0x58d/0x753 [] ? pcpu_alloc+0x67/0x325 [] lock_acquire+0x54/0x6a [] ? pcpu_alloc+0x67/0x325 [] ? add_preempt_count+0xa9/0xae [] mutex_lock_nested+0x5e/0x315 [] ? pcpu_alloc+0x67/0x325 [] ? __lock_acquire+0x6dc/0x753 [] ? __pagevec_release+0x2c/0x2c [] pcpu_alloc+0x67/0x325 [] ? __pagevec_release+0x2c/0x2c [] __alloc_percpu+0xb/0xd [] schedule_on_each_cpu+0x23/0x110 [] lru_add_drain_all+0x10/0x12 [] __compact_pgdat+0x20/0x182 [] compact_pgdat+0x27/0x29 [] ? zone_watermark_ok+0x1a/0x1c [] balance_pgdat+0x732/0x751 [] kswapd+0x15f/0x178 [] ? balance_pgdat+0x751/0x751 [] kthread+0x84/0x8c [] kernel_thread_helper+0x4/0x10 [] ? finish_task_switch+0x85/0xea [] ? retint_restore_args+0xe/0xe [] ? __init_kthread_worker+0x56/0x56 [] ? gs_change+0xb/0xb The RECLAIM_FS notations indicate that it's doing the GFP_FS checking that Nick hacked into lockdep a while back: I think we're intended to read that "" in the DEADLOCK scenario as "". I'm hazy, I have not reached any conclusion as to whether it's right to complain or not; but I believe it's uneasy about kswapd now doing the mutex_lock(&pcpu_alloc_mutex) which lru_add_drain_all() entails. Nor have I reached any conclusion as to whether it's important for kswapd to do that draining or not. But so as not to get blocked on this, with lockdep disabled from giving further reports, here's a patch which removes the lru_add_drain_all() from kswapd's callpath (and calls it only once from compact_nodes(), instead of once per node). Signed-off-by: Hugh Dickins Cc: Rik van Riel Acked-by: Mel Gorman Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

vmscan: only defer compaction for failed order and higher

2012-03-22T00:54:56Z

Currently a failed order-9 (transparent hugepage) compaction can lead to memory compaction being temporarily disabled for a memory zone. Even if we only need compaction for an order 2 allocation, eg. for jumbo frames networking. The fix is relatively straightforward: keep track of the highest order at which compaction is succeeding, and only defer compaction for orders at which compaction is failing. Signed-off-by: Rik van Riel Cc: Andrea Arcangeli Acked-by: Mel Gorman Cc: Johannes Weiner Cc: Minchan Kim Cc: KOSAKI Motohiro Cc: Hillf Danton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

vmscan: kswapd carefully call compaction

2012-03-22T00:54:56Z

With CONFIG_COMPACTION enabled, kswapd does not try to free contiguous free pages, even when it is woken for a higher order request. This could be bad for eg. jumbo frame network allocations, which are done from interrupt context and cannot compact memory themselves. Higher than before allocation failure rates in the network receive path have been observed in kernels with compaction enabled. Teach kswapd to defragment the memory zones in a node, but only if required and compaction is not deferred in a zone. [akpm@linux-foundation.org: reduce scope of zones_need_compaction] Signed-off-by: Rik van Riel Acked-by: Mel Gorman Cc: Andrea Arcangeli Cc: Johannes Weiner Cc: Minchan Kim Cc: KOSAKI Motohiro Cc: Hillf Danton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm: compaction: check for overlapping nodes during isolation for migration

2012-02-09T03:03:51Z

When isolating pages for migration, migration starts at the start of a zone while the free scanner starts at the end of the zone. Migration avoids entering a new zone by never going beyond the free scanned. Unfortunately, in very rare cases nodes can overlap. When this happens, migration isolates pages without the LRU lock held, corrupting lists which will trigger errors in reclaim or during page free such as in the following oops BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [] free_pcppages_bulk+0xcc/0x450 PGD 1dda554067 PUD 1e1cb58067 PMD 0 Oops: 0000 [#1] SMP CPU 37 Pid: 17088, comm: memcg_process_s Tainted: G X RIP: free_pcppages_bulk+0xcc/0x450 Process memcg_process_s (pid: 17088, threadinfo ffff881c2926e000, task ffff881c2926c0c0) Call Trace: free_hot_cold_page+0x17e/0x1f0 __pagevec_free+0x90/0xb0 release_pages+0x22a/0x260 pagevec_lru_move_fn+0xf3/0x110 putback_lru_page+0x66/0xe0 unmap_and_move+0x156/0x180 migrate_pages+0x9e/0x1b0 compact_zone+0x1f3/0x2f0 compact_zone_order+0xa2/0xe0 try_to_compact_pages+0xdf/0x110 __alloc_pages_direct_compact+0xee/0x1c0 __alloc_pages_slowpath+0x370/0x830 __alloc_pages_nodemask+0x1b1/0x1c0 alloc_pages_vma+0x9b/0x160 do_huge_pmd_anonymous_page+0x160/0x270 do_page_fault+0x207/0x4c0 page_fault+0x25/0x30 The "X" in the taint flag means that external modules were loaded but but is unrelated to the bug triggering. The real problem was because the PFN layout looks like this Zone PFN ranges: DMA 0x00000010 -> 0x00001000 DMA32 0x00001000 -> 0x00100000 Normal 0x00100000 -> 0x01e80000 Movable zone start PFN for each node early_node_map[14] active PFN ranges 0: 0x00000010 -> 0x0000009b 0: 0x00000100 -> 0x0007a1ec 0: 0x0007a354 -> 0x0007a379 0: 0x0007f7ff -> 0x0007f800 0: 0x00100000 -> 0x00680000 1: 0x00680000 -> 0x00e80000 0: 0x00e80000 -> 0x01080000 1: 0x01080000 -> 0x01280000 0: 0x01280000 -> 0x01480000 1: 0x01480000 -> 0x01680000 0: 0x01680000 -> 0x01880000 1: 0x01880000 -> 0x01a80000 0: 0x01a80000 -> 0x01c80000 1: 0x01c80000 -> 0x01e80000 The fix is straight-forward. isolate_migratepages() has to make a similar check to isolate_freepage to ensure that it never isolates pages from a zone it does not hold the LRU lock for. This was discovered in a 3.0-based kernel but it affects 3.1.x, 3.2.x and current mainline. Signed-off-by: Mel Gorman Acked-by: Michal Nazarewicz Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm: compaction: check pfn_valid when entering a new MAX_ORDER_NR_PAGES block during isolation for migration

2012-02-04T00:16:41Z

When isolating for migration, migration starts at the start of a zone which is not necessarily pageblock aligned. Further, it stops isolating when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally not aligned. This allows isolate_migratepages() to call pfn_to_page() on an invalid PFN which can result in a crash. This was originally reported against a 3.0-based kernel with the following trace in a crash dump. PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s" #0 [d72d3ad0] crash_kexec at c028cfdb #1 [d72d3b24] oops_end at c05c5322 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60 #3 [d72d3bec] bad_area at c0227fb6 #4 [d72d3c00] do_page_fault at c05c72ec #5 [d72d3c80] error_code (via page_fault) at c05c47a4 EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000 DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50 CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002 #6 [d72d3cb4] isolate_migratepages at c030b15a #7 [d72d3d14] zone_watermark_ok at c02d26cb #8 [d72d3d2c] compact_zone at c030b8de #9 [d72d3d68] compact_zone_order at c030bba1 #10 [d72d3db4] try_to_compact_pages at c030bc84 #11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7 #12 [d72d3e08] __alloc_pages_slowpath at c02d66c7 #13 [d72d3e78] __alloc_pages_nodemask at c02d6a97 #14 [d72d3eb8] alloc_pages_vma at c030a845 #15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb #16 [d72d3f00] handle_mm_fault at c02f36c6 #17 [d72d3f30] do_page_fault at c05c70ed #18 [d72d3fb0] error_code (via page_fault) at c05c47a4 EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431 DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788 SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50 CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202 It was also reported by Herbert van den Bergh against 3.1-based kernel with the following snippet from the console log. BUG: unable to handle kernel paging request at 01c00008 IP: [] isolate_migratepages+0x119/0x390 *pdpt = 000000002f7ce001 *pde = 0000000000000000 It is expected that it also affects 3.2.x and current mainline. The problem is that pfn_valid is only called on the first PFN being checked and that PFN is not necessarily aligned. Lets say we have a case like this H = MAX_ORDER_NR_PAGES boundary | = pageblock boundary m = cc->migrate_pfn f = cc->free_pfn o = memory hole H------|------H------|----m-Hoooooo|ooooooH-f----|------H The migrate_pfn is just below a memory hole and the free scanner is beyond the hole. When isolate_migratepages started, it scans from migrate_pfn to migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks pfn_valid() on the first PFN but then scans into the hole where there are not necessarily valid struct pages. This patch ensures that isolate_migratepages calls pfn_valid when necessary. Reported-by: Herbert van den Bergh Tested-by: Herbert van den Bergh Signed-off-by: Mel Gorman Acked-by: Michal Nazarewicz Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm: compaction: introduce sync-light migration for use by compaction

2012-01-13T04:13:09Z

This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT mode that avoids writing back pages to backing storage. Async compaction maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT. For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is used. This avoids sync compaction stalling for an excessive length of time, particularly when copying files to a USB stick where there might be a large number of dirty pages backed by a filesystem that does not support ->writepages. [aarcange@redhat.com: This patch is heavily based on Andrea's work] [akpm@linux-foundation.org: fix fs/nfs/write.c build] [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build] Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Cc: Minchan Kim Cc: Dave Jones Cc: Jan Kara Cc: Andy Isaacson Cc: Nai Xia Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm: compaction: make isolate_lru_page() filter-aware again

2012-01-13T04:13:09Z

Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware") noted that compaction does not migrate dirty or writeback pages and that is was meaningless to pick the page and re-add it to the LRU list. This had to be partially reverted because some dirty pages can be migrated by compaction without blocking. This patch updates "mm: compaction: make isolate_lru_page" by skipping over pages that migration has no possibility of migrating to minimise LRU disruption. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Reviewed-by: Minchan Kim Cc: Dave Jones Cc: Jan Kara Cc: Andy Isaacson Cc: Nai Xia Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds