From f0774d884bad7007b54cfffb5c93c23420c75aa6 Mon Sep 17 00:00:00 2001 From: Sasha Levin Date: Mon, 23 Feb 2015 05:38:00 -0500 Subject: mm: shmem: check for mapping owner before dereferencing mapping->host can be NULL and shouldn't be dereferenced before being checked. [ 1295.741844] GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] SMP KASAN [ 1295.746387] Dumping ftrace buffer: [ 1295.748217] (ftrace buffer empty) [ 1295.749527] Modules linked in: [ 1295.750268] CPU: 62 PID: 23410 Comm: trinity-c70 Not tainted 3.19.0-next-20150219-sasha-00045-g9130270f #1939 [ 1295.750268] task: ffff8803a49db000 ti: ffff8803a4dc8000 task.ti: ffff8803a4dc8000 [ 1295.750268] RIP: shmem_mapping (mm/shmem.c:1458) [ 1295.750268] RSP: 0000:ffff8803a4dcfbf8 EFLAGS: 00010206 [ 1295.750268] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 00000000000f2804 [ 1295.750268] RDX: 0000000000000005 RSI: 0400000000000794 RDI: 0000000000000028 [ 1295.750268] RBP: ffff8803a4dcfc08 R08: 0000000000000000 R09: 00000000031de000 [ 1295.750268] R10: dffffc0000000000 R11: 00000000031c1000 R12: 0400000000000794 [ 1295.750268] R13: 00000000031c2000 R14: 00000000031de000 R15: ffff880e3bdc1000 [ 1295.750268] FS: 00007f8703c7e700(0000) GS:ffff881164800000(0000) knlGS:0000000000000000 [ 1295.750268] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1295.750268] CR2: 0000000004e58000 CR3: 00000003a9f3c000 CR4: 00000000000007a0 [ 1295.750268] DR0: ffffffff81000000 DR1: 0000009494949494 DR2: 0000000000000000 [ 1295.750268] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000d0602 [ 1295.750268] Stack: [ 1295.750268] ffff8803a4dcfec8 ffffffffbb1dc770 ffff8803a4dcfc38 ffffffffad6f230b [ 1295.750268] ffffffffad6f2b0d 0000014100000000 ffff88001e17c08b ffff880d9453fe08 [ 1295.750268] ffff8803a4dcfd18 ffffffffad6f2ce2 ffff8803a49dbcd8 ffff8803a49dbce0 [ 1295.750268] Call Trace: [ 1295.750268] mincore_page (mm/mincore.c:61) [ 1295.750268] ? mincore_pte_range (include/linux/spinlock.h:312 mm/mincore.c:131) [ 1295.750268] mincore_pte_range (mm/mincore.c:151) [ 1295.750268] ? mincore_unmapped_range (mm/mincore.c:113) [ 1295.750268] __walk_page_range (mm/pagewalk.c:51 mm/pagewalk.c:90 mm/pagewalk.c:116 mm/pagewalk.c:204) [ 1295.750268] walk_page_range (mm/pagewalk.c:275) [ 1295.750268] SyS_mincore (mm/mincore.c:191 mm/mincore.c:253 mm/mincore.c:220) [ 1295.750268] ? mincore_pte_range (mm/mincore.c:220) [ 1295.750268] ? mincore_unmapped_range (mm/mincore.c:113) [ 1295.750268] ? __mincore_unmapped_range (mm/mincore.c:105) [ 1295.750268] ? ptlock_free (mm/mincore.c:24) [ 1295.750268] ? syscall_trace_enter (arch/x86/kernel/ptrace.c:1610) [ 1295.750268] ia32_do_call (arch/x86/ia32/ia32entry.S:446) [ 1295.750268] Code: e5 48 c1 ea 03 53 48 89 fb 48 83 ec 08 80 3c 02 00 75 4f 48 b8 00 00 00 00 00 fc ff df 48 8b 1b 48 8d 7b 28 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 3f 48 b8 00 00 00 00 00 fc ff df 48 8b 5b 28 48 All code ======== 0: e5 48 in $0x48,%eax 2: c1 ea 03 shr $0x3,%edx 5: 53 push %rbx 6: 48 89 fb mov %rdi,%rbx 9: 48 83 ec 08 sub $0x8,%rsp d: 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) 11: 75 4f jne 0x62 13: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax 1a: fc ff df 1d: 48 8b 1b mov (%rbx),%rbx 20: 48 8d 7b 28 lea 0x28(%rbx),%rdi 24: 48 89 fa mov %rdi,%rdx 27: 48 c1 ea 03 shr $0x3,%rdx 2b:* 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) <-- trapping instruction 2f: 75 3f jne 0x70 31: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax 38: fc ff df 3b: 48 8b 5b 28 mov 0x28(%rbx),%rbx 3f: 48 rex.W ... Code starting with the faulting instruction =========================================== 0: 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) 4: 75 3f jne 0x45 6: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax d: fc ff df 10: 48 8b 5b 28 mov 0x28(%rbx),%rbx 14: 48 rex.W ... [ 1295.750268] RIP shmem_mapping (mm/shmem.c:1458) [ 1295.750268] RSP Fixes: 97b713ba3e ("fs: kill BDI_CAP_SWAP_BACKED") Signed-off-by: Sasha Levin Reviewed-by: Christoph Hellwig Signed-off-by: Jens Axboe --- mm/shmem.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'mm') diff --git a/mm/shmem.c b/mm/shmem.c index 2f17cb5f00a4..cf2d0ca010bc 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1455,6 +1455,9 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode bool shmem_mapping(struct address_space *mapping) { + if (!mapping->host) + return false; + return mapping->host->i_sb->s_op == &shmem_ops; } -- cgit v1.2.3 From da616534ed7f6e8ffaab699258b55c8d78d0b4ea Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Fri, 27 Feb 2015 15:51:43 -0800 Subject: mm/nommu: fix memory leak Maxime reported the following memory leak regression due to commit dbc8358c7237 ("mm/nommu: use alloc_pages_exact() rather than its own implementation"). On v3.19, I am facing a memory leak. Each time I run a command one page is lost. Here an example with busybox's free command: / # free total used free shared buffers cached Mem: 7928 1972 5956 0 0 492 -/+ buffers/cache: 1480 6448 / # free total used free shared buffers cached Mem: 7928 1976 5952 0 0 492 -/+ buffers/cache: 1484 6444 / # free total used free shared buffers cached Mem: 7928 1980 5948 0 0 492 -/+ buffers/cache: 1488 6440 / # free total used free shared buffers cached Mem: 7928 1984 5944 0 0 492 -/+ buffers/cache: 1492 6436 / # free total used free shared buffers cached Mem: 7928 1988 5940 0 0 492 -/+ buffers/cache: 1496 6432 At some point, the system fails to sastisfy 256KB allocations: free: page allocation failure: order:6, mode:0xd0 CPU: 0 PID: 67 Comm: free Not tainted 3.19.0-05389-gacf2cf1-dirty #64 Hardware name: STM32 (Device Tree Support) show_stack+0xb/0xc warn_alloc_failed+0x97/0xbc __alloc_pages_nodemask+0x295/0x35c __get_free_pages+0xb/0x24 alloc_pages_exact+0x19/0x24 do_mmap_pgoff+0x423/0x658 vm_mmap_pgoff+0x3f/0x4e load_flat_file+0x20d/0x4f8 load_flat_binary+0x3f/0x26c search_binary_handler+0x51/0xe4 do_execveat_common+0x271/0x35c do_execve+0x19/0x1c ret_fast_syscall+0x1/0x4a Mem-info: Normal per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 active_anon:0 inactive_anon:0 isolated_anon:0 active_file:0 inactive_file:0 isolated_file:0 unevictable:123 dirty:0 writeback:0 unstable:0 free:1515 slab_reclaimable:17 slab_unreclaimable:139 mapped:0 shmem:0 pagetables:0 bounce:0 free_cma:0 Normal free:6060kB min:352kB low:440kB high:528kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:492kB isolated(anon):0ks lowmem_reserve[]: 0 0 Normal: 23*4kB (U) 22*8kB (U) 24*16kB (U) 23*32kB (U) 23*64kB (U) 23*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB 123 total pagecache pages 2048 pages of RAM 1538 free pages 66 reserved pages 109 slab pages -46 pages shared 0 pages swap cached nommu: Allocation of length 221184 from process 67 (free) failed Normal per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 active_anon:0 inactive_anon:0 isolated_anon:0 active_file:0 inactive_file:0 isolated_file:0 unevictable:123 dirty:0 writeback:0 unstable:0 free:1515 slab_reclaimable:17 slab_unreclaimable:139 mapped:0 shmem:0 pagetables:0 bounce:0 free_cma:0 Normal free:6060kB min:352kB low:440kB high:528kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:492kB isolated(anon):0ks lowmem_reserve[]: 0 0 Normal: 23*4kB (U) 22*8kB (U) 24*16kB (U) 23*32kB (U) 23*64kB (U) 23*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB 123 total pagecache pages Unable to allocate RAM for process text/data, errno 12 SEGV This problem happens because we allocate ordered page through __get_free_pages() in do_mmap_private() in some cases and we try to free individual pages rather than ordered page in free_page_series(). In this case, freeing pages whose refcount is not 0 won't be freed to the page allocator so memory leak happens. To fix the problem, this patch changes __get_free_pages() to alloc_pages_exact() since alloc_pages_exact() returns physically-contiguous pages but each pages are refcounted. Fixes: dbc8358c7237 ("mm/nommu: use alloc_pages_exact() rather than its own implementation"). Reported-by: Maxime Coquelin Tested-by: Maxime Coquelin Signed-off-by: Joonsoo Kim Cc: [3.19] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/nommu.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/nommu.c b/mm/nommu.c index 7296360fc057..3e67e7538ecf 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1213,11 +1213,9 @@ static int do_mmap_private(struct vm_area_struct *vma, if (sysctl_nr_trim_pages && total - point >= sysctl_nr_trim_pages) { total = point; kdebug("try to alloc exact %lu pages", total); - base = alloc_pages_exact(len, GFP_KERNEL); - } else { - base = (void *)__get_free_pages(GFP_KERNEL, order); } + base = alloc_pages_exact(total << PAGE_SHIFT, GFP_KERNEL); if (!base) goto enomem; -- cgit v1.2.3 From 4e54dede38b45052a941bcf709f7d29f2e18174d Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 27 Feb 2015 15:51:46 -0800 Subject: memcg: fix low limit calculation A memcg is considered low limited even when the current usage is equal to the low limit. This leads to interesting side effects e.g. groups/hierarchies with no memory accounted are considered protected and so the reclaim will emit MEMCG_LOW event when encountering them. Another and much bigger issue was reported by Joonsoo Kim. He has hit a NULL ptr dereference with the legacy cgroup API which even doesn't have low limit exposed. The limit is 0 by default but the initial check fails for memcg with 0 consumption and parent_mem_cgroup() would return NULL if use_hierarchy is 0 and so page_counter_read would try to dereference NULL. I suppose that the current implementation is just an overlook because the documentation in Documentation/cgroups/unified-hierarchy.txt says: "The memory.low boundary on the other hand is a top-down allocated reserve. A cgroup enjoys reclaim protection when it and all its ancestors are below their low boundaries" Fix the usage and the low limit comparision in mem_cgroup_low accordingly. Fixes: 241994ed8649 (mm: memcontrol: default hierarchy interface for memory) Reported-by: Joonsoo Kim Signed-off-by: Michal Hocko Acked-by: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d18d3a6e7337..76c5a1b1dd21 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5426,7 +5426,7 @@ bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg) if (memcg == root_mem_cgroup) return false; - if (page_counter_read(&memcg->memory) > memcg->low) + if (page_counter_read(&memcg->memory) >= memcg->low) return false; while (memcg != root) { @@ -5435,7 +5435,7 @@ bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg) if (memcg == root_mem_cgroup) break; - if (page_counter_read(&memcg->memory) > memcg->low) + if (page_counter_read(&memcg->memory) >= memcg->low) return false; } return true; -- cgit v1.2.3 From d2973697b377e8b061e179c6057e94151759a73d Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Fri, 27 Feb 2015 15:52:04 -0800 Subject: mm: memcontrol: use "max" instead of "infinity" in control knobs The memcg control knobs indicate the highest possible value using the symbolic name "infinity", which is long and awkward to type. Switch to the string "max", which is just as descriptive but shorter and sweeter. This changes a user interface, so do it before the release and before the development flag is dropped from the default hierarchy. Signed-off-by: Johannes Weiner Cc: Michal Hocko Cc: Tejun Heo Cc: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/unified-hierarchy.txt | 4 ++-- mm/memcontrol.c | 12 ++++++------ 2 files changed, 8 insertions(+), 8 deletions(-) (limited to 'mm') diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt index 71daa35ec2d9..eb102fb72213 100644 --- a/Documentation/cgroups/unified-hierarchy.txt +++ b/Documentation/cgroups/unified-hierarchy.txt @@ -404,8 +404,8 @@ supported and the interface files "release_agent" and be understood as an underflow into the highest possible value, -2 or -10M etc. do not work, so it's not consistent. - memory.low, memory.high, and memory.max will use the string - "infinity" to indicate and set the highest possible value. + memory.low, memory.high, and memory.max will use the string "max" to + indicate and set the highest possible value. 5. Planned Changes diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 76c5a1b1dd21..9fe07692eaad 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5247,7 +5247,7 @@ static int memory_low_show(struct seq_file *m, void *v) unsigned long low = ACCESS_ONCE(memcg->low); if (low == PAGE_COUNTER_MAX) - seq_puts(m, "infinity\n"); + seq_puts(m, "max\n"); else seq_printf(m, "%llu\n", (u64)low * PAGE_SIZE); @@ -5262,7 +5262,7 @@ static ssize_t memory_low_write(struct kernfs_open_file *of, int err; buf = strstrip(buf); - err = page_counter_memparse(buf, "infinity", &low); + err = page_counter_memparse(buf, "max", &low); if (err) return err; @@ -5277,7 +5277,7 @@ static int memory_high_show(struct seq_file *m, void *v) unsigned long high = ACCESS_ONCE(memcg->high); if (high == PAGE_COUNTER_MAX) - seq_puts(m, "infinity\n"); + seq_puts(m, "max\n"); else seq_printf(m, "%llu\n", (u64)high * PAGE_SIZE); @@ -5292,7 +5292,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, int err; buf = strstrip(buf); - err = page_counter_memparse(buf, "infinity", &high); + err = page_counter_memparse(buf, "max", &high); if (err) return err; @@ -5307,7 +5307,7 @@ static int memory_max_show(struct seq_file *m, void *v) unsigned long max = ACCESS_ONCE(memcg->memory.limit); if (max == PAGE_COUNTER_MAX) - seq_puts(m, "infinity\n"); + seq_puts(m, "max\n"); else seq_printf(m, "%llu\n", (u64)max * PAGE_SIZE); @@ -5322,7 +5322,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, int err; buf = strstrip(buf); - err = page_counter_memparse(buf, "infinity", &max); + err = page_counter_memparse(buf, "max", &max); if (err) return err; -- cgit v1.2.3 From cc87317726f851531ae8422e0c2d3d6e2d7b1955 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Fri, 27 Feb 2015 15:52:09 -0800 Subject: mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change Historically, !__GFP_FS allocations were not allowed to invoke the OOM killer once reclaim had failed, but nevertheless kept looping in the allocator. Commit 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath"), which should have been a simple cleanup patch, accidentally changed the behavior to aborting the allocation at that point. This creates problems with filesystem callers (?) that currently rely on the allocator waiting for other tasks to intervene. Revert the behavior as it shouldn't have been changed as part of a cleanup patch. Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath") Signed-off-by: Johannes Weiner Acked-by: Michal Hocko Reported-by: Tetsuo Handa Cc: Theodore Ts'o Cc: Dave Chinner Acked-by: David Rientjes Cc: Oleg Nesterov Cc: Mel Gorman Cc: [3.19.x] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a47f0b229a1a..7abfa70cdc1a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2353,8 +2353,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, if (ac->high_zoneidx < ZONE_NORMAL) goto out; /* The OOM killer does not compensate for light reclaim */ - if (!(gfp_mask & __GFP_FS)) + if (!(gfp_mask & __GFP_FS)) { + /* + * XXX: Page reclaim didn't yield anything, + * and the OOM killer can't be invoked, but + * keep looping as per should_alloc_retry(). + */ + *did_some_progress = 1; goto out; + } /* * GFP_THISNODE contains __GFP_NORETRY and we never hit this. * Sanity check for bare calls of __GFP_THISNODE, not real OOM. -- cgit v1.2.3