<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/mm, branch v3.10.49</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v3.10.49</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v3.10.49'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2014-07-17T22:58:00Z</updated>
<entry>
<title>cpuset,mempolicy: fix sleeping function called from invalid context</title>
<updated>2014-07-17T22:58:00Z</updated>
<author>
<name>Gu Zheng</name>
<email>guz.fnst@cn.fujitsu.com</email>
</author>
<published>2014-06-25T01:57:18Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=3c33a9bdbccb44e5bbe89243c639f38493740492'/>
<id>urn:sha1:3c33a9bdbccb44e5bbe89243c639f38493740492</id>
<content type='text'>
commit 391acf970d21219a2a5446282d3b20eace0c0d7a upstream.

When runing with the kernel(3.15-rc7+), the follow bug occurs:
[ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
[ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python
[ 9969.441175] INFO: lockdep is turned off.
[ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G       A      3.15.0-rc7+ #85
[ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012
[ 9969.706052]  ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18
[ 9969.795323]  ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c
[ 9969.884710]  ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000
[ 9969.974071] Call Trace:
[ 9970.003403]  [&lt;ffffffff8162f523&gt;] dump_stack+0x4d/0x66
[ 9970.065074]  [&lt;ffffffff8109995a&gt;] __might_sleep+0xfa/0x130
[ 9970.130743]  [&lt;ffffffff81633e6c&gt;] mutex_lock_nested+0x3c/0x4f0
[ 9970.200638]  [&lt;ffffffff811ba5dc&gt;] ? kmem_cache_alloc+0x1bc/0x210
[ 9970.272610]  [&lt;ffffffff81105807&gt;] cpuset_mems_allowed+0x27/0x140
[ 9970.344584]  [&lt;ffffffff811b1303&gt;] ? __mpol_dup+0x63/0x150
[ 9970.409282]  [&lt;ffffffff811b1385&gt;] __mpol_dup+0xe5/0x150
[ 9970.471897]  [&lt;ffffffff811b1303&gt;] ? __mpol_dup+0x63/0x150
[ 9970.536585]  [&lt;ffffffff81068c86&gt;] ? copy_process.part.23+0x606/0x1d40
[ 9970.613763]  [&lt;ffffffff810bf28d&gt;] ? trace_hardirqs_on+0xd/0x10
[ 9970.683660]  [&lt;ffffffff810ddddf&gt;] ? monotonic_to_bootbased+0x2f/0x50
[ 9970.759795]  [&lt;ffffffff81068cf0&gt;] copy_process.part.23+0x670/0x1d40
[ 9970.834885]  [&lt;ffffffff8106a598&gt;] do_fork+0xd8/0x380
[ 9970.894375]  [&lt;ffffffff81110e4c&gt;] ? __audit_syscall_entry+0x9c/0xf0
[ 9970.969470]  [&lt;ffffffff8106a8c6&gt;] SyS_clone+0x16/0x20
[ 9971.030011]  [&lt;ffffffff81642009&gt;] stub_clone+0x69/0x90
[ 9971.091573]  [&lt;ffffffff81641c29&gt;] ? system_call_fastpath+0x16/0x1b

The cause is that cpuset_mems_allowed() try to take
mutex_lock(&amp;callback_mutex) under the rcu_read_lock(which was hold in
__mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is
under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock
protection region to protect the access to cpuset only in
current_cpuset_is_being_rebound(). So that we can avoid this bug.

This patch is a temporary solution that just addresses the bug
mentioned above, can not fix the long-standing issue about cpuset.mems
rebinding on fork():

"When the forker's task_struct is duplicated (which includes
 -&gt;mems_allowed) and it races with an update to cpuset_being_rebound
 in update_tasks_nodemask() then the task's mems_allowed doesn't get
 updated. And the child task's mems_allowed can be wrong if the
 cpuset's nodemask changes before the child has been added to the
 cgroup's tasklist."

Signed-off-by: Gu Zheng &lt;guz.fnst@cn.fujitsu.com&gt;
Acked-by: Li Zefan &lt;lizefan@huawei.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>mm: fix crashes from mbind() merging vmas</title>
<updated>2014-07-09T18:14:03Z</updated>
<author>
<name>Hugh Dickins</name>
<email>hughd@google.com</email>
</author>
<published>2014-06-23T20:22:07Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=f76d0efeb668c5409dcce9dc3afafaee599bd757'/>
<id>urn:sha1:f76d0efeb668c5409dcce9dc3afafaee599bd757</id>
<content type='text'>
commit d05f0cdcbe6388723f1900c549b4850360545201 upstream.

In v2.6.34 commit 9d8cebd4bcd7 ("mm: fix mbind vma merge problem")
introduced vma merging to mbind(), but it should have also changed the
convention of passing start vma from queue_pages_range() (formerly
check_range()) to new_vma_page(): vma merging may have already freed
that structure, resulting in BUG at mm/mempolicy.c:1738 and probably
worse crashes.

Fixes: 9d8cebd4bcd7 ("mm: fix mbind vma merge problem")
Reported-by: Naoya Horiguchi &lt;n-horiguchi@ah.jp.nec.com&gt;
Tested-by: Naoya Horiguchi &lt;n-horiguchi@ah.jp.nec.com&gt;
Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Acked-by: Christoph Lameter &lt;cl@linux.com&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;


</content>
</entry>
<entry>
<title>hugetlb: fix copy_hugetlb_page_range() to handle migration/hwpoisoned entry</title>
<updated>2014-07-09T18:14:03Z</updated>
<author>
<name>Naoya Horiguchi</name>
<email>n-horiguchi@ah.jp.nec.com</email>
</author>
<published>2014-06-23T20:22:03Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=9b576da0f77415c2735a5bbdb581309fe22d5999'/>
<id>urn:sha1:9b576da0f77415c2735a5bbdb581309fe22d5999</id>
<content type='text'>
commit 4a705fef986231a3e7a6b1a6d3c37025f021f49f upstream.

There's a race between fork() and hugepage migration, as a result we try
to "dereference" a swap entry as a normal pte, causing kernel panic.
The cause of the problem is that copy_hugetlb_page_range() can't handle
"swap entry" family (migration entry and hwpoisoned entry) so let's fix
it.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Naoya Horiguchi &lt;n-horiguchi@ah.jp.nec.com&gt;
Acked-by: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;


</content>
</entry>
<entry>
<title>mm: vmscan: clear kswapd's special reclaim powers before exiting</title>
<updated>2014-07-01T03:09:42Z</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2014-06-06T21:35:35Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=f26bff271a16480ed1e1f3d0420cb1d36acda093'/>
<id>urn:sha1:f26bff271a16480ed1e1f3d0420cb1d36acda093</id>
<content type='text'>
commit 71abdc15adf8c702a1dd535f8e30df50758848d2 upstream.

When kswapd exits, it can end up taking locks that were previously held
by allocating tasks while they waited for reclaim.  Lockdep currently
warns about this:

On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote:
&gt;  inconsistent {RECLAIM_FS-ON-W} -&gt; {IN-RECLAIM_FS-R} usage.
&gt;  kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes:
&gt;   (&amp;sig-&gt;group_rwsem){+++++?}, at: exit_signals+0x24/0x130
&gt;  {RECLAIM_FS-ON-W} state was registered at:
&gt;     mark_held_locks+0xb9/0x140
&gt;     lockdep_trace_alloc+0x7a/0xe0
&gt;     kmem_cache_alloc_trace+0x37/0x240
&gt;     flex_array_alloc+0x99/0x1a0
&gt;     cgroup_attach_task+0x63/0x430
&gt;     attach_task_by_pid+0x210/0x280
&gt;     cgroup_procs_write+0x16/0x20
&gt;     cgroup_file_write+0x120/0x2c0
&gt;     vfs_write+0xc0/0x1f0
&gt;     SyS_write+0x4c/0xa0
&gt;     tracesys+0xdd/0xe2
&gt;  irq event stamp: 49
&gt;  hardirqs last  enabled at (49):  _raw_spin_unlock_irqrestore+0x36/0x70
&gt;  hardirqs last disabled at (48):  _raw_spin_lock_irqsave+0x2b/0xa0
&gt;  softirqs last  enabled at (0):  copy_process.part.24+0x627/0x15f0
&gt;  softirqs last disabled at (0):            (null)
&gt;
&gt;  other info that might help us debug this:
&gt;   Possible unsafe locking scenario:
&gt;
&gt;         CPU0
&gt;         ----
&gt;    lock(&amp;sig-&gt;group_rwsem);
&gt;    &lt;Interrupt&gt;
&gt;      lock(&amp;sig-&gt;group_rwsem);
&gt;
&gt;   *** DEADLOCK ***
&gt;
&gt;  no locks held by kswapd2/1151.
&gt;
&gt;  stack backtrace:
&gt;  CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4
&gt;  Call Trace:
&gt;    dump_stack+0x19/0x1b
&gt;    print_usage_bug+0x1f7/0x208
&gt;    mark_lock+0x21d/0x2a0
&gt;    __lock_acquire+0x52a/0xb60
&gt;    lock_acquire+0xa2/0x140
&gt;    down_read+0x51/0xa0
&gt;    exit_signals+0x24/0x130
&gt;    do_exit+0xb5/0xa50
&gt;    kthread+0xdb/0x100
&gt;    ret_from_fork+0x7c/0xb0

This is because the kswapd thread is still marked as a reclaimer at the
time of exit.  But because it is exiting, nobody is actually waiting on
it to make reclaim progress anymore, and it's nothing but a regular
thread at this point.  Be tidy and strip it of all its powers
(PF_MEMALLOC, PF_SWAPWRITE, PF_KSWAPD, and the lockdep reclaim state)
before returning from the thread function.

Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Reported-by: Gu Zheng &lt;guz.fnst@cn.fujitsu.com&gt;
Cc: Yasuaki Ishimatsu &lt;isimatu.yasuaki@jp.fujitsu.com&gt;
Cc: Tang Chen &lt;tangchen@cn.fujitsu.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>mm: fix sleeping function warning from __put_anon_vma</title>
<updated>2014-07-01T03:09:42Z</updated>
<author>
<name>Hugh Dickins</name>
<email>hughd@google.com</email>
</author>
<published>2014-06-04T23:05:33Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=ef1516486217fc24b0bd3555c593ae40d48b74eb'/>
<id>urn:sha1:ef1516486217fc24b0bd3555c593ae40d48b74eb</id>
<content type='text'>
commit 7f39dda9d86fb4f4f17af0de170decf125726f8c upstream.

Trinity reports BUG:

  sleeping function called from invalid context at kernel/locking/rwsem.c:47
  in_atomic(): 0, irqs_disabled(): 0, pid: 5787, name: trinity-c27

__might_sleep &lt; down_write &lt; __put_anon_vma &lt; page_get_anon_vma &lt;
migrate_pages &lt; compact_zone &lt; compact_zone_order &lt; try_to_compact_pages ..

Right, since conversion to mutex then rwsem, we should not put_anon_vma()
from inside an rcu_read_lock()ed section: fix the two places that did so.
And add might_sleep() to anon_vma_free(), as suggested by Peter Zijlstra.

Fixes: 88c22088bf23 ("mm: optimize page_lock_anon_vma() fast-path")
Reported-by: Dave Jones &lt;davej@redhat.com&gt;
Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>mm/memory-failure.c: don't let collect_procs() skip over processes for MF_ACTION_REQUIRED</title>
<updated>2014-07-01T03:09:42Z</updated>
<author>
<name>Tony Luck</name>
<email>tony.luck@intel.com</email>
</author>
<published>2014-06-04T23:11:01Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=15e09f82416374a28bcea21315f600df344f9a6c'/>
<id>urn:sha1:15e09f82416374a28bcea21315f600df344f9a6c</id>
<content type='text'>
commit 74614de17db6fb472370c426d4f934d8d616edf2 upstream.

When Linux sees an "action optional" machine check (where h/w has reported
an error that is not in the current execution path) we generally do not
want to signal a process, since most processes do not have a SIGBUS
handler - we'd just prematurely terminate the process for a problem that
they might never actually see.

task_early_kill() decides whether to consider a process - and it checks
whether this specific process has been marked for early signals with
"prctl", or if the system administrator has requested early signals for
all processes using /proc/sys/vm/memory_failure_early_kill.

But for MF_ACTION_REQUIRED case we must not defer.  The error is in the
execution path of the current thread so we must send the SIGBUS
immediatley.

Fix by passing a flag argument through collect_procs*() to
task_early_kill() so it knows whether we can defer or must take action.

Signed-off-by: Tony Luck &lt;tony.luck@intel.com&gt;
Signed-off-by: Naoya Horiguchi &lt;n-horiguchi@ah.jp.nec.com&gt;
Cc: Andi Kleen &lt;andi@firstfloor.org&gt;
Cc: Borislav Petkov &lt;bp@suse.de&gt;
Cc: Chen Gong &lt;gong.chen@linux.jf.intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>mm/memory-failure.c-failure: send right signal code to correct thread</title>
<updated>2014-07-01T03:09:42Z</updated>
<author>
<name>Tony Luck</name>
<email>tony.luck@intel.com</email>
</author>
<published>2014-06-04T23:10:59Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=4451dd21497254e6f9d28a528bad70cf31a1db91'/>
<id>urn:sha1:4451dd21497254e6f9d28a528bad70cf31a1db91</id>
<content type='text'>
commit a70ffcac741d31a406c1d2b832ae43d658e7e1cf upstream.

When a thread in a multi-threaded application hits a machine check because
of an uncorrectable error in memory - we want to send the SIGBUS with
si.si_code = BUS_MCEERR_AR to that thread.  Currently we fail to do that
if the active thread is not the primary thread in the process.
collect_procs() just finds primary threads and this test:

	if ((flags &amp; MF_ACTION_REQUIRED) &amp;&amp; t == current) {

will see that the thread we found isn't the current thread and so send a
si.si_code = BUS_MCEERR_AO to the primary (and nothing to the active
thread at this time).

We can fix this by checking whether "current" shares the same mm with the
process that collect_procs() said owned the page.  If so, we send the
SIGBUS to current (with code BUS_MCEERR_AR).

Signed-off-by: Tony Luck &lt;tony.luck@intel.com&gt;
Signed-off-by: Naoya Horiguchi &lt;n-horiguchi@ah.jp.nec.com&gt;
Reported-by: Otto Bruggeman &lt;otto.g.bruggeman@intel.com&gt;
Cc: Andi Kleen &lt;andi@firstfloor.org&gt;
Cc: Borislav Petkov &lt;bp@suse.de&gt;
Cc: Chen Gong &lt;gong.chen@linux.jf.intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>mm: vmscan: do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL</title>
<updated>2014-07-01T03:09:41Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2014-06-04T23:07:35Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=16838de6ab0c2d951248b3a1c1591aa740af96b7'/>
<id>urn:sha1:16838de6ab0c2d951248b3a1c1591aa740af96b7</id>
<content type='text'>
commit 675becce15f320337499bc1a9356260409a5ba29 upstream.

throttle_direct_reclaim() is meant to trigger during swap-over-network
during which the min watermark is treated as a pfmemalloc reserve.  It
throttes on the first node in the zonelist but this is flawed.

The user-visible impact is that a process running on CPU whose local
memory node has no ZONE_NORMAL will stall for prolonged periods of time,
possibly indefintely.  This is due to throttle_direct_reclaim thinking the
pfmemalloc reserves are depleted when in fact they don't exist on that
node.

On a NUMA machine running a 32-bit kernel (I know) allocation requests
from CPUs on node 1 would detect no pfmemalloc reserves and the process
gets throttled.  This patch adjusts throttling of direct reclaim to
throttle based on the first node in the zonelist that has a usable
ZONE_NORMAL or lower zone.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>mm/compaction: make isolate_freepages start at pageblock boundary</title>
<updated>2014-06-16T20:42:53Z</updated>
<author>
<name>Vlastimil Babka</name>
<email>vbabka@suse.cz</email>
</author>
<published>2014-05-06T19:50:03Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=ee7f7615f9119c1e952a0b364ec5862f37a3ada7'/>
<id>urn:sha1:ee7f7615f9119c1e952a0b364ec5862f37a3ada7</id>
<content type='text'>
commit 49e068f0b73dd042c186ffa9b420a9943e90389a upstream.

The compaction freepage scanner implementation in isolate_freepages()
starts by taking the current cc-&gt;free_pfn value as the first pfn.  In a
for loop, it scans from this first pfn to the end of the pageblock, and
then subtracts pageblock_nr_pages from the first pfn to obtain the first
pfn for the next for loop iteration.

This means that when cc-&gt;free_pfn starts at offset X rather than being
aligned on pageblock boundary, the scanner will start at offset X in all
scanned pageblock, ignoring potentially many free pages.  Currently this
can happen when

 a) zone's end pfn is not pageblock aligned, or

 b) through zone-&gt;compact_cached_free_pfn with CONFIG_HOLES_IN_ZONE
    enabled and a hole spanning the beginning of a pageblock

This patch fixes the problem by aligning the initial pfn in
isolate_freepages() to pageblock boundary.  This also permits replacing
the end-of-pageblock alignment within the for loop with a simple
pageblock_nr_pages increment.

Signed-off-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Reported-by: Heesub Shin &lt;heesub.shin@samsung.com&gt;
Acked-by: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Acked-by: Joonsoo Kim &lt;iamjoonsoo.kim@lge.com&gt;
Cc: Bartlomiej Zolnierkiewicz &lt;b.zolnierkie@samsung.com&gt;
Cc: Michal Nazarewicz &lt;mina86@mina86.com&gt;
Cc: Naoya Horiguchi &lt;n-horiguchi@ah.jp.nec.com&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Acked-by: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Dongjun Shin &lt;d.j.shin@samsung.com&gt;
Cc: Sunghwan Yun &lt;sunghwan.yun@samsung.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>mm: compaction: detect when scanners meet in isolate_freepages</title>
<updated>2014-06-16T20:42:53Z</updated>
<author>
<name>Vlastimil Babka</name>
<email>vbabka@suse.cz</email>
</author>
<published>2014-01-21T23:51:09Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=87c4475c903e68e494facf6ddf4c2c952b46b936'/>
<id>urn:sha1:87c4475c903e68e494facf6ddf4c2c952b46b936</id>
<content type='text'>
commit 7ed695e069c3cbea5e1fd08f84a04536da91f584 upstream.

Compaction of a zone is finished when the migrate scanner (which begins
at the zone's lowest pfn) meets the free page scanner (which begins at
the zone's highest pfn).  This is detected in compact_zone() and in the
case of direct compaction, the compact_blockskip_flush flag is set so
that kswapd later resets the cached scanner pfn's, and a new compaction
may again start at the zone's borders.

The meeting of the scanners can happen during either scanner's activity.
However, it may currently fail to be detected when it occurs in the free
page scanner, due to two problems.  First, isolate_freepages() keeps
free_pfn at the highest block where it isolated pages from, for the
purposes of not missing the pages that are returned back to allocator
when migration fails.  Second, failing to isolate enough free pages due
to scanners meeting results in -ENOMEM being returned by
migrate_pages(), which makes compact_zone() bail out immediately without
calling compact_finished() that would detect scanners meeting.

This failure to detect scanners meeting might result in repeated
attempts at compaction of a zone that keep starting from the cached
pfn's close to the meeting point, and quickly failing through the
-ENOMEM path, without the cached pfns being reset, over and over.  This
has been observed (through additional tracepoints) in the third phase of
the mmtests stress-highalloc benchmark, where the allocator runs on an
otherwise idle system.  The problem was observed in the DMA32 zone,
which was used as a fallback to the preferred Normal zone, but on the
4GB system it was actually the largest zone.  The problem is even
amplified for such fallback zone - the deferred compaction logic, which
could (after being fixed by a previous patch) reset the cached scanner
pfn's, is only applied to the preferred zone and not for the fallbacks.

The problem in the third phase of the benchmark was further amplified by
commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") which
resulted in a non-deterministic regression of the allocation success
rate from ~85% to ~65%.  This occurs in about half of benchmark runs,
making bisection problematic.  It is unlikely that the commit itself is
buggy, but it should put more pressure on the DMA32 zone during phases 1
and 2, which may leave it more fragmented in phase 3 and expose the bugs
that this patch fixes.

The fix is to make scanners meeting in isolate_freepage() stay that way,
and to check in compact_zone() for scanners meeting when migrate_pages()
returns -ENOMEM.  The result is that compact_finished() also detects
scanners meeting and sets the compact_blockskip_flush flag to make
kswapd reset the scanner pfn's.

The results in stress-highalloc benchmark show that the "regression" by
commit 81c0a2bb515f in phase 3 no longer occurs, and phase 1 and 2
allocation success rates are also significantly improved.

Signed-off-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Joonsoo Kim &lt;iamjoonsoo.kim@lge.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
</feed>
