<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/mm, branch v3.16.66</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v3.16.66</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v3.16.66'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2019-05-02T20:42:04Z</updated>
<entry>
<title>coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping</title>
<updated>2019-05-02T20:42:04Z</updated>
<author>
<name>Andrea Arcangeli</name>
<email>aarcange@redhat.com</email>
</author>
<published>2019-04-19T00:50:52Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=a301e6a651037c11d2d9932a35fb56a04eedba8c'/>
<id>urn:sha1:a301e6a651037c11d2d9932a35fb56a04eedba8c</id>
<content type='text'>
commit 04f5866e41fb70690e28397487d8bd8eea7d712a upstream.

The core dumping code has always run without holding the mmap_sem for
writing, despite that is the only way to ensure that the entire vma
layout will not change from under it.  Only using some signal
serialization on the processes belonging to the mm is not nearly enough.
This was pointed out earlier.  For example in Hugh's post from Jul 2017:

  https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils

  "Not strictly relevant here, but a related note: I was very surprised
   to discover, only quite recently, how handle_mm_fault() may be called
   without down_read(mmap_sem) - when core dumping. That seems a
   misguided optimization to me, which would also be nice to correct"

In particular because the growsdown and growsup can move the
vm_start/vm_end the various loops the core dump does around the vma will
not be consistent if page faults can happen concurrently.

Pretty much all users calling mmget_not_zero()/get_task_mm() and then
taking the mmap_sem had the potential to introduce unexpected side
effects in the core dumping code.

Adding mmap_sem for writing around the -&gt;core_dump invocation is a
viable long term fix, but it requires removing all copy user and page
faults and to replace them with get_dump_page() for all binary formats
which is not suitable as a short term fix.

For the time being this solution manually covers the places that can
confuse the core dump either by altering the vma layout or the vma flags
while it runs.  Once -&gt;core_dump runs under mmap_sem for writing the
function mmget_still_valid() can be dropped.

Allowing mmap_sem protected sections to run in parallel with the
coredump provides some minor parallelism advantage to the swapoff code
(which seems to be safe enough by never mangling any vma field and can
keep doing swapins in parallel to the core dumping) and to some other
corner case.

In order to facilitate the backporting I added "Fixes: 86039bd3b4e6"
however the side effect of this same race condition in /proc/pid/mem
should be reproducible since before 2.6.12-rc2 so I couldn't add any
other "Fixes:" because there's no hash beyond the git genesis commit.

Because find_extend_vma() is the only location outside of the process
context that could modify the "mm" structures under mmap_sem for
reading, by adding the mmget_still_valid() check to it, all other cases
that take the mmap_sem for reading don't need the new check after
mmget_not_zero()/get_task_mm().  The expand_stack() in page fault
context also doesn't need the new check, because all tasks under core
dumping are frozen.

Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com
Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization")
Signed-off-by: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Reported-by: Jann Horn &lt;jannh@google.com&gt;
Suggested-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Acked-by: Peter Xu &lt;peterx@redhat.com&gt;
Reviewed-by: Mike Rapoport &lt;rppt@linux.ibm.com&gt;
Reviewed-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Reviewed-by: Jann Horn &lt;jannh@google.com&gt;
Acked-by: Jason Gunthorpe &lt;jgg@mellanox.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
[bwh: Backported to 3.16:
 - Drop changes in Infiniband and userfaultfd
 - In clear_refs_write(), use up_read() as we never upgrade to a write lock
 - Adjust filename, context]
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</content>
</entry>
<entry>
<title>mm: enforce min addr even if capable() in expand_downwards()</title>
<updated>2019-05-02T20:42:01Z</updated>
<author>
<name>Jann Horn</name>
<email>jannh@google.com</email>
</author>
<published>2019-02-27T20:29:52Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=c90030281dc8b6a25ac8850e98e15877f80b8d66'/>
<id>urn:sha1:c90030281dc8b6a25ac8850e98e15877f80b8d66</id>
<content type='text'>
commit 0a1d52994d440e21def1c2174932410b4f2a98a1 upstream.

security_mmap_addr() does a capability check with current_cred(), but
we can reach this code from contexts like a VFS write handler where
current_cred() must not be used.

This can be abused on systems without SMAP to make NULL pointer
dereferences exploitable again.

Fixes: 8869477a49c3 ("security: protect from stack expansion into low vm addresses")
Signed-off-by: Jann Horn &lt;jannh@google.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</content>
</entry>
<entry>
<title>mm/mmap.c: expand_downwards: don't require the gap if !vm_prev</title>
<updated>2019-05-02T20:42:01Z</updated>
<author>
<name>Oleg Nesterov</name>
<email>oleg@redhat.com</email>
</author>
<published>2017-07-10T22:49:54Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=120d66394f05ec50a018168850a8db6518ea2d9b'/>
<id>urn:sha1:120d66394f05ec50a018168850a8db6518ea2d9b</id>
<content type='text'>
commit 32e4e6d5cbb0c0e427391635991fe65e17797af8 upstream.

expand_stack(vma) fails if address &lt; stack_guard_gap even if there is no
vma-&gt;vm_prev.  I don't think this makes sense, and we didn't do this
before the recent commit 1be7107fbe18 ("mm: larger stack guard gap,
between vmas").

We do not need a gap in this case, any address is fine as long as
security_mmap_addr() doesn't object.

This also simplifies the code, we know that address &gt;= prev-&gt;vm_end and
thus underflow is not possible.

Link: http://lkml.kernel.org/r/20170628175258.GA24881@redhat.com
Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Larry Woodman &lt;lwoodman@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</content>
</entry>
<entry>
<title>tmpfs: fix uninitialized return value in shmem_link</title>
<updated>2019-05-02T20:42:00Z</updated>
<author>
<name>Darrick J. Wong</name>
<email>darrick.wong@oracle.com</email>
</author>
<published>2019-02-23T06:35:32Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=6a0614c5b8951b9e41ffe2d32533a41a472bddb9'/>
<id>urn:sha1:6a0614c5b8951b9e41ffe2d32533a41a472bddb9</id>
<content type='text'>
commit 29b00e609960ae0fcff382f4c7079dd0874a5311 upstream.

When we made the shmem_reserve_inode call in shmem_link conditional, we
forgot to update the declaration for ret so that it always has a known
value.  Dan Carpenter pointed out this deficiency in the original patch.

Fixes: 1062af920c07 ("tmpfs: fix link accounting when a tmpfile is linked in")
Reported-by: Dan Carpenter &lt;dan.carpenter@oracle.com&gt;
Signed-off-by: Darrick J. Wong &lt;darrick.wong@oracle.com&gt;
Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Matej Kupljen &lt;matej.kupljen@gmail.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</content>
</entry>
<entry>
<title>tmpfs: fix link accounting when a tmpfile is linked in</title>
<updated>2019-05-02T20:41:57Z</updated>
<author>
<name>Darrick J. Wong</name>
<email>darrick.wong@oracle.com</email>
</author>
<published>2019-02-21T16:48:09Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=17b676ec70edf5cafdd2794c88f9d197a49fb30b'/>
<id>urn:sha1:17b676ec70edf5cafdd2794c88f9d197a49fb30b</id>
<content type='text'>
commit 1062af920c07f5b54cf5060fde3339da6df0cf6b upstream.

tmpfs has a peculiarity of accounting hard links as if they were
separate inodes: so that when the number of inodes is limited, as it is
by default, a user cannot soak up an unlimited amount of unreclaimable
dcache memory just by repeatedly linking a file.

But when v3.11 added O_TMPFILE, and the ability to use linkat() on the
fd, we missed accommodating this new case in tmpfs: "df -i" shows that
an extra "inode" remains accounted after the file is unlinked and the fd
closed and the actual inode evicted.  If a user repeatedly links
tmpfiles into a tmpfs, the limit will be hit (ENOSPC) even after they
are deleted.

Just skip the extra reservation from shmem_link() in this case: there's
a sense in which this first link of a tmpfile is then cheaper than a
hard link of another file, but the accounting works out, and there's
still good limiting, so no need to do anything more complicated.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1902182134370.7035@eggly.anvils
Fixes: f4e0c30c191 ("allow the temp files created by open() to be linked to")
Signed-off-by: Darrick J. Wong &lt;darrick.wong@oracle.com&gt;
Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Reported-by: Matej Kupljen &lt;matej.kupljen@gmail.com&gt;
Acked-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</content>
</entry>
<entry>
<title>mm: migrate: don't rely on __PageMovable() of newpage after unlocking it</title>
<updated>2019-05-02T20:41:34Z</updated>
<author>
<name>David Hildenbrand</name>
<email>david@redhat.com</email>
</author>
<published>2019-02-01T22:21:19Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=103d1a1c453c51d3c8cb0e7cba9ed40a5410b765'/>
<id>urn:sha1:103d1a1c453c51d3c8cb0e7cba9ed40a5410b765</id>
<content type='text'>
commit e0a352fabce61f730341d119fbedf71ffdb8663f upstream.

We had a race in the old balloon compaction code before b1123ea6d3b3
("mm: balloon: use general non-lru movable page feature") refactored it
that became visible after backporting 195a8c43e93d ("virtio-balloon:
deflate via a page list") without the refactoring.

The bug existed from commit d6d86c0a7f8d ("mm/balloon_compaction:
redesign ballooned pages management") till b1123ea6d3b3 ("mm: balloon:
use general non-lru movable page feature").  d6d86c0a7f8d
("mm/balloon_compaction: redesign ballooned pages management") was
backported to 3.12, so the broken kernels are stable kernels [3.12 -
4.7].

There was a subtle race between dropping the page lock of the newpage in
__unmap_and_move() and checking for __is_movable_balloon_page(newpage).

Just after dropping this page lock, virtio-balloon could go ahead and
deflate the newpage, effectively dequeueing it and clearing PageBalloon,
in turn making __is_movable_balloon_page(newpage) fail.

This resulted in dropping the reference of the newpage via
putback_lru_page(newpage) instead of put_page(newpage), leading to
page-&gt;lru getting modified and a !LRU page ending up in the LRU lists.
With 195a8c43e93d ("virtio-balloon: deflate via a page list")
backported, one would suddenly get corrupted lists in
release_pages_balloon():

- WARNING: CPU: 13 PID: 6586 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
- list_del corruption. prev-&gt;next should be ffffe253961090a0, but was dead000000000100

Nowadays this race is no longer possible, but it is hidden behind very
ugly handling of __ClearPageMovable() and __PageMovable().

__ClearPageMovable() will not make __PageMovable() fail, only
PageMovable().  So the new check (__PageMovable(newpage)) will still
hold even after newpage was dequeued by virtio-balloon.

If anybody would ever change that special handling, the BUG would be
introduced again.  So instead, make it explicit and use the information
of the original isolated page before migration.

This patch can be backported fairly easy to stable kernels (in contrast
to the refactoring).

Link: http://lkml.kernel.org/r/20190129233217.10747-1-david@redhat.com
Fixes: d6d86c0a7f8d ("mm/balloon_compaction: redesign ballooned pages management")
Signed-off-by: David Hildenbrand &lt;david@redhat.com&gt;
Reported-by: Vratislav Bendel &lt;vbendel@redhat.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Acked-by: Rafael Aquini &lt;aquini@redhat.com&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: "Kirill A. Shutemov" &lt;kirill.shutemov@linux.intel.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Naoya Horiguchi &lt;n-horiguchi@ah.jp.nec.com&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Dominik Brodowski &lt;linux@dominikbrodowski.net&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Cc: Vratislav Bendel &lt;vbendel@redhat.com&gt;
Cc: Rafael Aquini &lt;aquini@redhat.com&gt;
Cc: Konstantin Khlebnikov &lt;k.khlebnikov@samsung.com&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
[bwh: Backported to 3.16:
 - Add the is_lru flag variable to unmap_and_move()
 - Keep using __is_movable_balloon_page() instead of __PageMovable()
 - Adjust context]
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</content>
</entry>
<entry>
<title>mm: hwpoison: use do_send_sig_info() instead of force_sig()</title>
<updated>2019-05-02T20:41:34Z</updated>
<author>
<name>Naoya Horiguchi</name>
<email>n-horiguchi@ah.jp.nec.com</email>
</author>
<published>2019-02-01T22:21:08Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=df3256a2972eb75429255d1ce675abde1f843904'/>
<id>urn:sha1:df3256a2972eb75429255d1ce675abde1f843904</id>
<content type='text'>
commit 6376360ecbe525a9c17b3d081dfd88ba3e4ed65b upstream.

Currently memory_failure() is racy against process's exiting, which
results in kernel crash by null pointer dereference.

The root cause is that memory_failure() uses force_sig() to forcibly
kill asynchronous (meaning not in the current context) processes.  As
discussed in thread https://lkml.org/lkml/2010/6/8/236 years ago for OOM
fixes, this is not a right thing to do.  OOM solves this issue by using
do_send_sig_info() as done in commit d2d393099de2 ("signal:
oom_kill_task: use SEND_SIG_FORCED instead of force_sig()"), so this
patch is suggesting to do the same for hwpoison.  do_send_sig_info()
properly accesses to siglock with lock_task_sighand(), so is free from
the reported race.

I confirmed that the reported bug reproduces with inserting some delay
in kill_procs(), and it never reproduces with this patch.

Note that memory_failure() can send another type of signal using
force_sig_mceerr(), and the reported race shouldn't happen on it because
force_sig_mceerr() is called only for synchronous processes (i.e.
BUS_MCEERR_AR happens only when some process accesses to the corrupted
memory.)

Link: http://lkml.kernel.org/r/20190116093046.GA29835@hori1.linux.bs1.fc.nec.co.jp
Signed-off-by: Naoya Horiguchi &lt;n-horiguchi@ah.jp.nec.com&gt;
Reported-by: Jane Chu &lt;jane.chu@oracle.com&gt;
Reviewed-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
Reviewed-by: William Kucharski &lt;william.kucharski@oracle.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</content>
</entry>
<entry>
<title>mm, oom: fix use-after-free in oom_kill_process</title>
<updated>2019-05-02T20:41:34Z</updated>
<author>
<name>Shakeel Butt</name>
<email>shakeelb@google.com</email>
</author>
<published>2019-02-01T22:20:54Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=4d6795e7c32b1701f6669bdfafd0d15534533bcc'/>
<id>urn:sha1:4d6795e7c32b1701f6669bdfafd0d15534533bcc</id>
<content type='text'>
commit cefc7ef3c87d02fc9307835868ff721ea12cc597 upstream.

Syzbot instance running on upstream kernel found a use-after-free bug in
oom_kill_process.  On further inspection it seems like the process
selected to be oom-killed has exited even before reaching
read_lock(&amp;tasklist_lock) in oom_kill_process().  More specifically the
tsk-&gt;usage is 1 which is due to get_task_struct() in oom_evaluate_task()
and the put_task_struct within for_each_thread() frees the tsk and
for_each_thread() tries to access the tsk.  The easiest fix is to do
get/put across the for_each_thread() on the selected task.

Now the next question is should we continue with the oom-kill as the
previously selected task has exited? However before adding more
complexity and heuristics, let's answer why we even look at the children
of oom-kill selected task? The select_bad_process() has already selected
the worst process in the system/memcg.  Due to race, the selected
process might not be the worst at the kill time but does that matter?
The userspace can use the oom_score_adj interface to prefer children to
be killed before the parent.  I looked at the history but it seems like
this is there before git history.

Link: http://lkml.kernel.org/r/20190121215850.221745-1-shakeelb@google.com
Reported-by: syzbot+7fbbfa368521945f0e3d@syzkaller.appspotmail.com
Fixes: 6b0c81b3be11 ("mm, oom: reduce dependency on tasklist_lock")
Signed-off-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Reviewed-by: Roman Gushchin &lt;guro@fb.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Tetsuo Handa &lt;penguin-kernel@i-love.sakura.ne.jp&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</content>
</entry>
<entry>
<title>hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined</title>
<updated>2019-04-04T15:14:09Z</updated>
<author>
<name>Michal Hocko</name>
<email>mhocko@suse.com</email>
</author>
<published>2018-12-28T08:38:01Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=85ef35ab972b7484f41c3bb2bbc79de212e19129'/>
<id>urn:sha1:85ef35ab972b7484f41c3bb2bbc79de212e19129</id>
<content type='text'>
commit b15c87263a69272423771118c653e9a1d0672caa upstream.

We have received a bug report that an injected MCE about faulty memory
prevents memory offline to succeed on 4.4 base kernel.  The underlying
reason was that the HWPoison page has an elevated reference count and the
migration keeps failing.  There are two problems with that.  First of all
it is dubious to migrate the poisoned page because we know that accessing
that memory is possible to fail.  Secondly it doesn't make any sense to
migrate a potentially broken content and preserve the memory corruption
over to a new location.

Oscar has found out that 4.4 and the current upstream kernels behave
slightly differently with his simply testcase

===

int main(void)
{
        int ret;
        int i;
        int fd;
        char *array = malloc(4096);
        char *array_locked = malloc(4096);

        fd = open("/tmp/data", O_RDONLY);
        read(fd, array, 4095);

        for (i = 0; i &lt; 4096; i++)
                array_locked[i] = 'd';

        ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked));
        if (ret)
                perror("mlock");

        sleep (20);

        ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON);
        if (ret)
                perror("madvise");

        for (i = 0; i &lt; 4096; i++)
                array_locked[i] = 'd';

        return 0;
}
===

+ offline this memory.

In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU
list
kernel:  [&lt;ffffffff81019ac9&gt;] dump_trace+0x59/0x340
kernel:  [&lt;ffffffff81019e9a&gt;] show_stack_log_lvl+0xea/0x170
kernel:  [&lt;ffffffff8101ac71&gt;] show_stack+0x21/0x40
kernel:  [&lt;ffffffff8132bb90&gt;] dump_stack+0x5c/0x7c
kernel:  [&lt;ffffffff810815a1&gt;] warn_slowpath_common+0x81/0xb0
kernel:  [&lt;ffffffff811a275c&gt;] __pagevec_lru_add_fn+0x14c/0x160
kernel:  [&lt;ffffffff811a2eed&gt;] pagevec_lru_move_fn+0xad/0x100
kernel:  [&lt;ffffffff811a334c&gt;] __lru_cache_add+0x6c/0xb0
kernel:  [&lt;ffffffff81195236&gt;] add_to_page_cache_lru+0x46/0x70
kernel:  [&lt;ffffffffa02b4373&gt;] extent_readpages+0xc3/0x1a0 [btrfs]
kernel:  [&lt;ffffffff811a16d7&gt;] __do_page_cache_readahead+0x177/0x200
kernel:  [&lt;ffffffff811a18c8&gt;] ondemand_readahead+0x168/0x2a0
kernel:  [&lt;ffffffff8119673f&gt;] generic_file_read_iter+0x41f/0x660
kernel:  [&lt;ffffffff8120e50d&gt;] __vfs_read+0xcd/0x140
kernel:  [&lt;ffffffff8120e9ea&gt;] vfs_read+0x7a/0x120
kernel:  [&lt;ffffffff8121404b&gt;] kernel_read+0x3b/0x50
kernel:  [&lt;ffffffff81215c80&gt;] do_execveat_common.isra.29+0x490/0x6f0
kernel:  [&lt;ffffffff81215f08&gt;] do_execve+0x28/0x30
kernel:  [&lt;ffffffff81095ddb&gt;] call_usermodehelper_exec_async+0xfb/0x130
kernel:  [&lt;ffffffff8161c045&gt;] ret_from_fork+0x55/0x80

And that latter confuses the hotremove path because an LRU page is
attempted to be migrated and that fails due to an elevated reference
count.  It is quite possible that the reuse of the HWPoisoned page is some
kind of fixed race condition but I am not really sure about that.

With the upstream kernel the failure is slightly different.  The page
doesn't seem to have LRU bit set but isolate_movable_page simply fails and
do_migrate_range simply puts all the isolated pages back to LRU and
therefore no progress is made and scan_movable_pages finds same set of
pages over and over again.

Fix both cases by explicitly checking HWPoisoned pages before we even try
to get reference on the page, try to unmap it if it is still mapped.  As
explained by Naoya:

: Hwpoison code never unmapped those for no big reason because
: Ksm pages never dominate memory, so we simply didn't have strong
: motivation to save the pages.

Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU
HWPoison pages which shouldn't happen but I couldn't convince myself about
that.  Naoya has noted the following:

: Theoretically no such gurantee, because try_to_unmap() doesn't have a
: guarantee of success and then memory_failure() returns immediately
: when hwpoison_user_mappings fails.
: Or the following code (comes after hwpoison_user_mappings block) also impli=
: es
: that the target page can still have PageLRU flag.
:
:         /*
:          * Torn down by someone else?
:          */
:         if (PageLRU(p) &amp;&amp; !PageSwapCache(p) &amp;&amp; p-&gt;mapping =3D=3D NULL) {
:                 action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
:                 res =3D -EBUSY;
:                 goto out;
:         }
:
: So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in
: current version of your patch.

Link: http://lkml.kernel.org/r/20181206120135.14079-1-mhocko@kernel.org
Signed-off-by: Michal Hocko &lt;mhocko@suse.com&gt;
Reviewed-by: Oscar Salvador &lt;osalvador@suse.com&gt;
Debugged-by: Oscar Salvador &lt;osalvador@suse.com&gt;
Tested-by: Oscar Salvador &lt;osalvador@suse.com&gt;
Acked-by: David Hildenbrand &lt;david@redhat.com&gt;
Acked-by: Naoya Horiguchi &lt;n-horiguchi@ah.jp.nec.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</content>
</entry>
<entry>
<title>mm, memory_hotplug: do not clear numa_node association after hot_remove</title>
<updated>2019-04-04T15:14:08Z</updated>
<author>
<name>Michal Hocko</name>
<email>mhocko@suse.com</email>
</author>
<published>2018-12-28T08:34:13Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=21de2791382684e0b2292a5f55e796d0641db1b9'/>
<id>urn:sha1:21de2791382684e0b2292a5f55e796d0641db1b9</id>
<content type='text'>
commit 46a3679b8190101e4ebdfe252ef79e6150a4f2ac upstream.

Per-cpu numa_node provides a default node for each possible cpu.  The
association gets initialized during the boot when the architecture
specific code explores cpu-&gt;NUMA affinity.  When the whole NUMA node is
removed though we are clearing this association

try_offline_node
  check_and_unmap_cpu_on_node
    unmap_cpu_on_node
      numa_clear_node
        numa_set_node(cpu, NUMA_NO_NODE)

This means that whoever calls cpu_to_node for a cpu associated with such a
node will get NUMA_NO_NODE.  This is problematic for two reasons.  First
it is fragile because __alloc_pages_node would simply blow up on an
out-of-bound access.  We have encountered this when loading kvm module

  BUG: unable to handle kernel paging request at 00000000000021c0
  IP: __alloc_pages_nodemask+0x93/0xb70
  PGD 800000ffe853e067 PUD 7336bbc067 PMD 0
  Oops: 0000 [#1] SMP
  [...]
  CPU: 88 PID: 1223749 Comm: modprobe Tainted: G        W          4.4.156-94.64-default #1
  RIP: __alloc_pages_nodemask+0x93/0xb70
  RSP: 0018:ffff887354493b40  EFLAGS: 00010202
  RAX: 00000000000021c0 RBX: 0000000000000000 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: 0000000000000002 RDI: 00000000014000c0
  RBP: 00000000014000c0 R08: ffffffffffffffff R09: 0000000000000000
  R10: ffff88fffc89e790 R11: 0000000000014000 R12: 0000000000000101
  R13: ffffffffa0772cd4 R14: ffffffffa0769ac0 R15: 0000000000000000
  FS:  00007fdf2f2f1700(0000) GS:ffff88fffc880000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000000021c0 CR3: 00000077205ee000 CR4: 0000000000360670
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
    alloc_vmcs_cpu+0x3d/0x90 [kvm_intel]
    hardware_setup+0x781/0x849 [kvm_intel]
    kvm_arch_hardware_setup+0x28/0x190 [kvm]
    kvm_init+0x7c/0x2d0 [kvm]
    vmx_init+0x1e/0x32c [kvm_intel]
    do_one_initcall+0xca/0x1f0
    do_init_module+0x5a/0x1d7
    load_module+0x1393/0x1c90
    SYSC_finit_module+0x70/0xa0
    entry_SYSCALL_64_fastpath+0x1e/0xb7
  DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7

on an older kernel but the code is basically the same in the current Linus
tree as well.  alloc_vmcs_cpu could use alloc_pages_nodemask which would
recognize NUMA_NO_NODE and use alloc_pages_node which would translate it
to numa_mem_id but that is wrong as well because it would use a cpu
affinity of the local CPU which might be quite far from the original node.
It is also reasonable to expect that cpu_to_node will provide a sane
value and there might be many more callers like that.

The second problem is that __register_one_node relies on cpu_to_node to
properly associate cpus back to the node when it is onlined.  We do not
want to lose that link as there is no arch independent way to get it from
the early boot time AFAICS.

Drop the whole check_and_unmap_cpu_on_node machinery and keep the
association to fix both issues.  The NODE_DATA(nid) is not deallocated so
it will stay in place and if anybody wants to allocate from that node then
a fallback node will be used.

Thanks to Vlastimil Babka for his live system debugging skills that helped
debugging the issue.

Link: http://lkml.kernel.org/r/20181108100413.966-1-mhocko@kernel.org
Fixes: e13fe8695c57 ("cpu-hotplug,memory-hotplug: clear cpu_to_node() when offlining the node")
Signed-off-by: Michal Hocko &lt;mhocko@suse.com&gt;
Debugged-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Reported-by: Miroslav Benes &lt;mbenes@suse.cz&gt;
Acked-by: Anshuman Khandual &lt;anshuman.khandual@arm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</content>
</entry>
</feed>
