<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/include/linux/sched.h, branch v4.9.74</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v4.9.74</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v4.9.74'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2017-08-25T00:12:21Z</updated>
<entry>
<title>pids: make task_tgid_nr_ns() safe</title>
<updated>2017-08-25T00:12:21Z</updated>
<author>
<name>Oleg Nesterov</name>
<email>oleg@redhat.com</email>
</author>
<published>2017-08-21T15:35:02Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=322cd32623653c0f860d2fc9789fa5b7ac7b09ae'/>
<id>urn:sha1:322cd32623653c0f860d2fc9789fa5b7ac7b09ae</id>
<content type='text'>
commit dd1c1f2f2028a7b851f701fc6a8ebe39dcb95e7c upstream.

This was reported many times, and this was even mentioned in commit
52ee2dfdd4f5 ("pids: refactor vnr/nr_ns helpers to make them safe") but
somehow nobody bothered to fix the obvious problem: task_tgid_nr_ns() is
not safe because task-&gt;group_leader points to nowhere after the exiting
task passes exit_notify(), rcu_read_lock() can not help.

We really need to change __unhash_process() to nullify group_leader,
parent, and real_parent, but this needs some cleanups.  Until then we
can turn task_tgid_nr_ns() into another user of __task_pid_nr_ns() and
fix the problem.

Reported-by: Troy Kensinger &lt;tkensinger@google.com&gt;
Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>signal: protect SIGNAL_UNKILLABLE from unintentional clearing.</title>
<updated>2017-08-11T15:49:36Z</updated>
<author>
<name>Jamie Iles</name>
<email>jamie.iles@oracle.com</email>
</author>
<published>2017-01-11T00:57:54Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=916a05b90d8322f38ea133feb3a194617e507c3c'/>
<id>urn:sha1:916a05b90d8322f38ea133feb3a194617e507c3c</id>
<content type='text'>
[ Upstream commit 2d39b3cd34e6d323720d4c61bd714f5ae202c022 ]

Since commit 00cd5c37afd5 ("ptrace: permit ptracing of /sbin/init") we
can now trace init processes.  init is initially protected with
SIGNAL_UNKILLABLE which will prevent fatal signals such as SIGSTOP, but
there are a number of paths during tracing where SIGNAL_UNKILLABLE can
be implicitly cleared.

This can result in init becoming stoppable/killable after tracing.  For
example, running:

  while true; do kill -STOP 1; done &amp;
  strace -p 1

and then stopping strace and the kill loop will result in init being
left in state TASK_STOPPED.  Sending SIGCONT to init will resume it, but
init will now respond to future SIGSTOP signals rather than ignoring
them.

Make sure that when setting SIGNAL_STOP_CONTINUED/SIGNAL_STOP_STOPPED
that we don't clear SIGNAL_UNKILLABLE.

Link: http://lkml.kernel.org/r/20170104122017.25047-1-jamie.iles@oracle.com
Signed-off-by: Jamie Iles &lt;jamie.iles@oracle.com&gt;
Acked-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@verizon.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups</title>
<updated>2017-04-21T07:31:18Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2017-03-16T20:54:24Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=f44236a1b05b583862eb8b06dd8412660fdc7fad'/>
<id>urn:sha1:f44236a1b05b583862eb8b06dd8412660fdc7fad</id>
<content type='text'>
commit 77f88796cee819b9c4562b0b6b44691b3b7755b1 upstream.

Creation of a kthread goes through a couple interlocked stages between
the kthread itself and its creator.  Once the new kthread starts
running, it initializes itself and wakes up the creator.  The creator
then can further configure the kthread and then let it start doing its
job by waking it up.

In this configuration-by-creator stage, the creator is the only one
that can wake it up but the kthread is visible to userland.  When
altering the kthread's attributes from userland is allowed, this is
fine; however, for cases where CPU affinity is critical,
kthread_bind() is used to first disable affinity changes from userland
and then set the affinity.  This also prevents the kthread from being
migrated into non-root cgroups as that can affect the CPU affinity and
many other things.

Unfortunately, the cgroup side of protection is racy.  While the
PF_NO_SETAFFINITY flag prevents further migrations, userland can win
the race before the creator sets the flag with kthread_bind() and put
the kthread in a non-root cgroup, which can lead to all sorts of
problems including incorrect CPU affinity and starvation.

This bug got triggered by userland which periodically tries to migrate
all processes in the root cpuset cgroup to a non-root one.  Per-cpu
workqueue workers got caught while being created and ended up with
incorrected CPU affinity breaking concurrency management and sometimes
stalling workqueue execution.

This patch adds task-&gt;no_cgroup_migration which disallows the task to
be migrated by userland.  kthreadd starts with the flag set making
every child kthread start in the root cgroup with migration
disallowed.  The flag is cleared after the kthread finishes
initialization by which time PF_NO_SETAFFINITY is set if the kthread
should stay in the root cgroup.

It'd be better to wait for the initialization instead of failing but I
couldn't think of a way of implementing that without adding either a
new PF flag, or sleeping and retrying from waiting side.  Even if
userland depends on changing cgroup membership of a kthread, it either
has to be synchronized with kthread_create() or periodically repeat,
so it's unlikely that this would break anything.

v2: Switch to a simpler implementation using a new task_struct bit
    field suggested by Oleg.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Suggested-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reported-and-debugged-by: Chris Mason &lt;clm@fb.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>ptrace: Capture the ptracer's creds not PT_PTRACE_CAP</title>
<updated>2017-01-06T09:40:13Z</updated>
<author>
<name>Eric W. Biederman</name>
<email>ebiederm@xmission.com</email>
</author>
<published>2016-11-15T00:48:07Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=e747b4ae3b6bca205d82e86366e140cdcbfb7731'/>
<id>urn:sha1:e747b4ae3b6bca205d82e86366e140cdcbfb7731</id>
<content type='text'>
commit 64b875f7ac8a5d60a4e191479299e931ee949b67 upstream.

When the flag PT_PTRACE_CAP was added the PTRACE_TRACEME path was
overlooked.  This can result in incorrect behavior when an application
like strace traces an exec of a setuid executable.

Further PT_PTRACE_CAP does not have enough information for making good
security decisions as it does not report which user namespace the
capability is in.  This has already allowed one mistake through
insufficient granulariy.

I found this issue when I was testing another corner case of exec and
discovered that I could not get strace to set PT_PTRACE_CAP even when
running strace as root with a full set of caps.

This change fixes the above issue with strace allowing stracing as
root a setuid executable without disabling setuid.  More fundamentaly
this change allows what is allowable at all times, by using the correct
information in it's decision.

Fixes: 4214e42f96d4 ("v2.4.9.11 -&gt; v2.4.9.12")
Signed-off-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>sched/autogroup: Do not use autogroup-&gt;tg in zombie threads</title>
<updated>2016-11-22T11:33:43Z</updated>
<author>
<name>Oleg Nesterov</name>
<email>oleg@redhat.com</email>
</author>
<published>2016-11-14T18:46:12Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=8e5bfa8c1f8471aa4a2d30be631ef2b50e10abaf'/>
<id>urn:sha1:8e5bfa8c1f8471aa4a2d30be631ef2b50e10abaf</id>
<content type='text'>
Exactly because for_each_thread() in autogroup_move_group() can't see it
and update its -&gt;sched_task_group before _put() and possibly free().

So the exiting task needs another sched_move_task() before exit_notify()
and we need to re-introduce the PF_EXITING (or similar) check removed by
the previous change for another reason.

Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mike Galbraith &lt;efault@gmx.de&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: hartsjc@redhat.com
Cc: vbendel@redhat.com
Cc: vlovejoy@redhat.com
Link: http://lkml.kernel.org/r/20161114184612.GA15968@redhat.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>thp: reduce usage of huge zero page's atomic counter</title>
<updated>2016-10-08T01:46:28Z</updated>
<author>
<name>Aaron Lu</name>
<email>aaron.lu@intel.com</email>
</author>
<published>2016-10-08T00:00:08Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=6fcb52a56ff60d240f06296b12827e7f20d45f63'/>
<id>urn:sha1:6fcb52a56ff60d240f06296b12827e7f20d45f63</id>
<content type='text'>
The global zero page is used to satisfy an anonymous read fault.  If
THP(Transparent HugePage) is enabled then the global huge zero page is
used.  The global huge zero page uses an atomic counter for reference
counting and is allocated/freed dynamically according to its counter
value.

CPU time spent on that counter will greatly increase if there are a lot
of processes doing anonymous read faults.  This patch proposes a way to
reduce the access to the global counter so that the CPU load can be
reduced accordingly.

To do this, a new flag of the mm_struct is introduced:
MMF_USED_HUGE_ZERO_PAGE.  With this flag, the process only need to touch
the global counter in two cases:

 1 The first time it uses the global huge zero page;
 2 The time when mm_user of its mm_struct reaches zero.

Note that right now, the huge zero page is eligible to be freed as soon
as its last use goes away.  With this patch, the page will not be
eligible to be freed until the exit of the last process from which it
was ever used.

And with the use of mm_user, the kthread is not eligible to use huge
zero page either.  Since no kthread is using huge zero page today, there
is no difference after applying this patch.  But if that is not desired,
I can change it to when mm_count reaches zero.

Case used for test on Haswell EP:

  usemem -n 72 --readonly -j 0x200000 100G

Which spawns 72 processes and each will mmap 100G anonymous space and
then do read only access to that space sequentially with a step of 2MB.

  CPU cycles from perf report for base commit:
      54.03%  usemem   [kernel.kallsyms]   [k] get_huge_zero_page
  CPU cycles from perf report for this commit:
       0.11%  usemem   [kernel.kallsyms]   [k] mm_get_huge_zero_page

Performance(throughput) of the workload for base commit: 1784430792
Performance(throughput) of the workload for this commit: 4726928591
164% increase.

Runtime of the workload for base commit: 707592 us
Runtime of the workload for this commit: 303970 us
50% drop.

Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com
Signed-off-by: Aaron Lu &lt;aaron.lu@intel.com&gt;
Cc: Sergey Senozhatsky &lt;sergey.senozhatsky@gmail.com&gt;
Cc: "Kirill A. Shutemov" &lt;kirill.shutemov@linux.intel.com&gt;
Cc: Dave Hansen &lt;dave.hansen@intel.com&gt;
Cc: Tim Chen &lt;tim.c.chen@linux.intel.com&gt;
Cc: Huang Ying &lt;ying.huang@intel.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Jerome Marchand &lt;jmarchan@redhat.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Ebru Akagunduz &lt;ebru.akagunduz@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: make sure that kthreads will not refault oom reaped memory</title>
<updated>2016-10-08T01:46:28Z</updated>
<author>
<name>Michal Hocko</name>
<email>mhocko@suse.com</email>
</author>
<published>2016-10-07T23:59:06Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=3f70dc38cec2ad6e5355f80c4c7a15a3f7e97a19'/>
<id>urn:sha1:3f70dc38cec2ad6e5355f80c4c7a15a3f7e97a19</id>
<content type='text'>
There are only few use_mm() users in the kernel right now.  Most of them
write to the target memory but vhost driver relies on
copy_from_user/get_user from a kernel thread context.  This makes it
impossible to reap the memory of an oom victim which shares the mm with
the vhost kernel thread because it could see a zero page unexpectedly
and theoretically make an incorrect decision visible outside of the
killed task context.

To quote Michael S. Tsirkin:
: Getting an error from __get_user and friends is handled gracefully.
: Getting zero instead of a real value will cause userspace
: memory corruption.

The vhost kernel thread is bound to an open fd of the vhost device which
is not tight to the mm owner life cycle in general.  The device fd can
be inherited or passed over to another process which means that we
really have to be careful about unexpected memory corruption because
unlike for normal oom victims the result will be visible outside of the
oom victim context.

Make sure that no kthread context (users of use_mm) can ever see
corrupted data because of the oom reaper and hook into the page fault
path by checking MMF_UNSTABLE mm flag.  __oom_reap_task_mm will set the
flag before it starts unmapping the address space while the flag is
checked after the page fault has been handled.  If the flag is set then
SIGBUS is triggered so any g-u-p user will get a error code.

Regular tasks do not need this protection because all which share the mm
are killed when the mm is reaped and so the corruption will not outlive
them.

This patch shouldn't have any visible effect at this moment because the
OOM killer doesn't invoke oom reaper for tasks with mm shared with
kthreads yet.

Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko &lt;mhocko@suse.com&gt;
Acked-by: "Michael S. Tsirkin" &lt;mst@redhat.com&gt;
Cc: Tetsuo Handa &lt;penguin-kernel@I-love.SAKURA.ne.jp&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Vladimir Davydov &lt;vdavydov@parallels.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm, oom: get rid of signal_struct::oom_victims</title>
<updated>2016-10-08T01:46:27Z</updated>
<author>
<name>Michal Hocko</name>
<email>mhocko@suse.com</email>
</author>
<published>2016-10-07T23:58:57Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=862e3073b3eed13f17bd6be6ca6052db15c0b728'/>
<id>urn:sha1:862e3073b3eed13f17bd6be6ca6052db15c0b728</id>
<content type='text'>
After "oom: keep mm of the killed task available" we can safely detect
an oom victim by checking task-&gt;signal-&gt;oom_mm so we do not need the
signal_struct counter anymore so let's get rid of it.

This alone wouldn't be sufficient for nommu archs because
exit_oom_victim doesn't hide the process from the oom killer anymore.
We can, however, mark the mm with a MMF flag in __mmput.  We can reuse
MMF_OOM_REAPED and rename it to a more generic MMF_OOM_SKIP.

Link: http://lkml.kernel.org/r/1472119394-11342-6-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Tetsuo Handa &lt;penguin-kernel@I-love.SAKURA.ne.jp&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Vladimir Davydov &lt;vdavydov@parallels.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>kernel, oom: fix potential pgd_lock deadlock from __mmdrop</title>
<updated>2016-10-08T01:46:27Z</updated>
<author>
<name>Michal Hocko</name>
<email>mhocko@suse.com</email>
</author>
<published>2016-10-07T23:58:54Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=7283094ec3db318e87ec9e31cf75f136ac2a4dd3'/>
<id>urn:sha1:7283094ec3db318e87ec9e31cf75f136ac2a4dd3</id>
<content type='text'>
Lockdep complains that __mmdrop is not safe from the softirq context:

  =================================
  [ INFO: inconsistent lock state ]
  4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949 Tainted: G        W
  ---------------------------------
  inconsistent {SOFTIRQ-ON-W} -&gt; {IN-SOFTIRQ-W} usage.
  swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
   (pgd_lock){+.?...}, at: pgd_free+0x19/0x6b
  {SOFTIRQ-ON-W} state was registered at:
     __lock_acquire+0xa06/0x196e
     lock_acquire+0x139/0x1e1
     _raw_spin_lock+0x32/0x41
     __change_page_attr_set_clr+0x2a5/0xacd
     change_page_attr_set_clr+0x16f/0x32c
     set_memory_nx+0x37/0x3a
     free_init_pages+0x9e/0xc7
     alternative_instructions+0xa2/0xb3
     check_bugs+0xe/0x2d
     start_kernel+0x3ce/0x3ea
     x86_64_start_reservations+0x2a/0x2c
     x86_64_start_kernel+0x17a/0x18d
  irq event stamp: 105916
  hardirqs last  enabled at (105916): free_hot_cold_page+0x37e/0x390
  hardirqs last disabled at (105915): free_hot_cold_page+0x2c1/0x390
  softirqs last  enabled at (105878): _local_bh_enable+0x42/0x44
  softirqs last disabled at (105879): irq_exit+0x6f/0xd1

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(pgd_lock);
    &lt;Interrupt&gt;
      lock(pgd_lock);

   *** DEADLOCK ***

  1 lock held by swapper/1/0:
   #0:  (rcu_callback){......}, at: rcu_process_callbacks+0x390/0x800

  stack backtrace:
  CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W       4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
  Call Trace:
   &lt;IRQ&gt;
    print_usage_bug.part.25+0x259/0x268
    mark_lock+0x381/0x567
    __lock_acquire+0x993/0x196e
    lock_acquire+0x139/0x1e1
    _raw_spin_lock+0x32/0x41
    pgd_free+0x19/0x6b
    __mmdrop+0x25/0xb9
    __put_task_struct+0x103/0x11e
    delayed_put_task_struct+0x157/0x15e
    rcu_process_callbacks+0x660/0x800
    __do_softirq+0x1ec/0x4d5
    irq_exit+0x6f/0xd1
    smp_apic_timer_interrupt+0x42/0x4d
    apic_timer_interrupt+0x8e/0xa0
   &lt;EOI&gt;
    arch_cpu_idle+0xf/0x11
    default_idle_call+0x32/0x34
    cpu_startup_entry+0x20c/0x399
    start_secondary+0xfe/0x101

More over commit a79e53d85683 ("x86/mm: Fix pgd_lock deadlock") was
explicit about pgd_lock not to be called from the irq context.  This
means that __mmdrop called from free_signal_struct has to be postponed
to a user context.  We already have a similar mechanism for mmput_async
so we can use it here as well.  This is safe because mm_count is pinned
by mm_users.

This fixes bug introduced by "oom: keep mm of the killed task available"

Link: http://lkml.kernel.org/r/1472119394-11342-5-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Tetsuo Handa &lt;penguin-kernel@I-love.SAKURA.ne.jp&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Vladimir Davydov &lt;vdavydov@parallels.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>oom: keep mm of the killed task available</title>
<updated>2016-10-08T01:46:27Z</updated>
<author>
<name>Michal Hocko</name>
<email>mhocko@suse.com</email>
</author>
<published>2016-10-07T23:58:51Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=26db62f179d112d345031e14926a4cda9cd40d6e'/>
<id>urn:sha1:26db62f179d112d345031e14926a4cda9cd40d6e</id>
<content type='text'>
oom_reap_task has to call exit_oom_victim in order to make sure that the
oom vicim will not block the oom killer for ever.  This is, however,
opening new problems (e.g oom_killer_disable exclusion - see commit
74070542099c ("oom, suspend: fix oom_reaper vs.  oom_killer_disable
race")).  exit_oom_victim should be only called from the victim's
context ideally.

One way to achieve this would be to rely on per mm_struct flags.  We
already have MMF_OOM_REAPED to hide a task from the oom killer since
"mm, oom: hide mm which is shared with kthread or global init". The
problem is that the exit path:

  do_exit
    exit_mm
      tsk-&gt;mm = NULL;
      mmput
        __mmput
      exit_oom_victim

doesn't guarantee that exit_oom_victim will get called in a bounded
amount of time.  At least exit_aio depends on IO which might get blocked
due to lack of memory and who knows what else is lurking there.

This patch takes a different approach.  We remember tsk-&gt;mm into the
signal_struct and bind it to the signal struct life time for all oom
victims.  __oom_reap_task_mm as well as oom_scan_process_thread do not
have to rely on find_lock_task_mm anymore and they will have a reliable
reference to the mm struct.  As a result all the oom specific
communication inside the OOM killer can be done via tsk-&gt;signal-&gt;oom_mm.

Increasing the signal_struct for something as unlikely as the oom killer
is far from ideal but this approach will make the code much more
reasonable and long term we even might want to move task-&gt;mm into the
signal_struct anyway.  In the next step we might want to make the oom
killer exclusion and access to memory reserves completely independent
which would be also nice.

Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Tetsuo Handa &lt;penguin-kernel@I-love.SAKURA.ne.jp&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Vladimir Davydov &lt;vdavydov@parallels.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
