<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/kernel/sched, branch v3.18.74</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v3.18.74</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v3.18.74'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2017-07-21T06:12:23Z</updated>
<entry>
<title>sched/topology: Fix overlapping sched_group_mask</title>
<updated>2017-07-21T06:12:23Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2017-04-25T12:00:49Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=8f70f49650b428378320c12dee41122d9f9e258e'/>
<id>urn:sha1:8f70f49650b428378320c12dee41122d9f9e258e</id>
<content type='text'>
commit 73bb059f9b8a00c5e1bf2f7ca83138c05d05e600 upstream.

The point of sched_group_mask is to select those CPUs from
sched_group_cpus that can actually arrive at this balance domain.

The current code gets it wrong, as can be readily demonstrated with a
topology like:

  node   0   1   2   3
    0:  10  20  30  20
    1:  20  10  20  30
    2:  30  20  10  20
    3:  20  30  20  10

Where (for example) domain 1 on CPU1 ends up with a mask that includes
CPU0:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 (mask: 1), 2, 0
  []   domain 1: span 0-3 level NUMA
  []    groups: 0-2 (mask: 0-2) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)

This causes sched_balance_cpu() to compute the wrong CPU and
consequently should_we_balance() will terminate early resulting in
missed load-balance opportunities.

The fixed topology looks like:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 (mask: 1), 2, 0
  []   domain 1: span 0-3 level NUMA
  []    groups: 0-2 (mask: 1) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)

(note: this relies on OVERLAP domains to always have children, this is
 true because the regular topology domains are still here -- this is
 before degenerate trimming)

Debugged-by: Lauro Ramos Venancio &lt;lvenanci@redhat.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mike Galbraith &lt;efault@gmx.de&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: linux-kernel@vger.kernel.org
Fixes: e3589f6c81e4 ("sched: Allow for overlapping sched_domain spans")
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>sched/topology: Optimize build_group_mask()</title>
<updated>2017-07-21T06:12:23Z</updated>
<author>
<name>Lauro Ramos Venancio</name>
<email>lvenanci@redhat.com</email>
</author>
<published>2017-04-20T19:51:40Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=49c159f59d5328cc86900fbb175356e1cb0be7f2'/>
<id>urn:sha1:49c159f59d5328cc86900fbb175356e1cb0be7f2</id>
<content type='text'>
commit f32d782e31bf079f600dcec126ed117b0577e85c upstream.

The group mask is always used in intersection with the group CPUs. So,
when building the group mask, we don't have to care about CPUs that are
not part of the group.

Signed-off-by: Lauro Ramos Venancio &lt;lvenanci@redhat.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mike Galbraith &lt;efault@gmx.de&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: lwang@redhat.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1492717903-5195-2-git-send-email-lvenanci@redhat.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>sched: panic on corrupted stack end</title>
<updated>2017-05-20T12:18:44Z</updated>
<author>
<name>Jann Horn</name>
<email>jannh@google.com</email>
</author>
<published>2016-06-01T09:55:07Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=f5efbbc2bd0a4f30c4f312f9d867ff7ef18df2f9'/>
<id>urn:sha1:f5efbbc2bd0a4f30c4f312f9d867ff7ef18df2f9</id>
<content type='text'>
commit 29d6455178a09e1dc340380c582b13356227e8df upstream.

Until now, hitting this BUG_ON caused a recursive oops (because oops
handling involves do_exit(), which calls into the scheduler, which in
turn raises an oops), which caused stuff below the stack to be
overwritten until a panic happened (e.g.  via an oops in interrupt
context, caused by the overwritten CPU index in the thread_info).

Just panic directly.

Signed-off-by: Jann Horn &lt;jannh@google.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
[AmitP: Minor refactoring of upstream changes for linux-3.18.y]
Signed-off-by: Amit Pundir &lt;amit.pundir@linaro.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>perf: Avoid horrible stack usage</title>
<updated>2017-04-30T03:49:15Z</updated>
<author>
<name>Peter Zijlstra (Intel)</name>
<email>peterz@infradead.org</email>
</author>
<published>2014-12-16T11:47:34Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=803e3757c40366d673b7792f6ac07793825fafeb'/>
<id>urn:sha1:803e3757c40366d673b7792f6ac07793825fafeb</id>
<content type='text'>
commit 86038c5ea81b519a8a1fcfcd5e4599aab0cdd119 upstream.

Both Linus (most recent) and Steve (a while ago) reported that perf
related callbacks have massive stack bloat.

The problem is that software events need a pt_regs in order to
properly report the event location and unwind stack. And because we
could not assume one was present we allocated one on stack and filled
it with minimal bits required for operation.

Now, pt_regs is quite large, so this is undesirable. Furthermore it
turns out that most sites actually have a pt_regs pointer available,
making this even more onerous, as the stack space is pointless waste.

This patch addresses the problem by observing that software events
have well defined nesting semantics, therefore we can use static
per-cpu storage instead of on-stack.

Linus made the further observation that all but the scheduler callers
of perf_sw_event() have a pt_regs available, so we change the regular
perf_sw_event() to require a valid pt_regs (where it used to be
optional) and add perf_sw_event_sched() for the scheduler.

We have a scheduler specific call instead of a more generic _noregs()
like construct because we can assume non-recursion from the scheduler
and thereby simplify the code further (_noregs would have to put the
recursion context call inline in order to assertain which __perf_regs
element to use).

One last note on the implementation of perf_trace_buf_prepare(); we
allow .regs = NULL for those cases where we already have a pt_regs
pointer available and do not need another.

Reported-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Reported-by: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Arnaldo Carvalho de Melo &lt;acme@kernel.org&gt;
Cc: Javi Merino &lt;javi.merino@arm.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mathieu Desnoyers &lt;mathieu.desnoyers@efficios.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Paul Mackerras &lt;paulus@samba.org&gt;
Cc: Petr Mladek &lt;pmladek@suse.cz&gt;
Cc: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Tom Zanussi &lt;tom.zanussi@linux.intel.com&gt;
Cc: Vaibhav Nagarnaik &lt;vnagarnaik@google.com&gt;
Link: http://lkml.kernel.org/r/20141216115041.GW3337@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Cc: Arnd Bergmann &lt;arnd@arndb.de&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>sched/core: Fix a race between try_to_wake_up() and a woken up task</title>
<updated>2016-10-06T02:40:20Z</updated>
<author>
<name>Balbir Singh</name>
<email>bsingharora@gmail.com</email>
</author>
<published>2016-09-05T03:16:40Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=fb4064af0fd3887fafec8dfeba206966d0ff12b3'/>
<id>urn:sha1:fb4064af0fd3887fafec8dfeba206966d0ff12b3</id>
<content type='text'>
[ Upstream commit 135e8c9250dd5c8c9aae5984fde6f230d0cbfeaf ]

The origin of the issue I've seen is related to
a missing memory barrier between check for task-&gt;state and
the check for task-&gt;on_rq.

The task being woken up is already awake from a schedule()
and is doing the following:

	do {
		schedule()
		set_current_state(TASK_(UN)INTERRUPTIBLE);
	} while (!cond);

The waker, actually gets stuck doing the following in
try_to_wake_up():

	while (p-&gt;on_cpu)
		cpu_relax();

Analysis:

The instance I've seen involves the following race:

 CPU1					CPU2

 while () {
   if (cond)
     break;
   do {
     schedule();
     set_current_state(TASK_UN..)
   } while (!cond);
					wakeup_routine()
					  spin_lock_irqsave(wait_lock)
   raw_spin_lock_irqsave(wait_lock)	  wake_up_process()
 }					  try_to_wake_up()
 set_current_state(TASK_RUNNING);	  ..
 list_del(&amp;waiter.list);

CPU2 wakes up CPU1, but before it can get the wait_lock and set
current state to TASK_RUNNING the following occurs:

 CPU3
 wakeup_routine()
 raw_spin_lock_irqsave(wait_lock)
 if (!list_empty)
   wake_up_process()
   try_to_wake_up()
   raw_spin_lock_irqsave(p-&gt;pi_lock)
   ..
   if (p-&gt;on_rq &amp;&amp; ttwu_wakeup())
   ..
   while (p-&gt;on_cpu)
     cpu_relax()
   ..

CPU3 tries to wake up the task on CPU1 again since it finds
it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
after CPU2, CPU3 got it.

CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
the task is spinning on the wait_lock. Interestingly since p-&gt;on_rq
is checked under pi_lock, I've noticed that try_to_wake_up() finds
p-&gt;on_rq to be 0. This was the most confusing bit of the analysis,
but p-&gt;on_rq is changed under runqueue lock, rq_lock, the p-&gt;on_rq
check is not reliable without this fix IMHO. The race is visible
(based on the analysis) only when ttwu_queue() does a remote wakeup
via ttwu_queue_remote. In which case the p-&gt;on_rq change is not
done uder the pi_lock.

The result is that after a while the entire system locks up on
the raw_spin_irqlock_save(wait_lock) and the holder spins infintely

Reproduction of the issue:

The issue can be reproduced after a long run on my system with 80
threads and having to tweak available memory to very low and running
memory stress-ng mmapfork test. It usually takes a long time to
reproduce. I am trying to work on a test case that can reproduce
the issue faster, but thats work in progress. I am still testing the
changes on my still in a loop and the tests seem OK thus far.

Big thanks to Benjamin and Nick for helping debug this as well.
Ben helped catch the missing barrier, Nick caught every missing
bit in my theory.

Signed-off-by: Balbir Singh &lt;bsingharora@gmail.com&gt;
[ Updated comment to clarify matching barriers. Many
  architectures do not have a full barrier in switch_to()
  so that cannot be relied upon. ]
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: Alexey Kardashevskiy &lt;aik@ozlabs.ru&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Nicholas Piggin &lt;nicholas.piggin@gmail.com&gt;
Cc: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Link: http://lkml.kernel.org/r/e02cce7b-d9ca-1ad0-7a61-ea97c7582b37@gmail.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;

Signed-off-by: Sasha Levin &lt;alexander.levin@verizon.com&gt;
</content>
</entry>
<entry>
<title>kernel/sysrq, watchdog, sched/core: Reset watchdog on all CPUs while processing sysrq-w</title>
<updated>2016-07-12T12:46:50Z</updated>
<author>
<name>Andrey Ryabinin</name>
<email>aryabinin@virtuozzo.com</email>
</author>
<published>2016-06-09T12:20:05Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=069e0fa4c5ff85836f1e4e39391883f73b7dd085'/>
<id>urn:sha1:069e0fa4c5ff85836f1e4e39391883f73b7dd085</id>
<content type='text'>
[ Upstream commit 57675cb976eff977aefb428e68e4e0236d48a9ff ]

Lengthy output of sysrq-w may take a lot of time on slow serial console.

Currently we reset NMI-watchdog on the current CPU to avoid spurious
lockup messages. Sometimes this doesn't work since softlockup watchdog
might trigger on another CPU which is waiting for an IPI to proceed.
We reset softlockup watchdogs on all CPUs, but we do this only after
listing all tasks, and this may be too late on a busy system.

So, reset watchdogs CPUs earlier, in for_each_process_thread() loop.

Signed-off-by: Andrey Ryabinin &lt;aryabinin@virtuozzo.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Link: http://lkml.kernel.org/r/1465474805-14641-1-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</content>
</entry>
<entry>
<title>sched/loadavg: Fix loadavg artifacts on fully idle and on fully loaded systems</title>
<updated>2016-06-06T23:11:03Z</updated>
<author>
<name>Sasha Levin</name>
<email>sasha.levin@oracle.com</email>
</author>
<published>2016-06-01T01:11:00Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=8d2ba3ff3ab81345eba35eac4c06e905da62f1c6'/>
<id>urn:sha1:8d2ba3ff3ab81345eba35eac4c06e905da62f1c6</id>
<content type='text'>
[ Upstream commit 20878232c52329f92423d27a60e48b6a6389e0dd ]

Systems show a minimal load average of 0.00, 0.01, 0.05 even when they
have no load at all.

Uptime and /proc/loadavg on all systems with kernels released during the
last five years up until kernel version 4.6-rc5, show a 5- and 15-minute
minimum loadavg of 0.01 and 0.05 respectively. This should be 0.00 on
idle systems, but the way the kernel calculates this value prevents it
from getting lower than the mentioned values.

Likewise but not as obviously noticeable, a fully loaded system with no
processes waiting, shows a maximum 1/5/15 loadavg of 1.00, 0.99, 0.95
(multiplied by number of cores).

Once the (old) load becomes 93 or higher, it mathematically can never
get lower than 93, even when the active (load) remains 0 forever.
This results in the strange 0.00, 0.01, 0.05 uptime values on idle
systems.  Note: 93/2048 = 0.0454..., which rounds up to 0.05.

It is not correct to add a 0.5 rounding (=1024/2048) here, since the
result from this function is fed back into the next iteration again,
so the result of that +0.5 rounding value then gets multiplied by
(2048-2037), and then rounded again, so there is a virtual "ghost"
load created, next to the old and active load terms.

By changing the way the internally kept value is rounded, that internal
value equivalent now can reach 0.00 on idle, and 1.00 on full load. Upon
increasing load, the internally kept load value is rounded up, when the
load is decreasing, the load value is rounded down.

The modified code was tested on nohz=off and nohz kernels. It was tested
on vanilla kernel 4.6-rc5 and on centos 7.1 kernel 3.10.0-327. It was
tested on single, dual, and octal cores system. It was tested on virtual
hosts and bare hardware. No unwanted effects have been observed, and the
problems that the patch intended to fix were indeed gone.

Tested-by: Damien Wyart &lt;damien.wyart@free.fr&gt;
Signed-off-by: Vik Heyndrickx &lt;vik.heyndrickx@veribox.net&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Cc: Doug Smythies &lt;dsmythies@telus.net&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mike Galbraith &lt;efault@gmx.de&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Fixes: 0f004f5a696a ("sched: Cure more NO_HZ load average woes")
Link: http://lkml.kernel.org/r/e8d32bff-d544-7748-72b5-3c86cc71f09f@veribox.net
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</content>
</entry>
<entry>
<title>sched/cputime: Fix steal_account_process_tick() to always return jiffies</title>
<updated>2016-04-18T12:49:25Z</updated>
<author>
<name>Chris Friesen</name>
<email>cbf123@mail.usask.ca</email>
</author>
<published>2016-03-06T05:18:48Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=e62daff6d23932a38b45f005ba4736bdf230ff84'/>
<id>urn:sha1:e62daff6d23932a38b45f005ba4736bdf230ff84</id>
<content type='text'>
[ Upstream commit f9c904b7613b8b4c85b10cd6b33ad41b2843fa9d ]

The callers of steal_account_process_tick() expect it to return
whether a jiffy should be considered stolen or not.

Currently the return value of steal_account_process_tick() is in
units of cputime, which vary between either jiffies or nsecs
depending on CONFIG_VIRT_CPU_ACCOUNTING_GEN.

If cputime has nsecs granularity and there is a tiny amount of
stolen time (a few nsecs, say) then we will consider the entire
tick stolen and will not account the tick on user/system/idle,
causing /proc/stats to show invalid data.

The fix is to change steal_account_process_tick() to accumulate
the stolen time and only account it once it's worth a jiffy.

(Thanks to Frederic Weisbecker for suggestions to fix a bug in my
first version of the patch.)

Signed-off-by: Chris Friesen &lt;chris.friesen@windriver.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Cc: Frederic Weisbecker &lt;fweisbec@gmail.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Link: http://lkml.kernel.org/r/56DBBDB8.40305@mail.usask.ca
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</content>
</entry>
<entry>
<title>sched: Fix crash in sched_init_numa()</title>
<updated>2016-04-12T16:10:53Z</updated>
<author>
<name>Raghavendra K T</name>
<email>raghavendra.kt@linux.vnet.ibm.com</email>
</author>
<published>2016-01-15T19:01:23Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=c8d69b6b02be32cb2eca20b704c5c54fd1420c8e'/>
<id>urn:sha1:c8d69b6b02be32cb2eca20b704c5c54fd1420c8e</id>
<content type='text'>
[ Upstream commit 9c03ee147193645be4c186d3688232fa438c57c7 ]

The following PowerPC commit:

  c118baf80256 ("arch/powerpc/mm/numa.c: do not allocate bootmem memory for non existing nodes")

avoids allocating bootmem memory for non existent nodes.

But when DEBUG_PER_CPU_MAPS=y is enabled, my powerNV system failed to boot
because in sched_init_numa(), cpumask_or() operation was done on
unallocated nodes.

Fix that by making cpumask_or() operation only on existing nodes.

[ Tested with and w/o DEBUG_PER_CPU_MAPS=y on x86 and PowerPC. ]

Reported-by: Jan Stancek &lt;jstancek@redhat.com&gt;
Tested-by: Jan Stancek &lt;jstancek@redhat.com&gt;
Signed-off-by: Raghavendra K T &lt;raghavendra.kt@linux.vnet.ibm.com&gt;
Cc: &lt;gkurz@linux.vnet.ibm.com&gt;
Cc: &lt;grant.likely@linaro.org&gt;
Cc: &lt;nikunj@linux.vnet.ibm.com&gt;
Cc: &lt;vdavydov@parallels.com&gt;
Cc: &lt;linuxppc-dev@lists.ozlabs.org&gt;
Cc: &lt;linux-mm@kvack.org&gt;
Cc: &lt;peterz@infradead.org&gt;
Cc: &lt;benh@kernel.crashing.org&gt;
Cc: &lt;paulus@samba.org&gt;
Cc: &lt;mpe@ellerman.id.au&gt;
Cc: &lt;anton@samba.org&gt;
Link: http://lkml.kernel.org/r/1452884483-11676-1-git-send-email-raghavendra.kt@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</content>
</entry>
<entry>
<title>sched: Fix cpu_active_mask/cpu_online_mask race</title>
<updated>2015-12-03T03:57:18Z</updated>
<author>
<name>Jan H. Schönherr</name>
<email>jschoenh@amazon.de</email>
</author>
<published>2015-08-12T19:35:56Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=71dc6a353848a515e48ebe47a92e80a0be3411e5'/>
<id>urn:sha1:71dc6a353848a515e48ebe47a92e80a0be3411e5</id>
<content type='text'>
[ Upstream commit dd9d3843755da95f63dd3a376f62b3e45c011210 ]

There is a race condition in SMP bootup code, which may result
in

    WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418
    workqueue_cpu_up_callback()
or
    kernel BUG at kernel/smpboot.c:135!

It can be triggered with a bit of luck in Linux guests running
on busy hosts.

	CPU0                        CPUn
	====                        ====

	_cpu_up()
	  __cpu_up()
				    start_secondary()
				      set_cpu_online()
					cpumask_set_cpu(cpu,
						   to_cpumask(cpu_online_bits));
	  cpu_notify(CPU_ONLINE)
	    &lt;do stuff, see below&gt;
					cpumask_set_cpu(cpu,
						   to_cpumask(cpu_active_bits));

During the various CPU_ONLINE callbacks CPUn is online but not
active. Several things can go wrong at that point, depending on
the scheduling of tasks on CPU0.

Variant 1:

  cpu_notify(CPU_ONLINE)
    workqueue_cpu_up_callback()
      rebind_workers()
        set_cpus_allowed_ptr()

  This call fails because it requires an active CPU; rebind_workers()
  ends with a warning:

    WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418
    workqueue_cpu_up_callback()

Variant 2:

  cpu_notify(CPU_ONLINE)
    smpboot_thread_call()
      smpboot_unpark_threads()
       ..
        __kthread_unpark()
          __kthread_bind()
          wake_up_state()
           ..
            select_task_rq()
              select_fallback_rq()

  The -&gt;wake_cpu of the unparked thread is not allowed, making a call
  to select_fallback_rq() necessary. Then, select_fallback_rq() cannot
  find an allowed, active CPU and promptly resets the allowed CPUs, so
  that the task in question ends up on CPU0.

  When those unparked tasks are eventually executed, they run
  immediately into a BUG:

    kernel BUG at kernel/smpboot.c:135!

Just changing the order in which the online/active bits are set
(and adding some memory barriers), would solve the two issues
above. However, it would change the order of operations back to
the one before commit 6acbfb96976f ("sched: Fix hotplug vs.
set_cpus_allowed_ptr()"), thus, reintroducing that particular
problem.

Going further back into history, we have at least the following
commits touching this topic:
- commit 2baab4e90495 ("sched: Fix select_fallback_rq() vs cpu_active/cpu_online")
- commit 5fbd036b552f ("sched: Cleanup cpu_active madness")

Together, these give us the following non-working solutions:

  - secondary CPU sets active before online, because active is assumed to
    be a subset of online;

  - secondary CPU sets online before active, because the primary CPU
    assumes that an online CPU is also active;

  - secondary CPU sets online and waits for primary CPU to set active,
    because it might deadlock.

Commit 875ebe940d77 ("powerpc/smp: Wait until secondaries are
active &amp; online") introduces an arch-specific solution to this
arch-independent problem.

Now, go for a more general solution without explicit waiting and
simply set active twice: once on the secondary CPU after online
was set and once on the primary CPU after online was seen.

set_cpus_allowed_ptr()")

Signed-off-by: Jan H. Schönherr &lt;jschoenh@amazon.de&gt;
Acked-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Cc: Anton Blanchard &lt;anton@samba.org&gt;
Cc: Borislav Petkov &lt;bp@alien8.de&gt;
Cc: Joerg Roedel &lt;jroedel@suse.de&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Matt Wilson &lt;msw@amazon.com&gt;
Cc: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Fixes: 6acbfb96976f ("sched: Fix hotplug vs. set_cpus_allowed_ptr()")
Link: http://lkml.kernel.org/r/1439408156-18840-1-git-send-email-jschoenh@amazon.de
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</content>
</entry>
</feed>
