<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/kernel/sched, branch v4.4.27</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v4.4.27</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v4.4.27'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2016-09-24T08:07:42Z</updated>
<entry>
<title>sched/core: Fix a race between try_to_wake_up() and a woken up task</title>
<updated>2016-09-24T08:07:42Z</updated>
<author>
<name>Balbir Singh</name>
<email>bsingharora@gmail.com</email>
</author>
<published>2016-09-05T03:16:40Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=a37a538e82b037b51ad8418c5c33d1274cfe27d4'/>
<id>urn:sha1:a37a538e82b037b51ad8418c5c33d1274cfe27d4</id>
<content type='text'>
commit 135e8c9250dd5c8c9aae5984fde6f230d0cbfeaf upstream.

The origin of the issue I've seen is related to
a missing memory barrier between check for task-&gt;state and
the check for task-&gt;on_rq.

The task being woken up is already awake from a schedule()
and is doing the following:

	do {
		schedule()
		set_current_state(TASK_(UN)INTERRUPTIBLE);
	} while (!cond);

The waker, actually gets stuck doing the following in
try_to_wake_up():

	while (p-&gt;on_cpu)
		cpu_relax();

Analysis:

The instance I've seen involves the following race:

 CPU1					CPU2

 while () {
   if (cond)
     break;
   do {
     schedule();
     set_current_state(TASK_UN..)
   } while (!cond);
					wakeup_routine()
					  spin_lock_irqsave(wait_lock)
   raw_spin_lock_irqsave(wait_lock)	  wake_up_process()
 }					  try_to_wake_up()
 set_current_state(TASK_RUNNING);	  ..
 list_del(&amp;waiter.list);

CPU2 wakes up CPU1, but before it can get the wait_lock and set
current state to TASK_RUNNING the following occurs:

 CPU3
 wakeup_routine()
 raw_spin_lock_irqsave(wait_lock)
 if (!list_empty)
   wake_up_process()
   try_to_wake_up()
   raw_spin_lock_irqsave(p-&gt;pi_lock)
   ..
   if (p-&gt;on_rq &amp;&amp; ttwu_wakeup())
   ..
   while (p-&gt;on_cpu)
     cpu_relax()
   ..

CPU3 tries to wake up the task on CPU1 again since it finds
it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
after CPU2, CPU3 got it.

CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
the task is spinning on the wait_lock. Interestingly since p-&gt;on_rq
is checked under pi_lock, I've noticed that try_to_wake_up() finds
p-&gt;on_rq to be 0. This was the most confusing bit of the analysis,
but p-&gt;on_rq is changed under runqueue lock, rq_lock, the p-&gt;on_rq
check is not reliable without this fix IMHO. The race is visible
(based on the analysis) only when ttwu_queue() does a remote wakeup
via ttwu_queue_remote. In which case the p-&gt;on_rq change is not
done uder the pi_lock.

The result is that after a while the entire system locks up on
the raw_spin_irqlock_save(wait_lock) and the holder spins infintely

Reproduction of the issue:

The issue can be reproduced after a long run on my system with 80
threads and having to tweak available memory to very low and running
memory stress-ng mmapfork test. It usually takes a long time to
reproduce. I am trying to work on a test case that can reproduce
the issue faster, but thats work in progress. I am still testing the
changes on my still in a loop and the tests seem OK thus far.

Big thanks to Benjamin and Nick for helping debug this as well.
Ben helped catch the missing barrier, Nick caught every missing
bit in my theory.

Signed-off-by: Balbir Singh &lt;bsingharora@gmail.com&gt;
[ Updated comment to clarify matching barriers. Many
  architectures do not have a full barrier in switch_to()
  so that cannot be relied upon. ]
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: Alexey Kardashevskiy &lt;aik@ozlabs.ru&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Nicholas Piggin &lt;nicholas.piggin@gmail.com&gt;
Cc: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Link: http://lkml.kernel.org/r/e02cce7b-d9ca-1ad0-7a61-ea97c7582b37@gmail.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>sched/numa: Fix use-after-free bug in the task_numa_compare</title>
<updated>2016-09-15T06:27:45Z</updated>
<author>
<name>Gavin Guo</name>
<email>gavin.guo@canonical.com</email>
</author>
<published>2016-01-20T04:36:58Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=f2b8424f35f5d457c2d35b3823ee0fb272e77728'/>
<id>urn:sha1:f2b8424f35f5d457c2d35b3823ee0fb272e77728</id>
<content type='text'>
[ Upstream commit 1dff76b92f69051e579bdc131e01500da9fa2a91 ]

The following message can be observed on the Ubuntu v3.13.0-65 with KASan
backported:

  ==================================================================
  BUG: KASan: use after free in task_numa_find_cpu+0x64c/0x890 at addr ffff880dd393ecd8
  Read of size 8 by task qemu-system-x86/3998900
  =============================================================================
  BUG kmalloc-128 (Tainted: G    B        ): kasan: bad access detected
  -----------------------------------------------------------------------------

  INFO: Allocated in task_numa_fault+0xc1b/0xed0 age=41980 cpu=18 pid=3998890
	__slab_alloc+0x4f8/0x560
	__kmalloc+0x1eb/0x280
	task_numa_fault+0xc1b/0xed0
	do_numa_page+0x192/0x200
	handle_mm_fault+0x808/0x1160
	__do_page_fault+0x218/0x750
	do_page_fault+0x1a/0x70
	page_fault+0x28/0x30
	SyS_poll+0x66/0x1a0
	system_call_fastpath+0x1a/0x1f
  INFO: Freed in task_numa_free+0x1d2/0x200 age=62 cpu=18 pid=0
	__slab_free+0x2ab/0x3f0
	kfree+0x161/0x170
	task_numa_free+0x1d2/0x200
	finish_task_switch+0x1d2/0x210
	__schedule+0x5d4/0xc60
	schedule_preempt_disabled+0x40/0xc0
	cpu_startup_entry+0x2da/0x340
	start_secondary+0x28f/0x360
  Call Trace:
   [&lt;ffffffff81a6ce35&gt;] dump_stack+0x45/0x56
   [&lt;ffffffff81244aed&gt;] print_trailer+0xfd/0x170
   [&lt;ffffffff8124ac36&gt;] object_err+0x36/0x40
   [&lt;ffffffff8124cbf9&gt;] kasan_report_error+0x1e9/0x3a0
   [&lt;ffffffff8124d260&gt;] kasan_report+0x40/0x50
   [&lt;ffffffff810dda7c&gt;] ? task_numa_find_cpu+0x64c/0x890
   [&lt;ffffffff8124bee9&gt;] __asan_load8+0x69/0xa0
   [&lt;ffffffff814f5c38&gt;] ? find_next_bit+0xd8/0x120
   [&lt;ffffffff810dda7c&gt;] task_numa_find_cpu+0x64c/0x890
   [&lt;ffffffff810de16c&gt;] task_numa_migrate+0x4ac/0x7b0
   [&lt;ffffffff810de523&gt;] numa_migrate_preferred+0xb3/0xc0
   [&lt;ffffffff810e0b88&gt;] task_numa_fault+0xb88/0xed0
   [&lt;ffffffff8120ef02&gt;] do_numa_page+0x192/0x200
   [&lt;ffffffff81211038&gt;] handle_mm_fault+0x808/0x1160
   [&lt;ffffffff810d7dbd&gt;] ? sched_clock_cpu+0x10d/0x160
   [&lt;ffffffff81068c52&gt;] ? native_load_tls+0x82/0xa0
   [&lt;ffffffff81a7bd68&gt;] __do_page_fault+0x218/0x750
   [&lt;ffffffff810c2186&gt;] ? hrtimer_try_to_cancel+0x76/0x160
   [&lt;ffffffff81a6f5e7&gt;] ? schedule_hrtimeout_range_clock.part.24+0xf7/0x1c0
   [&lt;ffffffff81a7c2ba&gt;] do_page_fault+0x1a/0x70
   [&lt;ffffffff81a772e8&gt;] page_fault+0x28/0x30
   [&lt;ffffffff8128cbd4&gt;] ? do_sys_poll+0x1c4/0x6d0
   [&lt;ffffffff810e64f6&gt;] ? enqueue_task_fair+0x4b6/0xaa0
   [&lt;ffffffff810233c9&gt;] ? sched_clock+0x9/0x10
   [&lt;ffffffff810cf70a&gt;] ? resched_task+0x7a/0xc0
   [&lt;ffffffff810d0663&gt;] ? check_preempt_curr+0xb3/0x130
   [&lt;ffffffff8128b5c0&gt;] ? poll_select_copy_remaining+0x170/0x170
   [&lt;ffffffff810d3bc0&gt;] ? wake_up_state+0x10/0x20
   [&lt;ffffffff8112a28f&gt;] ? drop_futex_key_refs.isra.14+0x1f/0x90
   [&lt;ffffffff8112d40e&gt;] ? futex_requeue+0x3de/0xba0
   [&lt;ffffffff8112e49e&gt;] ? do_futex+0xbe/0x8f0
   [&lt;ffffffff81022c89&gt;] ? read_tsc+0x9/0x20
   [&lt;ffffffff8111bd9d&gt;] ? ktime_get_ts+0x12d/0x170
   [&lt;ffffffff8108f699&gt;] ? timespec_add_safe+0x59/0xe0
   [&lt;ffffffff8128d1f6&gt;] SyS_poll+0x66/0x1a0
   [&lt;ffffffff81a830dd&gt;] system_call_fastpath+0x1a/0x1f

As commit 1effd9f19324 ("sched/numa: Fix unsafe get_task_struct() in
task_numa_assign()") points out, the rcu_read_lock() cannot protect the
task_struct from being freed in the finish_task_switch(). And the bug
happens in the process of calculation of imp which requires the access of
p-&gt;numa_faults being freed in the following path:

do_exit()
        current-&gt;flags |= PF_EXITING;
    release_task()
        ~~delayed_put_task_struct()~~
    schedule()
    ...
    ...
rq-&gt;curr = next;
    context_switch()
        finish_task_switch()
            put_task_struct()
                __put_task_struct()
		    task_numa_free()

The fix here to get_task_struct() early before end of dst_rq-&gt;lock to
protect the calculation process and also put_task_struct() in the
corresponding point if finally the dst_rq-&gt;curr somehow cannot be
assigned.

Additional credit to Liang Chen who helped fix the error logic and add the
put_task_struct() to the place it missed.

Signed-off-by: Gavin Guo &lt;gavin.guo@canonical.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: jay.vosburgh@canonical.com
Cc: liang.chen@canonical.com
Link: http://lkml.kernel.org/r/1453264618-17645-1-git-send-email-gavin.guo@canonical.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@verizon.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>sched/nohz: Fix affine unpinned timers mess</title>
<updated>2016-09-07T06:32:41Z</updated>
<author>
<name>Wanpeng Li</name>
<email>wanpeng.li@hotmail.com</email>
</author>
<published>2016-05-04T06:45:34Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=15abaa07a2f0dabb66dfa637162fdaa66b839141'/>
<id>urn:sha1:15abaa07a2f0dabb66dfa637162fdaa66b839141</id>
<content type='text'>
commit 444969223c81c7d0a95136b7b4cfdcfbc96ac5bd upstream.

The following commit:

  9642d18eee2c ("nohz: Affine unpinned timers to housekeepers")'

intended to affine unpinned timers to housekeepers:

  unpinned timers(full dynaticks, idle)   =&gt;   nearest busy housekeepers(otherwise, fallback to any housekeepers)
  unpinned timers(full dynaticks, busy)   =&gt;   nearest busy housekeepers(otherwise, fallback to any housekeepers)
  unpinned timers(houserkeepers, idle)    =&gt;   nearest busy housekeepers(otherwise, fallback to itself)

However, the !idle_cpu(i) &amp;&amp; is_housekeeping_cpu(cpu) check modified the
intention to:

  unpinned timers(full dynaticks, idle)   =&gt;   any housekeepers(no mattter cpu topology)
  unpinned timers(full dynaticks, busy)   =&gt;   any housekeepers(no mattter cpu topology)
  unpinned timers(housekeepers, idle)     =&gt;   any busy cpus(otherwise, fallback to any housekeepers)

This patch fixes it by checking if there are busy housekeepers nearby,
otherwise falls to any housekeepers/itself. After the patch:

  unpinned timers(full dynaticks, idle)   =&gt;   nearest busy housekeepers(otherwise, fallback to any housekeepers)
  unpinned timers(full dynaticks, busy)   =&gt;   nearest busy housekeepers(otherwise, fallback to any housekeepers)
  unpinned timers(housekeepers, idle)     =&gt;   nearest busy housekeepers(otherwise, fallback to itself)

Signed-off-by: Wanpeng Li &lt;wanpeng.li@hotmail.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
[ Fixed the changelog. ]
Cc: Frederic Weisbecker &lt;fweisbec@gmail.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mike Galbraith &lt;efault@gmx.de&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: linux-kernel@vger.kernel.org
Fixes: 'commit 9642d18eee2c ("nohz: Affine unpinned timers to housekeepers")'
Link: http://lkml.kernel.org/r/1462344334-8303-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>sched/cputime: Fix NO_HZ_FULL getrusage() monotonicity regression</title>
<updated>2016-09-07T06:32:41Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2016-08-15T16:38:42Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=c3cf68ec5595e30c28d44b0080f236af94e0e8da'/>
<id>urn:sha1:c3cf68ec5595e30c28d44b0080f236af94e0e8da</id>
<content type='text'>
commit 173be9a14f7b2e901cf77c18b1aafd4d672e9d9e upstream.

Mike reports:

 Roughly 10% of the time, ltp testcase getrusage04 fails:
 getrusage04    0  TINFO  :  Expected timers granularity is 4000 us
 getrusage04    0  TINFO  :  Using 1 as multiply factor for max [us]time increment (1000+4000us)!
 getrusage04    0  TINFO  :  utime:           0us; stime:         179us
 getrusage04    0  TINFO  :  utime:        3751us; stime:           0us
 getrusage04    1  TFAIL  :  getrusage04.c:133: stime increased &gt; 5000us:

And tracked it down to the case where the task simply doesn't get
_any_ [us]time ticks.

Update the code to assume all rtime is utime when we lack information,
thus ensuring a task that elides the tick gets time accounted.

Reported-by: Mike Galbraith &lt;umgwanakikbuti@gmail.com&gt;
Tested-by: Mike Galbraith &lt;umgwanakikbuti@gmail.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Frederic Weisbecker &lt;fweisbec@gmail.com&gt;
Cc: Fredrik Markstrom &lt;fredrik.markstrom@gmail.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Radim &lt;rkrcmar@redhat.com&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Stephane Eranian &lt;eranian@google.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Vince Weaver &lt;vincent.weaver@maine.edu&gt;
Cc: Wanpeng Li &lt;wanpeng.li@hotmail.com&gt;
Fixes: 9d7fb0427648 ("sched/cputime: Guarantee stime + utime == rtime")
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>sched/fair: Fix effective_load() to consistently use smoothed load</title>
<updated>2016-08-10T09:49:28Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2016-06-24T13:53:54Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=34bf12312bd4222a1b945be3f58173edc8aa3f22'/>
<id>urn:sha1:34bf12312bd4222a1b945be3f58173edc8aa3f22</id>
<content type='text'>
commit 7dd4912594daf769a46744848b05bd5bc6d62469 upstream.

Starting with the following commit:

  fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities")

calc_tg_weight() doesn't compute the right value as expected by effective_load().

The difference is in the 'correction' term. In order to ensure \Sum
rw_j &gt;= rw_i we cannot use tg-&gt;load_avg directly, since that might be
lagging a correction on the current cfs_rq-&gt;avg.load_avg value.
Therefore we use tg-&gt;load_avg - cfs_rq-&gt;tg_load_avg_contrib +
cfs_rq-&gt;avg.load_avg.

Now, per the referenced commit, calc_tg_weight() doesn't use
cfs_rq-&gt;avg.load_avg, as is later used in @w, but uses
cfs_rq-&gt;load.weight instead.

So stop using calc_tg_weight() and do it explicitly.

The effects of this bug are wake_affine() making randomly
poor choices in cgroup-intense workloads.

Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Fixes: fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities")
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>kernel/sysrq, watchdog, sched/core: Reset watchdog on all CPUs while processing sysrq-w</title>
<updated>2016-08-10T09:49:25Z</updated>
<author>
<name>Andrey Ryabinin</name>
<email>aryabinin@virtuozzo.com</email>
</author>
<published>2016-06-09T12:20:05Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=dc20f3244ae920430d9d9f19939a13a0279380ca'/>
<id>urn:sha1:dc20f3244ae920430d9d9f19939a13a0279380ca</id>
<content type='text'>
commit 57675cb976eff977aefb428e68e4e0236d48a9ff upstream.

Lengthy output of sysrq-w may take a lot of time on slow serial console.

Currently we reset NMI-watchdog on the current CPU to avoid spurious
lockup messages. Sometimes this doesn't work since softlockup watchdog
might trigger on another CPU which is waiting for an IPI to proceed.
We reset softlockup watchdogs on all CPUs, but we do this only after
listing all tasks, and this may be too late on a busy system.

So, reset watchdogs CPUs earlier, in for_each_process_thread() loop.

Signed-off-by: Andrey Ryabinin &lt;aryabinin@virtuozzo.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Link: http://lkml.kernel.org/r/1465474805-14641-1-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>sched/fair: Fix cfs_rq avg tracking underflow</title>
<updated>2016-07-27T16:47:31Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2016-06-16T08:50:40Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=43b1bfec0e2d57497718fd01f7151b2c78de99fc'/>
<id>urn:sha1:43b1bfec0e2d57497718fd01f7151b2c78de99fc</id>
<content type='text'>
commit 8974189222159154c55f24ddad33e3613960521a upstream.

As per commit:

  b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization")

&gt; the code generated from update_cfs_rq_load_avg():
&gt;
&gt; 	if (atomic_long_read(&amp;cfs_rq-&gt;removed_load_avg)) {
&gt; 		s64 r = atomic_long_xchg(&amp;cfs_rq-&gt;removed_load_avg, 0);
&gt; 		sa-&gt;load_avg = max_t(long, sa-&gt;load_avg - r, 0);
&gt; 		sa-&gt;load_sum = max_t(s64, sa-&gt;load_sum - r * LOAD_AVG_MAX, 0);
&gt; 		removed_load = 1;
&gt; 	}
&gt;
&gt; turns into:
&gt;
&gt; ffffffff81087064:       49 8b 85 98 00 00 00    mov    0x98(%r13),%rax
&gt; ffffffff8108706b:       48 85 c0                test   %rax,%rax
&gt; ffffffff8108706e:       74 40                   je     ffffffff810870b0 &lt;update_blocked_averages+0xc0&gt;
&gt; ffffffff81087070:       4c 89 f8                mov    %r15,%rax
&gt; ffffffff81087073:       49 87 85 98 00 00 00    xchg   %rax,0x98(%r13)
&gt; ffffffff8108707a:       49 29 45 70             sub    %rax,0x70(%r13)
&gt; ffffffff8108707e:       4c 89 f9                mov    %r15,%rcx
&gt; ffffffff81087081:       bb 01 00 00 00          mov    $0x1,%ebx
&gt; ffffffff81087086:       49 83 7d 70 00          cmpq   $0x0,0x70(%r13)
&gt; ffffffff8108708b:       49 0f 49 4d 70          cmovns 0x70(%r13),%rcx
&gt;
&gt; Which you'll note ends up with sa-&gt;load_avg -= r in memory at
&gt; ffffffff8108707a.

So I _should_ have looked at other unserialized users of -&gt;load_avg,
but alas. Luckily nikbor reported a similar /0 from task_h_load() which
instantly triggered recollection of this here problem.

Aside from the intermediate value hitting memory and causing problems,
there's another problem: the underflow detection relies on the signed
bit. This reduces the effective width of the variables, IOW its
effectively the same as having these variables be of signed type.

This patch changes to a different means of unsigned underflow
detection to not rely on the signed bit. This allows the variables to
use the 'full' unsigned range. And it does so with explicit LOAD -
STORE to ensure any intermediate value will never be visible in
memory, allowing these unserialized loads.

Note: GCC generates crap code for this, might warrant a look later.

Note2: I say 'full' above, if we end up at U*_MAX we'll still explode;
       maybe we should do clamping on add too.

Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Andrey Ryabinin &lt;aryabinin@virtuozzo.com&gt;
Cc: Chris Wilson &lt;chris@chris-wilson.co.uk&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mike Galbraith &lt;efault@gmx.de&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Yuyang Du &lt;yuyang.du@intel.com&gt;
Cc: bsegall@google.com
Cc: kernel@kyup.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: steve.muckle@linaro.org
Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")
Link: http://lkml.kernel.org/r/20160617091948.GJ30927@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;


</content>
</entry>
<entry>
<title>sched: panic on corrupted stack end</title>
<updated>2016-06-24T17:18:20Z</updated>
<author>
<name>Jann Horn</name>
<email>jannh@google.com</email>
</author>
<published>2016-06-01T09:55:07Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=c08b1a593a042ae01e788ec5504bee2cfc83e1f2'/>
<id>urn:sha1:c08b1a593a042ae01e788ec5504bee2cfc83e1f2</id>
<content type='text'>
commit 29d6455178a09e1dc340380c582b13356227e8df upstream.

Until now, hitting this BUG_ON caused a recursive oops (because oops
handling involves do_exit(), which calls into the scheduler, which in
turn raises an oops), which caused stuff below the stack to be
overwritten until a panic happened (e.g.  via an oops in interrupt
context, caused by the overwritten CPU index in the thread_info).

Just panic directly.

Signed-off-by: Jann Horn &lt;jannh@google.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>sched/loadavg: Fix loadavg artifacts on fully idle and on fully loaded systems</title>
<updated>2016-06-01T19:15:49Z</updated>
<author>
<name>Vik Heyndrickx</name>
<email>vik.heyndrickx@veribox.net</email>
</author>
<published>2016-04-28T18:46:28Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=1df73f1884599d40c76e111e99e9511d943fa4c4'/>
<id>urn:sha1:1df73f1884599d40c76e111e99e9511d943fa4c4</id>
<content type='text'>
commit 20878232c52329f92423d27a60e48b6a6389e0dd upstream.

Systems show a minimal load average of 0.00, 0.01, 0.05 even when they
have no load at all.

Uptime and /proc/loadavg on all systems with kernels released during the
last five years up until kernel version 4.6-rc5, show a 5- and 15-minute
minimum loadavg of 0.01 and 0.05 respectively. This should be 0.00 on
idle systems, but the way the kernel calculates this value prevents it
from getting lower than the mentioned values.

Likewise but not as obviously noticeable, a fully loaded system with no
processes waiting, shows a maximum 1/5/15 loadavg of 1.00, 0.99, 0.95
(multiplied by number of cores).

Once the (old) load becomes 93 or higher, it mathematically can never
get lower than 93, even when the active (load) remains 0 forever.
This results in the strange 0.00, 0.01, 0.05 uptime values on idle
systems.  Note: 93/2048 = 0.0454..., which rounds up to 0.05.

It is not correct to add a 0.5 rounding (=1024/2048) here, since the
result from this function is fed back into the next iteration again,
so the result of that +0.5 rounding value then gets multiplied by
(2048-2037), and then rounded again, so there is a virtual "ghost"
load created, next to the old and active load terms.

By changing the way the internally kept value is rounded, that internal
value equivalent now can reach 0.00 on idle, and 1.00 on full load. Upon
increasing load, the internally kept load value is rounded up, when the
load is decreasing, the load value is rounded down.

The modified code was tested on nohz=off and nohz kernels. It was tested
on vanilla kernel 4.6-rc5 and on centos 7.1 kernel 3.10.0-327. It was
tested on single, dual, and octal cores system. It was tested on virtual
hosts and bare hardware. No unwanted effects have been observed, and the
problems that the patch intended to fix were indeed gone.

Tested-by: Damien Wyart &lt;damien.wyart@free.fr&gt;
Signed-off-by: Vik Heyndrickx &lt;vik.heyndrickx@veribox.net&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Doug Smythies &lt;dsmythies@telus.net&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mike Galbraith &lt;efault@gmx.de&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Fixes: 0f004f5a696a ("sched: Cure more NO_HZ load average woes")
Link: http://lkml.kernel.org/r/e8d32bff-d544-7748-72b5-3c86cc71f09f@veribox.net
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>sched/cgroup: Fix/cleanup cgroup teardown/init</title>
<updated>2016-05-04T21:48:42Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2016-03-16T15:22:45Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=c0944355a74bc9c2b5b3cc5b627efe0c73e30bd9'/>
<id>urn:sha1:c0944355a74bc9c2b5b3cc5b627efe0c73e30bd9</id>
<content type='text'>
commit 2f5177f0fd7e531b26d54633be62d1d4cb94621c upstream.

The CPU controller hasn't kept up with the various changes in the whole
cgroup initialization / destruction sequence, and commit:

  2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")

caused it to explode.

The reason for this is that zombies do not inhibit css_offline() from
being called, but do stall css_released(). Now we tear down the cfs_rq
structures on css_offline() but zombies can run after that, leading to
use-after-free issues.

The solution is to move the tear-down to css_released(), which
guarantees nobody (including no zombies) is still using our cgroup.

Furthermore, a few simple cleanups are possible too. There doesn't
appear to be any point to us using css_online() (anymore?) so fold that
in css_alloc().

And since cgroup code guarantees an RCU grace period between
css_released() and css_free() we can forgo using call_rcu() and free the
stuff immediately.

Suggested-by: Tejun Heo &lt;tj@kernel.org&gt;
Reported-by: Kazuki Yamaguchi &lt;k@rhe.jp&gt;
Reported-by: Niklas Cassel &lt;niklas.cassel@axis.com&gt;
Tested-by: Niklas Cassel &lt;niklas.cassel@axis.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
Link: http://lkml.kernel.org/r/20160316152245.GY6344@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
</feed>
