<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/kernel/workqueue.c, branch v3.16.42</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v3.16.42</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v3.16.42'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2016-06-15T20:29:27Z</updated>
<entry>
<title>workqueue: fix ghost PENDING flag while doing MQ IO</title>
<updated>2016-06-15T20:29:27Z</updated>
<author>
<name>Roman Pen</name>
<email>roman.penyaev@profitbricks.com</email>
</author>
<published>2016-04-26T11:15:35Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=6971777782ba593105cee306b598e6075bbddb1d'/>
<id>urn:sha1:6971777782ba593105cee306b598e6075bbddb1d</id>
<content type='text'>
commit 346c09f80459a3ad97df1816d6d606169a51001a upstream.

The bug in a workqueue leads to a stalled IO request in MQ ctx-&gt;rq_list
with the following backtrace:

[  601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
[  601.347574]       Tainted: G           O    4.4.5-1-storage+ #6
[  601.347651] "echo 0 &gt; /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  601.348142] kworker/u129:5  D ffff880803077988     0  1636      2 0x00000000
[  601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
[  601.348999]  ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
[  601.349662]  ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
[  601.350333]  ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
[  601.350965] Call Trace:
[  601.351203]  [&lt;ffffffff815b0920&gt;] ? bit_wait+0x60/0x60
[  601.351444]  [&lt;ffffffff815b01d5&gt;] schedule+0x35/0x80
[  601.351709]  [&lt;ffffffff815b2dd2&gt;] schedule_timeout+0x192/0x230
[  601.351958]  [&lt;ffffffff812d43f7&gt;] ? blk_flush_plug_list+0xc7/0x220
[  601.352208]  [&lt;ffffffff810bd737&gt;] ? ktime_get+0x37/0xa0
[  601.352446]  [&lt;ffffffff815b0920&gt;] ? bit_wait+0x60/0x60
[  601.352688]  [&lt;ffffffff815af784&gt;] io_schedule_timeout+0xa4/0x110
[  601.352951]  [&lt;ffffffff815b3a4e&gt;] ? _raw_spin_unlock_irqrestore+0xe/0x10
[  601.353196]  [&lt;ffffffff815b093b&gt;] bit_wait_io+0x1b/0x70
[  601.353440]  [&lt;ffffffff815b056d&gt;] __wait_on_bit+0x5d/0x90
[  601.353689]  [&lt;ffffffff81127bd0&gt;] wait_on_page_bit+0xc0/0xd0
[  601.353958]  [&lt;ffffffff81096db0&gt;] ? autoremove_wake_function+0x40/0x40
[  601.354200]  [&lt;ffffffff81127cc4&gt;] __filemap_fdatawait_range+0xe4/0x140
[  601.354441]  [&lt;ffffffff81127d34&gt;] filemap_fdatawait_range+0x14/0x30
[  601.354688]  [&lt;ffffffff81129a9f&gt;] filemap_write_and_wait_range+0x3f/0x70
[  601.354932]  [&lt;ffffffff811ced3b&gt;] blkdev_fsync+0x1b/0x50
[  601.355193]  [&lt;ffffffff811c82d9&gt;] vfs_fsync_range+0x49/0xa0
[  601.355432]  [&lt;ffffffff811cf45a&gt;] blkdev_write_iter+0xca/0x100
[  601.355679]  [&lt;ffffffff81197b1a&gt;] __vfs_write+0xaa/0xe0
[  601.355925]  [&lt;ffffffff81198379&gt;] vfs_write+0xa9/0x1a0
[  601.356164]  [&lt;ffffffff811c59d8&gt;] kernel_write+0x38/0x50

The underlying device is a null_blk, with default parameters:

  queue_mode    = MQ
  submit_queues = 1

Verification that nullb0 has something inflight:

root@pserver8:~# cat /sys/block/nullb0/inflight
       0        1
root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
...
/sys/block/nullb0/mq/0/cpu2/rq_list
CTX pending:
        ffff8838038e2400
...

During debug it became clear that stalled request is always inserted in
the rq_list from the following path:

   save_stack_trace_tsk + 34
   blk_mq_insert_requests + 231
   blk_mq_flush_plug_list + 281
   blk_flush_plug_list + 199
   wait_on_page_bit + 192
   __filemap_fdatawait_range + 228
   filemap_fdatawait_range + 20
   filemap_write_and_wait_range + 63
   blkdev_fsync + 27
   vfs_fsync_range + 73
   blkdev_write_iter + 202
   __vfs_write + 170
   vfs_write + 169
   kernel_write + 56

So blk_flush_plug_list() was called with from_schedule == true.

If from_schedule is true, that means that finally blk_mq_insert_requests()
offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
i.e. it calls kblockd_schedule_delayed_work_on().

That means, that we race with another CPU, which is about to execute
__blk_mq_run_hw_queue() work.

Further debugging shows the following traces from different CPUs:

  CPU#0                                  CPU#1
  ----------------------------------     -------------------------------
  reqeust A inserted
  STORE hctx-&gt;ctx_map[0] bit marked
  kblockd_schedule...() returns 1
  &lt;schedule to kblockd workqueue&gt;
                                         request B inserted
                                         STORE hctx-&gt;ctx_map[1] bit marked
                                         kblockd_schedule...() returns 0
  *** WORK PENDING bit is cleared ***
  flush_busy_ctxs() is executed, but
  bit 1, set by CPU#1, is not observed

As a result request B pended forever.

This behaviour can be explained by speculative LOAD of hctx-&gt;ctx_map on
CPU#0, which is reordered with clear of PENDING bit and executed _before_
actual STORE of bit 1 on CPU#1.

The proper fix is an explicit full barrier &lt;mfence&gt;, which guarantees
that clear of PENDING bit is to be executed before all possible
speculative LOADS or STORES inside actual work function.

Signed-off-by: Roman Pen &lt;roman.penyaev@profitbricks.com&gt;
Cc: Gioh Kim &lt;gi-oh.kim@profitbricks.com&gt;
Cc: Michael Wang &lt;yun.wang@profitbricks.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</content>
</entry>
<entry>
<title>workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup</title>
<updated>2016-02-25T10:34:56Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2016-02-03T18:54:25Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=6bfeca86dab7770b926bb3d2a86fc0c15ab2499b'/>
<id>urn:sha1:6bfeca86dab7770b926bb3d2a86fc0c15ab2499b</id>
<content type='text'>
commit d6e022f1d207a161cd88e08ef0371554680ffc46 upstream.

When looking up the pool_workqueue to use for an unbound workqueue,
workqueue assumes that the target CPU is always bound to a valid NUMA
node.  However, currently, when a CPU goes offline, the mapping is
destroyed and cpu_to_node() returns NUMA_NO_NODE.

This has always been broken but hasn't triggered often enough before
874bbfe600a6 ("workqueue: make sure delayed work run in local cpu").
After the commit, workqueue forcifully assigns the local CPU for
delayed work items without explicit target CPU to fix a different
issue.  This widens the window where CPU can go offline while a
delayed work item is pending causing delayed work items dispatched
with target CPU set to an already offlined CPU.  The resulting
NUMA_NO_NODE mapping makes workqueue try to queue the work item on a
NULL pool_workqueue and thus crash.

While 874bbfe600a6 has been reverted for a different reason making the
bug less visible again, it can still happen.  Fix it by mapping
NUMA_NO_NODE to the default pool_workqueue from unbound_pwq_by_node().
This is a temporary workaround.  The long term solution is keeping CPU
-&gt; NODE mapping stable across CPU off/online cycles which is being
worked on.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reported-by: Mike Galbraith &lt;umgwanakikbuti@gmail.com&gt;
Cc: Tang Chen &lt;tangchen@cn.fujitsu.com&gt;
Cc: Rafael J. Wysocki &lt;rafael@kernel.org&gt;
Cc: Len Brown &lt;len.brown@intel.com&gt;
Link: http://lkml.kernel.org/g/1454424264.11183.46.camel@gmail.com
Link: http://lkml.kernel.org/g/1453702100-2597-1-git-send-email-tangchen@cn.fujitsu.com
[ luis: backported to 3.16: adjusted context ]
Signed-off-by: Luis Henriques &lt;luis.henriques@canonical.com&gt;
</content>
</entry>
<entry>
<title>Revert "workqueue: make sure delayed work run in local cpu"</title>
<updated>2016-02-24T10:27:06Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2016-02-22T18:08:53Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=fb829af0ec50c1103d58a6aca65b09622a5f5e4c'/>
<id>urn:sha1:fb829af0ec50c1103d58a6aca65b09622a5f5e4c</id>
<content type='text'>
commit 041bd12e272c53a35c54c13875839bcb98c999ce upstream.

This reverts commit 874bbfe600a660cba9c776b3957b1ce393151b76.

Workqueue used to implicity guarantee that work items queued without
explicit CPU specified are put on the local CPU.  Recent changes in
timer broke the guarantee and led to vmstat breakage which was fixed
by 176bed1de5bf ("vmstat: explicitly schedule per-cpu work on the CPU
we need it to run on").

vmstat is the most likely to expose the issue and it's quite possible
that there are other similar problems which are a lot more difficult
to trigger.  As a preventive measure, 874bbfe600a6 ("workqueue: make
sure delayed work run in local cpu") was applied to restore the local
CPU guarnatee.  Unfortunately, the change exposed a bug in timer code
which got fixed by 22b886dd1018 ("timers: Use proper base migration in
add_timer_on()").  Due to code restructuring, the commit couldn't be
backported beyond certain point and stable kernels which only had
874bbfe600a6 started crashing.

The local CPU guarantee was accidental more than anything else and we
want to get rid of it anyway.  As, with the vmstat case fixed,
874bbfe600a6 is causing more problems than it's fixing, it has been
decided to take the chance and officially break the guarantee by
reverting the commit.  A debug feature will be added to force foreign
CPU assignment to expose cases relying on the guarantee and fixes for
the individual cases will be backported to stable as necessary.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Fixes: 874bbfe600a6 ("workqueue: make sure delayed work run in local cpu")
Link: http://lkml.kernel.org/g/20160120211926.GJ10810@quack.suse.cz
Cc: Mike Galbraith &lt;umgwanakikbuti@gmail.com&gt;
Cc: Henrique de Moraes Holschuh &lt;hmh@hmh.eng.br&gt;
Cc: Daniel Bilik &lt;daniel.bilik@neosystem.cz&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Shaohua Li &lt;shli@fb.com&gt;
Cc: Sasha Levin &lt;sasha.levin@oracle.com&gt;
Cc: Ben Hutchings &lt;ben@decadent.org.uk&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Daniel Bilik &lt;daniel.bilik@neosystem.cz&gt;
Cc: Jiri Slaby &lt;jslaby@suse.cz&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Signed-off-by: Kamal Mostafa &lt;kamal@canonical.com&gt;
Signed-off-by: Luis Henriques &lt;luis.henriques@canonical.com&gt;
</content>
</entry>
<entry>
<title>workqueue: make sure delayed work run in local cpu</title>
<updated>2015-10-30T13:59:30Z</updated>
<author>
<name>Shaohua Li</name>
<email>shli@fb.com</email>
</author>
<published>2015-09-30T16:05:30Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=c7ca6b49f98c88f7c6d7313edbaf3cf0792b8485'/>
<id>urn:sha1:c7ca6b49f98c88f7c6d7313edbaf3cf0792b8485</id>
<content type='text'>
commit 874bbfe600a660cba9c776b3957b1ce393151b76 upstream.

My system keeps crashing with below message. vmstat_update() schedules a delayed
work in current cpu and expects the work runs in the cpu.
schedule_delayed_work() is expected to make delayed work run in local cpu. The
problem is timer can be migrated with NO_HZ. __queue_work() queues work in
timer handler, which could run in a different cpu other than where the delayed
work is scheduled. The end result is the delayed work runs in different cpu.
The patch makes __queue_delayed_work records local cpu earlier. Where the timer
runs doesn't change where the work runs with the change.

[   28.010131] ------------[ cut here ]------------
[   28.010609] kernel BUG at ../mm/vmstat.c:1392!
[   28.011099] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
[   28.011860] Modules linked in:
[   28.012245] CPU: 0 PID: 289 Comm: kworker/0:3 Tainted: G        W4.3.0-rc3+ #634
[   28.013065] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014
[   28.014160] Workqueue: events vmstat_update
[   28.014571] task: ffff880117682580 ti: ffff8800ba428000 task.ti: ffff8800ba428000
[   28.015445] RIP: 0010:[&lt;ffffffff8115f921&gt;]  [&lt;ffffffff8115f921&gt;]vmstat_update+0x31/0x80
[   28.016282] RSP: 0018:ffff8800ba42fd80  EFLAGS: 00010297
[   28.016812] RAX: 0000000000000000 RBX: ffff88011a858dc0 RCX:0000000000000000
[   28.017585] RDX: ffff880117682580 RSI: ffffffff81f14d8c RDI:ffffffff81f4df8d
[   28.018366] RBP: ffff8800ba42fd90 R08: 0000000000000001 R09:0000000000000000
[   28.019169] R10: 0000000000000000 R11: 0000000000000121 R12:ffff8800baa9f640
[   28.019947] R13: ffff88011a81e340 R14: ffff88011a823700 R15:0000000000000000
[   28.020071] FS:  0000000000000000(0000) GS:ffff88011a800000(0000)knlGS:0000000000000000
[   28.020071] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   28.020071] CR2: 00007ff6144b01d0 CR3: 00000000b8e93000 CR4:00000000000006f0
[   28.020071] Stack:
[   28.020071]  ffff88011a858dc0 ffff8800baa9f640 ffff8800ba42fe00ffffffff8106bd88
[   28.020071]  ffffffff8106bd0b 0000000000000096 0000000000000000ffffffff82f9b1e8
[   28.020071]  ffffffff829f0b10 0000000000000000 ffffffff81f18460ffff88011a81e340
[   28.020071] Call Trace:
[   28.020071]  [&lt;ffffffff8106bd88&gt;] process_one_work+0x1c8/0x540
[   28.020071]  [&lt;ffffffff8106bd0b&gt;] ? process_one_work+0x14b/0x540
[   28.020071]  [&lt;ffffffff8106c214&gt;] worker_thread+0x114/0x460
[   28.020071]  [&lt;ffffffff8106c100&gt;] ? process_one_work+0x540/0x540
[   28.020071]  [&lt;ffffffff81071bf8&gt;] kthread+0xf8/0x110
[   28.020071]  [&lt;ffffffff81071b00&gt;] ?kthread_create_on_node+0x200/0x200
[   28.020071]  [&lt;ffffffff81a6522f&gt;] ret_from_fork+0x3f/0x70
[   28.020071]  [&lt;ffffffff81071b00&gt;] ?kthread_create_on_node+0x200/0x200

Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Luis Henriques &lt;luis.henriques@canonical.com&gt;
</content>
</entry>
<entry>
<title>workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE</title>
<updated>2015-03-23T15:16:21Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2015-03-05T13:04:13Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=1264f1e3728bb2e684395148ec61468c5edbf66c'/>
<id>urn:sha1:1264f1e3728bb2e684395148ec61468c5edbf66c</id>
<content type='text'>
commit 8603e1b30027f943cc9c1eef2b291d42c3347af1 upstream.

cancel[_delayed]_work_sync() are implemented using
__cancel_work_timer() which grabs the PENDING bit using
try_to_grab_pending() and then flushes the work item with PENDING set
to prevent the on-going execution of the work item from requeueing
itself.

try_to_grab_pending() can always grab PENDING bit without blocking
except when someone else is doing the above flushing during
cancelation.  In that case, try_to_grab_pending() returns -ENOENT.  In
this case, __cancel_work_timer() currently invokes flush_work().  The
assumption is that the completion of the work item is what the other
canceling task would be waiting for too and thus waiting for the same
condition and retrying should allow forward progress without excessive
busy looping

Unfortunately, this doesn't work if preemption is disabled or the
latter task has real time priority.  Let's say task A just got woken
up from flush_work() by the completion of the target work item.  If,
before task A starts executing, task B gets scheduled and invokes
__cancel_work_timer() on the same work item, its try_to_grab_pending()
will return -ENOENT as the work item is still being canceled by task A
and flush_work() will also immediately return false as the work item
is no longer executing.  This puts task B in a busy loop possibly
preventing task A from executing and clearing the canceling state on
the work item leading to a hang.

task A			task B			worker

						executing work
__cancel_work_timer()
  try_to_grab_pending()
  set work CANCELING
  flush_work()
    block for work completion
						completion, wakes up A
			__cancel_work_timer()
			while (forever) {
			  try_to_grab_pending()
			    -ENOENT as work is being canceled
			  flush_work()
			    false as work is no longer executing
			}

This patch removes the possible hang by updating __cancel_work_timer()
to explicitly wait for clearing of CANCELING rather than invoking
flush_work() after try_to_grab_pending() fails with -ENOENT.

Link: http://lkml.kernel.org/g/20150206171156.GA8942@axis.com

v3: bit_waitqueue() can't be used for work items defined in vmalloc
    area.  Switched to custom wake function which matches the target
    work item and exclusive wait and wakeup.

v2: v1 used wake_up() on bit_waitqueue() which leads to NULL deref if
    the target bit waitqueue has wait_bit_queue's on it.  Use
    DEFINE_WAIT_BIT() and __wake_up_bit() instead.  Reported by Tomeu
    Vizoso.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reported-by: Rabin Vincent &lt;rabin.vincent@axis.com&gt;
Cc: Tomeu Vizoso &lt;tomeu.vizoso@gmail.com&gt;
Tested-by: Jesper Nilsson &lt;jesper.nilsson@axis.com&gt;
Tested-by: Rabin Vincent &lt;rabin.vincent@axis.com&gt;
Signed-off-by: Luis Henriques &lt;luis.henriques@canonical.com&gt;
</content>
</entry>
<entry>
<title>workqueue: fix subtle pool management issue which can stall whole worker_pool</title>
<updated>2015-02-04T10:57:30Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2015-01-16T19:21:16Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=affdf013434b9a49a8db157d8885c100c4cc0019'/>
<id>urn:sha1:affdf013434b9a49a8db157d8885c100c4cc0019</id>
<content type='text'>
commit 29187a9eeaf362d8422e62e17a22a6e115277a49 upstream.

A worker_pool's forward progress is guaranteed by the fact that the
last idle worker assumes the manager role to create more workers and
summon the rescuers if creating workers doesn't succeed in timely
manner before proceeding to execute work items.

This manager role is implemented in manage_workers(), which indicates
whether the worker may proceed to work item execution with its return
value.  This is necessary because multiple workers may contend for the
manager role, and, if there already is a manager, others should
proceed to work item execution.

Unfortunately, the function also indicates that the worker may proceed
to work item execution if need_to_create_worker() is false at the head
of the function.  need_to_create_worker() tests the following
conditions.

	pending work items &amp;&amp; !nr_running &amp;&amp; !nr_idle

The first and third conditions are protected by pool-&gt;lock and thus
won't change while holding pool-&gt;lock; however, nr_running can change
asynchronously as other workers block and resume and while it's likely
to be zero, as someone woke this worker up in the first place, some
other workers could have become runnable inbetween making it non-zero.

If this happens, manage_worker() could return false even with zero
nr_idle making the worker, the last idle one, proceed to execute work
items.  If then all workers of the pool end up blocking on a resource
which can only be released by a work item which is pending on that
pool, the whole pool can deadlock as there's no one to create more
workers or summon the rescuers.

This patch fixes the problem by removing the early exit condition from
maybe_create_worker() and making manage_workers() return false iff
there's already another manager, which ensures that the last worker
doesn't start executing work items.

We can leave the early exit condition alone and just ignore the return
value but the only reason it was put there is because the
manage_workers() used to perform both creations and destructions of
workers and thus the function may be invoked while the pool is trying
to reduce the number of workers.  Now that manage_workers() is called
only when more workers are needed, the only case this early exit
condition is triggered is rare race conditions rendering it pointless.

Tested with simulated workload and modified workqueue code which
trigger the pool deadlock reliably without this patch.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reported-by: Eric Sandeen &lt;sandeen@sandeen.net&gt;
Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Lai Jiangshan &lt;laijs@cn.fujitsu.com&gt;
[ luis: backported to 3.16:
  - maybe_create_worker() is now void, 'return' instead of 'return true' ]
Signed-off-by: Luis Henriques &lt;luis.henriques@canonical.com&gt;
</content>
</entry>
<entry>
<title>workqueue: zero cpumask of wq_numa_possible_cpumask on init</title>
<updated>2014-07-07T13:56:48Z</updated>
<author>
<name>Yasuaki Ishimatsu</name>
<email>isimatu.yasuaki@jp.fujitsu.com</email>
</author>
<published>2014-07-07T13:56:48Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=5a6024f1604eef119cf3a6fa413fe0261a81a8f3'/>
<id>urn:sha1:5a6024f1604eef119cf3a6fa413fe0261a81a8f3</id>
<content type='text'>
When hot-adding and onlining CPU, kernel panic occurs, showing following
call trace.

  BUG: unable to handle kernel paging request at 0000000000001d08
  IP: [&lt;ffffffff8114acfd&gt;] __alloc_pages_nodemask+0x9d/0xb10
  PGD 0
  Oops: 0000 [#1] SMP
  ...
  Call Trace:
   [&lt;ffffffff812b8745&gt;] ? cpumask_next_and+0x35/0x50
   [&lt;ffffffff810a3283&gt;] ? find_busiest_group+0x113/0x8f0
   [&lt;ffffffff81193bc9&gt;] ? deactivate_slab+0x349/0x3c0
   [&lt;ffffffff811926f1&gt;] new_slab+0x91/0x300
   [&lt;ffffffff815de95a&gt;] __slab_alloc+0x2bb/0x482
   [&lt;ffffffff8105bc1c&gt;] ? copy_process.part.25+0xfc/0x14c0
   [&lt;ffffffff810a3c78&gt;] ? load_balance+0x218/0x890
   [&lt;ffffffff8101a679&gt;] ? sched_clock+0x9/0x10
   [&lt;ffffffff81105ba9&gt;] ? trace_clock_local+0x9/0x10
   [&lt;ffffffff81193d1c&gt;] kmem_cache_alloc_node+0x8c/0x200
   [&lt;ffffffff8105bc1c&gt;] copy_process.part.25+0xfc/0x14c0
   [&lt;ffffffff81114d0d&gt;] ? trace_buffer_unlock_commit+0x4d/0x60
   [&lt;ffffffff81085a80&gt;] ? kthread_create_on_node+0x140/0x140
   [&lt;ffffffff8105d0ec&gt;] do_fork+0xbc/0x360
   [&lt;ffffffff8105d3b6&gt;] kernel_thread+0x26/0x30
   [&lt;ffffffff81086652&gt;] kthreadd+0x2c2/0x300
   [&lt;ffffffff81086390&gt;] ? kthread_create_on_cpu+0x60/0x60
   [&lt;ffffffff815f20ec&gt;] ret_from_fork+0x7c/0xb0
   [&lt;ffffffff81086390&gt;] ? kthread_create_on_cpu+0x60/0x60

In my investigation, I found the root cause is wq_numa_possible_cpumask.
All entries of wq_numa_possible_cpumask is allocated by
alloc_cpumask_var_node(). And these entries are used without initializing.
So these entries have wrong value.

When hot-adding and onlining CPU, wq_update_unbound_numa() is called.
wq_update_unbound_numa() calls alloc_unbound_pwq(). And alloc_unbound_pwq()
calls get_unbound_pool(). In get_unbound_pool(), worker_pool-&gt;node is set
as follow:

3592         /* if cpumask is contained inside a NUMA node, we belong to that node */
3593         if (wq_numa_enabled) {
3594                 for_each_node(node) {
3595                         if (cpumask_subset(pool-&gt;attrs-&gt;cpumask,
3596                                            wq_numa_possible_cpumask[node])) {
3597                                 pool-&gt;node = node;
3598                                 break;
3599                         }
3600                 }
3601         }

But wq_numa_possible_cpumask[node] does not have correct cpumask. So, wrong
node is selected. As a result, kernel panic occurs.

By this patch, all entries of wq_numa_possible_cpumask are allocated by
zalloc_cpumask_var_node to initialize them. And the panic disappeared.

Signed-off-by: Yasuaki Ishimatsu &lt;isimatu.yasuaki@jp.fujitsu.com&gt;
Reviewed-by: Lai Jiangshan &lt;laijs@cn.fujitsu.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: stable@vger.kernel.org
Fixes: bce903809ab3 ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]")
</content>
</entry>
<entry>
<title>workqueue: fix dev_set_uevent_suppress() imbalance</title>
<updated>2014-06-23T18:40:49Z</updated>
<author>
<name>Maxime Bizon</name>
<email>mbizon@freebox.fr</email>
</author>
<published>2014-06-23T14:35:35Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=bddbceb688c6d0decaabc7884fede319d02f96c8'/>
<id>urn:sha1:bddbceb688c6d0decaabc7884fede319d02f96c8</id>
<content type='text'>
Uevents are suppressed during attributes registration, but never
restored, so kobject_uevent() does nothing.

Signed-off-by: Maxime Bizon &lt;mbizon@freebox.fr&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: stable@vger.kernel.org
Fixes: 226223ab3c4118ddd10688cc2c131135848371ab
</content>
</entry>
<entry>
<title>Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq</title>
<updated>2014-06-09T21:56:49Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2014-06-09T21:56:49Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=da85d191f58a44e149a7c07dbae78b3042909798'/>
<id>urn:sha1:da85d191f58a44e149a7c07dbae78b3042909798</id>
<content type='text'>
Pull workqueue updates from Tejun Heo:
 "Lai simplified worker destruction path and internal workqueue locking
  and there are some other minor changes.

  Except for the removal of some long-deprecated interfaces which
  haven't had any in-kernel user for quite a while, there shouldn't be
  any difference to workqueue users"

* 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  kernel/workqueue.c: pr_warning/pr_warn &amp; printk/pr_info
  workqueue: remove the confusing POOL_FREEZING
  workqueue: rename first_worker() to first_idle_worker()
  workqueue: remove unused work_clear_pending()
  workqueue: remove unused WORK_CPU_END
  workqueue: declare system_highpri_wq
  workqueue: use generic attach/detach routine for rescuers
  workqueue: separate pool-attaching code out from create_worker()
  workqueue: rename manager_mutex to attach_mutex
  workqueue: narrow the protection range of manager_mutex
  workqueue: convert worker_idr to worker_ida
  workqueue: separate iteration role from worker_idr
  workqueue: destroy worker directly in the idle timeout handler
  workqueue: async worker destruction
  workqueue: destroy_worker() should destroy idle workers only
  workqueue: use manager lock only to protect worker_idr
  workqueue: Remove deprecated system_nrt[_freezable]_wq
  workqueue: Remove deprecated flush[_delayed]_work_sync()
  kernel/workqueue.c: pr_warning/pr_warn &amp; printk/pr_info
  workqueue: simplify wq_update_unbound_numa() by jumping to use_dfl_pwq if the target cpumask equals wq's
</content>
</entry>
<entry>
<title>kernel/workqueue.c: pr_warning/pr_warn &amp; printk/pr_info</title>
<updated>2014-05-28T14:22:34Z</updated>
<author>
<name>Valdis Kletnieks</name>
<email>Valdis.Kletnieks@vt.edu</email>
</author>
<published>2014-05-27T18:28:59Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=015af06e103fa47af29ada0f564301c81d4973b2'/>
<id>urn:sha1:015af06e103fa47af29ada0f564301c81d4973b2</id>
<content type='text'>
This commit did an incorrect printk-&gt;pr_info conversion. If we were
converting to pr_info() we should lose the log_level parameter. The problem is
that this is called (indirectly) by show_regs_print_info(), which is called
with various log_levels (from _INFO clear to _EMERG). So we leave it as
a printk() call so the desired log_level is applied.

Not a full revert, as the other half of the patch is correct.

Signed-off-by: Valdis Kletnieks &lt;valdis.kletnieks@vt.edu&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
</feed>
