user/sven/linux.git/kernel/workqueue.c, branch v3.12.48

workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE

2015-04-09T11:14:05Z

commit 8603e1b30027f943cc9c1eef2b291d42c3347af1 upstream. cancel[_delayed]_work_sync() are implemented using __cancel_work_timer() which grabs the PENDING bit using try_to_grab_pending() and then flushes the work item with PENDING set to prevent the on-going execution of the work item from requeueing itself. try_to_grab_pending() can always grab PENDING bit without blocking except when someone else is doing the above flushing during cancelation. In that case, try_to_grab_pending() returns -ENOENT. In this case, __cancel_work_timer() currently invokes flush_work(). The assumption is that the completion of the work item is what the other canceling task would be waiting for too and thus waiting for the same condition and retrying should allow forward progress without excessive busy looping Unfortunately, this doesn't work if preemption is disabled or the latter task has real time priority. Let's say task A just got woken up from flush_work() by the completion of the target work item. If, before task A starts executing, task B gets scheduled and invokes __cancel_work_timer() on the same work item, its try_to_grab_pending() will return -ENOENT as the work item is still being canceled by task A and flush_work() will also immediately return false as the work item is no longer executing. This puts task B in a busy loop possibly preventing task A from executing and clearing the canceling state on the work item leading to a hang. task A task B worker executing work __cancel_work_timer() try_to_grab_pending() set work CANCELING flush_work() block for work completion completion, wakes up A __cancel_work_timer() while (forever) { try_to_grab_pending() -ENOENT as work is being canceled flush_work() false as work is no longer executing } This patch removes the possible hang by updating __cancel_work_timer() to explicitly wait for clearing of CANCELING rather than invoking flush_work() after try_to_grab_pending() fails with -ENOENT. Link: http://lkml.kernel.org/g/20150206171156.GA8942@axis.com v3: bit_waitqueue() can't be used for work items defined in vmalloc area. Switched to custom wake function which matches the target work item and exclusive wait and wakeup. v2: v1 used wake_up() on bit_waitqueue() which leads to NULL deref if the target bit waitqueue has wait_bit_queue's on it. Use DEFINE_WAIT_BIT() and __wake_up_bit() instead. Reported by Tomeu Vizoso. Signed-off-by: Tejun Heo Reported-by: Rabin Vincent Cc: Tomeu Vizoso Tested-by: Jesper Nilsson Tested-by: Rabin Vincent Signed-off-by: Jiri Slaby

workqueue: fix subtle pool management issue which can stall whole worker_pool

2015-02-08T19:02:08Z

commit 29187a9eeaf362d8422e62e17a22a6e115277a49 upstream. A worker_pool's forward progress is guaranteed by the fact that the last idle worker assumes the manager role to create more workers and summon the rescuers if creating workers doesn't succeed in timely manner before proceeding to execute work items. This manager role is implemented in manage_workers(), which indicates whether the worker may proceed to work item execution with its return value. This is necessary because multiple workers may contend for the manager role, and, if there already is a manager, others should proceed to work item execution. Unfortunately, the function also indicates that the worker may proceed to work item execution if need_to_create_worker() is false at the head of the function. need_to_create_worker() tests the following conditions. pending work items && !nr_running && !nr_idle The first and third conditions are protected by pool->lock and thus won't change while holding pool->lock; however, nr_running can change asynchronously as other workers block and resume and while it's likely to be zero, as someone woke this worker up in the first place, some other workers could have become runnable inbetween making it non-zero. If this happens, manage_worker() could return false even with zero nr_idle making the worker, the last idle one, proceed to execute work items. If then all workers of the pool end up blocking on a resource which can only be released by a work item which is pending on that pool, the whole pool can deadlock as there's no one to create more workers or summon the rescuers. This patch fixes the problem by removing the early exit condition from maybe_create_worker() and making manage_workers() return false iff there's already another manager, which ensures that the last worker doesn't start executing work items. We can leave the early exit condition alone and just ignore the return value but the only reason it was put there is because the manage_workers() used to perform both creations and destructions of workers and thus the function may be invoked while the pool is trying to reduce the number of workers. Now that manage_workers() is called only when more workers are needed, the only case this early exit condition is triggered is rare race conditions rendering it pointless. Tested with simulated workload and modified workqueue code which trigger the pool deadlock reliably without this patch. tj: Updated to v3.14 where manage_workers() is responsible not only for creating more workers but also destroying surplus ones. maybe_create_worker() needs to keep its early exit condition to avoid creating a new worker when manage_workers() is called to destroy surplus ones. Other than that, the adaptabion is straight-forward. Both maybe_{create|destroy}_worker() functions are converted to return void and manage_workers() returns %false iff it lost manager arbitration. Signed-off-by: Tejun Heo Reported-by: Eric Sandeen Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net Cc: Dave Chinner Cc: Lai Jiangshan Signed-off-by: Jiri Slaby

workqueue: zero cpumask of wq_numa_possible_cpumask on init

2014-07-18T13:51:15Z

commit 5a6024f1604eef119cf3a6fa413fe0261a81a8f3 upstream. When hot-adding and onlining CPU, kernel panic occurs, showing following call trace. BUG: unable to handle kernel paging request at 0000000000001d08 IP: [] __alloc_pages_nodemask+0x9d/0xb10 PGD 0 Oops: 0000 [#1] SMP ... Call Trace: [] ? cpumask_next_and+0x35/0x50 [] ? find_busiest_group+0x113/0x8f0 [] ? deactivate_slab+0x349/0x3c0 [] new_slab+0x91/0x300 [] __slab_alloc+0x2bb/0x482 [] ? copy_process.part.25+0xfc/0x14c0 [] ? load_balance+0x218/0x890 [] ? sched_clock+0x9/0x10 [] ? trace_clock_local+0x9/0x10 [] kmem_cache_alloc_node+0x8c/0x200 [] copy_process.part.25+0xfc/0x14c0 [] ? trace_buffer_unlock_commit+0x4d/0x60 [] ? kthread_create_on_node+0x140/0x140 [] do_fork+0xbc/0x360 [] kernel_thread+0x26/0x30 [] kthreadd+0x2c2/0x300 [] ? kthread_create_on_cpu+0x60/0x60 [] ret_from_fork+0x7c/0xb0 [] ? kthread_create_on_cpu+0x60/0x60 In my investigation, I found the root cause is wq_numa_possible_cpumask. All entries of wq_numa_possible_cpumask is allocated by alloc_cpumask_var_node(). And these entries are used without initializing. So these entries have wrong value. When hot-adding and onlining CPU, wq_update_unbound_numa() is called. wq_update_unbound_numa() calls alloc_unbound_pwq(). And alloc_unbound_pwq() calls get_unbound_pool(). In get_unbound_pool(), worker_pool->node is set as follow: 3592 /* if cpumask is contained inside a NUMA node, we belong to that node */ 3593 if (wq_numa_enabled) { 3594 for_each_node(node) { 3595 if (cpumask_subset(pool->attrs->cpumask, 3596 wq_numa_possible_cpumask[node])) { 3597 pool->node = node; 3598 break; 3599 } 3600 } 3601 } But wq_numa_possible_cpumask[node] does not have correct cpumask. So, wrong node is selected. As a result, kernel panic occurs. By this patch, all entries of wq_numa_possible_cpumask are allocated by zalloc_cpumask_var_node to initialize them. And the panic disappeared. Signed-off-by: Yasuaki Ishimatsu Reviewed-by: Lai Jiangshan Signed-off-by: Tejun Heo Fixes: bce903809ab3 ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]") Signed-off-by: Jiri Slaby

workqueue: fix dev_set_uevent_suppress() imbalance

2014-07-18T13:51:14Z

commit bddbceb688c6d0decaabc7884fede319d02f96c8 upstream. Uevents are suppressed during attributes registration, but never restored, so kobject_uevent() does nothing. Signed-off-by: Maxime Bizon Signed-off-by: Tejun Heo Fixes: 226223ab3c4118ddd10688cc2c131135848371ab Signed-off-by: Jiri Slaby

workqueue: make rescuer_thread() empty wq->maydays list before exiting

2014-06-09T13:53:36Z

commit 4d595b866d2c653dc90a492b9973a834eabfa354 upstream. After a @pwq is scheduled for emergency execution, other workers may consume the affectd work items before the rescuer gets to them. This means that a workqueue many have pwqs queued on @wq->maydays list while not having any work item pending or in-flight. If destroy_workqueue() executes in such condition, the rescuer may exit without emptying @wq->maydays. This currently doesn't cause any actual harm. destroy_workqueue() can safely destroy all the involved data structures whether @wq->maydays is populated or not as nobody access the list once the rescuer exits. However, this is nasty and makes future development difficult. Let's update rescuer_thread() so that it empties @wq->maydays after seeing should_stop to guarantee that the list is empty on rescuer exit. tj: Updated comment and patch description. Signed-off-by: Lai Jiangshan Signed-off-by: Tejun Heo Signed-off-by: Jiri Slaby

workqueue: fix a possible race condition between rescuer and pwq-release

2014-06-09T13:53:36Z

commit 77668c8b559e4fe2acf2a0749c7c83cde49a5025 upstream. There is a race condition between rescuer_thread() and pwq_unbound_release_workfn(). Even after a pwq is scheduled for rescue, the associated work items may be consumed by any worker. If all of them are consumed before the rescuer gets to them and the pwq's base ref was put due to attribute change, the pwq may be released while still being linked on @wq->maydays list making the rescuer dereference already freed pwq later. Make send_mayday() pin the target pwq until the rescuer is done with it. tj: Updated comment and patch description. Signed-off-by: Lai Jiangshan Signed-off-by: Tejun Heo Signed-off-by: Jiri Slaby

workqueue: fix bugs in wq_update_unbound_numa() failure path

2014-06-09T13:53:35Z

commit 77f300b198f93328c26191b52655ce1b62e202cf upstream. wq_update_unbound_numa() failure path has the following two bugs. - alloc_unbound_pwq() is called without holding wq->mutex; however, if the allocation fails, it jumps to out_unlock which tries to unlock wq->mutex. - The function should switch to dfl_pwq on failure but didn't do so after alloc_unbound_pwq() failure. Fix it by regrabbing wq->mutex and jumping to use_dfl_pwq on alloc_unbound_pwq() failure. Signed-off-by: Daeseok Youn Acked-by: Lai Jiangshan Signed-off-by: Tejun Heo Fixes: 4c16bd327c74 ("workqueue: implement NUMA affinity for unbound workqueues") Signed-off-by: Jiri Slaby

workqueue: ensure @task is valid across kthread_stop()

2014-03-05T16:13:50Z

commit 5bdfff96c69a4d5ab9c49e60abf9e070ecd2acbb upstream. When a kworker should die, the kworkre is notified through WORKER_DIE flag instead of kthread_should_stop(). This, IIRC, is primarily to keep the test synchronized inside worker_pool lock. WORKER_DIE is first set while holding pool->lock, the lock is dropped and kthread_stop() is called. Unfortunately, this means that there's a slight chance that the target kworker may see WORKER_DIE before kthread_stop() finishes and exits and frees the target task before or during kthread_stop(). Fix it by pinning the target task before setting WORKER_DIE and putting it after kthread_stop() is done. tj: Improved patch description and comment. Moved pinning above WORKER_DIE for better signify what it's protecting. Signed-off-by: Lai Jiangshan Signed-off-by: Tejun Heo Signed-off-by: Jiri Slaby

workqueue: fix ordered workqueues in NUMA setups

2013-12-04T19:05:55Z

commit 8a2b75384444488fc4f2cbb9f0921b6a0794838f upstream. An ordered workqueue implements execution ordering by using single pool_workqueue with max_active == 1. On a given pool_workqueue, work items are processed in FIFO order and limiting max_active to 1 enforces the queued work items to be processed one by one. Unfortunately, 4c16bd327c ("workqueue: implement NUMA affinity for unbound workqueues") accidentally broke this guarantee by applying NUMA affinity to ordered workqueues too. On NUMA setups, an ordered workqueue would end up with separate pool_workqueues for different nodes. Each pool_workqueue still limits max_active to 1 but multiple work items may be executed concurrently and out of order depending on which node they are queued to. Fix it by using dedicated ordered_wq_attrs[] when creating ordered workqueues. The new attrs match the unbound ones except that no_numa is always set thus forcing all NUMA nodes to share the default pool_workqueue. While at it, add sanity check in workqueue creation path which verifies that an ordered workqueues has only the default pool_workqueue. Signed-off-by: Tejun Heo Reported-by: Libin Cc: Lai Jiangshan Signed-off-by: Greg Kroah-Hartman

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial

2013-09-06T16:36:28Z

Pull trivial tree from Jiri Kosina: "The usual trivial updates all over the tree -- mostly typo fixes and documentation updates" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (52 commits) doc: Documentation/cputopology.txt fix typo treewide: Convert retrun typos to return Fix comment typo for init_cma_reserved_pageblock Documentation/trace: Correcting and extending tracepoint documentation mm/hotplug: fix a typo in Documentation/memory-hotplug.txt power: Documentation: Update s2ram link doc: fix a typo in Documentation/00-INDEX Documentation/printk-formats.txt: No casts needed for u64/s64 doc: Fix typo "is is" in Documentations treewide: Fix printks with 0x%# zram: doc fixes Documentation/kmemcheck: update kmemcheck documentation doc: documentation/hwspinlock.txt fix typo PM / Hibernate: add section for resume options doc: filesystems : Fix typo in Documentations/filesystems scsi/megaraid fixed several typos in comments ppc: init_32: Fix error typo "CONFIG_START_KERNEL" treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks page_isolation: Fix a comment typo in test_pages_isolated() doc: fix a typo about irq affinity ...