user/sven/linux.git/kernel/kthread.c, branch v4.2.4

kthread: export kthread functions

2015-08-07T01:39:41Z

The s-Par visornic driver, currently in staging, processes a queue being serviced by the an s-Par service partition. We can get a message that something has happened with the Service Partition, when that happens, we must not access the channel until we get a message that the service partition is back again. The visornic driver has a thread for processing the channel, when we get the message, we need to be able to park the thread and then resume it when the problem clears. We can do this with kthread_park and unpark but they are not exported from the kernel, this patch exports the needed functions. Signed-off-by: David Kershner Acked-by: Ingo Molnar Acked-by: Neil Horman Acked-by: Thomas Gleixner Cc: Richard Weinberger Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

kernel/kthread.c: partial revert of 81c98869faa5 ("kthread: ensure locality of task_struct allocations")

2014-10-10T02:25:51Z

After discussions with Tejun, we don't want to spread the use of cpu_to_mem() (and thus knowledge of allocators/NUMA topology details) into callers, but would rather ensure the callees correctly handle memoryless nodes. With the previous patches ("topology: add support for node_to_mem_node() to determine the fallback node" and "slub: fallback to node_to_mem_node() node if allocating on memoryless node") adding and using node_to_mem_node(), we can safely undo part of the change to the kthread logic from 81c98869faa5. Signed-off-by: Nishanth Aravamudan Cc: Joonsoo Kim Cc: David Rientjes Cc: Han Pingtian Cc: Pekka Enberg Cc: Paul Mackerras Cc: Benjamin Herrenschmidt Cc: Michael Ellerman Cc: Anton Blanchard Cc: Christoph Lameter Cc: Wanpeng Li Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

kthread_work: wake up worker only when the worker is idle

2014-07-28T18:07:52Z

If the worker is already executing a work item when another is queued, we can safely skip wakeup without worrying about stalling queue thus avoiding waking up the busy worker spuriously. Spurious wakeups should be fine but still isn't nice and avoiding it is trivial here. tj: Updated description. Signed-off-by: Lai Jiangshan Signed-off-by: Tejun Heo

kthread: fix return value of kthread_create() upon SIGKILL.

2014-06-04T23:53:51Z

Commit 786235eeba0e ("kthread: make kthread_create() killable") meant for allowing kthread_create() to abort as soon as killed by the OOM-killer. But returning -ENOMEM is wrong if killed by SIGKILL from userspace. Change kthread_create() to return -EINTR upon SIGKILL. Signed-off-by: Tetsuo Handa Cc: Oleg Nesterov Acked-by: David Rientjes Cc: [3.13+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

kthread: ensure locality of task_struct allocations

2014-04-03T23:20:49Z

In the presence of memoryless nodes, numa_node_id() will return the current CPU's NUMA node, but that may not be where we expect to allocate from memory from. Instead, we should rely on the fallback code in the memory allocator itself, by using NUMA_NO_NODE. Also, when calling kthread_create_on_node(), use the nearest node with memory to the cpu in question, rather than the node it is running on. Signed-off-by: Nishanth Aravamudan Reviewed-by: Christoph Lameter Acked-by: David Rientjes Cc: Anton Blanchard Cc: Tejun Heo Cc: Oleg Nesterov Cc: Jan Kara Cc: Thomas Gleixner Cc: Tetsuo Handa Cc: Wanpeng Li Cc: Joonsoo Kim Cc: Ben Herrenschmidt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

kthread: make kthread_create() killable

2013-11-13T03:08:59Z

Any user process callers of wait_for_completion() except global init process might be chosen by the OOM killer while waiting for completion() call by some other process which does memory allocation. See CVE-2012-4398 "kernel: request_module() OOM local DoS" can happen. When such users are chosen by the OOM killer when they are waiting for completion() in TASK_UNINTERRUPTIBLE, the system will be kept stressed due to memory starvation because the OOM killer cannot kill such users. kthread_create() is one of such users and this patch fixes the problem for kthreadd by making kthread_create() killable - the same approach used for fixing CVE-2012-4398. Signed-off-by: Tetsuo Handa Cc: Oleg Nesterov Acked-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

kthread: implement probe_kthread_data()

2013-05-01T00:04:02Z

One of the problems that arise when converting dedicated custom threadpool to workqueue is that the shared worker pool used by workqueue anonimizes each worker making it more difficult to identify what the worker was doing on which target from the output of sysrq-t or debug dump from oops, BUG() and friends. For example, after writeback is converted to use workqueue instead of priviate thread pool, there's no easy to tell which backing device a writeback work item was working on at the time of task dump, which, according to our writeback brethren, is important in tracking down issues with a lot of mounted file systems on a lot of different devices. This patchset implements a way for a work function to mark its execution instance so that task dump of the worker task includes information to indicate what the work item was doing. An example WARN dump would look like the following. WARNING: at fs/fs-writeback.c:1015 bdi_writeback_workfn+0x2b4/0x3c0() Modules linked in: CPU: 0 Pid: 28 Comm: kworker/u18:0 Not tainted 3.9.0-rc1-work+ #24 Hardware name: empty empty/S3992, BIOS 080011 10/26/2007 Workqueue: writeback bdi_writeback_workfn (flush-8:16) ffffffff820a3a98 ffff88015b927cb8 ffffffff81c61855 ffff88015b927cf8 ffffffff8108f500 0000000000000000 ffff88007a171948 ffff88007a1716b0 ffff88015b49df00 ffff88015b8d3940 0000000000000000 ffff88015b927d08 Call Trace: [] dump_stack+0x19/0x1b [] warn_slowpath_common+0x70/0xa0 ... This patch: Implement probe_kthread_data() which returns kthread_data if accessible. The function is equivalent to kthread_data() except that the specified @task may not be a kthread or its vfork_done is already cleared rendering struct kthread inaccessible. In the former case, probe_kthread_data() may return any value. In the latter, NULL. This will be used to safely print debug information without affecting synchronization in the normal paths. Workqueue debug info printing on dump_stack() and friends will make use of it. Signed-off-by: Tejun Heo Cc: Oleg Nesterov Acked-by: Jan Kara Cc: Dave Chinner Cc: Ingo Molnar Cc: Jens Axboe Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

Merge branch 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

2013-04-30T02:07:40Z

Pull workqueue updates from Tejun Heo: "A lot of activities on workqueue side this time. The changes achieve the followings. - WQ_UNBOUND workqueues - the workqueues which are per-cpu - are updated to be able to interface with multiple backend worker pools. This involved a lot of churning but the end result seems actually neater as unbound workqueues are now a lot closer to per-cpu ones. - The ability to interface with multiple backend worker pools are used to implement unbound workqueues with custom attributes. Currently the supported attributes are the nice level and CPU affinity. It may be expanded to include cgroup association in future. The attributes can be specified either by calling apply_workqueue_attrs() or through /sys/bus/workqueue/WQ_NAME/* if the workqueue in question is exported through sysfs. The backend worker pools are keyed by the actual attributes and shared by any workqueues which share the same attributes. When attributes of a workqueue are changed, the workqueue binds to the worker pool with the specified attributes while leaving the work items which are already executing in its previous worker pools alone. This allows converting custom worker pool implementations which want worker attribute tuning to use workqueues. The writeback pool is already converted in block tree and there are a couple others are likely to follow including btrfs io workers. - WQ_UNBOUND's ability to bind to multiple worker pools is also used to make it NUMA-aware. Because there's no association between work item issuer and the specific worker assigned to execute it, before this change, using unbound workqueue led to unnecessary cross-node bouncing and it couldn't be helped by autonuma as it requires tasks to have implicit node affinity and workers are assigned randomly. After these changes, an unbound workqueue now binds to multiple NUMA-affine worker pools so that queued work items are executed in the same node. This is turned on by default but can be disabled system-wide or for individual workqueues. Crypto was requesting NUMA affinity as encrypting data across different nodes can contribute noticeable overhead and doing it per-cpu was too limiting for certain cases and IO throughput could be bottlenecked by one CPU being fully occupied while others have idle cycles. While the new features required a lot of changes including restructuring locking, it didn't complicate the execution paths much. The unbound workqueue handling is now closer to per-cpu ones and the new features are implemented by simply associating a workqueue with different sets of backend worker pools without changing queue, execution or flush paths. As such, even though the amount of change is very high, I feel relatively safe in that it isn't likely to cause subtle issues with basic correctness of work item execution and handling. If something is wrong, it's likely to show up as being associated with worker pools with the wrong attributes or OOPS while workqueue attributes are being changed or during CPU hotplug. While this creates more backend worker pools, it doesn't add too many more workers unless, of course, there are many workqueues with unique combinations of attributes. Assuming everything else is the same, NUMA awareness costs an extra worker pool per NUMA node with online CPUs. There are also a couple things which are being routed outside the workqueue tree. - block tree pulled in workqueue for-3.10 so that writeback worker pool can be converted to unbound workqueue with sysfs control exposed. This simplifies the code, makes writeback workers NUMA-aware and allows tuning nice level and CPU affinity via sysfs. - The conversion to workqueue means that there's no 1:1 association between a specific worker, which makes writeback folks unhappy as they want to be able to tell which filesystem caused a problem from backtrace on systems with many filesystems mounted. This is resolved by allowing work items to set debug info string which is printed when the task is dumped. As this change involves unifying implementations of dump_stack() and friends in arch codes, it's being routed through Andrew's -mm tree." * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (84 commits) workqueue: use kmem_cache_free() instead of kfree() workqueue: avoid false negative WARN_ON() in destroy_workqueue() workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity workqueue: implement NUMA affinity for unbound workqueues workqueue: introduce put_pwq_unlocked() workqueue: introduce numa_pwq_tbl_install() workqueue: use NUMA-aware allocation for pool_workqueues workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq() workqueue: map an unbound workqueues to multiple per-node pool_workqueues workqueue: move hot fields of workqueue_struct to the end workqueue: make workqueue->name[] fixed len workqueue: add workqueue->unbound_attrs workqueue: determine NUMA node of workers accourding to the allowed cpumask workqueue: drop 'H' from kworker names of unbound worker pools workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[] workqueue: move pwq_pool_locking outside of get/put_unbound_pool() workqueue: fix memory leak in apply_workqueue_attrs() workqueue: fix unbound workqueue attrs hashing / comparison workqueue: fix race condition in unbound workqueue free path workqueue: remove pwq_lock which is no longer used ...

kthread: kill task_get_live_kthread()

2013-04-29T22:54:25Z

task_get_live_kthread() looks confusing and unneeded. It does get_task_struct() but only kthread_stop() needs this, it can be called even if the calller doesn't have a reference when we know that this kthread can't exit until we do kthread_stop(). kthread_park() and kthread_unpark() do not need get_task_struct(), the callers already have the reference. And it can not help if we can race with the exiting kthread anyway, kthread_park() can hang forever in this case. Change kthread_park() and kthread_unpark() to use to_live_kthread(), change kthread_stop() to do get_task_struct() by hand and remove task_get_live_kthread(). Signed-off-by: Oleg Nesterov Cc: Thomas Gleixner Cc: Namhyung Kim Cc: "Paul E. McKenney" Cc: Peter Zijlstra Cc: Rusty Russell Cc: "Srivatsa S. Bhat" Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

kthread: introduce to_live_kthread()

2013-04-29T22:54:25Z

"k->vfork_done != NULL" with a barrier() after to_kthread(k) in task_get_live_kthread(k) looks unclear, and sub-optimal because we load ->vfork_done twice. All we need is to ensure that we do not return to_kthread(NULL). Add a new trivial helper which loads/checks ->vfork_done once, this also looks more understandable. Signed-off-by: Oleg Nesterov Cc: Thomas Gleixner Cc: Namhyung Kim Cc: "Paul E. McKenney" Cc: Peter Zijlstra Cc: Rusty Russell Cc: "Srivatsa S. Bhat" Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds