<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/kernel/sched/wait.c, branch v4.14.45</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v4.14.45</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v4.14.45'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2018-02-16T19:22:43Z</updated>
<entry>
<title>sched/wait: Fix add_wait_queue() behavioral change</title>
<updated>2018-02-16T19:22:43Z</updated>
<author>
<name>Omar Sandoval</name>
<email>osandov@fb.com</email>
</author>
<published>2017-12-06T07:15:31Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=e94a7de2a3d2da42f0d124c7501195f1cdffa232'/>
<id>urn:sha1:e94a7de2a3d2da42f0d124c7501195f1cdffa232</id>
<content type='text'>
commit c6b9d9a33029014446bd9ed84c1688f6d3d4eab9 upstream.

The following cleanup commit:

  50816c48997a ("sched/wait: Standardize internal naming of wait-queue entries")

... unintentionally changed the behavior of add_wait_queue() from
inserting the wait entry at the head of the wait queue to the tail
of the wait queue.

Beyond a negative performance impact this change in behavior
theoretically also breaks wait queues which mix exclusive and
non-exclusive waiters, as non-exclusive waiters will not be
woken up if they are queued behind enough exclusive waiters.

Signed-off-by: Omar Sandoval &lt;osandov@fb.com&gt;
Reviewed-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Acked-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: kernel-team@fb.com
Fixes: ("sched/wait: Standardize internal naming of wait-queue entries")
Link: http://lkml.kernel.org/r/a16c8ccffd39bd08fdaa45a5192294c784b803a7.1512544324.git.osandov@fb.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>sched/wait: Introduce wakeup boomark in wake_up_page_bit</title>
<updated>2017-09-14T16:56:18Z</updated>
<author>
<name>Tim Chen</name>
<email>tim.c.chen@linux.intel.com</email>
</author>
<published>2017-08-25T16:13:55Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=11a19c7b099f96d00a8dec52bfbb8475e89b6745'/>
<id>urn:sha1:11a19c7b099f96d00a8dec52bfbb8475e89b6745</id>
<content type='text'>
Now that we have added breaks in the wait queue scan and allow bookmark
on scan position, we put this logic in the wake_up_page_bit function.

We can have very long page wait list in large system where multiple
pages share the same wait list. We break the wake up walk here to allow
other cpus a chance to access the list, and not to disable the interrupts
when traversing the list for too long.  This reduces the interrupt and
rescheduling latency, and excessive page wait queue lock hold time.

[ v2: Remove bookmark_wake_function ]

Signed-off-by: Tim Chen &lt;tim.c.chen@linux.intel.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>sched/wait: Break up long wake list walk</title>
<updated>2017-09-14T16:56:17Z</updated>
<author>
<name>Tim Chen</name>
<email>tim.c.chen@linux.intel.com</email>
</author>
<published>2017-08-25T16:13:54Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=2554db916586b228ce93e6f74a12fd7fe430a004'/>
<id>urn:sha1:2554db916586b228ce93e6f74a12fd7fe430a004</id>
<content type='text'>
We encountered workloads that have very long wake up list on large
systems. A waker takes a long time to traverse the entire wake list and
execute all the wake functions.

We saw page wait list that are up to 3700+ entries long in tests of
large 4 and 8 socket systems. It took 0.8 sec to traverse such list
during wake up. Any other CPU that contends for the list spin lock will
spin for a long time. It is a result of the numa balancing migration of
hot pages that are shared by many threads.

Multiple CPUs waking are queued up behind the lock, and the last one
queued has to wait until all CPUs did all the wakeups.

The page wait list is traversed with interrupt disabled, which caused
various problems. This was the original cause that triggered the NMI
watch dog timer in: https://patchwork.kernel.org/patch/9800303/ . Only
extending the NMI watch dog timer there helped.

This patch bookmarks the waker's scan position in wake list and break
the wake up walk, to allow access to the list before the waker resume
its walk down the rest of the wait list. It lowers the interrupt and
rescheduling latency.

This patch also provides a performance boost when combined with the next
patch to break up page wakeup list walk. We saw 22% improvement in the
will-it-scale file pread2 test on a Xeon Phi system running 256 threads.

[ v2: Merged in Linus' changes to remove the bookmark_wake_function, and
  simply access to flags. ]

Reported-by: Kan Liang &lt;kan.liang@intel.com&gt;
Tested-by: Kan Liang &lt;kan.liang@intel.com&gt;
Signed-off-by: Tim Chen &lt;tim.c.chen@linux.intel.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>Minor page waitqueue cleanups</title>
<updated>2017-08-27T20:55:12Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2017-08-27T20:55:12Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=3510ca20ece0150af6b10c77a74ff1b5c198e3e2'/>
<id>urn:sha1:3510ca20ece0150af6b10c77a74ff1b5c198e3e2</id>
<content type='text'>
Tim Chen and Kan Liang have been battling a customer load that shows
extremely long page wakeup lists.  The cause seems to be constant NUMA
migration of a hot page that is shared across a lot of threads, but the
actual root cause for the exact behavior has not been found.

Tim has a patch that batches the wait list traversal at wakeup time, so
that we at least don't get long uninterruptible cases where we traverse
and wake up thousands of processes and get nasty latency spikes.  That
is likely 4.14 material, but we're still discussing the page waitqueue
specific parts of it.

In the meantime, I've tried to look at making the page wait queues less
expensive, and failing miserably.  If you have thousands of threads
waiting for the same page, it will be painful.  We'll need to try to
figure out the NUMA balancing issue some day, in addition to avoiding
the excessive spinlock hold times.

That said, having tried to rewrite the page wait queues, I can at least
fix up some of the braindamage in the current situation. In particular:

 (a) we don't want to continue walking the page wait list if the bit
     we're waiting for already got set again (which seems to be one of
     the patterns of the bad load).  That makes no progress and just
     causes pointless cache pollution chasing the pointers.

 (b) we don't want to put the non-locking waiters always on the front of
     the queue, and the locking waiters always on the back.  Not only is
     that unfair, it means that we wake up thousands of reading threads
     that will just end up being blocked by the writer later anyway.

Also add a comment about the layout of 'struct wait_page_key' - there is
an external user of it in the cachefiles code that means that it has to
match the layout of 'struct wait_bit_key' in the two first members.  It
so happens to match, because 'struct page *' and 'unsigned long *' end
up having the same values simply because the page flags are the first
member in struct page.

Cc: Tim Chen &lt;tim.c.chen@linux.intel.com&gt;
Cc: Kan Liang &lt;kan.liang@intel.com&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Christopher Lameter &lt;cl@linux.com&gt;
Cc: Andi Kleen &lt;ak@linux.intel.com&gt;
Cc: Davidlohr Bueso &lt;dave@stgolabs.net&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>sched/wait: Disambiguate wq_entry-&gt;task_list and wq_head-&gt;task_list naming</title>
<updated>2017-06-20T10:19:14Z</updated>
<author>
<name>Ingo Molnar</name>
<email>mingo@kernel.org</email>
</author>
<published>2017-06-20T10:06:46Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=2055da97389a605c8a00d163d40903afbe413921'/>
<id>urn:sha1:2055da97389a605c8a00d163d40903afbe413921</id>
<content type='text'>
So I've noticed a number of instances where it was not obvious from the
code whether -&gt;task_list was for a wait-queue head or a wait-queue entry.

Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.

To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:

	struct wait_queue_head::task_list	=&gt; ::head
	struct wait_queue_entry::task_list	=&gt; ::entry

For example, this code:

	rqw-&gt;wait.task_list.next != &amp;wait-&gt;task_list

... is was pretty unclear (to me) what it's doing, while now it's written this way:

	rqw-&gt;wait.head.next != &amp;wait-&gt;entry

... which makes it pretty clear that we are iterating a list until we see the head.

Other examples are:

	list_for_each_entry_safe(pos, next, &amp;x-&gt;task_list, task_list) {
	list_for_each_entry(wq, &amp;fence-&gt;wait.task_list, task_list) {

... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:

	list_for_each_entry_safe(pos, next, &amp;x-&gt;head, entry) {
	list_for_each_entry(wq, &amp;fence-&gt;wait.head, entry) {

Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/wait: Split out the wait_bit*() APIs from &lt;linux/wait.h&gt; into &lt;linux/wait_bit.h&gt;</title>
<updated>2017-06-20T10:19:09Z</updated>
<author>
<name>Ingo Molnar</name>
<email>mingo@kernel.org</email>
</author>
<published>2017-06-20T10:19:09Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=5dd43ce2f69d42a71dcacdb13d17d8c0ac1fe8f7'/>
<id>urn:sha1:5dd43ce2f69d42a71dcacdb13d17d8c0ac1fe8f7</id>
<content type='text'>
The wait_bit*() types and APIs are mixed into wait.h, but they
are a pretty orthogonal extension of wait-queues.

Furthermore, only about 50 kernel files use these APIs, while
over 1000 use the regular wait-queue functionality.

So clean up the main wait.h by moving the wait-bit functionality
out of it, into a separate .h and .c file:

  include/linux/wait_bit.h  for types and APIs
  kernel/sched/wait_bit.c   for the implementation

Update all header dependencies.

This reduces the size of wait.h rather significantly, by about 30%.

Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/wait: Standardize wait_bit_queue naming</title>
<updated>2017-06-20T10:18:29Z</updated>
<author>
<name>Ingo Molnar</name>
<email>mingo@kernel.org</email>
</author>
<published>2017-03-05T10:35:27Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=76c85ddc4695bb7b8209bfeff11f5156088f9197'/>
<id>urn:sha1:76c85ddc4695bb7b8209bfeff11f5156088f9197</id>
<content type='text'>
So wait-bit-queue head variables are often named:

	struct wait_bit_queue *q

... which is a bit ambiguous and super confusing, because
they clearly suggest wait-queue head semantics and behavior
(they rhyme with the old wait_queue_t *q naming), while they
are extended wait-queue _entries_, not heads!

They are misnomers in two ways:

 - the 'wait_bit_queue' leaves open the question of whether
   it's an entry or a head

 - the 'q' parameter and local variable naming falsely implies
   that it's a 'queue' - while it's an entry.

This resulted in sometimes confusing cases such as:

	finish_wait(wq, &amp;q-&gt;wait);

where the 'q' is not a wait-queue head, but a wait-bit-queue entry.

So improve this all by standardizing wait-bit-queue nomenclature
similar to wait-queue head naming:

	struct wait_bit_queue   =&gt; struct wait_bit_queue_entry
	q			=&gt; wbq_entry

Which makes it all a much clearer:

	struct wait_bit_queue_entry *wbq_entry

... and turns the former confusing piece of code into:

	finish_wait(wq_head, &amp;wbq_entry-&gt;wq_entry;

which IMHO makes it apparently clear what we are doing,
without having to analyze the context of the code: we are
adding a wait-queue entry to a regular wait-queue head,
which entry is embedded in a wait-bit-queue entry.

I'm not a big fan of acronyms, but repeating wait_bit_queue_entry
in field and local variable names is too long, so Hopefully it's
clear enough that 'wq_' prefixes stand for wait-queues, while
'wbq_' prefixes stand for wait-bit-queues.

Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/wait: Standardize 'struct wait_bit_queue' wait-queue entry field name</title>
<updated>2017-06-20T10:18:28Z</updated>
<author>
<name>Ingo Molnar</name>
<email>mingo@kernel.org</email>
</author>
<published>2017-03-05T10:25:39Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=2141713616c652aeabf2dd5c1e89bc601c4fed6a'/>
<id>urn:sha1:2141713616c652aeabf2dd5c1e89bc601c4fed6a</id>
<content type='text'>
Rename 'struct wait_bit_queue::wait' to ::wq_entry, to more clearly
name it as a wait-queue entry.

Propagate it to a couple of usage sites where the wait-bit-queue internals
are exposed.

Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/wait: Standardize internal naming of wait-queue heads</title>
<updated>2017-06-20T10:18:28Z</updated>
<author>
<name>Ingo Molnar</name>
<email>mingo@kernel.org</email>
</author>
<published>2017-03-05T10:10:18Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=9d9d676f595b5081326be7a17dc681fcb38fb3b2'/>
<id>urn:sha1:9d9d676f595b5081326be7a17dc681fcb38fb3b2</id>
<content type='text'>
The wait-queue head parameters and variables are named in a
couple of ways, we have the following variants currently:

	wait_queue_head_t *q
	wait_queue_head_t *wq
	wait_queue_head_t *head

In particular the 'wq' naming is ambiguous in the sense whether it's
a wait-queue head or entry name - as entries were often named 'wait'.

( Not to mention the confusion of any readers coming over from
  workqueue-land. )

Standardize all this around a single, unambiguous parameter and
variable name:

	struct wait_queue_head *wq_head

which is easy to grep for and also rhymes nicely with the wait-queue
entry naming:

	struct wait_queue_entry *wq_entry

Also rename:

	struct __wait_queue_head =&gt; struct wait_queue_head

... and use this struct type to migrate from typedefs usage to 'struct'
usage, which is more in line with existing kernel practices.

Don't touch any external users and preserve the main wait_queue_head_t
typedef.

Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/wait: Standardize internal naming of wait-queue entries</title>
<updated>2017-06-20T10:18:27Z</updated>
<author>
<name>Ingo Molnar</name>
<email>mingo@kernel.org</email>
</author>
<published>2017-03-05T09:33:16Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=50816c48997af857d4bab3dca1aba90339705e96'/>
<id>urn:sha1:50816c48997af857d4bab3dca1aba90339705e96</id>
<content type='text'>
So the various wait-queue entry variables in include/linux/wait.h
and kernel/sched/wait.c are named in a colorfully inconsistent
way:

	wait_queue_entry_t *wait
	wait_queue_entry_t *__wait	(even in plain C code!)
	wait_queue_entry_t *q		(!)
	wait_queue_entry_t *new		(making anyone who knows C++ cringe)
	wait_queue_entry_t *old

I think part of the reason for the inconsistency is the constant
apparent confusion about what a wait queue 'head' versus 'entry' is.

( Some of the documentation talks about a 'wait descriptor', which is
  the wait-queue entry itself - further adding to the confusion. )

The most common name is 'wait', but that in itself is somewhat
ambiguous as well, as it does not really make it clear whether
it's a wait-queue entry or head.

To improve all this name the wait-queue entry structure parameters
and variables consistently and push through this naming into all
the wait.h and wait.c code:

	struct wait_queue_entry *wq_entry

The 'wq_' prefix makes it easy to grep for, and we also use the
opportunity to move away from the typedef to a plain 'struct' naming:
in the kernel we typically reserve typedefs for cases where a
C structure is really small and somewhat opaque - such as pte_t.

wait-queue entries are neither small nor opaque, so use the more
standard 'struct xxx_entry' list management code nomenclature instead.

( We don't touch external users, and we preserve the typedef as well
  for actual wait-queue users, to reduce unnecessary churn. )

Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
</feed>
