<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/kernel/sched, branch v4.19.261</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v4.19.261</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v4.19.261'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2022-09-05T08:26:28Z</updated>
<entry>
<title>sched/deadline: Fix priority inheritance with multiple scheduling classes</title>
<updated>2022-09-05T08:26:28Z</updated>
<author>
<name>Juri Lelli</name>
<email>juri.lelli@redhat.com</email>
</author>
<published>2022-08-22T07:43:47Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=3cc3d77dc541181f97f1dc96d2977ef8359fd760'/>
<id>urn:sha1:3cc3d77dc541181f97f1dc96d2977ef8359fd760</id>
<content type='text'>
commit 2279f540ea7d05f22d2f0c4224319330228586bc upstream.

Glenn reported that "an application [he developed produces] a BUG in
deadline.c when a SCHED_DEADLINE task contends with CFS tasks on nested
PTHREAD_PRIO_INHERIT mutexes.  I believe the bug is triggered when a CFS
task that was boosted by a SCHED_DEADLINE task boosts another CFS task
(nested priority inheritance).

 ------------[ cut here ]------------
 kernel BUG at kernel/sched/deadline.c:1462!
 invalid opcode: 0000 [#1] PREEMPT SMP
 CPU: 12 PID: 19171 Comm: dl_boost_bug Tainted: ...
 Hardware name: ...
 RIP: 0010:enqueue_task_dl+0x335/0x910
 Code: ...
 RSP: 0018:ffffc9000c2bbc68 EFLAGS: 00010002
 RAX: 0000000000000009 RBX: ffff888c0af94c00 RCX: ffffffff81e12500
 RDX: 000000000000002e RSI: ffff888c0af94c00 RDI: ffff888c10b22600
 RBP: ffffc9000c2bbd08 R08: 0000000000000009 R09: 0000000000000078
 R10: ffffffff81e12440 R11: ffffffff81e1236c R12: ffff888bc8932600
 R13: ffff888c0af94eb8 R14: ffff888c10b22600 R15: ffff888bc8932600
 FS:  00007fa58ac55700(0000) GS:ffff888c10b00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007fa58b523230 CR3: 0000000bf44ab003 CR4: 00000000007606e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 PKRU: 55555554
 Call Trace:
  ? intel_pstate_update_util_hwp+0x13/0x170
  rt_mutex_setprio+0x1cc/0x4b0
  task_blocks_on_rt_mutex+0x225/0x260
  rt_spin_lock_slowlock_locked+0xab/0x2d0
  rt_spin_lock_slowlock+0x50/0x80
  hrtimer_grab_expiry_lock+0x20/0x30
  hrtimer_cancel+0x13/0x30
  do_nanosleep+0xa0/0x150
  hrtimer_nanosleep+0xe1/0x230
  ? __hrtimer_init_sleeper+0x60/0x60
  __x64_sys_nanosleep+0x8d/0xa0
  do_syscall_64+0x4a/0x100
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 RIP: 0033:0x7fa58b52330d
 ...
 ---[ end trace 0000000000000002 ]—

He also provided a simple reproducer creating the situation below:

 So the execution order of locking steps are the following
 (N1 and N2 are non-deadline tasks. D1 is a deadline task. M1 and M2
 are mutexes that are enabled * with priority inheritance.)

 Time moves forward as this timeline goes down:

 N1              N2               D1
 |               |                |
 |               |                |
 Lock(M1)        |                |
 |               |                |
 |             Lock(M2)           |
 |               |                |
 |               |              Lock(M2)
 |               |                |
 |             Lock(M1)           |
 |             (!!bug triggered!) |

Daniel reported a similar situation as well, by just letting ksoftirqd
run with DEADLINE (and eventually block on a mutex).

Problem is that boosted entities (Priority Inheritance) use static
DEADLINE parameters of the top priority waiter. However, there might be
cases where top waiter could be a non-DEADLINE entity that is currently
boosted by a DEADLINE entity from a different lock chain (i.e., nested
priority chains involving entities of non-DEADLINE classes). In this
case, top waiter static DEADLINE parameters could be null (initialized
to 0 at fork()) and replenish_dl_entity() would hit a BUG().

Fix this by keeping track of the original donor and using its parameters
when a task is boosted.

Reported-by: Glenn Elliott &lt;glenn@aurora.tech&gt;
Reported-by: Daniel Bristot de Oliveira &lt;bristot@redhat.com&gt;
Signed-off-by: Juri Lelli &lt;juri.lelli@redhat.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Tested-by: Daniel Bristot de Oliveira &lt;bristot@redhat.com&gt;
Link: https://lkml.kernel.org/r/20201117061432.517340-1-juri.lelli@redhat.com
[Ankit: Regenerated the patch for v4.19.y]
Signed-off-by: Ankit Jain &lt;ankitja@vmware.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>sched/deadline: Fix stale throttling on de-/boosted tasks</title>
<updated>2022-09-05T08:26:28Z</updated>
<author>
<name>Lucas Stach</name>
<email>l.stach@pengutronix.de</email>
</author>
<published>2022-08-22T07:43:46Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=3f5116143862b892aacbf104048533344fb6c0ad'/>
<id>urn:sha1:3f5116143862b892aacbf104048533344fb6c0ad</id>
<content type='text'>
commit 46fcc4b00c3cca8adb9b7c9afdd499f64e427135 upstream.

When a boosted task gets throttled, what normally happens is that it's
immediately enqueued again with ENQUEUE_REPLENISH, which replenishes the
runtime and clears the dl_throttled flag. There is a special case however:
if the throttling happened on sched-out and the task has been deboosted in
the meantime, the replenish is skipped as the task will return to its
normal scheduling class. This leaves the task with the dl_throttled flag
set.

Now if the task gets boosted up to the deadline scheduling class again
while it is sleeping, it's still in the throttled state. The normal wakeup
however will enqueue the task with ENQUEUE_REPLENISH not set, so we don't
actually place it on the rq. Thus we end up with a task that is runnable,
but not actually on the rq and neither a immediate replenishment happens,
nor is the replenishment timer set up, so the task is stuck in
forever-throttled limbo.

Clear the dl_throttled flag before dropping back to the normal scheduling
class to fix this issue.

Signed-off-by: Lucas Stach &lt;l.stach@pengutronix.de&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Juri Lelli &lt;juri.lelli@redhat.com&gt;
Link: https://lkml.kernel.org/r/20200831110719.2126930-1-l.stach@pengutronix.de
[Ankit: Regenerated the patch for v4.19.y]
Signed-off-by: Ankit Jain &lt;ankitja@vmware.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>sched/deadline: Unthrottle PI boosted threads while enqueuing</title>
<updated>2022-09-05T08:26:28Z</updated>
<author>
<name>Daniel Bristot de Oliveira</name>
<email>bristot@redhat.com</email>
</author>
<published>2022-08-22T07:43:45Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=f0e1c1d8ff908a39dd42e723d08f104505dfa601'/>
<id>urn:sha1:f0e1c1d8ff908a39dd42e723d08f104505dfa601</id>
<content type='text'>
commit feff2e65efd8d84cf831668e182b2ce73c604bbb upstream.

stress-ng has a test (stress-ng --cyclic) that creates a set of threads
under SCHED_DEADLINE with the following parameters:

    dl_runtime   =  10000 (10 us)
    dl_deadline  = 100000 (100 us)
    dl_period    = 100000 (100 us)

These parameters are very aggressive. When using a system without HRTICK
set, these threads can easily execute longer than the dl_runtime because
the throttling happens with 1/HZ resolution.

During the main part of the test, the system works just fine because
the workload does not try to run over the 10 us. The problem happens at
the end of the test, on the exit() path. During exit(), the threads need
to do some cleanups that require real-time mutex locks, mainly those
related to memory management, resulting in this scenario:

Note: locks are rt_mutexes...
 ------------------------------------------------------------------------
    TASK A:		TASK B:				TASK C:
    activation
							activation
			activation

    lock(a): OK!	lock(b): OK!
    			&lt;overrun runtime&gt;
    			lock(a)
    			-&gt; block (task A owns it)
			  -&gt; self notice/set throttled
 +--&lt;			  -&gt; arm replenished timer
 |    			switch-out
 |    							lock(b)
 |    							-&gt; &lt;C prio &gt; B prio&gt;
 |    							-&gt; boost TASK B
 |  unlock(a)						switch-out
 |  -&gt; handle lock a to B
 |    -&gt; wakeup(B)
 |      -&gt; B is throttled:
 |        -&gt; do not enqueue
 |     switch-out
 |
 |
 +---------------------&gt; replenishment timer
			-&gt; TASK B is boosted:
			  -&gt; do not enqueue
 ------------------------------------------------------------------------

BOOM: TASK B is runnable but !enqueued, holding TASK C: the system
crashes with hung task C.

This problem is avoided by removing the throttle state from the boosted
thread while boosting it (by TASK A in the example above), allowing it to
be queued and run boosted.

The next replenishment will take care of the runtime overrun, pushing
the deadline further away. See the "while (dl_se-&gt;runtime &lt;= 0)" on
replenish_dl_entity() for more information.

Reported-by: Mark Simmons &lt;msimmons@redhat.com&gt;
Signed-off-by: Daniel Bristot de Oliveira &lt;bristot@redhat.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Juri Lelli &lt;juri.lelli@redhat.com&gt;
Tested-by: Mark Simmons &lt;msimmons@redhat.com&gt;
Link: https://lkml.kernel.org/r/5076e003450835ec74e6fa5917d02c4fa41687e6.1600170294.git.bristot@redhat.com
[Ankit: Regenerated the patch for v4.19.y]
Signed-off-by: Ankit Jain &lt;ankitja@vmware.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>nohz/full, sched/rt: Fix missed tick-reenabling bug in dequeue_task_rt()</title>
<updated>2022-08-25T09:15:02Z</updated>
<author>
<name>Nicolas Saenz Julienne</name>
<email>nsaenzju@redhat.com</email>
</author>
<published>2022-06-28T09:22:59Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=952146183df68a7fb29c3c29d56ffdc941fbdc39'/>
<id>urn:sha1:952146183df68a7fb29c3c29d56ffdc941fbdc39</id>
<content type='text'>
[ Upstream commit 5c66d1b9b30f737fcef85a0b75bfe0590e16b62a ]

dequeue_task_rt() only decrements 'rt_rq-&gt;rt_nr_running' after having
called sched_update_tick_dependency() preventing it from re-enabling the
tick on systems that no longer have pending SCHED_RT tasks but have
multiple runnable SCHED_OTHER tasks:

  dequeue_task_rt()
    dequeue_rt_entity()
      dequeue_rt_stack()
        dequeue_top_rt_rq()
	  sub_nr_running()	// decrements rq-&gt;nr_running
	    sched_update_tick_dependency()
	      sched_can_stop_tick()	// checks rq-&gt;rt.rt_nr_running,
	      ...
        __dequeue_rt_entity()
          dec_rt_tasks()	// decrements rq-&gt;rt.rt_nr_running
	  ...

Every other scheduler class performs the operation in the opposite
order, and sched_update_tick_dependency() expects the values to be
updated as such. So avoid the misbehaviour by inverting the order in
which the above operations are performed in the RT scheduler.

Fixes: 76d92ac305f2 ("sched: Migrate sched to use new tick dependency mask model")
Signed-off-by: Nicolas Saenz Julienne &lt;nsaenzju@redhat.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Valentin Schneider &lt;vschneid@redhat.com&gt;
Reviewed-by: Phil Auld &lt;pauld@redhat.com&gt;
Link: https://lore.kernel.org/r/20220628092259.330171-1-nsaenzju@redhat.com
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/debug: Remove mpol_get/put and task_lock/unlock from sched_show_numa</title>
<updated>2022-04-15T12:14:44Z</updated>
<author>
<name>Bharata B Rao</name>
<email>bharata@amd.com</email>
</author>
<published>2022-01-18T05:05:15Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=6157d4ede6e7fb34c6b243a6c4cb014ce383bbc9'/>
<id>urn:sha1:6157d4ede6e7fb34c6b243a6c4cb014ce383bbc9</id>
<content type='text'>
[ Upstream commit 28c988c3ec29db74a1dda631b18785958d57df4f ]

The older format of /proc/pid/sched printed home node info which
required the mempolicy and task lock around mpol_get(). However
the format has changed since then and there is no need for
sched_show_numa() any more to have mempolicy argument,
asssociated mpol_get/put and task_lock/unlock. Remove them.

Fixes: 397f2378f1361 ("sched/numa: Fix numa balancing stats in /proc/pid/sched")
Signed-off-by: Bharata B Rao &lt;bharata@amd.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Srikar Dronamraju &lt;srikar@linux.vnet.ibm.com&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Link: https://lore.kernel.org/r/20220118050515.2973-1-bharata@amd.com
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/topology: Fix sched_domain_topology_level alloc in sched_init_numa()</title>
<updated>2022-03-23T08:10:41Z</updated>
<author>
<name>Dietmar Eggemann</name>
<email>dietmar.eggemann@arm.com</email>
</author>
<published>2021-02-01T09:53:53Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=6d93ffef8ccdc1f3da616c71ef1fe580c7a4a3af'/>
<id>urn:sha1:6d93ffef8ccdc1f3da616c71ef1fe580c7a4a3af</id>
<content type='text'>
commit 71e5f6644fb2f3304fcb310145ded234a37e7cc1 upstream.

Commit "sched/topology: Make sched_init_numa() use a set for the
deduplicating sort" allocates 'i + nr_levels (level)' instead of
'i + nr_levels + 1' sched_domain_topology_level.

This led to an Oops (on Arm64 juno with CONFIG_SCHED_DEBUG):

sched_init_domains
  build_sched_domains()
    __free_domain_allocs()
      __sdt_free() {
	...
        for_each_sd_topology(tl)
	  ...
          sd = *per_cpu_ptr(sdd-&gt;sd, j); &lt;--
	  ...
      }

Signed-off-by: Dietmar Eggemann &lt;dietmar.eggemann@arm.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Tested-by: Vincent Guittot &lt;vincent.guittot@linaro.org&gt;
Tested-by: Barry Song &lt;song.bao.hua@hisilicon.com&gt;
Link: https://lkml.kernel.org/r/6000e39e-7d28-c360-9cd6-8798fd22a9bf@arm.com
Signed-off-by: dann frazier &lt;dann.frazier@canonical.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>sched/topology: Make sched_init_numa() use a set for the deduplicating sort</title>
<updated>2022-03-23T08:10:41Z</updated>
<author>
<name>Valentin Schneider</name>
<email>valentin.schneider@arm.com</email>
</author>
<published>2021-01-22T12:39:43Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=e9386b547950742ae41536161711531d7eacd1b8'/>
<id>urn:sha1:e9386b547950742ae41536161711531d7eacd1b8</id>
<content type='text'>
commit 620a6dc40754dc218f5b6389b5d335e9a107fd29 upstream.

The deduplicating sort in sched_init_numa() assumes that the first line in
the distance table contains all unique values in the entire table. I've
been trying to pen what this exactly means for the topology, but it's not
straightforward. For instance, topology.c uses this example:

  node   0   1   2   3
    0:  10  20  20  30
    1:  20  10  20  20
    2:  20  20  10  20
    3:  30  20  20  10

  0 ----- 1
  |     / |
  |   /   |
  | /     |
  2 ----- 3

Which works out just fine. However, if we swap nodes 0 and 1:

  1 ----- 0
  |     / |
  |   /   |
  | /     |
  2 ----- 3

we get this distance table:

  node   0  1  2  3
    0:  10 20 20 20
    1:  20 10 20 30
    2:  20 20 10 20
    3:  20 30 20 10

Which breaks the deduplicating sort (non-representative first line). In
this case this would just be a renumbering exercise, but it so happens that
we can have a deduplicating sort that goes through the whole table in O(n²)
at the extra cost of a temporary memory allocation (i.e. any form of set).

The ACPI spec (SLIT) mentions distances are encoded on 8 bits. Following
this, implement the set as a 256-bits bitmap. Should this not be
satisfactory (i.e. we want to support 32-bit values), then we'll have to go
for some other sparse set implementation.

This has the added benefit of letting us allocate just the right amount of
memory for sched_domains_numa_distance[], rather than an arbitrary
(nr_node_ids + 1).

Note: DT binding equivalent (distance-map) decodes distances as 32-bit
values.

Signed-off-by: Valentin Schneider &lt;valentin.schneider@arm.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lkml.kernel.org/r/20210122123943.1217-2-valentin.schneider@arm.com
Signed-off-by: dann frazier &lt;dann.frazier@canonical.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>cputime, cpuacct: Include guest time in user time in cpuacct.stat</title>
<updated>2022-01-27T08:04:30Z</updated>
<author>
<name>Andrey Ryabinin</name>
<email>arbn@yandex-team.com</email>
</author>
<published>2021-11-15T16:46:04Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=952514c8565cf72a966993b473fae1708c3684f3'/>
<id>urn:sha1:952514c8565cf72a966993b473fae1708c3684f3</id>
<content type='text'>
commit 9731698ecb9c851f353ce2496292ff9fcea39dff upstream.

cpuacct.stat in no-root cgroups shows user time without guest time
included int it. This doesn't match with user time shown in root
cpuacct.stat and /proc/&lt;pid&gt;/stat. This also affects cgroup2's cpu.stat
in the same way.

Make account_guest_time() to add user time to cgroup's cpustat to
fix this.

Fixes: ef12fefabf94 ("cpuacct: add per-cgroup utime/stime statistics")
Signed-off-by: Andrey Ryabinin &lt;arbn@yandex-team.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Daniel Jordan &lt;daniel.m.jordan@oracle.com&gt;
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Link: https://lore.kernel.org/r/20211115164607.23784-1-arbn@yandex-team.com
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>sched/rt: Try to restart rt period timer when rt runtime exceeded</title>
<updated>2022-01-27T08:04:19Z</updated>
<author>
<name>Li Hua</name>
<email>hucool.lihua@huawei.com</email>
</author>
<published>2021-12-03T03:36:18Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=e6bc7279b16517fab9ed3cdbd58ad7b08060c246'/>
<id>urn:sha1:e6bc7279b16517fab9ed3cdbd58ad7b08060c246</id>
<content type='text'>
[ Upstream commit 9b58e976b3b391c0cf02e038d53dd0478ed3013c ]

When rt_runtime is modified from -1 to a valid control value, it may
cause the task to be throttled all the time. Operations like the following
will trigger the bug. E.g:

  1. echo -1 &gt; /proc/sys/kernel/sched_rt_runtime_us
  2. Run a FIFO task named A that executes while(1)
  3. echo 950000 &gt; /proc/sys/kernel/sched_rt_runtime_us

When rt_runtime is -1, The rt period timer will not be activated when task
A enqueued. And then the task will be throttled after setting rt_runtime to
950,000. The task will always be throttled because the rt period timer is
not activated.

Fixes: d0b27fa77854 ("sched: rt-group: synchonised bandwidth period")
Reported-by: Hulk Robot &lt;hulkci@huawei.com&gt;
Signed-off-by: Li Hua &lt;hucool.lihua@huawei.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lkml.kernel.org/r/20211203033618.11895-1-hucool.lihua@huawei.com
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>wait: add wake_up_pollfree()</title>
<updated>2021-12-14T09:18:06Z</updated>
<author>
<name>Eric Biggers</name>
<email>ebiggers@google.com</email>
</author>
<published>2021-12-10T23:53:08Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=8dd7c46a59756bdc29cb9783338b899cd3fb4b83'/>
<id>urn:sha1:8dd7c46a59756bdc29cb9783338b899cd3fb4b83</id>
<content type='text'>
commit 42288cb44c4b5fff7653bc392b583a2b8bd6a8c0 upstream.

Several -&gt;poll() implementations are special in that they use a
waitqueue whose lifetime is the current task, rather than the struct
file as is normally the case.  This is okay for blocking polls, since a
blocking poll occurs within one task; however, non-blocking polls
require another solution.  This solution is for the queue to be cleared
before it is freed, using 'wake_up_poll(wq, EPOLLHUP | POLLFREE);'.

However, that has a bug: wake_up_poll() calls __wake_up() with
nr_exclusive=1.  Therefore, if there are multiple "exclusive" waiters,
and the wakeup function for the first one returns a positive value, only
that one will be called.  That's *not* what's needed for POLLFREE;
POLLFREE is special in that it really needs to wake up everyone.

Considering the three non-blocking poll systems:

- io_uring poll doesn't handle POLLFREE at all, so it is broken anyway.

- aio poll is unaffected, since it doesn't support exclusive waits.
  However, that's fragile, as someone could add this feature later.

- epoll doesn't appear to be broken by this, since its wakeup function
  returns 0 when it sees POLLFREE.  But this is fragile.

Although there is a workaround (see epoll), it's better to define a
function which always sends POLLFREE to all waiters.  Add such a
function.  Also make it verify that the queue really becomes empty after
all waiters have been woken up.

Reported-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211209010455.42744-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers &lt;ebiggers@google.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
</feed>
