<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/kernel/sched, branch v5.19.8</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.19.8</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.19.8'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2022-08-17T13:16:02Z</updated>
<entry>
<title>sched/core: Do not requeue task on CPU excluded from cpus_mask</title>
<updated>2022-08-17T13:16:02Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@techsingularity.net</email>
</author>
<published>2022-08-04T09:21:19Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=302f7b0fc337746f41c69eb08522907f6a90c643'/>
<id>urn:sha1:302f7b0fc337746f41c69eb08522907f6a90c643</id>
<content type='text'>
[ Upstream commit 751d4cbc43879229dbc124afefe240b70fd29a85 ]

The following warning was triggered on a large machine early in boot on
a distribution kernel but the same problem should also affect mainline.

   WARNING: CPU: 439 PID: 10 at ../kernel/workqueue.c:2231 process_one_work+0x4d/0x440
   Call Trace:
    &lt;TASK&gt;
    rescuer_thread+0x1f6/0x360
    kthread+0x156/0x180
    ret_from_fork+0x22/0x30
    &lt;/TASK&gt;

Commit c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p-&gt;on_cpu")
optimises ttwu by queueing a task that is descheduling on the wakelist,
but does not check if the task descheduling is still allowed to run on that CPU.

In this warning, the problematic task is a workqueue rescue thread which
checks if the rescue is for a per-cpu workqueue and running on the wrong CPU.
While this is early in boot and it should be possible to create workers,
the rescue thread may still used if the MAYDAY_INITIAL_TIMEOUT is reached
or MAYDAY_INTERVAL and on a sufficiently large machine, the rescue
thread is being used frequently.

Tracing confirmed that the task should have migrated properly using the
stopper thread to handle the migration. However, a parallel wakeup from udev
running on another CPU that does not share CPU cache observes p-&gt;on_cpu and
uses task_cpu(p), queues the task on the old CPU and triggers the warning.

Check that the wakee task that is descheduling is still allowed to run
on its current CPU and if not, wait for the descheduling to complete
and select an allowed CPU.

Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p-&gt;on_cpu")
Signed-off-by: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Link: https://lore.kernel.org/r/20220804092119.20137-1-mgorman@techsingularity.net
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle</title>
<updated>2022-08-17T13:16:02Z</updated>
<author>
<name>Tianchen Ding</name>
<email>dtcccc@linux.alibaba.com</email>
</author>
<published>2022-06-08T23:34:12Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=72f5b286d2b756a7e86212ad572a922568a4d62c'/>
<id>urn:sha1:72f5b286d2b756a7e86212ad572a922568a4d62c</id>
<content type='text'>
[ Upstream commit f3dd3f674555bd9455c5ae7fafce0696bd9931b3 ]

Wakelist can help avoid cache bouncing and offload the overhead of waker
cpu. So far, using wakelist within the same llc only happens on
WF_ON_CPU, and this limitation could be removed to further improve
wakeup performance.

The commit 518cd6234178 ("sched: Only queue remote wakeups when
crossing cache boundaries") disabled queuing tasks on wakelist when
the cpus share llc. This is because, at that time, the scheduler must
send IPIs to do ttwu_queue_wakelist. Nowadays, ttwu_queue_wakelist also
supports TIF_POLLING, so this is not a problem now when the wakee cpu is
in idle polling.

Benefits:
  Queuing the task on idle cpu can help improving performance on waker cpu
  and utilization on wakee cpu, and further improve locality because
  the wakee cpu can handle its own rq. This patch helps improving rt on
  our real java workloads where wakeup happens frequently.

  Consider the normal condition (CPU0 and CPU1 share same llc)
  Before this patch:

         CPU0                                       CPU1

    select_task_rq()                                idle
    rq_lock(CPU1-&gt;rq)
    enqueue_task(CPU1-&gt;rq)
    notify CPU1 (by sending IPI or CPU1 polling)

                                                    resched()

  After this patch:

         CPU0                                       CPU1

    select_task_rq()                                idle
    add to wakelist of CPU1
    notify CPU1 (by sending IPI or CPU1 polling)

                                                    rq_lock(CPU1-&gt;rq)
                                                    enqueue_task(CPU1-&gt;rq)
                                                    resched()

  We see CPU0 can finish its work earlier. It only needs to put task to
  wakelist and return.
  While CPU1 is idle, so let itself handle its own runqueue data.

This patch brings no difference about IPI.
  This patch only takes effect when the wakee cpu is:
  1) idle polling
  2) idle not polling

  For 1), there will be no IPI with or without this patch.

  For 2), there will always be an IPI before or after this patch.
  Before this patch: waker cpu will enqueue task and check preempt. Since
  "idle" will be sure to be preempted, waker cpu must send a resched IPI.
  After this patch: waker cpu will put the task to the wakelist of wakee
  cpu, and send an IPI.

Benchmark:
We've tested schbench, unixbench, and hachbench on both x86 and arm64.

On x86 (Intel Xeon Platinum 8269CY):
  schbench -m 2 -t 8

    Latency percentiles (usec)              before        after
        50.0000th:                             8            6
        75.0000th:                            10            7
        90.0000th:                            11            8
        95.0000th:                            12            8
        *99.0000th:                           13           10
        99.5000th:                            15           11
        99.9000th:                            18           14

  Unixbench with full threads (104)
                                            before        after
    Dhrystone 2 using register variables  3011862938    3009935994  -0.06%
    Double-Precision Whetstone              617119.3      617298.5   0.03%
    Execl Throughput                         27667.3       27627.3  -0.14%
    File Copy 1024 bufsize 2000 maxblocks   785871.4      784906.2  -0.12%
    File Copy 256 bufsize 500 maxblocks     210113.6      212635.4   1.20%
    File Copy 4096 bufsize 8000 maxblocks  2328862.2     2320529.1  -0.36%
    Pipe Throughput                      145535622.8   145323033.2  -0.15%
    Pipe-based Context Switching           3221686.4     3583975.4  11.25%
    Process Creation                        101347.1      103345.4   1.97%
    Shell Scripts (1 concurrent)            120193.5      123977.8   3.15%
    Shell Scripts (8 concurrent)             17233.4       17138.4  -0.55%
    System Call Overhead                   5300604.8     5312213.6   0.22%

  hackbench -g 1 -l 100000
                                            before        after
    Time                                     3.246        2.251

On arm64 (Ampere Altra):
  schbench -m 2 -t 8

    Latency percentiles (usec)              before        after
        50.0000th:                            14           10
        75.0000th:                            19           14
        90.0000th:                            22           16
        95.0000th:                            23           16
        *99.0000th:                           24           17
        99.5000th:                            24           17
        99.9000th:                            28           25

  Unixbench with full threads (80)
                                            before        after
    Dhrystone 2 using register variables  3536194249    3537019613   0.02%
    Double-Precision Whetstone              629383.6      629431.6   0.01%
    Execl Throughput                         65920.5       65846.2  -0.11%
    File Copy 1024 bufsize 2000 maxblocks  1063722.8     1064026.8   0.03%
    File Copy 256 bufsize 500 maxblocks     322684.5      318724.5  -1.23%
    File Copy 4096 bufsize 8000 maxblocks  2348285.3     2328804.8  -0.83%
    Pipe Throughput                      133542875.3   131619389.8  -1.44%
    Pipe-based Context Switching           3215356.1     3576945.1  11.25%
    Process Creation                        108520.5      120184.6  10.75%
    Shell Scripts (1 concurrent)            122636.3        121888  -0.61%
    Shell Scripts (8 concurrent)             17462.1       17381.4  -0.46%
    System Call Overhead                   4429998.9     4435006.7   0.11%

  hackbench -g 1 -l 100000
                                            before        after
    Time                                     4.217        2.916

Our patch has improvement on schbench, hackbench
and Pipe-based Context Switching of unixbench
when there exists idle cpus,
and no obvious regression on other tests of unixbench.
This can help improve rt in scenes where wakeup happens frequently.

Signed-off-by: Tianchen Ding &lt;dtcccc@linux.alibaba.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Valentin Schneider &lt;vschneid@redhat.com&gt;
Link: https://lore.kernel.org/r/20220608233412.327341-3-dtcccc@linux.alibaba.com
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched: Fix the check of nr_running at queue wakelist</title>
<updated>2022-08-17T13:16:02Z</updated>
<author>
<name>Tianchen Ding</name>
<email>dtcccc@linux.alibaba.com</email>
</author>
<published>2022-06-08T23:34:11Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=31ef9288a5938ebbf577b10b40087d5745be544f'/>
<id>urn:sha1:31ef9288a5938ebbf577b10b40087d5745be544f</id>
<content type='text'>
[ Upstream commit 28156108fecb1f808b21d216e8ea8f0d205a530c ]

The commit 2ebb17717550 ("sched/core: Offload wakee task activation if it
the wakee is descheduling") checked rq-&gt;nr_running &lt;= 1 to avoid task
stacking when WF_ON_CPU.

Per the ordering of writes to p-&gt;on_rq and p-&gt;on_cpu, observing p-&gt;on_cpu
(WF_ON_CPU) in ttwu_queue_cond() implies !p-&gt;on_rq, IOW p has gone through
the deactivate_task() in __schedule(), thus p has been accounted out of
rq-&gt;nr_running. As such, the task being the only runnable task on the rq
implies reading rq-&gt;nr_running == 0 at that point.

The benchmark result is in [1].

[1] https://lore.kernel.org/all/e34de686-4e85-bde1-9f3c-9bbc86b38627@linux.alibaba.com/

Suggested-by: Valentin Schneider &lt;vschneid@redhat.com&gt;
Signed-off-by: Tianchen Ding &lt;dtcccc@linux.alibaba.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Valentin Schneider &lt;vschneid@redhat.com&gt;
Link: https://lore.kernel.org/r/20220608233412.327341-2-dtcccc@linux.alibaba.com
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched, cpuset: Fix dl_cpu_busy() panic due to empty cs-&gt;cpus_allowed</title>
<updated>2022-08-17T13:16:01Z</updated>
<author>
<name>Waiman Long</name>
<email>longman@redhat.com</email>
</author>
<published>2022-08-03T01:54:51Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=f56607b44c9896e51678a7e8cdd3a5479f4b4548'/>
<id>urn:sha1:f56607b44c9896e51678a7e8cdd3a5479f4b4548</id>
<content type='text'>
[ Upstream commit b6e8d40d43ae4dec00c8fea2593eeea3114b8f44 ]

With cgroup v2, the cpuset's cpus_allowed mask can be empty indicating
that the cpuset will just use the effective CPUs of its parent. So
cpuset_can_attach() can call task_can_attach() with an empty mask.
This can lead to cpumask_any_and() returns nr_cpu_ids causing the call
to dl_bw_of() to crash due to percpu value access of an out of bound
CPU value. For example:

	[80468.182258] BUG: unable to handle page fault for address: ffffffff8b6648b0
	  :
	[80468.191019] RIP: 0010:dl_cpu_busy+0x30/0x2b0
	  :
	[80468.207946] Call Trace:
	[80468.208947]  cpuset_can_attach+0xa0/0x140
	[80468.209953]  cgroup_migrate_execute+0x8c/0x490
	[80468.210931]  cgroup_update_dfl_csses+0x254/0x270
	[80468.211898]  cgroup_subtree_control_write+0x322/0x400
	[80468.212854]  kernfs_fop_write_iter+0x11c/0x1b0
	[80468.213777]  new_sync_write+0x11f/0x1b0
	[80468.214689]  vfs_write+0x1eb/0x280
	[80468.215592]  ksys_write+0x5f/0xe0
	[80468.216463]  do_syscall_64+0x5c/0x80
	[80468.224287]  entry_SYSCALL_64_after_hwframe+0x44/0xae

Fix that by using effective_cpus instead. For cgroup v1, effective_cpus
is the same as cpus_allowed. For v2, effective_cpus is the real cpumask
to be used by tasks within the cpuset anyway.

Also update task_can_attach()'s 2nd argument name to cs_effective_cpus to
reflect the change. In addition, a check is added to task_can_attach()
to guard against the possibility that cpumask_any_and() may return a
value &gt;= nr_cpu_ids.

Fixes: 7f51412a415d ("sched/deadline: Fix bandwidth check/update when migrating tasks between exclusive cpusets")
Signed-off-by: Waiman Long &lt;longman@redhat.com&gt;
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Acked-by: Juri Lelli &lt;juri.lelli@redhat.com&gt;
Link: https://lore.kernel.org/r/20220803015451.2219567-1-longman@redhat.com
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>io_uring: move to separate directory</title>
<updated>2022-08-17T13:14:20Z</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2022-05-23T23:05:03Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=95c74abe404de914464b5685bc00b33e40382b9a'/>
<id>urn:sha1:95c74abe404de914464b5685bc00b33e40382b9a</id>
<content type='text'>
[ Upstream commit ed29b0b4fd835b058ddd151c49d021e28d631ee6 ]

In preparation for splitting io_uring up a bit, move it into its own
top level directory. It didn't really belong in fs/ anyway, as it's
not a file system only API.

This adds io_uring/ and moves the core files in there, and updates the
MAINTAINERS file for the new location.

Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/numa: Initialise numa_migrate_retry</title>
<updated>2022-08-17T13:14:20Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@techsingularity.net</email>
</author>
<published>2022-05-20T10:35:16Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=87666276569599739f506352bd43d8f4fb773e71'/>
<id>urn:sha1:87666276569599739f506352bd43d8f4fb773e71</id>
<content type='text'>
[ Upstream commit 70ce3ea9aa4ed901c8a90de667df5ef307766e71 ]

On clone, numa_migrate_retry is inherited from the parent which means
that the first NUMA placement of a task is non-deterministic. This
affects when load balancing recognises numa tasks and whether to
migrate "regular", "remote" or "all" tasks between NUMA scheduler
domains.

Signed-off-by: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Tested-by: K Prateek Nayak &lt;kprateek.nayak@amd.com&gt;
Link: https://lore.kernel.org/r/20220520103519.1863-2-mgorman@techsingularity.net
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched: only perform capability check on privileged operation</title>
<updated>2022-08-17T13:14:20Z</updated>
<author>
<name>Christian Göttsche</name>
<email>cgzones@googlemail.com</email>
</author>
<published>2022-06-15T15:25:04Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=dcdb6272b7cdff39c870a61a46740baa80cef2a3'/>
<id>urn:sha1:dcdb6272b7cdff39c870a61a46740baa80cef2a3</id>
<content type='text'>
[ Upstream commit 700a78335fc28a59c307f420857fd2d4521549f8 ]

sched_setattr(2) issues via kernel/sched/core.c:__sched_setscheduler()
a CAP_SYS_NICE audit event unconditionally, even when the requested
operation does not require that capability / is unprivileged, i.e. for
reducing niceness.
This is relevant in connection with SELinux, where a capability check
results in a policy decision and by default a denial message on
insufficient permission is issued.
It can lead to three undesired cases:
  1. A denial message is generated, even in case the operation was an
     unprivileged one and thus the syscall succeeded, creating noise.
  2. To avoid the noise from 1. the policy writer adds a rule to ignore
     those denial messages, hiding future syscalls, where the task
     performs an actual privileged operation, leading to hidden limited
     functionality of that task.
  3. To avoid the noise from 1. the policy writer adds a rule to allow
     the task the capability CAP_SYS_NICE, while it does not need it,
     violating the principle of least privilege.

Conduct privilged/unprivileged categorization first and perform a
capable test (and at most once) only if needed.

Signed-off-by: Christian Göttsche &lt;cgzones@googlemail.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lkml.kernel.org/r/20220615152505.310488-1-cgzones@googlemail.com
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>nohz/full, sched/rt: Fix missed tick-reenabling bug in dequeue_task_rt()</title>
<updated>2022-08-17T13:14:19Z</updated>
<author>
<name>Nicolas Saenz Julienne</name>
<email>nsaenzju@redhat.com</email>
</author>
<published>2022-06-28T09:22:59Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=32f346b7460ecd298c3f5cb566091205486cdc5a'/>
<id>urn:sha1:32f346b7460ecd298c3f5cb566091205486cdc5a</id>
<content type='text'>
[ Upstream commit 5c66d1b9b30f737fcef85a0b75bfe0590e16b62a ]

dequeue_task_rt() only decrements 'rt_rq-&gt;rt_nr_running' after having
called sched_update_tick_dependency() preventing it from re-enabling the
tick on systems that no longer have pending SCHED_RT tasks but have
multiple runnable SCHED_OTHER tasks:

  dequeue_task_rt()
    dequeue_rt_entity()
      dequeue_rt_stack()
        dequeue_top_rt_rq()
	  sub_nr_running()	// decrements rq-&gt;nr_running
	    sched_update_tick_dependency()
	      sched_can_stop_tick()	// checks rq-&gt;rt.rt_nr_running,
	      ...
        __dequeue_rt_entity()
          dec_rt_tasks()	// decrements rq-&gt;rt.rt_nr_running
	  ...

Every other scheduler class performs the operation in the opposite
order, and sched_update_tick_dependency() expects the values to be
updated as such. So avoid the misbehaviour by inverting the order in
which the above operations are performed in the RT scheduler.

Fixes: 76d92ac305f2 ("sched: Migrate sched to use new tick dependency mask model")
Signed-off-by: Nicolas Saenz Julienne &lt;nsaenzju@redhat.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Valentin Schneider &lt;vschneid@redhat.com&gt;
Reviewed-by: Phil Auld &lt;pauld@redhat.com&gt;
Link: https://lore.kernel.org/r/20220628092259.330171-1-nsaenzju@redhat.com
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/core: Always flush pending blk_plug</title>
<updated>2022-08-17T13:14:01Z</updated>
<author>
<name>John Keeping</name>
<email>john@metanate.com</email>
</author>
<published>2022-07-08T16:27:02Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=2c14f578b5a0807cd1524e024c5cecfa680d43c9'/>
<id>urn:sha1:2c14f578b5a0807cd1524e024c5cecfa680d43c9</id>
<content type='text'>
[ Upstream commit 401e4963bf45c800e3e9ea0d3a0289d738005fd4 ]

With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two
normal priority tasks (SCHED_OTHER, nice level zero):

	INFO: task kworker/u8:0:8 blocked for more than 491 seconds.
	      Not tainted 5.15.49-rt46 #1
	"echo 0 &gt; /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	task:kworker/u8:0    state:D stack:    0 pid:    8 ppid:     2 flags:0x00000000
	Workqueue: writeback wb_workfn (flush-7:0)
	[&lt;c08a3a10&gt;] (__schedule) from [&lt;c08a3d84&gt;] (schedule+0xdc/0x134)
	[&lt;c08a3d84&gt;] (schedule) from [&lt;c08a65a0&gt;] (rt_mutex_slowlock_block.constprop.0+0xb8/0x174)
	[&lt;c08a65a0&gt;] (rt_mutex_slowlock_block.constprop.0) from [&lt;c08a6708&gt;]
	+(rt_mutex_slowlock.constprop.0+0xac/0x174)
	[&lt;c08a6708&gt;] (rt_mutex_slowlock.constprop.0) from [&lt;c0374d60&gt;] (fat_write_inode+0x34/0x54)
	[&lt;c0374d60&gt;] (fat_write_inode) from [&lt;c0297304&gt;] (__writeback_single_inode+0x354/0x3ec)
	[&lt;c0297304&gt;] (__writeback_single_inode) from [&lt;c0297998&gt;] (writeback_sb_inodes+0x250/0x45c)
	[&lt;c0297998&gt;] (writeback_sb_inodes) from [&lt;c0297c20&gt;] (__writeback_inodes_wb+0x7c/0xb8)
	[&lt;c0297c20&gt;] (__writeback_inodes_wb) from [&lt;c0297f24&gt;] (wb_writeback+0x2c8/0x2e4)
	[&lt;c0297f24&gt;] (wb_writeback) from [&lt;c0298c40&gt;] (wb_workfn+0x1a4/0x3e4)
	[&lt;c0298c40&gt;] (wb_workfn) from [&lt;c0138ab8&gt;] (process_one_work+0x1fc/0x32c)
	[&lt;c0138ab8&gt;] (process_one_work) from [&lt;c0139120&gt;] (worker_thread+0x22c/0x2d8)
	[&lt;c0139120&gt;] (worker_thread) from [&lt;c013e6e0&gt;] (kthread+0x16c/0x178)
	[&lt;c013e6e0&gt;] (kthread) from [&lt;c01000fc&gt;] (ret_from_fork+0x14/0x38)
	Exception stack(0xc10e3fb0 to 0xc10e3ff8)
	3fa0:                                     00000000 00000000 00000000 00000000
	3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
	3fe0: 00000000 00000000 00000000 00000000 00000013 00000000

	INFO: task tar:2083 blocked for more than 491 seconds.
	      Not tainted 5.15.49-rt46 #1
	"echo 0 &gt; /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	task:tar             state:D stack:    0 pid: 2083 ppid:  2082 flags:0x00000000
	[&lt;c08a3a10&gt;] (__schedule) from [&lt;c08a3d84&gt;] (schedule+0xdc/0x134)
	[&lt;c08a3d84&gt;] (schedule) from [&lt;c08a41b0&gt;] (io_schedule+0x14/0x24)
	[&lt;c08a41b0&gt;] (io_schedule) from [&lt;c08a455c&gt;] (bit_wait_io+0xc/0x30)
	[&lt;c08a455c&gt;] (bit_wait_io) from [&lt;c08a441c&gt;] (__wait_on_bit_lock+0x54/0xa8)
	[&lt;c08a441c&gt;] (__wait_on_bit_lock) from [&lt;c08a44f4&gt;] (out_of_line_wait_on_bit_lock+0x84/0xb0)
	[&lt;c08a44f4&gt;] (out_of_line_wait_on_bit_lock) from [&lt;c0371fb0&gt;] (fat_mirror_bhs+0xa0/0x144)
	[&lt;c0371fb0&gt;] (fat_mirror_bhs) from [&lt;c0372a68&gt;] (fat_alloc_clusters+0x138/0x2a4)
	[&lt;c0372a68&gt;] (fat_alloc_clusters) from [&lt;c0370b14&gt;] (fat_alloc_new_dir+0x34/0x250)
	[&lt;c0370b14&gt;] (fat_alloc_new_dir) from [&lt;c03787c0&gt;] (vfat_mkdir+0x58/0x148)
	[&lt;c03787c0&gt;] (vfat_mkdir) from [&lt;c0277b60&gt;] (vfs_mkdir+0x68/0x98)
	[&lt;c0277b60&gt;] (vfs_mkdir) from [&lt;c027b484&gt;] (do_mkdirat+0xb0/0xec)
	[&lt;c027b484&gt;] (do_mkdirat) from [&lt;c0100060&gt;] (ret_fast_syscall+0x0/0x1c)
	Exception stack(0xc2e1bfa8 to 0xc2e1bff0)
	bfa0:                   01ee42f0 01ee4208 01ee42f0 000041ed 00000000 00004000
	bfc0: 01ee42f0 01ee4208 00000000 00000027 01ee4302 00000004 000dcb00 01ee4190
	bfe0: 000dc368 bed11924 0006d4b0 b6ebddfc

Here the kworker is waiting on msdos_sb_info::s_lock which is held by
tar which is in turn waiting for a buffer which is locked waiting to be
flushed, but this operation is plugged in the kworker.

The lock is a normal struct mutex, so tsk_is_pi_blocked() will always
return false on !RT and thus the behaviour changes for RT.

It seems that the intent here is to skip blk_flush_plug() in the case
where a non-preemptible lock (such as a spinlock) has been converted to
a rtmutex on RT, which is the case covered by the SM_RTLOCK_WAIT
schedule flag.  But sched_submit_work() is only called from schedule()
which is never called in this scenario, so the check can simply be
deleted.

Looking at the history of the -rt patchset, in fact this change was
present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part
of a larger patch [1] most of which was replaced by commit b4bfa3fcfe3b
("sched/core: Rework the __schedule() preempt argument").

As described in [1]:

   The schedule process must distinguish between blocking on a regular
   sleeping lock (rwsem and mutex) and a RT-only sleeping lock (spinlock
   and rwlock):
   - rwsem and mutex must flush block requests (blk_schedule_flush_plug())
     even if blocked on a lock. This can not deadlock because this also
     happens for non-RT.
     There should be a warning if the scheduling point is within a RCU read
     section.

   - spinlock and rwlock must not flush block requests. This will deadlock
     if the callback attempts to acquire a lock which is already acquired.
     Similarly to being preempted, there should be no warning if the
     scheduling point is within a RCU read section.

and with the tsk_is_pi_blocked() in the scheduler path, we hit the first
issue.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0022-locking-rtmutex-Use-custom-scheduling-function-for-s.patch?h=linux-5.10.y-rt-patches

Signed-off-by: John Keeping &lt;john@metanate.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Steven Rostedt (Google) &lt;rostedt@goodmis.org&gt;
Link: https://lkml.kernel.org/r/20220708162702.1758865-1-john@metanate.com
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/fair: fix case with reduced capacity CPU</title>
<updated>2022-08-17T13:14:01Z</updated>
<author>
<name>Vincent Guittot</name>
<email>vincent.guittot@linaro.org</email>
</author>
<published>2022-07-08T15:44:01Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=c1491d69334ad8c6de6c0350e26dd277ea634244'/>
<id>urn:sha1:c1491d69334ad8c6de6c0350e26dd277ea634244</id>
<content type='text'>
[ Upstream commit c82a69629c53eda5233f13fc11c3c01585ef48a2 ]

The capacity of the CPU available for CFS tasks can be reduced because of
other activities running on the latter. In such case, it's worth trying to
move CFS tasks on a CPU with more available capacity.

The rework of the load balance has filtered the case when the CPU is
classified to be fully busy but its capacity is reduced.

Check if CPU's capacity is reduced while gathering load balance statistic
and classify it group_misfit_task instead of group_fully_busy so we can
try to move the load on another CPU.

Reported-by: David Chen &lt;david.chen@nutanix.com&gt;
Reported-by: Zhang Qiao &lt;zhangqiao22@huawei.com&gt;
Signed-off-by: Vincent Guittot &lt;vincent.guittot@linaro.org&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Tested-by: David Chen &lt;david.chen@nutanix.com&gt;
Tested-by: Zhang Qiao &lt;zhangqiao22@huawei.com&gt;
Link: https://lkml.kernel.org/r/20220708154401.21411-1-vincent.guittot@linaro.org
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
</feed>
