| Age | Commit message (Collapse) | Author |
|
Reimplement NEXT_BUDDY preemption to take into account the deadline and
eligibility of the wakee with respect to the waker. In the event
multiple buddies could be considered, the one with the earliest deadline
is selected.
Sync wakeups are treated differently to every other type of wakeup. The
WF_SYNC assumption is that the waker promises to sleep in the very near
future. This is violated in enough cases that WF_SYNC should be treated
as a suggestion instead of a contract. If a waker does go to sleep almost
immediately then the delay in wakeup is negligible. In other cases, it's
throttled based on the accumulated runtime of the waker so there is a
chance that some batched wakeups have been issued before preemption.
For all other wakeups, preemption happens if the wakee has a earlier
deadline than the waker and eligible to run.
While many workloads were tested, the two main targets were a modified
dbench4 benchmark and hackbench because the are on opposite ends of the
spectrum -- one prefers throughput by avoiding preemption and the other
relies on preemption.
First is the dbench throughput data even though it is a poor metric but
it is the default metric. The test machine is a 2-socket machine and the
backing filesystem is XFS as a lot of the IO work is dispatched to kernel
threads. It's important to note that these results are not representative
across all machines, especially Zen machines, as different bottlenecks
are exposed on different machines and filesystems.
dbench4 Throughput (misleading but traditional)
6.18-rc1 6.18-rc1
vanilla sched-preemptnext-v5
Hmean 1 1268.80 ( 0.00%) 1269.74 ( 0.07%)
Hmean 4 3971.74 ( 0.00%) 3950.59 ( -0.53%)
Hmean 7 5548.23 ( 0.00%) 5420.08 ( -2.31%)
Hmean 12 7310.86 ( 0.00%) 7165.57 ( -1.99%)
Hmean 21 8874.53 ( 0.00%) 9149.04 ( 3.09%)
Hmean 30 9361.93 ( 0.00%) 10530.04 ( 12.48%)
Hmean 48 9540.14 ( 0.00%) 11820.40 ( 23.90%)
Hmean 79 9208.74 ( 0.00%) 12193.79 ( 32.42%)
Hmean 110 8573.12 ( 0.00%) 11933.72 ( 39.20%)
Hmean 141 7791.33 ( 0.00%) 11273.90 ( 44.70%)
Hmean 160 7666.60 ( 0.00%) 10768.72 ( 40.46%)
As throughput is misleading, the benchmark is modified to use a short
loadfile report the completion time duration in milliseconds.
dbench4 Loadfile Execution Time
6.18-rc1 6.18-rc1
vanilla sched-preemptnext-v5
Amean 1 14.62 ( 0.00%) 14.69 ( -0.46%)
Amean 4 18.76 ( 0.00%) 18.85 ( -0.45%)
Amean 7 23.71 ( 0.00%) 24.38 ( -2.82%)
Amean 12 31.25 ( 0.00%) 31.87 ( -1.97%)
Amean 21 45.12 ( 0.00%) 43.69 ( 3.16%)
Amean 30 61.07 ( 0.00%) 54.33 ( 11.03%)
Amean 48 95.91 ( 0.00%) 77.22 ( 19.49%)
Amean 79 163.38 ( 0.00%) 123.08 ( 24.66%)
Amean 110 243.91 ( 0.00%) 175.11 ( 28.21%)
Amean 141 343.47 ( 0.00%) 239.10 ( 30.39%)
Amean 160 401.15 ( 0.00%) 283.73 ( 29.27%)
Stddev 1 0.52 ( 0.00%) 0.51 ( 2.45%)
Stddev 4 1.36 ( 0.00%) 1.30 ( 4.04%)
Stddev 7 1.88 ( 0.00%) 1.87 ( 0.72%)
Stddev 12 3.06 ( 0.00%) 2.45 ( 19.83%)
Stddev 21 5.78 ( 0.00%) 3.87 ( 33.06%)
Stddev 30 9.85 ( 0.00%) 5.25 ( 46.76%)
Stddev 48 22.31 ( 0.00%) 8.64 ( 61.27%)
Stddev 79 35.96 ( 0.00%) 18.07 ( 49.76%)
Stddev 110 59.04 ( 0.00%) 30.93 ( 47.61%)
Stddev 141 85.38 ( 0.00%) 40.93 ( 52.06%)
Stddev 160 96.38 ( 0.00%) 39.72 ( 58.79%)
That is still looking good and the variance is reduced quite a bit.
Finally, fairness is a concern so the next report tracks how many
milliseconds does it take for all clients to complete a workfile. This
one is tricky because dbench makes to effort to synchronise clients so
the durations at benchmark start time differ substantially from typical
runtimes. This problem could be mitigated by warming up the benchmark
for a number of minutes but it's a matter of opinion whether that
counts as an evasion of inconvenient results.
dbench4 All Clients Loadfile Execution Time
6.18-rc1 6.18-rc1
vanilla sched-preemptnext-v5
Amean 1 15.06 ( 0.00%) 15.07 ( -0.03%)
Amean 4 603.81 ( 0.00%) 524.29 ( 13.17%)
Amean 7 855.32 ( 0.00%) 1331.07 ( -55.62%)
Amean 12 1890.02 ( 0.00%) 2323.97 ( -22.96%)
Amean 21 3195.23 ( 0.00%) 2009.29 ( 37.12%)
Amean 30 13919.53 ( 0.00%) 4579.44 ( 67.10%)
Amean 48 25246.07 ( 0.00%) 5705.46 ( 77.40%)
Amean 79 29701.84 ( 0.00%) 15509.26 ( 47.78%)
Amean 110 22803.03 ( 0.00%) 23782.08 ( -4.29%)
Amean 141 36356.07 ( 0.00%) 25074.20 ( 31.03%)
Amean 160 17046.71 ( 0.00%) 13247.62 ( 22.29%)
Stddev 1 0.47 ( 0.00%) 0.49 ( -3.74%)
Stddev 4 395.24 ( 0.00%) 254.18 ( 35.69%)
Stddev 7 467.24 ( 0.00%) 764.42 ( -63.60%)
Stddev 12 1071.43 ( 0.00%) 1395.90 ( -30.28%)
Stddev 21 1694.50 ( 0.00%) 1204.89 ( 28.89%)
Stddev 30 7945.63 ( 0.00%) 2552.59 ( 67.87%)
Stddev 48 14339.51 ( 0.00%) 3227.55 ( 77.49%)
Stddev 79 16620.91 ( 0.00%) 8422.15 ( 49.33%)
Stddev 110 12912.15 ( 0.00%) 13560.95 ( -5.02%)
Stddev 141 20700.13 ( 0.00%) 14544.51 ( 29.74%)
Stddev 160 9079.16 ( 0.00%) 7400.69 ( 18.49%)
This is more of a mixed bag but it at least shows that fairness
is not crippled.
The hackbench results are more neutral but this is still important.
It's possible to boost the dbench figures by a large amount but only by
crippling the performance of a workload like hackbench. The WF_SYNC
behaviour is important for these workloads and is why the WF_SYNC
changes are not a separate patch.
hackbench-process-pipes
6.18-rc1 6.18-rc1
vanilla sched-preemptnext-v5
Amean 1 0.2657 ( 0.00%) 0.2150 ( 19.07%)
Amean 4 0.6107 ( 0.00%) 0.6060 ( 0.76%)
Amean 7 0.7923 ( 0.00%) 0.7440 ( 6.10%)
Amean 12 1.1500 ( 0.00%) 1.1263 ( 2.06%)
Amean 21 1.7950 ( 0.00%) 1.7987 ( -0.20%)
Amean 30 2.3207 ( 0.00%) 2.5053 ( -7.96%)
Amean 48 3.5023 ( 0.00%) 3.9197 ( -11.92%)
Amean 79 4.8093 ( 0.00%) 5.2247 ( -8.64%)
Amean 110 6.1160 ( 0.00%) 6.6650 ( -8.98%)
Amean 141 7.4763 ( 0.00%) 7.8973 ( -5.63%)
Amean 172 8.9560 ( 0.00%) 9.3593 ( -4.50%)
Amean 203 10.4783 ( 0.00%) 10.8347 ( -3.40%)
Amean 234 12.4977 ( 0.00%) 13.0177 ( -4.16%)
Amean 265 14.7003 ( 0.00%) 15.5630 ( -5.87%)
Amean 296 16.1007 ( 0.00%) 17.4023 ( -8.08%)
Processes using pipes are impacted but the variance (not presented) indicates
it's close to noise and the results are not always reproducible. If executed
across multiple reboots, it may show neutral or small gains so the worst
measured results are presented.
Hackbench using sockets is more reliably neutral as the wakeup
mechanisms are different between sockets and pipes.
hackbench-process-sockets
6.18-rc1 6.18-rc1
vanilla sched-preemptnext-v2
Amean 1 0.3073 ( 0.00%) 0.3263 ( -6.18%)
Amean 4 0.7863 ( 0.00%) 0.7930 ( -0.85%)
Amean 7 1.3670 ( 0.00%) 1.3537 ( 0.98%)
Amean 12 2.1337 ( 0.00%) 2.1903 ( -2.66%)
Amean 21 3.4683 ( 0.00%) 3.4940 ( -0.74%)
Amean 30 4.7247 ( 0.00%) 4.8853 ( -3.40%)
Amean 48 7.6097 ( 0.00%) 7.8197 ( -2.76%)
Amean 79 14.7957 ( 0.00%) 16.1000 ( -8.82%)
Amean 110 21.3413 ( 0.00%) 21.9997 ( -3.08%)
Amean 141 29.0503 ( 0.00%) 29.0353 ( 0.05%)
Amean 172 36.4660 ( 0.00%) 36.1433 ( 0.88%)
Amean 203 39.7177 ( 0.00%) 40.5910 ( -2.20%)
Amean 234 42.1120 ( 0.00%) 43.5527 ( -3.42%)
Amean 265 45.7830 ( 0.00%) 50.0560 ( -9.33%)
Amean 296 50.7043 ( 0.00%) 54.3657 ( -7.22%)
As schbench has been mentioned in numerous bugs recently, the results
are interesting. A test case that represents the default schbench
behaviour is
schbench Wakeup Latency (usec)
6.18.0-rc1 6.18.0-rc1
vanilla sched-preemptnext-v5
Amean Wakeup-50th-80 7.17 ( 0.00%) 6.00 ( 16.28%)
Amean Wakeup-90th-80 46.56 ( 0.00%) 19.78 ( 57.52%)
Amean Wakeup-99th-80 119.61 ( 0.00%) 89.94 ( 24.80%)
Amean Wakeup-99.9th-80 3193.78 ( 0.00%) 328.22 ( 89.72%)
schbench Requests Per Second (ops/sec)
6.18.0-rc1 6.18.0-rc1
vanilla sched-preemptnext-v5
Hmean RPS-20th-80 8900.91 ( 0.00%) 9176.78 ( 3.10%)
Hmean RPS-50th-80 8987.41 ( 0.00%) 9217.89 ( 2.56%)
Hmean RPS-90th-80 9123.73 ( 0.00%) 9273.25 ( 1.64%)
Hmean RPS-max-80 9193.50 ( 0.00%) 9301.47 ( 1.17%)
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251112122521.1331238-3-mgorman@techsingularity.net
|
|
The NEXT_BUDDY feature reinforces wakeup preemption to encourage the last
wakee to be scheduled sooner on the assumption that the waker/wakee share
cache-hot data. In CFS, it was paired with LAST_BUDDY to switch back on
the assumption that the pair of tasks still share data but also relied
on START_DEBIT and the exact WAKEUP_PREEMPTION implementation to get
good results.
NEXT_BUDDY has been disabled since commit 0ec9fab3d186 ("sched: Improve
latencies and throughput") and LAST_BUDDY was removed in commit 5e963f2bd465
("sched/fair: Commit to EEVDF"). The reasoning is not clear but as vruntime
spread is mentioned so the expectation is that NEXT_BUDDY had an impact
on overall fairness. It was not noted why LAST_BUDDY was removed but it
is assumed that it's very difficult to reason what LAST_BUDDY's correct
and effective behaviour should be while still respecting EEVDFs goals.
Peter Zijlstra noted during review;
I think I was just struggling to make sense of things and figured
less is more and axed it.
I have vague memories trying to work through the dynamics of
a wakeup-stack and the EEVDF latency requirements and getting
a head-ache.
NEXT_BUDDY is easier to reason about given that it's a point-in-time
decision on the wakees deadline and eligibilty relative to the waker. Enable
NEXT_BUDDY as a preparation path to document that the decision to ignore
the current implementation is deliberate. While not presented, the results
were at best neutral and often much more variable.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251112122521.1331238-2-mgorman@techsingularity.net
|
|
Increase the sched_tick_remote WARN_ON timeout to remove false
positives due to temporarily busy HK cpus. The suggestion
was 30 seconds to catch really stuck remote tick processing
but not trigger it too easily.
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://patch.msgid.link/20250911161300.437944-1-pauld@redhat.com
|
|
Also serialize the possiblty much more frequent newidle balancing for
the 'expensive' domains that have SD_BALANCE set.
Initial benchmarking by K Prateek and Tim showed no negative effect.
Split out from the larger patch moving sched_balance_running around
for ease of bisect and such.
Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Seconded-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/df068896-82f9-458d-8fff-5a2f654e8ffd@amd.com
Link: https://patch.msgid.link/6fed119b723c71552943bfe5798c93851b30a361.1762800251.git.tim.c.chen@linux.intel.com
# Conflicts:
# kernel/sched/fair.c
|
|
The NUMA sched domain sets the SD_SERIALIZE flag by default, allowing
only one NUMA load balancing operation to run system-wide at a time.
Currently, each sched group leader directly under NUMA domain attempts
to acquire the global sched_balance_running flag via cmpxchg() before
checking whether load balancing is due or whether it is the designated
load balancer for that NUMA domain. On systems with a large number
of cores, this causes significant cache contention on the shared
sched_balance_running flag.
This patch reduces unnecessary cmpxchg() operations by first checking
that the balancer is the designated leader for a NUMA domain from
should_we_balance(), and the balance interval has expired before
trying to acquire sched_balance_running to load balance a NUMA
domain.
On a 2-socket Granite Rapids system with sub-NUMA clustering enabled,
running an OLTP workload, 7.8% of total CPU cycles were previously spent
in sched_balance_domain() contending on sched_balance_running before
this change.
: 104 static __always_inline int arch_atomic_cmpxchg(atomic_t *v, int old, int new)
: 105 {
: 106 return arch_cmpxchg(&v->counter, old, new);
0.00 : ffffffff81326e6c: xor %eax,%eax
0.00 : ffffffff81326e6e: mov $0x1,%ecx
0.00 : ffffffff81326e73: lock cmpxchg %ecx,0x2394195(%rip) # ffffffff836bb010 <sched_balance_running>
: 110 sched_balance_domains():
: 12234 if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
99.39 : ffffffff81326e7b: test %eax,%eax
0.00 : ffffffff81326e7d: jne ffffffff81326e99 <sched_balance_domains+0x209>
: 12238 if (time_after_eq(jiffies, sd->last_balance + interval)) {
0.00 : ffffffff81326e7f: mov 0x14e2b3a(%rip),%rax # ffffffff828099c0 <jiffies_64>
0.00 : ffffffff81326e86: sub 0x48(%r14),%rax
0.00 : ffffffff81326e8a: cmp %rdx,%rax
After applying this fix, sched_balance_domain() is gone from the profile
and there is a 5% throughput improvement.
[peterz: made it so that redo retains the 'lock' and split out the
CPU_NEWLY_IDLE change to a separate patch]
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.ibm.com>
Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/6fed119b723c71552943bfe5798c93851b30a361.1762800251.git.tim.c.chen@linux.intel.com
|
|
The free_kick_syncs_rcu() rcu-callback only invoke kvfree() to
release per-cpu ksyncs object, this can use kvfree_rcu() replace
call_rcu() to release per-cpu ksyncs object in the free_kick_syncs().
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
For PREEMPT_RT kernels, the kick_cpus_irq_workfn() be invoked in
the per-cpu irq_work/* task context and there is no rcu-read critical
section to protect. this commit therefore use IRQ_WORK_INIT_HARD() to
initialize the per-cpu rq->scx.kick_cpus_irq_work in the
init_sched_ext_class().
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
With the buddy lockup detector, smp_processor_id() returns the detecting CPU,
not the locked CPU, making scx_hardlockup()'s printouts confusing. Pass the
locked CPU number from watchdog_hardlockup_check() as a parameter instead.
Also add kerneldoc comments to handle_lockup(), scx_hardlockup(), and
scx_rcu_cpu_stall() documenting their return value semantics.
Suggested-by: Doug Anderson <dianders@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
For PREEMPT_RT=y kernels, the deferred_irq_workfn() is executed in
the per-cpu irq_work/* task context and not disable-irq, if the rq
returned by container_of() is current CPU's rq, the following scenarios
may occur:
lock(&rq->__lock);
<Interrupt>
lock(&rq->__lock);
This commit use IRQ_WORK_INIT_HARD() to replace init_irq_work() to
initialize rq->scx.deferred_irq_work, make the deferred_irq_workfn()
is always invoked in hard-irq context.
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Commit 5ebec443fb96a ("sched_ext: Exit dispatch and move operations
immediately when aborting") replaced the breather mechanism with the
scx_aborting flag.
Update comments removing references to the breather mechanism to avoid
confusion.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Update scx_task_locks so that it's safe to lock/unlock in a
non-sleepable context in PREEMPT_RT kernels. scx_task_locks is
(non-raw) spinlock used to protect the list of tasks under SCX.
This list is updated during from finish_task_switch(), which
cannot sleep. Regular spinlocks can be locked in such a context
in non-RT kernels, but are sleepable under when CONFIG_PREEMPT_RT=y.
Convert scx_task_locks into a raw spinlock, which is not sleepable
even on RT kernels.
Sample backtrace:
<TASK>
dump_stack_lvl+0x83/0xa0
__might_resched+0x14a/0x200
rt_spin_lock+0x61/0x1c0
? sched_ext_dead+0x2d/0xf0
? lock_release+0xc6/0x280
sched_ext_dead+0x2d/0xf0
? srso_alias_return_thunk+0x5/0xfbef5
finish_task_switch.isra.0+0x254/0x360
__schedule+0x584/0x11d0
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? tick_nohz_idle_exit+0x7e/0x120
schedule_idle+0x23/0x40
cpu_startup_entry+0x29/0x30
start_secondary+0xf8/0x100
common_startup_64+0x13e/0x148
</TASK>
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
In bypass mode, tasks are queued on per-CPU bypass DSQs. While this works well
in most cases, there is a failure mode where a BPF scheduler can skew task
placement severely before triggering bypass in highly over-saturated systems.
If most tasks end up concentrated on a few CPUs, those CPUs can accumulate
queues that are too long to drain in a reasonable time, leading to RCU stalls
and hung tasks.
Implement a simple timer-based load balancer that redistributes tasks across
CPUs within each NUMA node. The balancer runs periodically (default 500ms,
tunable via bypass_lb_intv_us module parameter) and moves tasks from overloaded
CPUs to underloaded ones.
When moving tasks between bypass DSQs, the load balancer holds nested DSQ locks
to avoid dropping and reacquiring the donor DSQ lock on each iteration, as
donor DSQs can be very long and highly contended. Add the SCX_ENQ_NESTED flag
and use raw_spin_lock_nested() in dispatch_enqueue() to support this. The load
balancer timer function reads scx_bypass_depth locklessly to check whether
bypass mode is active. Use WRITE_ONCE() when updating scx_bypass_depth to pair
with the READ_ONCE() in the timer function.
This has been tested on a 192 CPU dual socket AMD EPYC machine with ~20k
runnable tasks running scx_cpu0. As scx_cpu0 queues all tasks to CPU0, almost
all tasks end up on CPU0 creating severe imbalance. Without the load balancer,
disabling the scheduler can lead to RCU stalls and hung tasks, taking a very
long time to complete. With the load balancer, disable completes in about a
second.
The load balancing operation can be monitored using the sched_ext_bypass_lb
tracepoint and disabled by setting bypass_lb_intv_us to 0.
v2: Lock both rq and DSQ in bypass_lb_cpu() and use dispatch_dequeue_locked()
to prevent races with dispatch_dequeue() (Andrea Righi).
Cc: Andrea Righi <arighi@nvidia.com>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Reviewed_by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
dispatch_dequeue_locked()
move_task_between_dsqs() contains open-coded abbreviated dequeue logic when
moving tasks between non-local DSQs. Factor this out into
dispatch_dequeue_locked() which can be used when both the task's rq and dsq
locks are already held. Add lockdep assertions to both dispatch_dequeue() and
the new helper to verify locking requirements.
This prepares for the load balancer which will need the same abbreviated
dequeue pattern.
Cc: Andrea Righi <arighi@nvidia.com>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
INIT_DSQ_LIST_CURSOR
Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR
macro in preparation for additional users.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
A poorly behaving BPF scheduler can trigger hard lockup. For example, on a
large system with many tasks pinned to different subsets of CPUs, if the BPF
scheduler puts all tasks in a single DSQ and lets all CPUs at it, the DSQ lock
can be contended to the point where hardlockup triggers. Unfortunately,
hardlockup can be the first signal out of such situations, thus requiring
hardlockup handling.
Hook scx_hardlockup() into the hardlockup detector to try kicking out the
current scheduler in an attempt to recover the system to a good state. The
handling strategy can delay watchdog taking its own action by one polling
period; however, given that the only remediation for hardlockup is crash, this
is likely an acceptable trade-off.
v2: Add missing dummy scx_hardlockup() definition for
!CONFIG_SCHED_CLASS_EXT (kernel test bot).
Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
handle_lockup() currently calls scx_verror() but ignores its return value,
always returning true when the scheduler is enabled. Make it capture and return
the result from scx_verror(). This prepares for hardlockup handling.
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
scx_rcu_cpu_stall() and scx_softlockup() share the same pattern: check if the
scheduler is enabled under RCU read lock and trigger an error if so. Extract
the common pattern into handle_lockup() helper. Add scx_verror() macro and use
guard(rcu)().
This simplifies both handlers, reduces code duplication, and prepares for
hardlockup handling.
Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Make scx_exit() and scx_vexit() return bool indicating whether the calling
thread successfully claimed the exit. This will be used by the abort mechanism
added in a later patch.
Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
62dcbab8b0ef ("sched_ext: Avoid live-locking bypass mode switching") introduced
the breather mechanism to inject delays during bypass mode switching. It
maintains operation semantics unchanged while reducing lock contention to avoid
live-locks on large NUMA systems.
However, the breather only activates when exiting the scheduler, so there's no
need to maintain operation semantics. Simplify by exiting dispatch and move
operations immediately when scx_aborting is set. In consume_dispatch_q(), break
out of the task iteration loop. In scx_dsq_move(), return early before
acquiring locks.
This also fixes cases the breather mechanism cannot handle. When a large system
has many runnable threads affinitized to different CPU subsets and the BPF
scheduler places them all into a single DSQ, many CPUs can scan the DSQ
concurrently for tasks they can run. This can cause DSQ and RQ locks to be held
for extended periods, leading to various failure modes. The breather cannot
solve this because once in the consume loop, there's no exit. The new mechanism
fixes this by exiting the loop immediately.
The bypass DSQ is exempted to ensure the bypass mechanism itself can make
progress.
v2: Use READ_ONCE() when reading scx_aborting (Andrea Righi).
Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Andrea Righi <arighi@nvidia.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The breather mechanism was introduced in 62dcbab8b0ef ("sched_ext: Avoid
live-locking bypass mode switching") and e32c260195e6 ("sched_ext: Enable the
ops breather and eject BPF scheduler on softlockup") to prevent live-locks by
injecting delays when CPUs are trapped in dispatch paths.
Currently, it uses scx_breather_depth (atomic_t) and scx_in_softlockup
(unsigned long) with separate increment/decrement and cleanup operations. The
breather is only activated when aborting, so tie it directly to the exit
mechanism. Replace both variables with scx_aborting flag set when exit is
claimed and cleared after bypass is enabled. Introduce scx_claim_exit() to
consolidate exit_kind claiming and breather enablement. This eliminates
scx_clear_softlockup() and simplifies scx_softlockup() and scx_bypass().
The breather mechanism will be replaced by a different abort mechanism in a
future patch. This simplification prepares for that change.
Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Bypass mode routes tasks through fallback dispatch queues. Originally a single
global DSQ, b7b3b2dbae73 ("sched_ext: Split the global DSQ per NUMA node")
changed this to per-node DSQs to resolve NUMA-related livelocks.
Dan Schatzberg found per-node DSQs can still livelock when many threads are
pinned to different small CPU subsets: each CPU must scan many incompatible
tasks to find runnable ones, causing severe contention with high CPU counts.
Switch to per-CPU bypass DSQs. Each task queues on its current CPU. Default
idle CPU selection and direct dispatch handle most cases well.
This introduces a failure mode when tasks concentrate on one CPU in
over-saturated systems. If the BPF scheduler severely skews placement before
triggering bypass, that CPU's queue may be too long to drain, causing RCU
stalls. A load balancer in a future patch will address this. The bypass DSQ is
separate from local DSQ to enable load balancing: local DSQs use rq locks,
preventing efficient scanning and transfer across CPUs, especially problematic
when systems are already contended.
v2: Clarified why bypass DSQ is separate from local DSQ (Andrea Righi).
Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The local and global DSQ enqueue paths in do_enqueue_task() share the same
slice refill logic. Factor out the common code into a shared enqueue label.
This makes adding new enqueue cases easier. No functional changes.
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
There have been reported cases of bypass mode not making forward progress fast
enough. The 20ms default slice is unnecessarily long for bypass mode where the
primary goal is ensuring all tasks can make forward progress.
Introduce SCX_SLICE_BYPASS set to 5ms and make the scheduler automatically
switch to it when entering bypass mode. Also make the bypass slice value
tunable through the slice_bypass_us module parameter (adjustable between 100us
and 100ms) to make it easier to test whether slice durations are a factor in
problem cases.
v3: Use READ_ONCE/WRITE_ONCE for scx_slice_dfl access (Dan).
v2: Removed slice_dfl_us module parameter. Fixed typos (Andrea).
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
For built with CONFIG_PREEMPT_RT=y kernels, the dump_lock will be converted
sleepable spinlock and not disable-irq, so the following scenarios occur:
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
irq_work/0/27 [HC0[0]:SC0[0]:HE1:SE1] takes:
(&rq->__lock){?...}-{2:2}, at: raw_spin_rq_lock_nested+0x2b/0x40
{IN-HARDIRQ-W} state was registered at:
lock_acquire+0x1e1/0x510
_raw_spin_lock_nested+0x42/0x80
raw_spin_rq_lock_nested+0x2b/0x40
sched_tick+0xae/0x7b0
update_process_times+0x14c/0x1b0
tick_periodic+0x62/0x1f0
tick_handle_periodic+0x48/0xf0
timer_interrupt+0x55/0x80
__handle_irq_event_percpu+0x20a/0x5c0
handle_irq_event_percpu+0x18/0xc0
handle_irq_event+0xb5/0x150
handle_level_irq+0x220/0x460
__common_interrupt+0xa2/0x1e0
common_interrupt+0xb0/0xd0
asm_common_interrupt+0x2b/0x40
_raw_spin_unlock_irqrestore+0x45/0x80
__setup_irq+0xc34/0x1a30
request_threaded_irq+0x214/0x2f0
hpet_time_init+0x3e/0x60
x86_late_time_init+0x5b/0xb0
start_kernel+0x308/0x410
x86_64_start_reservations+0x1c/0x30
x86_64_start_kernel+0x96/0xa0
common_startup_64+0x13e/0x148
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&rq->__lock);
<Interrupt>
lock(&rq->__lock);
*** DEADLOCK ***
stack backtrace:
CPU: 0 UID: 0 PID: 27 Comm: irq_work/0
Call Trace:
<TASK>
dump_stack_lvl+0x8c/0xd0
dump_stack+0x14/0x20
print_usage_bug+0x42e/0x690
mark_lock.part.44+0x867/0xa70
? __pfx_mark_lock.part.44+0x10/0x10
? string_nocheck+0x19c/0x310
? number+0x739/0x9f0
? __pfx_string_nocheck+0x10/0x10
? __pfx_check_pointer+0x10/0x10
? kvm_sched_clock_read+0x15/0x30
? sched_clock_noinstr+0xd/0x20
? local_clock_noinstr+0x1c/0xe0
__lock_acquire+0xc4b/0x62b0
? __pfx_format_decode+0x10/0x10
? __pfx_string+0x10/0x10
? __pfx___lock_acquire+0x10/0x10
? __pfx_vsnprintf+0x10/0x10
lock_acquire+0x1e1/0x510
? raw_spin_rq_lock_nested+0x2b/0x40
? __pfx_lock_acquire+0x10/0x10
? dump_line+0x12e/0x270
? raw_spin_rq_lock_nested+0x20/0x40
_raw_spin_lock_nested+0x42/0x80
? raw_spin_rq_lock_nested+0x2b/0x40
raw_spin_rq_lock_nested+0x2b/0x40
scx_dump_state+0x3b3/0x1270
? finish_task_switch+0x27e/0x840
scx_ops_error_irq_workfn+0x67/0x80
irq_work_single+0x113/0x260
irq_work_run_list.part.3+0x44/0x70
run_irq_workd+0x6b/0x90
? __pfx_run_irq_workd+0x10/0x10
smpboot_thread_fn+0x529/0x870
? __pfx_smpboot_thread_fn+0x10/0x10
kthread+0x305/0x3f0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x40/0x70
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
This commit therefore use rq_lock_irqsave/irqrestore() to replace
rq_lock/unlock() in the scx_dump_state().
Fixes: 07814a9439a3 ("sched_ext: Print debug dump after an error exit")
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
In select_task_rq_dl, there is only one goto statement, there is no
need for it.
No functional changes.
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://patch.msgid.link/20251014100342.978936-2-sshegde@linux.ibm.com
|
|
cpumask_subset(a,b) -> cpumask_weight(a) should be same as cpumask_weight_and(a,b)
for_each_cpu_and(a,b) to count cpus could be replaced by cpumask_weight_and(a,b)
No Functional Change. It could save a few cycles since cpumask_weight_and
would be more efficient. Plus one less stack variable.
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://patch.msgid.link/20251014100342.978936-3-sshegde@linux.ibm.com
|
|
Place the notes that resulted from going through the dl_server code in a
comment.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
Gabriel reported that the dl_server doesn't stop as expected.
The problem was found to be the fact that idle time and fair runtime are
treated equally. Both will count towards dl_server runtime and push the
activation forwards when it is in the zero-laxity wait state.
Notably:
dl_server_update_idle()
update_curr_dl_se()
if (dl_defer && dl_throttled && dl_runtime_exceeded())
hrtimer_try_to_cancel(); // stop timer
replenish_dl_new_period()
deadline = now + dl_deadline; // fwd period
runtime = dl_runtime;
start_dl_timer(); // restart timer
And while we do want idle time accounted towards the *current* activation of
the dl_server -- after all, a fair task could've ran if we had any -- we don't
necessarily want idle time to cause or push forward an activation.
Introduce dl_defer_idle to make this distinction. It will be set once idle time
pushed the activation forward, once set idle time will only be allowed to
consume any runtime but not push the activation. This will then cause
dl_server_timer() to fire, which will stop the dl_server.
Any non-idle time accounting during this phase will clear dl_defer_idle, so
only a full period of idle will cause the dl_server to stop.
Reported-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251101000057.GA2184199@noisy.programming.kicks-ass.net
|
|
The dl_server time accounting code is a little odd. The normal scheduler
pattern is to update curr before doing something, such that the old state is
fully accounted before changing state.
Notably, the dl_server_timer() needs to propagate the current time accounting
since the current task could be ran by dl_server and thus this can affect
dl_se->runtime. Similarly for dl_server_start().
And since the (deferred) dl_server wants idle time accounted, rework
sched_idle_class time accounting to be more like all the others.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251020141130.GJ3245006@noisy.programming.kicks-ass.net
|
|
Since commit d4c64207b88a ("sched: Cleanup the sched_change NOCLOCK usage"),
update_rq_clock() is called in do_set_cpus_allowed() -> sched_change_begin()
to update the rq clock. This results in a duplicate call update_rq_clock()
in __set_cpus_allowed_ptr_locked().
While holding the rq lock and before calling do_set_cpus_allowed(),
there is nothing that depends on an updated rq_clock.
Therefore, remove the redundant update_rq_clock() in
__set_cpus_allowed_ptr_locked() to avoid the warning about double
rq clock updates.
Fixes: d4c64207b88a ("sched: Cleanup the sched_change NOCLOCK usage")
Signed-off-by: Hao Jia <jiahao1@lixiang.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20251029093655.31252-1-jiahao.kernel@gmail.com
|
|
Basically, from the constraint that the sum of lag is zero, you can
infer that the 0-lag point is the weighted average of the individual
vruntime, which is what we're trying to compute:
\Sum w_i * v_i
avg = --------------
\Sum w_i
Now, since vruntime takes the whole u64 (worse, it wraps), this
multiplication term in the numerator is not something we can compute;
instead we do the min_vruntime (v0 henceforth) thing like:
v_i = (v_i - v0) + v0
This does two things:
- it keeps the key: (v_i - v0) 'small';
- it creates a relative 0-point in the modular space.
If you do that subtitution and work it all out, you end up with:
\Sum w_i * (v_i - v0)
avg = --------------------- + v0
\Sum w_i
Since you cannot very well track a ratio like that (and not suffer
terrible numerical problems) we simpy track the numerator and
denominator individually and only perform the division when strictly
needed.
Notably, the numerator lives in cfs_rq->avg_vruntime and the denominator
lives in cfs_rq->avg_load.
The one extra 'funny' is that these numbers track the entities in the
tree, and current is typically outside of the tree, so avg_vruntime()
adds current when needed before doing the division.
(vruntime_eligible() elides the division by cross-wise multiplication)
Anyway, as mentioned above, we currently use the CFS era min_vruntime
for this purpose. However, this thing can only move forward, while the
above avg can in fact move backward (when a non-eligible task leaves,
the average becomes smaller), this can cause trouble when through
happenstance (or construction) these values drift far enough apart to
wreck the game.
Replace cfs_rq::min_vruntime with cfs_rq::zero_vruntime which is kept
near/at avg_vruntime, following its motion.
The down-side is that this requires computing the avg more often.
Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Reported-by: Zicheng Qu <quzicheng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251106111741.GC4068168@noisy.programming.kicks-ass.net
Cc: stable@vger.kernel.org
|
|
I always end up having to re-read these emails every time I look at
this code. And a future patch is going to change this story a little.
This means it is past time to stick them in a comment so it can be
modified and stay current.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200506143506.GH5298@hirez.programming.kicks-ass.net
Link: https://lkml.kernel.org/r/20200515103844.GG2978@hirez.programming.kicks-ass.net
Link: https://patch.msgid.link/20251106111603.GB4068168@noisy.programming.kicks-ass.net
|
|
Early return true if the core cookie matches. This avoids the SMT mask
loop to check for an idle core, which might be more expensive on wide
platforms.
Signed-off-by: Fernand Sieber <sieberf@amazon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Link: https://patch.msgid.link/20251105152538.470586-1-sieberf@amazon.com
|
|
When executing a task in proxy context, handle yields as if they were
requested by the donor task. This matches the traditional PI semantics
of yield() as well.
This avoids scenario like proxy task yielding, pick next task selecting the
same previous blocked donor, running the proxy task again, etc.
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202510211205.1e0f5223-lkp@intel.com
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Fernand Sieber <sieberf@amazon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251106104022.195157-1-sieberf@amazon.com
|
|
When a cfs_rq is to be throttled, its limbo list should be empty and
that's why there is a warn in tg_throttle_down() for non empty
cfs_rq->throttled_limbo_list.
When running a test with the following hierarchy:
root
/ \
A* ...
/ | \ ...
B
/ \
C*
where both A and C have quota settings, that warn on non empty limbo list
is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
part of the cfs_rq for the sake of simpler representation).
Debug showed it happened like this:
Task group C is created and quota is set, so in tg_set_cfs_bandwidth(),
cfs_rq_c is initialized with runtime_enabled set, runtime_remaining
equals to 0 and *unthrottled*. Before any tasks are enqueued to cfs_rq_c,
*multiple* throttled tasks can migrate to cfs_rq_c (e.g., due to task
group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is
called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled),
these throttled tasks are directly placed into cfs_rq_c's limbo list by
enqueue_throttled_task().
Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these
tasks. The first enqueue triggers check_enqueue_throttle(), and with zero
runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it
can't get more runtime and enters tg_throttle_down(), where the warning
is hit due to remaining tasks in the limbo list.
I think it's a chaos to trigger throttle on unthrottle path, the status
of a being unthrottled cfs_rq can be in a mixed state in the end, so fix
this by granting 1ns to cfs_rq in tg_set_cfs_bandwidth(). This ensures
cfs_rq_c has a positive runtime_remaining when initialized as unthrottled
and cannot enter tg_unthrottle_up() with zero runtime_remaining.
Also, update outdated comments in tg_throttle_down() since
unthrottle_cfs_rq() is no longer called with zero runtime_remaining.
While at it, remove a redundant assignment to se in tg_throttle_down().
Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model")
Reviewed-By: Benjamin Segall <bsegall@google.com>
Suggested-by: Benjamin Segall <bsegall@google.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Hao Jia <jiahao1@lixiang.com>
Link: https://patch.msgid.link/20251030032755.560-1-ziqianlu@bytedance.com
|
|
races
The warned bitfields in struct scx_sched are updated racily from concurrent
CPUs causing RMW races, which is fine for these boolean warning flags. Add a
comment marking this area to prevent future fields that can't tolerate racy
updates from being added here.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
- Use memset() in scx_task_iter_start() instead of zeroing fields individually.
- In scx_task_iter_next(), move __scx_task_iter_maybe_relock() after the batch
check which is simpler.
- Update comment to reflect that tasks are removed from scx_tasks when dead
(commit 7900aa699c34 ("sched_ext: Fix cgroup exit ordering by moving
sched_ext_free() to finish_task_switch()")).
No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The BUILD_BUG_ON() which checks that __SCX_DSQ_ITER_ALL_FLAGS doesn't
overlap with the private lnode bits was in scx_task_iter_start() which has
nothing to do with DSQ iteration. Move it to bpf_iter_scx_dsq_new() where it
belongs.
No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
After removing the various condition bits earlier it turns out that one
extra information is needed to avoid setting event::sched_switch and
TIF_NOTIFY_RESUME unconditionally on every context switch.
The update of the RSEQ user space memory is only required, when either
the task was interrupted in user space and schedules
or
the CPU or MM CID changes in schedule() independent of the entry mode
Right now only the interrupt from user information is available.
Add an event flag, which is set when the CPU or MM CID or both change.
Evaluate this event in the scheduler to decide whether the sched_switch
event and the TIF bit need to be set.
It's an extra conditional in context_switch(), but the downside of
unconditionally handling RSEQ after a context switch to user is way more
significant. The utilized boolean logic minimizes this to a single
conditional branch.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.578058898@linutronix.de
|
|
Since commit 0190e4198e47 ("rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_*
flags") the bits in task::rseq_event_mask are meaningless and just extra
work in terms of setting them individually.
Aside of that the only relevant point where an event has to be raised is
context switch. Neither the CPU nor MM CID can change without going through
a context switch.
Collapse them all into a single boolean which simplifies the code a lot and
remove the pointless invocations which have been sprinkled all over the
place for no value.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.336978188@linutronix.de
|
|
finish_task_switch()
sched_ext_free() was called from __put_task_struct() when the last reference
to the task is dropped, which could be long after the task has finished
running. This causes cgroup-related problems:
- ops.init_task() can be called on a cgroup which didn't get ops.cgroup_init()'d
during scheduler load, because the cgroup might be destroyed/unlinked
while the zombie or dead task is still lingering on the scx_tasks list.
- ops.cgroup_exit() could be called before ops.exit_task() is called on all
member tasks, leading to incorrect exit ordering.
Fix by moving it to finish_task_switch() to be called right after the final
context switch away from the dying task, matching when sched_class->task_dead()
is called. Rename it to sched_ext_dead() to match the new calling context.
By calling sched_ext_dead() before cgroup_task_dead(), we ensure that:
- Tasks visible on scx_tasks list have valid cgroups during scheduler load,
as cgroup_mutex prevents cgroup destruction while the task is still linked.
- All member tasks have ops.exit_task() called and are removed from scx_tasks
before the cgroup can be destroyed and trigger ops.cgroup_exit().
This fix is made possible by the cgroup_task_dead() split in the previous patch.
This also makes more sense resource-wise as there's no point in keeping
scheduler side resources around for dead tasks.
Reported-by: Dan Schatzberg <dschatzberg@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup into for-6.19
Pull cgroup/for-6.19 to receive:
16dad7801aad ("cgroup: Rename cgroup lifecycle hooks to cgroup_task_*()")
260fbcb92bbe ("cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free()")
d245698d727a ("cgroup: Defer task cgroup unlink until after the task is done switching out")
These are needed for the sched_ext cgroup exit ordering fix.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
When a task exits, css_set_move_task(tsk, cset, NULL, false) unlinks the task
from its cgroup. From the cgroup's perspective, the task is now gone. If this
makes the cgroup empty, it can be removed, triggering ->css_offline() callbacks
that notify controllers the cgroup is going offline resource-wise.
However, the exiting task can still run, perform memory operations, and schedule
until the final context switch in finish_task_switch(). This creates a confusing
situation where controllers are told a cgroup is offline while resource
activities are still happening in it. While this hasn't broken existing
controllers, it has caused direct confusion for sched_ext schedulers.
Split cgroup_task_exit() into two functions. cgroup_task_exit() now only calls
the subsystem exit callbacks and continues to be called from do_exit(). The
css_set cleanup is moved to the new cgroup_task_dead() which is called from
finish_task_switch() after the final context switch, so that the cgroup only
appears empty after the task is truly done running.
This also reorders operations so that subsys->exit() is now called before
unlinking from the cgroup, which shouldn't break anything.
Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The current names cgroup_exit(), cgroup_release(), and cgroup_free() are
confusing because they look like they're operating on cgroups themselves when
they're actually task lifecycle hooks. For example, cgroup_init() initializes
the cgroup subsystem while cgroup_exit() is a task exit notification to
cgroup. Rename them to cgroup_task_exit(), cgroup_task_release(), and
cgroup_task_free() to make it clear that these operate on tasks.
Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Crystal reports that the PCIe Advanced Error Reporting driver gets stuck
in an infinite loop on PREEMPT_RT:
Both the primary interrupt handler aer_irq() as well as the secondary
handler aer_isr() are forced into threads with identical priority.
Crystal writes that on the ARM system in question, the primary handler
has to clear an error in the Root Error Status register...
"before the next error happens, or else the hardware will set the
Multiple ERR_COR Received bit. If that bit is set, then aer_isr()
can't rely on the Error Source Identification register, so it scans
through all devices looking for errors -- and for some reason, on
this system, accessing the AER registers (or any Config Space above
0x400, even though there are capabilities located there) generates
an Unsupported Request Error (but returns valid data). Since this
happens more than once, without aer_irq() preempting, it causes
another multi error and we get stuck in a loop."
The issue does not show on non-PREEMPT_RT because the primary handler
runs in hardirq context and thus can preempt the threaded secondary
handler, clear the Root Error Status register and prevent the secondary
handler from getting stuck.
Emulate the same behavior on PREEMPT_RT by assigning a lower default
priority to the secondary handler if the primary handler is forced into
a thread.
Reported-by: Crystal Wood <crwood@redhat.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Crystal Wood <crwood@redhat.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/f6dcdb41be2694886b8dbf4fe7b3ab89e9d5114c.1761569303.git.lukas@wunner.de
Closes: https://lore.kernel.org/r/20250902224441.368483-1-crwood@redhat.com/
|
|
The ops.cpu_acquire/release() callbacks miss events under multiple conditions.
There are two distinct task dispatch gaps that can cause cpu_released flag
desynchronization:
1. balance-to-pick_task gap: This is what was originally reported. balance_scx()
can enqueue a task, but during consume_remote_task() when the rq lock is
released, a higher priority task can be enqueued and ultimately picked while
cpu_released remains false. This gap is closeable via RETRY_TASK handling.
2. ttwu-to-pick_task gap: ttwu() can directly dispatch a task to a CPU's local
DSQ. By the time the sched path runs on the target CPU, higher class tasks may
already be queued. In such cases, nothing on sched_ext side will be invoked,
and the only solution would be a hook invoked regardless of sched class, which
isn't desirable.
Rather than adding invasive core hooks, BPF schedulers can use generic BPF
mechanisms like tracepoints. From SCX scheduler's perspective, this is congruent
with other mechanisms it already uses and doesn't add further friction.
The main use case for cpu_release() was calling scx_bpf_reenqueue_local() when
a CPU gets preempted by a higher priority scheduling class. However, the old
scx_bpf_reenqueue_local() could only be called from cpu_release() context.
Add a new version of scx_bpf_reenqueue_local() that can be called from any
context by deferring the actual re-enqueue operation. This eliminates the need
for cpu_acquire/release() ops entirely. Schedulers can now use standard BPF
mechanisms like the sched_switch tracepoint to detect and handle CPU preemption.
Update scx_qmap to demonstrate the new approach using sched_switch instead of
cpu_release, with compat support for older kernels. Mark cpu_acquire/release()
as deprecated. The old scx_bpf_reenqueue_local() variant will be removed in
v6.23.
Reported-by: Wen-Fang Liu <liuwenfang@honor.com>
Link: https://lore.kernel.org/all/8d64c74118c6440f81bcf5a4ac6b9f00@honor.com/
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Factor out the core re-enqueue logic from scx_bpf_reenqueue_local() into a
new reenq_local() helper function. scx_bpf_reenqueue_local() now handles the
BPF kfunc checks and calls reenq_local() to perform the actual work.
This is a prep patch to allow reenq_local() to be called from other contexts.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
schedule_deferred() currently requires the rq lock to be held so that it can
use scheduler hooks for efficiency when available. However, there are cases
where deferred actions need to be scheduled from contexts that don't hold the
rq lock.
Split into schedule_deferred() which can be called from any context and just
queues irq_work, and schedule_deferred_locked() which requires the rq lock and
can optimize by using scheduler hooks when available. Update the existing call
site to use the _locked variant.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Pull to receive f4fa7c25f632 ("sched_ext: Fix use of uninitialized variable
in scx_bpf_cpuperf_set()") which conflicts with changes for planned
sub-sched changes.
|
|
scx_bpf_cpuperf_set() has a typo where it dereferences the local
variable @sch, instead of the global @scx_root pointer. Fix by
dereferencing the correct variable.
Fixes: 956f2b11a8a4f ("sched_ext: Drop kf_cpu_valid()")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|