sched_ext: Implement load balancer for bypass mode - user/sven/linux.git

diff options

author	Tejun Heo <tj@kernel.org>	2025-11-11 09:18:16 -1000
committer	Tejun Heo <tj@kernel.org>	2025-11-12 06:43:44 -1000
commit	95d1df610cdc7497510cc710435a5c8c4e3db606 (patch)
tree	2ab007466798638c6bf168eece5b44b267d4b75a /tools/docs/checktransupdate.py
parent	d18b96ce12becf3f3aa3556ba722c2de61aca94e (diff)

sched_ext: Implement load balancer for bypass mode

In bypass mode, tasks are queued on per-CPU bypass DSQs. While this works well in most cases, there is a failure mode where a BPF scheduler can skew task placement severely before triggering bypass in highly over-saturated systems. If most tasks end up concentrated on a few CPUs, those CPUs can accumulate queues that are too long to drain in a reasonable time, leading to RCU stalls and hung tasks. Implement a simple timer-based load balancer that redistributes tasks across CPUs within each NUMA node. The balancer runs periodically (default 500ms, tunable via bypass_lb_intv_us module parameter) and moves tasks from overloaded CPUs to underloaded ones. When moving tasks between bypass DSQs, the load balancer holds nested DSQ locks to avoid dropping and reacquiring the donor DSQ lock on each iteration, as donor DSQs can be very long and highly contended. Add the SCX_ENQ_NESTED flag and use raw_spin_lock_nested() in dispatch_enqueue() to support this. The load balancer timer function reads scx_bypass_depth locklessly to check whether bypass mode is active. Use WRITE_ONCE() when updating scx_bypass_depth to pair with the READ_ONCE() in the timer function. This has been tested on a 192 CPU dual socket AMD EPYC machine with ~20k runnable tasks running scx_cpu0. As scx_cpu0 queues all tasks to CPU0, almost all tasks end up on CPU0 creating severe imbalance. Without the load balancer, disabling the scheduler can lead to RCU stalls and hung tasks, taking a very long time to complete. With the load balancer, disable completes in about a second. The load balancing operation can be monitored using the sched_ext_bypass_lb tracepoint and disabled by setting bypass_lb_intv_us to 0. v2: Lock both rq and DSQ in bypass_lb_cpu() and use dispatch_dequeue_locked() to prevent races with dispatch_dequeue() (Andrea Righi). Cc: Andrea Righi <arighi@nvidia.com> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Emil Tsalapatis <etsal@meta.com> Reviewed_by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>

Diffstat (limited to 'tools/docs/checktransupdate.py')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: