user/sven/linux.git/kernel/sched, branch v6.17.8

sched_ext: Mark scx_bpf_dsq_move_set_[slice|vtime]() with KF_RCU

2025-11-13T20:36:36Z

commit 54e96258a6930909b690fd7e8889749231ba8085 upstream. scx_bpf_dsq_move_set_slice() and scx_bpf_dsq_move_set_vtime() take a DSQ iterator argument which has to be valid. Mark them with KF_RCU. Fixes: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()") Cc: stable@vger.kernel.org # v6.12+ Acked-by: Andrea Righi Signed-off-by: Tejun Heo Signed-off-by: Greg Kroah-Hartman

sched_ext: Keep bypass on between enable failure and scx_disable_workfn()

2025-11-02T13:18:03Z

[ Upstream commit 4a1d9d73aabc8f97f48c4f84f936de3b265ffd6f ] scx_enable() turns on the bypass mode while enable is in progress. If enabling fails, it turns off the bypass mode and then triggers scx_error(). scx_error() will trigger scx_disable_workfn() which will turn on the bypass mode again and unload the failed scheduler. This moves the system out of bypass mode between the enable error path and the disable path, which is unnecessary and can be brittle - e.g. the thread running scx_enable() may already be on the failed scheduler and can be switched out before it triggers scx_error() leading to a stall. The watchdog would eventually kick in, so the situation isn't critical but is still suboptimal. There is nothing to be gained by turning off the bypass mode between scx_enable() failure and scx_disable_workfn(). Keep bypass on. Signed-off-by: Tejun Heo Acked-by: Andrea Righi Signed-off-by: Sasha Levin

sched_ext: Sync error_irq_work before freeing scx_sched

2025-11-02T13:18:02Z

[ Upstream commit efeeaac9ae9763f9c953e69633c86bc3031e39b5 ] By the time scx_sched_free_rcu_work() runs, the scx_sched is no longer reachable. However, a previously queued error_irq_work may still be pending or running. Ensure it completes before proceeding with teardown. Fixes: bff3b5aec1b7 ("sched_ext: Move disable machinery into scx_sched") Acked-by: Andrea Righi Signed-off-by: Tejun Heo Signed-off-by: Sasha Levin

sched_ext: Put event_stats_cpu in struct scx_sched_pcpu

2025-11-02T13:18:02Z

[ Upstream commit bcb7c2305682c77a8bfdbfe37106b314ac10110f ] scx_sched.event_stats_cpu is the percpu counters that are used to track stats. Introduce struct scx_sched_pcpu and move the counters inside. This will ease adding more per-cpu fields. No functional changes. Signed-off-by: Tejun Heo Acked-by: Andrea Righi Stable-dep-of: efeeaac9ae97 ("sched_ext: Sync error_irq_work before freeing scx_sched") Signed-off-by: Sasha Levin

sched_ext: Move internal type and accessor definitions to ext_internal.h

2025-11-02T13:18:02Z

[ Upstream commit 0c2b8356e430229efef42b03bd765a2a7ecf73fd ] There currently isn't a place to place SCX-internal types and accessors to be shared between ext.c and ext_idle.c. Create kernel/sched/ext_internal.h and move internal type and accessor definitions there. This trims ext.c a bit and makes future additions easier. Pure code reorganization. No functional changes. Signed-off-by: Tejun Heo Acked-by: Andrea Righi Stable-dep-of: efeeaac9ae97 ("sched_ext: Sync error_irq_work before freeing scx_sched") Signed-off-by: Sasha Levin

sched: Remove never used code in mm_cid_get()

2025-10-29T13:10:29Z

[ Upstream commit 53abe3e1c154628cc74e33a1bfcd865656e433a5 ] Clang is not happy with set but unused variable (this is visible with `make W=1` build: kernel/sched/sched.h:3744:18: error: variable 'cpumask' set but not used [-Werror,-Wunused-but-set-variable] It seems like the variable was never used along with the assignment that does not have side effects as far as I can see. Remove those altogether. Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid") Signed-off-by: Andy Shevchenko Tested-by: Eric Biggers Reviewed-by: Breno Leitao Signed-off-by: Linus Torvalds Signed-off-by: Sasha Levin

sched/fair: Block delayed tasks on throttled hierarchy during dequeue

2025-10-29T13:10:15Z

Dequeuing a fair task on a throttled hierarchy returns early on encountering a throttled cfs_rq since the throttle path has already dequeued the hierarchy above and has adjusted the h_nr_* accounting till the root cfs_rq. dequeue_entities() crucially misses calling __block_task() for delayed tasks being dequeued on the throttled hierarchies, but this was mostly harmless until commit b7ca5743a260 ("sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks") since all existing cases would re-enqueue the task if task_on_rq_queued() returned true and the task would eventually be blocked at pick after the hierarchy was unthrottled. wait_task_inactive() is special as it expects the delayed task on throttled hierarchy to reach the blocked state on dequeue but since __block_task() is never called, task_on_rq_queued() continues to return true. Furthermore, since the task is now off the hierarchy, the pick never reaches it to fully block the task even after unthrottle leading to wait_task_inactive() looping endlessly. Remedy this by calling __block_task() if a delayed task is being dequeued on a throttled hierarchy. This fix is only required for stabled kernels implementing delay dequeue (>= v6.12) before v6.18 since upstream commit e1fad12dcb66 ("sched/fair: Switch to task based throttle model") indirectly fixes this by removing the early return conditions in dequeue_entities() as part of the per-task throttle feature. Cc: stable@vger.kernel.org Reported-by: Matt Fleming Closes: https://lore.kernel.org/all/20250925133310.1843863-1-matt@readmodwrite.com/ Fixes: b7ca5743a260 ("sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks") Tested-by: Matt Fleming Signed-off-by: K Prateek Nayak Signed-off-by: Greg Kroah-Hartman

sched/fair: Fix pelt lost idle time detection

2025-10-23T14:24:35Z

[ Upstream commit 17e3e88ed0b6318fde0d1c14df1a804711cab1b5 ] The check for some lost idle pelt time should be always done when pick_next_task_fair() fails to pick a task and not only when we call it from the fair fast-path. The case happens when the last running task on rq is a RT or DL task. When the latter goes to sleep and the /Sum of util_sum of the rq is at the max value, we don't account the lost of idle time whereas we should. Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path") Signed-off-by: Vincent Guittot Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sasha Levin

sched/deadline: Stop dl_server before CPU goes offline

2025-10-23T14:24:35Z

[ Upstream commit ee6e44dfe6e50b4a5df853d933a96bdff5309e6e ] IBM CI tool reported kernel warning[1] when running a CPU removal operation through drmgr[2]. i.e "drmgr -c cpu -r -q 1" WARNING: CPU: 0 PID: 0 at kernel/sched/cpudeadline.c:219 cpudl_set+0x58/0x170 NIP [c0000000002b6ed8] cpudl_set+0x58/0x170 LR [c0000000002b7cb8] dl_server_timer+0x168/0x2a0 Call Trace: [c000000002c2f8c0] init_stack+0x78c0/0x8000 (unreliable) [c0000000002b7cb8] dl_server_timer+0x168/0x2a0 [c00000000034df84] __hrtimer_run_queues+0x1a4/0x390 [c00000000034f624] hrtimer_interrupt+0x124/0x300 [c00000000002a230] timer_interrupt+0x140/0x320 Git bisects to: commit 4ae8d9aa9f9d ("sched/deadline: Fix dl_server getting stuck") This happens since: - dl_server hrtimer gets enqueued close to cpu offline, when kthread_park enqueues a fair task. - CPU goes offline and drmgr removes it from cpu_present_mask. - hrtimer fires and warning is hit. Fix it by stopping the dl_server before CPU is marked dead. [1]: https://lore.kernel.org/all/8218e149-7718-4432-9312-f97297c352b9@linux.ibm.com/ [2]: https://github.com/ibm-power-utilities/powerpc-utils/tree/next/src/drmgr [sshegde: wrote the changelog and tested it] Fixes: 4ae8d9aa9f9d ("sched/deadline: Fix dl_server getting stuck") Closes: https://lore.kernel.org/all/8218e149-7718-4432-9312-f97297c352b9@linux.ibm.com Signed-off-by: Peter Zijlstra (Intel) Reported-by: Venkat Rao Bagalkote Signed-off-by: Shrikanth Hegde Signed-off-by: Peter Zijlstra (Intel) Tested-by: Marek Szyprowski Tested-by: Shrikanth Hegde Signed-off-by: Sasha Levin

sched/deadline: Fix race in push_dl_task()

2025-10-19T14:37:30Z

commit 8fd5485fb4f3d9da3977fd783fcb8e5452463420 upstream. When a CPU chooses to call push_dl_task and picks a task to push to another CPU's runqueue then it will call find_lock_later_rq method which would take a double lock on both CPUs' runqueues. If one of the locks aren't readily available, it may lead to dropping the current runqueue lock and reacquiring both the locks at once. During this window it is possible that the task is already migrated and is running on some other CPU. These cases are already handled. However, if the task is migrated and has already been executed and another CPU is now trying to wake it up (ttwu) such that it is queued again on the runqeue (on_rq is 1) and also if the task was run by the same CPU, then the current checks will pass even though the task was migrated out and is no longer in the pushable tasks list. Please go through the original rt change for more details on the issue. To fix this, after the lock is obtained inside the find_lock_later_rq, it ensures that the task is still at the head of pushable tasks list. Also removed some checks that are no longer needed with the addition of this new check. However, the new check of pushable tasks list only applies when find_lock_later_rq is called by push_dl_task. For the other caller i.e. dl_task_offline_migration, existing checks are used. Signed-off-by: Harshit Agarwal Signed-off-by: Peter Zijlstra (Intel) Acked-by: Juri Lelli Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250408045021.3283624-1-harshit@nutanix.com Signed-off-by: Greg Kroah-Hartman