summaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)Author
7 dayssched_ext: Use the resched_cpu() to replace resched_curr() in the ↵Zqiang
bypass_lb_node() For the PREEMPT_RT kernels, the scx_bypass_lb_timerfn() running in the preemptible per-CPU ktimer kthread context, this means that the following scenarios will occur(for x86 platform): cpu1 cpu2 ktimer kthread: ->scx_bypass_lb_timerfn ->bypass_lb_node ->for_each_cpu(cpu, resched_mask) migration/1: by preempt by migration/2: multi_cpu_stop() multi_cpu_stop() ->take_cpu_down() ->__cpu_disable() ->set cpu1 offline ->rq1 = cpu_rq(cpu1) ->resched_curr(rq1) ->smp_send_reschedule(cpu1) ->native_smp_send_reschedule(cpu1) ->if(unlikely(cpu_is_offline(cpu))) { WARN(1, "sched: Unexpected reschedule of offline CPU#%d!\n", cpu); return; } This commit therefore use the resched_cpu() to replace resched_curr() in the bypass_lb_node() to avoid send-ipi to offline CPUs. Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
11 dayssched_ext: Fix some comments in ext.cZqiang
This commit update balance_scx() in the comments to balance_one(). Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
14 dayssched_ext: fix uninitialized ret on alloc_percpu() failureLiang Jie
Smatch reported: kernel/sched/ext.c:5332 scx_alloc_and_add_sched() warn: passing zero to 'ERR_PTR' In scx_alloc_and_add_sched(), the alloc_percpu() failure path jumps to err_free_gdsqs without initializing @ret. That can lead to returning ERR_PTR(0), which violates the ERR_PTR() convention and confuses callers. Set @ret to -ENOMEM before jumping to the error path when alloc_percpu() fails. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/r/202512141601.yAXDAeA9-lkp@intel.com/ Reported-by: Dan Carpenter <error27@gmail.com> Fixes: c201ea1578d3 ("sched_ext: Move event_stats_cpu into scx_sched") Signed-off-by: Liang Jie <liangjie@lixiang.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-15sched_ext: Remove unused code in the do_pick_task_scx()Zqiang
The kick_idle variable is no longer used, this commit therefore remove it and also remove associated code in the do_pick_task_scx(). Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-12sched_ext: Fix missing post-enqueue handling in move_local_task_to_local_dsq()Tejun Heo
move_local_task_to_local_dsq() is used when moving a task from a non-local DSQ to a local DSQ on the same CPU. It directly manipulates the local DSQ without going through dispatch_enqueue() and was missing the post-enqueue handling that triggers preemption when SCX_ENQ_PREEMPT is set or the idle task is running. The function is used by move_task_between_dsqs() which backs scx_bpf_dsq_move() and may be called while the CPU is busy. Add local_dsq_post_enq() call to move_local_task_to_local_dsq(). As the dispatch path doesn't need post-enqueue handling, add SCX_RQ_IN_BALANCE early exit to keep consume_dispatch_q() behavior unchanged and avoid triggering unnecessary resched when scx_bpf_dsq_move() is used from the dispatch path. Fixes: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()") Cc: stable@vger.kernel.org # v6.12+ Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-12sched_ext: Factor out local_dsq_post_enq() from dispatch_enqueue()Tejun Heo
Factor out local_dsq_post_enq() which performs post-enqueue handling for local DSQs - triggering resched_curr() if SCX_ENQ_PREEMPT is specified or if the current CPU is idle. No functional change. This will be used by the next patch to fix move_local_task_to_local_dsq(). Cc: stable@vger.kernel.org # v6.12+ Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-11sched_ext: Fix bypass depth leak on scx_enable() failureTejun Heo
scx_enable() calls scx_bypass(true) to initialize in bypass mode and then scx_bypass(false) on success to exit. If scx_enable() fails during task initialization - e.g. scx_cgroup_init() or scx_init_task() returns an error - it jumps to err_disable while bypass is still active. scx_disable_workfn() then calls scx_bypass(true/false) for its own bypass, leaving the bypass depth at 1 instead of 0. This causes the system to remain permanently in bypass mode after a failed scx_enable(). Failures after task initialization is complete - e.g. scx_tryset_enable_state() at the end - already call scx_bypass(false) before reaching the error path and are not affected. This only affects a subset of failure modes. Fix it by tracking whether scx_enable() called scx_bypass(true) in a bool and having scx_disable_workfn() call an extra scx_bypass(false) to clear it. This is a temporary measure as the bypass depth will be moved into the sched instance, which will make this tracking unnecessary. Fixes: 8c2090c504e9 ("sched_ext: Initialize in bypass mode") Cc: stable@vger.kernel.org # v6.12+ Reported-by: Chris Mason <clm@meta.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/stable/286e6f7787a81239e1ce2989b52391ce%40kernel.org Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-08sched/ext: Avoid null ptr traversal when ->put_prev_task() is called with ↵John Stultz
NULL next Early when trying to get sched_ext and proxy-exe working together, I kept tripping over NULL ptr in put_prev_task_scx() on the line: if (sched_class_above(&ext_sched_class, next->sched_class)) { Which was due to put_prev_task() passes a NULL next, calling: prev->sched_class->put_prev_task(rq, prev, NULL); put_prev_task_scx() already guards for a NULL next in the switch_class case, but doesn't seem to have a guard for sched_class_above() check. I can't say I understand why this doesn't trip usually without proxy-exec. And in newer kernels there are way fewer put_prev_task(), and I can't easily reproduce the issue now even with proxy-exec. But we still have one put_prev_task() call left in core.c that seems like it could trip this, so I wanted to send this out for consideration. tj: put_prev_task() can be called with NULL @next; however, when @p is queued, that doesn't happen, so this condition shouldn't currently be triggerable. The connection isn't straightforward or necessarily reliable, so add the NULL check even if it can't currently be triggered. Link: http://lkml.kernel.org/r/20251206022218.1541878-1-jstultz@google.com Signed-off-by: John Stultz <jstultz@google.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-08sched_ext: Fix the memleak for sch->helper objectsZqiang
This commit use kthread_destroy_worker() to release sch->helper objects to fix the following kmemleak: unreferenced object 0xffff888121ec7b00 (size 128): comm "scx_simple", pid 1197, jiffies 4295884415 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 ad 4e ad de .............N.. ff ff ff ff 00 00 00 00 ff ff ff ff ff ff ff ff ................ backtrace (crc 587b3352): kmemleak_alloc+0x62/0xa0 __kmalloc_cache_noprof+0x28d/0x3e0 kthread_create_worker_on_node+0xd5/0x1f0 scx_enable.isra.210+0x6c2/0x25b0 bpf_scx_reg+0x12/0x20 bpf_struct_ops_link_create+0x2c3/0x3b0 __sys_bpf+0x3102/0x4b00 __x64_sys_bpf+0x79/0xc0 x64_sys_call+0x15d9/0x1dd0 do_syscall_64+0xf0/0x470 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: bff3b5aec1b7 ("sched_ext: Move disable machinery into scx_sched") Cc: stable@vger.kernel.org # v6.16+ Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-06Merge tag 'sched-urgent-2025-12-06' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Miscellaneous scheduler fixes/cleanups: - Fix psi_dequeue() for Proxy Execution - Fix hrtick() vs. scheduling context bug - Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out - Fix whitespace noise in headers - Remove a preempt-disable section in rt_mutex_setprio()" * tag 'sched-urgent-2025-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Fix psi_dequeue() for Proxy Execution sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out sched/rt: Remove a preempt-disable section in rt_mutex_setprio() sched/hrtick: Fix hrtick() vs. scheduling context sched/headers: Remove whitespace noise from kernel/sched/sched.h
2025-12-06sched/core: Fix psi_dequeue() for Proxy ExecutionJohn Stultz
Currently, if the sleep flag is set, psi_dequeue() doesn't change any of the psi_flags. This is because psi_task_switch() will clear TSK_ONCPU as well as other potential flags (TSK_RUNNING), and the assumption is that a voluntary sleep always consists of a task being dequeued followed shortly there after with a psi_sched_switch() call. Proxy Execution changes this expectation, as mutex-blocked tasks that would normally sleep stay on the runqueue. But in the case where the mutex-owning task goes to sleep, or the owner is on a remote cpu, we will then deactivate the blocked task shortly after. In that situation, the mutex-blocked task will have had its TSK_ONCPU cleared when it was switched off the cpu, but it will stay TSK_RUNNING. Then if we later dequeue it (as currently done if we hit a case find_proxy_task() can't yet handle, such as the case of the owner being on another rq or a sleeping owner) psi_dequeue() won't change any state (leaving it TSK_RUNNING), as it incorrectly expects a psi_task_switch() call to immediately follow. Later on when the task get woken/re-enqueued, and psi_flags are set for TSK_RUNNING, we hit an error as the task is already TSK_RUNNING: psi: inconsistent task state! task=188:kworker/28:0 cpu=28 psi_flags=4 clear=0 set=4 To resolve this, extend the logic in psi_dequeue() so that if the sleep flag is set, we also check if psi_flags have TSK_ONCPU set (meaning the psi_task_switch is imminent) before we do the shortcut return. If TSK_ONCPU is not set, that means we've already switched away, and this psi_dequeue call needs to clear the flags. Fixes: be41bde4c3a8 ("sched: Add an initial sketch of the find_proxy_task() function") Reported-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Haiyue Wang <haiyuewa@163.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://patch.msgid.link/20251205012721.756394-1-jstultz@google.com Closes: https://lore.kernel.org/lkml/20251117185550.365156-1-kprateek.nayak@amd.com/
2025-12-06sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the ↵xupengbo
last task migrates out When a task is migrated out, there is a probability that the tg->load_avg value will become abnormal. The reason is as follows: 1. Due to the 1ms update period limitation in update_tg_load_avg(), there is a possibility that the reduced load_avg is not updated to tg->load_avg when a task migrates out. 2. Even though __update_blocked_fair() traverses the leaf_cfs_rq_list and calls update_tg_load_avg() for cfs_rqs that are not fully decayed, the key function cfs_rq_is_decayed() does not check whether cfs->tg_load_avg_contrib is null. Consequently, in some cases, __update_blocked_fair() removes cfs_rqs whose avg.load_avg has not been updated to tg->load_avg. Add a check of cfs_rq->tg_load_avg_contrib in cfs_rq_is_decayed(), which fixes the case (2.) mentioned above. Fixes: 1528c661c24b ("sched/fair: Ratelimit update to tg->load_avg") Signed-off-by: xupengbo <xupengbo@oppo.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Aaron Lu <ziqianlu@bytedance.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Aaron Lu <ziqianlu@bytedance.com> Link: https://patch.msgid.link/20250827022208.14487-1-xupengbo@oppo.com
2025-12-06sched/rt: Remove a preempt-disable section in rt_mutex_setprio()Sebastian Andrzej Siewior
rt_mutex_setprio() has only one caller: rt_mutex_adjust_prio(). It expects that task_struct::pi_lock and rt_mutex_base::wait_lock are held. Both locks are raw_spinlock_t and are acquired with disabled interrupts. Nevertheless rt_mutex_setprio() disables preemption while invoking __balance_callbacks() and raw_spin_rq_unlock(). Even if one of the balance callbacks unlocks the rq then it must not enable interrupts because rt_mutex_base::wait_lock is still locked. Therefore interrupts should remain disabled and disabling preemption is not needed. Commit 4c9a4bc89a9cc ("sched: Allow balance callbacks for check_class_changed()") adds a preempt-disable section to rt_mutex_setprio() and __sched_setscheduler(). In __sched_setscheduler() the preemption is disabled before rq is unlocked and interrupts enabled but I don't see why it makes a difference in rt_mutex_setprio(). Remove the preempt_disable() section from rt_mutex_setprio(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20251127155529.t_sTatE4@linutronix.de
2025-12-06sched/hrtick: Fix hrtick() vs. scheduling contextPeter Zijlstra
The sched_class::task_tick() method is called on the donor sched_class, and sched_tick() hands it rq->donor as argument, which is consistent. However, while hrtick() uses the donor sched_class, it then passes rq->curr, which is inconsistent. Fix it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20250918080205.442967033@infradead.org
2025-12-06sched/headers: Remove whitespace noise from kernel/sched/sched.hIngo Molnar
A single case of space-Tab noise snuck in recently. Fixes: 36569780b0d6 ("sched: Change nr_uninterruptible type to unsigned long") Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/176478595428.498.13816176784792752599.tip-bot2@tip-bot2
2025-12-03Merge tag 'sched_ext-for-6.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - Improve recovery from misbehaving BPF schedulers. When a scheduler puts many tasks with varying affinity restrictions on a shared DSQ, CPUs scanning through tasks they cannot run can overwhelm the system, causing lockups. Bypass mode now uses per-CPU DSQs with a load balancer to avoid this, and hooks into the hardlockup detector to attempt recovery. Add scx_cpu0 example scheduler to demonstrate this scenario. - Add lockless peek operation for DSQs to reduce lock contention for schedulers that need to query queue state during load balancing. - Allow scx_bpf_reenqueue_local() to be called from anywhere in preparation for deprecating cpu_acquire/release() callbacks in favor of generic BPF hooks. - Prepare for hierarchical scheduler support: add scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() kfuncs, make scx_bpf_dsq_insert*() return bool, and wrap kfunc args in structs for future aux__prog parameter. - Implement cgroup_set_idle() callback to notify BPF schedulers when a cgroup's idle state changes. - Fix migration tasks being incorrectly downgraded from stop_sched_class to rt_sched_class across sched_ext enable/disable. Applied late as the fix is low risk and the bug subtle but needs stable backporting. - Various fixes and cleanups including cgroup exit ordering, SCX_KICK_WAIT reliability, and backward compatibility improvements. * tag 'sched_ext-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (44 commits) sched_ext: Fix incorrect sched_class settings for per-cpu migration tasks sched_ext: tools: Removing duplicate targets during non-cross compilation sched_ext: Use kvfree_rcu() to release per-cpu ksyncs object sched_ext: Pass locked CPU parameter to scx_hardlockup() and add docs sched_ext: Update comments replacing breather with aborting mechanism sched_ext: Implement load balancer for bypass mode sched_ext: Factor out abbreviated dispatch dequeue into dispatch_dequeue_locked() sched_ext: Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR sched_ext: Add scx_cpu0 example scheduler sched_ext: Hook up hardlockup detector sched_ext: Make handle_lockup() propagate scx_verror() result sched_ext: Refactor lockup handlers into handle_lockup() sched_ext: Make scx_exit() and scx_vexit() return bool sched_ext: Exit dispatch and move operations immediately when aborting sched_ext: Simplify breather mechanism with scx_aborting flag sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode sched_ext: Refactor do_enqueue_task() local and global DSQ paths sched_ext: Use shorter slice in bypass mode sched_ext: Mark racy bitfields to prevent adding fields that can't tolerate races sched_ext: Minor cleanups to scx_task_iter ...
2025-12-03Merge tag 'cgroup-for-6.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Defer task cgroup unlink until after the dying task's final context switch so that controllers see the cgroup properly populated until the task is truly gone - cpuset cleanups and simplifications. Enforce that domain isolated CPUs stay in root or isolated partitions and fail if isolated+nohz_full would leave no housekeeping CPU. Fix sched/deadline root domain handling during CPU hot-unplug and race for tasks in attaching cpusets - Misc fixes including memory reclaim protection documentation and selftest KTAP conformance * tag 'cgroup-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits) cpuset: Treat cpusets in attaching as populated sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug cgroup/cpuset: Introduce cpuset_cpus_allowed_locked() docs: cgroup: No special handling of unpopulated memcgs docs: cgroup: Note about sibling relative reclaim protection docs: cgroup: Explain reclaim protection target selftests/cgroup: conform test to KTAP format output cpuset: remove need_rebuild_sched_domains cpuset: remove global remote_children list cpuset: simplify node setting on error cgroup: include missing header for struct irq_work cgroup: Fix sleeping from invalid context warning on PREEMPT_RT cgroup/cpuset: Globally track isolated_cpus update cgroup/cpuset: Ensure domain isolated CPUs stay in root or isolated partition cgroup/cpuset: Move up prstate_housekeeping_conflict() helper cgroup/cpuset: Fail if isolated and nohz_full don't leave any housekeeping cgroup/cpuset: Rename update_unbound_workqueue_cpumask() to update_isolation_cpumasks() cgroup: Defer task cgroup unlink until after the task is done switching out cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free() cgroup: Rename cgroup lifecycle hooks to cgroup_task_*() ...
2025-12-02Merge tag 'pm-6.19-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "There are quite a few interesting things here, including new hardware support, new features, some bug fixes and documentation updates. In addition, there are a usual bunch of minor fixes and cleanups all over. In the new hardware support category, there are intel_pstate and intel_rapl driver updates to support new processors, Panther Lake, Wildcat Lake, Noval Lake, and Diamond Rapids in the OOB mode, OPP and bandwidth allocation support in the tegra186 cpufreq driver, and JH7110S SOC support in dt-platdev cpufreq. The new features are the PM QoS CPU latency limit for suspend-to-idle, the netlink support for the energy model management, support for terminating system suspend via a wakeup event during the sync of file systems, configurable number of hibernation compression threads, the runtime PM auto-cleanup macros, and the "poweroff" PM event that is expected to be used during system shutdown. Bugs are mostly fixed in cpuidle governors, but there are also fixes elsewhere, like in the amd-pstate cpufreq driver. Documentation updates include, but are not limited to, a new doc on debugging shutdown hangs, cross-referencing fixes and cleanups in the intel_pstate documentation, and updates of comments in the core hibernation code. Specifics: - Introduce and document a QoS limit on CPU exit latency during wakeup from suspend-to-idle (Ulf Hansson) - Add support for building libcpupower statically (Zuo An) - Add support for sending netlink notifications to user space on energy model updates (Changwoo Mini, Peng Fan) - Minor improvements to the Rust OPP interface (Tamir Duberstein) - Fixes to scope-based pointers in the OPP library (Viresh Kumar) - Use residency threshold in polling state override decisions in the menu cpuidle governor (Aboorva Devarajan) - Add sanity check for exit latency and target residency in the cpufreq core (Rafael Wysocki) - Use this_cpu_ptr() where possible in the teo governor (Christian Loehle) - Rework the handling of tick wakeups in the teo cpuidle governor to increase the likelihood of stopping the scheduler tick in the cases when tick wakeups can be counted as non-timer ones (Rafael Wysocki) - Fix a reverse condition in the teo cpuidle governor and drop a misguided target residency check from it (Rafael Wysocki) - Clean up multiple minor defects in the teo cpuidle governor (Rafael Wysocki) - Update header inclusion to make it follow the Include What You Use principle (Andy Shevchenko) - Enable MSR-based RAPL PMU support in the intel_rapl power capping driver and arrange for using it on the Panther Lake and Wildcat Lake processors (Kuppuswamy Sathyanarayanan) - Add support for Nova Lake and Wildcat Lake processors to the intel_rapl power capping driver (Kaushlendra Kumar, Srinivas Pandruvada) - Add OPP and bandwidth support for Tegra186 (Aaron Kling) - Optimizations for parameter array handling in the amd-pstate cpufreq driver (Mario Limonciello) - Fix for mode changes with offline CPUs in the amd-pstate cpufreq driver (Gautham Shenoy) - Preserve freq_table_sorted across suspend/hibernate in the cpufreq core (Zihuan Zhang) - Adjust energy model rules for Intel hybrid platforms in the intel_pstate cpufreq driver and improve printing of debug messages in it (Rafael Wysocki) - Replace deprecated strcpy() in cpufreq_unregister_governor() (Thorsten Blum) - Fix duplicate hyperlink target errors in the intel_pstate cpufreq driver documentation and use :ref: directive for internal linking in it (Swaraj Gaikwad, Bagas Sanjaya) - Add Diamond Rapids OOB mode support to the intel_pstate cpufreq driver (Kuppuswamy Sathyanarayanan) - Use mutex guard for driver locking in the intel_pstate driver and eliminate some code duplication from it (Rafael Wysocki) - Replace udelay() with usleep_range() in ACPI cpufreq (Kaushlendra Kumar) - Minor improvements to various cpufreq drivers (Christian Marangi, Hal Feng, Jie Zhan, Marco Crivellari, Miaoqian Lin, and Shuhao Fu) - Replace snprintf() with scnprintf() in show_trace_dev_match() (Kaushlendra Kumar) - Fix memory allocation error handling in pm_vt_switch_required() (Malaya Kumar Rout) - Introduce CALL_PM_OP() macro and use it to simplify code in generic PM operations (Kaushlendra Kumar) - Add module param to backtrace all CPUs in the device power management watchdog (Sergey Senozhatsky) - Rework message printing in swsusp_save() (Rafael Wysocki) - Make it possible to change the number of hibernation compression threads (Xueqin Luo) - Clarify that only cgroup1 freezer uses PM freezer (Tejun Heo) - Add document on debugging shutdown hangs to PM documentation and correct a mistaken configuration option in it (Mario Limonciello) - Shut down wakeup source timer before removing the wakeup source from the list (Kaushlendra Kumar, Rafael Wysocki) - Introduce new PMSG_POWEROFF event for system shutdown handling with the help of PM device callbacks (Mario Limonciello) - Make pm_test delay interruptible by wakeup events (Riwen Lu) - Clean up kernel-doc comment style usage in the core hibernation code and remove unuseful comments from it (Sunday Adelodun, Rafael Wysocki) - Add support for handling wakeup events and aborting the suspend process while it is syncing file systems (Samuel Wu, Rafael Wysocki) - Add WQ_UNBOUND to pm_wq workqueue (Marco Crivellari) - Add runtime PM wrapper macros for ACQUIRE()/ACQUIRE_ERR() and use them in the PCI core and the ACPI TAD driver (Rafael Wysocki) - Improve runtime PM in the ACPI TAD driver (Rafael Wysocki) - Update pm_runtime_allow/forbid() documentation (Rafael Wysocki) - Fix typos in runtime.c comments (Malaya Kumar Rout) - Move governor.h from devfreq under include/linux/ and rename to devfreq-governor.h to allow devfreq governor definitions in out of drivers/devfreq/ (Dmitry Baryshkov) - Use min() to improve readability in tegra30-devfreq.c (Thorsten Blum) - Fix potential use-after-free issue of OPP handling in hisi_uncore_freq.c (Pengjie Zhang) - Fix typo in DFSO_DOWNDIFFERENTIAL macro name in governor_simpleondemand.c in devfreq (Riwen Lu)" * tag 'pm-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (96 commits) PM / devfreq: Fix typo in DFSO_DOWNDIFFERENTIAL macro name cpuidle: Warn instead of bailing out if target residency check fails cpuidle: Update header inclusion Documentation: power/cpuidle: Document the CPU system wakeup latency QoS cpuidle: Respect the CPU system wakeup QoS limit for cpuidle sched: idle: Respect the CPU system wakeup QoS limit for s2idle pmdomain: Respect the CPU system wakeup QoS limit for cpuidle pmdomain: Respect the CPU system wakeup QoS limit for s2idle PM: QoS: Introduce a CPU system wakeup QoS limit cpuidle: governors: teo: Add missing space to the description PM: hibernate: Extra cleanup of comments in swap handling code PM / devfreq: tegra30: use min to simplify actmon_cpu_to_emc_rate PM / devfreq: hisi: Fix potential UAF in OPP handling PM / devfreq: Move governor.h to a public header location powercap: intel_rapl: Enable MSR-based RAPL PMU support powercap: intel_rapl: Prepare read_raw() interface for atomic-context callers cpufreq: qcom-nvmem: fix compilation warning for qcom_cpufreq_ipq806x_match_list PM: sleep: Call pm_sleep_fs_sync() instead of ksys_sync_helper() PM: sleep: Add support for wakeup during filesystem sync cpufreq: ACPI: Replace udelay() with usleep_range() ...
2025-12-02Merge tag 'timers-core-2025-11-30' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer core updates from Thomas Gleixner: - Prevent a thundering herd problem when the timekeeper CPU is delayed and a large number of CPUs compete to acquire jiffies_lock to do the update. Limit it to one CPU with a separate "uncontended" atomic variable. - A set of improvements for the timer migration mechanism: - Support imbalanced NUMA trees correctly - Support dynamic exclusion of CPUs from the migrator duty to allow the cpuset/isolation mechanism to exclude them from handling timers of remote idle CPUs - The usual small updates, cleanups and enhancements * tag 'timers-core-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers/migration: Exclude isolated cpus from hierarchy cpumask: Add initialiser to use cleanup helpers sched/isolation: Force housekeeping if isolcpus and nohz_full don't leave any cgroup/cpuset: Rename update_unbound_workqueue_cpumask() to update_isolation_cpumasks() timers/migration: Use scoped_guard on available flag set/clear timers/migration: Add mask for CPUs available in the hierarchy timers/migration: Rename 'online' bit to 'available' selftests/timers/nanosleep: Add tests for return of remaining time selftests/timers: Clean up kernel version check in posix_timers time: Fix a few typos in time[r] related code comments time: tick-oneshot: Add missing Return and parameter descriptions to kernel-doc hrtimer: Store time as ktime_t in restart block timers/migration: Remove dead code handling idle CPU checking for remote timers timers/migration: Remove unused "cpu" parameter from tmigr_get_group() timers/migration: Assert that hotplug preparing CPU is part of stable active hierarchy timers/migration: Fix imbalanced NUMA trees timers/migration: Remove locking on group connection timers/migration: Convert "while" loops to use "for" tick/sched: Limit non-timekeeper CPUs calling jiffies update
2025-12-02Merge tag 'irq-core-2025-11-30' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq core updates from Thomas Gleixner: "Updates for the interrupt core and treewide cleanups: - Rework of the Per Processor Interrupt (PPI) management on ARM[64] PPI support was built under the assumption that the systems are homogenous so that the same CPU local device types are connected to them. That's unfortunately wishful thinking and created horrible workarounds. This rework provides affinity management for PPIs so that they can be individually configured in the firmware tables and mops up the related drivers all over the place. - Prevent CPUSET/isolation changes to arbitrarily affine interrupt threads to random CPUs, which ignores user or driver settings. - Plug a harmless race in the interrupt affinity proc interface, which allows to see a half updated mask - Adjust the priority of secondary interrupt threads on RT, so that the combination of primary and secondary thread emulates the hardware interrupt plus thread scenario. Having them at the same priority can cause starvation issues in some drivers" * tag 'irq-core-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) genirq: Remove cpumask availability check on kthread affinity setting genirq: Fix interrupt threads affinity vs. cpuset isolated partitions genirq: Prevent early spurious wake-ups of interrupt threads genirq: Use raw_spinlock_irq() in irq_set_affinity_notifier() genirq/manage: Reduce priority of forced secondary interrupt handler genirq/proc: Fix race in show_irq_affinity() genirq: Fix percpu_devid irq affinity documentation perf: arm_pmu: Kill last use of per-CPU cpu_armpmu pointer irqdomain: Kill of_node_to_fwnode() helper genirq: Kill irq_{g,s}et_percpu_devid_partition() irqchip: Kill irq-partition-percpu irqchip/apple-aic: Drop support for custom PMU irq partitions irqchip/gic-v3: Drop support for custom PPI partitions coresight: trbe: Request specific affinities for per CPU interrupts perf: arm_spe_pmu: Request specific affinities for per CPU interrupts perf: arm_pmu: Request specific affinities for per CPU NMIs/interrupts genirq: Add request_percpu_irq_affinity() helper genirq: Allow per-cpu interrupt sharing for non-overlapping affinities genirq: Update request_percpu_nmi() to take an affinity genirq: Add affinity to percpu_devid interrupt requests ...
2025-12-02Merge tag 'core-rseq-2025-11-30' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull rseq updates from Thomas Gleixner: "A large overhaul of the restartable sequences and CID management: The recent enablement of RSEQ in glibc resulted in regressions which are caused by the related overhead. It turned out that the decision to invoke the exit to user work was not really a decision. More or less each context switch caused that. There is a long list of small issues which sums up nicely and results in a 3-4% regression in I/O benchmarks. The other detail which caused issues due to extra work in context switch and task migration is the CID (memory context ID) management. It also requires to use a task work to consolidate the CID space, which is executed in the context of an arbitrary task and results in sporadic uncontrolled exit latencies. The rewrite addresses this by: - Removing deprecated and long unsupported functionality - Moving the related data into dedicated data structures which are optimized for fast path processing. - Caching values so actual decisions can be made - Replacing the current implementation with a optimized inlined variant. - Separating fast and slow path for architectures which use the generic entry code, so that only fault and error handling goes into the TIF_NOTIFY_RESUME handler. - Rewriting the CID management so that it becomes mostly invisible in the context switch path. That moves the work of switching modes into the fork/exit path, which is a reasonable tradeoff. That work is only required when a process creates more threads than the cpuset it is allowed to run on or when enough threads exit after that. An artificial thread pool benchmarks which triggers this did not degrade, it actually improved significantly. The main effect in migration heavy scenarios is that runqueue lock held time and therefore contention goes down significantly" * tag 'core-rseq-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits) sched/mmcid: Switch over to the new mechanism sched/mmcid: Implement deferred mode change irqwork: Move data struct to a types header sched/mmcid: Provide CID ownership mode fixup functions sched/mmcid: Provide new scheduler CID mechanism sched/mmcid: Introduce per task/CPU ownership infrastructure sched/mmcid: Serialize sched_mm_cid_fork()/exit() with a mutex sched/mmcid: Provide precomputed maximal value sched/mmcid: Move initialization out of line signal: Move MMCID exit out of sighand lock sched/mmcid: Convert mm CID mask to a bitmap cpumask: Cache num_possible_cpus() sched/mmcid: Use cpumask_weighted_or() cpumask: Introduce cpumask_weighted_or() sched/mmcid: Prevent pointless work in mm_update_cpus_allowed() sched/mmcid: Move scheduler code out of global header sched: Fixup whitespace damage sched/mmcid: Cacheline align MM CID storage sched/mmcid: Use proper data structures sched/mmcid: Revert the complex CID management ...
2025-12-01Merge tag 'sched-core-2025-12-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Scalability and load-balancing improvements: - Enable scheduler feature NEXT_BUDDY (Mel Gorman) - Reimplement NEXT_BUDDY to align with EEVDF goals (Mel Gorman) - Skip sched_balance_running cmpxchg when balance is not due (Tim Chen) - Implement generic code for architecture specific sched domain NUMA distances (Tim Chen) - Optimize the NUMA distances of the sched-domains builds of Intel Granite Rapids (GNR) and Clearwater Forest (CWF) platforms (Tim Chen) - Implement proportional newidle balance: a randomized algorithm that runs newidle balancing proportional to its success rate. (Peter Zijlstra) Scheduler infrastructure changes: - Implement the 'sched_change' scoped_guard() pattern for the entire scheduler (Peter Zijlstra) - More broadly utilize the sched_change guard (Peter Zijlstra) - Add support to pick functions to take runqueue-flags (Joel Fernandes) - Provide and use set_need_resched_current() (Peter Zijlstra) Fair scheduling enhancements: - Forfeit vruntime on yield (Fernand Sieber) - Only update stats for allowed CPUs when looking for dst group (Adam Li) CPU-core scheduling enhancements: - Optimize core cookie matching check (Fernand Sieber) Deadline scheduler fixes: - Only set free_cpus for online runqueues (Doug Berger) - Fix dl_server time accounting (Peter Zijlstra) - Fix dl_server stop condition (Peter Zijlstra) Proxy scheduling fixes: - Yield the donor task (Fernand Sieber) Fixes and cleanups: - Fix do_set_cpus_allowed() locking (Peter Zijlstra) - Fix migrate_disable_switch() locking (Peter Zijlstra) - Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked() (Hao Jia) - Increase sched_tick_remote timeout (Phil Auld) - sched/deadline: Use cpumask_weight_and() in dl_bw_cpus() (Shrikanth Hegde) - sched/deadline: Clean up select_task_rq_dl() (Shrikanth Hegde)" * tag 'sched-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits) sched: Provide and use set_need_resched_current() sched/fair: Proportional newidle balance sched/fair: Small cleanup to update_newidle_cost() sched/fair: Small cleanup to sched_balance_newidle() sched/fair: Revert max_newidle_lb_cost bump sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals sched/fair: Enable scheduler feature NEXT_BUDDY sched: Increase sched_tick_remote timeout sched/fair: Have SD_SERIALIZE affect newidle balancing sched/fair: Skip sched_balance_running cmpxchg when balance is not due sched/deadline: Minor cleanup in select_task_rq_dl() sched/deadline: Use cpumask_weight_and() in dl_bw_cpus sched/deadline: Document dl_server sched/deadline: Fix dl_server stop condition sched/deadline: Fix dl_server time accounting sched/core: Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked() sched/eevdf: Fix min_vruntime vs avg_vruntime sched/core: Add comment explaining force-idle vruntime snapshots sched/core: Optimize core cookie matching check sched/proxy: Yield the donor task ...
2025-12-01Merge tag 'locking-core-2025-12-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "Mutexes: - Redo __mutex_init() to reduce generated code size (Sebastian Andrzej Siewior) Seqlocks: - Introduce scoped_seqlock_read() (Peter Zijlstra) - Change thread_group_cputime() to use scoped_seqlock_read() (Oleg Nesterov) - Change do_task_stat() to use scoped_seqlock_read() (Oleg Nesterov) - Change do_io_accounting() to use scoped_seqlock_read() (Oleg Nesterov) - Fix the incorrect documentation of read_seqbegin_or_lock() / need_seqretry() (Oleg Nesterov) - Allow KASAN to fail optimizing (Peter Zijlstra) Local lock updates: - Fix all kernel-doc warnings (Randy Dunlap) - Add the <linux/local_lock*.h> headers to MAINTAINERS (Sebastian Andrzej Siewior) - Reduce the risk of shadowing via s/l/__l/ and s/tl/__tl/ (Vincent Mailhol) Lock debugging: - spinlock/debug: Fix data-race in do_raw_write_lock (Alexander Sverdlin) Atomic primitives infrastructure: - atomic: Skip alignment check for try_cmpxchg() old arg (Arnd Bergmann) Rust runtime integration: - sync: atomic: Enable generated Atomic<T> usage (Boqun Feng) - sync: atomic: Implement Debug for Atomic<Debug> (Boqun Feng) - debugfs: Remove Rust native atomics and replace them with Linux versions (Boqun Feng) - debugfs: Implement Reader for Mutex<T> only when T is Unpin (Boqun Feng) - lock: guard: Add T: Unpin bound to DerefMut (Daniel Almeida) - lock: Pin the inner data (Daniel Almeida) - lock: Add a Pin<&mut T> accessor (Daniel Almeida)" * tag 'locking-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/local_lock: Fix all kernel-doc warnings locking/local_lock: s/l/__l/ and s/tl/__tl/ to reduce the risk of shadowing locking/local_lock: Add the <linux/local_lock*.h> headers to MAINTAINERS locking/mutex: Redo __mutex_init() to reduce generated code size rust: debugfs: Replace the usage of Rust native atomics rust: sync: atomic: Implement Debug for Atomic<Debug> rust: sync: atomic: Make Atomic*Ops pub(crate) seqlock: Allow KASAN to fail optimizing rust: debugfs: Implement Reader for Mutex<T> only when T is Unpin seqlock: Change do_io_accounting() to use scoped_seqlock_read() seqlock: Change do_task_stat() to use scoped_seqlock_read() seqlock: Change thread_group_cputime() to use scoped_seqlock_read() seqlock: Introduce scoped_seqlock_read() documentation: seqlock: fix the wrong documentation of read_seqbegin_or_lock/need_seqretry atomic: Skip alignment check for try_cmpxchg() old arg rust: lock: Add a Pin<&mut T> accessor rust: lock: Pin the inner data rust: lock: guard: Add T: Unpin bound to DerefMut locking/spinlock/debug: Fix data-race in do_raw_write_lock
2025-12-01sched_ext: Fix incorrect sched_class settings for per-cpu migration tasksZqiang
When loading the ebpf scheduler, the tasks in the scx_tasks list will be traversed and invoke __setscheduler_class() to get new sched_class. however, this would also incorrectly set the per-cpu migration task's->sched_class to rt_sched_class, even after unload, the per-cpu migration task's->sched_class remains sched_rt_class. The log for this issue is as follows: ./scx_rustland --stats 1 [ 199.245639][ T630] sched_ext: "rustland" does not implement cgroup cpu.weight [ 199.269213][ T630] sched_ext: BPF scheduler "rustland" enabled 04:25:09 [INFO] RustLand scheduler attached bpftrace -e 'iter:task /strcontains(ctx->task->comm, "migration")/ { printf("%s:%d->%pS\n", ctx->task->comm, ctx->task->pid, ctx->task->sched_class); }' Attaching 1 probe... migration/0:24->rt_sched_class+0x0/0xe0 migration/1:27->rt_sched_class+0x0/0xe0 migration/2:33->rt_sched_class+0x0/0xe0 migration/3:39->rt_sched_class+0x0/0xe0 migration/4:45->rt_sched_class+0x0/0xe0 migration/5:52->rt_sched_class+0x0/0xe0 migration/6:58->rt_sched_class+0x0/0xe0 migration/7:64->rt_sched_class+0x0/0xe0 sched_ext: BPF scheduler "rustland" disabled (unregistered from user space) EXIT: unregistered from user space 04:25:21 [INFO] Unregister RustLand scheduler bpftrace -e 'iter:task /strcontains(ctx->task->comm, "migration")/ { printf("%s:%d->%pS\n", ctx->task->comm, ctx->task->pid, ctx->task->sched_class); }' Attaching 1 probe... migration/0:24->rt_sched_class+0x0/0xe0 migration/1:27->rt_sched_class+0x0/0xe0 migration/2:33->rt_sched_class+0x0/0xe0 migration/3:39->rt_sched_class+0x0/0xe0 migration/4:45->rt_sched_class+0x0/0xe0 migration/5:52->rt_sched_class+0x0/0xe0 migration/6:58->rt_sched_class+0x0/0xe0 migration/7:64->rt_sched_class+0x0/0xe0 This commit therefore generate a new scx_setscheduler_class() and add check for stop_sched_class to replace __setscheduler_class(). Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-25sched/mmcid: Switch over to the new mechanismThomas Gleixner
Now that all pieces are in place, change the implementations of sched_mm_cid_fork() and sched_mm_cid_exit() to adhere to the new strict ownership scheme and switch context_switch() over to use the new mm_cid_schedin() functionality. The common case is that there is no mode change required, which makes fork() and exit() just update the user count and the constraints. In case that a new user would exceed the CID space limit the fork() context handles the transition to per CPU mode with mm::mm_cid::mutex held. exit() handles the transition back to per task mode when the user count drops below the switch back threshold. fork() might also be forced to handle a deferred switch back to per task mode, when a affinity change increased the number of allowed CPUs enough. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172550.280380631@linutronix.de
2025-11-25sched/mmcid: Implement deferred mode changeThomas Gleixner
When affinity changes cause an increase of the number of CPUs allowed for tasks which are related to a MM, that might results in a situation where the ownership mode can go back from per CPU mode to per task mode. As affinity changes happen with runqueue lock held there is no way to do the actual mode change and required fixup right there. Add the infrastructure to defer it to a workqueue. The scheduled work can race with a fork() or exit(). Whatever happens first takes care of it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172550.216484739@linutronix.de
2025-11-25sched/mmcid: Provide CID ownership mode fixup functionsThomas Gleixner
CIDs are either owned by tasks or by CPUs. The ownership mode depends on the number of tasks related to a MM and the number of CPUs on which these tasks are theoretically allowed to run on. Theoretically because that number is the superset of CPU affinities of all tasks which only grows and never shrinks. Switching to per CPU mode happens when the user count becomes greater than the maximum number of CIDs, which is calculated by: opt_cids = min(mm_cid::nr_cpus_allowed, mm_cid::users); max_cids = min(1.25 * opt_cids, nr_cpu_ids); The +25% allowance is useful for tight CPU masks in scenarios where only a few threads are created and destroyed to avoid frequent mode switches. Though this allowance shrinks, the closer opt_cids becomes to nr_cpu_ids, which is the (unfortunate) hard ABI limit. At the point of switching to per CPU mode the new user is not yet visible in the system, so the task which initiated the fork() runs the fixup function: mm_cid_fixup_tasks_to_cpu() walks the thread list and either transfers each tasks owned CID to the CPU the task runs on or drops it into the CID pool if a task is not on a CPU at that point in time. Tasks which schedule in before the task walk reaches them do the handover in mm_cid_schedin(). When mm_cid_fixup_tasks_to_cpus() completes it's guaranteed that no task related to that MM owns a CID anymore. Switching back to task mode happens when the user count goes below the threshold which was recorded on the per CPU mode switch: pcpu_thrs = min(opt_cids - (opt_cids / 4), nr_cpu_ids / 2); This threshold is updated when a affinity change increases the number of allowed CPUs for the MM, which might cause a switch back to per task mode. If the switch back was initiated by a exiting task, then that task runs the fixup function. If it was initiated by a affinity change, then it's run either in the deferred update function in context of a workqueue or by a task which forks a new one or by a task which exits. Whatever happens first. mm_cid_fixup_cpus_to_task() walks through the possible CPUs and either transfers the CPU owned CIDs to a related task which runs on the CPU or drops it into the pool. Tasks which schedule in on a CPU which the walk did not cover yet do the handover themselves. This transition from CPU to per task ownership happens in two phases: 1) mm:mm_cid.transit contains MM_CID_TRANSIT. This is OR'ed on the task CID and denotes that the CID is only temporarily owned by the task. When it schedules out the task drops the CID back into the pool if this bit is set. 2) The initiating context walks the per CPU space and after completion clears mm:mm_cid.transit. After that point the CIDs are strictly task owned again. This two phase transition is required to prevent CID space exhaustion during the transition as a direct transfer of ownership would fail if two tasks are scheduled in on the same CPU before the fixup freed per CPU CIDs. When mm_cid_fixup_cpus_to_tasks() completes it's guaranteed that no CID related to that MM is owned by a CPU anymore. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172550.088189028@linutronix.de
2025-11-25sched/mmcid: Provide new scheduler CID mechanismThomas Gleixner
The MM CID management has two fundamental requirements: 1) It has to guarantee that at no given point in time the same CID is used by concurrent tasks in userspace. 2) The CID space must not exceed the number of possible CPUs in a system. While most allocators (glibc, tcmalloc, jemalloc) do not care about that, there seems to be at least some LTTng library depending on it. The CID space compaction itself is not a functional correctness requirement, it is only a useful optimization mechanism to reduce the memory foot print in unused user space pools. The optimal CID space is: min(nr_tasks, nr_cpus_allowed); Where @nr_tasks is the number of actual user space threads associated to the mm and @nr_cpus_allowed is the superset of all task affinities. It is growth only as it would be insane to take a racy snapshot of all task affinities when the affinity of one task changes just do redo it 2 milliseconds later when the next task changes it's affinity. That means that as long as the number of tasks is lower or equal than the number of CPUs allowed, each task owns a CID. If the number of tasks exceeds the number of CPUs allowed it switches to per CPU mode, where the CPUs own the CIDs and the tasks borrow them as long as they are scheduled in. For transition periods CIDs can go beyond the optimal space as long as they don't go beyond the number of possible CPUs. The current upstream implementation adds overhead into task migration to keep the CID with the task. It also has to do the CID space consolidation work from a task work in the exit to user space path. As that work is assigned to a random task related to a MM this can inflict unwanted exit latencies. Implement the context switch parts of a strict ownership mechanism to address this. This removes most of the work from the task which schedules out. Only during transitioning from per CPU to per task ownership it is required to drop the CID when leaving the CPU to prevent CID space exhaustion. Other than that scheduling out is just a single check and branch. The task which schedules in has to check whether: 1) The ownership mode changed 2) The CID is within the optimal CID space In stable situations this results in zero work. The only short disruption is when ownership mode changes or when the associated CID is not in the optimal CID space. The latter only happens when tasks exit and therefore the optimal CID space shrinks. That mechanism is strictly optimized for the common case where no change happens. The only case where it actually causes a temporary one time spike is on mode changes when and only when a lot of tasks related to a MM schedule exactly at the same time and have eventually to compete on allocating a CID from the bitmap. In the sysbench test case which triggered the spinlock contention in the initial CID code, __schedule() drops significantly in perf top on a 128 Core (256 threads) machine when running sysbench with 255 threads, which fits into the task mode limit of 256 together with the parent thread: Upstream rseq/perf branch +CID rework 0.42% 0.37% 0.32% [k] __schedule Increasing the number of threads to 256, which puts the test process into per CPU mode looks about the same. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172550.023984859@linutronix.de
2025-11-25sched/mmcid: Introduce per task/CPU ownership infrastructureThomas Gleixner
The MM CID management has two fundamental requirements: 1) It has to guarantee that at no given point in time the same CID is used by concurrent tasks in userspace. 2) The CID space must not exceed the number of possible CPUs in a system. While most allocators (glibc, tcmalloc, jemalloc) do not care about that, there seems to be at least librseq depending on it. The CID space compaction itself is not a functional correctness requirement, it is only a useful optimization mechanism to reduce the memory foot print in unused user space pools. The optimal CID space is: min(nr_tasks, nr_cpus_allowed); Where @nr_tasks is the number of actual user space threads associated to the mm and @nr_cpus_allowed is the superset of all task affinities. It is growth only as it would be insane to take a racy snapshot of all task affinities when the affinity of one task changes just do redo it 2 milliseconds later when the next task changes its affinity. That means that as long as the number of tasks is lower or equal than the number of CPUs allowed, each task owns a CID. If the number of tasks exceeds the number of CPUs allowed it switches to per CPU mode, where the CPUs own the CIDs and the tasks borrow them as long as they are scheduled in. For transition periods CIDs can go beyond the optimal space as long as they don't go beyond the number of possible CPUs. The current upstream implementation adds overhead into task migration to keep the CID with the task. It also has to do the CID space consolidation work from a task work in the exit to user space path. As that work is assigned to a random task related to a MM this can inflict unwanted exit latencies. This can be done differently by implementing a strict CID ownership mechanism. Either the CIDs are owned by the tasks or by the CPUs. The latter provides less locality when tasks are heavily migrating, but there is no justification to optimize for overcommit scenarios and thereby penalizing everyone else. Provide the basic infrastructure to implement this: - Change the UNSET marker to BIT(31) from ~0U - Add the ONCPU marker as BIT(30) - Add the TRANSIT marker as BIT(29) That allows to check for ownership trivially and provides a simple check for UNSET as well. The TRANSIT marker is required to prevent CID space exhaustion when switching from per CPU to per task mode. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251119172549.960252358@linutronix.de
2025-11-25sched/mmcid: Serialize sched_mm_cid_fork()/exit() with a mutexThomas Gleixner
Prepare for the new CID management scheme which puts the CID ownership transition into the fork() and exit() slow path by serializing sched_mm_cid_fork()/exit() with it, so task list and cpu mask walks can be done in interruptible and preemptible code. The contention on it is not worse than on other concurrency controls in the fork()/exit() machinery. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.895826703@linutronix.de
2025-11-25sched/mmcid: Provide precomputed maximal valueThomas Gleixner
Reading mm::mm_users and mm:::mm_cid::nr_cpus_allowed every time to compute the maximal CID value is just wasteful as that value is only changing on fork(), exit() and eventually when the affinity changes. So it can be easily precomputed at those points and provided in mm::mm_cid for consumption in the hot path. But there is an issue with using mm::mm_users for accounting because that does not necessarily reflect the number of user space tasks as other kernel code can take temporary references on the MM which skew the picture. Solve that by adding a users counter to struct mm_mm_cid, which is modified by fork() and exit() and used for precomputing under mm_mm_cid::lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.832764634@linutronix.de
2025-11-25sched/mmcid: Move initialization out of lineThomas Gleixner
It's getting bigger soon, so just move it out of line to the rest of the code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.769636491@linutronix.de
2025-11-25signal: Move MMCID exit out of sighand lockThomas Gleixner
There is no need anymore to keep this under sighand lock as the current code and the upcoming replacement are not depending on the exit state of a task anymore. That allows to use a mutex in the exit path. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.706439391@linutronix.de
2025-11-25sched/mmcid: Convert mm CID mask to a bitmapThomas Gleixner
This is truly a bitmap and just conveniently uses a cpumask because the maximum size of the bitmap is nr_cpu_ids. But that prevents to do searches for a zero bit in a limited range, which is helpful to provide an efficient mechanism to consolidate the CID space when the number of users decreases. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Link: https://patch.msgid.link/20251119172549.642866767@linutronix.de
2025-11-25sched: idle: Respect the CPU system wakeup QoS limit for s2idleUlf Hansson
A CPU system wakeup QoS limit may have been requested by user space. To avoid breaking this constraint when entering a low power state during s2idle, let's start to take into account the QoS limit. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dhruva Gole <d-gole@ti.com> Reviewed-by: Kevin Hilman (TI) <khilman@baylibre.com> Tested-by: Kevin Hilman (TI) <khilman@baylibre.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://patch.msgid.link/20251125112650.329269-5-ulf.hansson@linaro.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-11-20sched/isolation: Force housekeeping if isolcpus and nohz_full don't leave anyGabriele Monaco
Currently the user can set up isolcpus and nohz_full in such a way that leaves no housekeeping CPU (i.e. no CPU that is neither domain isolated nor nohz full). This can be a problem for other subsystems (e.g. the timer wheel imgration). Prevent this configuration by invalidating the last setting in case the union of isolcpus (domain) and nohz_full covers all CPUs. Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Waiman Long <longman@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://patch.msgid.link/20251120145653.296659-6-gmonaco@redhat.com
2025-11-20Merge tag 'sched_ext-for-6.18-rc6-fixes-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fix from Tejun Heo: "One low risk and obvious fix: scx_enable() was dereferencing an error pointer on helper kthread creation failure. Fixed" * tag 'sched_ext-for-6.18-rc6-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix scx_enable() crash on helper kthread creation failure
2025-11-20sched_ext: Fix scx_enable() crash on helper kthread creation failureSaket Kumar Bhaskar
A crash was observed when the sched_ext selftests runner was terminated with Ctrl+\ while test 15 was running: NIP [c00000000028fa58] scx_enable.constprop.0+0x358/0x12b0 LR [c00000000028fa2c] scx_enable.constprop.0+0x32c/0x12b0 Call Trace: scx_enable.constprop.0+0x32c/0x12b0 (unreliable) bpf_struct_ops_link_create+0x18c/0x22c __sys_bpf+0x23f8/0x3044 sys_bpf+0x2c/0x6c system_call_exception+0x124/0x320 system_call_vectored_common+0x15c/0x2ec kthread_run_worker() returns an ERR_PTR() on failure rather than NULL, but the current code in scx_alloc_and_add_sched() only checks for a NULL helper. Incase of failure on SIGQUIT, the error is not handled in scx_alloc_and_add_sched() and scx_enable() ends up dereferencing an error pointer. Error handling is fixed in scx_alloc_and_add_sched() to propagate PTR_ERR() into ret, so that scx_enable() jumps to the existing error path, avoiding random dereference on failure. Fixes: bff3b5aec1b7 ("sched_ext: Move disable machinery into scx_sched") Cc: stable@vger.kernel.org # v6.16+ Reported-and-tested-by: Samir Mulani <samir@linux.ibm.com> Signed-off-by: Saket Kumar Bhaskar <skb99@linux.ibm.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Vishal Chourasia <vishalc@linux.ibm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-20sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplugPingfan Liu
*** Bug description *** When testing kexec-reboot on a 144 cpus machine with isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I encounter the following bug: [ 97.114759] psci: CPU142 killed (polled 0 ms) [ 97.333236] Failed to offline CPU143 - error=-16 [ 97.333246] ------------[ cut here ]------------ [ 97.342682] kernel BUG at kernel/cpu.c:1569! [ 97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP [...] In essence, the issue originates from the CPU hot-removal process, not limited to kexec. It can be reproduced by writing a SCHED_DEADLINE program that waits indefinitely on a semaphore, spawning multiple instances to ensure some run on CPU 72, and then offlining CPUs 1–143 one by one. When attempting this, CPU 143 failed to go offline. bash -c 'taskset -cp 0 $$ && for i in {1..143}; do echo 0 > /sys/devices/system/cpu/cpu$i/online 2>/dev/null; done' Tracking down this issue, I found that dl_bw_deactivate() returned -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU. But that is not the fact, and contributed by the following factors: When a CPU is inactive, cpu_rq()->rd is set to def_root_domain. For an blocked-state deadline task (in this case, "cppc_fie"), it was not migrated to CPU0, and its task_rq() information is stale. So its rq->rd points to def_root_domain instead of the one shared with CPU0. As a result, its bandwidth is wrongly accounted into a wrong root domain during domain rebuild. *** Issue *** The key point is that root_domain is only tracked through active rq->rd. To avoid using a global data structure to track all root_domains in the system, there should be a method to locate an active CPU within the corresponding root_domain. *** Solution *** To locate the active cpu, the following rules for deadline sub-system is useful -1.any cpu belongs to a unique root domain at a given time -2.DL bandwidth checker ensures that the root domain has active cpus. Now, let's examine the blocked-state task P. If P is attached to a cpuset that is a partition root, it is straightforward to find an active CPU. If P is attached to a cpuset that has changed from 'root' to 'member', the active CPUs are grouped into the parent root domain. Naturally, the CPUs' capacity and reserved DL bandwidth are taken into account in the ancestor root domain. (In practice, it may be unsafe to attach P to an arbitrary root domain, since that domain may lack sufficient DL bandwidth for P.) Again, it is straightforward to find an active CPU in the ancestor root domain. This patch groups CPUs into isolated and housekeeping sets. For the housekeeping group, it walks up the cpuset hierarchy to find active CPUs in P's root domain and retrieves the valid rd from cpu_rq(cpu)->rd. Signed-off-by: Pingfan Liu <piliu@redhat.com> Cc: Waiman Long <longman@redhat.com> Cc: Chen Ridong <chenridong@huaweicloud.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Pierre Gondois <pierre.gondois@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> To: linux-kernel@vger.kernel.org Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-20sched/mmcid: Use cpumask_weighted_or()Thomas Gleixner
Use cpumask_weighted_or() instead of cpumask_or() and cpumask_weight() on the result, which walks the same bitmap twice. Results in 10-20% less cycles, which reduces the runqueue lock hold time. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Link: https://patch.msgid.link/20251119172549.511736272@linutronix.de
2025-11-20sched/mmcid: Prevent pointless work in mm_update_cpus_allowed()Thomas Gleixner
mm_update_cpus_allowed() is not required to be invoked for affinity changes due to migrate_disable() and migrate_enable(). migrate_disable() restricts the task temporarily to a CPU on which the task was already allowed to run, so nothing changes. migrate_enable() restores the actual task affinity mask. If that mask changed between migrate_disable() and migrate_enable() then that change was already accounted for. Move the invocation to the proper place to avoid that. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.385208276@linutronix.de
2025-11-20sched/mmcid: Move scheduler code out of global headerThomas Gleixner
This is only used in the scheduler core code, so there is no point to have it in a global header. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Link: https://patch.msgid.link/20251119172549.321259077@linutronix.de
2025-11-20sched: Fixup whitespace damageThomas Gleixner
With whitespace checks enabled in the editor this makes eyes bleed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.258651925@linutronix.de
2025-11-20sched/mmcid: Use proper data structuresThomas Gleixner
Having a lot of CID functionality specific members in struct task_struct and struct mm_struct is not really making the code easier to read. Encapsulate the CID specific parts in data structures and keep them separate from the stuff they are embedded in. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.131573768@linutronix.de
2025-11-20sched/mmcid: Revert the complex CID managementThomas Gleixner
The CID management is a complex beast, which affects both scheduling and task migration. The compaction mechanism forces random tasks of a process into task work on exit to user space causing latency spikes. Revert back to the initial simple bitmap allocating mechanics, which are known to have scalability issues as that allows to gradually build up a replacement functionality in a reviewable way. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.068197830@linutronix.de
2025-11-17Merge tag 'sched_ext-for-6.18-rc6-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: "Five fixes addressing PREEMPT_RT compatibility and locking issues. Three commits fix potential deadlocks and sleeps in atomic contexts on RT kernels by converting locks to raw spinlocks and ensuring IRQ work runs in hard-irq context. The remaining two fix unsafe locking in the debug dump path and a variable dereference typo" * tag 'sched_ext-for-6.18-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Use IRQ_WORK_INIT_HARD() to initialize rq->scx.kick_cpus_irq_work sched_ext: Fix possible deadlock in the deferred_irq_workfn() sched/ext: convert scx_tasks_lock to raw spinlock sched_ext: Fix unsafe locking in the scx_dump_state() sched_ext: Fix use of uninitialized variable in scx_bpf_cpuperf_set()
2025-11-17sched/fair: Proportional newidle balancePeter Zijlstra
Add a randomized algorithm that runs newidle balancing proportional to its success rate. This improves schbench significantly: 6.18-rc4: 2.22 Mrps/s 6.18-rc4+revert: 2.04 Mrps/s 6.18-rc4+revert+random: 2.18 Mrps/S Conversely, per Adam Li this affects SpecJBB slightly, reducing it by 1%: 6.17: -6% 6.17+revert: 0% 6.17+revert+random: -1% Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Chris Mason <clm@meta.com> Link: https://lkml.kernel.org/r/6825c50d-7fa7-45d8-9b81-c6e7e25738e2@meta.com Link: https://patch.msgid.link/20251107161739.770122091@infradead.org
2025-11-17sched/fair: Small cleanup to update_newidle_cost()Peter Zijlstra
Simplify code by adding a few variables. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Chris Mason <clm@meta.com> Link: https://patch.msgid.link/20251107161739.655208666@infradead.org
2025-11-17sched/fair: Small cleanup to sched_balance_newidle()Peter Zijlstra
Pull out the !sd check to simplify code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Chris Mason <clm@meta.com> Link: https://patch.msgid.link/20251107161739.525916173@infradead.org
2025-11-17sched/fair: Revert max_newidle_lb_cost bumpPeter Zijlstra
Many people reported regressions on their database workloads due to: 155213a2aed4 ("sched/fair: Bump sd->max_newidle_lb_cost when newidle balance fails") For instance Adam Li reported a 6% regression on SpecJBB. Conversely this will regress schbench again; on my machine from 2.22 Mrps/s down to 2.04 Mrps/s. Reported-by: Joseph Salisbury <joseph.salisbury@oracle.com> Reported-by: Adam Li <adamli@os.amperecomputing.com> Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Chris Mason <clm@meta.com> Link: https://lkml.kernel.org/r/20250626144017.1510594-2-clm@fb.com Link: https://lkml.kernel.org/r/006c9df2-b691-47f1-82e6-e233c3f91faf@oracle.com Link: https://patch.msgid.link/20251107161739.406147760@infradead.org