summaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)Author
32 hoursMerge tag 'mm-nonmm-stable-2026-02-12-10-48' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves disk space by teaching ocfs2 to reclaim suballocator block group space (Heming Zhao) - "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the ARRAY_END() macro and uses it in various places (Alejandro Colomar) - "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the page size (Pnina Feder) - "kallsyms: Prevent invalid access when showing module buildid" cleans up kallsyms code related to module buildid and fixes an invalid access crash when printing backtraces (Petr Mladek) - "Address page fault in ima_restore_measurement_list()" fixes a kexec-related crash that can occur when booting the second-stage kernel on x86 (Harshit Mogalapalli) - "kho: ABI headers and Documentation updates" updates the kexec handover ABI documentation (Mike Rapoport) - "Align atomic storage" adds the __aligned attribute to atomic_t and atomic64_t definitions to get natural alignment of both types on csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain) - "kho: clean up page initialization logic" simplifies the page initialization logic in kho_restore_page() (Pratyush Yadav) - "Unload linux/kernel.h" moves several things out of kernel.h and into more appropriate places (Yury Norov) - "don't abuse task_struct.group_leader" removes the usage of ->group_leader when it is "obviously unnecessary" (Oleg Nesterov) - "list private v2 & luo flb" adds some infrastructure improvements to the live update orchestrator (Pasha Tatashin) * tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits) watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency procfs: fix missing RCU protection when reading real_parent in do_task_stat() watchdog/softlockup: fix sample ring index wrap in need_counting_irqs() kcsan, compiler_types: avoid duplicate type issues in BPF Type Format kho: fix doc for kho_restore_pages() tests/liveupdate: add in-kernel liveupdate test liveupdate: luo_flb: introduce File-Lifecycle-Bound global state liveupdate: luo_file: Use private list list: add kunit test for private list primitives list: add primitives for private list manipulations delayacct: fix uapi timespec64 definition panic: add panic_force_cpu= parameter to redirect panic to a specific CPU netclassid: use thread_group_leader(p) in update_classid_task() RDMA/umem: don't abuse current->group_leader drm/pan*: don't abuse current->group_leader drm/amd: kill the outdated "Only the pthreads threading model is supported" checks drm/amdgpu: don't abuse current->group_leader android/binder: use same_thread_group(proc->tsk, current) in binder_mmap() android/binder: don't abuse current->group_leader kho: skip memoryless NUMA nodes when reserving scratch areas ...
2 daysMerge tag 'sched_ext-for-6.20' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - Move C example schedulers back from the external scx repo to tools/sched_ext as the authoritative source. scx_userland and scx_pair are returning while scx_sdt (BPF arena-based task data management) is new. These schedulers will be dropped from the external repo. - Improve error reporting by adding scx_bpf_error() calls when DSQ creation fails across all in-tree schedulers - Avoid redundant irq_work_queue() calls in destroy_dsq() by only queueing when llist_add() indicates an empty list - Fix flaky init_enable_count selftest by properly synchronizing pre-forked children using a pipe instead of sleep() * tag 'sched_ext-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: selftests/sched_ext: Fix init_enable_count flakiness tools/sched_ext: Fix data header access during free in scx_sdt tools/sched_ext: Add error logging for dsq creation failures in remaining schedulers tools/sched_ext: add arena based scheduler tools/sched_ext: add scx_pair scheduler tools/sched_ext: add scx_userland scheduler sched_ext: Add error logging for dsq creation failures sched_ext: Avoid multiple irq_work_queue() calls in destroy_dsq()
2 dayssched/mmcid: Don't assume CID is CPU owned on mode switchThomas Gleixner
Shinichiro reported a KASAN UAF, which is actually an out of bounds access in the MMCID management code. CPU0 CPU1 T1 runs in userspace T0: fork(T4) -> Switch to per CPU CID mode fixup() set MM_CID_TRANSIT on T1/CPU1 T4 exit() T3 exit() T2 exit() T1 exit() switch to per task mode ---> Out of bounds access. As T1 has not scheduled after T0 set the TRANSIT bit, it exits with the TRANSIT bit set. sched_mm_cid_remove_user() clears the TRANSIT bit in the task and drops the CID, but it does not touch the per CPU storage. That's functionally correct because a CID is only owned by the CPU when the ONCPU bit is set, which is mutually exclusive with the TRANSIT flag. Now sched_mm_cid_exit() assumes that the CID is CPU owned because the prior mode was per CPU. It invokes mm_drop_cid_on_cpu() which clears the not set ONCPU bit and then invokes clear_bit() with an insanely large bit number because TRANSIT is set (bit 29). Prevent that by actually validating that the CID is CPU owned in mm_drop_cid_on_cpu(). Fixes: 007d84287c74 ("sched/mmcid: Drop per CPU CID immediately when switching to per task mode") Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Cc: stable@vger.kernel.org Closes: https://lore.kernel.org/aYsZrixn9b6s_2zL@shinmob Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 daysMerge tag 'x86_paravirt_for_v7.0_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 paravirt updates from Borislav Petkov: - A nice cleanup to the paravirt code containing a unification of the paravirt clock interface, taming the include hell by splitting the pv_ops structure and removing of a bunch of obsolete code (Juergen Gross) * tag 'x86_paravirt_for_v7.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) x86/paravirt: Use XOR r32,r32 to clear register in pv_vcpu_is_preempted() x86/paravirt: Remove trailing semicolons from alternative asm templates x86/pvlocks: Move paravirt spinlock functions into own header x86/paravirt: Specify pv_ops array in paravirt macros x86/paravirt: Allow pv-calls outside paravirt.h objtool: Allow multiple pv_ops arrays x86/xen: Drop xen_mmu_ops x86/xen: Drop xen_cpu_ops x86/xen: Drop xen_irq_ops x86/paravirt: Move pv_native_*() prototypes to paravirt.c x86/paravirt: Introduce new paravirt-base.h header x86/paravirt: Move paravirt_sched_clock() related code into tsc.c x86/paravirt: Use common code for paravirt_steal_clock() riscv/paravirt: Use common code for paravirt_steal_clock() loongarch/paravirt: Use common code for paravirt_steal_clock() arm64/paravirt: Use common code for paravirt_steal_clock() arm/paravirt: Use common code for paravirt_steal_clock() sched: Move clock related paravirt code to kernel/sched paravirt: Remove asm/paravirt_api_clock.h x86/paravirt: Move thunk macros to paravirt_types.h ...
3 daysMerge tag 'sched-core-2026-02-09' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Scheduler Kconfig space updates: - Further consolidate configurable preemption modes (Peter Zijlstra) Reduce the number of architectures that are allowed to offer PREEMPT_NONE and PREEMPT_VOLUNTARY, reducing the number of preemption models from four to just two: 'full' and 'lazy' on up-to-date architectures (arm64, loongarch, powerpc, riscv, s390, x86). None and voluntary are only available as legacy features on platforms that don't implement lazy preemption yet, or which don't even support preemption. The goal is to eventually remove cond_resched() and voluntary preemption altogether. RSEQ based 'scheduler time slice extension' support (Thomas Gleixner and Peter Zijlstra): This allows a thread to request a time slice extension when it enters a critical section to avoid contention on a resource when the thread is scheduled out inside of the critical section. - Add fields and constants for time slice extension - Provide static branch for time slice extensions - Add statistics for time slice extensions - Add prctl() to enable time slice extensions - Implement sys_rseq_slice_yield() - Implement syscall entry work for time slice extensions - Implement time slice extension enforcement timer - Reset slice extension when scheduled - Implement rseq_grant_slice_extension() - entry: Hook up rseq time slice extension - selftests: Implement time slice extension test - Allow registering RSEQ with slice extension - Move slice_ext_nsec to debugfs - Lower default slice extension - selftests/rseq: Add rseq slice histogram script Scheduler performance/scalability improvements: - Update rq->avg_idle when a task is moved to an idle CPU, which improves the scalability of various workloads (Shubhang Kaushik) - Reorder fields in 'struct rq' for better caching (Blake Jones) - Fair scheduler SMP NOHZ balancing code speedups (Shrikanth Hegde): - Move checking for nohz cpus after time check - Change likelyhood of nohz.nr_cpus - Remove nohz.nr_cpus and use weight of cpumask instead - Avoid false sharing for sched_clock_irqtime (Wangyang Guo) - Cleanups (Yury Norov): - Drop useless cpumask_empty() in find_energy_efficient_cpu() - Simplify task_numa_find_cpu() - Use cpumask_weight_and() in sched_balance_find_dst_group() DL scheduler updates: - Add a deadline server for sched_ext tasks (by Andrea Righi and Joel Fernandes, with fixes by Peter Zijlstra) RT scheduler updates: - Skip currently executing CPU in rto_next_cpu() (Chen Jinghuang) Entry code updates and performance improvements (Jinjie Ruan) This is part of the scheduler tree in this cycle due to inter- dependencies with the RSEQ based time slice extension work: - Remove unused syscall argument from syscall_trace_enter() - Rework syscall_exit_to_user_mode_work() for architecture reuse - Add arch_ptrace_report_syscall_entry/exit() - Inline syscall_exit_work() and syscall_trace_enter() Scheduler core updates (Peter Zijlstra): - Rework sched_class::wakeup_preempt() and rq_modified_*() - Avoid rq->lock bouncing in sched_balance_newidle() - Rename rcu_dereference_check_sched_domain() => rcu_dereference_sched_domain() - <linux/compiler_types.h>: Add the __signed_scalar_typeof() helper Fair scheduler updates/refactoring (Peter Zijlstra and Ingo Molnar): - Fold the sched_avg update - Change rcu_dereference_check_sched_domain() to rcu-sched - Switch to rcu_dereference_all() - Remove superfluous rcu_read_lock() - Limit hrtick work - Join two #ifdef CONFIG_FAIR_GROUP_SCHED blocks - Clean up comments in 'struct cfs_rq' - Separate se->vlag from se->vprot - Rename cfs_rq::avg_load to cfs_rq::sum_weight - Rename cfs_rq::avg_vruntime to ::sum_w_vruntime & helper functions - Introduce and use the vruntime_cmp() and vruntime_op() wrappers for wrapped-signed aritmetics - Sort out 'blocked_load*' namespace noise Scheduler debugging code updates: - Export hidden tracepoints to modules (Gabriele Monaco) - Convert copy_from_user() + kstrtouint() to kstrtouint_from_user() (Fushuai Wang) - Add assertions to QUEUE_CLASS (Peter Zijlstra) - hrtimer: Fix tracing oddity (Thomas Gleixner) Misc fixes and cleanups: - Re-evaluate scheduling when migrating queued tasks out of throttled cgroups (Zicheng Qu) - Remove task_struct->faults_disabled_mapping (Christoph Hellwig) - Fix math notation errors in avg_vruntime comment (Zhan Xusheng) - sched/cpufreq: Use %pe format for PTR_ERR() printing (zenghongling)" * tag 'sched-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits) sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups sched/cpufreq: Use %pe format for PTR_ERR() printing sched/rt: Skip currently executing CPU in rto_next_cpu() sched/clock: Avoid false sharing for sched_clock_irqtime selftests/sched_ext: Add test for DL server total_bw consistency selftests/sched_ext: Add test for sched_ext dl_server sched/debug: Fix dl_server (re)start conditions sched/debug: Add support to change sched_ext server params sched_ext: Add a DL server for sched_ext tasks sched/debug: Stop and start server based on if it was active sched/debug: Fix updating of ppos on server write ops sched/deadline: Clear the defer params entry: Inline syscall_exit_work() and syscall_trace_enter() entry: Add arch_ptrace_report_syscall_entry/exit() entry: Rework syscall_exit_to_user_mode_work() for architecture reuse entry: Remove unused syscall argument from syscall_trace_enter() sched: remove task_struct->faults_disabled_mapping sched: Update rq->avg_idle when a task is moved to an idle CPU selftests/rseq: Add rseq slice histogram script hrtimer: Fix trace oddity ...
3 daysMerge tag 'locking-core-2026-02-08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "Lock debugging: - Implement compiler-driven static analysis locking context checking, using the upcoming Clang 22 compiler's context analysis features (Marco Elver) We removed Sparse context analysis support, because prior to removal even a defconfig kernel produced 1,700+ context tracking Sparse warnings, the overwhelming majority of which are false positives. On an allmodconfig kernel the number of false positive context tracking Sparse warnings grows to over 5,200... On the plus side of the balance actual locking bugs found by Sparse context analysis is also rather ... sparse: I found only 3 such commits in the last 3 years. So the rate of false positives and the maintenance overhead is rather high and there appears to be no active policy in place to achieve a zero-warnings baseline to move the annotations & fixers to developers who introduce new code. Clang context analysis is more complete and more aggressive in trying to find bugs, at least in principle. Plus it has a different model to enabling it: it's enabled subsystem by subsystem, which results in zero warnings on all relevant kernel builds (as far as our testing managed to cover it). Which allowed us to enable it by default, similar to other compiler warnings, with the expectation that there are no warnings going forward. This enforces a zero-warnings baseline on clang-22+ builds (Which are still limited in distribution, admittedly) Hopefully the Clang approach can lead to a more maintainable zero-warnings status quo and policy, with more and more subsystems and drivers enabling the feature. Context tracking can be enabled for all kernel code via WARN_CONTEXT_ANALYSIS_ALL=y (default disabled), but this will generate a lot of false positives. ( Having said that, Sparse support could still be added back, if anyone is interested - the removal patch is still relatively straightforward to revert at this stage. ) Rust integration updates: (Alice Ryhl, Fujita Tomonori, Boqun Feng) - Add support for Atomic<i8/i16/bool> and replace most Rust native AtomicBool usages with Atomic<bool> - Clean up LockClassKey and improve its documentation - Add missing Send and Sync trait implementation for SetOnce - Make ARef Unpin as it is supposed to be - Add __rust_helper to a few Rust helpers as a preparation for helper LTO - Inline various lock related functions to avoid additional function calls WW mutexes: - Extend ww_mutex tests and other test-ww_mutex updates (John Stultz) Misc fixes and cleanups: - rcu: Mark lockdep_assert_rcu_helper() __always_inline (Arnd Bergmann) - locking/local_lock: Include more missing headers (Peter Zijlstra) - seqlock: fix scoped_seqlock_read kernel-doc (Randy Dunlap) - rust: sync: Replace `kernel::c_str!` with C-Strings (Tamir Duberstein)" * tag 'locking-core-2026-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (90 commits) locking/rwlock: Fix write_trylock_irqsave() with CONFIG_INLINE_WRITE_TRYLOCK rcu: Mark lockdep_assert_rcu_helper() __always_inline compiler-context-analysis: Remove __assume_ctx_lock from initializers tomoyo: Use scoped init guard crypto: Use scoped init guard kcov: Use scoped init guard compiler-context-analysis: Introduce scoped init guards cleanup: Make __DEFINE_LOCK_GUARD handle commas in initializers seqlock: fix scoped_seqlock_read kernel-doc tools: Update context analysis macros in compiler_types.h rust: sync: Replace `kernel::c_str!` with C-Strings rust: sync: Inline various lock related methods rust: helpers: Move #define __rust_helper out of atomic.c rust: wait: Add __rust_helper to helpers rust: time: Add __rust_helper to helpers rust: task: Add __rust_helper to helpers rust: sync: Add __rust_helper to helpers rust: refcount: Add __rust_helper to helpers rust: rcu: Add __rust_helper to helpers rust: processor: Add __rust_helper to helpers ...
3 daysMerge tag 'bpf-next-7.0' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: - Support associating BPF program with struct_ops (Amery Hung) - Switch BPF local storage to rqspinlock and remove recursion detection counters which were causing false positives (Amery Hung) - Fix live registers marking for indirect jumps (Anton Protopopov) - Introduce execution context detection BPF helpers (Changwoo Min) - Improve verifier precision for 32bit sign extension pattern (Cupertino Miranda) - Optimize BTF type lookup by sorting vmlinux BTF and doing binary search (Donglin Peng) - Allow states pruning for misc/invalid slots in iterator loops (Eduard Zingerman) - In preparation for ASAN support in BPF arenas teach libbpf to move global BPF variables to the end of the region and enable arena kfuncs while holding locks (Emil Tsalapatis) - Introduce support for implicit arguments in kfuncs and migrate a number of them to new API. This is a prerequisite for cgroup sub-schedulers in sched-ext (Ihor Solodrai) - Fix incorrect copied_seq calculation in sockmap (Jiayuan Chen) - Fix ORC stack unwind from kprobe_multi (Jiri Olsa) - Speed up fentry attach by using single ftrace direct ops in BPF trampolines (Jiri Olsa) - Require frozen map for calculating map hash (KP Singh) - Fix lock entry creation in TAS fallback in rqspinlock (Kumar Kartikeya Dwivedi) - Allow user space to select cpu in lookup/update operations on per-cpu array and hash maps (Leon Hwang) - Make kfuncs return trusted pointers by default (Matt Bobrowski) - Introduce "fsession" support where single BPF program is executed upon entry and exit from traced kernel function (Menglong Dong) - Allow bpf_timer and bpf_wq use in all programs types (Mykyta Yatsenko, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Alexei Starovoitov) - Make KF_TRUSTED_ARGS the default for all kfuncs and clean up their definition across the tree (Puranjay Mohan) - Allow BPF arena calls from non-sleepable context (Puranjay Mohan) - Improve register id comparison logic in the verifier and extend linked registers with negative offsets (Puranjay Mohan) - In preparation for BPF-OOM introduce kfuncs to access memcg events (Roman Gushchin) - Use CFI compatible destructor kfunc type (Sami Tolvanen) - Add bitwise tracking for BPF_END in the verifier (Tianci Cao) - Add range tracking for BPF_DIV and BPF_MOD in the verifier (Yazhou Tang) - Make BPF selftests work with 64k page size (Yonghong Song) * tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (268 commits) selftests/bpf: Fix outdated test on storage->smap selftests/bpf: Choose another percpu variable in bpf for btf_dump test selftests/bpf: Remove test_task_storage_map_stress_lookup selftests/bpf: Update task_local_storage/task_storage_nodeadlock test selftests/bpf: Update task_local_storage/recursion test selftests/bpf: Update sk_storage_omem_uncharge test bpf: Switch to bpf_selem_unlink_nofail in bpf_local_storage_{map_free, destroy} bpf: Support lockless unlink when freeing map or local storage bpf: Prepare for bpf_selem_unlink_nofail() bpf: Remove unused percpu counter from bpf_local_storage_map_free bpf: Remove cgroup local storage percpu counter bpf: Remove task local storage percpu counter bpf: Change local_storage->lock and b->lock to rqspinlock bpf: Convert bpf_selem_unlink to failable bpf: Convert bpf_selem_link_map to failable bpf: Convert bpf_selem_unlink_map to failable bpf: Select bpf_local_storage_map_bucket based on bpf_local_storage selftests/xsk: fix number of Tx frags in invalid packet selftests/xsk: properly handle batch ending in the middle of a packet bpf: Prevent reentrance into call_rcu_tasks_trace() ...
4 daysMerge tag 'kthread-for-7.0' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks Pull kthread updates from Frederic Weisbecker: "The kthread code provides an infrastructure which manages the preferred affinity of unbound kthreads (node or custom cpumask) against housekeeping (CPU isolation) constraints and CPU hotplug events. One crucial missing piece is the handling of cpuset: when an isolated partition is created, deleted, or its CPUs updated, all the unbound kthreads in the top cpuset become indifferently affine to _all_ the non-isolated CPUs, possibly breaking their preferred affinity along the way. Solve this with performing the kthreads affinity update from cpuset to the kthreads consolidated relevant code instead so that preferred affinities are honoured and applied against the updated cpuset isolated partitions. The dispatch of the new isolated cpumasks to timers, workqueues and kthreads is performed by housekeeping, as per the nice Tejun's suggestion. As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set from boot defined domain isolation (through isolcpus=) and cpuset isolated partitions. Housekeeping cpumasks are now modifiable with a specific RCU based synchronization. A big step toward making nohz_full= also mutable through cpuset in the future" * tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: (33 commits) doc: Add housekeeping documentation kthread: Document kthread_affine_preferred() kthread: Comment on the purpose and placement of kthread_affine_node() call kthread: Honour kthreads preferred affinity after cpuset changes sched/arm64: Move fallback task cpumask to HK_TYPE_DOMAIN sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management kthread: Include kthreadd to the managed affinity list kthread: Include unbound kthreads in the managed affinity list kthread: Refine naming of affinity related fields PCI: Remove superfluous HK_TYPE_WQ check sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated() cpuset: Remove cpuset_cpu_is_isolated() timers/migration: Remove superfluous cpuset isolation test cpuset: Propagate cpuset isolation update to timers through housekeeping cpuset: Propagate cpuset isolation update to workqueue through housekeeping PCI: Flush PCI probe workqueue on cpuset isolated partition change sched/isolation: Flush vmstat workqueues on cpuset isolated partition change sched/isolation: Flush memcg workqueues on cpuset isolated partition change cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset ...
6 daysMerge tag 'sched-urgent-2026-02-07' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Miscellaneous MMCID fixes to address bugs and performance regressions in the recent rewrite of the SCHED_MM_CID management code: - Fix livelock triggered by BPF CI testing - Fix hard lockup on weakly ordered systems - Simplify the dropping of CIDs in the exit path by removing an unintended transition phase - Fix performance/scalability regression on a thread-pool benchmark by optimizing transitional CIDs when scheduling out" * tag 'sched-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/mmcid: Optimize transitional CIDs when scheduling out sched/mmcid: Drop per CPU CID immediately when switching to per task mode sched/mmcid: Protect transition on weakly ordered systems sched/mmcid: Prevent live lock on task to CPU mode transition
9 daysMerge tag 'sched_ext-for-6.19-rc8-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fix from Tejun Heo: - Fix race where sched_class operations (sched_setscheduler() and friends) could be invoked on dead tasks after sched_ext_dead() already ran, causing invalid SCX task state transitions and NULL pointer dereferences. This was a regression from the cgroup exit ordering fix which moved sched_ext_free() to finish_task_switch(). * tag 'sched_ext-for-6.19-rc8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Short-circuit sched_class operations on dead tasks
9 dayssched_ext: Short-circuit sched_class operations on dead tasksTejun Heo
7900aa699c34 ("sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()") moved sched_ext_free() to finish_task_switch() and renamed it to sched_ext_dead() to fix cgroup exit ordering issues. However, this created a race window where certain sched_class ops may be invoked on dead tasks leading to failures - e.g. sched_setscheduler() may try to switch a task which finished sched_ext_dead() back into SCX triggering invalid SCX task state transitions. Add task_dead_and_done() which tests whether a task is TASK_DEAD and has completed its final context switch, and use it to short-circuit sched_class operations which may be called on dead tasks. Fixes: 7900aa699c34 ("sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()") Reported-by: Andrea Righi <arighi@nvidia.com> Link: http://lkml.kernel.org/r/20260202151341.796959-1-arighi@nvidia.com Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
10 dayssched/mmcid: Optimize transitional CIDs when scheduling outThomas Gleixner
During the investigation of the various transition mode issues instrumentation revealed that the amount of bitmap operations can be significantly reduced when a task with a transitional CID schedules out after the fixup function completed and disabled the transition mode. At that point the mode is stable and therefore it is not required to drop the transitional CID back into the pool. As the fixup is complete the potential exhaustion of the CID pool is not longer possible, so the CID can be transferred to the scheduling out task or to the CPU depending on the current ownership mode. The racy snapshot of mm_cid::mode which contains both the ownership state and the transition bit is valid because runqueue lock is held and the fixup function of a concurrent mode switch is serialized. Assigning the ownership right there not only spares the bitmap access for dropping the CID it also avoids it when the task is scheduled back in as it directly hits the fast path in both modes when the CID is within the optimal range. If it's outside the range the next schedule in will need to converge so dropping it right away is sensible. In the good case this also allows to go into the fast path on the next schedule in operation. With a thread pool benchmark which is configured to cross the mode switch boundaries frequently this reduces the number of bitmap operations by about 30% and increases the fastpath utilization in the low single digit percentage range. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260201192835.100194627@kernel.org
10 dayssched/mmcid: Drop per CPU CID immediately when switching to per task modeThomas Gleixner
When a exiting task initiates the switch from per CPU back to per task mode, it has already dropped its CID and marked itself inactive. But a leftover from an earlier iteration of the rework then reassigns the per CPU CID to the exiting task with the transition bit set. That's wrong as the task is already marked CID inactive, which means it is inconsistent state. It's harmless because the CID is marked in transit and therefore dropped back into the pool when the exiting task schedules out either through preemption or the final schedule(). Simply drop the per CPU CID when the exiting task triggered the transition. Fixes: fbd0e71dc370 ("sched/mmcid: Provide CID ownership mode fixup functions") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260201192835.032221009@kernel.org
10 dayssched/mmcid: Protect transition on weakly ordered systemsThomas Gleixner
Shrikanth reported a hard lockup which he observed once. The stack trace shows the following CID related participants: watchdog: CPU 23 self-detected hard LOCKUP @ mm_get_cid+0xe8/0x188 NIP: mm_get_cid+0xe8/0x188 LR: mm_get_cid+0x108/0x188 mm_cid_switch_to+0x3c4/0x52c __schedule+0x47c/0x700 schedule_idle+0x3c/0x64 do_idle+0x160/0x1b0 cpu_startup_entry+0x48/0x50 start_secondary+0x284/0x288 start_secondary_prolog+0x10/0x14 watchdog: CPU 11 self-detected hard LOCKUP @ plpar_hcall_norets_notrace+0x18/0x2c NIP: plpar_hcall_norets_notrace+0x18/0x2c LR: queued_spin_lock_slowpath+0xd88/0x15d0 _raw_spin_lock+0x80/0xa0 raw_spin_rq_lock_nested+0x3c/0xf8 mm_cid_fixup_cpus_to_tasks+0xc8/0x28c sched_mm_cid_exit+0x108/0x22c do_exit+0xf4/0x5d0 make_task_dead+0x0/0x178 system_call_exception+0x128/0x390 system_call_vectored_common+0x15c/0x2ec The task on CPU11 is running the CID ownership mode change fixup function and is stuck on a runqueue lock. The task on CPU23 is trying to get a CID from the pool with the same runqueue lock held, but the pool is empty. After decoding a similar issue in the opposite direction switching from per task to per CPU mode the tool which models the possible scenarios failed to come up with a similar loop hole. This showed up only once, was not reproducible and according to tooling not related to a overlooked scheduling scenario permutation. But the fact that it was observed on a PowerPC system gave the right hint: PowerPC is a weakly ordered architecture. The transition mechanism does: WRITE_ONCE(mm->mm_cid.transit, MM_CID_TRANSIT); WRITE_ONCE(mm->mm_cid.percpu, new_mode); fixup() WRITE_ONCE(mm->mm_cid.transit, 0); mm_cid_schedin() does: if (!READ_ONCE(mm->mm_cid.percpu)) ... cid |= READ_ONCE(mm->mm_cid.transit); so weakly ordered systems can observe percpu == false and transit == 0 even if the fixup function has not yet completed. As a consequence the task will not drop the CID when scheduling out before the fixup is completed, which means the CID space can be exhausted and the next task scheduling in will loop in mm_get_cid() and the fixup thread can livelock on the held runqueue lock as above. This could obviously be solved by using: smp_store_release(&mm->mm_cid.percpu, true); and smp_load_acquire(&mm->mm_cid.percpu); but that brings a memory barrier back into the scheduler hotpath, which was just designed out by the CID rewrite. That can be completely avoided by combining the per CPU mode and the transit storage into a single mm_cid::mode member and ordering the stores against the fixup functions to prevent the CPU from reordering them. That makes the update of both states atomic and a concurrent read observes always consistent state. The price is an additional AND operation in mm_cid_schedin() to evaluate the per CPU or the per task path, but that's in the noise even on strongly ordered architectures as the actual load can be significantly more expensive and the conditional branch evaluation is there anyway. Fixes: fbd0e71dc370 ("sched/mmcid: Provide CID ownership mode fixup functions") Closes: https://lore.kernel.org/bdfea828-4585-40e8-8835-247c6a8a76b0@linux.ibm.com Reported-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260201192834.965217106@kernel.org
10 dayssched/mmcid: Prevent live lock on task to CPU mode transitionThomas Gleixner
Ihor reported a BPF CI failure which turned out to be a live lock in the MM_CID management. The scenario is: A test program creates the 5th thread, which means the MM_CID users become more than the number of CPUs (four in this example), so it switches to per CPU ownership mode. At this point each live task of the program has a CID associated. Assume thread creation order assignment for simplicity. T0 CID0 runs fork() and creates T4 T1 CID1 T2 CID2 T3 CID3 T4 --- not visible yet T0 sets mm_cid::percpu = true and transfers its own CID to CPU0 where it runs on and then starts the fixup which walks through the threads to transfer the per task CIDs either to the CPU the task is running on or drop it back into the pool if the task is not on a CPU. During that T1 - T3 are free to schedule in and out before the fixup caught up with them. Going through all possible permutations with a python script revealed a few problematic cases. The most trivial one is: T1 schedules in on CPU1 and observes percpu == true, so it transfers its CID to CPU1 T1 is migrated to CPU2 and schedule in observes percpu == true, but CPU2 does not have a CID associated and T1 transferred its own to CPU1 So it has to allocate one with CPU2 runqueue lock held, but the pool is empty, so it keeps looping in mm_get_cid(). Now T0 reaches T1 in the thread walk and tries to lock the corresponding runqueue lock, which is held causing a full live lock. There is a similar scenario in the reverse direction of switching from per CPU to task mode which is way more obvious and got therefore addressed by an intermediate mode. In this mode the CIDs are marked with MM_CID_TRANSIT, which means that they are neither owned by the CPU nor by the task. When a task schedules out with a transit CID it drops the CID back into the pool making it available for others to use temporarily. Once the task which initiated the mode switch finished the fixup it clears the transit mode and the process goes back into per task ownership mode. Unfortunately this insight was not mapped back to the task to CPU mode switch as the above described scenario was not considered in the analysis. Apply the same transit mechanism to the task to CPU mode switch to handle these problematic cases correctly. As with the CPU to task transition this results in a potential temporary contention on the CID bitmap, but that's only for the time it takes to complete the transition. After that it stays in steady mode which does not touch the bitmap at all. Fixes: fbd0e71dc370 ("sched/mmcid: Provide CID ownership mode fixup functions") Closes: https://lore.kernel.org/2b7463d7-0f58-4e34-9775-6e2115cfb971@linux.dev Reported-by: Ihor Solodrai <ihor.solodrai@linux.dev> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260201192834.897115238@kernel.org
11 dayskthread: Honour kthreads preferred affinity after cpuset changesFrederic Weisbecker
When cpuset isolated partitions get updated, unbound kthreads get indifferently affine to all non isolated CPUs, regardless of their individual affinity preferences. For example kswapd is a per-node kthread that prefers to be affine to the node it refers to. Whenever an isolated partition is created, updated or deleted, kswapd's node affinity is going to be broken if any CPU in the related node is not isolated because kswapd will be affine globally. Fix this with letting the consolidated kthread managed affinity code do the affinity update on behalf of cpuset. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Waiman Long <longman@redhat.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Marco Crivellari <marco.crivellari@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: cgroups@vger.kernel.org
11 dayscpuset: Propagate cpuset isolation update to timers through housekeepingFrederic Weisbecker
Until now, cpuset would propagate isolated partition changes to timer migration so that unbound timers don't get migrated to isolated CPUs. Since housekeeping now centralizes, synchronize and propagates isolation cpumask changes, perform the work from that subsystem for consolidation and consistency purposes. Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
11 dayscpuset: Propagate cpuset isolation update to workqueue through housekeepingFrederic Weisbecker
Until now, cpuset would propagate isolated partition changes to workqueues so that unbound workers get properly reaffined. Since housekeeping now centralizes, synchronize and propagates isolation cpumask changes, perform the work from that subsystem for consolidation and consistency purposes. For simplification purpose, the target function is adapted to take the new housekeeping mask instead of the isolated mask. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Waiman Long <longman@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: "Michal Koutný" <mkoutny@suse.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Marco Crivellari <marco.crivellari@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: cgroups@vger.kernel.org
11 daysPCI: Flush PCI probe workqueue on cpuset isolated partition changeFrederic Weisbecker
The HK_TYPE_DOMAIN housekeeping cpumask is now modifiable at runtime. In order to synchronize against PCI probe works and make sure that no asynchronous probing is still pending or executing on a newly isolated CPU, the housekeeping subsystem must flush the PCI probe works. However the PCI probe works can't be flushed easily since they are queued to the main per-CPU workqueue pool. Solve this with creating a PCI probe-specific pool and provide and use the appropriate flushing API. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Cc: Marco Crivellari <marco.crivellari@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: linux-pci@vger.kernel.org
11 dayssched/isolation: Flush vmstat workqueues on cpuset isolated partition changeFrederic Weisbecker
The HK_TYPE_DOMAIN housekeeping cpumask is now modifiable at runtime. In order to synchronize against vmstat workqueue to make sure that no asynchronous vmstat work is still pending or executing on a newly made isolated CPU, the housekeeping susbsystem must flush the vmstat workqueues. This involves flushing the whole mm_percpu_wq workqueue, shared with LRU drain, introducing here a welcome side effect. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Marco Crivellari <marco.crivellari@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: linux-mm@kvack.org
11 dayssched/isolation: Flush memcg workqueues on cpuset isolated partition changeFrederic Weisbecker
The HK_TYPE_DOMAIN housekeeping cpumask is now modifiable at runtime. In order to synchronize against memcg workqueue to make sure that no asynchronous draining is still pending or executing on a newly made isolated CPU, the housekeeping susbsystem must flush the memcg workqueues. However the memcg workqueues can't be flushed easily since they are queued to the main per-CPU workqueue pool. Solve this with creating a memcg specific pool and provide and use the appropriate flushing API. Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Marco Crivellari <marco.crivellari@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: cgroups@vger.kernel.org Cc: linux-mm@kvack.org
11 dayscpuset: Update HK_TYPE_DOMAIN cpumask from cpusetFrederic Weisbecker
Until now, HK_TYPE_DOMAIN used to only include boot defined isolated CPUs passed through isolcpus= boot option. Users interested in also knowing the runtime defined isolated CPUs through cpuset must use different APIs: cpuset_cpu_is_isolated(), cpu_is_isolated(), etc... There are many drawbacks to that approach: 1) Most interested subsystems want to know about all isolated CPUs, not just those defined on boot time. 2) cpuset_cpu_is_isolated() / cpu_is_isolated() are not synchronized with concurrent cpuset changes. 3) Further cpuset modifications are not propagated to subsystems Solve 1) and 2) and centralize all isolated CPUs within the HK_TYPE_DOMAIN housekeeping cpumask. Subsystems can rely on RCU to synchronize against concurrent changes. The propagation mentioned in 3) will be handled in further patches. [Chen Ridong: Fix cpu_hotplug_lock deadlock and use correct static branch API] Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Chen Ridong <chenridong@huawei.com> Cc: "Michal Koutný" <mkoutny@suse.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Marco Crivellari <marco.crivellari@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: cgroups@vger.kernel.org
11 dayssched/isolation: Convert housekeeping cpumasks to rcu pointersFrederic Weisbecker
HK_TYPE_DOMAIN's cpumask will soon be made modifiable by cpuset. A synchronization mechanism is then needed to synchronize the updates with the housekeeping cpumask readers. Turn the housekeeping cpumasks into RCU pointers. Once a housekeeping cpumask will be modified, the update side will wait for an RCU grace period and propagate the change to interested subsystem when deemed necessary. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Marco Crivellari <marco.crivellari@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com>
11 dayssched/isolation: Save boot defined domain flagsFrederic Weisbecker
HK_TYPE_DOMAIN will soon integrate not only boot defined isolcpus= CPUs but also cpuset isolated partitions. Housekeeping still needs a way to record what was initially passed to isolcpus= in order to keep these CPUs isolated after a cpuset isolated partition is modified or destroyed while containing some of them. Create a new HK_TYPE_DOMAIN_BOOT to keep track of those. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Waiman Long <longman@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Marco Crivellari <marco.crivellari@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com>
11 dayssched: Re-evaluate scheduling when migrating queued tasks out of throttled ↵Zicheng Qu
cgroups Consider the following sequence on a CPU configured with nohz_full: 1) A task P runs in cgroup A, and cgroup A becomes throttled due to CFS bandwidth control. The gse (cgroup A) where the task P attached is dequeued and the CPU switches to idle. 2) Before cgroup A is unthrottled, task P is migrated from cgroup A to another cgroup B (not throttled). During sched_move_task(), the task P is observed as queued but not running, and therefore no resched_curr() is triggered. 3) Since the CPU is nohz_full, it remains in do_idle() waiting for an explicit scheduling event, i.e., resched_curr(). 4) For kernel <= 5.10: Later, cgroup A is unthrottled. However, the task P has already been migrated out of cgroup A, so unthrottle_cfs_rq() may observe load_weight == 0 and return early without resched_curr() called. For kernel >= 6.6: The unthrottling path normally triggers `resched_curr()` almost cases even when no runnable tasks remain in the unthrottled cgroup, preventing the idle stall described above. However, if cgroup A is removed before it gets unthrottled, the unthrottling path for cgroup A is never executed. In a result, no `resched_curr()` can be called. 5) At this point, the task P is runnable in cgroup B (not throttled), but the CPU remains in do_idle() with no pending reschedule point. The system stays in this state until an unrelated event (e.g. a new task wakeup or any cases) that can trigger a resched_curr() breaks the nohz_full idle state, and then the task P finally gets scheduled. The root cause is that sched_move_task() may classify the task as only queued, not running, and therefore fails to trigger a resched_curr(), while the later unthrottling path no longer has visibility of the migrated task. Preserve the existing behavior for running tasks by issuing resched_curr(), and explicitly invoke check_preempt_curr() for tasks that were queued at the time of migration. This ensures that runnable tasks are reconsidered for scheduling even when nohz_full suppresses periodic ticks. Fixes: 29f59db3a74b ("sched: group-scheduler core") Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: Aaron Lu <ziqianlu@bytedance.com> Tested-by: Aaron Lu <ziqianlu@bytedance.com> Link: https://patch.msgid.link/20260130083438.1122457-1-quzicheng@huawei.com
11 dayssched/cpufreq: Use %pe format for PTR_ERR() printingzenghongling
Use %pe format specifier for printing PTR_ERR() error values to make error messages more readable. Found by Coccinelle: ./cpufreq_schedutil.c:685:49-56: WARNING: Consider using %pe to print PTR_ERR() Signed-off-by: zenghongling <zenghongling@kylinos.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260120083333.148385-1-zenghongling@kylinos.cn
11 dayssched/rt: Skip currently executing CPU in rto_next_cpu()Chen Jinghuang
CPU0 becomes overloaded when hosting a CPU-bound RT task, a non-CPU-bound RT task, and a CFS task stuck in kernel space. When other CPUs switch from RT to non-RT tasks, RT load balancing (LB) is triggered; with HAVE_RT_PUSH_IPI enabled, they send IPIs to CPU0 to drive the execution of rto_push_irq_work_func. During push_rt_task on CPU0, if next_task->prio < rq->donor->prio, resched_curr() sets NEED_RESCHED and after the push operation completes, CPU0 calls rto_next_cpu(). Since only CPU0 is overloaded in this scenario, rto_next_cpu() should ideally return -1 (no further IPI needed). However, multiple CPUs invoking tell_cpu_to_push() during LB increments rd->rto_loop_next. Even when rd->rto_cpu is set to -1, the mismatch between rd->rto_loop and rd->rto_loop_next forces rto_next_cpu() to restart its search from -1. With CPU0 remaining overloaded (satisfying rt_nr_migratory && rt_nr_total > 1), it gets reselected, causing CPU0 to queue irq_work to itself and send self-IPIs repeatedly. As long as CPU0 stays overloaded and other CPUs run pull_rt_tasks(), it falls into an infinite self-IPI loop, which triggers a CPU hardlockup due to continuous self-interrupts. The trigging scenario is as follows: cpu0 cpu1 cpu2 pull_rt_task tell_cpu_to_push <------------irq_work_queue_on rto_push_irq_work_func push_rt_task resched_curr(rq) pull_rt_task rto_next_cpu tell_cpu_to_push <-------------------------- atomic_inc(rto_loop_next) rd->rto_loop != next rto_next_cpu irq_work_queue_on rto_push_irq_work_func Fix redundant self-IPI by filtering the initiating CPU in rto_next_cpu(). This solution has been verified to effectively eliminate spurious self-IPIs and prevent CPU hardlockup scenarios. Fixes: 4bdced5c9a29 ("sched/rt: Simplify the IPI based RT balancing logic") Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://patch.msgid.link/20260122012533.673768-1-chenjinghuang2@huawei.com
11 dayssched/clock: Avoid false sharing for sched_clock_irqtimeWangyang Guo
Read-mostly sched_clock_irqtime may share the same cacheline with frequently updated nohz struct. Make it as static_key to avoid false sharing issue. The only user of disable_sched_clock_irqtime() is tsc_.*mark_unstable() which may be invoked under atomic context and require a workqueue to disable static_key. But both of them calls clear_sched_clock_stable() just before doing disable_sched_clock_irqtime(). We can reuse "sched_clock_work" to also disable sched_clock_irqtime(). One additional case need to handle is if the tsc is marked unstable before late_initcall() phase, sched_clock_work will not be invoked and sched_clock_irqtime will stay enabled although clock is unstable: tsc_init() enable_sched_clock_irqtime() # irqtime accounting is enabled here ... if (unsynchronized_tsc()) # true mark_tsc_unstable() clear_sched_clock_stable() __sched_clock_stable_early = 0; ... if (static_key_count(&sched_clock_running.key) == 2) # Only happens at sched_clock_init_late() __clear_sched_clock_stable(); # Never executed ... # late_initcall() phase sched_clock_init_late() if (__sched_clock_stable_early) # Already false __set_sched_clock_stable(); # sched_clock is never marked stable # TSC unstable, but sched_clock_work won't run to disable irqtime So we need to disable_sched_clock_irqtime() in sched_clock_init_late() if clock is unstable. Reported-by: Benjamin Lei <benjamin.lei@intel.com> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Wangyang Guo <wangyang.guo@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Tianyou Li <tianyou.li@intel.com> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260127072509.2627346-1-wangyang.guo@intel.com
11 dayssched/debug: Fix dl_server (re)start conditionsPeter Zijlstra
There are two problems with sched_server_write_common() that can cause the dl_server to malfunction upon attempting to change the parameters: 1) when, after having disabled the dl_server by setting runtime=0, it is enabled again while tasks are already enqueued. In this case is_active would still be 0 and dl_server_start() would not be called. 2) when dl_server_apply_params() would fail, runtime is not applied and does not reflect the new state. Instead have dl_server_start() check its actual dl_runtime, and have sched_server_write_common() unconditionally (re)start the dl_server. It will automatically stop if there isn't anything to do, so spurious activation is harmless -- while failing to start it is a problem. While there, move the printk out of the locked region and make it symmetric, also printing on enable. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260203103407.GK1282955@noisy.programming.kicks-ass.net
11 dayssched/debug: Add support to change sched_ext server paramsJoel Fernandes
When a sched_ext server is loaded, tasks in the fair class are automatically moved to the sched_ext class. Add support to modify the ext server parameters similar to how the fair server parameters are modified. Re-use common code between ext and fair servers as needed. Co-developed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/20260126100050.3854740-6-arighi@nvidia.com
11 dayssched_ext: Add a DL server for sched_ext tasksAndrea Righi
sched_ext currently suffers starvation due to RT. The same workload when converted to EXT can get zero runtime if RT is 100% running, causing EXT processes to stall. Fix it by adding a DL server for EXT. A kselftest is also included later to confirm that both DL servers are functioning correctly: # ./runner -t rt_stall ===== START ===== TEST: rt_stall DESCRIPTION: Verify that RT tasks cannot stall SCHED_EXT tasks OUTPUT: TAP version 13 1..1 # Runtime of FAIR task (PID 1511) is 0.250000 seconds # Runtime of RT task (PID 1512) is 4.750000 seconds # FAIR task got 5.00% of total runtime ok 1 PASS: FAIR task got more than 4.00% of runtime TAP version 13 1..1 # Runtime of EXT task (PID 1514) is 0.250000 seconds # Runtime of RT task (PID 1515) is 4.750000 seconds # EXT task got 5.00% of total runtime ok 2 PASS: EXT task got more than 4.00% of runtime TAP version 13 1..1 # Runtime of FAIR task (PID 1517) is 0.250000 seconds # Runtime of RT task (PID 1518) is 4.750000 seconds # FAIR task got 5.00% of total runtime ok 3 PASS: FAIR task got more than 4.00% of runtime TAP version 13 1..1 # Runtime of EXT task (PID 1521) is 0.250000 seconds # Runtime of RT task (PID 1522) is 4.750000 seconds # EXT task got 5.00% of total runtime ok 4 PASS: EXT task got more than 4.00% of runtime ok 1 rt_stall # ===== END ===== Co-developed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/20260126100050.3854740-5-arighi@nvidia.com
11 dayssched/debug: Stop and start server based on if it was activeJoel Fernandes
Currently the DL server interface for applying parameters checks CFS-internals to identify if the server is active. This is error-prone and makes it difficult when adding new servers in the future. Fix it, by using dl_server_active() which is also used by the DL server code to determine if the DL server was started. Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Tejun Heo <tj@kernel.org> Tested-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/20260126100050.3854740-4-arighi@nvidia.com
11 dayssched/debug: Fix updating of ppos on server write opsJoel Fernandes
Updating "ppos" on error conditions does not make much sense. The pattern is to return the error code directly without modifying the position, or modify the position on success and return the number of bytes written. Since on success, the return value of apply is 0, there is no point in modifying ppos either. Fix it by removing all this and just returning error code or number of bytes written on success. Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Tejun Heo <tj@kernel.org> Tested-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/20260126100050.3854740-3-arighi@nvidia.com
11 dayssched/deadline: Clear the defer paramsJoel Fernandes
The defer params were not cleared in __dl_clear_params. Clear them. Without this is some of my test cases are flaking and the DL timer is not starting correctly AFAICS. Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server") Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/20260126100050.3854740-2-arighi@nvidia.com
11 daysMerge branch 'v6.19-rc8'Peter Zijlstra
Update to avoid conflicts with /urgent patches. Signed-off-by: Peter Zijlstra <peterz@infradead.org>
13 daysdelayacct: add timestamp of delay maxWang Yaxin
Problem ======= Commit 658eb5ab916d ("delayacct: add delay max to record delay peak") introduced the delay max for getdelays, which records abnormal latency peaks and helps us understand the magnitude of such delays. However, the peak latency value alone is insufficient for effective root cause analysis. Without the precise timestamp of when the peak occurred, we still lack the critical context needed to correlate it with other system events. Solution ======== To address this, we need to additionally record a precise timestamp when the maximum latency occurs. By correlating this timestamp with system logs and monitoring metrics, we can identify processes with abnormal resource usage at the same moment, which can help us to pinpoint root causes. Use Case ======== bash-4.4# ./getdelays -d -t 227 print delayacct stats ON TGID 227 CPU count real total virtual total delay total delay average delay max delay min delay max timestamp 46 188000000 192348334 4098012 0.089ms 0.429260ms 0.051205ms 2026-01-15T15:06:58 IO count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A SWAP count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A RECLAIM count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A THRAS HING count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A COMPACT count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A WPCOPY count delay total delay average delay max delay min delay max timestamp 182 19413338 0.107ms 0.547353ms 0.022462ms 2026-01-15T15:05:24 IRQ count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A Link: https://lkml.kernel.org/r/20260119100241520gWubW8-5QfhSf9gjqcc_E@zte.com.cn Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn> Cc: Fan Yu <fan.yu9@zte.com.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-30sched/deadline: Fix 'stuck' dl_serverPeter Zijlstra
Andrea reported the dl_server getting stuck for him. He tracked it down to a state where dl_server_start() saw dl_defer_running==1, but the dl_server's job is no longer valid at the time of dl_server_start(). In the state diagram this corresponds to [4] D->A (or dl_server_stop() due to no more runnable tasks) followed by [1], which in case of a lapsed deadline must then be A->B. Now our A has dl_defer_running==1, while B demands dl_defer_running==0, therefore it must get cleared when the CBS wakeup rules demand a replenish. Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server") Reported-by: Andrea Righi arighi@nvidia.com Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: Andrea Righi arighi@nvidia.com Link: https://lkml.kernel.org/r/20260123161645.2181752-1-arighi@nvidia.com Link: https://patch.msgid.link/20260130124100.GC1079264@noisy.programming.kicks-ass.net
2026-01-23sched/fair: Revert force wakeup preemptionVincent Guittot
This agressively bypasses run_to_parity and slice protection with the assumpiton that this is what waker wants but there is no garantee that the wakee will be the next to run. It is a better choice to use yield_to_task or WF_SYNC in such case. This increases the number of resched and preemption because a task becomes quickly "ineligible" when it runs; We update the task vruntime periodically and before the task exhausted its slice or at least quantum. Example: 2 tasks A and B wake up simultaneously with lag = 0. Both are eligible. Task A runs 1st and wakes up task C. Scheduler updates task A's vruntime which becomes greater than average runtime as all others have a lag == 0 and didn't run yet. Now task A is ineligible because it received more runtime than the other task but it has not yet exhausted its slice nor a min quantum. We force preemption, disable protection but Task B will run 1st not task C. Sidenote, DELAY_ZERO increases this effect by clearing positive lag at wake up. Fixes: e837456fdca8 ("sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260123102858.52428-1-vincent.guittot@linaro.org
2026-01-23sched/fair: Disable scheduler feature NEXT_BUDDYMel Gorman
NEXT_BUDDY was disabled with the introduction of EEVDF and enabled again after NEXT_BUDDY was rewritten for EEVDF by commit e837456fdca8 ("sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals"). It was not expected that this would be a universal win without a crystal ball instruction but the reported regressions are a concern [1][2] even if gains were also reported. Specifically; o mysql with client/server running on different servers regresses o specjbb reports lower peak metrics o daytrader regresses The mysql is realistic and a concern. It needs to be confirmed if specjbb is simply shifting the point where peak performance is measured but still a concern. daytrader is considered to be representative of a real workload. Access to test machines is currently problematic for verifying any fix to this problem. Disable NEXT_BUDDY for now by default until the root causes are addressed. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Link: https://lore.kernel.org/lkml/4b96909a-f1ac-49eb-b814-97b8adda6229@arm.com [1] Link: https://lore.kernel.org/lkml/ec3ea66f-3a0d-4b5a-ab36-ce778f159b5b@linux.ibm.com [2] Link: https://patch.msgid.link/fyqsk63pkoxpeaclyqsm5nwtz3dyejplr7rg6p74xwemfzdzuu@7m7xhs5aqpqw
2026-01-22sched: Update rq->avg_idle when a task is moved to an idle CPUShubhang Kaushik
Currently, rq->idle_stamp is only used to calculate avg_idle during wakeups. This means other paths that move a task to an idle CPU such as fork/clone, execve, or migrations, do not end the CPU's idle status in the scheduler's eyes, leading to an inaccurate avg_idle. This patch introduces update_rq_avg_idle() to provide a more accurate measurement of CPU idle duration. By invoking this helper in put_prev_task_idle(), we ensure avg_idle is updated whenever a CPU stops being idle, regardless of how the new task arrived. Testing on an 80-core Ampere Altra (ARMv8) with 6.19-rc5 baseline: - Hackbench : +7.2% performance gain at 16 threads. - Schbench: Reduced p99.9 tail latencies at high concurrency. Signed-off-by: Shubhang Kaushik <shubhang@os.amperecomputing.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com> Link: https://patch.msgid.link/20260121-v8-patch-series-v8-1-b7f1cbee5055@os.amperecomputing.com
2026-01-22sched/debug: Convert copy_from_user() + kstrtouint() to kstrtouint_from_user()Fushuai Wang
Using kstrtouint_from_user() instead of copy_from_user() + kstrtouint() makes the code simpler and less error-prone. Suggested-by: Yury Norov <ynorov@nvidia.com> Signed-off-by: Fushuai Wang <wangfushuai@baidu.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Yury Norov <ynorov@nvidia.com> Link: https://patch.msgid.link/20260117145615.53455-2-fushuai.wang@linux.dev
2026-01-21sched/fair: Fix pelt clock sync when entering idleVincent Guittot
Samuel and Alex reported regressions of the util_avg of RT rq with commit 17e3e88ed0b6 ("sched/fair: Fix pelt lost idle time detection"). It happens that fair is updating and syncing the pelt clock with task one when pick_next_task_fair() fails to pick a task but before the prev scheduling class got a chance to update its pelt signals. Move update_idle_rq_clock_pelt() in set_next_task_idle() which is called after prev class has been called. Fixes: 17e3e88ed0b6 ("sched/fair: Fix pelt lost idle time detection") Closes: https://lore.kernel.org/all/CAG2KctpO6VKS6GN4QWDji0t92_gNBJ7HjjXrE+6H+RwRXt=iLg@mail.gmail.com/ Closes: https://lore.kernel.org/all/8cf19bf0e0054dcfed70e9935029201694f1bb5a.camel@mediatek.com/ Reported-by: Samuel Wu <wusamuel@google.com> Reported-by: Alex Hoh <Alex.Hoh@mediatek.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Samuel Wu <wusamuel@google.com> Tested-by: Alex Hoh <Alex.Hoh@mediatek.com> Link: https://patch.msgid.link/20260121163317.505635-1-vincent.guittot@linaro.org
2026-01-15sched/fair: Remove nohz.nr_cpus and use weight of cpumask insteadShrikanth Hegde
nohz.nr_cpus was observed as contended cacheline when running enterprise workload on large systems. Fundamental scalability challenge with nohz.idle_cpus_mask and nohz.nr_cpus is the following: (1) nohz_balancer_kick() observes (reads) nohz.nr_cpus (or nohz.idle_cpu_mask) and nohz.has_blocked to see whether there's any nohz balancing work to do, in every scheduler tick. (2) nohz_balance_enter_idle() and nohz_balance_exit_idle() (through nohz_balancer_kick() via sched_tick()) modify (write) nohz.nr_cpus (and/or nohz.idle_cpu_mask) and nohz.has_blocked. The characteristic frequencies are the following: (1) nohz_balancer_kick() happens at scheduler (busy)tick frequency on CPU(which has not gone idle). This is a relatively constant frequency in the ~1 kHz range or lower. (2) happens at idle enter/exit frequency on every CPU that goes to idle. This is workload dependent, but can easily be hundreds of kHz for IO-bound loads and high CPU counts. Ie. can be orders of magnitude higher than (1), in which case a cachemiss at every invocation of (1) is almost inevitable. idle exit will trigger (1) on the CPU which is coming out of idle. There's two types of costs from these functions: (A) scheduler tick cost via (1): this happens on busy CPUs too, and is thus a primary scalability cost. But the rate here is constant and typically much lower than (B), hence the absolute benefit to workload scalability will be lower as well. (B) idle cost via (2): going-to-idle and coming-from-idle costs are secondary concerns, because they impact power efficiency more than they impact scalability. But in terms of absolute cost this scales up with nr_cpus as well, and a much faster rate, and thus may also approach and negatively impact system limits like memory bus/fabric bandwidth. Note that nohz.idle_cpus_mask and nohz.nr_cpus may appear to reside in the same cacheline, however under CONFIG_CPUMASK_OFFSTACK=y the backing storage for nohz.idle_cpus_mask will be elsewhere. With CPUMASK_OFFSTACK=n, the nohz.idle_cpus_mask and rest of nohz fields are in different cachelines under typical NR_CPUS=512/2048. This implies two separate cachelines being dirtied upon idle entry / exit. nohz.nr_cpus can be derived from the mask itself. Its usage doesn't warrant a functionally correct value. This means one less cacheline being dirtied in idle entry/exit path which helps to save some bus bandwidth w.r.t to those nohz functions(approx 50%). This in turn helps to improve enterprise workload throughput. On system with 480 CPUs, running "hackbench 40 process 10000 loops" (Avg of 3 runs) baseline: 0.81% hackbench [k] nohz_balance_exit_idle 0.21% hackbench [k] nohz_balancer_kick 0.09% swapper [k] nohz_run_idle_balance With patch: 0.35% hackbench [k] nohz_balance_exit_idle 0.09% hackbench [k] nohz_balancer_kick 0.07% swapper [k] nohz_run_idle_balance [Ingo Molnar: scalability analysis changlog] Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260115073524.376643-4-sshegde@linux.ibm.com
2026-01-15sched/fair: Change likelyhood of nohz.nr_cpusShrikanth Hegde
These days most of the system have multi cores. The likelyhood of at least one or more CPUs in nohz (idle state) is higher. Give accurate hint to the branch predictor. Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260115073524.376643-3-sshegde@linux.ibm.com
2026-01-15sched/fair: Move checking for nohz cpus after time checkShrikanth Hegde
Current code does. - Read nohz.nr_cpus - Check if the time has passed to do NOHZ idle balance Instead do this. - Check if the time has passed to do NOHZ idle balance - Read nohz.nr_cpus This will skip the read most of the time in normal system usage. i.e when there are nohz.nr_cpus (system is not 100% busy). Note that when there are no idle CPUs(100% busy), even if the flag gets set to NOHZ_STATS_KICK | NOHZ_NEXT_KICK, find_new_ilb will fail and there will be no NOHZ idle balance. In such cases there will be a very narrow window where, kick_ilb will be called un-necessarily. However current functionality is still retained. Note: This patch doesn't solve any cacheline overheads. No improvement in performance apart from saving a few cycles of reading nohz.nr_cpus Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260115073524.376643-2-sshegde@linux.ibm.com
2026-01-15sched/fair: Fix math notation errors in avg_vruntime commentZhan Xusheng
The avg_vruntime comment contains a couple of mathematical notation issues: - The summation over w_i * (V - v_i) is written in an ambiguous form - The delta term refers to v instead of v0, which is inconsistent with the code and preceding explanation Fix these to make the comment mathematically correct and consistent with the implementation. Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260114090035.19033-1-zhanxusheng@xiaomi.com
2026-01-15sched: Fix build for modules using set_tsk_need_resched()Gabriele Monaco
Commit adcc3bfa8806 ("sched: Adapt sched tracepoints for RV task model") added a tracepoint to the need_resched action that can be triggered also by set_tsk_need_resched. This function was previously accessible from out-of-tree modules but it's no longer available because the __trace_set_need_resched() symbol is not exported (together with the tracepoint itself, which was exported in a separate patch) and building such modules fails. Export __trace_set_need_resched to modules to fix those build issues. Fixes: adcc3bfa8806 ("sched: Adapt sched tracepoints for RV task model") Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://patch.msgid.link/20260112140413.362202-1-gmonaco@redhat.com
2026-01-15sched/deadline: Use ENQUEUE_MOVE to allow priority changePeter Zijlstra
Pierre reported hitting balance callback warnings for deadline tasks after commit 6455ad5346c9 ("sched: Move sched_class::prio_changed() into the change pattern"). It turns out that DEQUEUE_SAVE+ENQUEUE_RESTORE does not preserve DL priority and subsequently trips a balance pass -- where one was not expected. From discussion with Juri and Luca, the purpose of this clause was to deal with tasks new to DL and all those sites will have MOVE set (as well as CLASS, but MOVE is move conservative at this point). Per the previous patches MOVE is audited to always run the balance callbacks, so switch enqueue_dl_entity() to use MOVE for this case. Fixes: 6455ad5346c9 ("sched: Move sched_class::prio_changed() into the change pattern") Reported-by: Pierre Gondois <pierre.gondois@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Pierre Gondois <pierre.gondois@arm.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
2026-01-15sched: Deadline has dynamic priorityPeter Zijlstra
While FIFO/RR have static priority, DEADLINE is a dynamic priority scheme. Notably it has static priority -1. Do not assume the priority doesn't change for deadline tasks just because the static priority doesn't change. This ensures DL always sees {DE,EN}QUEUE_MOVE where appropriate. Fixes: ff77e4685359 ("sched/rt: Fix PI handling vs. sched_setscheduler()") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Pierre Gondois <pierre.gondois@arm.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
2026-01-15sched: Audit MOVE vs balance_callbacksPeter Zijlstra
The {DE,EN}QUEUE_MOVE flag indicates a task is allowed to change priority, which means there could be balance callbacks queued. Therefore audit all MOVE users and make sure they do run balance callbacks before dropping rq-lock. Fixes: 6455ad5346c9 ("sched: Move sched_class::prio_changed() into the change pattern") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Pierre Gondois <pierre.gondois@arm.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net