user/sven/linux.git/kernel/sched, branch v6.14.8

sched_ext: bpf_iter_scx_dsq_new() should always initialize iterator

2025-05-22T12:31:47Z

commit 428dc9fc0873989d73918d4a9cc22745b7bbc799 upstream. BPF programs may call next() and destroy() on BPF iterators even after new() returns an error value (e.g. bpf_for_each() macro ignores error returns from new()). bpf_iter_scx_dsq_new() could leave the iterator in an uninitialized state after an error return causing bpf_iter_scx_dsq_next() to dereference garbage data. Make bpf_iter_scx_dsq_new() always clear $kit->dsq so that next() and destroy() become noops. Signed-off-by: Tejun Heo Fixes: 650ba21b131e ("sched_ext: Implement DSQ iterator") Cc: stable@vger.kernel.org # v6.12+ Acked-by: Andrea Righi Signed-off-by: Greg Kroah-Hartman

sched_ext: Use kvzalloc for large exit_dump allocation

2025-05-02T06:01:46Z

commit 47068309b5777313b6ac84a77d8d10dc7312260a upstream. Replace kzalloc with kvzalloc for the exit_dump buffer allocation, which can require large contiguous memory depending on the implementation. This change prevents allocation failures by allowing the system to fall back to vmalloc when contiguous memory allocation fails. Since this buffer is only used for debugging purposes, physical memory contiguity is not required, making vmalloc a suitable alternative. Cc: stable@vger.kernel.org Fixes: 07814a9439a3b0 ("sched_ext: Print debug dump after an error exit") Suggested-by: Rik van Riel Signed-off-by: Breno Leitao Acked-by: Andrea Righi Signed-off-by: Tejun Heo Signed-off-by: Greg Kroah-Hartman

sched/eevdf: Fix se->slice being set to U64_MAX and resulting crash

2025-05-02T06:01:44Z

[ Upstream commit bbce3de72be56e4b5f68924b7da9630cc89aa1a8 ] There is a code path in dequeue_entities() that can set the slice of a sched_entity to U64_MAX, which sometimes results in a crash. The offending case is when dequeue_entities() is called to dequeue a delayed group entity, and then the entity's parent's dequeue is delayed. In that case: 1. In the if (entity_is_task(se)) else block at the beginning of dequeue_entities(), slice is set to cfs_rq_min_slice(group_cfs_rq(se)). If the entity was delayed, then it has no queued tasks, so cfs_rq_min_slice() returns U64_MAX. 2. The first for_each_sched_entity() loop dequeues the entity. 3. If the entity was its parent's only child, then the next iteration tries to dequeue the parent. 4. If the parent's dequeue needs to be delayed, then it breaks from the first for_each_sched_entity() loop _without updating slice_. 5. The second for_each_sched_entity() loop sets the parent's ->slice to the saved slice, which is still U64_MAX. This throws off subsequent calculations with potentially catastrophic results. A manifestation we saw in production was: 6. In update_entity_lag(), se->slice is used to calculate limit, which ends up as a huge negative number. 7. limit is used in se->vlag = clamp(vlag, -limit, limit). Because limit is negative, vlag > limit, so se->vlag is set to the same huge negative number. 8. In place_entity(), se->vlag is scaled, which overflows and results in another huge (positive or negative) number. 9. The adjusted lag is subtracted from se->vruntime, which increases or decreases se->vruntime by a huge number. 10. pick_eevdf() calls entity_eligible()/vruntime_eligible(), which incorrectly returns false because the vruntime is so far from the other vruntimes on the queue, causing the (vruntime - cfs_rq->min_vruntime) * load calulation to overflow. 11. Nothing appears to be eligible, so pick_eevdf() returns NULL. 12. pick_next_entity() tries to dereference the return value of pick_eevdf() and crashes. Dumping the cfs_rq states from the core dumps with drgn showed tell-tale huge vruntime ranges and bogus vlag values, and I also traced se->slice being set to U64_MAX on live systems (which was usually "benign" since the rest of the runqueue needed to be in a particular state to crash). Fix it in dequeue_entities() by always setting slice from the first non-empty cfs_rq. Fixes: aef6987d8954 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy") Signed-off-by: Omar Sandoval Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/f0c2d1072be229e1bdddc73c0703919a8b00c652.1745570998.git.osandov@fb.com Signed-off-by: Sasha Levin

cpufreq/sched: Explicitly synchronize limits_changed flag handling

2025-04-25T08:51:04Z

commit 79443a7e9da3c9f68290a8653837e23aba0fa89f upstream. The handling of the limits_changed flag in struct sugov_policy needs to be explicitly synchronized to ensure that cpufreq policy limits updates will not be missed in some cases. Without that synchronization it is theoretically possible that the limits_changed update in sugov_should_update_freq() will be reordered with respect to the reads of the policy limits in cpufreq_driver_resolve_freq() and in that case, if the limits_changed update in sugov_limits() clobbers the one in sugov_should_update_freq(), the new policy limits may not take effect for a long time. Likewise, the limits_changed update in sugov_limits() may theoretically get reordered with respect to the updates of the policy limits in cpufreq_set_policy() and if sugov_should_update_freq() runs between them, the policy limits change may be missed. To ensure that the above situations will not take place, add memory barriers preventing the reordering in question from taking place and add READ_ONCE() and WRITE_ONCE() annotations around all of the limits_changed flag updates to prevent the compiler from messing up with that code. Fixes: 600f5badb78c ("cpufreq: schedutil: Don't skip freq update when limits change") Cc: 5.3+ # 5.3+ Signed-off-by: Rafael J. Wysocki Reviewed-by: Christian Loehle Link: https://patch.msgid.link/3376719.44csPzL39Z@rjwysocki.net Signed-off-by: Greg Kroah-Hartman

cpufreq/sched: Fix the usage of CPUFREQ_NEED_UPDATE_LIMITS

2025-04-25T08:50:59Z

[ Upstream commit cfde542df7dd51d26cf667f4af497878ddffd85a ] Commit 8e461a1cb43d ("cpufreq: schedutil: Fix superfluous updates caused by need_freq_update") modified sugov_should_update_freq() to set the need_freq_update flag only for drivers with CPUFREQ_NEED_UPDATE_LIMITS set, but that flag generally needs to be set when the policy limits change because the driver callback may need to be invoked for the new limits to take effect. However, if the return value of cpufreq_driver_resolve_freq() after applying the new limits is still equal to the previously selected frequency, the driver callback needs to be invoked only in the case when CPUFREQ_NEED_UPDATE_LIMITS is set (which means that the driver specifically wants its callback to be invoked every time the policy limits change). Update the code accordingly to avoid missing policy limits changes for drivers without CPUFREQ_NEED_UPDATE_LIMITS. Fixes: 8e461a1cb43d ("cpufreq: schedutil: Fix superfluous updates caused by need_freq_update") Closes: https://lore.kernel.org/lkml/Z_Tlc6Qs-tYpxWYb@linaro.org/ Reported-by: Stephan Gerhold Signed-off-by: Rafael J. Wysocki Reviewed-by: Christian Loehle Link: https://patch.msgid.link/3010358.e9J7NaK4W3@rjwysocki.net Signed-off-by: Sasha Levin

sched_ext: create_dsq: Return -EEXIST on duplicate request

2025-04-20T08:23:19Z

commit a8897ed8523d4c9d782e282b18005a3779c92714 upstream. create_dsq and therefore the scx_bpf_create_dsq kfunc currently silently ignore duplicate entries. As a sched_ext scheduler is creating each DSQ for a different purpose this is surprising behaviour. Replace rhashtable_insert_fast which ignores duplicates with rhashtable_lookup_insert_fast that reports duplicates (though doesn't return their value). The rest of the code is structured correctly and this now returns -EEXIST. Tested by adding an extra scx_bpf_create_dsq to scx_simple. Previously this was ignored, now init fails with a -17 code. Also ran scx_lavd which continued to work well. Signed-off-by: Jake Hillion Acked-by: Andrea Righi Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Tejun Heo Signed-off-by: Greg Kroah-Hartman

tracing: Disable branch profiling in noinstr code

2025-04-20T08:22:25Z

[ Upstream commit 2cbb20b008dba39893f0e296dc8ca312f40a9a0e ] CONFIG_TRACE_BRANCH_PROFILING inserts a call to ftrace_likely_update() for each use of likely() or unlikely(). That breaks noinstr rules if the affected function is annotated as noinstr. Disable branch profiling for files with noinstr functions. In addition to some individual files, this also includes the entire arch/x86 subtree, as well as the kernel/entry, drivers/cpuidle, and drivers/idle directories, all of which are noinstr-heavy. Due to the nature of how sched binaries are built by combining multiple .c files into one, branch profiling is disabled more broadly across the sched code than would otherwise be needed. This fixes many warnings like the following: vmlinux.o: warning: objtool: do_syscall_64+0x40: call to ftrace_likely_update() leaves .noinstr.text section vmlinux.o: warning: objtool: __rdgsbase_inactive+0x33: call to ftrace_likely_update() leaves .noinstr.text section vmlinux.o: warning: objtool: handle_bug.isra.0+0x198: call to ftrace_likely_update() leaves .noinstr.text section ... Reported-by: Ingo Molnar Suggested-by: Steven Rostedt Signed-off-by: Josh Poimboeuf Signed-off-by: Ingo Molnar Acked-by: Thomas Gleixner Cc: Linus Torvalds Link: https://lore.kernel.org/r/fb94fc9303d48a5ed370498f54500cc4c338eb6d.1742586676.git.jpoimboe@kernel.org Signed-off-by: Sasha Levin

sched/deadline: Rebuild root domain accounting after every update

2025-04-10T12:43:59Z

[ Upstream commit 2ff899e3516437354204423ef0a94994717b8e6a ] Rebuilding of root domains accounting information (total_bw) is currently broken on some cases, e.g. suspend/resume on aarch64. Problem is that the way we keep track of domain changes and try to add bandwidth back is convoluted and fragile. Fix it by simplify things by making sure bandwidth accounting is cleared and completely restored after root domains changes (after root domains are again stable). To be sure we always call dl_rebuild_rd_accounting while holding cpuset_mutex we also add cpuset_reset_sched_domains() wrapper. Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug") Reported-by: Jon Hunter Co-developed-by: Waiman Long Signed-off-by: Waiman Long Signed-off-by: Juri Lelli Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Dietmar Eggemann Tested-by: Dietmar Eggemann Link: https://lore.kernel.org/r/Z9MRfeJKJUOyUSto@jlelli-thinkpadt14gen4.remote.csb Signed-off-by: Sasha Levin

sched/deadline: Generalize unique visiting of root domains

2025-04-10T12:43:58Z

[ Upstream commit 45007c6fb5860cf63556a9cadc87c8984927e23d ] Bandwidth checks and updates that work on root domains currently employ a cookie mechanism for efficiency. This mechanism is very much tied to when root domains are first created and initialized. Generalize the cookie mechanism so that it can be used also later at runtime while updating root domains. Also, additionally guard it with sched_domains_mutex, since domains need to be stable while updating them (and it will be required for further dynamic changes). Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug") Reported-by: Jon Hunter Signed-off-by: Juri Lelli Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Valentin Schneider Reviewed-by: Dietmar Eggemann Tested-by: Waiman Long Tested-by: Jon Hunter Tested-by: Dietmar Eggemann Link: https://lore.kernel.org/r/Z9MQaiXPvEeW_v7x@jlelli-thinkpadt14gen4.remote.csb Signed-off-by: Sasha Levin

sched/topology: Wrappers for sched_domains_mutex

2025-04-10T12:43:58Z

[ Upstream commit 56209334dda1832c0a919e1d74768c6d0f3b2ca9 ] Create wrappers for sched_domains_mutex so that it can transparently be used on both CONFIG_SMP and !CONFIG_SMP, as some function will need to do. Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug") Reported-by: Jon Hunter Signed-off-by: Juri Lelli Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Valentin Schneider Reviewed-by: Dietmar Eggemann Tested-by: Waiman Long Tested-by: Jon Hunter Tested-by: Dietmar Eggemann Link: https://lore.kernel.org/r/Z9MP5Oq9RB8jBs3y@jlelli-thinkpadt14gen4.remote.csb Signed-off-by: Sasha Levin