user/sven/linux.git/kernel/sched, branch v6.1.85

sched/fair: Take the scheduling domain into account in select_idle_core()

2024-03-26T22:20:30Z

[ Upstream commit 23d04d8c6b8ec339057264659b7834027f3e6a63 ] When picking a CPU on task wakeup, select_idle_core() has to take into account the scheduling domain where the function looks for the CPU. This is because the "isolcpus" kernel command line option can remove CPUs from the domain to isolate them from other SMT siblings. This change replaces the set of CPUs allowed to run the task from p->cpus_ptr by the intersection of p->cpus_ptr and sched_domain_span(sd) which is stored in the 'cpus' argument provided by select_idle_cpu(). Fixes: 9fe1f127b913 ("sched/fair: Merge select_idle_core/cpu()") Signed-off-by: Keisuke Nishimura Signed-off-by: Julia Lawall Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20240110131707.437301-2-keisuke.nishimura@inria.fr Signed-off-by: Sasha Levin

sched/fair: Take the scheduling domain into account in select_idle_smt()

2024-03-26T22:20:29Z

[ Upstream commit 8aeaffef8c6eceab0e1498486fdd4f3dc3b7066c ] When picking a CPU on task wakeup, select_idle_smt() has to take into account the scheduling domain of @target. This is because the "isolcpus" kernel command line option can remove CPUs from the domain to isolate them from other SMT siblings. This fix checks if the candidate CPU is in the target scheduling domain. Commit: df3cb4ea1fb6 ("sched/fair: Fix wrong cpu selecting from isolated domain") ... originally introduced this fix by adding the check of the scheduling domain in the loop. However, commit: 3e6efe87cd5cc ("sched/fair: Remove redundant check in select_idle_smt()") ... accidentally removed the check. Bring it back. Fixes: 3e6efe87cd5c ("sched/fair: Remove redundant check in select_idle_smt()") Signed-off-by: Keisuke Nishimura Signed-off-by: Julia Lawall Signed-off-by: Ingo Molnar Reviewed-by: Vincent Guittot Link: https://lore.kernel.org/r/20240110131707.437301-1-keisuke.nishimura@inria.fr Signed-off-by: Sasha Levin

sched/rt: sysctl_sched_rr_timeslice show default timeslice after reset

2024-03-01T12:26:24Z

commit c1fc6484e1fb7cc2481d169bfef129a1b0676abe upstream. The sched_rr_timeslice can be reset to default by writing value that is <= 0. However after reading from this file we always got the last value written, which is not useful at all. $ echo -1 > /proc/sys/kernel/sched_rr_timeslice_ms $ cat /proc/sys/kernel/sched_rr_timeslice_ms -1 Fix this by setting the variable that holds the sysctl file value to the jiffies_to_msecs(RR_TIMESLICE) in case that <= 0 value was written. Signed-off-by: Cyril Hrubis Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Petr Vorel Acked-by: Mel Gorman Tested-by: Petr Vorel Cc: Mahmoud Adam Link: https://lore.kernel.org/r/20230802151906.25258-3-chrubis@suse.cz Signed-off-by: Greg Kroah-Hartman

sched/rt: Disallow writing invalid values to sched_rt_period_us

2024-03-01T12:26:24Z

commit 079be8fc630943d9fc70a97807feb73d169ee3fc upstream. The validation of the value written to sched_rt_period_us was broken because: - the sysclt_sched_rt_period is declared as unsigned int - parsed by proc_do_intvec() - the range is asserted after the value parsed by proc_do_intvec() Because of this negative values written to the file were written into a unsigned integer that were later on interpreted as large positive integers which did passed the check: if (sysclt_sched_rt_period <= 0) return EINVAL; This commit fixes the parsing by setting explicit range for both perid_us and runtime_us into the sched_rt_sysctls table and processes the values with proc_dointvec_minmax() instead. Alternatively if we wanted to use full range of unsigned int for the period value we would have to split the proc_handler and use proc_douintvec() for it however even the Documentation/scheduller/sched-rt-group.rst describes the range as 1 to INT_MAX. As far as I can tell the only problem this causes is that the sysctl file allows writing negative values which when read back may confuse userspace. There is also a LTP test being submitted for these sysctl files at: http://patchwork.ozlabs.org/project/ltp/patch/20230901144433.2526-1-chrubis@suse.cz/ Signed-off-by: Cyril Hrubis Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20231002115553.3007-2-chrubis@suse.cz Cc: Mahmoud Adam Signed-off-by: Greg Kroah-Hartman

sched/membarrier: reduce the ability to hammer on sys_membarrier

2024-02-23T08:12:52Z

commit 944d5fe50f3f03daacfea16300e656a1691c4a23 upstream. On some systems, sys_membarrier can be very expensive, causing overall slowdowns for everything. So put a lock on the path in order to serialize the accesses to prevent the ability for this to be called at too high of a frequency and saturate the machine. Reviewed-and-tested-by: Mathieu Desnoyers Acked-by: Borislav Petkov Fixes: 22e4ebb97582 ("membarrier: Provide expedited private command") Fixes: c5f58bd58f43 ("membarrier: Provide GLOBAL_EXPEDITED command") Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman

sched: Fix stop_one_cpu_nowait() vs hotplug

2023-11-20T10:51:50Z

[ Upstream commit f0498d2a54e7966ce23cd7c7ff42c64fa0059b07 ] Kuyo reported sporadic failures on a sched_setaffinity() vs CPU hotplug stress-test -- notably affine_move_task() remains stuck in wait_for_completion(), leading to a hung-task detector warning. Specifically, it was reported that stop_one_cpu_nowait(.fn = migration_cpu_stop) returns false -- this stopper is responsible for the matching complete(). The race scenario is: CPU0 CPU1 // doing _cpu_down() __set_cpus_allowed_ptr() task_rq_lock(); takedown_cpu() stop_machine_cpuslocked(take_cpu_down..) ack_state() MULTI_STOP_RUN take_cpu_down() __cpu_disable(); stop_machine_park(); stopper->enabled = false; /> /> stop_one_cpu_nowait(.fn = migration_cpu_stop); if (stopper->enabled) // false!!! That is, by doing stop_one_cpu_nowait() after dropping rq-lock, the stopper thread gets a chance to preempt and allows the cpu-down for the target CPU to complete. OTOH, since stop_one_cpu_nowait() / cpu_stop_queue_work() needs to issue a wakeup, it must not be ran under the scheduler locks. Solve this apparent contradiction by keeping preemption disabled over the unlock + queue_stopper combination: preempt_disable(); task_rq_unlock(...); if (!stop_pending) stop_one_cpu_nowait(...) preempt_enable(); This respects the lock ordering contraints while still avoiding the above race. That is, if we find the CPU is online under rq-lock, the targeted stop_one_cpu_nowait() must succeed. Apply this pattern to all similar stop_one_cpu_nowait() invocations. Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()") Reported-by: "Kuyo Chang (張建文)" Signed-off-by: Peter Zijlstra (Intel) Tested-by: "Kuyo Chang (張建文)" Link: https://lkml.kernel.org/r/20231010200442.GA16515@noisy.programming.kicks-ass.net Signed-off-by: Sasha Levin

sched/uclamp: Ignore (util == 0) optimization in feec() when p_util_max = 0

2023-11-20T10:51:50Z

[ Upstream commit 23c9519def98ee0fa97ea5871535e9b136f522fc ] find_energy_efficient_cpu() bails out early if effective util of the task is 0 as the delta at this point will be zero and there's nothing for EAS to do. When uclamp is being used, this could lead to wrong decisions when uclamp_max is set to 0. In this case the task is capped to performance point 0, but it is actually running and consuming energy and we can benefit from EAS energy calculations. Rework the condition so that it bails out when both util and uclamp_min are 0. We can do that without needing to use uclamp_task_util(); remove it. Fixes: d81304bc6193 ("sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition") Signed-off-by: Qais Yousef (Google) Signed-off-by: Ingo Molnar Reviewed-by: Vincent Guittot Reviewed-by: Dietmar Eggemann Acked-by: Peter Zijlstra (Intel) Link: https://lore.kernel.org/r/20230916232955.2099394-3-qyousef@layalina.io Signed-off-by: Sasha Levin

sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0

2023-11-20T10:51:49Z

[ Upstream commit 6b00a40147653c8ea748e8f4396510f252763364 ] When uclamp_max is being used, the util of the task could be higher than the spare capacity of the CPU, but due to uclamp_max value we force-fit it there. The way the condition for checking for max_spare_cap in find_energy_efficient_cpu() was constructed; it ignored any CPU that has its spare_cap less than or _equal_ to max_spare_cap. Since we initialize max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and hence ending up never performing compute_energy() for this cluster and missing an opportunity for a better energy efficient placement to honour uclamp_max setting. max_spare_cap = 0; cpu_cap = capacity_of(cpu) - cpu_util(p); // 0 if cpu_util(p) is high ... util_fits_cpu(...); // will return true if uclamp_max forces it to fit ... // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0 if (cpu_cap > max_spare_cap) { max_spare_cap = cpu_cap; max_spare_cap_cpu = cpu; } prev_spare_cap suffers from a similar problem. Fix the logic by converting the variables into long and treating -1 value as 'not populated' instead of 0 which is a viable and correct spare capacity value. We need to be careful signed comparison is used when comparing with cpu_cap in one of the conditions. Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions") Signed-off-by: Qais Yousef (Google) Signed-off-by: Ingo Molnar Reviewed-by: Vincent Guittot Reviewed-by: Dietmar Eggemann Acked-by: Peter Zijlstra (Intel) Link: https://lore.kernel.org/r/20230916232955.2099394-2-qyousef@layalina.io Signed-off-by: Sasha Levin

sched/fair: Fix cfs_rq_is_decayed() on !SMP

2023-11-20T10:51:49Z

[ Upstream commit c0490bc9bb62d9376f3dd4ec28e03ca0fef97152 ] We don't need to maintain per-queue leaf_cfs_rq_list on !SMP, since it's used for cfs_rq load tracking & balancing on SMP. But sched debug interface uses it to print per-cfs_rq stats. This patch fixes the !SMP version of cfs_rq_is_decayed(), so the per-queue leaf_cfs_rq_list is also maintained correctly on !SMP, to fix the warning in assert_list_leaf_cfs_rq(). Fixes: 0a00a354644e ("sched/fair: Delete useless condition in tg_unthrottle_up()") Reported-by: Leo Yu-Chi Liang Signed-off-by: Chengming Zhou Signed-off-by: Ingo Molnar Tested-by: Leo Yu-Chi Liang Reviewed-by: Vincent Guittot Closes: https://lore.kernel.org/all/ZN87UsqkWcFLDxea@swlinux02/ Link: https://lore.kernel.org/r/20230913132031.2242151-1-chengming.zhou@linux.dev Signed-off-by: Sasha Levin

cpufreq: schedutil: Update next_freq when cpufreq_limits change

2023-10-25T10:03:11Z

[ Upstream commit 9e0bc36ab07c550d791bf17feeb479f1dfc42d89 ] When cpufreq's policy is 'single', there is a scenario that will cause sg_policy's next_freq to be unable to update. When the CPU's util is always max, the cpufreq will be max, and then if we change the policy's scaling_max_freq to be a lower freq, indeed, the sg_policy's next_freq need change to be the lower freq, however, because the cpu_is_busy, the next_freq would keep the max_freq. For example: The cpu7 is a single CPU: unisoc:/sys/devices/system/cpu/cpufreq/policy7 # while true;do done& [1] 4737 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # taskset -p 80 4737 pid 4737's current affinity mask: ff pid 4737's new affinity mask: 80 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq 2301000 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_cur_freq 2301000 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # echo 2171000 > scaling_max_freq unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq 2171000 At this time, the sg_policy's next_freq would stay at 2301000, which is wrong. To fix this, add a check for the ->need_freq_update flag. [ mingo: Clarified the changelog. ] Co-developed-by: Guohua Yan Signed-off-by: Xuewen Yan Signed-off-by: Guohua Yan Signed-off-by: Ingo Molnar Acked-by: "Rafael J. Wysocki" Link: https://lore.kernel.org/r/20230719130527.8074-1-xuewen.yan@unisoc.com Signed-off-by: Sasha Levin