user/sven/linux.git/kernel/sched, branch v5.15.34

sched: Teach the forced-newidle balancer about CPU affinity limitation.

2022-04-13T18:59:27Z

commit 386ef214c3c6ab111d05e1790e79475363abaa05 upstream. try_steal_cookie() looks at task_struct::cpus_mask to decide if the task could be moved to `this' CPU. It ignores that the task might be in a migration disabled section while not on the CPU. In this case the task must not be moved otherwise per-CPU assumption are broken. Use is_cpu_allowed(), as suggested by Peter Zijlstra, to decide if the a task can be moved. Fixes: d2dfa17bc7de6 ("sched: Trivial forced-newidle balancer") Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/YjNK9El+3fzGmswf@linutronix.de Signed-off-by: Greg Kroah-Hartman

sched/rt: Plug rt_mutex_setprio() vs push_rt_task() race

2022-04-08T12:23:11Z

[ Upstream commit 49bef33e4b87b743495627a529029156c6e09530 ] John reported that push_rt_task() can end up invoking find_lowest_rq(rq->curr) when curr is not an RT task (in this case a CFS one), which causes mayhem down convert_prio(). This can happen when current gets demoted to e.g. CFS when releasing an rt_mutex, and the local CPU gets hit with an rto_push_work irqwork before getting the chance to reschedule. Exactly who triggers this work isn't entirely clear to me - switched_from_rt() only invokes rt_queue_pull_task() if there are no RT tasks on the local RQ, which means the local CPU can't be in the rto_mask. My current suspected sequence is something along the lines of the below, with the demoted task being current. mark_wakeup_next_waiter() rt_mutex_adjust_prio() rt_mutex_setprio() // deboost originally-CFS task check_class_changed() switched_from_rt() // Only rt_queue_pull_task() if !rq->rt.rt_nr_running switched_to_fair() // Sets need_resched __balance_callbacks() // if pull_rt_task(), tell_cpu_to_push() can't select local CPU per the above raw_spin_rq_unlock(rq) // need_resched is set, so task_woken_rt() can't // invoke push_rt_tasks(). Best I can come up with is // local CPU has rt_nr_migratory >= 2 after the demotion, so stays // in the rto_mask, and then: push_rt_task() // breakage follows here as rq->curr is CFS Move an existing check to check rq->curr vs the next pushable task's priority before getting anywhere near find_lowest_rq(). While at it, add an explicit sched_class of rq->curr check prior to invoking find_lowest_rq(rq->curr). Align the DL logic to also reschedule regardless of next_task's migratability. Fixes: a7c81556ec4d ("sched: Fix migrate_disable() vs rt/dl balancing") Reported-by: John Keeping Signed-off-by: Valentin Schneider Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Dietmar Eggemann Tested-by: John Keeping Link: https://lore.kernel.org/r/20220127154059.974729-1-valentin.schneider@arm.com Signed-off-by: Sasha Levin

sched/cpuacct: Fix charge percpu cpuusage

2022-04-08T12:23:11Z

[ Upstream commit 248cc9993d1cc12b8e9ed716cc3fc09f6c3517dd ] The cpuacct_account_field() is always called by the current task itself, so it's ok to use __this_cpu_add() to charge the tick time. But cpuacct_charge() maybe called by update_curr() in load_balance() on a random CPU, different from the CPU on which the task is running. So __this_cpu_add() will charge that cputime to a random incorrect CPU. Fixes: 73e6aafd9ea8 ("sched/cpuacct: Simplify the cpuacct code") Reported-by: Minye Zhu Signed-off-by: Chengming Zhou Signed-off-by: Peter Zijlstra (Intel) Acked-by: Tejun Heo Link: https://lore.kernel.org/r/20220220051426.5274-1-zhouchengming@bytedance.com Signed-off-by: Sasha Levin

sched/fair: Improve consistency of allowed NUMA balance calculations

2022-04-08T12:23:11Z

[ Upstream commit 2cfb7a1b031b0e816af7a6ee0c6ab83b0acdf05a ] There are inconsistencies when determining if a NUMA imbalance is allowed that should be corrected. o allow_numa_imbalance changes types and is not always examining the destination group so both the type should be corrected as well as the naming. o find_idlest_group uses the sched_domain's weight instead of the group weight which is different to find_busiest_group o find_busiest_group uses the source group instead of the destination which is different to task_numa_find_cpu o Both find_idlest_group and find_busiest_group should account for the number of running tasks if a move was allowed to be consistent with task_numa_find_cpu Fixes: 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes") Signed-off-by: Mel Gorman Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Gautham R. Shenoy Link: https://lore.kernel.org/r/20220208094334.16379-2-mgorman@techsingularity.net Signed-off-by: Sasha Levin

sched/uclamp: Fix iowait boost escaping uclamp restriction

2022-04-08T12:23:10Z

[ Upstream commit d37aee9018e68b0d356195caefbb651910e0bbfa ] iowait_boost signal is applied independently of util and doesn't take into account uclamp settings of the rq. An io heavy task that is capped by uclamp_max could still request higher frequency because sugov_iowait_apply() doesn't clamp the boost via uclamp_rq_util_with() like effective_cpu_util() does. Make sure that iowait_boost honours uclamp requests by calling uclamp_rq_util_with() when applying the boost. Fixes: 982d9cdc22c9 ("sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks") Signed-off-by: Qais Yousef Signed-off-by: Peter Zijlstra (Intel) Acked-by: Rafael J. Wysocki Link: https://lore.kernel.org/r/20211216225320.2957053-3-qais.yousef@arm.com Signed-off-by: Sasha Levin

sched/core: Export pelt_thermal_tp

2022-04-08T12:23:10Z

[ Upstream commit 77cf151b7bbdfa3577b3c3f3a5e267a6c60a263b ] We can't use this tracepoint in modules without having the symbol exported first, fix that. Fixes: 765047932f15 ("sched/pelt: Add support to track thermal pressure") Signed-off-by: Qais Yousef Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20211028115005.873539-1-qais.yousef@arm.com Signed-off-by: Sasha Levin

sched/debug: Remove mpol_get/put and task_lock/unlock from sched_show_numa

2022-04-08T12:23:10Z

[ Upstream commit 28c988c3ec29db74a1dda631b18785958d57df4f ] The older format of /proc/pid/sched printed home node info which required the mempolicy and task lock around mpol_get(). However the format has changed since then and there is no need for sched_show_numa() any more to have mempolicy argument, asssociated mpol_get/put and task_lock/unlock. Remove them. Fixes: 397f2378f1361 ("sched/numa: Fix numa balancing stats in /proc/pid/sched") Signed-off-by: Bharata B Rao Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Srikar Dronamraju Acked-by: Mel Gorman Link: https://lore.kernel.org/r/20220118050515.2973-1-bharata@amd.com Signed-off-by: Sasha Levin

sched: Fix yet more sched_fork() races

2022-03-08T18:12:49Z

commit b1e8206582f9d680cff7d04828708c8b6ab32957 upstream. Where commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") fixed a fork race vs cgroup, it opened up a race vs syscalls by not placing the task on the runqueue before it gets exposed through the pidhash. Commit 13765de8148f ("sched/fair: Fix fault in reweight_entity") is trying to fix a single instance of this, instead fix the whole class of issues, effectively reverting this commit. Fixes: 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") Reported-by: Linus Torvalds Signed-off-by: Peter Zijlstra (Intel) Tested-by: Tadeusz Struk Tested-by: Zhang Qiao Tested-by: Dietmar Eggemann Link: https://lkml.kernel.org/r/YgoeCbwj5mbCR0qA@hirez.programming.kicks-ass.net Signed-off-by: Greg Kroah-Hartman

sched/fair: Fix fault in reweight_entity

2022-03-08T18:12:32Z

[ Upstream commit 13765de8148f71fa795e0a6607de37c49ea5915a ] Syzbot found a GPF in reweight_entity. This has been bisected to commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") There is a race between sched_post_fork() and setpriority(PRIO_PGRP) within a thread group that causes a null-ptr-deref in reweight_entity() in CFS. The scenario is that the main process spawns number of new threads, which then call setpriority(PRIO_PGRP, 0, -20), wait, and exit. For each of the new threads the copy_process() gets invoked, which adds the new task_struct and calls sched_post_fork() for it. In the above scenario there is a possibility that setpriority(PRIO_PGRP) and set_one_prio() will be called for a thread in the group that is just being created by copy_process(), and for which the sched_post_fork() has not been executed yet. This will trigger a null pointer dereference in reweight_entity(), as it will try to access the run queue pointer, which hasn't been set. Before the mentioned change the cfs_rq pointer for the task has been set in sched_fork(), which is called much earlier in copy_process(), before the new task is added to the thread_group. Now it is done in the sched_post_fork(), which is called after that. To fix the issue the remove the update_load param from the update_load param() function and call reweight_task() only if the task flag doesn't have the TASK_NEW flag set. Fixes: 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") Reported-by: syzbot+af7a719bc92395ee41b3@syzkaller.appspotmail.com Signed-off-by: Tadeusz Struk Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Dietmar Eggemann Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20220203161846.1160750-1-tadeusz.struk@linaro.org Signed-off-by: Sasha Levin

sched: Avoid double preemption in __cond_resched_lock()

2022-02-16T11:56:11Z

[ Upstream commit 7e406d1ff39b8ee574036418a5043c86723170cf ] For PREEMPT/DYNAMIC_PREEMPT the *_unlock() will already trigger a preemption, no point in then calling preempt_schedule_common() *again*. Use _cond_resched() instead, since this is a NOP for the preemptible configs while it provide a preemption point for the others. Reported-by: xuhaifeng Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/YcGnvDEYBwOiV0cR@hirez.programming.kicks-ass.net Signed-off-by: Sasha Levin

user/sven/linux.git/kernel/sched, branch v5.15.34

sched: Teach the forced-newidle balancer about CPU affinity limitation.

sched/rt: Plug rt_mutex_setprio() vs push_rt_task() race

sched/cpuacct: Fix charge percpu cpuusage

sched/fair: Improve consistency of allowed NUMA balance calculations

sched/uclamp: Fix iowait boost escaping uclamp restriction

sched/core: Export pelt_thermal_tp

sched/debug: Remove mpol_get/put and task_lock/unlock from sched_show_numa

sched: Fix yet more sched_fork() races

sched/fair: Fix fault in reweight_entity

sched: Avoid double preemption in __cond_resched_*lock*()

sched: Avoid double preemption in __cond_resched_lock()