user/sven/linux.git/kernel/sched, branch v5.4.113

sched/membarrier: fix missing local execution of ipi_sync_rq_state()

2021-03-17T16:03:57Z

commit ce29ddc47b91f97e7f69a0fb7cbb5845f52a9825 upstream. The function sync_runqueues_membarrier_state() should copy the membarrier state from the @mm received as parameter to each runqueue currently running tasks using that mm. However, the use of smp_call_function_many() skips the current runqueue, which is unintended. Replace by a call to on_each_cpu_mask(). Fixes: 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy load") Reported-by: Nadav Amit Signed-off-by: Mathieu Desnoyers Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Cc: stable@vger.kernel.org # 5.4.x+ Link: https://lore.kernel.org/r/74F1E842-4A84-47BF-B6C2-5407DFDD4A4A@gmail.com Signed-off-by: Greg Kroah-Hartman

sched/features: Fix hrtick reprogramming

2021-03-07T11:20:47Z

[ Upstream commit 156ec6f42b8d300dbbf382738ff35c8bad8f4c3a ] Hung tasks and RCU stall cases were reported on systems which were not 100% busy. Investigation of such unexpected cases (no sign of potential starvation caused by tasks hogging the system) pointed out that the periodic sched tick timer wasn't serviced anymore after a certain point and that caused all machinery that depends on it (timers, RCU, etc.) to stop working as well. This issues was however only reproducible if HRTICK was enabled. Looking at core dumps it was found that the rbtree of the hrtimer base used also for the hrtick was corrupted (i.e. next as seen from the base root and actual leftmost obtained by traversing the tree are different). Same base is also used for periodic tick hrtimer, which might get "lost" if the rbtree gets corrupted. Much alike what described in commit 1f71addd34f4c ("tick/sched: Do not mess with an enqueued hrtimer") there is a race window between hrtimer_set_expires() in hrtick_start and hrtimer_start_expires() in __hrtick_restart() in which the former might be operating on an already queued hrtick hrtimer, which might lead to corruption of the base. Use hrtick_start() (which removes the timer before enqueuing it back) to ensure hrtick hrtimer reprogramming is entirely guarded by the base lock, so that no race conditions can occur. Signed-off-by: Juri Lelli Signed-off-by: Luis Claudio R. Goncalves Signed-off-by: Daniel Bristot de Oliveira Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20210208073554.14629-2-juri.lelli@redhat.com Signed-off-by: Sasha Levin

rcu/nocb: Perform deferred wake up before last idle's need_resched() check

2021-03-04T09:26:47Z

commit 43789ef3f7d61aa7bed0cb2764e588fc990c30ef upstream. Entering RCU idle mode may cause a deferred wake up of an RCU NOCB_GP kthread (rcuog) to be serviced. Usually a local wake up happening while running the idle task is handled in one of the need_resched() checks carefully placed within the idle loop that can break to the scheduler. Unfortunately the call to rcu_idle_enter() is already beyond the last generic need_resched() check and we may halt the CPU with a resched request unhandled, leaving the task hanging. Fix this with splitting the rcuog wakeup handling from rcu_idle_enter() and place it before the last generic need_resched() check in the idle loop. It is then assumed that no call to call_rcu() will be performed after that in the idle loop until the CPU is put in low power mode. Fixes: 96d3fd0d315a (rcu: Break call_rcu() deadlock involving scheduler and perf) Reported-by: Paul E. McKenney Signed-off-by: Frederic Weisbecker Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20210131230548.32970-3-frederic@kernel.org Signed-off-by: Greg Kroah-Hartman

sched/eas: Don't update misfit status if the task is pinned

2021-03-04T09:26:23Z

[ Upstream commit 0ae78eec8aa64e645866e75005162603a77a0f49 ] If the task is pinned to a cpu, setting the misfit status means that we'll unnecessarily continuously attempt to migrate the task but fail. This continuous failure will cause the balance_interval to increase to a high value, and eventually cause unnecessary significant delays in balancing the system when real imbalance happens. Caught while testing uclamp where rt-app calibration loop was pinned to cpu 0, shortly after which we spawn another task with high util_clamp value. The task was failing to migrate after over 40ms of runtime due to balance_interval unnecessary expanded to a very high value from the calibration loop. Not done here, but it could be useful to extend the check for pinning to verify that the affinity of the task has a cpu that fits. We could end up in a similar situation otherwise. Fixes: 3b1baa6496e6 ("sched/fair: Add 'group_misfit_task' load-balance type") Signed-off-by: Qais Yousef Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Quentin Perret Acked-by: Valentin Schneider Link: https://lkml.kernel.org/r/20210119120755.2425264-1-qais.yousef@arm.com Signed-off-by: Sasha Levin

sched: Reenable interrupts in do_sched_yield()

2020-12-30T10:51:03Z

[ Upstream commit 345a957fcc95630bf5535d7668a59ed983eb49a7 ] do_sched_yield() invokes schedule() with interrupts disabled which is not allowed. This goes back to the pre git era to commit a6efb709806c ("[PATCH] irqlock patch 2.5.27-H6") in the history tree. Reenable interrupts and remove the misleading comment which "explains" it. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/87r1pt7y5c.fsf@nanos.tec.linutronix.de Signed-off-by: Sasha Levin

sched/deadline: Fix sched_dl_global_validate()

2020-12-30T10:51:03Z

[ Upstream commit a57415f5d1e43c3a5c5d412cd85e2792d7ed9b11 ] When change sched_rt_{runtime, period}_us, we validate that the new settings should at least accommodate the currently allocated -dl bandwidth: sched_rt_handler() --> sched_dl_bandwidth_validate() { new_bw = global_rt_runtime()/global_rt_period(); for_each_possible_cpu(cpu) { dl_b = dl_bw_of(cpu); if (new_bw < dl_b->total_bw) <------- ret = -EBUSY; } } But under CONFIG_SMP, dl_bw is per root domain , but not per CPU, dl_b->total_bw is the allocated bandwidth of the whole root domain. Instead, we should compare dl_b->total_bw against "cpus*new_bw", where 'cpus' is the number of CPUs of the root domain. Also, below annotation(in kernel/sched/sched.h) implied implementation only appeared in SCHED_DEADLINE v2[1], then deadline scheduler kept evolving till got merged(v9), but the annotation remains unchanged, meaningless and misleading, update it. * With respect to SMP, the bandwidth is given on a per-CPU basis, * meaning that: * - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU; * - dl_total_bw array contains, in the i-eth element, the currently * allocated bandwidth on the i-eth CPU. [1]: https://lore.kernel.org/lkml/1267385230.13676.101.camel@Palantir/ Fixes: 332ac17ef5bf ("sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks") Signed-off-by: Peng Liu Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Daniel Bristot de Oliveira Acked-by: Juri Lelli Link: https://lkml.kernel.org/r/db6bbda316048cda7a1bbc9571defde193a8d67e.1602171061.git.iwtbavbm@gmail.com Signed-off-by: Sasha Levin

membarrier: Explicitly sync remote cores when SYNC_CORE is requested

2020-12-21T12:27:07Z

commit 758c9373d84168dc7d039cf85a0e920046b17b41 upstream. membarrier() does not explicitly sync_core() remote CPUs; instead, it relies on the assumption that an IPI will result in a core sync. On x86, this may be true in practice, but it's not architecturally reliable. In particular, the SDM and APM do not appear to guarantee that interrupt delivery is serializing. While IRET does serialize, IPI return can schedule, thereby switching to another task in the same mm that was sleeping in a syscall. The new task could then SYSRET back to usermode without ever executing IRET. Make this more robust by explicitly calling sync_core_before_usermode() on remote cores. (This also helps people who search the kernel tree for instances of sync_core() and sync_core_before_usermode() -- one might be surprised that the core membarrier code doesn't currently show up in a such a search.) Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE") Signed-off-by: Andy Lutomirski Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/776b448d5f7bd6b12690707f5ed67bcda7f1d427.1607058304.git.luto@kernel.org Signed-off-by: Greg Kroah-Hartman

sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq list

2020-12-08T09:40:24Z

[ Upstream commit 39f23ce07b9355d05a64ae303ce20d1c4b92b957 ] Although not exactly identical, unthrottle_cfs_rq() and enqueue_task_fair() are quite close and follow the same sequence for enqueuing an entity in the cfs hierarchy. Modify unthrottle_cfs_rq() to use the same pattern as enqueue_task_fair(). This fixes a problem already faced with the latter and add an optimization in the last for_each_sched_entity loop. Fixes: fe61468b2cb (sched/fair: Fix enqueue_task_fair warning) Reported-by Tao Zhou Signed-off-by: Vincent Guittot Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Phil Auld Reviewed-by: Ben Segall Link: https://lkml.kernel.org/r/20200513135528.4742-1-vincent.guittot@linaro.org Signed-off-by: Sasha Levin

sched/fair: Fix overutilized update in enqueue_task_fair()

2020-11-24T12:29:24Z

commit 8e1ac4299a6e8726de42310d9c1379f188140c71 upstream. enqueue_task_fair() attempts to skip the overutilized update for new tasks as their util_avg is not accurate yet. However, the flag we check to do so is overwritten earlier on in the function, which makes the condition pretty much a nop. Fix this by saving the flag early on. Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator") Reported-by: Rick Yiu Signed-off-by: Quentin Perret Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Vincent Guittot Reviewed-by: Valentin Schneider Link: https://lkml.kernel.org/r/20201112111201.2081902-1-qperret@google.com Signed-off-by: Greg Kroah-Hartman

sched/features: Fix !CONFIG_JUMP_LABEL case

2020-10-29T08:58:00Z

[ Upstream commit a73f863af4ce9730795eab7097fb2102e6854365 ] Commit: 765cc3a4b224e ("sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds") made sched features static for !CONFIG_SCHED_DEBUG configurations, but overlooked the CONFIG_SCHED_DEBUG=y and !CONFIG_JUMP_LABEL cases. For the latter echoing changes to /sys/kernel/debug/sched_features has the nasty effect of effectively changing what sched_features reports, but without actually changing the scheduler behaviour (since different translation units get different sysctl_sched_features). Fix CONFIG_SCHED_DEBUG=y and !CONFIG_JUMP_LABEL configurations by properly restructuring ifdefs. Fixes: 765cc3a4b224e ("sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds") Co-developed-by: Daniel Bristot de Oliveira Signed-off-by: Daniel Bristot de Oliveira Signed-off-by: Juri Lelli Signed-off-by: Ingo Molnar Acked-by: Patrick Bellasi Reviewed-by: Valentin Schneider Link: https://lore.kernel.org/r/20201013053114.160628-1-juri.lelli@redhat.com Signed-off-by: Sasha Levin