user/sven/linux.git/kernel/sched/psi.c, branch v6.12.71

sched/psi: Fix psi_seq initialization

2025-08-15T10:14:03Z

[ Upstream commit 99b773d720aeea1ef2170dce5fcfa80649e26b78 ] With the seqcount moved out of the group into a global psi_seq, re-initializing the seqcount on group creation is causing seqcount corruption. Fixes: 570c8efd5eb7 ("sched/psi: Optimize psi_group_change() cpu_clock() usage") Reported-by: Chris Mason Suggested-by: Beata Michalska Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Linus Torvalds Signed-off-by: Sasha Levin

sched/psi: Optimize psi_group_change() cpu_clock() usage

2025-08-15T10:13:42Z

[ Upstream commit 570c8efd5eb79c3725ba439ce105ed1bedc5acd9 ] Dietmar reported that commit 3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregation race") caused a regression for him on a high context switch rate benchmark (schbench) due to the now repeating cpu_clock() calls. In particular the problem is that get_recent_times() will extrapolate the current state to 'now'. But if an update uses a timestamp from before the start of the update, it is possible to get two reads with inconsistent results. It is effectively back-dating an update. (note that this all hard-relies on the clock being synchronized across CPUs -- if this is not the case, all bets are off). Combine this problem with the fact that there are per-group-per-cpu seqcounts, the commit in question pushed the clock read into the group iteration, causing tree-depth cpu_clock() calls. On architectures where cpu_clock() has appreciable overhead, this hurts. Instead move to a per-cpu seqcount, which allows us to have a single clock read for all group updates, increasing internal consistency and lowering update overhead. This comes at the cost of a longer update side (proportional to the tree depth) which can cause the read side to retry more often. Fixes: 3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregation race") Reported-by: Dietmar Eggemann Signed-off-by: Peter Zijlstra (Intel) Acked-by: Johannes Weiner Tested-by: Dietmar Eggemann , Link: https://lkml.kernel.org/20250522084844.GC31726@noisy.programming.kicks-ass.net Signed-off-by: Sasha Levin

sched: psi: fix bogus pressure spikes from aggregation race

2024-10-03T23:03:16Z

Brandon reports sporadic, non-sensical spikes in cumulative pressure time (total=) when reading cpu.pressure at a high rate. This is due to a race condition between reader aggregation and tasks changing states. While it affects all states and all resources captured by PSI, in practice it most likely triggers with CPU pressure, since scheduling events are so frequent compared to other resource events. The race context is the live snooping of ongoing stalls during a pressure read. The read aggregates per-cpu records for stalls that have concluded, but will also incorporate ad-hoc the duration of any active state that hasn't been recorded yet. This is important to get timely measurements of ongoing stalls. Those ad-hoc samples are calculated on-the-fly up to the current time on that CPU; since the stall hasn't concluded, it's expected that this is the minimum amount of stall time that will enter the per-cpu records once it does. The problem is that the path that concludes the state uses a CPU clock read that is not synchronized against aggregators; the clock is read outside of the seqlock protection. This allows aggregators to race and snoop a stall with a longer duration than will actually be recorded. With the recorded stall time being less than the last snapshot remembered by the aggregator, a subsequent sample will underflow and observe a bogus delta value, resulting in an erratic jump in pressure. Fix this by moving the clock read of the state change into the seqlock protection. This ensures no aggregation can snoop live stalls past the time that's recorded when the state concludes. Reported-by: Brandon Duffany Link: https://bugzilla.kernel.org/show_bug.cgi?id=219194 Link: https://lore.kernel.org/lkml/20240827121851.GB438928@cmpxchg.org/ Fixes: df77430639c9 ("psi: Reduce calls to sched_clock() in psi") Cc: stable@vger.kernel.org Signed-off-by: Johannes Weiner Reviewed-by: Chengming Zhou Signed-off-by: Linus Torvalds

Merge branch 'sched/urgent' into sched/core, to pick up fixes and refresh the branch

2024-07-11T08:42:33Z

Signed-off-by: Ingo Molnar

sched/psi: Optimise psi_group_change a bit

2024-07-04T13:59:52Z

The current code loops over the psi_states only to call a helper which then resolves back to the action needed for each state using a switch statement. That is effectively creating a double indirection of a kind which, given how all the states need to be explicitly listed and handled anyway, we can simply remove. Both the for loop and the switch statement that is. The benefit is both in the code size and CPU time spent in this function. YMMV but on my Steam Deck, while in a game, the patch makes the CPU usage go from ~2.4% down to ~1.2%. Text size at the same time went from 0x323 to 0x2c1. Signed-off-by: Tvrtko Ursulin Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Chengming Zhou Acked-by: Johannes Weiner Link: https://lkml.kernel.org/r/20240625135000.38652-1-tursulin@igalia.com

sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath

2024-07-01T11:01:44Z

It was reported that in moving to 6.1, a larger then 10% regression was seen in the performance of clock_gettime(CLOCK_THREAD_CPUTIME_ID,...). Using a simple reproducer, I found: 5.10: 100000000 calls in 24345994193 ns => 243.460 ns per call 100000000 calls in 24288172050 ns => 242.882 ns per call 100000000 calls in 24289135225 ns => 242.891 ns per call 6.1: 100000000 calls in 28248646742 ns => 282.486 ns per call 100000000 calls in 28227055067 ns => 282.271 ns per call 100000000 calls in 28177471287 ns => 281.775 ns per call The cause of this was finally narrowed down to the addition of psi_account_irqtime() in update_rq_clock_task(), in commit 52b1364ba0b1 ("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure"). In my initial attempt to resolve this, I leaned towards moving all accounting work out of the clock_gettime() call path, but it wasn't very pretty, so it will have to wait for a later deeper rework. Instead, Peter shared this approach: Rework psi_account_irqtime() to use its own psi_irq_time base for accounting, and move it out of the hotpath, calling it instead from sched_tick() and __schedule(). In testing this, we found the importance of ensuring psi_account_irqtime() is run under the rq_lock, which Johannes Weiner helpfully explained, so also add some lockdep annotations to make that requirement clear. With this change the performance is back in-line with 5.10: 6.1+fix: 100000000 calls in 24297324597 ns => 242.973 ns per call 100000000 calls in 24318869234 ns => 243.189 ns per call 100000000 calls in 24291564588 ns => 242.916 ns per call Reported-by: Jimmy Shiu Originally-by: Peter Zijlstra Signed-off-by: John Stultz Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Chengming Zhou Reviewed-by: Qais Yousef Link: https://lore.kernel.org/r/20240618215909.4099720-1-jstultz@google.com

sched: Fix spelling in comments

2024-05-27T15:00:21Z

Do a spell-checking pass. Signed-off-by: Ingo Molnar Cc: Peter Zijlstra Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar

sched/psi: Update poll => rtpoll in relevant comments

2023-10-16T11:42:49Z

The PSI trigger code is now making a distinction between privileged and unprivileged triggers, after the following commit: 65457b74aa94 ("sched/psi: Rename existing poll members in preparation") But some comments have not been modified along with the code, so they need to be updated. This will help readers better understand the code. Signed-off-by: Fan Yu Signed-off-by: Ingo Molnar Cc: Johannes Weiner Cc: Suren Baghdasaryan Cc: Peter Ziljstra Link: https://lore.kernel.org/r/202310161920399921184@zte.com.cn

sched/psi: Bail out early from irq time accounting

2023-10-13T07:56:29Z

We could bail out early when psi was disabled. Signed-off-by: Haifeng Xu Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Chengming Zhou Link: https://lore.kernel.org/r/20230926115722.467833-1-haifeng.xu@shopee.com

sched/psi: Delete the 'update_total' function parameter from update_triggers()

2023-10-11T21:08:09Z

The 'update_total' parameter of update_triggers() is always true after the previous commit: 80cc1d1d5ee3 ("sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no state changes") If the 'changed_states & group->rtpoll_states' condition is true, 'new_stall' in update_triggers() will be true, and then 'update_total' should also be true. So update_total is redundant - remove it. [ mingo: Changelog updates ] Signed-off-by: Yang Yang Signed-off-by: Ingo Molnar Cc: Johannes Weiner Cc: Suren Baghdasaryan Cc: Peter Ziljstra Link: https://lore.kernel.org/r/202310101645437859599@zte.com.cn