user/sven/linux.git/kernel/rcu/tree_plugin.h, branch v6.1.162

rcu: Protect ->defer_qs_iw_pending from data race

2025-08-28T14:25:55Z

[ Upstream commit 90c09d57caeca94e6f3f87c49e96a91edd40cbfd ] On kernels built with CONFIG_IRQ_WORK=y, when rcu_read_unlock() is invoked within an interrupts-disabled region of code [1], it will invoke rcu_read_unlock_special(), which uses an irq-work handler to force the system to notice when the RCU read-side critical section actually ends. That end won't happen until interrupts are enabled at the soonest. In some kernels, such as those booted with rcutree.use_softirq=y, the irq-work handler is used unconditionally. The per-CPU rcu_data structure's ->defer_qs_iw_pending field is updated by the irq-work handler and is both read and updated by rcu_read_unlock_special(). This resulted in the following KCSAN splat: ------------------------------------------------------------------------ BUG: KCSAN: data-race in rcu_preempt_deferred_qs_handler / rcu_read_unlock_special read to 0xffff96b95f42d8d8 of 1 bytes by task 90 on cpu 8: rcu_read_unlock_special+0x175/0x260 __rcu_read_unlock+0x92/0xa0 rt_spin_unlock+0x9b/0xc0 __local_bh_enable+0x10d/0x170 __local_bh_enable_ip+0xfb/0x150 rcu_do_batch+0x595/0xc40 rcu_cpu_kthread+0x4e9/0x830 smpboot_thread_fn+0x24d/0x3b0 kthread+0x3bd/0x410 ret_from_fork+0x35/0x40 ret_from_fork_asm+0x1a/0x30 write to 0xffff96b95f42d8d8 of 1 bytes by task 88 on cpu 8: rcu_preempt_deferred_qs_handler+0x1e/0x30 irq_work_single+0xaf/0x160 run_irq_workd+0x91/0xc0 smpboot_thread_fn+0x24d/0x3b0 kthread+0x3bd/0x410 ret_from_fork+0x35/0x40 ret_from_fork_asm+0x1a/0x30 no locks held by irq_work/8/88. irq event stamp: 200272 hardirqs last enabled at (200272): [] finish_task_switch+0x131/0x320 hardirqs last disabled at (200271): [] __schedule+0x129/0xd70 softirqs last enabled at (0): [] copy_process+0x4df/0x1cc0 softirqs last disabled at (0): [<0000000000000000>] 0x0 ------------------------------------------------------------------------ The problem is that irq-work handlers run with interrupts enabled, which means that rcu_preempt_deferred_qs_handler() could be interrupted, and that interrupt handler might contain an RCU read-side critical section, which might invoke rcu_read_unlock_special(). In the strict KCSAN mode of operation used by RCU, this constitutes a data race on the ->defer_qs_iw_pending field. This commit therefore disables interrupts across the portion of the rcu_preempt_deferred_qs_handler() that updates the ->defer_qs_iw_pending field. This suffices because this handler is not a fast path. Signed-off-by: Paul E. McKenney Reviewed-by: Frederic Weisbecker Signed-off-by: Neeraj Upadhyay (AMD) Signed-off-by: Sasha Levin

rcu: handle unstable rdp in rcu_read_unlock_strict()

2025-06-04T12:40:16Z

[ Upstream commit fcf0e25ad4c8d14d2faab4d9a17040f31efce205 ] rcu_read_unlock_strict() can be called with preemption enabled which can make for an unstable rdp and a racy norm value. Fix this by dropping the preempt-count in __rcu_read_unlock() after the call to rcu_read_unlock_strict(), adjusting the preempt-count check appropriately. Suggested-by: Frederic Weisbecker Signed-off-by: Ankur Arora Reviewed-by: Frederic Weisbecker Signed-off-by: Paul E. McKenney Signed-off-by: Boqun Feng Signed-off-by: Sasha Levin

rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y

2025-06-04T12:40:16Z

[ Upstream commit 83b28cfe796464ebbde1cf7916c126da6d572685 ] With PREEMPT_RCU=n, cond_resched() provides urgently needed quiescent states for read-side critical sections via rcu_all_qs(). One reason why this was needed: lacking preempt-count, the tick handler has no way of knowing whether it is executing in a read-side critical section or not. With (PREEMPT_LAZY=y, PREEMPT_DYNAMIC=n), we get (PREEMPT_COUNT=y, PREEMPT_RCU=n). In this configuration cond_resched() is a stub and does not provide quiescent states via rcu_all_qs(). (PREEMPT_RCU=y provides this information via rcu_read_unlock() and its nesting counter.) So, use the availability of preempt_count() to report quiescent states in rcu_flavor_sched_clock_irq(). Suggested-by: Paul E. McKenney Reviewed-by: Sebastian Andrzej Siewior Signed-off-by: Ankur Arora Reviewed-by: Frederic Weisbecker Signed-off-by: Paul E. McKenney Signed-off-by: Boqun Feng Signed-off-by: Sasha Levin

rcu: Mark additional concurrent load from ->cpu_no_qs.b.exp

2023-07-27T06:50:33Z

[ Upstream commit 9146eb25495ea8bfb5010192e61e3ed5805ce9ef ] The per-CPU rcu_data structure's ->cpu_no_qs.b.exp field is updated only on the instance corresponding to the current CPU, but can be read more widely. Unmarked accesses are OK from the corresponding CPU, but only if interrupts are disabled, given that interrupt handlers can and do modify this field. Unfortunately, although the load from rcu_preempt_deferred_qs() is always carried out from the corresponding CPU, interrupts are not necessarily disabled. This commit therefore upgrades this load to READ_ONCE. Similarly, the diagnostic access from synchronize_rcu_expedited_wait() might run with interrupts disabled and from some other CPU. This commit therefore marks this load with data_race(). Finally, the C-language access in rcu_preempt_ctxt_queue() is OK as is because interrupts are disabled and this load is always from the corresponding CPU. This commit adds a comment giving the rationale for this access being safe. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney Signed-off-by: Sasha Levin

Merge branches 'doc.2022.08.31b', 'fixes.2022.08.31b', 'kvfree.2022.08.31b', 'nocb.2022.09.01a', 'poll.2022.08.31b', 'poll-srcu.2022.08.31b' and 'tasks.2022.08.31b' into HEAD

2022-09-01T17:55:57Z

doc.2022.08.31b: Documentation updates fixes.2022.08.31b: Miscellaneous fixes kvfree.2022.08.31b: kvfree_rcu() updates nocb.2022.09.01a: NOCB CPU updates poll.2022.08.31b: Full-oldstate RCU polling grace-period API poll-srcu.2022.08.31b: Polled SRCU grace-period updates tasks.2022.08.31b: Tasks RCU updates

rcu-tasks: Make RCU Tasks Trace check for userspace execution

2022-08-31T12:10:55Z

Userspace execution is a valid quiescent state for RCU Tasks Trace, but the scheduling-clock interrupt does not currently report such quiescent states. Of course, the scheduling-clock interrupt is not strictly speaking userspace execution. However, the only way that this code is not in a quiescent state is if something invoked rcu_read_lock_trace(), and that would be reflected in the ->trc_reader_nesting field in the task_struct structure. Furthermore, this field is checked by rcu_tasks_trace_qs(), which is invoked by rcu_tasks_qs() which is in turn invoked by rcu_note_voluntary_context_switch() in kernels building at least one of the RCU Tasks flavors. It is therefore safe to invoke rcu_tasks_trace_qs() from the rcu_sched_clock_irq(). But rcu_tasks_qs() also invokes rcu_tasks_classic_qs() for RCU Tasks, which lacks the read-side markers provided by RCU Tasks Trace. This raises the possibility that an RCU Tasks grace period could start after the interrupt from userspace execution, but before the call to rcu_sched_clock_irq(). However, it turns out that this is safe because the RCU Tasks grace period waits for an RCU grace period, which will wait for the entire scheduling-clock interrupt handler, including any RCU Tasks read-side critical section that this handler might contain. This commit therefore updates the rcu_sched_clock_irq() function's check for usermode execution and its call to rcu_tasks_classic_qs() to instead check for both usermode execution and interrupt from idle, and to instead call rcu_note_voluntary_context_switch(). This consolidates code and provides more faster RCU Tasks Trace reporting of quiescent states in kernels that do scheduling-clock interrupts for userspace execution. [ paulmck: Consolidate checks into rcu_sched_clock_irq(). ] Signed-off-by: Zqiang Signed-off-by: Paul E. McKenney

rcu: Exclude outgoing CPU when it is the last to leave

2022-08-31T12:06:03Z

The rcu_boost_kthread_setaffinity() function removes the outgoing CPU from the set_cpus_allowed() mask for the corresponding leaf rcu_node structure's rcub priority-boosting kthread. Except that if the outgoing CPU will leave that structure without any online CPUs, the mask is set to the housekeeping CPU mask from housekeeping_cpumask(). Which is fine unless the outgoing CPU happens to be a housekeeping CPU. This commit therefore removes the outgoing CPU from the housekeeping mask. This would of course be problematic if the outgoing CPU was the last online housekeeping CPU, but in that case you are in a world of hurt anyway. If someone comes up with a valid use case for a system needing all the housekeeping CPUs to be offline, further adjustments can be made. Signed-off-by: Paul E. McKenney

rcu: Avoid triggering strict-GP irq-work when RCU is idle

2022-08-31T12:06:02Z

Kernels built with PREEMPT_RCU=y and RCU_STRICT_GRACE_PERIOD=y trigger irq-work from rcu_read_unlock(), and the resulting irq-work handler invokes rcu_preempt_deferred_qs_handle(). The point of this triggering is to force grace periods to end quickly in order to give tools like KASAN a better chance of detecting RCU usage bugs such as leaking RCU-protected pointers out of an RCU read-side critical section. However, this irq-work triggering is unconditional. This works, but there is no point in doing this irq-work unless the current grace period is waiting on the running CPU or task, which is not the common case. After all, in the common case there are many rcu_read_unlock() calls per CPU per grace period. This commit therefore triggers the irq-work only when the current grace period is waiting on the running CPU or task. This change was tested as follows on a four-CPU system: echo rcu_preempt_deferred_qs_handler > /sys/kernel/debug/tracing/set_ftrace_filter echo 1 > /sys/kernel/debug/tracing/function_profile_enabled insmod rcutorture.ko sleep 20 rmmod rcutorture.ko echo 0 > /sys/kernel/debug/tracing/function_profile_enabled echo > /sys/kernel/debug/tracing/set_ftrace_filter This procedure produces results in this per-CPU set of files: /sys/kernel/debug/tracing/trace_stat/function* Sample output from one of these files is as follows: Function Hit Time Avg s^2 -------- --- ---- --- --- rcu_preempt_deferred_qs_handle 838746 182650.3 us 0.217 us 0.004 us The baseline sum of the "Hit" values (the number of calls to this function) was 3,319,015. With this commit, that sum was 1,140,359, for a 2.9x reduction. The worst-case variance across the CPUs was less than 25%, so this large effect size is statistically significant. The raw data is available in the Link: URL. Link: https://lore.kernel.org/all/20220808022626.12825-1-qiang1.zhang@intel.com/ Signed-off-by: Zqiang Signed-off-by: Paul E. McKenney

rcu: Document reason for rcu_all_qs() call to preempt_disable()

2022-08-31T12:03:14Z

Given that rcu_all_qs() is in non-preemptible kernels, why on earth should it invoke preempt_disable()? This commit adds the reason, which is to work nicely with debugging enabled in CONFIG_PREEMPT_COUNT=y kernels. Reported-by: Neeraj Upadhyay Reported-by: Boqun Feng Reported-by: Frederic Weisbecker Signed-off-by: Paul E. McKenney

rcu: Update rcu_preempt_deferred_qs() comments for !PREEMPT kernels

2022-08-31T12:03:14Z

In non-premptible kernels, tasks never do context switches within RCU read-side critical sections. Therefore, in such kernels, each leaf rcu_node structure's ->blkd_tasks list will always be empty. The comment on the non-preemptible version of rcu_preempt_deferred_qs() confuses this point, so this commit therefore fixes it. Signed-off-by: Zqiang Signed-off-by: Paul E. McKenney