diff options
| author | Linus Torvalds <torvalds@linux-foundation.org> | 2025-07-30 11:01:41 -0700 |
|---|---|---|
| committer | Linus Torvalds <torvalds@linux-foundation.org> | 2025-07-30 11:01:41 -0700 |
| commit | 2db4df0c09eeb209726261f43fc556360b38ec99 (patch) | |
| tree | ec2369ba858fb0ea6f11c90b81800f258587e733 /Documentation | |
| parent | 7dff275c663178e9a12a0c0038e4b3be2f3edcba (diff) | |
| parent | cc1d1365f0f414f6522378867baa997642a7e6b2 (diff) | |
Merge tag 'rcu.release.v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux
Pull RCU updates from Neeraj Upadhyay:
"Expedited grace period updates:
- Protect against early RCU exp quiescent state reporting during exp
grace period initialization
- Remove superfluous barrier in task unblock path
- Remove the CPU online quiescent state report optimization, which is
error prone for certain scenarios
- Add warning for unexpected pending requested expedited quiescent
state on dying CPU
Core:
- Robustify rcu_is_cpu_rrupt_from_idle() by using more accurate
indicators of the actual context tracking state of a CPU
- Handle ->defer_qs_iw_pending field data race
- Enable rcu_normal_wake_from_gp by default on systems with <= 16
CPUs
- Fix lockup in rcu_read_unlock() due to recursive irq_exit() calls
- Refactor expedited handling condition in rcu_read_unlock_special()
- Documentation updates for hotplug and GP init scan ordering,
separation of rcu_state and rnp's gp_seq states, quiescent state
reporting for offline CPUs
torture-scripts:
- Cleanup and improve scripts : remove superfluous warnings for
disabled tests; better handling of kvm.sh --kconfig arg; suppress
some confusing diagnostics; tolerate bad kvm.sh args; add new
diagnostic for build output; fail allmodconfig testing on warnings
- Include RCU_TORTURE_TEST_CHK_RDR_STATE config for KCSAN kernels
- Disable default RCU-tasks and clocksource-wdog testing on arm64
- Add EXPERT Kconfig option for arm64 KCSAN runs
- Remove SRCU-lite testing
rcutorture:
- Start torture writer threads creation after reader threads to
handle race in SRCU-P scenario
- Add SRCU down_read()/up_read() test
- Add diagnostics for delayed SRCU up_read(), unmatched up_read(),
print number of up/down readers and the number of such readers
which migrated to other CPU
- Ignore certain unsupported configurations for trivial RCU test
- Fix splats in RT kernels due to inaccurate checks for BH-disabled
context
- Enable checks and logs to capture intentionally exercised
unexpected scenarios (too short readers) for BUSTED test
- Remove SRCU-lite testing
srcu:
- Expedite SRCU-fast grace periods
- Remove SRCU-lite implementation
- Add guards for SRCU-fast readers
rcu nocb:
- Dump NOCB group leader state on stall detection
- Robustify nocb_cb_kthread pointer accesses
- Fix delayed execution of hurry callbacks when LAZY_RCU is enabled
refscale:
- Fix multiplication overflow in "loops" and "nreaders" calculations"
* tag 'rcu.release.v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (49 commits)
rcu: Document concurrent quiescent state reporting for offline CPUs
rcu: Document separation of rcu_state and rnp's gp_seq
rcu: Document GP init vs hotplug-scan ordering requirements
srcu: Add guards for SRCU-fast readers
rcu: Fix delayed execution of hurry callbacks
rcu: Refactor expedited handling check in rcu_read_unlock_special()
checkpatch: Remove SRCU-lite deprecation
srcu: Remove SRCU-lite implementation
srcu: Expedite SRCU-fast grace periods
rcutorture: Remove support for SRCU-lite
rcutorture: Remove SRCU-lite scenarios
torture: Remove support for SRCU-lite
torture: Make torture.sh --allmodconfig testing fail on warnings
torture: Add "ERROR" diagnostic for testing kernel-build output
torture: Make torture.sh tolerate runs having bad kvm.sh arguments
torture: Add textid.txt file to --do-allmodconfig and --do-rcu-rust runs
torture: Extract testid.txt generation to separate script
torture: Suppress "find" diagnostics from torture.sh --do-none run
torture: Provide EXPERT Kconfig option for arm64 KCSAN torture.sh runs
rcu: Fix rcu_read_unlock() deadloop due to IRQ work
...
Diffstat (limited to 'Documentation')
| -rw-r--r-- | Documentation/RCU/Design/Data-Structures/Data-Structures.rst | 33 | ||||
| -rw-r--r-- | Documentation/RCU/Design/Requirements/Requirements.rst | 128 | ||||
| -rw-r--r-- | Documentation/admin-guide/kernel-parameters.txt | 3 |
3 files changed, 163 insertions, 1 deletions
diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.rst b/Documentation/RCU/Design/Data-Structures/Data-Structures.rst index 04e16775c752..1b0aad184dd7 100644 --- a/Documentation/RCU/Design/Data-Structures/Data-Structures.rst +++ b/Documentation/RCU/Design/Data-Structures/Data-Structures.rst @@ -286,6 +286,39 @@ in order to detect the beginnings and ends of grace periods in a distributed fashion. The values flow from ``rcu_state`` to ``rcu_node`` (down the tree from the root to the leaves) to ``rcu_data``. ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Given that the root rcu_node structure has a gp_seq field, | +| why does RCU maintain a separate gp_seq in the rcu_state structure? | +| Why not just use the root rcu_node's gp_seq as the official record | +| and update it directly when starting a new grace period? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| On single-node RCU trees (where the root node is also a leaf), | +| updating the root node's gp_seq immediately would create unnecessary | +| lock contention. Here's why: | +| | +| If we did rcu_seq_start() directly on the root node's gp_seq: | +| | +| 1. All CPUs would immediately see their node's gp_seq from their rdp's| +| gp_seq, in rcu_pending(). They would all then invoke the RCU-core. | +| 2. Which calls note_gp_changes() and try to acquire the node lock. | +| 3. But rnp->qsmask isn't initialized yet (happens later in | +| rcu_gp_init()) | +| 4. So each CPU would acquire the lock, find it can't determine if it | +| needs to report quiescent state (no qsmask), update rdp->gp_seq, | +| and release the lock. | +| 5. Result: Lots of lock acquisitions with no grace period progress | +| | +| By having a separate rcu_state.gp_seq, we can increment the official | +| grace period counter without immediately affecting what CPUs see in | +| their nodes. The hierarchical propagation in rcu_gp_init() then | +| updates the root node's gp_seq and qsmask together under the same lock| +| acquisition, avoiding this useless contention. | ++-----------------------------------------------------------------------+ + Miscellaneous ''''''''''''' diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst index 6125e7068d2c..b0395540296b 100644 --- a/Documentation/RCU/Design/Requirements/Requirements.rst +++ b/Documentation/RCU/Design/Requirements/Requirements.rst @@ -1970,6 +1970,134 @@ corresponding CPU's leaf node lock is held. This avoids race conditions between RCU's hotplug notifier hooks, the grace period initialization code, and the FQS loop, all of which refer to or modify this bookkeeping. +Note that grace period initialization (rcu_gp_init()) must carefully sequence +CPU hotplug scanning with grace period state changes. For example, the +following race could occur in rcu_gp_init() if rcu_seq_start() were to happen +after the CPU hotplug scanning. + +.. code-block:: none + + CPU0 (rcu_gp_init) CPU1 CPU2 + --------------------- ---- ---- + // Hotplug scan first (WRONG ORDER) + rcu_for_each_leaf_node(rnp) { + rnp->qsmaskinit = rnp->qsmaskinitnext; + } + rcutree_report_cpu_starting() + rnp->qsmaskinitnext |= mask; + rcu_read_lock() + r0 = *X; + r1 = *X; + X = NULL; + cookie = get_state_synchronize_rcu(); + // cookie = 8 (future GP) + rcu_seq_start(&rcu_state.gp_seq); + // gp_seq = 5 + + // CPU1 now invisible to this GP! + rcu_for_each_node_breadth_first() { + rnp->qsmask = rnp->qsmaskinit; + // CPU1 not included! + } + + // GP completes without CPU1 + rcu_seq_end(&rcu_state.gp_seq); + // gp_seq = 8 + poll_state_synchronize_rcu(cookie); + // Returns true! + kfree(r1); + r2 = *r0; // USE-AFTER-FREE! + +By incrementing gp_seq first, CPU1's RCU read-side critical section +is guaranteed to not be missed by CPU2. + +**Concurrent Quiescent State Reporting for Offline CPUs** + +RCU must ensure that CPUs going offline report quiescent states to avoid +blocking grace periods. This requires careful synchronization to handle +race conditions + +**Race condition causing Offline CPU to hang GP** + +A race between CPU offlining and new GP initialization (gp_init) may occur +because `rcu_report_qs_rnp()` in `rcutree_report_cpu_dead()` must temporarily +release the `rcu_node` lock to wake the RCU grace-period kthread: + +.. code-block:: none + + CPU1 (going offline) CPU0 (GP kthread) + -------------------- ----------------- + rcutree_report_cpu_dead() + rcu_report_qs_rnp() + // Must release rnp->lock to wake GP kthread + raw_spin_unlock_irqrestore_rcu_node() + // Wakes up and starts new GP + rcu_gp_init() + // First loop: + copies qsmaskinitnext->qsmaskinit + // CPU1 still in qsmaskinitnext! + + // Second loop: + rnp->qsmask = rnp->qsmaskinit + mask = rnp->qsmask & ~rnp->qsmaskinitnext + // mask is 0! CPU1 still in both masks + // Reacquire lock (but too late) + rnp->qsmaskinitnext &= ~mask // Finally clears bit + +Without `ofl_lock`, the new grace period includes the offline CPU and waits +forever for its quiescent state causing a GP hang. + +**A solution with ofl_lock** + +The `ofl_lock` (offline lock) prevents `rcu_gp_init()` from running during +the vulnerable window when `rcu_report_qs_rnp()` has released `rnp->lock`: + +.. code-block:: none + + CPU0 (rcu_gp_init) CPU1 (rcutree_report_cpu_dead) + ------------------ ------------------------------ + rcu_for_each_leaf_node(rnp) { + arch_spin_lock(&ofl_lock) -----> arch_spin_lock(&ofl_lock) [BLOCKED] + + // Safe: CPU1 can't interfere + rnp->qsmaskinit = rnp->qsmaskinitnext + + arch_spin_unlock(&ofl_lock) ---> // Now CPU1 can proceed + } // But snapshot already taken + +**Another race causing GP hangs in rcu_gpu_init(): Reporting QS for Now-offline CPUs** + +After the first loop takes an atomic snapshot of online CPUs, as shown above, +the second loop in `rcu_gp_init()` detects CPUs that went offline between +releasing `ofl_lock` and acquiring the per-node `rnp->lock`. This detection is +crucial because: + +1. The CPU might have gone offline after the snapshot but before the second loop +2. The offline CPU cannot report its own QS if it's already dead +3. Without this detection, the grace period would wait forever for CPUs that + are now offline. + +The second loop performs this detection safely: + +.. code-block:: none + + rcu_for_each_node_breadth_first(rnp) { + raw_spin_lock_irqsave_rcu_node(rnp, flags); + rnp->qsmask = rnp->qsmaskinit; // Apply the snapshot + + // Detect CPUs offline after snapshot + mask = rnp->qsmask & ~rnp->qsmaskinitnext; + + if (mask && rcu_is_leaf_node(rnp)) + rcu_report_qs_rnp(mask, ...) // Report QS for offline CPUs + } + +This approach ensures atomicity: quiescent state reporting for offline CPUs +happens either in `rcu_gp_init()` (second loop) or in `rcutree_report_cpu_dead()`, +never both and never neither. The `rnp->lock` held throughout the sequence +prevents races - `rcutree_report_cpu_dead()` also acquires this lock when +clearing `qsmaskinitnext`, ensuring mutual exclusion. + Scheduler and RCU ~~~~~~~~~~~~~~~~~ diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 0d2ea9a60145..d3f5a1c69dab 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5508,7 +5508,8 @@ echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1" - Default is 0. + Default is 1 if num_possible_cpus() <= 16 and it is not explicitly + disabled by the boot parameter passing 0. rcuscale.gp_async= [KNL] Measure performance of asynchronous |
