Merge tag 'rcu.release.v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux

Pull RCU updates from Neeraj Upadhyay: "Expedited grace period updates: - Protect against early RCU exp quiescent state reporting during exp grace period initialization - Remove superfluous barrier in task unblock path - Remove the CPU online quiescent state report optimization, which is error prone for certain scenarios - Add warning for unexpected pending requested expedited quiescent state on dying CPU Core: - Robustify rcu_is_cpu_rrupt_from_idle() by using more accurate indicators of the actual context tracking state of a CPU - Handle ->defer_qs_iw_pending field data race - Enable rcu_normal_wake_from_gp by default on systems with <= 16 CPUs - Fix lockup in rcu_read_unlock() due to recursive irq_exit() calls - Refactor expedited handling condition in rcu_read_unlock_special() - Documentation updates for hotplug and GP init scan ordering, separation of rcu_state and rnp's gp_seq states, quiescent state reporting for offline CPUs torture-scripts: - Cleanup and improve scripts : remove superfluous warnings for disabled tests; better handling of kvm.sh --kconfig arg; suppress some confusing diagnostics; tolerate bad kvm.sh args; add new diagnostic for build output; fail allmodconfig testing on warnings - Include RCU_TORTURE_TEST_CHK_RDR_STATE config for KCSAN kernels - Disable default RCU-tasks and clocksource-wdog testing on arm64 - Add EXPERT Kconfig option for arm64 KCSAN runs - Remove SRCU-lite testing rcutorture: - Start torture writer threads creation after reader threads to handle race in SRCU-P scenario - Add SRCU down_read()/up_read() test - Add diagnostics for delayed SRCU up_read(), unmatched up_read(), print number of up/down readers and the number of such readers which migrated to other CPU - Ignore certain unsupported configurations for trivial RCU test - Fix splats in RT kernels due to inaccurate checks for BH-disabled context - Enable checks and logs to capture intentionally exercised unexpected scenarios (too short readers) for BUSTED test - Remove SRCU-lite testing srcu: - Expedite SRCU-fast grace periods - Remove SRCU-lite implementation - Add guards for SRCU-fast readers rcu nocb: - Dump NOCB group leader state on stall detection - Robustify nocb_cb_kthread pointer accesses - Fix delayed execution of hurry callbacks when LAZY_RCU is enabled refscale: - Fix multiplication overflow in "loops" and "nreaders" calculations" * tag 'rcu.release.v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (49 commits) rcu: Document concurrent quiescent state reporting for offline CPUs rcu: Document separation of rcu_state and rnp's gp_seq rcu: Document GP init vs hotplug-scan ordering requirements srcu: Add guards for SRCU-fast readers rcu: Fix delayed execution of hurry callbacks rcu: Refactor expedited handling check in rcu_read_unlock_special() checkpatch: Remove SRCU-lite deprecation srcu: Remove SRCU-lite implementation srcu: Expedite SRCU-fast grace periods rcutorture: Remove support for SRCU-lite rcutorture: Remove SRCU-lite scenarios torture: Remove support for SRCU-lite torture: Make torture.sh --allmodconfig testing fail on warnings torture: Add "ERROR" diagnostic for testing kernel-build output torture: Make torture.sh tolerate runs having bad kvm.sh arguments torture: Add textid.txt file to --do-allmodconfig and --do-rcu-rust runs torture: Extract testid.txt generation to separate script torture: Suppress "find" diagnostics from torture.sh --do-none run torture: Provide EXPERT Kconfig option for arm64 KCSAN torture.sh runs rcu: Fix rcu_read_unlock() deadloop due to IRQ work ...
author: Linus Torvalds <torvalds@linux-foundation.org> 2025-07-30 11:01:41 -0700
committer: Linus Torvalds <torvalds@linux-foundation.org> 2025-07-30 11:01:41 -0700
commit: 2db4df0c09eeb209726261f43fc556360b38ec99 (patch)
tree: ec2369ba858fb0ea6f11c90b81800f258587e733 /Documentation
parent: 7dff275c663178e9a12a0c0038e4b3be2f3edcba (diff)
parent: cc1d1365f0f414f6522378867baa997642a7e6b2 (diff)
3 files changed, 163 insertions, 1 deletions
diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.rst b/Documentation/RCU/Design/Data-Structures/Data-Structures.rst
index 04e16775c752..1b0aad184dd7 100644
--- a/Documentation/RCU/Design/Data-Structures/Data-Structures.rst
+++ b/Documentation/RCU/Design/Data-Structures/Data-Structures.rst
@@ -286,6 +286,39 @@ in order to detect the beginnings and ends of grace periods in a
 distributed fashion. The values flow from ``rcu_state`` to ``rcu_node``
 (down the tree from the root to the leaves) to ``rcu_data``.
 
++-----------------------------------------------------------------------+
+| **Quick Quiz**:                                                       |
++-----------------------------------------------------------------------+
+| Given that the root rcu_node structure has a gp_seq field,            |
+| why does RCU maintain a separate gp_seq in the rcu_state structure?   |
+| Why not just use the root rcu_node's gp_seq as the official record    |
+| and update it directly when starting a new grace period?              |
++-----------------------------------------------------------------------+
+| **Answer**:                                                           |
++-----------------------------------------------------------------------+
+| On single-node RCU trees (where the root node is also a leaf),        |
+| updating the root node's gp_seq immediately would create unnecessary  |
+| lock contention. Here's why:                                          |
+|                                                                       |
+| If we did rcu_seq_start() directly on the root node's gp_seq:         |
+|                                                                       |
+| 1. All CPUs would immediately see their node's gp_seq from their rdp's|
+|    gp_seq, in rcu_pending(). They would all then invoke the RCU-core. |
+| 2. Which calls note_gp_changes() and try to acquire the node lock.    |
+| 3. But rnp->qsmask isn't initialized yet (happens later in            |
+|    rcu_gp_init())                                                     |
+| 4. So each CPU would acquire the lock, find it can't determine if it  |
+|    needs to report quiescent state (no qsmask), update rdp->gp_seq,   |
+|    and release the lock.                                              |
+| 5. Result: Lots of lock acquisitions with no grace period progress    |
+|                                                                       |
+| By having a separate rcu_state.gp_seq, we can increment the official  |
+| grace period counter without immediately affecting what CPUs see in   |
+| their nodes. The hierarchical propagation in rcu_gp_init() then       |
+| updates the root node's gp_seq and qsmask together under the same lock|
+| acquisition, avoiding this useless contention.                        |
++-----------------------------------------------------------------------+
+
 Miscellaneous
 '''''''''''''
 
diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
index 6125e7068d2c..b0395540296b 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.rst
+++ b/Documentation/RCU/Design/Requirements/Requirements.rst
@@ -1970,6 +1970,134 @@ corresponding CPU's leaf node lock is held. This avoids race conditions
 between RCU's hotplug notifier hooks, the grace period initialization
 code, and the FQS loop, all of which refer to or modify this bookkeeping.
 
+Note that grace period initialization (rcu_gp_init()) must carefully sequence
+CPU hotplug scanning with grace period state changes. For example, the
+following race could occur in rcu_gp_init() if rcu_seq_start() were to happen
+after the CPU hotplug scanning.
+
+.. code-block:: none
+
+   CPU0 (rcu_gp_init)                   CPU1                          CPU2
+   ---------------------                ----                          ----
+   // Hotplug scan first (WRONG ORDER)
+   rcu_for_each_leaf_node(rnp) {
+       rnp->qsmaskinit = rnp->qsmaskinitnext;
+   }
+                                        rcutree_report_cpu_starting()
+                                            rnp->qsmaskinitnext |= mask;
+                                        rcu_read_lock()
+                                        r0 = *X;
+                                                                      r1 = *X;
+                                                                      X = NULL;
+                                                                      cookie = get_state_synchronize_rcu();
+                                                                      // cookie = 8 (future GP)
+   rcu_seq_start(&rcu_state.gp_seq);
+   // gp_seq = 5
+
+   // CPU1 now invisible to this GP!
+   rcu_for_each_node_breadth_first() {
+       rnp->qsmask = rnp->qsmaskinit;
+       // CPU1 not included!
+   }
+
+   // GP completes without CPU1
+   rcu_seq_end(&rcu_state.gp_seq);
+   // gp_seq = 8
+                                                                      poll_state_synchronize_rcu(cookie);
+                                                                      // Returns true!
+                                                                      kfree(r1);
+                                        r2 = *r0; // USE-AFTER-FREE!
+
+By incrementing gp_seq first, CPU1's RCU read-side critical section
+is guaranteed to not be missed by CPU2.
+
+**Concurrent Quiescent State Reporting for Offline CPUs**
+
+RCU must ensure that CPUs going offline report quiescent states to avoid
+blocking grace periods. This requires careful synchronization to handle
+race conditions
+
+**Race condition causing Offline CPU to hang GP**
+
+A race between CPU offlining and new GP initialization (gp_init) may occur
+because `rcu_report_qs_rnp()` in `rcutree_report_cpu_dead()` must temporarily
+release the `rcu_node` lock to wake the RCU grace-period kthread:
+
+.. code-block:: none
+
+   CPU1 (going offline)                 CPU0 (GP kthread)
+   --------------------                 -----------------
+   rcutree_report_cpu_dead()
+     rcu_report_qs_rnp()
+       // Must release rnp->lock to wake GP kthread
+       raw_spin_unlock_irqrestore_rcu_node()
+                                        // Wakes up and starts new GP
+                                        rcu_gp_init()
+                                          // First loop:
+                                          copies qsmaskinitnext->qsmaskinit
+                                          // CPU1 still in qsmaskinitnext!
+
+                                          // Second loop:
+                                          rnp->qsmask = rnp->qsmaskinit
+                                          mask = rnp->qsmask & ~rnp->qsmaskinitnext
+                                          // mask is 0! CPU1 still in both masks
+       // Reacquire lock (but too late)
+     rnp->qsmaskinitnext &= ~mask       // Finally clears bit
+
+Without `ofl_lock`, the new grace period includes the offline CPU and waits
+forever for its quiescent state causing a GP hang.
+
+**A solution with ofl_lock**
+
+The `ofl_lock` (offline lock) prevents `rcu_gp_init()` from running during
+the vulnerable window when `rcu_report_qs_rnp()` has released `rnp->lock`:
+
+.. code-block:: none
+
+   CPU0 (rcu_gp_init)                   CPU1 (rcutree_report_cpu_dead)
+   ------------------                   ------------------------------
+   rcu_for_each_leaf_node(rnp) {
+       arch_spin_lock(&ofl_lock) -----> arch_spin_lock(&ofl_lock) [BLOCKED]
+
+       // Safe: CPU1 can't interfere
+       rnp->qsmaskinit = rnp->qsmaskinitnext
+
+       arch_spin_unlock(&ofl_lock) ---> // Now CPU1 can proceed
+   }                                    // But snapshot already taken
+
+**Another race causing GP hangs in rcu_gpu_init(): Reporting QS for Now-offline CPUs**
+
+After the first loop takes an atomic snapshot of online CPUs, as shown above,
+the second loop in `rcu_gp_init()` detects CPUs that went offline between
+releasing `ofl_lock` and acquiring the per-node `rnp->lock`. This detection is
+crucial because:
+
+1. The CPU might have gone offline after the snapshot but before the second loop
+2. The offline CPU cannot report its own QS if it's already dead
+3. Without this detection, the grace period would wait forever for CPUs that
+   are now offline.
+
+The second loop performs this detection safely:
+
+.. code-block:: none
+
+   rcu_for_each_node_breadth_first(rnp) {
+       raw_spin_lock_irqsave_rcu_node(rnp, flags);
+       rnp->qsmask = rnp->qsmaskinit;  // Apply the snapshot
+
+       // Detect CPUs offline after snapshot
+       mask = rnp->qsmask & ~rnp->qsmaskinitnext;
+
+       if (mask && rcu_is_leaf_node(rnp))
+           rcu_report_qs_rnp(mask, ...)  // Report QS for offline CPUs
+   }
+
+This approach ensures atomicity: quiescent state reporting for offline CPUs
+happens either in `rcu_gp_init()` (second loop) or in `rcutree_report_cpu_dead()`,
+never both and never neither. The `rnp->lock` held throughout the sequence
+prevents races - `rcutree_report_cpu_dead()` also acquires this lock when
+clearing `qsmaskinitnext`, ensuring mutual exclusion.
+
 Scheduler and RCU
 ~~~~~~~~~~~~~~~~~
 
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 0d2ea9a60145..d3f5a1c69dab 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5508,7 +5508,8 @@
 			echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
 			or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1"
 
-			Default is 0.
+			Default is 1 if num_possible_cpus() <= 16 and it is not explicitly
+			disabled by the boot parameter passing 0.
 
 	rcuscale.gp_async= [KNL]
 			Measure performance of asynchronous
author	Linus Torvalds <torvalds@linux-foundation.org>	2025-07-30 11:01:41 -0700
committer	Linus Torvalds <torvalds@linux-foundation.org>	2025-07-30 11:01:41 -0700
commit	2db4df0c09eeb209726261f43fc556360b38ec99 (patch)
tree	ec2369ba858fb0ea6f11c90b81800f258587e733 /Documentation
parent	7dff275c663178e9a12a0c0038e4b3be2f3edcba (diff)
parent	cc1d1365f0f414f6522378867baa997642a7e6b2 (diff)