diff options
Diffstat (limited to 'Documentation/driver-api/hw-recoverable-errors.rst')
| -rw-r--r-- | Documentation/driver-api/hw-recoverable-errors.rst | 60 |
1 files changed, 60 insertions, 0 deletions
diff --git a/Documentation/driver-api/hw-recoverable-errors.rst b/Documentation/driver-api/hw-recoverable-errors.rst new file mode 100644 index 000000000000..fc526c3454bd --- /dev/null +++ b/Documentation/driver-api/hw-recoverable-errors.rst @@ -0,0 +1,60 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================================= +Recoverable Hardware Error Tracking in vmcoreinfo +================================================= + +Overview +-------- + +This feature provides a generic infrastructure within the Linux kernel to track +and log recoverable hardware errors. These are hardware recoverable errors +visible that might not cause immediate panics but may influence health, mainly +because new code path will be executed in the kernel. + +By recording counts and timestamps of recoverable errors into the vmcoreinfo +crash dump notes, this infrastructure aids post-mortem crash analysis tools in +correlating hardware events with kernel failures. This enables faster triage +and better understanding of root causes, especially in large-scale cloud +environments where hardware issues are common. + +Benefits +-------- + +- Facilitates correlation of hardware recoverable errors with kernel panics or + unusual code paths that lead to system crashes. +- Provides operators and cloud providers quick insights, improving reliability + and reducing troubleshooting time. +- Complements existing full hardware diagnostics without replacing them. + +Data Exposure and Consumption +----------------------------- + +- The tracked error data consists of per-error-type counts and timestamps of + last occurrence. +- This data is stored in the `hwerror_data` array, categorized by error source + types like CPU, memory, PCI, CXL, and others. +- It is exposed via vmcoreinfo crash dump notes and can be read using tools + like `crash`, `drgn`, or other kernel crash analysis utilities. +- There is no other way to read these data other than from crash dumps. +- These errors are divided by area, which includes CPU, Memory, PCI, CXL and + others. + +Typical usage example (in drgn REPL): + +.. code-block:: python + + >>> prog['hwerror_data'] + (struct hwerror_info[HWERR_RECOV_MAX]){ + { + .count = (int)844, + .timestamp = (time64_t)1752852018, + }, + ... + } + +Enabling +-------- + +- This feature is enabled when CONFIG_VMCORE_INFO is set. + |
