diff options
Diffstat (limited to 'Documentation/admin-guide')
| -rw-r--r-- | Documentation/admin-guide/LSM/Smack.rst | 16 | ||||
| -rw-r--r-- | Documentation/admin-guide/LSM/ipe.rst | 17 | ||||
| -rw-r--r-- | Documentation/admin-guide/RAS/main.rst | 142 | ||||
| -rw-r--r-- | Documentation/admin-guide/kernel-parameters.txt | 16 | ||||
| -rw-r--r-- | Documentation/admin-guide/pm/cpuidle.rst | 9 | ||||
| -rw-r--r-- | Documentation/admin-guide/pm/intel_pstate.rst | 133 | ||||
| -rw-r--r-- | Documentation/admin-guide/thermal/index.rst | 1 | ||||
| -rw-r--r-- | Documentation/admin-guide/thermal/intel_thermal_throttle.rst | 91 |
8 files changed, 220 insertions, 205 deletions
diff --git a/Documentation/admin-guide/LSM/Smack.rst b/Documentation/admin-guide/LSM/Smack.rst index 6d44f4fdbf59..c5ed775f2d10 100644 --- a/Documentation/admin-guide/LSM/Smack.rst +++ b/Documentation/admin-guide/LSM/Smack.rst @@ -601,10 +601,15 @@ specification. Task Attribute ~~~~~~~~~~~~~~ -The Smack label of a process can be read from /proc/<pid>/attr/current. A -process can read its own Smack label from /proc/self/attr/current. A +The Smack label of a process can be read from ``/proc/<pid>/attr/current``. A +process can read its own Smack label from ``/proc/self/attr/current``. A privileged process can change its own Smack label by writing to -/proc/self/attr/current but not the label of another process. +``/proc/self/attr/current`` but not the label of another process. + +Format of writing is : only the label or the label followed by one of the +3 trailers: ``\n`` (by common agreement for ``/proc/...`` interfaces), +``\0`` (because some applications incorrectly include it), +``\n\0`` (because we think some applications may incorrectly include it). File Attribute ~~~~~~~~~~~~~~ @@ -696,6 +701,11 @@ sockets. A privileged program may set this to match the label of another task with which it hopes to communicate. +UNIX domain socket (UDS) with a BSD address functions both as a file in a +filesystem and as a socket. As a file, it carries the SMACK64 attribute. This +attribute is not involved in Smack security enforcement and is immutably +assigned the label "*". + Smack Netlabel Exceptions ~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/Documentation/admin-guide/LSM/ipe.rst b/Documentation/admin-guide/LSM/ipe.rst index dc7088451f9d..a756d8158531 100644 --- a/Documentation/admin-guide/LSM/ipe.rst +++ b/Documentation/admin-guide/LSM/ipe.rst @@ -95,7 +95,20 @@ languages when these scripts are invoked by passing these program files to the interpreter. This is because the way interpreters execute these files; the scripts themselves are not evaluated as executable code through one of IPE's hooks, but they are merely text files that are read -(as opposed to compiled executables) [#interpreters]_. +(as opposed to compiled executables). However, with the introduction of the +``AT_EXECVE_CHECK`` flag (:doc:`AT_EXECVE_CHECK </userspace-api/check_exec>`), +interpreters can use it to signal the kernel that a script file will be executed, +and request the kernel to perform LSM security checks on it. + +IPE's EXECUTE operation enforcement differs between compiled executables and +interpreted scripts: For compiled executables, enforcement is triggered +automatically by the kernel during ``execve()``, ``execveat()``, ``mmap()`` +and ``mprotect()`` syscalls when loading executable content. For interpreted +scripts, enforcement requires explicit interpreter integration using +``execveat()`` with ``AT_EXECVE_CHECK`` flag. Unlike exec syscalls that IPE +intercepts during the execution process, this mechanism needs the interpreter +to take the initiative, and existing interpreters won't be automatically +supported unless the signal call is added. Threat Model ------------ @@ -806,8 +819,6 @@ A: .. [#digest_cache_lsm] https://lore.kernel.org/lkml/20240415142436.2545003-1-roberto.sassu@huaweicloud.com/ -.. [#interpreters] There is `some interest in solving this issue <https://lore.kernel.org/lkml/20220321161557.495388-1-mic@digikod.net/>`_. - .. [#devdoc] Please see :doc:`the design docs </security/ipe>` for more on this topic. diff --git a/Documentation/admin-guide/RAS/main.rst b/Documentation/admin-guide/RAS/main.rst index 447bfde509fb..5a45db32c49b 100644 --- a/Documentation/admin-guide/RAS/main.rst +++ b/Documentation/admin-guide/RAS/main.rst @@ -406,24 +406,8 @@ index of the MC:: |->mc2 .... -Under each ``mcX`` directory each ``csrowX`` is again represented by a -``csrowX``, where ``X`` is the csrow index:: - - .../mc/mc0/ - | - |->csrow0 - |->csrow2 - |->csrow3 - .... - -Notice that there is no csrow1, which indicates that csrow0 is composed -of a single ranked DIMMs. This should also apply in both Channels, in -order to have dual-channel mode be operational. Since both csrow2 and -csrow3 are populated, this indicates a dual ranked set of DIMMs for -channels 0 and 1. - -Within each of the ``mcX`` and ``csrowX`` directories are several EDAC -control and attribute files. +Within each of the ``mcX`` directory are several EDAC control and +attribute files. ``mcX`` directories ------------------- @@ -569,7 +553,7 @@ this ``X`` memory module: - Unbuffered-DDR .. [#f5] On some systems, the memory controller doesn't have any logic - to identify the memory module. On such systems, the directory is called ``rankX`` and works on a similar way as the ``csrowX`` directories. + to identify the memory module. On such systems, the directory is called ``rankX``. On modern Intel memory controllers, the memory controller identifies the memory modules directly. On such systems, the directory is called ``dimmX``. @@ -577,126 +561,6 @@ this ``X`` memory module: symlinks inside the sysfs mapping that are automatically created by the sysfs subsystem. Currently, they serve no purpose. -``csrowX`` directories ----------------------- - -When CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the ``csrowX`` -directories. As this API doesn't work properly for Rambus, FB-DIMMs and -modern Intel Memory Controllers, this is being deprecated in favor of -``dimmX`` directories. - -In the ``csrowX`` directories are EDAC control and attribute files for -this ``X`` instance of csrow: - - -- ``ue_count`` - Total Uncorrectable Errors count attribute file - - This attribute file displays the total count of uncorrectable - errors that have occurred on this csrow. If panic_on_ue is set - this counter will not have a chance to increment, since EDAC - will panic the system. - - -- ``ce_count`` - Total Correctable Errors count attribute file - - This attribute file displays the total count of correctable - errors that have occurred on this csrow. This count is very - important to examine. CEs provide early indications that a - DIMM is beginning to fail. This count field should be - monitored for non-zero values and report such information - to the system administrator. - - -- ``size_mb`` - Total memory managed by this csrow attribute file - - This attribute file displays, in count of megabytes, the memory - that this csrow contains. - - -- ``mem_type`` - Memory Type attribute file - - This attribute file will display what type of memory is currently - on this csrow. Normally, either buffered or unbuffered memory. - Examples: - - - Registered-DDR - - Unbuffered-DDR - - -- ``edac_mode`` - EDAC Mode of operation attribute file - - This attribute file will display what type of Error detection - and correction is being utilized. - - -- ``dev_type`` - Device type attribute file - - This attribute file will display what type of DRAM device is - being utilized on this DIMM. - Examples: - - - x1 - - x2 - - x4 - - x8 - - -- ``ch0_ce_count`` - Channel 0 CE Count attribute file - - This attribute file will display the count of CEs on this - DIMM located in channel 0. - - -- ``ch0_ue_count`` - Channel 0 UE Count attribute file - - This attribute file will display the count of UEs on this - DIMM located in channel 0. - - -- ``ch0_dimm_label`` - Channel 0 DIMM Label control file - - - This control file allows this DIMM to have a label assigned - to it. With this label in the module, when errors occur - the output can provide the DIMM label in the system log. - This becomes vital for panic events to isolate the - cause of the UE event. - - DIMM Labels must be assigned after booting, with information - that correctly identifies the physical slot with its - silk screen label. This information is currently very - motherboard specific and determination of this information - must occur in userland at this time. - - -- ``ch1_ce_count`` - Channel 1 CE Count attribute file - - - This attribute file will display the count of CEs on this - DIMM located in channel 1. - - -- ``ch1_ue_count`` - Channel 1 UE Count attribute file - - - This attribute file will display the count of UEs on this - DIMM located in channel 0. - - -- ``ch1_dimm_label`` - Channel 1 DIMM Label control file - - This control file allows this DIMM to have a label assigned - to it. With this label in the module, when errors occur - the output can provide the DIMM label in the system log. - This becomes vital for panic events to isolate the - cause of the UE event. - - DIMM Labels must be assigned after booting, with information - that correctly identifies the physical slot with its - silk screen label. This information is currently very - motherboard specific and determination of this information - must occur in userland at this time. - System Logging -------------- diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 6c42061ca20e..2b465eab41a1 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1907,6 +1907,16 @@ /sys/power/pm_test). Only available when CONFIG_PM_DEBUG is set. Default value is 5. + hibernate_compression_threads= + [HIBERNATION] + Set the number of threads used for compressing or decompressing + hibernation images. + + Format: <integer> + Default: 3 + Minimum: 1 + Example: hibernate_compression_threads=4 + highmem=nn[KMG] [KNL,BOOT,EARLY] forces the highmem zone to have an exact size of <nn>. This works even on boxes that have no highmem otherwise. This also works to reduce highmem @@ -6207,7 +6217,7 @@ rdt= [HW,X86,RDT] Turn on/off individual RDT features. List is: cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp, - mba, smba, bmec, abmc. + mba, smba, bmec, abmc, sdciae. E.g. to turn on cmt and turn off mba use: rdt=cmt,!mba @@ -6500,6 +6510,10 @@ Memory area to be used by remote processor image, managed by CMA. + rseq_debug= [KNL] Enable or disable restartable sequence + debug mode. Defaults to CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE. + Format: <bool> + rt_group_sched= [KNL] Enable or disable SCHED_RR/FIFO group scheduling when CONFIG_RT_GROUP_SCHED=y. Defaults to !CONFIG_RT_GROUP_SCHED_DEFAULT_DISABLED. diff --git a/Documentation/admin-guide/pm/cpuidle.rst b/Documentation/admin-guide/pm/cpuidle.rst index 0c090b076224..be4c1120e3f0 100644 --- a/Documentation/admin-guide/pm/cpuidle.rst +++ b/Documentation/admin-guide/pm/cpuidle.rst @@ -580,6 +580,15 @@ the given CPU as the upper limit for the exit latency of the idle states that they are allowed to select for that CPU. They should never select any idle states with exit latency beyond that limit. +While the above CPU QoS constraints apply to CPU idle time management, user +space may also request a CPU system wakeup latency QoS limit, via the +`cpu_wakeup_latency` file. This QoS constraint is respected when selecting a +suitable idle state for the CPUs, while entering the system-wide suspend-to-idle +sleep state, but also to the regular CPU idle time management. + +Note that, the management of the `cpu_wakeup_latency` file works according to +the 'cpu_dma_latency' file from user space point of view. Moreover, the unit +is also microseconds. Idle States Control Via Kernel Command Line =========================================== diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst index 26e702c7016e..fde967b0c2e0 100644 --- a/Documentation/admin-guide/pm/intel_pstate.rst +++ b/Documentation/admin-guide/pm/intel_pstate.rst @@ -48,8 +48,9 @@ only way to pass early-configuration-time parameters to it is via the kernel command line. However, its configuration can be adjusted via ``sysfs`` to a great extent. In some configurations it even is possible to unregister it via ``sysfs`` which allows another ``CPUFreq`` scaling driver to be loaded and -registered (see `below <status_attr_>`_). +registered (see :ref:`below <status_attr>`). +.. _operation_modes: Operation Modes =============== @@ -62,6 +63,8 @@ a certain performance scaling algorithm. Which of them will be in effect depends on what kernel command line options are used and on the capabilities of the processor. +.. _active_mode: + Active Mode ----------- @@ -94,6 +97,8 @@ Which of the P-state selection algorithms is used by default depends on the Namely, if that option is set, the ``performance`` algorithm will be used by default, and the other one will be used by default if it is not set. +.. _active_mode_hwp: + Active Mode With HWP ~~~~~~~~~~~~~~~~~~~~ @@ -123,7 +128,7 @@ Energy-Performance Bias (EPB) knob (otherwise), which means that the processor's internal P-state selection logic is expected to focus entirely on performance. This will override the EPP/EPB setting coming from the ``sysfs`` interface -(see `Energy vs Performance Hints`_ below). Moreover, any attempts to change +(see :ref:`energy_performance_hints` below). Moreover, any attempts to change the EPP/EPB to a value different from 0 ("performance") via ``sysfs`` in this configuration will be rejected. @@ -192,6 +197,8 @@ This is the default P-state selection algorithm if the :c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option is not set. +.. _passive_mode: + Passive Mode ------------ @@ -289,12 +296,12 @@ Unlike ``_PSS`` objects in the ACPI tables, ``intel_pstate`` always exposes the entire range of available P-states, including the whole turbo range, to the ``CPUFreq`` core and (in the passive mode) to generic scaling governors. This generally causes turbo P-states to be set more often when ``intel_pstate`` is -used relative to ACPI-based CPU performance scaling (see `below <acpi-cpufreq_>`_ -for more information). +used relative to ACPI-based CPU performance scaling (see +:ref:`below <acpi-cpufreq>` for more information). Moreover, since ``intel_pstate`` always knows what the real turbo threshold is (even if the Configurable TDP feature is enabled in the processor), its -``no_turbo`` attribute in ``sysfs`` (described `below <no_turbo_attr_>`_) should +``no_turbo`` attribute in ``sysfs`` (described :ref:`below <no_turbo_attr>`) should work as expected in all cases (that is, if set to disable turbo P-states, it always should prevent ``intel_pstate`` from using them). @@ -307,12 +314,12 @@ pieces of information on it to be known, including: * The minimum supported P-state. - * The maximum supported `non-turbo P-state <turbo_>`_. + * The maximum supported :ref:`non-turbo P-state <turbo>`. * Whether or not turbo P-states are supported at all. - * The maximum supported `one-core turbo P-state <turbo_>`_ (if turbo P-states - are supported). + * The maximum supported :ref:`one-core turbo P-state <turbo>` (if turbo + P-states are supported). * The scaling formula to translate the driver's internal representation of P-states into frequencies and the other way around. @@ -400,10 +407,10 @@ Energy-Aware Scheduling Support If ``CONFIG_ENERGY_MODEL`` has been set during kernel configuration and ``intel_pstate`` runs on a hybrid processor without SMT, in addition to enabling -`CAS <CAS_>`_ it registers an Energy Model for the processor. This allows the +:ref:`CAS` it registers an Energy Model for the processor. This allows the Energy-Aware Scheduling (EAS) support to be enabled in the CPU scheduler if ``schedutil`` is used as the ``CPUFreq`` governor which requires ``intel_pstate`` -to operate in the `passive mode <Passive Mode_>`_. +to operate in the :ref:`passive mode <passive_mode>`. The Energy Model registered by ``intel_pstate`` is artificial (that is, it is based on abstract cost values and it does not include any real power numbers) @@ -432,6 +439,8 @@ the ``energy_model`` directory in ``debugfs`` (typlically mounted on User Space Interface in ``sysfs`` ================================= +.. _global_attributes: + Global Attributes ----------------- @@ -444,8 +453,8 @@ argument is passed to the kernel in the command line. ``max_perf_pct`` Maximum P-state the driver is allowed to set in percent of the - maximum supported performance level (the highest supported `turbo - P-state <turbo_>`_). + maximum supported performance level (the highest supported :ref:`turbo + P-state <turbo>`). This attribute will not be exposed if the ``intel_pstate=per_cpu_perf_limits`` argument is present in the kernel @@ -453,8 +462,8 @@ argument is passed to the kernel in the command line. ``min_perf_pct`` Minimum P-state the driver is allowed to set in percent of the - maximum supported performance level (the highest supported `turbo - P-state <turbo_>`_). + maximum supported performance level (the highest supported :ref:`turbo + P-state <turbo>`). This attribute will not be exposed if the ``intel_pstate=per_cpu_perf_limits`` argument is present in the kernel @@ -463,18 +472,18 @@ argument is passed to the kernel in the command line. ``num_pstates`` Number of P-states supported by the processor (between 0 and 255 inclusive) including both turbo and non-turbo P-states (see - `Turbo P-states Support`_). + :ref:`turbo`). This attribute is present only if the value exposed by it is the same for all of the CPUs in the system. The value of this attribute is not affected by the ``no_turbo`` - setting described `below <no_turbo_attr_>`_. + setting described :ref:`below <no_turbo_attr>`. This attribute is read-only. ``turbo_pct`` - Ratio of the `turbo range <turbo_>`_ size to the size of the entire + Ratio of the :ref:`turbo range <turbo>` size to the size of the entire range of supported P-states, in percent. This attribute is present only if the value exposed by it is the same @@ -486,7 +495,7 @@ argument is passed to the kernel in the command line. ``no_turbo`` If set (equal to 1), the driver is not allowed to set any turbo P-states - (see `Turbo P-states Support`_). If unset (equal to 0, which is the + (see :ref:`turbo`). If unset (equal to 0, which is the default), turbo P-states can be set by the driver. [Note that ``intel_pstate`` does not support the general ``boost`` attribute (supported by some other scaling drivers) which is replaced @@ -495,11 +504,11 @@ argument is passed to the kernel in the command line. This attribute does not affect the maximum supported frequency value supplied to the ``CPUFreq`` core and exposed via the policy interface, but it affects the maximum possible value of per-policy P-state limits - (see `Interpretation of Policy Attributes`_ below for details). + (see :ref:`policy_attributes_interpretation` below for details). ``hwp_dynamic_boost`` This attribute is only present if ``intel_pstate`` works in the - `active mode with the HWP feature enabled <Active Mode With HWP_>`_ in + :ref:`active mode with the HWP feature enabled <active_mode_hwp>` in the processor. If set (equal to 1), it causes the minimum P-state limit to be increased dynamically for a short time whenever a task previously waiting on I/O is selected to run on a given logical CPU (the purpose @@ -514,12 +523,12 @@ argument is passed to the kernel in the command line. Operation mode of the driver: "active", "passive" or "off". "active" - The driver is functional and in the `active mode - <Active Mode_>`_. + The driver is functional and in the :ref:`active mode + <active_mode>`. "passive" - The driver is functional and in the `passive mode - <Passive Mode_>`_. + The driver is functional and in the :ref:`passive mode + <passive_mode>`. "off" The driver is not functional (it is not registered as a scaling @@ -547,13 +556,15 @@ argument is passed to the kernel in the command line. attribute to "1" enables the energy-efficiency optimizations and setting to "0" disables them. +.. _policy_attributes_interpretation: + Interpretation of Policy Attributes ----------------------------------- The interpretation of some ``CPUFreq`` policy attributes described in Documentation/admin-guide/pm/cpufreq.rst is special with ``intel_pstate`` as the current scaling driver and it generally depends on the driver's -`operation mode <Operation Modes_>`_. +:ref:`operation mode <operation_modes>`. First of all, the values of the ``cpuinfo_max_freq``, ``cpuinfo_min_freq`` and ``scaling_cur_freq`` attributes are produced by applying a processor-specific @@ -562,9 +573,10 @@ Also, the values of the ``scaling_max_freq`` and ``scaling_min_freq`` attributes are capped by the frequency corresponding to the maximum P-state that the driver is allowed to set. -If the ``no_turbo`` `global attribute <no_turbo_attr_>`_ is set, the driver is -not allowed to use turbo P-states, so the maximum value of ``scaling_max_freq`` -and ``scaling_min_freq`` is limited to the maximum non-turbo P-state frequency. +If the ``no_turbo`` :ref:`global attribute <no_turbo_attr>` is set, the driver +is not allowed to use turbo P-states, so the maximum value of +``scaling_max_freq`` and ``scaling_min_freq`` is limited to the maximum +non-turbo P-state frequency. Accordingly, setting ``no_turbo`` causes ``scaling_max_freq`` and ``scaling_min_freq`` to go down to that value if they were above it before. However, the old values of ``scaling_max_freq`` and ``scaling_min_freq`` will be @@ -576,7 +588,7 @@ and ``scaling_min_freq`` corresponds to the maximum supported turbo P-state, which also is the value of ``cpuinfo_max_freq`` in either case. Next, the following policy attributes have special meaning if -``intel_pstate`` works in the `active mode <Active Mode_>`_: +``intel_pstate`` works in the :ref:`active mode <active_mode>`: ``scaling_available_governors`` List of P-state selection algorithms provided by ``intel_pstate``. @@ -597,20 +609,22 @@ processor: Shows the base frequency of the CPU. Any frequency above this will be in the turbo frequency range. -The meaning of these attributes in the `passive mode <Passive Mode_>`_ is the +The meaning of these attributes in the :ref:`passive mode <passive_mode>` is the same as for other scaling drivers. Additionally, the value of the ``scaling_driver`` attribute for ``intel_pstate`` depends on the operation mode of the driver. Namely, it is either -"intel_pstate" (in the `active mode <Active Mode_>`_) or "intel_cpufreq" (in the -`passive mode <Passive Mode_>`_). +"intel_pstate" (in the :ref:`active mode <active_mode>`) or "intel_cpufreq" +(in the :ref:`passive mode <passive_mode>`). + +.. _pstate_limits_coordination: Coordination of P-State Limits ------------------------------ ``intel_pstate`` allows P-state limits to be set in two ways: with the help of -the ``max_perf_pct`` and ``min_perf_pct`` `global attributes -<Global Attributes_>`_ or via the ``scaling_max_freq`` and ``scaling_min_freq`` +the ``max_perf_pct`` and ``min_perf_pct`` :ref:`global attributes +<global_attributes>` or via the ``scaling_max_freq`` and ``scaling_min_freq`` ``CPUFreq`` policy attributes. The coordination between those limits is based on the following rules, regardless of the current operation mode of the driver: @@ -632,17 +646,18 @@ on the following rules, regardless of the current operation mode of the driver: 3. The global and per-policy limits can be set independently. -In the `active mode with the HWP feature enabled <Active Mode With HWP_>`_, the +In the :ref:`active mode with the HWP feature enabled <active_mode_hwp>`, the resulting effective values are written into hardware registers whenever the limits change in order to request its internal P-state selection logic to always set P-states within these limits. Otherwise, the limits are taken into account -by scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver -every time before setting a new P-state for a CPU. +by scaling governors (in the :ref:`passive mode <passive_mode>`) and by the +driver every time before setting a new P-state for a CPU. Additionally, if the ``intel_pstate=per_cpu_perf_limits`` command line argument is passed to the kernel, ``max_perf_pct`` and ``min_perf_pct`` are not exposed at all and the only way to set the limits is by using the policy attributes. +.. _energy_performance_hints: Energy vs Performance Hints --------------------------- @@ -702,9 +717,9 @@ output. On those systems each ``_PSS`` object returns a list of P-states supported by the corresponding CPU which basically is a subset of the P-states range that can be used by ``intel_pstate`` on the same system, with one exception: the whole -`turbo range <turbo_>`_ is represented by one item in it (the topmost one). By -convention, the frequency returned by ``_PSS`` for that item is greater by 1 MHz -than the frequency of the highest non-turbo P-state listed by it, but the +:ref:`turbo range <turbo>` is represented by one item in it (the topmost one). +By convention, the frequency returned by ``_PSS`` for that item is greater by +1 MHz than the frequency of the highest non-turbo P-state listed by it, but the corresponding P-state representation (following the hardware specification) returned for it matches the maximum supported turbo P-state (or is the special value 255 meaning essentially "go as high as you can get"). @@ -730,18 +745,18 @@ benefit from running at turbo frequencies will be given non-turbo P-states instead. One more issue related to that may appear on systems supporting the -`Configurable TDP feature <turbo_>`_ allowing the platform firmware to set the -turbo threshold. Namely, if that is not coordinated with the lists of P-states -returned by ``_PSS`` properly, there may be more than one item corresponding to -a turbo P-state in those lists and there may be a problem with avoiding the -turbo range (if desirable or necessary). Usually, to avoid using turbo -P-states overall, ``acpi-cpufreq`` simply avoids using the topmost state listed -by ``_PSS``, but that is not sufficient when there are other turbo P-states in -the list returned by it. +:ref:`Configurable TDP feature <turbo>` allowing the platform firmware to set +the turbo threshold. Namely, if that is not coordinated with the lists of +P-states returned by ``_PSS`` properly, there may be more than one item +corresponding to a turbo P-state in those lists and there may be a problem with +avoiding the turbo range (if desirable or necessary). Usually, to avoid using +turbo P-states overall, ``acpi-cpufreq`` simply avoids using the topmost state +listed by ``_PSS``, but that is not sufficient when there are other turbo +P-states in the list returned by it. Apart from the above, ``acpi-cpufreq`` works like ``intel_pstate`` in the -`passive mode <Passive Mode_>`_, except that the number of P-states it can set -is limited to the ones listed by the ACPI ``_PSS`` objects. +:ref:`passive mode <passive_mode>`, except that the number of P-states it can +set is limited to the ones listed by the ACPI ``_PSS`` objects. Kernel Command Line Options for ``intel_pstate`` @@ -756,11 +771,11 @@ of them have to be prepended with the ``intel_pstate=`` prefix. processor is supported by it. ``active`` - Register ``intel_pstate`` in the `active mode <Active Mode_>`_ to start - with. + Register ``intel_pstate`` in the :ref:`active mode <active_mode>` to + start with. ``passive`` - Register ``intel_pstate`` in the `passive mode <Passive Mode_>`_ to + Register ``intel_pstate`` in the :ref:`passive mode <passive_mode>` to start with. ``force`` @@ -793,12 +808,12 @@ of them have to be prepended with the ``intel_pstate=`` prefix. and this option has no effect. ``per_cpu_perf_limits`` - Use per-logical-CPU P-State limits (see `Coordination of P-state - Limits`_ for details). + Use per-logical-CPU P-State limits (see + :ref:`pstate_limits_coordination` for details). ``no_cas`` - Do not enable `capacity-aware scheduling <CAS_>`_ which is enabled by - default on hybrid systems without SMT. + Do not enable :ref:`capacity-aware scheduling <CAS>` which is enabled + by default on hybrid systems without SMT. Diagnostics and Tuning ====================== @@ -810,7 +825,7 @@ There are two static trace events that can be used for ``intel_pstate`` diagnostics. One of them is the ``cpu_frequency`` trace event generally used by ``CPUFreq``, and the other one is the ``pstate_sample`` trace event specific to ``intel_pstate``. Both of them are triggered by ``intel_pstate`` only if -it works in the `active mode <Active Mode_>`_. +it works in the :ref:`active mode <active_mode>`. The following sequence of shell commands can be used to enable them and see their output (if the kernel is generally configured to support event tracing):: @@ -822,7 +837,7 @@ their output (if the kernel is generally configured to support event tracing):: gnome-terminal--4510 [001] ..s. 1177.680733: pstate_sample: core_busy=107 scaled=94 from=26 to=26 mperf=1143818 aperf=1230607 tsc=29838618 freq=2474476 cat-5235 [002] ..s. 1177.681723: cpu_frequency: state=2900000 cpu_id=2 -If ``intel_pstate`` works in the `passive mode <Passive Mode_>`_, the +If ``intel_pstate`` works in the :ref:`passive mode <passive_mode>`, the ``cpu_frequency`` trace event will be triggered either by the ``schedutil`` scaling governor (for the policies it is attached to), or by the ``CPUFreq`` core (for the policies with other scaling governors). diff --git a/Documentation/admin-guide/thermal/index.rst b/Documentation/admin-guide/thermal/index.rst index 193b7b01a87d..e48bc0a1951b 100644 --- a/Documentation/admin-guide/thermal/index.rst +++ b/Documentation/admin-guide/thermal/index.rst @@ -6,3 +6,4 @@ Thermal Subsystem :maxdepth: 1 intel_powerclamp + intel_thermal_throttle diff --git a/Documentation/admin-guide/thermal/intel_thermal_throttle.rst b/Documentation/admin-guide/thermal/intel_thermal_throttle.rst new file mode 100644 index 000000000000..f4fbf9d5a4ec --- /dev/null +++ b/Documentation/admin-guide/thermal/intel_thermal_throttle.rst @@ -0,0 +1,91 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + +======================================= +Intel thermal throttle events reporting +======================================= + +:Author: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> + +Introduction +------------ + +Intel processors have built in automatic and adaptive thermal monitoring +mechanisms that force the processor to reduce its power consumption in order +to operate within predetermined temperature limits. + +Refer to section "THERMAL MONITORING AND PROTECTION" in the "Intel® 64 and +IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B, 3C, & 3D): +System Programming Guide" for more details. + +In general, there are two mechanisms to control the core temperature of the +processor. They are called "Thermal Monitor 1 (TM1) and Thermal Monitor 2 (TM2)". + +The status of the temperature sensor that triggers the thermal monitor (TM1/TM2) +is indicated through the "thermal status flag" and "thermal status log flag" in +MSR_IA32_THERM_STATUS for core level and MSR_IA32_PACKAGE_THERM_STATUS for +package level. + +Thermal Status flag, bit 0 — When set, indicates that the processor core +temperature is currently at the trip temperature of the thermal monitor and that +the processor power consumption is being reduced via either TM1 or TM2, depending +on which is enabled. When clear, the flag indicates that the core temperature is +below the thermal monitor trip temperature. This flag is read only. + +Thermal Status Log flag, bit 1 — When set, indicates that the thermal sensor has +tripped since the last power-up or reset or since the last time that software +cleared this flag. This flag is a sticky bit; once set it remains set until +cleared by software or until a power-up or reset of the processor. The default +state is clear. + +It is possible that when user reads MSR_IA32_THERM_STATUS or +MSR_IA32_PACKAGE_THERM_STATUS, TM1/TM2 is not active. In this case, +"Thermal Status flag" will read "0" and the "Thermal Status Log flag" will be set +to show any previous "TM1/TM2" activation. But since it needs to be cleared by +the software, it can't show the number of occurrences of "TM1/TM2" activations. + +Hence, Linux provides counters of how many times the "Thermal Status flag" was +set. Also presents how long the "Thermal Status flag" was active in milliseconds. +Using these counters, users can check if the performance was limited because of +thermal events. It is recommended to read from sysfs instead of directly reading +MSRs as the "Thermal Status Log flag" is reset by the driver to implement rate +control. + +Sysfs Interface +--------------- + +Thermal throttling events are presented for each CPU under +"/sys/devices/system/cpu/cpuX/thermal_throttle/", where "X" is the CPU number. + +All these counters are read-only. They can't be reset to 0. So, they can potentially +overflow after reaching the maximum 64 bit unsigned integer. + +``core_throttle_count`` + Shows the number of times "Thermal Status flag" changed from 0 to 1 for this + CPU since OS boot and thermal vector is initialized. This is a 64 bit counter. + +``package_throttle_count`` + Shows the number of times "Thermal Status flag" changed from 0 to 1 for the + package containing this CPU since OS boot and thermal vector is initialized. + Package status is broadcast to all CPUs; all CPUs in the package increment + this count. This is a 64-bit counter. + +``core_throttle_max_time_ms`` + Shows the maximum amount of time for which "Thermal Status flag" has been + set to 1 for this CPU at the core level since OS boot and thermal vector + is initialized. + +``package_throttle_max_time_ms`` + Shows the maximum amount of time for which "Thermal Status flag" has been + set to 1 for the package containing this CPU since OS boot and thermal + vector is initialized. + +``core_throttle_total_time_ms`` + Shows the cumulative time for which "Thermal Status flag" has been + set to 1 for this CPU for core level since OS boot and thermal vector + is initialized. + +``package_throttle_total_time_ms`` + Shows the cumulative time for which "Thermal Status flag" has been set + to 1 for the package containing this CPU since OS boot and thermal vector + is initialized. |
