diff options
Diffstat (limited to 'tools/perf/Documentation')
| -rw-r--r-- | tools/perf/Documentation/perf-arm-spe.txt | 104 | ||||
| -rw-r--r-- | tools/perf/Documentation/perf-c2c.txt | 7 | ||||
| -rw-r--r-- | tools/perf/Documentation/perf-check.txt | 1 | ||||
| -rw-r--r-- | tools/perf/Documentation/perf-config.txt | 3 | ||||
| -rw-r--r-- | tools/perf/Documentation/perf-record.txt | 4 | ||||
| -rw-r--r-- | tools/perf/Documentation/perf-script.txt | 5 | ||||
| -rw-r--r-- | tools/perf/Documentation/perf-timechart.txt | 3 |
7 files changed, 117 insertions, 10 deletions
diff --git a/tools/perf/Documentation/perf-arm-spe.txt b/tools/perf/Documentation/perf-arm-spe.txt index cda8dd47fc4d..8b02e5b983fa 100644 --- a/tools/perf/Documentation/perf-arm-spe.txt +++ b/tools/perf/Documentation/perf-arm-spe.txt @@ -141,27 +141,65 @@ Config parameters These are placed between the // in the event and comma separated. For example '-e arm_spe/load_filter=1,min_latency=10/' - branch_filter=1 - collect branches only (PMSFCR.B) - event_filter=<mask> - filter on specific events (PMSEVFR) - see bitfield description below + event_filter=<mask> - logical AND filter on specific events (PMSEVFR) - see bitfield description below + inv_event_filter=<mask> - logical OR to filter out specific events (PMSNEVFR, FEAT_SPEv1p2) - see bitfield description below jitter=1 - use jitter to avoid resonance when sampling (PMSIRR.RND) - load_filter=1 - collect loads only (PMSFCR.LD) min_latency=<n> - collect only samples with this latency or higher* (PMSLATFR) pa_enable=1 - collect physical address (as well as VA) of loads/stores (PMSCR.PA) - requires privilege pct_enable=1 - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege - store_filter=1 - collect stores only (PMSFCR.ST) ts_enable=1 - enable timestamping with value of generic timer (PMSCR.TS) discard=1 - enable SPE PMU events but don't collect sample data - see 'Discard mode' (PMBLIMITR.FM = DISCARD) + inv_data_src_filter=<mask> - mask to filter from 0-63 possible data sources (PMSDSFR, FEAT_SPE_FDS) - See 'Data source filtering' +++*+++ Latency is the total latency from the point at which sampling started on that instruction, rather than only the execution latency. -Only some events can be filtered on; these include: - - bit 1 - instruction retired (i.e. omit speculative instructions) +Only some events can be filtered on using 'event_filter' bits. The overall +filter is the logical AND of these bits, for example if bits 3 and 5 are set +only samples that have both 'L1D cache refill' AND 'TLB walk' are recorded. When +FEAT_SPEv1p2 is implemented 'inv_event_filter' can also be used to exclude +events that have any (OR) of the filter's bits set. For example setting bits 3 +and 5 in 'inv_event_filter' will exclude any events that are either L1D cache +refill OR TLB walk. If the same bit is set in both filters it's UNPREDICTABLE +whether the sample is included or excluded. Filter bits for both event_filter +and inv_event_filter are: + + bit 1 - Instruction retired (i.e. omit speculative instructions) + bit 2 - L1D access (FEAT_SPEv1p4) bit 3 - L1D refill + bit 4 - TLB access (FEAT_SPEv1p4) bit 5 - TLB refill - bit 7 - mispredict - bit 11 - misaligned access + bit 6 - Not taken event (FEAT_SPEv1p2) + bit 7 - Mispredict + bit 8 - Last level cache access (FEAT_SPEv1p4) + bit 9 - Last level cache miss (FEAT_SPEv1p4) + bit 10 - Remote access (FEAT_SPEv1p4) + bit 11 - Misaligned access (FEAT_SPEv1p1) + bit 12-15 - IMPLEMENTATION DEFINED events (when implemented) + bit 16 - Transaction (FEAT_TME) + bit 17 - Partial or empty SME or SVE predicate (FEAT_SPEv1p1) + bit 18 - Empty SME or SVE predicate (FEAT_SPEv1p1) + bit 19 - L2D access (FEAT_SPEv1p4) + bit 20 - L2D miss (FEAT_SPEv1p4) + bit 21 - Cache data modified (FEAT_SPEv1p4) + bit 22 - Recently fetched (FEAT_SPEv1p4) + bit 23 - Data snooped (FEAT_SPEv1p4) + bit 24 - Streaming SVE mode event (when FEAT_SPE_SME is implemented), or + IMPLEMENTATION DEFINED event 24 (when implemented, only versions + less than FEAT_SPEv1p4) + bit 25 - SMCU or external coprocessor operation event when FEAT_SPE_SME is + implemented, or IMPLEMENTATION DEFINED event 25 (when implemented, + only versions less than FEAT_SPEv1p4) + bit 26-31 - IMPLEMENTATION DEFINED events (only versions less than FEAT_SPEv1p4) + bit 48-63 - IMPLEMENTATION DEFINED events (when implemented) + +For IMPLEMENTATION DEFINED bits, refer to the CPU TRM if these bits are +implemented. + +The driver will reject events if requested filter bits require unimplemented SPE +versions, but will not reject filter bits for unimplemented IMPDEF bits or when +their related feature is not present (e.g. SME). For example, if FEAT_SPEv1p2 is +not implemented, filtering on "Not taken event" (bit 6) will be rejected. So to sample just retired instructions: @@ -171,6 +209,31 @@ or just mispredicted branches: perf record -e arm_spe/event_filter=0x80/ -- ./mybench +When set, the following filters can be used to select samples that match any of +the operation types (OR filtering). If only one is set then only samples of that +type are collected: + + branch_filter=1 - Collect branches (PMSFCR.B) + load_filter=1 - Collect loads (PMSFCR.LD) + store_filter=1 - Collect stores (PMSFCR.ST) + +When extended filtering is supported (FEAT_SPE_EFT), SIMD and float +pointer operations can also be selected: + + simd_filter=1 - Collect SIMD loads, stores and operations (PMSFCR.SIMD) + float_filter=1 - Collect floating point loads, stores and operations (PMSFCR.FP) + +When extended filtering is supported (FEAT_SPE_EFT), operation type filters can +be changed to AND using _mask fields. For example samples could be selected if +they are store AND SIMD by setting 'store_filter=1,simd_filter=1, +store_filter_mask=1,simd_filter_mask=1'. The new masks are as follows: + + branch_filter_mask=1 - Change branch filter behavior from OR to AND (PMSFCR.Bm) + load_filter_mask=1 - Change load filter behavior from OR to AND (PMSFCR.LDm) + store_filter_mask=1 - Change store filter behavior from OR to AND (PMSFCR.STm) + simd_filter_mask=1 - Change SIMD filter behavior from OR to AND (PMSFCR.SIMDm) + float_filter_mask=1 - Change floating point filter behavior from OR to AND (PMSFCR.FPm) + Viewing the data ~~~~~~~~~~~~~~~~~ @@ -210,6 +273,10 @@ Memory access details are also stored on the samples and this can be viewed with perf report --mem-mode +The latency value from the SPE sample is stored in the 'weight' field of the +Perf samples and can be displayed in Perf script and report outputs by enabling +its display from the command line. + Common errors ~~~~~~~~~~~~~ @@ -253,6 +320,25 @@ to minimize output. Then run perf stat: perf record -e arm_spe/discard/ -a -N -B --no-bpf-event -o - > /dev/null & perf stat -e SAMPLE_FEED_LD +Data source filtering +~~~~~~~~~~~~~~~~~~~~~ + +When FEAT_SPE_FDS is present, 'inv_data_src_filter' can be used as a mask to +filter on a subset (0 - 63) of possible data source IDs. The full range of data +sources is 0 - 65535 although these are unlikely to be used in practice. Data +sources are IMPDEF so refer to the TRM for the mappings. Each bit N of the +filter maps to data source N. The filter is an OR of all the bits, and the value +provided inv_data_src_filter is inverted before writing to PMSDSFR_EL1 so that +set bits exclude that data source and cleared bits include that data source. +Therefore the default value of 0 is equivalent to no filtering (all data sources +included). + +For example, to include only data sources 0 and 3, clear bits 0 and 3 +(0xFFFFFFFFFFFFFFF6) + +When 'inv_data_src_filter' is set to 0xFFFFFFFFFFFFFFFF, any samples with any +data source set are excluded. + SEE ALSO -------- diff --git a/tools/perf/Documentation/perf-c2c.txt b/tools/perf/Documentation/perf-c2c.txt index f4af2dd6ab31..40b0f71a2c44 100644 --- a/tools/perf/Documentation/perf-c2c.txt +++ b/tools/perf/Documentation/perf-c2c.txt @@ -143,6 +143,13 @@ REPORT OPTIONS feature, which causes cacheline sharing to behave like the cacheline size is doubled. +-M:: +--disassembler-style=:: + Set disassembler style for objdump. + +--objdump=<path>:: + Path to objdump binary. + C2C RECORD ---------- The perf c2c record command setup options related to HITM cacheline analysis diff --git a/tools/perf/Documentation/perf-check.txt b/tools/perf/Documentation/perf-check.txt index 4c9ccda6ce91..09e1d35677f5 100644 --- a/tools/perf/Documentation/perf-check.txt +++ b/tools/perf/Documentation/perf-check.txt @@ -50,7 +50,6 @@ feature:: dwarf / HAVE_LIBDW_SUPPORT dwarf_getlocations / HAVE_LIBDW_SUPPORT dwarf-unwind / HAVE_DWARF_UNWIND_SUPPORT - auxtrace / HAVE_AUXTRACE_SUPPORT libbfd / HAVE_LIBBFD_SUPPORT libbpf-strings / HAVE_LIBBPF_STRINGS_SUPPORT libcapstone / HAVE_LIBCAPSTONE_SUPPORT diff --git a/tools/perf/Documentation/perf-config.txt b/tools/perf/Documentation/perf-config.txt index c6f335659667..642d1c490d9e 100644 --- a/tools/perf/Documentation/perf-config.txt +++ b/tools/perf/Documentation/perf-config.txt @@ -452,6 +452,9 @@ call-graph.*:: kernel space is controlled not by this option but by the kernel config (CONFIG_UNWINDER_*). + The 'defer' mode can be used with 'fp' mode to enable deferred + user callchains (like 'fp,defer'). + call-graph.dump-size:: The size of stack to dump in order to do post-unwinding. Default is 8192 (byte). When using dwarf into record-mode, the default size will be used if omitted. diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt index 067891bd7da6..e8b9aadbbfa5 100644 --- a/tools/perf/Documentation/perf-record.txt +++ b/tools/perf/Documentation/perf-record.txt @@ -325,6 +325,10 @@ OPTIONS by default. User can change the number by passing it after comma like "--call-graph fp,32". + Also "defer" can be used with "fp" (like "--call-graph fp,defer") to + enable deferred user callchain which will collect user-space callchains + when the thread returns to the user space. + -q:: --quiet:: Don't print any warnings or messages, useful for scripting. diff --git a/tools/perf/Documentation/perf-script.txt b/tools/perf/Documentation/perf-script.txt index 28bec7e78bc8..03d112960632 100644 --- a/tools/perf/Documentation/perf-script.txt +++ b/tools/perf/Documentation/perf-script.txt @@ -527,6 +527,11 @@ include::itrace.txt[] The known limitations include exception handing such as setjmp/longjmp will have calls/returns not match. +--merge-callchains:: + Enable merging deferred user callchains if available. This is the + default behavior. If you want to see separate CALLCHAIN_DEFERRED + records for some reason, use --no-merge-callchains explicitly. + :GMEXAMPLECMD: script :GMEXAMPLESUBCMD: include::guest-files.txt[] diff --git a/tools/perf/Documentation/perf-timechart.txt b/tools/perf/Documentation/perf-timechart.txt index ef0c7565bd5c..ef2281c56743 100644 --- a/tools/perf/Documentation/perf-timechart.txt +++ b/tools/perf/Documentation/perf-timechart.txt @@ -94,6 +94,9 @@ RECORD OPTIONS -g:: --callchain:: Do call-graph (stack chain/backtrace) recording +-o:: +--output=:: + Select the output file (default: perf.data) EXAMPLES -------- |
