2013-01-24 16:10:38 +01:00
|
|
|
perf-mem(1)
|
|
|
|
===========
|
|
|
|
|
|
|
|
NAME
|
|
|
|
----
|
|
|
|
perf-mem - Profile memory accesses
|
|
|
|
|
|
|
|
SYNOPSIS
|
|
|
|
--------
|
|
|
|
[verse]
|
|
|
|
'perf mem' [<options>] (record [<command>] | report)
|
|
|
|
|
|
|
|
DESCRIPTION
|
|
|
|
-----------
|
2014-12-17 16:23:55 +01:00
|
|
|
"perf mem record" runs a command and gathers memory operation data
|
2013-01-24 16:10:38 +01:00
|
|
|
from it, into perf.data. Perf record options are accepted and are passed through.
|
|
|
|
|
2014-12-17 16:23:55 +01:00
|
|
|
"perf mem report" displays the result. It invokes perf report with the
|
|
|
|
right set of options to display a memory access profile. By default, loads
|
|
|
|
and stores are sampled. Use the -t option to limit to loads or stores.
|
2013-01-24 16:10:38 +01:00
|
|
|
|
2014-02-28 06:02:14 -08:00
|
|
|
Note that on Intel systems the memory latency reported is the use-latency,
|
|
|
|
not the pure load (or store latency). Use latency includes any pipeline
|
2024-05-21 15:35:55 -07:00
|
|
|
queuing delays in addition to the memory subsystem latency.
|
2014-02-28 06:02:14 -08:00
|
|
|
|
2023-01-24 14:59:29 +00:00
|
|
|
On Arm64 this uses SPE to sample load and store operations, therefore hardware
|
|
|
|
and kernel support is required. See linkperf:perf-arm-spe[1] for a setup guide.
|
|
|
|
Due to the statistical nature of SPE sampling, not every memory operation will
|
|
|
|
be sampled.
|
|
|
|
|
2025-04-29 03:59:37 +00:00
|
|
|
On AMD this use IBS Op PMU to sample load-store operations.
|
|
|
|
|
2024-08-02 11:09:13 -07:00
|
|
|
COMMON OPTIONS
|
|
|
|
--------------
|
2018-02-12 05:38:37 +09:00
|
|
|
-f::
|
|
|
|
--force::
|
|
|
|
Don't do ownership validation
|
|
|
|
|
2013-01-24 16:10:38 +01:00
|
|
|
-t::
|
2018-04-22 16:29:06 +09:00
|
|
|
--type=<type>::
|
2014-12-17 16:23:55 +01:00
|
|
|
Select the memory operation type: load or store (default: load,store)
|
2013-01-24 16:10:38 +01:00
|
|
|
|
2024-08-02 11:09:13 -07:00
|
|
|
-v::
|
|
|
|
--verbose::
|
|
|
|
Be more verbose (show counter open errors, etc)
|
2018-04-22 16:29:06 +09:00
|
|
|
|
|
|
|
-p::
|
|
|
|
--phys-data::
|
|
|
|
Record/Report sample physical addresses
|
|
|
|
|
perf mem: Support data page size
Add option --data-page-size in "perf mem" to record/report data page
size.
Here are some examples:
# perf mem --phys-data --data-page-size report -D
# PID, TID, IP, ADDR, PHYS ADDR, DATA PAGE SIZE, LOCAL WEIGHT, DSRC, SYMBOL
20134 20134 0xffffffffb5bd2fd0 0x016ffff9a274e96a308 0x000000044e96a308 4K 1168 0x5080144 /lib/modules/4.18.0-rc7+/build/vmlinux:perf_ctx_unlock
20134 20134 0xffffffffb63f645c 0xffffffffb752b814 0xcfb52b814 2M 225 0x26a100142 /lib/modules/4.18.0-rc7+/build/vmlinux:_raw_spin_lock
20134 20134 0xffffffffb660300c 0xfffffe00016b8bb0 0x0 4K 0 0x5080144 /lib/modules/4.18.0-rc7+/build/vmlinux:__x86_indirect_thunk_rax
#
# perf mem --phys-data --data-page-size report --stdio
# To display the perf.data header info, please use
# --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 5K of event 'cpu/mem-loads,ldlat=30/P'
# Total weight : 281234
# Sort order :
# mem,sym,dso,symbol_daddr,dso_daddr,tlb,locked,phys_daddr,data_page_size
#
# Overhead Samples Memory access Symbol Shared Object Data Symbol Data Object TLB access Locked Data Physical Address Data Page Size
# ........ ....... ............. ............................ ................ ...................... ........... ............ ...... ...................... ..............
28.54% 1826 L1 or L1 hit [k] __x86_indirect_thunk_rax [kernel.vmlinux] [k] 0xffffb0df31b0ff28 [unknown] L1 or L2 hit No [k] 0x0000000000000000 4K
6.02% 256 L1 or L1 hit [.] touch_buffer dtlb [.] 0x00007ffd50109da8 [stack] L1 or L2 hit No [.] 0x000000042454ada8 4K
3.23% 5 L1 or L1 hit [k] clear_huge_page [kernel.vmlinux] [k] 0xffff9a2753b8ce60 [unknown] L1 or L2 hit No [k] 0x0000000453b8ce60 2M
2.98% 4 L1 or L1 hit [k] clear_page_erms [kernel.vmlinux] [k] 0xffffb0df31b0fd00 [unknown] L1 or L2 hit No [k] 0x0000000000000000 4K
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Stephane Eranian <eranian@google.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210105195752.43489-3-kan.liang@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2021-01-05 11:57:48 -08:00
|
|
|
--data-page-size::
|
|
|
|
Record/Report sample data address page size
|
|
|
|
|
2018-04-22 16:29:06 +09:00
|
|
|
RECORD OPTIONS
|
|
|
|
--------------
|
2024-08-02 11:09:13 -07:00
|
|
|
<command>...::
|
|
|
|
Any command you can specify in a shell.
|
|
|
|
|
2018-04-22 16:29:06 +09:00
|
|
|
-e::
|
|
|
|
--event <event>::
|
|
|
|
Event selector. Use 'perf mem record -e list' to list available events.
|
2013-01-24 16:10:38 +01:00
|
|
|
|
2016-03-24 13:52:16 +01:00
|
|
|
-K::
|
|
|
|
--all-kernel::
|
|
|
|
Configure all used events to run in kernel space.
|
|
|
|
|
|
|
|
-U::
|
|
|
|
--all-user::
|
|
|
|
Configure all used events to run in user space.
|
|
|
|
|
2018-04-22 16:29:06 +09:00
|
|
|
--ldlat <n>::
|
2025-04-29 03:59:37 +00:00
|
|
|
Specify desired latency for loads event. Supported on Intel, Arm64 and
|
|
|
|
some AMD processors. Ignored on other archs.
|
|
|
|
|
|
|
|
On supported AMD processors:
|
|
|
|
- /sys/bus/event_source/devices/ibs_op/caps/ldlat file contains '1'.
|
|
|
|
- Supported latency values are 128 to 2048 (both inclusive).
|
|
|
|
- Latency value which is a multiple of 128 incurs a little less profiling
|
|
|
|
overhead compared to other values.
|
|
|
|
- Load latency filtering is disabled by default.
|
2017-08-29 13:11:10 -04:00
|
|
|
|
2024-08-02 11:09:13 -07:00
|
|
|
REPORT OPTIONS
|
|
|
|
--------------
|
|
|
|
-i::
|
|
|
|
--input=<file>::
|
|
|
|
Input file name.
|
|
|
|
|
|
|
|
-C::
|
|
|
|
--cpu=<cpu>::
|
|
|
|
Monitor only on the list of CPUs provided. Multiple CPUs can be provided as a
|
|
|
|
comma-separated list with no space: 0,1. Ranges of CPUs are specified with -
|
|
|
|
like 0-2. Default is to monitor all CPUS.
|
|
|
|
|
|
|
|
-D::
|
|
|
|
--dump-raw-samples::
|
|
|
|
Dump the raw decoded samples on the screen in a format that is easy to parse with
|
|
|
|
one sample per line.
|
|
|
|
|
|
|
|
-s::
|
|
|
|
--sort=<key>::
|
|
|
|
Group result by given key(s) - multiple keys can be specified
|
|
|
|
in CSV format. The keys are specific to memory samples are:
|
|
|
|
symbol_daddr, symbol_iaddr, dso_daddr, locked, tlb, mem, snoop,
|
|
|
|
dcacheline, phys_daddr, data_page_size, blocked.
|
|
|
|
|
|
|
|
- symbol_daddr: name of data symbol being executed on at the time of sample
|
|
|
|
- symbol_iaddr: name of code symbol being executed on at the time of sample
|
|
|
|
- dso_daddr: name of library or module containing the data being executed
|
|
|
|
on at the time of the sample
|
|
|
|
- locked: whether the bus was locked at the time of the sample
|
|
|
|
- tlb: type of tlb access for the data at the time of the sample
|
|
|
|
- mem: type of memory access for the data at the time of the sample
|
|
|
|
- snoop: type of snoop (if any) for the data at the time of the sample
|
|
|
|
- dcacheline: the cacheline the data address is on at the time of the sample
|
|
|
|
- phys_daddr: physical address of data being executed on at the time of sample
|
|
|
|
- data_page_size: the data page size of data being executed on at the time of sample
|
|
|
|
- blocked: reason of blocked load access for the data at the time of the sample
|
|
|
|
|
|
|
|
And the default sort keys are changed to local_weight, mem, sym, dso,
|
|
|
|
symbol_daddr, dso_daddr, snoop, tlb, locked, blocked, local_ins_lat.
|
|
|
|
|
2025-06-09 17:57:42 -07:00
|
|
|
-F::
|
|
|
|
--fields=::
|
|
|
|
Specify output field - multiple keys can be specified in CSV format.
|
|
|
|
Please see linkperf:perf-report[1] for details.
|
|
|
|
|
|
|
|
In addition to the default fields, 'perf mem report' will provide the
|
|
|
|
following fields to break down sample periods.
|
|
|
|
|
|
|
|
- op: operation in the sample instruction (load, store, prefetch, ...)
|
|
|
|
- cache: location in CPU cache (L1, L2, ...) where the sample hit
|
|
|
|
- mem: location in memory or other places the sample hit
|
|
|
|
- dtlb: location in Data TLB (L1, L2) where the sample hit
|
|
|
|
- snoop: snoop result for the sampled data access
|
|
|
|
|
|
|
|
Please take a look at the OUTPUT FIELD SELECTION section for caveats.
|
|
|
|
|
2024-08-02 11:09:13 -07:00
|
|
|
-T::
|
|
|
|
--type-profile::
|
|
|
|
Show data-type profile result instead of code symbols. This requires
|
|
|
|
the debug information and it will change the default sort keys to:
|
|
|
|
mem, snoop, tlb, type.
|
|
|
|
|
|
|
|
-U::
|
|
|
|
--hide-unresolved::
|
|
|
|
Only display entries resolved to a symbol.
|
|
|
|
|
|
|
|
-x::
|
|
|
|
--field-separator=<separator>::
|
|
|
|
Specify the field separator used when dump raw samples (-D option). By default,
|
|
|
|
The separator is the space character.
|
|
|
|
|
2018-04-06 13:38:09 -07:00
|
|
|
In addition, for report all perf report options are valid, and for record
|
|
|
|
all perf record options.
|
|
|
|
|
2025-05-23 15:21:55 -07:00
|
|
|
OVERHEAD CALCULATION
|
|
|
|
--------------------
|
|
|
|
Unlike linkperf:perf-report[1], which calculates overhead from the actual
|
|
|
|
sample period, perf-mem overhead is calculated using sample weight. E.g.
|
|
|
|
there are two samples in perf.data file, both with the same sample period,
|
|
|
|
but one sample with weight 180 and the other with weight 20:
|
|
|
|
|
|
|
|
$ perf script -F period,data_src,weight,ip,sym
|
|
|
|
100000 629080842 |OP LOAD|LVL L3 hit|... 20 7e69b93ca524 strcmp
|
|
|
|
100000 1a29081042 |OP LOAD|LVL RAM hit|... 180 ffffffff82429168 memcpy
|
|
|
|
|
|
|
|
$ perf report -F overhead,symbol
|
|
|
|
50% [.] strcmp
|
|
|
|
50% [k] memcpy
|
|
|
|
|
|
|
|
$ perf mem report -F overhead,symbol
|
|
|
|
90% [k] memcpy
|
|
|
|
10% [.] strcmp
|
|
|
|
|
2025-06-09 17:57:42 -07:00
|
|
|
OUTPUT FIELD SELECTION
|
|
|
|
----------------------
|
|
|
|
"perf mem report" adds a number of new output fields specific to data source
|
|
|
|
information in the sample. Some of them have the same name with the existing
|
|
|
|
sort keys ("mem" and "snoop"). So unlike other fields and sort keys, they'll
|
|
|
|
behave differently when it's used by -F/--fields or -s/--sort.
|
|
|
|
|
|
|
|
Using those two as output fields will aggregate samples altogether and show
|
|
|
|
breakdown.
|
|
|
|
|
|
|
|
$ perf mem report -F mem,snoop
|
|
|
|
...
|
|
|
|
# ------ Memory ------- --- Snoop ----
|
|
|
|
# RAM Uncach Other HitM Other
|
|
|
|
# ..................... ..............
|
|
|
|
#
|
|
|
|
3.5% 0.0% 96.5% 25.1% 74.9%
|
|
|
|
|
|
|
|
But using the same name for sort keys will aggregate samples for each type
|
|
|
|
separately.
|
|
|
|
|
|
|
|
$ perf mem report -s mem,snoop
|
|
|
|
# Overhead Samples Memory access Snoop
|
|
|
|
# ........ ............ ....................................... ............
|
|
|
|
#
|
|
|
|
47.99% 1509 L2 hit N/A
|
|
|
|
25.08% 338 core, same node Any cache hit HitM
|
|
|
|
10.24% 54374 N/A N/A
|
|
|
|
6.77% 35938 L1 hit N/A
|
|
|
|
6.39% 101 core, same node Any cache hit N/A
|
|
|
|
3.50% 69 RAM hit N/A
|
|
|
|
0.03% 158 LFB/MAB hit N/A
|
|
|
|
0.00% 2 Uncached hit N/A
|
|
|
|
|
2013-01-24 16:10:38 +01:00
|
|
|
SEE ALSO
|
|
|
|
--------
|
2023-01-24 14:59:29 +00:00
|
|
|
linkperf:perf-record[1], linkperf:perf-report[1], linkperf:perf-arm-spe[1]
|