linux/tools/perf/util/bpf_skel/lock_contention.bpf.c

567 lines
13 KiB
C
Raw Permalink Normal View History

// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
// Copyright (c) 2022 Google
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>
perf lock contention: Do not try to update if hash map is full It doesn't delete data in the task_data and lock_stat maps. The data is kept there until it's consumed by userspace at the end. But it calls bpf_map_update_elem() again and again, and the data will be discarded if the map is full. This is not good. Worse, in the bpf_map_update_elem(), it keeps trying to get a new node even if the map was full. I guess it makes sense if it deletes some node like in the tstamp map (that's why I didn't make the change there). In a pre-allocated hash map, that means it'd iterate all CPU to check the freelist. And it has a bad performance impact on large machines. I've checked it on my 64 CPU machine with this. $ perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 2.825 [sec] And I used the task mode, so that it can guarantee the map is full. The default map entry size is 16K and this workload has 40K tasks. Before: $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 11.299 [sec] contended total wait max wait avg wait pid comm 19284 3.51 s 3.70 ms 181.91 us 1305863 sched-messaging 243 84.09 ms 466.67 us 346.04 us 1336608 sched-messaging 177 66.35 ms 12.08 ms 374.88 us 1220416 node For some reason, it didn't report the data failures. But you can see the total time in the workload is increased a lot (2.8 -> 11.3). If it fails early when the map is full, it goes back to normal. After: $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 3.044 [sec] contended total wait max wait avg wait pid comm 18743 591.92 ms 442.96 us 31.58 us 1431454 sched-messaging 51 210.64 ms 207.45 ms 4.13 ms 1468724 sched-messaging 81 68.61 ms 65.79 ms 847.07 us 1463183 sched-messaging === output for debug === bad: 1164137, total: 2253341 bad rate: 51.66 % histogram of failure reasons task: 0 stack: 0 time: 0 data: 1164137 Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Hao Luo <haoluo@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230406210611.1622492-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-04-06 14:06:11 -07:00
#include <asm-generic/errno-base.h>
#include "lock_data.h"
/* for collect_lock_syms(). 4096 was rejected by the verifier */
#define MAX_CPUS 1024
perf lock contention: Add -o/--lock-owner option When there're many lock contentions in the system, people sometimes want to know who caused the contention, IOW who's the owner of the locks. The -o/--lock-owner option tries to follow the lock owners for the contended mutexes and rwsems from BPF, and then attributes the contention time to the owner instead of the waiter. It's a best effort approach to get the owner info at the time of the contention and doesn't guarantee to have the precise tracking of owners if it's changing over time. Currently it only handles mutex and rwsem that have owner field in their struct and it basically points to a task_struct that owns the lock at the moment. Technically its type is atomic_long_t and it comes with some LSB bits used for other meanings. So it needs to clear them when casting it to a pointer to task_struct. Also the atomic_long_t is a typedef of the atomic 32 or 64 bit types depending on arch which is a wrapper struct for the counter value. I'm not aware of proper ways to access those kernel atomic types from BPF so I just read the internal counter value directly. Please let me know if there's a better way. When -o/--lock-owner option is used, it goes to the task aggregation mode like -t/--threads option does. However it cannot get the owner for other lock types like spinlock and sometimes even for mutex. $ sudo ./perf lock con -abo -- ./perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 4.766 [sec] 4.766540 usecs/op 209795 ops/sec contended total wait max wait avg wait pid owner 403 565.32 us 26.81 us 1.40 us -1 Unknown 4 27.99 us 8.57 us 7.00 us 1583145 sched-pipe 1 8.25 us 8.25 us 8.25 us 1583144 sched-pipe 1 2.03 us 2.03 us 2.03 us 5068 chrome As you can see, the owner is unknown for the most cases. But if we filter only for the mutex locks, it'd more likely get the onwers. $ sudo ./perf lock con -abo -Y mutex -- ./perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 4.910 [sec] 4.910435 usecs/op 203647 ops/sec contended total wait max wait avg wait pid owner 2 15.50 us 8.29 us 7.75 us 1582852 sched-pipe 7 7.20 us 2.47 us 1.03 us -1 Unknown 1 6.74 us 6.74 us 6.74 us 1582851 sched-pipe Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hao Luo <haoluo@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230207002403.63590-3-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-02-06 16:24:02 -08:00
/* lock contention flags from include/trace/events/lock.h */
#define LCB_F_SPIN (1U << 0)
#define LCB_F_READ (1U << 1)
#define LCB_F_WRITE (1U << 2)
#define LCB_F_RT (1U << 3)
#define LCB_F_PERCPU (1U << 4)
#define LCB_F_MUTEX (1U << 5)
/* callstack storage */
struct {
__uint(type, BPF_MAP_TYPE_STACK_TRACE);
__uint(key_size, sizeof(__u32));
__uint(value_size, sizeof(__u64));
__uint(max_entries, MAX_ENTRIES);
} stacks SEC(".maps");
/* maintain timestamp at the beginning of contention */
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__type(key, int);
__type(value, struct tstamp_data);
__uint(max_entries, MAX_ENTRIES);
} tstamp SEC(".maps");
/* maintain per-CPU timestamp at the beginning of contention */
struct {
__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
__uint(key_size, sizeof(__u32));
__uint(value_size, sizeof(struct tstamp_data));
__uint(max_entries, 1);
} tstamp_cpu SEC(".maps");
/* actual lock contention statistics */
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(key_size, sizeof(struct contention_key));
__uint(value_size, sizeof(struct contention_data));
__uint(max_entries, MAX_ENTRIES);
} lock_stat SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(key_size, sizeof(__u32));
__uint(value_size, sizeof(struct contention_task_data));
__uint(max_entries, MAX_ENTRIES);
} task_data SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(key_size, sizeof(__u64));
__uint(value_size, sizeof(__u32));
__uint(max_entries, MAX_ENTRIES);
} lock_syms SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(key_size, sizeof(__u32));
__uint(value_size, sizeof(__u8));
__uint(max_entries, 1);
} cpu_filter SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(key_size, sizeof(__u32));
__uint(value_size, sizeof(__u8));
__uint(max_entries, 1);
} task_filter SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(key_size, sizeof(__u32));
__uint(value_size, sizeof(__u8));
__uint(max_entries, 1);
} type_filter SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(key_size, sizeof(__u64));
__uint(value_size, sizeof(__u8));
__uint(max_entries, 1);
} addr_filter SEC(".maps");
perf lock contention: Add -G/--cgroup-filter option The -G/--cgroup-filter is to limit lock contention collection on the tasks in the specific cgroups only. $ sudo ./perf lock con -abt -G /user.slice/.../vte-spawn-52221fb8-b33f-4a52-b5c3-e35d1e6fc0e0.scope \ ./perf bench sched messaging # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 0.174 [sec] contended total wait max wait avg wait pid comm 4 114.45 us 60.06 us 28.61 us 214847 sched-messaging 2 111.40 us 60.84 us 55.70 us 214848 sched-messaging 2 106.09 us 59.42 us 53.04 us 214837 sched-messaging 1 81.70 us 81.70 us 81.70 us 214709 sched-messaging 68 78.44 us 6.83 us 1.15 us 214633 sched-messaging 69 73.71 us 2.69 us 1.07 us 214632 sched-messaging 4 72.62 us 60.83 us 18.15 us 214850 sched-messaging 2 71.75 us 67.60 us 35.88 us 214840 sched-messaging 2 69.29 us 67.53 us 34.65 us 214804 sched-messaging 2 69.00 us 68.23 us 34.50 us 214826 sched-messaging ... Export cgroup__new() function as it's needed from outside. Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Hao Luo <haoluo@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230906174903.346486-5-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-09-06 10:49:02 -07:00
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(key_size, sizeof(__u64));
__uint(value_size, sizeof(__u8));
__uint(max_entries, 1);
} cgroup_filter SEC(".maps");
struct rw_semaphore___old {
struct task_struct *owner;
} __attribute__((preserve_access_index));
struct rw_semaphore___new {
atomic_long_t owner;
} __attribute__((preserve_access_index));
perf lock contention: Track and show mmap_lock with address Sometimes there are severe contentions on the mmap_lock and we want see it in the -l/--lock-addr output. However it cannot symbolize the mmap_lock because it's allocated dynamically without symbols. Stephane and Hao gave me an idea separately to display mmap_lock by following the current->mm pointer. I added a flag to mark mmap_lock after comparing the lock address so that it can show them differently. With this change it can show mmap_lock like below: $ sudo ./perf lock con -abl -- sleep 10 contended total wait max wait avg wait address symbol ... 16344 312.30 ms 2.22 ms 19.11 us ffff8cc702595640 17686 310.08 ms 1.49 ms 17.53 us ffff8cc7025952c0 3 84.14 ms 45.79 ms 28.05 ms ffff8cc78114c478 mmap_lock 3557 76.80 ms 68.75 us 21.59 us ffff8cc77ca3af58 1 68.27 ms 68.27 ms 68.27 ms ffff8cda745dfd70 9 54.53 ms 7.96 ms 6.06 ms ffff8cc7642a48b8 mmap_lock 14629 44.01 ms 60.00 us 3.01 us ffff8cc7625f9ca0 3481 42.63 ms 140.71 us 12.24 us ffffffff937906ac vmap_area_lock 16194 38.73 ms 42.15 us 2.39 us ffff8cd397cbc560 11 38.44 ms 10.39 ms 3.49 ms ffff8ccd6d12fbb8 mmap_lock 1 5.43 ms 5.43 ms 5.43 ms ffff8cd70018f0d8 1674 5.38 ms 422.93 us 3.21 us ffffffff92e06080 tasklist_lock 581 4.51 ms 130.68 us 7.75 us ffff8cc9b1259058 5 3.52 ms 1.27 ms 703.23 us ffff8cc754510070 112 3.47 ms 56.47 us 31.02 us ffff8ccee38b3120 381 3.31 ms 73.44 us 8.69 us ffffffff93790690 purge_vmap_area_lock 255 3.19 ms 36.35 us 12.49 us ffff8d053ce30c80 Note that mmap_lock was renamed some time ago and it needs to support old kernels with a different name 'mmap_sem'. Suggested-by: Hao Luo <haoluo@google.com> Suggested-by: Stephane Eranian <eranian@google.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230313204825.2665483-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-03-13 13:48:22 -07:00
struct mm_struct___old {
struct rw_semaphore mmap_sem;
} __attribute__((preserve_access_index));
struct mm_struct___new {
struct rw_semaphore mmap_lock;
} __attribute__((preserve_access_index));
/* control flags */
perf lock contention: Constify control data for BPF The control knobs set before loading BPF programs should be declared as 'const volatile' so that it can be optimized by the BPF core. Committer testing: root@x1:~# perf lock contention --use-bpf contended total wait max wait avg wait type caller 5 31.57 us 14.93 us 6.31 us mutex btrfs_delayed_update_inode+0x43 1 16.91 us 16.91 us 16.91 us rwsem:R btrfs_tree_read_lock_nested+0x1b 1 15.13 us 15.13 us 15.13 us spinlock btrfs_getattr+0xd1 1 6.65 us 6.65 us 6.65 us rwsem:R btrfs_tree_read_lock_nested+0x1b 1 4.34 us 4.34 us 4.34 us spinlock process_one_work+0x1a9 root@x1:~# root@x1:~# perf trace -e bpf --max-events 10 perf lock contention --use-bpf 0.000 ( 0.013 ms): :2948281/2948281 bpf(cmd: 36, uattr: 0x7ffd5f12d730, size: 8) = -1 EOPNOTSUPP (Operation not supported) 0.024 ( 0.120 ms): :2948281/2948281 bpf(cmd: PROG_LOAD, uattr: 0x7ffd5f12d460, size: 148) = 16 0.158 ( 0.034 ms): :2948281/2948281 bpf(cmd: PROG_LOAD, uattr: 0x7ffd5f12d520, size: 148) = 16 26.653 ( 0.154 ms): perf/2948281 bpf(cmd: PROG_LOAD, uattr: 0x7ffd5f12d3d0, size: 148) = 16 26.825 ( 0.014 ms): perf/2948281 bpf(uattr: 0x7ffd5f12d580, size: 80) = 16 87.924 ( 0.038 ms): perf/2948281 bpf(cmd: BTF_LOAD, uattr: 0x7ffd5f12d400, size: 40) = 16 87.988 ( 0.006 ms): perf/2948281 bpf(cmd: BTF_LOAD, uattr: 0x7ffd5f12d470, size: 40) = 16 88.019 ( 0.006 ms): perf/2948281 bpf(cmd: BTF_LOAD, uattr: 0x7ffd5f12d250, size: 40) = 16 88.029 ( 0.172 ms): perf/2948281 bpf(cmd: PROG_LOAD, uattr: 0x7ffd5f12d320, size: 148) = 17 88.217 ( 0.005 ms): perf/2948281 bpf(cmd: BTF_LOAD, uattr: 0x7ffd5f12d4d0, size: 40) = 16 root@x1:~# Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240902200515.2103769-5-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-09-02 13:05:14 -07:00
const volatile int has_cpu;
const volatile int has_task;
const volatile int has_type;
const volatile int has_addr;
const volatile int has_cgroup;
const volatile int needs_callstack;
const volatile int stack_skip;
const volatile int lock_owner;
const volatile int use_cgroup_v2;
/* determine the key of lock stat */
perf lock contention: Constify control data for BPF The control knobs set before loading BPF programs should be declared as 'const volatile' so that it can be optimized by the BPF core. Committer testing: root@x1:~# perf lock contention --use-bpf contended total wait max wait avg wait type caller 5 31.57 us 14.93 us 6.31 us mutex btrfs_delayed_update_inode+0x43 1 16.91 us 16.91 us 16.91 us rwsem:R btrfs_tree_read_lock_nested+0x1b 1 15.13 us 15.13 us 15.13 us spinlock btrfs_getattr+0xd1 1 6.65 us 6.65 us 6.65 us rwsem:R btrfs_tree_read_lock_nested+0x1b 1 4.34 us 4.34 us 4.34 us spinlock process_one_work+0x1a9 root@x1:~# root@x1:~# perf trace -e bpf --max-events 10 perf lock contention --use-bpf 0.000 ( 0.013 ms): :2948281/2948281 bpf(cmd: 36, uattr: 0x7ffd5f12d730, size: 8) = -1 EOPNOTSUPP (Operation not supported) 0.024 ( 0.120 ms): :2948281/2948281 bpf(cmd: PROG_LOAD, uattr: 0x7ffd5f12d460, size: 148) = 16 0.158 ( 0.034 ms): :2948281/2948281 bpf(cmd: PROG_LOAD, uattr: 0x7ffd5f12d520, size: 148) = 16 26.653 ( 0.154 ms): perf/2948281 bpf(cmd: PROG_LOAD, uattr: 0x7ffd5f12d3d0, size: 148) = 16 26.825 ( 0.014 ms): perf/2948281 bpf(uattr: 0x7ffd5f12d580, size: 80) = 16 87.924 ( 0.038 ms): perf/2948281 bpf(cmd: BTF_LOAD, uattr: 0x7ffd5f12d400, size: 40) = 16 87.988 ( 0.006 ms): perf/2948281 bpf(cmd: BTF_LOAD, uattr: 0x7ffd5f12d470, size: 40) = 16 88.019 ( 0.006 ms): perf/2948281 bpf(cmd: BTF_LOAD, uattr: 0x7ffd5f12d250, size: 40) = 16 88.029 ( 0.172 ms): perf/2948281 bpf(cmd: PROG_LOAD, uattr: 0x7ffd5f12d320, size: 148) = 17 88.217 ( 0.005 ms): perf/2948281 bpf(cmd: BTF_LOAD, uattr: 0x7ffd5f12d4d0, size: 40) = 16 root@x1:~# Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240902200515.2103769-5-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-09-02 13:05:14 -07:00
const volatile int aggr_mode;
int enabled;
int perf_subsys_id = -1;
__u64 end_ts;
/* error stat */
int task_fail;
int stack_fail;
int time_fail;
int data_fail;
perf lock contention: Do not try to update if hash map is full It doesn't delete data in the task_data and lock_stat maps. The data is kept there until it's consumed by userspace at the end. But it calls bpf_map_update_elem() again and again, and the data will be discarded if the map is full. This is not good. Worse, in the bpf_map_update_elem(), it keeps trying to get a new node even if the map was full. I guess it makes sense if it deletes some node like in the tstamp map (that's why I didn't make the change there). In a pre-allocated hash map, that means it'd iterate all CPU to check the freelist. And it has a bad performance impact on large machines. I've checked it on my 64 CPU machine with this. $ perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 2.825 [sec] And I used the task mode, so that it can guarantee the map is full. The default map entry size is 16K and this workload has 40K tasks. Before: $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 11.299 [sec] contended total wait max wait avg wait pid comm 19284 3.51 s 3.70 ms 181.91 us 1305863 sched-messaging 243 84.09 ms 466.67 us 346.04 us 1336608 sched-messaging 177 66.35 ms 12.08 ms 374.88 us 1220416 node For some reason, it didn't report the data failures. But you can see the total time in the workload is increased a lot (2.8 -> 11.3). If it fails early when the map is full, it goes back to normal. After: $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 3.044 [sec] contended total wait max wait avg wait pid comm 18743 591.92 ms 442.96 us 31.58 us 1431454 sched-messaging 51 210.64 ms 207.45 ms 4.13 ms 1468724 sched-messaging 81 68.61 ms 65.79 ms 847.07 us 1463183 sched-messaging === output for debug === bad: 1164137, total: 2253341 bad rate: 51.66 % histogram of failure reasons task: 0 stack: 0 time: 0 data: 1164137 Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Hao Luo <haoluo@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230406210611.1622492-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-04-06 14:06:11 -07:00
int task_map_full;
int data_map_full;
static inline __u64 get_current_cgroup_id(void)
{
struct task_struct *task;
struct cgroup *cgrp;
if (use_cgroup_v2)
return bpf_get_current_cgroup_id();
task = bpf_get_current_task_btf();
if (perf_subsys_id == -1) {
#if __has_builtin(__builtin_preserve_enum_value)
perf_subsys_id = bpf_core_enum_value(enum cgroup_subsys_id,
perf_event_cgrp_id);
#else
perf_subsys_id = perf_event_cgrp_id;
#endif
}
cgrp = BPF_CORE_READ(task, cgroups, subsys[perf_subsys_id], cgroup);
return BPF_CORE_READ(cgrp, kn, id);
}
static inline int can_record(u64 *ctx)
{
if (has_cpu) {
__u32 cpu = bpf_get_smp_processor_id();
__u8 *ok;
ok = bpf_map_lookup_elem(&cpu_filter, &cpu);
if (!ok)
return 0;
}
if (has_task) {
__u8 *ok;
__u32 pid = bpf_get_current_pid_tgid();
ok = bpf_map_lookup_elem(&task_filter, &pid);
if (!ok)
return 0;
}
if (has_type) {
__u8 *ok;
__u32 flags = (__u32)ctx[1];
ok = bpf_map_lookup_elem(&type_filter, &flags);
if (!ok)
return 0;
}
if (has_addr) {
__u8 *ok;
__u64 addr = ctx[0];
ok = bpf_map_lookup_elem(&addr_filter, &addr);
if (!ok)
return 0;
}
perf lock contention: Add -G/--cgroup-filter option The -G/--cgroup-filter is to limit lock contention collection on the tasks in the specific cgroups only. $ sudo ./perf lock con -abt -G /user.slice/.../vte-spawn-52221fb8-b33f-4a52-b5c3-e35d1e6fc0e0.scope \ ./perf bench sched messaging # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 0.174 [sec] contended total wait max wait avg wait pid comm 4 114.45 us 60.06 us 28.61 us 214847 sched-messaging 2 111.40 us 60.84 us 55.70 us 214848 sched-messaging 2 106.09 us 59.42 us 53.04 us 214837 sched-messaging 1 81.70 us 81.70 us 81.70 us 214709 sched-messaging 68 78.44 us 6.83 us 1.15 us 214633 sched-messaging 69 73.71 us 2.69 us 1.07 us 214632 sched-messaging 4 72.62 us 60.83 us 18.15 us 214850 sched-messaging 2 71.75 us 67.60 us 35.88 us 214840 sched-messaging 2 69.29 us 67.53 us 34.65 us 214804 sched-messaging 2 69.00 us 68.23 us 34.50 us 214826 sched-messaging ... Export cgroup__new() function as it's needed from outside. Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Hao Luo <haoluo@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230906174903.346486-5-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-09-06 10:49:02 -07:00
if (has_cgroup) {
__u8 *ok;
__u64 cgrp = get_current_cgroup_id();
ok = bpf_map_lookup_elem(&cgroup_filter, &cgrp);
if (!ok)
return 0;
}
return 1;
}
perf lock contention: Add -o/--lock-owner option When there're many lock contentions in the system, people sometimes want to know who caused the contention, IOW who's the owner of the locks. The -o/--lock-owner option tries to follow the lock owners for the contended mutexes and rwsems from BPF, and then attributes the contention time to the owner instead of the waiter. It's a best effort approach to get the owner info at the time of the contention and doesn't guarantee to have the precise tracking of owners if it's changing over time. Currently it only handles mutex and rwsem that have owner field in their struct and it basically points to a task_struct that owns the lock at the moment. Technically its type is atomic_long_t and it comes with some LSB bits used for other meanings. So it needs to clear them when casting it to a pointer to task_struct. Also the atomic_long_t is a typedef of the atomic 32 or 64 bit types depending on arch which is a wrapper struct for the counter value. I'm not aware of proper ways to access those kernel atomic types from BPF so I just read the internal counter value directly. Please let me know if there's a better way. When -o/--lock-owner option is used, it goes to the task aggregation mode like -t/--threads option does. However it cannot get the owner for other lock types like spinlock and sometimes even for mutex. $ sudo ./perf lock con -abo -- ./perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 4.766 [sec] 4.766540 usecs/op 209795 ops/sec contended total wait max wait avg wait pid owner 403 565.32 us 26.81 us 1.40 us -1 Unknown 4 27.99 us 8.57 us 7.00 us 1583145 sched-pipe 1 8.25 us 8.25 us 8.25 us 1583144 sched-pipe 1 2.03 us 2.03 us 2.03 us 5068 chrome As you can see, the owner is unknown for the most cases. But if we filter only for the mutex locks, it'd more likely get the onwers. $ sudo ./perf lock con -abo -Y mutex -- ./perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 4.910 [sec] 4.910435 usecs/op 203647 ops/sec contended total wait max wait avg wait pid owner 2 15.50 us 8.29 us 7.75 us 1582852 sched-pipe 7 7.20 us 2.47 us 1.03 us -1 Unknown 1 6.74 us 6.74 us 6.74 us 1582851 sched-pipe Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hao Luo <haoluo@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230207002403.63590-3-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-02-06 16:24:02 -08:00
static inline int update_task_data(struct task_struct *task)
{
struct contention_task_data *p;
perf lock contention: Add -o/--lock-owner option When there're many lock contentions in the system, people sometimes want to know who caused the contention, IOW who's the owner of the locks. The -o/--lock-owner option tries to follow the lock owners for the contended mutexes and rwsems from BPF, and then attributes the contention time to the owner instead of the waiter. It's a best effort approach to get the owner info at the time of the contention and doesn't guarantee to have the precise tracking of owners if it's changing over time. Currently it only handles mutex and rwsem that have owner field in their struct and it basically points to a task_struct that owns the lock at the moment. Technically its type is atomic_long_t and it comes with some LSB bits used for other meanings. So it needs to clear them when casting it to a pointer to task_struct. Also the atomic_long_t is a typedef of the atomic 32 or 64 bit types depending on arch which is a wrapper struct for the counter value. I'm not aware of proper ways to access those kernel atomic types from BPF so I just read the internal counter value directly. Please let me know if there's a better way. When -o/--lock-owner option is used, it goes to the task aggregation mode like -t/--threads option does. However it cannot get the owner for other lock types like spinlock and sometimes even for mutex. $ sudo ./perf lock con -abo -- ./perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 4.766 [sec] 4.766540 usecs/op 209795 ops/sec contended total wait max wait avg wait pid owner 403 565.32 us 26.81 us 1.40 us -1 Unknown 4 27.99 us 8.57 us 7.00 us 1583145 sched-pipe 1 8.25 us 8.25 us 8.25 us 1583144 sched-pipe 1 2.03 us 2.03 us 2.03 us 5068 chrome As you can see, the owner is unknown for the most cases. But if we filter only for the mutex locks, it'd more likely get the onwers. $ sudo ./perf lock con -abo -Y mutex -- ./perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 4.910 [sec] 4.910435 usecs/op 203647 ops/sec contended total wait max wait avg wait pid owner 2 15.50 us 8.29 us 7.75 us 1582852 sched-pipe 7 7.20 us 2.47 us 1.03 us -1 Unknown 1 6.74 us 6.74 us 6.74 us 1582851 sched-pipe Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hao Luo <haoluo@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230207002403.63590-3-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-02-06 16:24:02 -08:00
int pid, err;
err = bpf_core_read(&pid, sizeof(pid), &task->pid);
if (err)
return -1;
p = bpf_map_lookup_elem(&task_data, &pid);
perf lock contention: Do not try to update if hash map is full It doesn't delete data in the task_data and lock_stat maps. The data is kept there until it's consumed by userspace at the end. But it calls bpf_map_update_elem() again and again, and the data will be discarded if the map is full. This is not good. Worse, in the bpf_map_update_elem(), it keeps trying to get a new node even if the map was full. I guess it makes sense if it deletes some node like in the tstamp map (that's why I didn't make the change there). In a pre-allocated hash map, that means it'd iterate all CPU to check the freelist. And it has a bad performance impact on large machines. I've checked it on my 64 CPU machine with this. $ perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 2.825 [sec] And I used the task mode, so that it can guarantee the map is full. The default map entry size is 16K and this workload has 40K tasks. Before: $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 11.299 [sec] contended total wait max wait avg wait pid comm 19284 3.51 s 3.70 ms 181.91 us 1305863 sched-messaging 243 84.09 ms 466.67 us 346.04 us 1336608 sched-messaging 177 66.35 ms 12.08 ms 374.88 us 1220416 node For some reason, it didn't report the data failures. But you can see the total time in the workload is increased a lot (2.8 -> 11.3). If it fails early when the map is full, it goes back to normal. After: $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 3.044 [sec] contended total wait max wait avg wait pid comm 18743 591.92 ms 442.96 us 31.58 us 1431454 sched-messaging 51 210.64 ms 207.45 ms 4.13 ms 1468724 sched-messaging 81 68.61 ms 65.79 ms 847.07 us 1463183 sched-messaging === output for debug === bad: 1164137, total: 2253341 bad rate: 51.66 % histogram of failure reasons task: 0 stack: 0 time: 0 data: 1164137 Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Hao Luo <haoluo@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230406210611.1622492-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-04-06 14:06:11 -07:00
if (p == NULL && !task_map_full) {
perf lock contention: Add -o/--lock-owner option When there're many lock contentions in the system, people sometimes want to know who caused the contention, IOW who's the owner of the locks. The -o/--lock-owner option tries to follow the lock owners for the contended mutexes and rwsems from BPF, and then attributes the contention time to the owner instead of the waiter. It's a best effort approach to get the owner info at the time of the contention and doesn't guarantee to have the precise tracking of owners if it's changing over time. Currently it only handles mutex and rwsem that have owner field in their struct and it basically points to a task_struct that owns the lock at the moment. Technically its type is atomic_long_t and it comes with some LSB bits used for other meanings. So it needs to clear them when casting it to a pointer to task_struct. Also the atomic_long_t is a typedef of the atomic 32 or 64 bit types depending on arch which is a wrapper struct for the counter value. I'm not aware of proper ways to access those kernel atomic types from BPF so I just read the internal counter value directly. Please let me know if there's a better way. When -o/--lock-owner option is used, it goes to the task aggregation mode like -t/--threads option does. However it cannot get the owner for other lock types like spinlock and sometimes even for mutex. $ sudo ./perf lock con -abo -- ./perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 4.766 [sec] 4.766540 usecs/op 209795 ops/sec contended total wait max wait avg wait pid owner 403 565.32 us 26.81 us 1.40 us -1 Unknown 4 27.99 us 8.57 us 7.00 us 1583145 sched-pipe 1 8.25 us 8.25 us 8.25 us 1583144 sched-pipe 1 2.03 us 2.03 us 2.03 us 5068 chrome As you can see, the owner is unknown for the most cases. But if we filter only for the mutex locks, it'd more likely get the onwers. $ sudo ./perf lock con -abo -Y mutex -- ./perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 4.910 [sec] 4.910435 usecs/op 203647 ops/sec contended total wait max wait avg wait pid owner 2 15.50 us 8.29 us 7.75 us 1582852 sched-pipe 7 7.20 us 2.47 us 1.03 us -1 Unknown 1 6.74 us 6.74 us 6.74 us 1582851 sched-pipe Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hao Luo <haoluo@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230207002403.63590-3-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-02-06 16:24:02 -08:00
struct contention_task_data data = {};
perf lock contention: Add -o/--lock-owner option When there're many lock contentions in the system, people sometimes want to know who caused the contention, IOW who's the owner of the locks. The -o/--lock-owner option tries to follow the lock owners for the contended mutexes and rwsems from BPF, and then attributes the contention time to the owner instead of the waiter. It's a best effort approach to get the owner info at the time of the contention and doesn't guarantee to have the precise tracking of owners if it's changing over time. Currently it only handles mutex and rwsem that have owner field in their struct and it basically points to a task_struct that owns the lock at the moment. Technically its type is atomic_long_t and it comes with some LSB bits used for other meanings. So it needs to clear them when casting it to a pointer to task_struct. Also the atomic_long_t is a typedef of the atomic 32 or 64 bit types depending on arch which is a wrapper struct for the counter value. I'm not aware of proper ways to access those kernel atomic types from BPF so I just read the internal counter value directly. Please let me know if there's a better way. When -o/--lock-owner option is used, it goes to the task aggregation mode like -t/--threads option does. However it cannot get the owner for other lock types like spinlock and sometimes even for mutex. $ sudo ./perf lock con -abo -- ./perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 4.766 [sec] 4.766540 usecs/op 209795 ops/sec contended total wait max wait avg wait pid owner 403 565.32 us 26.81 us 1.40 us -1 Unknown 4 27.99 us 8.57 us 7.00 us 1583145 sched-pipe 1 8.25 us 8.25 us 8.25 us 1583144 sched-pipe 1 2.03 us 2.03 us 2.03 us 5068 chrome As you can see, the owner is unknown for the most cases. But if we filter only for the mutex locks, it'd more likely get the onwers. $ sudo ./perf lock con -abo -Y mutex -- ./perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 4.910 [sec] 4.910435 usecs/op 203647 ops/sec contended total wait max wait avg wait pid owner 2 15.50 us 8.29 us 7.75 us 1582852 sched-pipe 7 7.20 us 2.47 us 1.03 us -1 Unknown 1 6.74 us 6.74 us 6.74 us 1582851 sched-pipe Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hao Luo <haoluo@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230207002403.63590-3-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-02-06 16:24:02 -08:00
BPF_CORE_READ_STR_INTO(&data.comm, task, comm);
perf lock contention: Do not try to update if hash map is full It doesn't delete data in the task_data and lock_stat maps. The data is kept there until it's consumed by userspace at the end. But it calls bpf_map_update_elem() again and again, and the data will be discarded if the map is full. This is not good. Worse, in the bpf_map_update_elem(), it keeps trying to get a new node even if the map was full. I guess it makes sense if it deletes some node like in the tstamp map (that's why I didn't make the change there). In a pre-allocated hash map, that means it'd iterate all CPU to check the freelist. And it has a bad performance impact on large machines. I've checked it on my 64 CPU machine with this. $ perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 2.825 [sec] And I used the task mode, so that it can guarantee the map is full. The default map entry size is 16K and this workload has 40K tasks. Before: $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 11.299 [sec] contended total wait max wait avg wait pid comm 19284 3.51 s 3.70 ms 181.91 us 1305863 sched-messaging 243 84.09 ms 466.67 us 346.04 us 1336608 sched-messaging 177 66.35 ms 12.08 ms 374.88 us 1220416 node For some reason, it didn't report the data failures. But you can see the total time in the workload is increased a lot (2.8 -> 11.3). If it fails early when the map is full, it goes back to normal. After: $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 3.044 [sec] contended total wait max wait avg wait pid comm 18743 591.92 ms 442.96 us 31.58 us 1431454 sched-messaging 51 210.64 ms 207.45 ms 4.13 ms 1468724 sched-messaging 81 68.61 ms 65.79 ms 847.07 us 1463183 sched-messaging === output for debug === bad: 1164137, total: 2253341 bad rate: 51.66 % histogram of failure reasons task: 0 stack: 0 time: 0 data: 1164137 Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Hao Luo <haoluo@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230406210611.1622492-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-04-06 14:06:11 -07:00
if (bpf_map_update_elem(&task_data, &pid, &data, BPF_NOEXIST) == -E2BIG)
task_map_full = 1;
}
perf lock contention: Add -o/--lock-owner option When there're many lock contentions in the system, people sometimes want to know who caused the contention, IOW who's the owner of the locks. The -o/--lock-owner option tries to follow the lock owners for the contended mutexes and rwsems from BPF, and then attributes the contention time to the owner instead of the waiter. It's a best effort approach to get the owner info at the time of the contention and doesn't guarantee to have the precise tracking of owners if it's changing over time. Currently it only handles mutex and rwsem that have owner field in their struct and it basically points to a task_struct that owns the lock at the moment. Technically its type is atomic_long_t and it comes with some LSB bits used for other meanings. So it needs to clear them when casting it to a pointer to task_struct. Also the atomic_long_t is a typedef of the atomic 32 or 64 bit types depending on arch which is a wrapper struct for the counter value. I'm not aware of proper ways to access those kernel atomic types from BPF so I just read the internal counter value directly. Please let me know if there's a better way. When -o/--lock-owner option is used, it goes to the task aggregation mode like -t/--threads option does. However it cannot get the owner for other lock types like spinlock and sometimes even for mutex. $ sudo ./perf lock con -abo -- ./perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 4.766 [sec] 4.766540 usecs/op 209795 ops/sec contended total wait max wait avg wait pid owner 403 565.32 us 26.81 us 1.40 us -1 Unknown 4 27.99 us 8.57 us 7.00 us 1583145 sched-pipe 1 8.25 us 8.25 us 8.25 us 1583144 sched-pipe 1 2.03 us 2.03 us 2.03 us 5068 chrome As you can see, the owner is unknown for the most cases. But if we filter only for the mutex locks, it'd more likely get the onwers. $ sudo ./perf lock con -abo -Y mutex -- ./perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 4.910 [sec] 4.910435 usecs/op 203647 ops/sec contended total wait max wait avg wait pid owner 2 15.50 us 8.29 us 7.75 us 1582852 sched-pipe 7 7.20 us 2.47 us 1.03 us -1 Unknown 1 6.74 us 6.74 us 6.74 us 1582851 sched-pipe Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hao Luo <haoluo@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230207002403.63590-3-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-02-06 16:24:02 -08:00
return 0;
}
#ifndef __has_builtin
# define __has_builtin(x) 0
#endif
static inline struct task_struct *get_lock_owner(__u64 lock, __u32 flags)
{
struct task_struct *task;
__u64 owner = 0;
if (flags & LCB_F_MUTEX) {
struct mutex *mutex = (void *)lock;
owner = BPF_CORE_READ(mutex, owner.counter);
} else if (flags == LCB_F_READ || flags == LCB_F_WRITE) {
/*
* Support for the BPF_TYPE_MATCHES argument to the
* __builtin_preserve_type_info builtin was added at some point during
* development of clang 15 and it's what is needed for
* bpf_core_type_matches.
*/
#if __has_builtin(__builtin_preserve_type_info) && __clang_major__ >= 15
if (bpf_core_type_matches(struct rw_semaphore___old)) {
struct rw_semaphore___old *rwsem = (void *)lock;
owner = (unsigned long)BPF_CORE_READ(rwsem, owner);
} else if (bpf_core_type_matches(struct rw_semaphore___new)) {
struct rw_semaphore___new *rwsem = (void *)lock;
owner = BPF_CORE_READ(rwsem, owner.counter);
}
#else
/* assume new struct */
struct rw_semaphore *rwsem = (void *)lock;
owner = BPF_CORE_READ(rwsem, owner.counter);
#endif
}
if (!owner)
return NULL;
task = (void *)(owner & ~7UL);
return task;
}
perf lock contention: Track and show mmap_lock with address Sometimes there are severe contentions on the mmap_lock and we want see it in the -l/--lock-addr output. However it cannot symbolize the mmap_lock because it's allocated dynamically without symbols. Stephane and Hao gave me an idea separately to display mmap_lock by following the current->mm pointer. I added a flag to mark mmap_lock after comparing the lock address so that it can show them differently. With this change it can show mmap_lock like below: $ sudo ./perf lock con -abl -- sleep 10 contended total wait max wait avg wait address symbol ... 16344 312.30 ms 2.22 ms 19.11 us ffff8cc702595640 17686 310.08 ms 1.49 ms 17.53 us ffff8cc7025952c0 3 84.14 ms 45.79 ms 28.05 ms ffff8cc78114c478 mmap_lock 3557 76.80 ms 68.75 us 21.59 us ffff8cc77ca3af58 1 68.27 ms 68.27 ms 68.27 ms ffff8cda745dfd70 9 54.53 ms 7.96 ms 6.06 ms ffff8cc7642a48b8 mmap_lock 14629 44.01 ms 60.00 us 3.01 us ffff8cc7625f9ca0 3481 42.63 ms 140.71 us 12.24 us ffffffff937906ac vmap_area_lock 16194 38.73 ms 42.15 us 2.39 us ffff8cd397cbc560 11 38.44 ms 10.39 ms 3.49 ms ffff8ccd6d12fbb8 mmap_lock 1 5.43 ms 5.43 ms 5.43 ms ffff8cd70018f0d8 1674 5.38 ms 422.93 us 3.21 us ffffffff92e06080 tasklist_lock 581 4.51 ms 130.68 us 7.75 us ffff8cc9b1259058 5 3.52 ms 1.27 ms 703.23 us ffff8cc754510070 112 3.47 ms 56.47 us 31.02 us ffff8ccee38b3120 381 3.31 ms 73.44 us 8.69 us ffffffff93790690 purge_vmap_area_lock 255 3.19 ms 36.35 us 12.49 us ffff8d053ce30c80 Note that mmap_lock was renamed some time ago and it needs to support old kernels with a different name 'mmap_sem'. Suggested-by: Hao Luo <haoluo@google.com> Suggested-by: Stephane Eranian <eranian@google.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230313204825.2665483-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-03-13 13:48:22 -07:00
static inline __u32 check_lock_type(__u64 lock, __u32 flags)
{
struct task_struct *curr;
struct mm_struct___old *mm_old;
struct mm_struct___new *mm_new;
struct sighand_struct *sighand;
perf lock contention: Track and show mmap_lock with address Sometimes there are severe contentions on the mmap_lock and we want see it in the -l/--lock-addr output. However it cannot symbolize the mmap_lock because it's allocated dynamically without symbols. Stephane and Hao gave me an idea separately to display mmap_lock by following the current->mm pointer. I added a flag to mark mmap_lock after comparing the lock address so that it can show them differently. With this change it can show mmap_lock like below: $ sudo ./perf lock con -abl -- sleep 10 contended total wait max wait avg wait address symbol ... 16344 312.30 ms 2.22 ms 19.11 us ffff8cc702595640 17686 310.08 ms 1.49 ms 17.53 us ffff8cc7025952c0 3 84.14 ms 45.79 ms 28.05 ms ffff8cc78114c478 mmap_lock 3557 76.80 ms 68.75 us 21.59 us ffff8cc77ca3af58 1 68.27 ms 68.27 ms 68.27 ms ffff8cda745dfd70 9 54.53 ms 7.96 ms 6.06 ms ffff8cc7642a48b8 mmap_lock 14629 44.01 ms 60.00 us 3.01 us ffff8cc7625f9ca0 3481 42.63 ms 140.71 us 12.24 us ffffffff937906ac vmap_area_lock 16194 38.73 ms 42.15 us 2.39 us ffff8cd397cbc560 11 38.44 ms 10.39 ms 3.49 ms ffff8ccd6d12fbb8 mmap_lock 1 5.43 ms 5.43 ms 5.43 ms ffff8cd70018f0d8 1674 5.38 ms 422.93 us 3.21 us ffffffff92e06080 tasklist_lock 581 4.51 ms 130.68 us 7.75 us ffff8cc9b1259058 5 3.52 ms 1.27 ms 703.23 us ffff8cc754510070 112 3.47 ms 56.47 us 31.02 us ffff8ccee38b3120 381 3.31 ms 73.44 us 8.69 us ffffffff93790690 purge_vmap_area_lock 255 3.19 ms 36.35 us 12.49 us ffff8d053ce30c80 Note that mmap_lock was renamed some time ago and it needs to support old kernels with a different name 'mmap_sem'. Suggested-by: Hao Luo <haoluo@google.com> Suggested-by: Stephane Eranian <eranian@google.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230313204825.2665483-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-03-13 13:48:22 -07:00
switch (flags) {
case LCB_F_READ: /* rwsem */
case LCB_F_WRITE:
curr = bpf_get_current_task_btf();
if (curr->mm == NULL)
break;
mm_new = (void *)curr->mm;
if (bpf_core_field_exists(mm_new->mmap_lock)) {
if (&mm_new->mmap_lock == (void *)lock)
return LCD_F_MMAP_LOCK;
break;
}
mm_old = (void *)curr->mm;
if (bpf_core_field_exists(mm_old->mmap_sem)) {
if (&mm_old->mmap_sem == (void *)lock)
return LCD_F_MMAP_LOCK;
}
break;
case LCB_F_SPIN: /* spinlock */
curr = bpf_get_current_task_btf();
sighand = curr->sighand;
if (sighand && &sighand->siglock == (void *)lock)
return LCD_F_SIGHAND_LOCK;
break;
perf lock contention: Track and show mmap_lock with address Sometimes there are severe contentions on the mmap_lock and we want see it in the -l/--lock-addr output. However it cannot symbolize the mmap_lock because it's allocated dynamically without symbols. Stephane and Hao gave me an idea separately to display mmap_lock by following the current->mm pointer. I added a flag to mark mmap_lock after comparing the lock address so that it can show them differently. With this change it can show mmap_lock like below: $ sudo ./perf lock con -abl -- sleep 10 contended total wait max wait avg wait address symbol ... 16344 312.30 ms 2.22 ms 19.11 us ffff8cc702595640 17686 310.08 ms 1.49 ms 17.53 us ffff8cc7025952c0 3 84.14 ms 45.79 ms 28.05 ms ffff8cc78114c478 mmap_lock 3557 76.80 ms 68.75 us 21.59 us ffff8cc77ca3af58 1 68.27 ms 68.27 ms 68.27 ms ffff8cda745dfd70 9 54.53 ms 7.96 ms 6.06 ms ffff8cc7642a48b8 mmap_lock 14629 44.01 ms 60.00 us 3.01 us ffff8cc7625f9ca0 3481 42.63 ms 140.71 us 12.24 us ffffffff937906ac vmap_area_lock 16194 38.73 ms 42.15 us 2.39 us ffff8cd397cbc560 11 38.44 ms 10.39 ms 3.49 ms ffff8ccd6d12fbb8 mmap_lock 1 5.43 ms 5.43 ms 5.43 ms ffff8cd70018f0d8 1674 5.38 ms 422.93 us 3.21 us ffffffff92e06080 tasklist_lock 581 4.51 ms 130.68 us 7.75 us ffff8cc9b1259058 5 3.52 ms 1.27 ms 703.23 us ffff8cc754510070 112 3.47 ms 56.47 us 31.02 us ffff8ccee38b3120 381 3.31 ms 73.44 us 8.69 us ffffffff93790690 purge_vmap_area_lock 255 3.19 ms 36.35 us 12.49 us ffff8d053ce30c80 Note that mmap_lock was renamed some time ago and it needs to support old kernels with a different name 'mmap_sem'. Suggested-by: Hao Luo <haoluo@google.com> Suggested-by: Stephane Eranian <eranian@google.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230313204825.2665483-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-03-13 13:48:22 -07:00
default:
break;
}
return 0;
}
static inline struct tstamp_data *get_tstamp_elem(__u32 flags)
{
__u32 pid;
struct tstamp_data *pelem;
/* Use per-cpu array map for spinlock and rwlock */
if ((flags & (LCB_F_SPIN | LCB_F_MUTEX)) == LCB_F_SPIN) {
__u32 idx = 0;
pelem = bpf_map_lookup_elem(&tstamp_cpu, &idx);
/* Do not update the element for nested locks */
if (pelem && pelem->lock)
pelem = NULL;
return pelem;
}
pid = bpf_get_current_pid_tgid();
pelem = bpf_map_lookup_elem(&tstamp, &pid);
/* Do not update the element for nested locks */
if (pelem && pelem->lock)
return NULL;
if (pelem == NULL) {
struct tstamp_data zero = {};
if (bpf_map_update_elem(&tstamp, &pid, &zero, BPF_NOEXIST) < 0) {
__sync_fetch_and_add(&task_fail, 1);
return NULL;
}
pelem = bpf_map_lookup_elem(&tstamp, &pid);
if (pelem == NULL) {
__sync_fetch_and_add(&task_fail, 1);
return NULL;
}
}
return pelem;
}
SEC("tp_btf/contention_begin")
int contention_begin(u64 *ctx)
{
struct tstamp_data *pelem;
if (!enabled || !can_record(ctx))
return 0;
pelem = get_tstamp_elem(ctx[1]);
if (pelem == NULL)
return 0;
pelem->timestamp = bpf_ktime_get_ns();
pelem->lock = (__u64)ctx[0];
pelem->flags = (__u32)ctx[1];
perf lock contention: Support filters for different aggregation It'd be useful to filter other than the current aggregation mode. For example, users may want to see callstacks for specific locks only. Or they may want tasks from a certain callstack. The tracepoints already collected the information but it needs to check the condition again when processing the event. And it needs to change BPF to allow the key combinations. The lock contentions on 'rcu_state' spinlock can be monitored: $ sudo perf lock con -abv -L rcu_state sleep 1 ... contended total wait max wait avg wait type caller 4 151.39 us 62.57 us 37.85 us spinlock rcu_core+0xcb 0xffffffff81fd1666 _raw_spin_lock_irqsave+0x46 0xffffffff8172d76b rcu_core+0xcb 0xffffffff822000eb __softirqentry_text_start+0xeb 0xffffffff816a0ba9 __irq_exit_rcu+0xc9 0xffffffff81fc0112 sysvec_apic_timer_interrupt+0xa2 0xffffffff82000e46 asm_sysvec_apic_timer_interrupt+0x16 0xffffffff81d49f78 cpuidle_enter_state+0xd8 0xffffffff81d4a259 cpuidle_enter+0x29 1 30.21 us 30.21 us 30.21 us spinlock rcu_core+0xcb 0xffffffff81fd1666 _raw_spin_lock_irqsave+0x46 0xffffffff8172d76b rcu_core+0xcb 0xffffffff822000eb __softirqentry_text_start+0xeb 0xffffffff816a0ba9 __irq_exit_rcu+0xc9 0xffffffff81fc00c4 sysvec_apic_timer_interrupt+0x54 0xffffffff82000e46 asm_sysvec_apic_timer_interrupt+0x16 1 28.84 us 28.84 us 28.84 us spinlock rcu_accelerate_cbs_unlocked+0x40 0xffffffff81fd1c60 _raw_spin_lock+0x30 0xffffffff81728cf0 rcu_accelerate_cbs_unlocked+0x40 0xffffffff8172da82 rcu_core+0x3e2 0xffffffff822000eb __softirqentry_text_start+0xeb 0xffffffff816a0ba9 __irq_exit_rcu+0xc9 0xffffffff81fc0112 sysvec_apic_timer_interrupt+0xa2 0xffffffff82000e46 asm_sysvec_apic_timer_interrupt+0x16 0xffffffff81d49f78 cpuidle_enter_state+0xd8 ... To see tasks calling 'rcu_core' function: $ sudo perf lock con -abt -S rcu_core sleep 1 contended total wait max wait avg wait pid comm 19 23.46 us 2.21 us 1.23 us 0 swapper 2 18.37 us 17.01 us 9.19 us 2061859 ThreadPoolForeg 3 5.76 us 1.97 us 1.92 us 3909 pipewire-pulse 1 2.26 us 2.26 us 2.26 us 1809271 MediaSu~isor #2 1 1.97 us 1.97 us 1.97 us 1514882 Chrome_ChildIOT 1 987 ns 987 ns 987 ns 3740 pipewire-pulse Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Hao Luo <haoluo@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230203021324.143540-4-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-02-02 18:13:24 -08:00
if (needs_callstack) {
pelem->stack_id = bpf_get_stackid(ctx, &stacks,
BPF_F_FAST_STACK_CMP | stack_skip);
if (pelem->stack_id < 0)
__sync_fetch_and_add(&stack_fail, 1);
perf lock contention: Add -o/--lock-owner option When there're many lock contentions in the system, people sometimes want to know who caused the contention, IOW who's the owner of the locks. The -o/--lock-owner option tries to follow the lock owners for the contended mutexes and rwsems from BPF, and then attributes the contention time to the owner instead of the waiter. It's a best effort approach to get the owner info at the time of the contention and doesn't guarantee to have the precise tracking of owners if it's changing over time. Currently it only handles mutex and rwsem that have owner field in their struct and it basically points to a task_struct that owns the lock at the moment. Technically its type is atomic_long_t and it comes with some LSB bits used for other meanings. So it needs to clear them when casting it to a pointer to task_struct. Also the atomic_long_t is a typedef of the atomic 32 or 64 bit types depending on arch which is a wrapper struct for the counter value. I'm not aware of proper ways to access those kernel atomic types from BPF so I just read the internal counter value directly. Please let me know if there's a better way. When -o/--lock-owner option is used, it goes to the task aggregation mode like -t/--threads option does. However it cannot get the owner for other lock types like spinlock and sometimes even for mutex. $ sudo ./perf lock con -abo -- ./perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 4.766 [sec] 4.766540 usecs/op 209795 ops/sec contended total wait max wait avg wait pid owner 403 565.32 us 26.81 us 1.40 us -1 Unknown 4 27.99 us 8.57 us 7.00 us 1583145 sched-pipe 1 8.25 us 8.25 us 8.25 us 1583144 sched-pipe 1 2.03 us 2.03 us 2.03 us 5068 chrome As you can see, the owner is unknown for the most cases. But if we filter only for the mutex locks, it'd more likely get the onwers. $ sudo ./perf lock con -abo -Y mutex -- ./perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 4.910 [sec] 4.910435 usecs/op 203647 ops/sec contended total wait max wait avg wait pid owner 2 15.50 us 8.29 us 7.75 us 1582852 sched-pipe 7 7.20 us 2.47 us 1.03 us -1 Unknown 1 6.74 us 6.74 us 6.74 us 1582851 sched-pipe Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hao Luo <haoluo@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230207002403.63590-3-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-02-06 16:24:02 -08:00
} else if (aggr_mode == LOCK_AGGR_TASK) {
struct task_struct *task;
if (lock_owner) {
task = get_lock_owner(pelem->lock, pelem->flags);
perf lock contention: Add -o/--lock-owner option When there're many lock contentions in the system, people sometimes want to know who caused the contention, IOW who's the owner of the locks. The -o/--lock-owner option tries to follow the lock owners for the contended mutexes and rwsems from BPF, and then attributes the contention time to the owner instead of the waiter. It's a best effort approach to get the owner info at the time of the contention and doesn't guarantee to have the precise tracking of owners if it's changing over time. Currently it only handles mutex and rwsem that have owner field in their struct and it basically points to a task_struct that owns the lock at the moment. Technically its type is atomic_long_t and it comes with some LSB bits used for other meanings. So it needs to clear them when casting it to a pointer to task_struct. Also the atomic_long_t is a typedef of the atomic 32 or 64 bit types depending on arch which is a wrapper struct for the counter value. I'm not aware of proper ways to access those kernel atomic types from BPF so I just read the internal counter value directly. Please let me know if there's a better way. When -o/--lock-owner option is used, it goes to the task aggregation mode like -t/--threads option does. However it cannot get the owner for other lock types like spinlock and sometimes even for mutex. $ sudo ./perf lock con -abo -- ./perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 4.766 [sec] 4.766540 usecs/op 209795 ops/sec contended total wait max wait avg wait pid owner 403 565.32 us 26.81 us 1.40 us -1 Unknown 4 27.99 us 8.57 us 7.00 us 1583145 sched-pipe 1 8.25 us 8.25 us 8.25 us 1583144 sched-pipe 1 2.03 us 2.03 us 2.03 us 5068 chrome As you can see, the owner is unknown for the most cases. But if we filter only for the mutex locks, it'd more likely get the onwers. $ sudo ./perf lock con -abo -Y mutex -- ./perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 4.910 [sec] 4.910435 usecs/op 203647 ops/sec contended total wait max wait avg wait pid owner 2 15.50 us 8.29 us 7.75 us 1582852 sched-pipe 7 7.20 us 2.47 us 1.03 us -1 Unknown 1 6.74 us 6.74 us 6.74 us 1582851 sched-pipe Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hao Luo <haoluo@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230207002403.63590-3-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-02-06 16:24:02 -08:00
/* The flags is not used anymore. Pass the owner pid. */
if (task)
pelem->flags = BPF_CORE_READ(task, pid);
else
pelem->flags = -1U;
} else {
task = bpf_get_current_task_btf();
}
if (task) {
if (update_task_data(task) < 0 && lock_owner)
pelem->flags = -1U;
}
}
return 0;
}
SEC("tp_btf/contention_end")
int contention_end(u64 *ctx)
{
__u32 pid = 0, idx = 0;
struct tstamp_data *pelem;
perf lock contention: Support filters for different aggregation It'd be useful to filter other than the current aggregation mode. For example, users may want to see callstacks for specific locks only. Or they may want tasks from a certain callstack. The tracepoints already collected the information but it needs to check the condition again when processing the event. And it needs to change BPF to allow the key combinations. The lock contentions on 'rcu_state' spinlock can be monitored: $ sudo perf lock con -abv -L rcu_state sleep 1 ... contended total wait max wait avg wait type caller 4 151.39 us 62.57 us 37.85 us spinlock rcu_core+0xcb 0xffffffff81fd1666 _raw_spin_lock_irqsave+0x46 0xffffffff8172d76b rcu_core+0xcb 0xffffffff822000eb __softirqentry_text_start+0xeb 0xffffffff816a0ba9 __irq_exit_rcu+0xc9 0xffffffff81fc0112 sysvec_apic_timer_interrupt+0xa2 0xffffffff82000e46 asm_sysvec_apic_timer_interrupt+0x16 0xffffffff81d49f78 cpuidle_enter_state+0xd8 0xffffffff81d4a259 cpuidle_enter+0x29 1 30.21 us 30.21 us 30.21 us spinlock rcu_core+0xcb 0xffffffff81fd1666 _raw_spin_lock_irqsave+0x46 0xffffffff8172d76b rcu_core+0xcb 0xffffffff822000eb __softirqentry_text_start+0xeb 0xffffffff816a0ba9 __irq_exit_rcu+0xc9 0xffffffff81fc00c4 sysvec_apic_timer_interrupt+0x54 0xffffffff82000e46 asm_sysvec_apic_timer_interrupt+0x16 1 28.84 us 28.84 us 28.84 us spinlock rcu_accelerate_cbs_unlocked+0x40 0xffffffff81fd1c60 _raw_spin_lock+0x30 0xffffffff81728cf0 rcu_accelerate_cbs_unlocked+0x40 0xffffffff8172da82 rcu_core+0x3e2 0xffffffff822000eb __softirqentry_text_start+0xeb 0xffffffff816a0ba9 __irq_exit_rcu+0xc9 0xffffffff81fc0112 sysvec_apic_timer_interrupt+0xa2 0xffffffff82000e46 asm_sysvec_apic_timer_interrupt+0x16 0xffffffff81d49f78 cpuidle_enter_state+0xd8 ... To see tasks calling 'rcu_core' function: $ sudo perf lock con -abt -S rcu_core sleep 1 contended total wait max wait avg wait pid comm 19 23.46 us 2.21 us 1.23 us 0 swapper 2 18.37 us 17.01 us 9.19 us 2061859 ThreadPoolForeg 3 5.76 us 1.97 us 1.92 us 3909 pipewire-pulse 1 2.26 us 2.26 us 2.26 us 1809271 MediaSu~isor #2 1 1.97 us 1.97 us 1.97 us 1514882 Chrome_ChildIOT 1 987 ns 987 ns 987 ns 3740 pipewire-pulse Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Hao Luo <haoluo@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230203021324.143540-4-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-02-02 18:13:24 -08:00
struct contention_key key = {};
struct contention_data *data;
__u64 duration;
bool need_delete = false;
if (!enabled)
return 0;
/*
* For spinlock and rwlock, it needs to get the timestamp for the
* per-cpu map. However, contention_end does not have the flags
* so it cannot know whether it reads percpu or hash map.
*
* Try per-cpu map first and check if there's active contention.
* If it is, do not read hash map because it cannot go to sleeping
* locks before releasing the spinning locks.
*/
pelem = bpf_map_lookup_elem(&tstamp_cpu, &idx);
if (pelem && pelem->lock) {
if (pelem->lock != ctx[0])
return 0;
} else {
pid = bpf_get_current_pid_tgid();
pelem = bpf_map_lookup_elem(&tstamp, &pid);
if (!pelem || pelem->lock != ctx[0])
return 0;
need_delete = true;
}
duration = bpf_ktime_get_ns() - pelem->timestamp;
if ((__s64)duration < 0) {
__sync_fetch_and_add(&time_fail, 1);
goto out;
}
switch (aggr_mode) {
case LOCK_AGGR_CALLER:
perf lock contention: Support filters for different aggregation It'd be useful to filter other than the current aggregation mode. For example, users may want to see callstacks for specific locks only. Or they may want tasks from a certain callstack. The tracepoints already collected the information but it needs to check the condition again when processing the event. And it needs to change BPF to allow the key combinations. The lock contentions on 'rcu_state' spinlock can be monitored: $ sudo perf lock con -abv -L rcu_state sleep 1 ... contended total wait max wait avg wait type caller 4 151.39 us 62.57 us 37.85 us spinlock rcu_core+0xcb 0xffffffff81fd1666 _raw_spin_lock_irqsave+0x46 0xffffffff8172d76b rcu_core+0xcb 0xffffffff822000eb __softirqentry_text_start+0xeb 0xffffffff816a0ba9 __irq_exit_rcu+0xc9 0xffffffff81fc0112 sysvec_apic_timer_interrupt+0xa2 0xffffffff82000e46 asm_sysvec_apic_timer_interrupt+0x16 0xffffffff81d49f78 cpuidle_enter_state+0xd8 0xffffffff81d4a259 cpuidle_enter+0x29 1 30.21 us 30.21 us 30.21 us spinlock rcu_core+0xcb 0xffffffff81fd1666 _raw_spin_lock_irqsave+0x46 0xffffffff8172d76b rcu_core+0xcb 0xffffffff822000eb __softirqentry_text_start+0xeb 0xffffffff816a0ba9 __irq_exit_rcu+0xc9 0xffffffff81fc00c4 sysvec_apic_timer_interrupt+0x54 0xffffffff82000e46 asm_sysvec_apic_timer_interrupt+0x16 1 28.84 us 28.84 us 28.84 us spinlock rcu_accelerate_cbs_unlocked+0x40 0xffffffff81fd1c60 _raw_spin_lock+0x30 0xffffffff81728cf0 rcu_accelerate_cbs_unlocked+0x40 0xffffffff8172da82 rcu_core+0x3e2 0xffffffff822000eb __softirqentry_text_start+0xeb 0xffffffff816a0ba9 __irq_exit_rcu+0xc9 0xffffffff81fc0112 sysvec_apic_timer_interrupt+0xa2 0xffffffff82000e46 asm_sysvec_apic_timer_interrupt+0x16 0xffffffff81d49f78 cpuidle_enter_state+0xd8 ... To see tasks calling 'rcu_core' function: $ sudo perf lock con -abt -S rcu_core sleep 1 contended total wait max wait avg wait pid comm 19 23.46 us 2.21 us 1.23 us 0 swapper 2 18.37 us 17.01 us 9.19 us 2061859 ThreadPoolForeg 3 5.76 us 1.97 us 1.92 us 3909 pipewire-pulse 1 2.26 us 2.26 us 2.26 us 1809271 MediaSu~isor #2 1 1.97 us 1.97 us 1.97 us 1514882 Chrome_ChildIOT 1 987 ns 987 ns 987 ns 3740 pipewire-pulse Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Hao Luo <haoluo@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230203021324.143540-4-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-02-02 18:13:24 -08:00
key.stack_id = pelem->stack_id;
break;
case LOCK_AGGR_TASK:
perf lock contention: Add -o/--lock-owner option When there're many lock contentions in the system, people sometimes want to know who caused the contention, IOW who's the owner of the locks. The -o/--lock-owner option tries to follow the lock owners for the contended mutexes and rwsems from BPF, and then attributes the contention time to the owner instead of the waiter. It's a best effort approach to get the owner info at the time of the contention and doesn't guarantee to have the precise tracking of owners if it's changing over time. Currently it only handles mutex and rwsem that have owner field in their struct and it basically points to a task_struct that owns the lock at the moment. Technically its type is atomic_long_t and it comes with some LSB bits used for other meanings. So it needs to clear them when casting it to a pointer to task_struct. Also the atomic_long_t is a typedef of the atomic 32 or 64 bit types depending on arch which is a wrapper struct for the counter value. I'm not aware of proper ways to access those kernel atomic types from BPF so I just read the internal counter value directly. Please let me know if there's a better way. When -o/--lock-owner option is used, it goes to the task aggregation mode like -t/--threads option does. However it cannot get the owner for other lock types like spinlock and sometimes even for mutex. $ sudo ./perf lock con -abo -- ./perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 4.766 [sec] 4.766540 usecs/op 209795 ops/sec contended total wait max wait avg wait pid owner 403 565.32 us 26.81 us 1.40 us -1 Unknown 4 27.99 us 8.57 us 7.00 us 1583145 sched-pipe 1 8.25 us 8.25 us 8.25 us 1583144 sched-pipe 1 2.03 us 2.03 us 2.03 us 5068 chrome As you can see, the owner is unknown for the most cases. But if we filter only for the mutex locks, it'd more likely get the onwers. $ sudo ./perf lock con -abo -Y mutex -- ./perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 4.910 [sec] 4.910435 usecs/op 203647 ops/sec contended total wait max wait avg wait pid owner 2 15.50 us 8.29 us 7.75 us 1582852 sched-pipe 7 7.20 us 2.47 us 1.03 us -1 Unknown 1 6.74 us 6.74 us 6.74 us 1582851 sched-pipe Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hao Luo <haoluo@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230207002403.63590-3-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-02-06 16:24:02 -08:00
if (lock_owner)
key.pid = pelem->flags;
else {
if (!need_delete)
pid = bpf_get_current_pid_tgid();
perf lock contention: Add -o/--lock-owner option When there're many lock contentions in the system, people sometimes want to know who caused the contention, IOW who's the owner of the locks. The -o/--lock-owner option tries to follow the lock owners for the contended mutexes and rwsems from BPF, and then attributes the contention time to the owner instead of the waiter. It's a best effort approach to get the owner info at the time of the contention and doesn't guarantee to have the precise tracking of owners if it's changing over time. Currently it only handles mutex and rwsem that have owner field in their struct and it basically points to a task_struct that owns the lock at the moment. Technically its type is atomic_long_t and it comes with some LSB bits used for other meanings. So it needs to clear them when casting it to a pointer to task_struct. Also the atomic_long_t is a typedef of the atomic 32 or 64 bit types depending on arch which is a wrapper struct for the counter value. I'm not aware of proper ways to access those kernel atomic types from BPF so I just read the internal counter value directly. Please let me know if there's a better way. When -o/--lock-owner option is used, it goes to the task aggregation mode like -t/--threads option does. However it cannot get the owner for other lock types like spinlock and sometimes even for mutex. $ sudo ./perf lock con -abo -- ./perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 4.766 [sec] 4.766540 usecs/op 209795 ops/sec contended total wait max wait avg wait pid owner 403 565.32 us 26.81 us 1.40 us -1 Unknown 4 27.99 us 8.57 us 7.00 us 1583145 sched-pipe 1 8.25 us 8.25 us 8.25 us 1583144 sched-pipe 1 2.03 us 2.03 us 2.03 us 5068 chrome As you can see, the owner is unknown for the most cases. But if we filter only for the mutex locks, it'd more likely get the onwers. $ sudo ./perf lock con -abo -Y mutex -- ./perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 4.910 [sec] 4.910435 usecs/op 203647 ops/sec contended total wait max wait avg wait pid owner 2 15.50 us 8.29 us 7.75 us 1582852 sched-pipe 7 7.20 us 2.47 us 1.03 us -1 Unknown 1 6.74 us 6.74 us 6.74 us 1582851 sched-pipe Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hao Luo <haoluo@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230207002403.63590-3-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-02-06 16:24:02 -08:00
key.pid = pid;
}
perf lock contention: Support filters for different aggregation It'd be useful to filter other than the current aggregation mode. For example, users may want to see callstacks for specific locks only. Or they may want tasks from a certain callstack. The tracepoints already collected the information but it needs to check the condition again when processing the event. And it needs to change BPF to allow the key combinations. The lock contentions on 'rcu_state' spinlock can be monitored: $ sudo perf lock con -abv -L rcu_state sleep 1 ... contended total wait max wait avg wait type caller 4 151.39 us 62.57 us 37.85 us spinlock rcu_core+0xcb 0xffffffff81fd1666 _raw_spin_lock_irqsave+0x46 0xffffffff8172d76b rcu_core+0xcb 0xffffffff822000eb __softirqentry_text_start+0xeb 0xffffffff816a0ba9 __irq_exit_rcu+0xc9 0xffffffff81fc0112 sysvec_apic_timer_interrupt+0xa2 0xffffffff82000e46 asm_sysvec_apic_timer_interrupt+0x16 0xffffffff81d49f78 cpuidle_enter_state+0xd8 0xffffffff81d4a259 cpuidle_enter+0x29 1 30.21 us 30.21 us 30.21 us spinlock rcu_core+0xcb 0xffffffff81fd1666 _raw_spin_lock_irqsave+0x46 0xffffffff8172d76b rcu_core+0xcb 0xffffffff822000eb __softirqentry_text_start+0xeb 0xffffffff816a0ba9 __irq_exit_rcu+0xc9 0xffffffff81fc00c4 sysvec_apic_timer_interrupt+0x54 0xffffffff82000e46 asm_sysvec_apic_timer_interrupt+0x16 1 28.84 us 28.84 us 28.84 us spinlock rcu_accelerate_cbs_unlocked+0x40 0xffffffff81fd1c60 _raw_spin_lock+0x30 0xffffffff81728cf0 rcu_accelerate_cbs_unlocked+0x40 0xffffffff8172da82 rcu_core+0x3e2 0xffffffff822000eb __softirqentry_text_start+0xeb 0xffffffff816a0ba9 __irq_exit_rcu+0xc9 0xffffffff81fc0112 sysvec_apic_timer_interrupt+0xa2 0xffffffff82000e46 asm_sysvec_apic_timer_interrupt+0x16 0xffffffff81d49f78 cpuidle_enter_state+0xd8 ... To see tasks calling 'rcu_core' function: $ sudo perf lock con -abt -S rcu_core sleep 1 contended total wait max wait avg wait pid comm 19 23.46 us 2.21 us 1.23 us 0 swapper 2 18.37 us 17.01 us 9.19 us 2061859 ThreadPoolForeg 3 5.76 us 1.97 us 1.92 us 3909 pipewire-pulse 1 2.26 us 2.26 us 2.26 us 1809271 MediaSu~isor #2 1 1.97 us 1.97 us 1.97 us 1514882 Chrome_ChildIOT 1 987 ns 987 ns 987 ns 3740 pipewire-pulse Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Hao Luo <haoluo@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230203021324.143540-4-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-02-02 18:13:24 -08:00
if (needs_callstack)
key.stack_id = pelem->stack_id;
break;
case LOCK_AGGR_ADDR:
key.lock_addr_or_cgroup = pelem->lock;
perf lock contention: Support filters for different aggregation It'd be useful to filter other than the current aggregation mode. For example, users may want to see callstacks for specific locks only. Or they may want tasks from a certain callstack. The tracepoints already collected the information but it needs to check the condition again when processing the event. And it needs to change BPF to allow the key combinations. The lock contentions on 'rcu_state' spinlock can be monitored: $ sudo perf lock con -abv -L rcu_state sleep 1 ... contended total wait max wait avg wait type caller 4 151.39 us 62.57 us 37.85 us spinlock rcu_core+0xcb 0xffffffff81fd1666 _raw_spin_lock_irqsave+0x46 0xffffffff8172d76b rcu_core+0xcb 0xffffffff822000eb __softirqentry_text_start+0xeb 0xffffffff816a0ba9 __irq_exit_rcu+0xc9 0xffffffff81fc0112 sysvec_apic_timer_interrupt+0xa2 0xffffffff82000e46 asm_sysvec_apic_timer_interrupt+0x16 0xffffffff81d49f78 cpuidle_enter_state+0xd8 0xffffffff81d4a259 cpuidle_enter+0x29 1 30.21 us 30.21 us 30.21 us spinlock rcu_core+0xcb 0xffffffff81fd1666 _raw_spin_lock_irqsave+0x46 0xffffffff8172d76b rcu_core+0xcb 0xffffffff822000eb __softirqentry_text_start+0xeb 0xffffffff816a0ba9 __irq_exit_rcu+0xc9 0xffffffff81fc00c4 sysvec_apic_timer_interrupt+0x54 0xffffffff82000e46 asm_sysvec_apic_timer_interrupt+0x16 1 28.84 us 28.84 us 28.84 us spinlock rcu_accelerate_cbs_unlocked+0x40 0xffffffff81fd1c60 _raw_spin_lock+0x30 0xffffffff81728cf0 rcu_accelerate_cbs_unlocked+0x40 0xffffffff8172da82 rcu_core+0x3e2 0xffffffff822000eb __softirqentry_text_start+0xeb 0xffffffff816a0ba9 __irq_exit_rcu+0xc9 0xffffffff81fc0112 sysvec_apic_timer_interrupt+0xa2 0xffffffff82000e46 asm_sysvec_apic_timer_interrupt+0x16 0xffffffff81d49f78 cpuidle_enter_state+0xd8 ... To see tasks calling 'rcu_core' function: $ sudo perf lock con -abt -S rcu_core sleep 1 contended total wait max wait avg wait pid comm 19 23.46 us 2.21 us 1.23 us 0 swapper 2 18.37 us 17.01 us 9.19 us 2061859 ThreadPoolForeg 3 5.76 us 1.97 us 1.92 us 3909 pipewire-pulse 1 2.26 us 2.26 us 2.26 us 1809271 MediaSu~isor #2 1 1.97 us 1.97 us 1.97 us 1514882 Chrome_ChildIOT 1 987 ns 987 ns 987 ns 3740 pipewire-pulse Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Hao Luo <haoluo@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230203021324.143540-4-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-02-02 18:13:24 -08:00
if (needs_callstack)
key.stack_id = pelem->stack_id;
break;
case LOCK_AGGR_CGROUP:
key.lock_addr_or_cgroup = get_current_cgroup_id();
break;
default:
/* should not happen */
return 0;
}
data = bpf_map_lookup_elem(&lock_stat, &key);
if (!data) {
perf lock contention: Do not try to update if hash map is full It doesn't delete data in the task_data and lock_stat maps. The data is kept there until it's consumed by userspace at the end. But it calls bpf_map_update_elem() again and again, and the data will be discarded if the map is full. This is not good. Worse, in the bpf_map_update_elem(), it keeps trying to get a new node even if the map was full. I guess it makes sense if it deletes some node like in the tstamp map (that's why I didn't make the change there). In a pre-allocated hash map, that means it'd iterate all CPU to check the freelist. And it has a bad performance impact on large machines. I've checked it on my 64 CPU machine with this. $ perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 2.825 [sec] And I used the task mode, so that it can guarantee the map is full. The default map entry size is 16K and this workload has 40K tasks. Before: $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 11.299 [sec] contended total wait max wait avg wait pid comm 19284 3.51 s 3.70 ms 181.91 us 1305863 sched-messaging 243 84.09 ms 466.67 us 346.04 us 1336608 sched-messaging 177 66.35 ms 12.08 ms 374.88 us 1220416 node For some reason, it didn't report the data failures. But you can see the total time in the workload is increased a lot (2.8 -> 11.3). If it fails early when the map is full, it goes back to normal. After: $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 3.044 [sec] contended total wait max wait avg wait pid comm 18743 591.92 ms 442.96 us 31.58 us 1431454 sched-messaging 51 210.64 ms 207.45 ms 4.13 ms 1468724 sched-messaging 81 68.61 ms 65.79 ms 847.07 us 1463183 sched-messaging === output for debug === bad: 1164137, total: 2253341 bad rate: 51.66 % histogram of failure reasons task: 0 stack: 0 time: 0 data: 1164137 Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Hao Luo <haoluo@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230406210611.1622492-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-04-06 14:06:11 -07:00
if (data_map_full) {
__sync_fetch_and_add(&data_fail, 1);
goto out;
perf lock contention: Do not try to update if hash map is full It doesn't delete data in the task_data and lock_stat maps. The data is kept there until it's consumed by userspace at the end. But it calls bpf_map_update_elem() again and again, and the data will be discarded if the map is full. This is not good. Worse, in the bpf_map_update_elem(), it keeps trying to get a new node even if the map was full. I guess it makes sense if it deletes some node like in the tstamp map (that's why I didn't make the change there). In a pre-allocated hash map, that means it'd iterate all CPU to check the freelist. And it has a bad performance impact on large machines. I've checked it on my 64 CPU machine with this. $ perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 2.825 [sec] And I used the task mode, so that it can guarantee the map is full. The default map entry size is 16K and this workload has 40K tasks. Before: $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 11.299 [sec] contended total wait max wait avg wait pid comm 19284 3.51 s 3.70 ms 181.91 us 1305863 sched-messaging 243 84.09 ms 466.67 us 346.04 us 1336608 sched-messaging 177 66.35 ms 12.08 ms 374.88 us 1220416 node For some reason, it didn't report the data failures. But you can see the total time in the workload is increased a lot (2.8 -> 11.3). If it fails early when the map is full, it goes back to normal. After: $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 3.044 [sec] contended total wait max wait avg wait pid comm 18743 591.92 ms 442.96 us 31.58 us 1431454 sched-messaging 51 210.64 ms 207.45 ms 4.13 ms 1468724 sched-messaging 81 68.61 ms 65.79 ms 847.07 us 1463183 sched-messaging === output for debug === bad: 1164137, total: 2253341 bad rate: 51.66 % histogram of failure reasons task: 0 stack: 0 time: 0 data: 1164137 Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Hao Luo <haoluo@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230406210611.1622492-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-04-06 14:06:11 -07:00
}
struct contention_data first = {
.total_time = duration,
.max_time = duration,
.min_time = duration,
.count = 1,
.flags = pelem->flags,
};
perf lock contention: Do not try to update if hash map is full It doesn't delete data in the task_data and lock_stat maps. The data is kept there until it's consumed by userspace at the end. But it calls bpf_map_update_elem() again and again, and the data will be discarded if the map is full. This is not good. Worse, in the bpf_map_update_elem(), it keeps trying to get a new node even if the map was full. I guess it makes sense if it deletes some node like in the tstamp map (that's why I didn't make the change there). In a pre-allocated hash map, that means it'd iterate all CPU to check the freelist. And it has a bad performance impact on large machines. I've checked it on my 64 CPU machine with this. $ perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 2.825 [sec] And I used the task mode, so that it can guarantee the map is full. The default map entry size is 16K and this workload has 40K tasks. Before: $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 11.299 [sec] contended total wait max wait avg wait pid comm 19284 3.51 s 3.70 ms 181.91 us 1305863 sched-messaging 243 84.09 ms 466.67 us 346.04 us 1336608 sched-messaging 177 66.35 ms 12.08 ms 374.88 us 1220416 node For some reason, it didn't report the data failures. But you can see the total time in the workload is increased a lot (2.8 -> 11.3). If it fails early when the map is full, it goes back to normal. After: $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 3.044 [sec] contended total wait max wait avg wait pid comm 18743 591.92 ms 442.96 us 31.58 us 1431454 sched-messaging 51 210.64 ms 207.45 ms 4.13 ms 1468724 sched-messaging 81 68.61 ms 65.79 ms 847.07 us 1463183 sched-messaging === output for debug === bad: 1164137, total: 2253341 bad rate: 51.66 % histogram of failure reasons task: 0 stack: 0 time: 0 data: 1164137 Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Hao Luo <haoluo@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230406210611.1622492-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-04-06 14:06:11 -07:00
int err;
perf lock contention: Track and show mmap_lock with address Sometimes there are severe contentions on the mmap_lock and we want see it in the -l/--lock-addr output. However it cannot symbolize the mmap_lock because it's allocated dynamically without symbols. Stephane and Hao gave me an idea separately to display mmap_lock by following the current->mm pointer. I added a flag to mark mmap_lock after comparing the lock address so that it can show them differently. With this change it can show mmap_lock like below: $ sudo ./perf lock con -abl -- sleep 10 contended total wait max wait avg wait address symbol ... 16344 312.30 ms 2.22 ms 19.11 us ffff8cc702595640 17686 310.08 ms 1.49 ms 17.53 us ffff8cc7025952c0 3 84.14 ms 45.79 ms 28.05 ms ffff8cc78114c478 mmap_lock 3557 76.80 ms 68.75 us 21.59 us ffff8cc77ca3af58 1 68.27 ms 68.27 ms 68.27 ms ffff8cda745dfd70 9 54.53 ms 7.96 ms 6.06 ms ffff8cc7642a48b8 mmap_lock 14629 44.01 ms 60.00 us 3.01 us ffff8cc7625f9ca0 3481 42.63 ms 140.71 us 12.24 us ffffffff937906ac vmap_area_lock 16194 38.73 ms 42.15 us 2.39 us ffff8cd397cbc560 11 38.44 ms 10.39 ms 3.49 ms ffff8ccd6d12fbb8 mmap_lock 1 5.43 ms 5.43 ms 5.43 ms ffff8cd70018f0d8 1674 5.38 ms 422.93 us 3.21 us ffffffff92e06080 tasklist_lock 581 4.51 ms 130.68 us 7.75 us ffff8cc9b1259058 5 3.52 ms 1.27 ms 703.23 us ffff8cc754510070 112 3.47 ms 56.47 us 31.02 us ffff8ccee38b3120 381 3.31 ms 73.44 us 8.69 us ffffffff93790690 purge_vmap_area_lock 255 3.19 ms 36.35 us 12.49 us ffff8d053ce30c80 Note that mmap_lock was renamed some time ago and it needs to support old kernels with a different name 'mmap_sem'. Suggested-by: Hao Luo <haoluo@google.com> Suggested-by: Stephane Eranian <eranian@google.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230313204825.2665483-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-03-13 13:48:22 -07:00
if (aggr_mode == LOCK_AGGR_ADDR)
first.flags |= check_lock_type(pelem->lock, pelem->flags);
perf lock contention: Do not try to update if hash map is full It doesn't delete data in the task_data and lock_stat maps. The data is kept there until it's consumed by userspace at the end. But it calls bpf_map_update_elem() again and again, and the data will be discarded if the map is full. This is not good. Worse, in the bpf_map_update_elem(), it keeps trying to get a new node even if the map was full. I guess it makes sense if it deletes some node like in the tstamp map (that's why I didn't make the change there). In a pre-allocated hash map, that means it'd iterate all CPU to check the freelist. And it has a bad performance impact on large machines. I've checked it on my 64 CPU machine with this. $ perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 2.825 [sec] And I used the task mode, so that it can guarantee the map is full. The default map entry size is 16K and this workload has 40K tasks. Before: $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 11.299 [sec] contended total wait max wait avg wait pid comm 19284 3.51 s 3.70 ms 181.91 us 1305863 sched-messaging 243 84.09 ms 466.67 us 346.04 us 1336608 sched-messaging 177 66.35 ms 12.08 ms 374.88 us 1220416 node For some reason, it didn't report the data failures. But you can see the total time in the workload is increased a lot (2.8 -> 11.3). If it fails early when the map is full, it goes back to normal. After: $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 3.044 [sec] contended total wait max wait avg wait pid comm 18743 591.92 ms 442.96 us 31.58 us 1431454 sched-messaging 51 210.64 ms 207.45 ms 4.13 ms 1468724 sched-messaging 81 68.61 ms 65.79 ms 847.07 us 1463183 sched-messaging === output for debug === bad: 1164137, total: 2253341 bad rate: 51.66 % histogram of failure reasons task: 0 stack: 0 time: 0 data: 1164137 Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Hao Luo <haoluo@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230406210611.1622492-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-04-06 14:06:11 -07:00
err = bpf_map_update_elem(&lock_stat, &key, &first, BPF_NOEXIST);
if (err < 0) {
if (err == -EEXIST) {
/* it lost the race, try to get it again */
data = bpf_map_lookup_elem(&lock_stat, &key);
if (data != NULL)
goto found;
}
perf lock contention: Do not try to update if hash map is full It doesn't delete data in the task_data and lock_stat maps. The data is kept there until it's consumed by userspace at the end. But it calls bpf_map_update_elem() again and again, and the data will be discarded if the map is full. This is not good. Worse, in the bpf_map_update_elem(), it keeps trying to get a new node even if the map was full. I guess it makes sense if it deletes some node like in the tstamp map (that's why I didn't make the change there). In a pre-allocated hash map, that means it'd iterate all CPU to check the freelist. And it has a bad performance impact on large machines. I've checked it on my 64 CPU machine with this. $ perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 2.825 [sec] And I used the task mode, so that it can guarantee the map is full. The default map entry size is 16K and this workload has 40K tasks. Before: $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 11.299 [sec] contended total wait max wait avg wait pid comm 19284 3.51 s 3.70 ms 181.91 us 1305863 sched-messaging 243 84.09 ms 466.67 us 346.04 us 1336608 sched-messaging 177 66.35 ms 12.08 ms 374.88 us 1220416 node For some reason, it didn't report the data failures. But you can see the total time in the workload is increased a lot (2.8 -> 11.3). If it fails early when the map is full, it goes back to normal. After: $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 3.044 [sec] contended total wait max wait avg wait pid comm 18743 591.92 ms 442.96 us 31.58 us 1431454 sched-messaging 51 210.64 ms 207.45 ms 4.13 ms 1468724 sched-messaging 81 68.61 ms 65.79 ms 847.07 us 1463183 sched-messaging === output for debug === bad: 1164137, total: 2253341 bad rate: 51.66 % histogram of failure reasons task: 0 stack: 0 time: 0 data: 1164137 Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Hao Luo <haoluo@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230406210611.1622492-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-04-06 14:06:11 -07:00
if (err == -E2BIG)
data_map_full = 1;
__sync_fetch_and_add(&data_fail, 1);
perf lock contention: Do not try to update if hash map is full It doesn't delete data in the task_data and lock_stat maps. The data is kept there until it's consumed by userspace at the end. But it calls bpf_map_update_elem() again and again, and the data will be discarded if the map is full. This is not good. Worse, in the bpf_map_update_elem(), it keeps trying to get a new node even if the map was full. I guess it makes sense if it deletes some node like in the tstamp map (that's why I didn't make the change there). In a pre-allocated hash map, that means it'd iterate all CPU to check the freelist. And it has a bad performance impact on large machines. I've checked it on my 64 CPU machine with this. $ perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 2.825 [sec] And I used the task mode, so that it can guarantee the map is full. The default map entry size is 16K and this workload has 40K tasks. Before: $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 11.299 [sec] contended total wait max wait avg wait pid comm 19284 3.51 s 3.70 ms 181.91 us 1305863 sched-messaging 243 84.09 ms 466.67 us 346.04 us 1336608 sched-messaging 177 66.35 ms 12.08 ms 374.88 us 1220416 node For some reason, it didn't report the data failures. But you can see the total time in the workload is increased a lot (2.8 -> 11.3). If it fails early when the map is full, it goes back to normal. After: $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 3.044 [sec] contended total wait max wait avg wait pid comm 18743 591.92 ms 442.96 us 31.58 us 1431454 sched-messaging 51 210.64 ms 207.45 ms 4.13 ms 1468724 sched-messaging 81 68.61 ms 65.79 ms 847.07 us 1463183 sched-messaging === output for debug === bad: 1164137, total: 2253341 bad rate: 51.66 % histogram of failure reasons task: 0 stack: 0 time: 0 data: 1164137 Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Hao Luo <haoluo@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230406210611.1622492-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-04-06 14:06:11 -07:00
}
goto out;
}
found:
__sync_fetch_and_add(&data->total_time, duration);
__sync_fetch_and_add(&data->count, 1);
/* FIXME: need atomic operations */
if (data->max_time < duration)
data->max_time = duration;
if (data->min_time > duration)
data->min_time = duration;
out:
pelem->lock = 0;
if (need_delete)
bpf_map_delete_elem(&tstamp, &pid);
return 0;
}
extern struct rq runqueues __ksym;
struct rq___old {
raw_spinlock_t lock;
} __attribute__((preserve_access_index));
struct rq___new {
raw_spinlock_t __lock;
} __attribute__((preserve_access_index));
SEC("raw_tp/bpf_test_finish")
int BPF_PROG(collect_lock_syms)
{
__u64 lock_addr, lock_off;
__u32 lock_flag;
if (bpf_core_field_exists(struct rq___new, __lock))
lock_off = offsetof(struct rq___new, __lock);
else
lock_off = offsetof(struct rq___old, lock);
for (int i = 0; i < MAX_CPUS; i++) {
struct rq *rq = bpf_per_cpu_ptr(&runqueues, i);
if (rq == NULL)
break;
lock_addr = (__u64)(void *)rq + lock_off;
lock_flag = LOCK_CLASS_RQLOCK;
bpf_map_update_elem(&lock_syms, &lock_addr, &lock_flag, BPF_ANY);
}
return 0;
}
SEC("raw_tp/bpf_test_finish")
int BPF_PROG(end_timestamp)
{
end_ts = bpf_ktime_get_ns();
return 0;
}
char LICENSE[] SEC("license") = "Dual BSD/GPL";