2019-06-03 07:44:50 +02:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
2014-08-26 21:15:30 -07:00
|
|
|
/*
|
|
|
|
* BPF JIT compiler for ARM64
|
|
|
|
*
|
2016-01-13 23:33:22 -08:00
|
|
|
* Copyright (C) 2014-2016 Zi Shen Lim <zlim.lnx@gmail.com>
|
2014-08-26 21:15:30 -07:00
|
|
|
*/
|
|
|
|
|
|
|
|
#define pr_fmt(fmt) "bpf_jit: " fmt
|
|
|
|
|
2020-07-28 17:21:26 +02:00
|
|
|
#include <linux/bitfield.h>
|
arm64: bpf: implement bpf_tail_call() helper
Add support for JMP_CALL_X (tail call) introduced by commit 04fd61ab36ec
("bpf: allow bpf programs to tail-call other bpf programs").
bpf_tail_call() arguments:
ctx - context pointer passed to next program
array - pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
index - index inside array that selects specific program to run
In this implementation arm64 JIT jumps into callee program after prologue,
so callee program reuses the same stack. For tail_call_cnt, we use the
callee-saved R26 (which was already saved/restored but previously unused
by JIT).
With this patch a tail call generates the following code on arm64:
if (index >= array->map.max_entries)
goto out;
34: mov x10, #0x10 // #16
38: ldr w10, [x1,x10]
3c: cmp w2, w10
40: b.ge 0x0000000000000074
if (tail_call_cnt > MAX_TAIL_CALL_CNT)
goto out;
tail_call_cnt++;
44: mov x10, #0x20 // #32
48: cmp x26, x10
4c: b.gt 0x0000000000000074
50: add x26, x26, #0x1
prog = array->ptrs[index];
if (prog == NULL)
goto out;
54: mov x10, #0x68 // #104
58: ldr x10, [x1,x10]
5c: ldr x11, [x10,x2]
60: cbz x11, 0x0000000000000074
goto *(prog->bpf_func + prologue_size);
64: mov x10, #0x20 // #32
68: ldr x10, [x11,x10]
6c: add x10, x10, #0x20
70: br x10
74:
Signed-off-by: Zi Shen Lim <zlim.lnx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 21:18:48 -07:00
|
|
|
#include <linux/bpf.h>
|
2014-08-26 21:15:30 -07:00
|
|
|
#include <linux/filter.h>
|
bpf, arm64: Implement bpf_arch_text_poke() for arm64
Implement bpf_arch_text_poke() for arm64, so bpf prog or bpf trampoline
can be patched with it.
When the target address is NULL, the original instruction is patched to
a NOP.
When the target address and the source address are within the branch
range, the original instruction is patched to a bl instruction to the
target address directly.
To support attaching bpf trampoline to both regular kernel function and
bpf prog, we follow the ftrace patchsite way for bpf prog. That is, two
instructions are inserted at the beginning of bpf prog, the first one
saves the return address to x9, and the second is a nop which will be
patched to a bl instruction when a bpf trampoline is attached.
However, when a bpf trampoline is attached to bpf prog, the distance
between target address and source address may exceed 128MB, the maximum
branch range, because bpf trampoline and bpf prog are allocated
separately with vmalloc. So long jump should be handled.
When a bpf prog is constructed, a plt pointing to empty trampoline
dummy_tramp is placed at the end:
bpf_prog:
mov x9, lr
nop // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
This is also the state when no trampoline is attached.
When a short-jump bpf trampoline is attached, the patchsite is patched to
a bl instruction to the trampoline directly:
bpf_prog:
mov x9, lr
bl <short-jump bpf trampoline address> // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
When a long-jump bpf trampoline is attached, the plt target is filled with
the trampoline address and the patchsite is patched to a bl instruction to
the plt:
bpf_prog:
mov x9, lr
bl plt // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad <long-jump bpf trampoline address>
dummy_tramp is used to prevent another CPU from jumping to an unknown
location during the patching process, making the patching process easier.
The patching process is as follows:
1. when neither the old address or the new address is a long jump, the
patchsite is replaced with a bl to the new address, or nop if the new
address is NULL;
2. when the old address is not long jump but the new one is, the
branch target address is written to plt first, then the patchsite
is replaced with a bl instruction to the plt;
3. when the old address is long jump but the new one is not, the address
of dummy_tramp is written to plt first, then the patchsite is replaced
with a bl to the new address, or a nop if the new address is NULL;
4. when both the old address and the new address are long jump, the
new address is written to plt and the patchsite is not changed.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20220711150823.2128542-4-xukuohai@huawei.com
2022-07-11 11:08:22 -04:00
|
|
|
#include <linux/memory.h>
|
2014-08-26 21:15:30 -07:00
|
|
|
#include <linux/printk.h>
|
|
|
|
#include <linux/slab.h>
|
2014-09-16 08:48:50 +01:00
|
|
|
|
arm64: extable: add `type` and `data` fields
Subsequent patches will add specialized handlers for fixups, in addition
to the simple PC fixup and BPF handlers we have today. In preparation,
this patch adds a new `type` field to struct exception_table_entry, and
uses this to distinguish the fixup and BPF cases. A `data` field is also
added so that subsequent patches can associate data specific to each
exception site (e.g. register numbers).
Handlers are named ex_handler_*() for consistency, following the exmaple
of x86. At the same time, get_ex_fixup() is split out into a helper so
that it can be used by other ex_handler_*() functions ins subsequent
patches.
This patch will increase the size of the exception tables, which will be
remedied by subsequent patches removing redundant fixup code. There
should be no functional change as a result of this patch.
Since each entry is now 12 bytes in size, we must reduce the alignment
of each entry from `.align 3` (i.e. 8 bytes) to `.align 2` (i.e. 4
bytes), which is the natrual alignment of the `insn` and `fixup` fields.
The current 8-byte alignment is a holdover from when the `insn` and
`fixup` fields was 8 bytes, and while not harmful has not been necessary
since commit:
6c94f27ac847ff8e ("arm64: switch to relative exception tables")
Similarly, RO_EXCEPTION_TABLE_ALIGN is dropped to 4 bytes.
Concurrently with this patch, x86's exception table entry format is
being updated (similarly to a 12-byte format, with 32-bytes of absolute
data). Once both have been merged it should be possible to unify the
sorttable logic for the two.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: James Morse <james.morse@arm.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20211019160219.5202-11-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2021-10-19 17:02:16 +01:00
|
|
|
#include <asm/asm-extable.h>
|
2014-08-26 21:15:30 -07:00
|
|
|
#include <asm/byteorder.h>
|
|
|
|
#include <asm/cacheflush.h>
|
2014-09-16 08:48:50 +01:00
|
|
|
#include <asm/debug-monitors.h>
|
2021-06-09 11:23:01 +01:00
|
|
|
#include <asm/insn.h>
|
2024-10-23 19:27:06 +03:00
|
|
|
#include <asm/text-patching.h>
|
2017-05-08 15:58:05 -07:00
|
|
|
#include <asm/set_memory.h>
|
2014-08-26 21:15:30 -07:00
|
|
|
|
|
|
|
#include "bpf_jit.h"
|
|
|
|
|
2016-05-13 19:08:34 +02:00
|
|
|
#define TMP_REG_1 (MAX_BPF_JIT_REG + 0)
|
|
|
|
#define TMP_REG_2 (MAX_BPF_JIT_REG + 1)
|
bpf, arm64: Fix tailcall hierarchy
This patch fixes a tailcall issue caused by abusing the tailcall in
bpf2bpf feature on arm64 like the way of "bpf, x64: Fix tailcall
hierarchy".
On arm64, when a tail call happens, it uses tail_call_cnt_ptr to
increment tail_call_cnt, too.
At the prologue of main prog, it has to initialize tail_call_cnt and
prepare tail_call_cnt_ptr.
At the prologue of subprog, it pushes x26 register twice, and does not
initialize tail_call_cnt.
At the epilogue, it pops x26 twice, no matter whether it is main prog or
subprog.
Fixes: d4609a5d8c70 ("bpf, arm64: Keep tail call count across bpf2bpf calls")
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
Link: https://lore.kernel.org/r/20240714123902.32305-3-hffilwlqm@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-07-14 20:39:01 +08:00
|
|
|
#define TCCNT_PTR (MAX_BPF_JIT_REG + 2)
|
bpf, arm64: use separate register for state in stxr
Will reported that in BPF_XADD we must use a different register in stxr
instruction for the status flag due to otherwise CONSTRAINED UNPREDICTABLE
behavior per architecture. Reference manual says [1]:
If s == t, then one of the following behaviors must occur:
* The instruction is UNDEFINED.
* The instruction executes as a NOP.
* The instruction performs the store to the specified address, but
the value stored is UNKNOWN.
Thus, use a different temporary register for the status flag to fix it.
Disassembly extract from test 226/STX_XADD_DW from test_bpf.ko:
[...]
0000003c: c85f7d4b ldxr x11, [x10]
00000040: 8b07016b add x11, x11, x7
00000044: c80c7d4b stxr w12, x11, [x10]
00000048: 35ffffac cbnz w12, 0x0000003c
[...]
[1] https://static.docs.arm.com/ddi0487/b/DDI0487B_a_armv8_arm.pdf, p.6132
Fixes: 85f68fe89832 ("bpf, arm64: implement jiting of BPF_XADD")
Reported-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-07 13:45:37 +02:00
|
|
|
#define TMP_REG_3 (MAX_BPF_JIT_REG + 3)
|
2024-03-25 15:07:15 +00:00
|
|
|
#define ARENA_VM_START (MAX_BPF_JIT_REG + 5)
|
2014-08-26 21:15:30 -07:00
|
|
|
|
bpf, arm64: Support more atomic operations
Atomics for eBPF patch series adds support for atomic[64]_fetch_add,
atomic[64]_[fetch_]{and,or,xor} and atomic[64]_{xchg|cmpxchg}, but it
only adds support for x86-64, so support these atomic operations for
arm64 as well.
Basically the implementation procedure is almost mechanical translation
of code snippets in atomic_ll_sc.h & atomic_lse.h & cmpxchg.h located
under arch/arm64/include/asm.
When LSE atomic is unavailable, an extra temporary register is needed for
(BPF_ADD | BPF_FETCH) to save the value of src register, instead of adding
TMP_REG_4 just use BPF_REG_AX instead. Also make emit_lse_atomic() as an
empty inline function when CONFIG_ARM64_LSE_ATOMICS is disabled.
For cpus_have_cap(ARM64_HAS_LSE_ATOMICS) case and no-LSE-ATOMICS case, the
following three tests: "./test_verifier", "./test_progs -t atomic" and
"insmod ./test_bpf.ko" are exercised and passed.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220217072232.1186625-4-houtao1@huawei.com
2022-02-17 15:22:31 +08:00
|
|
|
#define check_imm(bits, imm) do { \
|
|
|
|
if ((((imm) > 0) && ((imm) >> (bits))) || \
|
|
|
|
(((imm) < 0) && (~(imm) >> (bits)))) { \
|
|
|
|
pr_info("[%2d] imm=%d(0x%x) out of range\n", \
|
|
|
|
i, imm, imm); \
|
|
|
|
return -EINVAL; \
|
|
|
|
} \
|
|
|
|
} while (0)
|
|
|
|
#define check_imm19(imm) check_imm(19, imm)
|
|
|
|
#define check_imm26(imm) check_imm(26, imm)
|
|
|
|
|
2014-08-26 21:15:30 -07:00
|
|
|
/* Map BPF registers to A64 registers */
|
|
|
|
static const int bpf2a64[] = {
|
|
|
|
/* return value from in-kernel function, and exit value from eBPF */
|
|
|
|
[BPF_REG_0] = A64_R(7),
|
|
|
|
/* arguments from eBPF program to in-kernel function */
|
|
|
|
[BPF_REG_1] = A64_R(0),
|
|
|
|
[BPF_REG_2] = A64_R(1),
|
|
|
|
[BPF_REG_3] = A64_R(2),
|
|
|
|
[BPF_REG_4] = A64_R(3),
|
|
|
|
[BPF_REG_5] = A64_R(4),
|
|
|
|
/* callee saved registers that in-kernel function will preserve */
|
|
|
|
[BPF_REG_6] = A64_R(19),
|
|
|
|
[BPF_REG_7] = A64_R(20),
|
|
|
|
[BPF_REG_8] = A64_R(21),
|
|
|
|
[BPF_REG_9] = A64_R(22),
|
|
|
|
/* read-only frame pointer to access stack */
|
2015-11-16 14:35:35 -08:00
|
|
|
[BPF_REG_FP] = A64_R(25),
|
2021-11-19 17:32:13 +01:00
|
|
|
/* temporary registers for BPF JIT */
|
2016-05-16 16:36:26 -07:00
|
|
|
[TMP_REG_1] = A64_R(10),
|
|
|
|
[TMP_REG_2] = A64_R(11),
|
bpf, arm64: use separate register for state in stxr
Will reported that in BPF_XADD we must use a different register in stxr
instruction for the status flag due to otherwise CONSTRAINED UNPREDICTABLE
behavior per architecture. Reference manual says [1]:
If s == t, then one of the following behaviors must occur:
* The instruction is UNDEFINED.
* The instruction executes as a NOP.
* The instruction performs the store to the specified address, but
the value stored is UNKNOWN.
Thus, use a different temporary register for the status flag to fix it.
Disassembly extract from test 226/STX_XADD_DW from test_bpf.ko:
[...]
0000003c: c85f7d4b ldxr x11, [x10]
00000040: 8b07016b add x11, x11, x7
00000044: c80c7d4b stxr w12, x11, [x10]
00000048: 35ffffac cbnz w12, 0x0000003c
[...]
[1] https://static.docs.arm.com/ddi0487/b/DDI0487B_a_armv8_arm.pdf, p.6132
Fixes: 85f68fe89832 ("bpf, arm64: implement jiting of BPF_XADD")
Reported-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-07 13:45:37 +02:00
|
|
|
[TMP_REG_3] = A64_R(12),
|
bpf, arm64: Fix tailcall hierarchy
This patch fixes a tailcall issue caused by abusing the tailcall in
bpf2bpf feature on arm64 like the way of "bpf, x64: Fix tailcall
hierarchy".
On arm64, when a tail call happens, it uses tail_call_cnt_ptr to
increment tail_call_cnt, too.
At the prologue of main prog, it has to initialize tail_call_cnt and
prepare tail_call_cnt_ptr.
At the prologue of subprog, it pushes x26 register twice, and does not
initialize tail_call_cnt.
At the epilogue, it pops x26 twice, no matter whether it is main prog or
subprog.
Fixes: d4609a5d8c70 ("bpf, arm64: Keep tail call count across bpf2bpf calls")
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
Link: https://lore.kernel.org/r/20240714123902.32305-3-hffilwlqm@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-07-14 20:39:01 +08:00
|
|
|
/* tail_call_cnt_ptr */
|
|
|
|
[TCCNT_PTR] = A64_R(26),
|
2016-05-13 19:08:34 +02:00
|
|
|
/* temporary register for blinding constants */
|
|
|
|
[BPF_REG_AX] = A64_R(9),
|
2024-03-25 15:07:15 +00:00
|
|
|
/* callee saved register for kern_vm_start address */
|
|
|
|
[ARENA_VM_START] = A64_R(28),
|
2014-08-26 21:15:30 -07:00
|
|
|
};
|
|
|
|
|
|
|
|
struct jit_ctx {
|
|
|
|
const struct bpf_prog *prog;
|
|
|
|
int idx;
|
2014-12-03 08:38:01 +00:00
|
|
|
int epilogue_offset;
|
2014-08-26 21:15:30 -07:00
|
|
|
int *offset;
|
2020-07-28 17:21:26 +02:00
|
|
|
int exentry_idx;
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
int nr_used_callee_reg;
|
|
|
|
u8 used_callee_reg[8]; /* r6~r9, fp, arena_vm_start */
|
2017-06-28 16:58:03 +02:00
|
|
|
__le32 *image;
|
2024-02-28 14:18:24 +00:00
|
|
|
__le32 *ro_image;
|
2017-06-11 03:55:27 +02:00
|
|
|
u32 stack_size;
|
2024-03-25 15:07:16 +00:00
|
|
|
u64 user_vm_start;
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
u64 arena_vm_start;
|
|
|
|
bool fp_used;
|
bpf, arm64: Jit BPF_CALL to direct call when possible
Currently, BPF_CALL is always jited to indirect call. When target is
within the range of direct call, BPF_CALL can be jited to direct call.
For example, the following BPF_CALL
call __htab_map_lookup_elem
is always jited to indirect call:
mov x10, #0xffffffffffff18f4
movk x10, #0x821, lsl #16
movk x10, #0x8000, lsl #32
blr x10
When the address of target __htab_map_lookup_elem is within the range of
direct call, the BPF_CALL can be jited to:
bl 0xfffffffffd33bc98
This patch does such jit optimization by emitting arm64 direct calls for
BPF_CALL when possible, indirect calls otherwise.
Without this patch, the jit works as follows.
1. First pass
A. Determine jited position and size for each bpf instruction.
B. Computed the jited image size.
2. Allocate jited image with size computed in step 1.
3. Second pass
A. Adjust jump offset for jump instructions
B. Write the final image.
This works because, for a given bpf prog, regardless of where the jited
image is allocated, the jited result for each instruction is fixed. The
second pass differs from the first only in adjusting the jump offsets,
like changing "jmp imm1" to "jmp imm2", while the position and size of
the "jmp" instruction remain unchanged.
Now considering whether to jit BPF_CALL to arm64 direct or indirect call
instruction. The choice depends solely on the jump offset: direct call
if the jump offset is within 128MB, indirect call otherwise.
For a given BPF_CALL, the target address is known, so the jump offset is
decided by the jited address of the BPF_CALL instruction. In other words,
for a given bpf prog, the jited result for each BPF_CALL is determined
by its jited address.
The jited address for a BPF_CALL is the jited image address plus the
total jited size of all preceding instructions. For a given bpf prog,
there are clearly no BPF_CALL instructions before the first BPF_CALL
instruction. Since the jited result for all other instructions other
than BPF_CALL are fixed, the total jited size preceding the first
BPF_CALL is also fixed. Therefore, once the jited image is allocated,
the jited address for the first BPF_CALL is fixed.
Now that the jited result for the first BPF_CALL is fixed, the jited
results for all instructions preceding the second BPF_CALL are fixed.
So the jited address and result for the second BPF_CALL are also fixed.
Similarly, we can conclude that the jited addresses and results for all
subsequent BPF_CALL instructions are fixed.
This means that, for a given bpf prog, once the jited image is allocated,
the jited address and result for all instructions, including all BPF_CALL
instructions, are fixed.
Based on the observation, with this patch, the jit works as follows.
1. First pass
Estimate the maximum jited image size. In this pass, all BPF_CALLs
are jited to arm64 indirect calls since the jump offsets are unknown
because the jited image is not allocated.
2. Allocate jited image with size estimated in step 1.
3. Second pass
A. Determine the jited result for each BPF_CALL.
B. Determine jited address and size for each bpf instruction.
4. Third pass
A. Adjust jump offset for jump instructions.
B. Write the final image.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20240903094407.601107-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-03 17:44:07 +08:00
|
|
|
bool write;
|
2014-08-26 21:15:30 -07:00
|
|
|
};
|
|
|
|
|
bpf, arm64: Implement bpf_arch_text_poke() for arm64
Implement bpf_arch_text_poke() for arm64, so bpf prog or bpf trampoline
can be patched with it.
When the target address is NULL, the original instruction is patched to
a NOP.
When the target address and the source address are within the branch
range, the original instruction is patched to a bl instruction to the
target address directly.
To support attaching bpf trampoline to both regular kernel function and
bpf prog, we follow the ftrace patchsite way for bpf prog. That is, two
instructions are inserted at the beginning of bpf prog, the first one
saves the return address to x9, and the second is a nop which will be
patched to a bl instruction when a bpf trampoline is attached.
However, when a bpf trampoline is attached to bpf prog, the distance
between target address and source address may exceed 128MB, the maximum
branch range, because bpf trampoline and bpf prog are allocated
separately with vmalloc. So long jump should be handled.
When a bpf prog is constructed, a plt pointing to empty trampoline
dummy_tramp is placed at the end:
bpf_prog:
mov x9, lr
nop // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
This is also the state when no trampoline is attached.
When a short-jump bpf trampoline is attached, the patchsite is patched to
a bl instruction to the trampoline directly:
bpf_prog:
mov x9, lr
bl <short-jump bpf trampoline address> // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
When a long-jump bpf trampoline is attached, the plt target is filled with
the trampoline address and the patchsite is patched to a bl instruction to
the plt:
bpf_prog:
mov x9, lr
bl plt // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad <long-jump bpf trampoline address>
dummy_tramp is used to prevent another CPU from jumping to an unknown
location during the patching process, making the patching process easier.
The patching process is as follows:
1. when neither the old address or the new address is a long jump, the
patchsite is replaced with a bl to the new address, or nop if the new
address is NULL;
2. when the old address is not long jump but the new one is, the
branch target address is written to plt first, then the patchsite
is replaced with a bl instruction to the plt;
3. when the old address is long jump but the new one is not, the address
of dummy_tramp is written to plt first, then the patchsite is replaced
with a bl to the new address, or a nop if the new address is NULL;
4. when both the old address and the new address are long jump, the
new address is written to plt and the patchsite is not changed.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20220711150823.2128542-4-xukuohai@huawei.com
2022-07-11 11:08:22 -04:00
|
|
|
struct bpf_plt {
|
|
|
|
u32 insn_ldr; /* load target */
|
|
|
|
u32 insn_br; /* branch to target */
|
|
|
|
u64 target; /* target value */
|
|
|
|
};
|
|
|
|
|
|
|
|
#define PLT_TARGET_SIZE sizeof_field(struct bpf_plt, target)
|
|
|
|
#define PLT_TARGET_OFFSET offsetof(struct bpf_plt, target)
|
|
|
|
|
2014-08-26 21:15:30 -07:00
|
|
|
static inline void emit(const u32 insn, struct jit_ctx *ctx)
|
|
|
|
{
|
bpf, arm64: Jit BPF_CALL to direct call when possible
Currently, BPF_CALL is always jited to indirect call. When target is
within the range of direct call, BPF_CALL can be jited to direct call.
For example, the following BPF_CALL
call __htab_map_lookup_elem
is always jited to indirect call:
mov x10, #0xffffffffffff18f4
movk x10, #0x821, lsl #16
movk x10, #0x8000, lsl #32
blr x10
When the address of target __htab_map_lookup_elem is within the range of
direct call, the BPF_CALL can be jited to:
bl 0xfffffffffd33bc98
This patch does such jit optimization by emitting arm64 direct calls for
BPF_CALL when possible, indirect calls otherwise.
Without this patch, the jit works as follows.
1. First pass
A. Determine jited position and size for each bpf instruction.
B. Computed the jited image size.
2. Allocate jited image with size computed in step 1.
3. Second pass
A. Adjust jump offset for jump instructions
B. Write the final image.
This works because, for a given bpf prog, regardless of where the jited
image is allocated, the jited result for each instruction is fixed. The
second pass differs from the first only in adjusting the jump offsets,
like changing "jmp imm1" to "jmp imm2", while the position and size of
the "jmp" instruction remain unchanged.
Now considering whether to jit BPF_CALL to arm64 direct or indirect call
instruction. The choice depends solely on the jump offset: direct call
if the jump offset is within 128MB, indirect call otherwise.
For a given BPF_CALL, the target address is known, so the jump offset is
decided by the jited address of the BPF_CALL instruction. In other words,
for a given bpf prog, the jited result for each BPF_CALL is determined
by its jited address.
The jited address for a BPF_CALL is the jited image address plus the
total jited size of all preceding instructions. For a given bpf prog,
there are clearly no BPF_CALL instructions before the first BPF_CALL
instruction. Since the jited result for all other instructions other
than BPF_CALL are fixed, the total jited size preceding the first
BPF_CALL is also fixed. Therefore, once the jited image is allocated,
the jited address for the first BPF_CALL is fixed.
Now that the jited result for the first BPF_CALL is fixed, the jited
results for all instructions preceding the second BPF_CALL are fixed.
So the jited address and result for the second BPF_CALL are also fixed.
Similarly, we can conclude that the jited addresses and results for all
subsequent BPF_CALL instructions are fixed.
This means that, for a given bpf prog, once the jited image is allocated,
the jited address and result for all instructions, including all BPF_CALL
instructions, are fixed.
Based on the observation, with this patch, the jit works as follows.
1. First pass
Estimate the maximum jited image size. In this pass, all BPF_CALLs
are jited to arm64 indirect calls since the jump offsets are unknown
because the jited image is not allocated.
2. Allocate jited image with size estimated in step 1.
3. Second pass
A. Determine the jited result for each BPF_CALL.
B. Determine jited address and size for each bpf instruction.
4. Third pass
A. Adjust jump offset for jump instructions.
B. Write the final image.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20240903094407.601107-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-03 17:44:07 +08:00
|
|
|
if (ctx->image != NULL && ctx->write)
|
2014-08-26 21:15:30 -07:00
|
|
|
ctx->image[ctx->idx] = cpu_to_le32(insn);
|
|
|
|
|
|
|
|
ctx->idx++;
|
|
|
|
}
|
|
|
|
|
bpf, arm64: optimize 32/64 immediate emission
Improve the JIT to emit 64 and 32 bit immediates, the current
algorithm is not optimal and we often emit more instructions
than actually needed. arm64 has movz, movn, movk variants but
for the current 64 bit immediates we only use movz with a
series of movk when needed.
For example loading ffffffffffffabab emits the following 4
instructions in the JIT today:
* movz: abab, shift: 0, result: 000000000000abab
* movk: ffff, shift: 16, result: 00000000ffffabab
* movk: ffff, shift: 32, result: 0000ffffffffabab
* movk: ffff, shift: 48, result: ffffffffffffabab
Whereas after the patch the same load only needs a single
instruction:
* movn: 5454, shift: 0, result: ffffffffffffabab
Another example where two extra instructions can be saved:
* movz: abab, shift: 0, result: 000000000000abab
* movk: 1f2f, shift: 16, result: 000000001f2fabab
* movk: ffff, shift: 32, result: 0000ffff1f2fabab
* movk: ffff, shift: 48, result: ffffffff1f2fabab
After the patch:
* movn: e0d0, shift: 16, result: ffffffff1f2fffff
* movk: abab, shift: 0, result: ffffffff1f2fabab
Another example with movz, before:
* movz: 0000, shift: 0, result: 0000000000000000
* movk: fea0, shift: 32, result: 0000fea000000000
After:
* movz: fea0, shift: 32, result: 0000fea000000000
Moreover, reuse emit_a64_mov_i() for 32 bit immediates that
are loaded via emit_a64_mov_i64() which is a similar optimization
as done in 6fe8b9c1f41d ("bpf, x64: save several bytes by using
mov over movabsq when possible"). On arm64, the latter allows to
use a single instruction with movn due to zero extension where
otherwise two would be needed. And last but not least add a
missing optimization in emit_a64_mov_i() where movn is used but
the subsequent movk not needed. With some of the Cilium programs
in use, this shrinks the needed instructions by about three
percent. Tested on Cavium ThunderX CN8890.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-14 23:22:32 +02:00
|
|
|
static inline void emit_a64_mov_i(const int is64, const int reg,
|
|
|
|
const s32 val, struct jit_ctx *ctx)
|
|
|
|
{
|
|
|
|
u16 hi = val >> 16;
|
|
|
|
u16 lo = val & 0xffff;
|
|
|
|
|
|
|
|
if (hi & 0x8000) {
|
|
|
|
if (hi == 0xffff) {
|
|
|
|
emit(A64_MOVN(is64, reg, (u16)~lo, 0), ctx);
|
|
|
|
} else {
|
|
|
|
emit(A64_MOVN(is64, reg, (u16)~hi, 16), ctx);
|
|
|
|
if (lo != 0xffff)
|
|
|
|
emit(A64_MOVK(is64, reg, lo, 0), ctx);
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
emit(A64_MOVZ(is64, reg, lo, 0), ctx);
|
|
|
|
if (hi)
|
|
|
|
emit(A64_MOVK(is64, reg, hi, 16), ctx);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static int i64_i16_blocks(const u64 val, bool inverse)
|
|
|
|
{
|
|
|
|
return (((val >> 0) & 0xffff) != (inverse ? 0xffff : 0x0000)) +
|
|
|
|
(((val >> 16) & 0xffff) != (inverse ? 0xffff : 0x0000)) +
|
|
|
|
(((val >> 32) & 0xffff) != (inverse ? 0xffff : 0x0000)) +
|
|
|
|
(((val >> 48) & 0xffff) != (inverse ? 0xffff : 0x0000));
|
|
|
|
}
|
|
|
|
|
2014-08-26 21:15:30 -07:00
|
|
|
static inline void emit_a64_mov_i64(const int reg, const u64 val,
|
|
|
|
struct jit_ctx *ctx)
|
|
|
|
{
|
bpf, arm64: optimize 32/64 immediate emission
Improve the JIT to emit 64 and 32 bit immediates, the current
algorithm is not optimal and we often emit more instructions
than actually needed. arm64 has movz, movn, movk variants but
for the current 64 bit immediates we only use movz with a
series of movk when needed.
For example loading ffffffffffffabab emits the following 4
instructions in the JIT today:
* movz: abab, shift: 0, result: 000000000000abab
* movk: ffff, shift: 16, result: 00000000ffffabab
* movk: ffff, shift: 32, result: 0000ffffffffabab
* movk: ffff, shift: 48, result: ffffffffffffabab
Whereas after the patch the same load only needs a single
instruction:
* movn: 5454, shift: 0, result: ffffffffffffabab
Another example where two extra instructions can be saved:
* movz: abab, shift: 0, result: 000000000000abab
* movk: 1f2f, shift: 16, result: 000000001f2fabab
* movk: ffff, shift: 32, result: 0000ffff1f2fabab
* movk: ffff, shift: 48, result: ffffffff1f2fabab
After the patch:
* movn: e0d0, shift: 16, result: ffffffff1f2fffff
* movk: abab, shift: 0, result: ffffffff1f2fabab
Another example with movz, before:
* movz: 0000, shift: 0, result: 0000000000000000
* movk: fea0, shift: 32, result: 0000fea000000000
After:
* movz: fea0, shift: 32, result: 0000fea000000000
Moreover, reuse emit_a64_mov_i() for 32 bit immediates that
are loaded via emit_a64_mov_i64() which is a similar optimization
as done in 6fe8b9c1f41d ("bpf, x64: save several bytes by using
mov over movabsq when possible"). On arm64, the latter allows to
use a single instruction with movn due to zero extension where
otherwise two would be needed. And last but not least add a
missing optimization in emit_a64_mov_i() where movn is used but
the subsequent movk not needed. With some of the Cilium programs
in use, this shrinks the needed instructions by about three
percent. Tested on Cavium ThunderX CN8890.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-14 23:22:32 +02:00
|
|
|
u64 nrm_tmp = val, rev_tmp = ~val;
|
|
|
|
bool inverse;
|
|
|
|
int shift;
|
|
|
|
|
|
|
|
if (!(nrm_tmp >> 32))
|
|
|
|
return emit_a64_mov_i(0, reg, (u32)val, ctx);
|
|
|
|
|
|
|
|
inverse = i64_i16_blocks(nrm_tmp, true) < i64_i16_blocks(nrm_tmp, false);
|
|
|
|
shift = max(round_down((inverse ? (fls64(rev_tmp) - 1) :
|
|
|
|
(fls64(nrm_tmp) - 1)), 16), 0);
|
|
|
|
if (inverse)
|
|
|
|
emit(A64_MOVN(1, reg, (rev_tmp >> shift) & 0xffff, shift), ctx);
|
|
|
|
else
|
|
|
|
emit(A64_MOVZ(1, reg, (nrm_tmp >> shift) & 0xffff, shift), ctx);
|
|
|
|
shift -= 16;
|
|
|
|
while (shift >= 0) {
|
|
|
|
if (((nrm_tmp >> shift) & 0xffff) != (inverse ? 0xffff : 0x0000))
|
|
|
|
emit(A64_MOVK(1, reg, (nrm_tmp >> shift) & 0xffff, shift), ctx);
|
|
|
|
shift -= 16;
|
2014-08-26 21:15:30 -07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
bpf, arm64: Implement bpf_arch_text_poke() for arm64
Implement bpf_arch_text_poke() for arm64, so bpf prog or bpf trampoline
can be patched with it.
When the target address is NULL, the original instruction is patched to
a NOP.
When the target address and the source address are within the branch
range, the original instruction is patched to a bl instruction to the
target address directly.
To support attaching bpf trampoline to both regular kernel function and
bpf prog, we follow the ftrace patchsite way for bpf prog. That is, two
instructions are inserted at the beginning of bpf prog, the first one
saves the return address to x9, and the second is a nop which will be
patched to a bl instruction when a bpf trampoline is attached.
However, when a bpf trampoline is attached to bpf prog, the distance
between target address and source address may exceed 128MB, the maximum
branch range, because bpf trampoline and bpf prog are allocated
separately with vmalloc. So long jump should be handled.
When a bpf prog is constructed, a plt pointing to empty trampoline
dummy_tramp is placed at the end:
bpf_prog:
mov x9, lr
nop // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
This is also the state when no trampoline is attached.
When a short-jump bpf trampoline is attached, the patchsite is patched to
a bl instruction to the trampoline directly:
bpf_prog:
mov x9, lr
bl <short-jump bpf trampoline address> // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
When a long-jump bpf trampoline is attached, the plt target is filled with
the trampoline address and the patchsite is patched to a bl instruction to
the plt:
bpf_prog:
mov x9, lr
bl plt // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad <long-jump bpf trampoline address>
dummy_tramp is used to prevent another CPU from jumping to an unknown
location during the patching process, making the patching process easier.
The patching process is as follows:
1. when neither the old address or the new address is a long jump, the
patchsite is replaced with a bl to the new address, or nop if the new
address is NULL;
2. when the old address is not long jump but the new one is, the
branch target address is written to plt first, then the patchsite
is replaced with a bl instruction to the plt;
3. when the old address is long jump but the new one is not, the address
of dummy_tramp is written to plt first, then the patchsite is replaced
with a bl to the new address, or a nop if the new address is NULL;
4. when both the old address and the new address are long jump, the
new address is written to plt and the patchsite is not changed.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20220711150823.2128542-4-xukuohai@huawei.com
2022-07-11 11:08:22 -04:00
|
|
|
static inline void emit_bti(u32 insn, struct jit_ctx *ctx)
|
|
|
|
{
|
|
|
|
if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
|
|
|
|
emit(insn, ctx);
|
|
|
|
}
|
|
|
|
|
bpf, arm64: optimize 32/64 immediate emission
Improve the JIT to emit 64 and 32 bit immediates, the current
algorithm is not optimal and we often emit more instructions
than actually needed. arm64 has movz, movn, movk variants but
for the current 64 bit immediates we only use movz with a
series of movk when needed.
For example loading ffffffffffffabab emits the following 4
instructions in the JIT today:
* movz: abab, shift: 0, result: 000000000000abab
* movk: ffff, shift: 16, result: 00000000ffffabab
* movk: ffff, shift: 32, result: 0000ffffffffabab
* movk: ffff, shift: 48, result: ffffffffffffabab
Whereas after the patch the same load only needs a single
instruction:
* movn: 5454, shift: 0, result: ffffffffffffabab
Another example where two extra instructions can be saved:
* movz: abab, shift: 0, result: 000000000000abab
* movk: 1f2f, shift: 16, result: 000000001f2fabab
* movk: ffff, shift: 32, result: 0000ffff1f2fabab
* movk: ffff, shift: 48, result: ffffffff1f2fabab
After the patch:
* movn: e0d0, shift: 16, result: ffffffff1f2fffff
* movk: abab, shift: 0, result: ffffffff1f2fabab
Another example with movz, before:
* movz: 0000, shift: 0, result: 0000000000000000
* movk: fea0, shift: 32, result: 0000fea000000000
After:
* movz: fea0, shift: 32, result: 0000fea000000000
Moreover, reuse emit_a64_mov_i() for 32 bit immediates that
are loaded via emit_a64_mov_i64() which is a similar optimization
as done in 6fe8b9c1f41d ("bpf, x64: save several bytes by using
mov over movabsq when possible"). On arm64, the latter allows to
use a single instruction with movn due to zero extension where
otherwise two would be needed. And last but not least add a
missing optimization in emit_a64_mov_i() where movn is used but
the subsequent movk not needed. With some of the Cilium programs
in use, this shrinks the needed instructions by about three
percent. Tested on Cavium ThunderX CN8890.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-14 23:22:32 +02:00
|
|
|
/*
|
2018-11-23 18:29:02 +01:00
|
|
|
* Kernel addresses in the vmalloc space use at most 48 bits, and the
|
|
|
|
* remaining bits are guaranteed to be 0x1. So we can compose the address
|
|
|
|
* with a fixed length movn/movk/movk sequence.
|
bpf, arm64: optimize 32/64 immediate emission
Improve the JIT to emit 64 and 32 bit immediates, the current
algorithm is not optimal and we often emit more instructions
than actually needed. arm64 has movz, movn, movk variants but
for the current 64 bit immediates we only use movz with a
series of movk when needed.
For example loading ffffffffffffabab emits the following 4
instructions in the JIT today:
* movz: abab, shift: 0, result: 000000000000abab
* movk: ffff, shift: 16, result: 00000000ffffabab
* movk: ffff, shift: 32, result: 0000ffffffffabab
* movk: ffff, shift: 48, result: ffffffffffffabab
Whereas after the patch the same load only needs a single
instruction:
* movn: 5454, shift: 0, result: ffffffffffffabab
Another example where two extra instructions can be saved:
* movz: abab, shift: 0, result: 000000000000abab
* movk: 1f2f, shift: 16, result: 000000001f2fabab
* movk: ffff, shift: 32, result: 0000ffff1f2fabab
* movk: ffff, shift: 48, result: ffffffff1f2fabab
After the patch:
* movn: e0d0, shift: 16, result: ffffffff1f2fffff
* movk: abab, shift: 0, result: ffffffff1f2fabab
Another example with movz, before:
* movz: 0000, shift: 0, result: 0000000000000000
* movk: fea0, shift: 32, result: 0000fea000000000
After:
* movz: fea0, shift: 32, result: 0000fea000000000
Moreover, reuse emit_a64_mov_i() for 32 bit immediates that
are loaded via emit_a64_mov_i64() which is a similar optimization
as done in 6fe8b9c1f41d ("bpf, x64: save several bytes by using
mov over movabsq when possible"). On arm64, the latter allows to
use a single instruction with movn due to zero extension where
otherwise two would be needed. And last but not least add a
missing optimization in emit_a64_mov_i() where movn is used but
the subsequent movk not needed. With some of the Cilium programs
in use, this shrinks the needed instructions by about three
percent. Tested on Cavium ThunderX CN8890.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-14 23:22:32 +02:00
|
|
|
*/
|
2017-12-14 17:55:16 -08:00
|
|
|
static inline void emit_addr_mov_i64(const int reg, const u64 val,
|
|
|
|
struct jit_ctx *ctx)
|
|
|
|
{
|
|
|
|
u64 tmp = val;
|
|
|
|
int shift = 0;
|
|
|
|
|
2018-11-23 18:29:02 +01:00
|
|
|
emit(A64_MOVN(1, reg, ~tmp & 0xffff, shift), ctx);
|
|
|
|
while (shift < 32) {
|
2017-12-14 17:55:16 -08:00
|
|
|
tmp >>= 16;
|
|
|
|
shift += 16;
|
|
|
|
emit(A64_MOVK(1, reg, tmp & 0xffff, shift), ctx);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
bpf, arm64: Jit BPF_CALL to direct call when possible
Currently, BPF_CALL is always jited to indirect call. When target is
within the range of direct call, BPF_CALL can be jited to direct call.
For example, the following BPF_CALL
call __htab_map_lookup_elem
is always jited to indirect call:
mov x10, #0xffffffffffff18f4
movk x10, #0x821, lsl #16
movk x10, #0x8000, lsl #32
blr x10
When the address of target __htab_map_lookup_elem is within the range of
direct call, the BPF_CALL can be jited to:
bl 0xfffffffffd33bc98
This patch does such jit optimization by emitting arm64 direct calls for
BPF_CALL when possible, indirect calls otherwise.
Without this patch, the jit works as follows.
1. First pass
A. Determine jited position and size for each bpf instruction.
B. Computed the jited image size.
2. Allocate jited image with size computed in step 1.
3. Second pass
A. Adjust jump offset for jump instructions
B. Write the final image.
This works because, for a given bpf prog, regardless of where the jited
image is allocated, the jited result for each instruction is fixed. The
second pass differs from the first only in adjusting the jump offsets,
like changing "jmp imm1" to "jmp imm2", while the position and size of
the "jmp" instruction remain unchanged.
Now considering whether to jit BPF_CALL to arm64 direct or indirect call
instruction. The choice depends solely on the jump offset: direct call
if the jump offset is within 128MB, indirect call otherwise.
For a given BPF_CALL, the target address is known, so the jump offset is
decided by the jited address of the BPF_CALL instruction. In other words,
for a given bpf prog, the jited result for each BPF_CALL is determined
by its jited address.
The jited address for a BPF_CALL is the jited image address plus the
total jited size of all preceding instructions. For a given bpf prog,
there are clearly no BPF_CALL instructions before the first BPF_CALL
instruction. Since the jited result for all other instructions other
than BPF_CALL are fixed, the total jited size preceding the first
BPF_CALL is also fixed. Therefore, once the jited image is allocated,
the jited address for the first BPF_CALL is fixed.
Now that the jited result for the first BPF_CALL is fixed, the jited
results for all instructions preceding the second BPF_CALL are fixed.
So the jited address and result for the second BPF_CALL are also fixed.
Similarly, we can conclude that the jited addresses and results for all
subsequent BPF_CALL instructions are fixed.
This means that, for a given bpf prog, once the jited image is allocated,
the jited address and result for all instructions, including all BPF_CALL
instructions, are fixed.
Based on the observation, with this patch, the jit works as follows.
1. First pass
Estimate the maximum jited image size. In this pass, all BPF_CALLs
are jited to arm64 indirect calls since the jump offsets are unknown
because the jited image is not allocated.
2. Allocate jited image with size estimated in step 1.
3. Second pass
A. Determine the jited result for each BPF_CALL.
B. Determine jited address and size for each bpf instruction.
4. Third pass
A. Adjust jump offset for jump instructions.
B. Write the final image.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20240903094407.601107-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-03 17:44:07 +08:00
|
|
|
static bool should_emit_indirect_call(long target, const struct jit_ctx *ctx)
|
2022-07-11 11:08:23 -04:00
|
|
|
{
|
bpf, arm64: Jit BPF_CALL to direct call when possible
Currently, BPF_CALL is always jited to indirect call. When target is
within the range of direct call, BPF_CALL can be jited to direct call.
For example, the following BPF_CALL
call __htab_map_lookup_elem
is always jited to indirect call:
mov x10, #0xffffffffffff18f4
movk x10, #0x821, lsl #16
movk x10, #0x8000, lsl #32
blr x10
When the address of target __htab_map_lookup_elem is within the range of
direct call, the BPF_CALL can be jited to:
bl 0xfffffffffd33bc98
This patch does such jit optimization by emitting arm64 direct calls for
BPF_CALL when possible, indirect calls otherwise.
Without this patch, the jit works as follows.
1. First pass
A. Determine jited position and size for each bpf instruction.
B. Computed the jited image size.
2. Allocate jited image with size computed in step 1.
3. Second pass
A. Adjust jump offset for jump instructions
B. Write the final image.
This works because, for a given bpf prog, regardless of where the jited
image is allocated, the jited result for each instruction is fixed. The
second pass differs from the first only in adjusting the jump offsets,
like changing "jmp imm1" to "jmp imm2", while the position and size of
the "jmp" instruction remain unchanged.
Now considering whether to jit BPF_CALL to arm64 direct or indirect call
instruction. The choice depends solely on the jump offset: direct call
if the jump offset is within 128MB, indirect call otherwise.
For a given BPF_CALL, the target address is known, so the jump offset is
decided by the jited address of the BPF_CALL instruction. In other words,
for a given bpf prog, the jited result for each BPF_CALL is determined
by its jited address.
The jited address for a BPF_CALL is the jited image address plus the
total jited size of all preceding instructions. For a given bpf prog,
there are clearly no BPF_CALL instructions before the first BPF_CALL
instruction. Since the jited result for all other instructions other
than BPF_CALL are fixed, the total jited size preceding the first
BPF_CALL is also fixed. Therefore, once the jited image is allocated,
the jited address for the first BPF_CALL is fixed.
Now that the jited result for the first BPF_CALL is fixed, the jited
results for all instructions preceding the second BPF_CALL are fixed.
So the jited address and result for the second BPF_CALL are also fixed.
Similarly, we can conclude that the jited addresses and results for all
subsequent BPF_CALL instructions are fixed.
This means that, for a given bpf prog, once the jited image is allocated,
the jited address and result for all instructions, including all BPF_CALL
instructions, are fixed.
Based on the observation, with this patch, the jit works as follows.
1. First pass
Estimate the maximum jited image size. In this pass, all BPF_CALLs
are jited to arm64 indirect calls since the jump offsets are unknown
because the jited image is not allocated.
2. Allocate jited image with size estimated in step 1.
3. Second pass
A. Determine the jited result for each BPF_CALL.
B. Determine jited address and size for each bpf instruction.
4. Third pass
A. Adjust jump offset for jump instructions.
B. Write the final image.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20240903094407.601107-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-03 17:44:07 +08:00
|
|
|
long offset;
|
2022-07-11 11:08:23 -04:00
|
|
|
|
bpf, arm64: Jit BPF_CALL to direct call when possible
Currently, BPF_CALL is always jited to indirect call. When target is
within the range of direct call, BPF_CALL can be jited to direct call.
For example, the following BPF_CALL
call __htab_map_lookup_elem
is always jited to indirect call:
mov x10, #0xffffffffffff18f4
movk x10, #0x821, lsl #16
movk x10, #0x8000, lsl #32
blr x10
When the address of target __htab_map_lookup_elem is within the range of
direct call, the BPF_CALL can be jited to:
bl 0xfffffffffd33bc98
This patch does such jit optimization by emitting arm64 direct calls for
BPF_CALL when possible, indirect calls otherwise.
Without this patch, the jit works as follows.
1. First pass
A. Determine jited position and size for each bpf instruction.
B. Computed the jited image size.
2. Allocate jited image with size computed in step 1.
3. Second pass
A. Adjust jump offset for jump instructions
B. Write the final image.
This works because, for a given bpf prog, regardless of where the jited
image is allocated, the jited result for each instruction is fixed. The
second pass differs from the first only in adjusting the jump offsets,
like changing "jmp imm1" to "jmp imm2", while the position and size of
the "jmp" instruction remain unchanged.
Now considering whether to jit BPF_CALL to arm64 direct or indirect call
instruction. The choice depends solely on the jump offset: direct call
if the jump offset is within 128MB, indirect call otherwise.
For a given BPF_CALL, the target address is known, so the jump offset is
decided by the jited address of the BPF_CALL instruction. In other words,
for a given bpf prog, the jited result for each BPF_CALL is determined
by its jited address.
The jited address for a BPF_CALL is the jited image address plus the
total jited size of all preceding instructions. For a given bpf prog,
there are clearly no BPF_CALL instructions before the first BPF_CALL
instruction. Since the jited result for all other instructions other
than BPF_CALL are fixed, the total jited size preceding the first
BPF_CALL is also fixed. Therefore, once the jited image is allocated,
the jited address for the first BPF_CALL is fixed.
Now that the jited result for the first BPF_CALL is fixed, the jited
results for all instructions preceding the second BPF_CALL are fixed.
So the jited address and result for the second BPF_CALL are also fixed.
Similarly, we can conclude that the jited addresses and results for all
subsequent BPF_CALL instructions are fixed.
This means that, for a given bpf prog, once the jited image is allocated,
the jited address and result for all instructions, including all BPF_CALL
instructions, are fixed.
Based on the observation, with this patch, the jit works as follows.
1. First pass
Estimate the maximum jited image size. In this pass, all BPF_CALLs
are jited to arm64 indirect calls since the jump offsets are unknown
because the jited image is not allocated.
2. Allocate jited image with size estimated in step 1.
3. Second pass
A. Determine the jited result for each BPF_CALL.
B. Determine jited address and size for each bpf instruction.
4. Third pass
A. Adjust jump offset for jump instructions.
B. Write the final image.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20240903094407.601107-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-03 17:44:07 +08:00
|
|
|
/* when ctx->ro_image is not allocated or the target is unknown,
|
|
|
|
* emit indirect call
|
|
|
|
*/
|
|
|
|
if (!ctx->ro_image || !target)
|
|
|
|
return true;
|
|
|
|
|
|
|
|
offset = target - (long)&ctx->ro_image[ctx->idx];
|
|
|
|
return offset < -SZ_128M || offset >= SZ_128M;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void emit_direct_call(u64 target, struct jit_ctx *ctx)
|
|
|
|
{
|
|
|
|
u32 insn;
|
|
|
|
unsigned long pc;
|
|
|
|
|
|
|
|
pc = (unsigned long)&ctx->ro_image[ctx->idx];
|
|
|
|
insn = aarch64_insn_gen_branch_imm(pc, target, AARCH64_INSN_BRANCH_LINK);
|
|
|
|
emit(insn, ctx);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void emit_indirect_call(u64 target, struct jit_ctx *ctx)
|
|
|
|
{
|
|
|
|
u8 tmp;
|
|
|
|
|
|
|
|
tmp = bpf2a64[TMP_REG_1];
|
2022-07-11 11:08:23 -04:00
|
|
|
emit_addr_mov_i64(tmp, target, ctx);
|
|
|
|
emit(A64_BLR(tmp), ctx);
|
|
|
|
}
|
|
|
|
|
bpf, arm64: Jit BPF_CALL to direct call when possible
Currently, BPF_CALL is always jited to indirect call. When target is
within the range of direct call, BPF_CALL can be jited to direct call.
For example, the following BPF_CALL
call __htab_map_lookup_elem
is always jited to indirect call:
mov x10, #0xffffffffffff18f4
movk x10, #0x821, lsl #16
movk x10, #0x8000, lsl #32
blr x10
When the address of target __htab_map_lookup_elem is within the range of
direct call, the BPF_CALL can be jited to:
bl 0xfffffffffd33bc98
This patch does such jit optimization by emitting arm64 direct calls for
BPF_CALL when possible, indirect calls otherwise.
Without this patch, the jit works as follows.
1. First pass
A. Determine jited position and size for each bpf instruction.
B. Computed the jited image size.
2. Allocate jited image with size computed in step 1.
3. Second pass
A. Adjust jump offset for jump instructions
B. Write the final image.
This works because, for a given bpf prog, regardless of where the jited
image is allocated, the jited result for each instruction is fixed. The
second pass differs from the first only in adjusting the jump offsets,
like changing "jmp imm1" to "jmp imm2", while the position and size of
the "jmp" instruction remain unchanged.
Now considering whether to jit BPF_CALL to arm64 direct or indirect call
instruction. The choice depends solely on the jump offset: direct call
if the jump offset is within 128MB, indirect call otherwise.
For a given BPF_CALL, the target address is known, so the jump offset is
decided by the jited address of the BPF_CALL instruction. In other words,
for a given bpf prog, the jited result for each BPF_CALL is determined
by its jited address.
The jited address for a BPF_CALL is the jited image address plus the
total jited size of all preceding instructions. For a given bpf prog,
there are clearly no BPF_CALL instructions before the first BPF_CALL
instruction. Since the jited result for all other instructions other
than BPF_CALL are fixed, the total jited size preceding the first
BPF_CALL is also fixed. Therefore, once the jited image is allocated,
the jited address for the first BPF_CALL is fixed.
Now that the jited result for the first BPF_CALL is fixed, the jited
results for all instructions preceding the second BPF_CALL are fixed.
So the jited address and result for the second BPF_CALL are also fixed.
Similarly, we can conclude that the jited addresses and results for all
subsequent BPF_CALL instructions are fixed.
This means that, for a given bpf prog, once the jited image is allocated,
the jited address and result for all instructions, including all BPF_CALL
instructions, are fixed.
Based on the observation, with this patch, the jit works as follows.
1. First pass
Estimate the maximum jited image size. In this pass, all BPF_CALLs
are jited to arm64 indirect calls since the jump offsets are unknown
because the jited image is not allocated.
2. Allocate jited image with size estimated in step 1.
3. Second pass
A. Determine the jited result for each BPF_CALL.
B. Determine jited address and size for each bpf instruction.
4. Third pass
A. Adjust jump offset for jump instructions.
B. Write the final image.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20240903094407.601107-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-03 17:44:07 +08:00
|
|
|
static void emit_call(u64 target, struct jit_ctx *ctx)
|
|
|
|
{
|
|
|
|
if (should_emit_indirect_call((long)target, ctx))
|
|
|
|
emit_indirect_call(target, ctx);
|
|
|
|
else
|
|
|
|
emit_direct_call(target, ctx);
|
|
|
|
}
|
|
|
|
|
arm64: bpf: Fix branch offset in JIT
Running the eBPF test_verifier leads to random errors looking like this:
[ 6525.735488] Unexpected kernel BRK exception at EL1
[ 6525.735502] Internal error: ptrace BRK handler: f2000100 [#1] SMP
[ 6525.741609] Modules linked in: nls_utf8 cifs libdes libarc4 dns_resolver fscache binfmt_misc nls_ascii nls_cp437 vfat fat aes_ce_blk crypto_simd cryptd aes_ce_cipher ghash_ce gf128mul efi_pstore sha2_ce sha256_arm64 sha1_ce evdev efivars efivarfs ip_tables x_tables autofs4 btrfs blake2b_generic xor xor_neon zstd_compress raid6_pq libcrc32c crc32c_generic ahci xhci_pci libahci xhci_hcd igb libata i2c_algo_bit nvme realtek usbcore nvme_core scsi_mod t10_pi netsec mdio_devres of_mdio gpio_keys fixed_phy libphy gpio_mb86s7x
[ 6525.787760] CPU: 3 PID: 7881 Comm: test_verifier Tainted: G W 5.9.0-rc1+ #47
[ 6525.796111] Hardware name: Socionext SynQuacer E-series DeveloperBox, BIOS build #1 Jun 6 2020
[ 6525.804812] pstate: 20000005 (nzCv daif -PAN -UAO BTYPE=--)
[ 6525.810390] pc : bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
[ 6525.815613] lr : bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
[ 6525.820832] sp : ffff8000130cbb80
[ 6525.824141] x29: ffff8000130cbbb0 x28: 0000000000000000
[ 6525.829451] x27: 000005ef6fcbf39b x26: 0000000000000000
[ 6525.834759] x25: ffff8000130cbb80 x24: ffff800011dc7038
[ 6525.840067] x23: ffff8000130cbd00 x22: ffff0008f624d080
[ 6525.845375] x21: 0000000000000001 x20: ffff800011dc7000
[ 6525.850682] x19: 0000000000000000 x18: 0000000000000000
[ 6525.855990] x17: 0000000000000000 x16: 0000000000000000
[ 6525.861298] x15: 0000000000000000 x14: 0000000000000000
[ 6525.866606] x13: 0000000000000000 x12: 0000000000000000
[ 6525.871913] x11: 0000000000000001 x10: ffff8000000a660c
[ 6525.877220] x9 : ffff800010951810 x8 : ffff8000130cbc38
[ 6525.882528] x7 : 0000000000000000 x6 : 0000009864cfa881
[ 6525.887836] x5 : 00ffffffffffffff x4 : 002880ba1a0b3e9f
[ 6525.893144] x3 : 0000000000000018 x2 : ffff8000000a4374
[ 6525.898452] x1 : 000000000000000a x0 : 0000000000000009
[ 6525.903760] Call trace:
[ 6525.906202] bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
[ 6525.911076] bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
[ 6525.915957] bpf_dispatcher_xdp_func+0x14/0x20
[ 6525.920398] bpf_test_run+0x70/0x1b0
[ 6525.923969] bpf_prog_test_run_xdp+0xec/0x190
[ 6525.928326] __do_sys_bpf+0xc88/0x1b28
[ 6525.932072] __arm64_sys_bpf+0x24/0x30
[ 6525.935820] el0_svc_common.constprop.0+0x70/0x168
[ 6525.940607] do_el0_svc+0x28/0x88
[ 6525.943920] el0_sync_handler+0x88/0x190
[ 6525.947838] el0_sync+0x140/0x180
[ 6525.951154] Code: d4202000 d4202000 d4202000 d4202000 (d4202000)
[ 6525.957249] ---[ end trace cecc3f93b14927e2 ]---
The reason is the offset[] creation and later usage, while building
the eBPF body. The code currently omits the first instruction, since
build_insn() will increase our ctx->idx before saving it.
That was fine up until bounded eBPF loops were introduced. After that
introduction, offset[0] must be the offset of the end of prologue which
is the start of the 1st insn while, offset[n] holds the
offset of the end of n-th insn.
When "taken loop with back jump to 1st insn" test runs, it will
eventually call bpf2a64_offset(-1, 2, ctx). Since negative indexing is
permitted, the current outcome depends on the value stored in
ctx->offset[-1], which has nothing to do with our array.
If the value happens to be 0 the tests will work. If not this error
triggers.
commit 7c2e988f400e ("bpf: fix x64 JIT code generation for jmp to 1st insn")
fixed an indentical bug on x86 when eBPF bounded loops were introduced.
So let's fix it by creating the ctx->offset[] differently. Track the
beginning of instruction and account for the extra instruction while
calculating the arm instruction offsets.
Fixes: 2589726d12a1 ("bpf: introduce bounded loops")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Jiri Olsa <jolsa@kernel.org>
Co-developed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Co-developed-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20200917084925.177348-1-ilias.apalodimas@linaro.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-09-17 11:49:25 +03:00
|
|
|
static inline int bpf2a64_offset(int bpf_insn, int off,
|
2014-08-26 21:15:30 -07:00
|
|
|
const struct jit_ctx *ctx)
|
|
|
|
{
|
arm64: bpf: Fix branch offset in JIT
Running the eBPF test_verifier leads to random errors looking like this:
[ 6525.735488] Unexpected kernel BRK exception at EL1
[ 6525.735502] Internal error: ptrace BRK handler: f2000100 [#1] SMP
[ 6525.741609] Modules linked in: nls_utf8 cifs libdes libarc4 dns_resolver fscache binfmt_misc nls_ascii nls_cp437 vfat fat aes_ce_blk crypto_simd cryptd aes_ce_cipher ghash_ce gf128mul efi_pstore sha2_ce sha256_arm64 sha1_ce evdev efivars efivarfs ip_tables x_tables autofs4 btrfs blake2b_generic xor xor_neon zstd_compress raid6_pq libcrc32c crc32c_generic ahci xhci_pci libahci xhci_hcd igb libata i2c_algo_bit nvme realtek usbcore nvme_core scsi_mod t10_pi netsec mdio_devres of_mdio gpio_keys fixed_phy libphy gpio_mb86s7x
[ 6525.787760] CPU: 3 PID: 7881 Comm: test_verifier Tainted: G W 5.9.0-rc1+ #47
[ 6525.796111] Hardware name: Socionext SynQuacer E-series DeveloperBox, BIOS build #1 Jun 6 2020
[ 6525.804812] pstate: 20000005 (nzCv daif -PAN -UAO BTYPE=--)
[ 6525.810390] pc : bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
[ 6525.815613] lr : bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
[ 6525.820832] sp : ffff8000130cbb80
[ 6525.824141] x29: ffff8000130cbbb0 x28: 0000000000000000
[ 6525.829451] x27: 000005ef6fcbf39b x26: 0000000000000000
[ 6525.834759] x25: ffff8000130cbb80 x24: ffff800011dc7038
[ 6525.840067] x23: ffff8000130cbd00 x22: ffff0008f624d080
[ 6525.845375] x21: 0000000000000001 x20: ffff800011dc7000
[ 6525.850682] x19: 0000000000000000 x18: 0000000000000000
[ 6525.855990] x17: 0000000000000000 x16: 0000000000000000
[ 6525.861298] x15: 0000000000000000 x14: 0000000000000000
[ 6525.866606] x13: 0000000000000000 x12: 0000000000000000
[ 6525.871913] x11: 0000000000000001 x10: ffff8000000a660c
[ 6525.877220] x9 : ffff800010951810 x8 : ffff8000130cbc38
[ 6525.882528] x7 : 0000000000000000 x6 : 0000009864cfa881
[ 6525.887836] x5 : 00ffffffffffffff x4 : 002880ba1a0b3e9f
[ 6525.893144] x3 : 0000000000000018 x2 : ffff8000000a4374
[ 6525.898452] x1 : 000000000000000a x0 : 0000000000000009
[ 6525.903760] Call trace:
[ 6525.906202] bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
[ 6525.911076] bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
[ 6525.915957] bpf_dispatcher_xdp_func+0x14/0x20
[ 6525.920398] bpf_test_run+0x70/0x1b0
[ 6525.923969] bpf_prog_test_run_xdp+0xec/0x190
[ 6525.928326] __do_sys_bpf+0xc88/0x1b28
[ 6525.932072] __arm64_sys_bpf+0x24/0x30
[ 6525.935820] el0_svc_common.constprop.0+0x70/0x168
[ 6525.940607] do_el0_svc+0x28/0x88
[ 6525.943920] el0_sync_handler+0x88/0x190
[ 6525.947838] el0_sync+0x140/0x180
[ 6525.951154] Code: d4202000 d4202000 d4202000 d4202000 (d4202000)
[ 6525.957249] ---[ end trace cecc3f93b14927e2 ]---
The reason is the offset[] creation and later usage, while building
the eBPF body. The code currently omits the first instruction, since
build_insn() will increase our ctx->idx before saving it.
That was fine up until bounded eBPF loops were introduced. After that
introduction, offset[0] must be the offset of the end of prologue which
is the start of the 1st insn while, offset[n] holds the
offset of the end of n-th insn.
When "taken loop with back jump to 1st insn" test runs, it will
eventually call bpf2a64_offset(-1, 2, ctx). Since negative indexing is
permitted, the current outcome depends on the value stored in
ctx->offset[-1], which has nothing to do with our array.
If the value happens to be 0 the tests will work. If not this error
triggers.
commit 7c2e988f400e ("bpf: fix x64 JIT code generation for jmp to 1st insn")
fixed an indentical bug on x86 when eBPF bounded loops were introduced.
So let's fix it by creating the ctx->offset[] differently. Track the
beginning of instruction and account for the extra instruction while
calculating the arm instruction offsets.
Fixes: 2589726d12a1 ("bpf: introduce bounded loops")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Jiri Olsa <jolsa@kernel.org>
Co-developed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Co-developed-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20200917084925.177348-1-ilias.apalodimas@linaro.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-09-17 11:49:25 +03:00
|
|
|
/* BPF JMP offset is relative to the next instruction */
|
|
|
|
bpf_insn++;
|
|
|
|
/*
|
|
|
|
* Whereas arm64 branch instructions encode the offset
|
|
|
|
* from the branch itself, so we must subtract 1 from the
|
|
|
|
* instruction offset.
|
|
|
|
*/
|
|
|
|
return ctx->offset[bpf_insn + off] - (ctx->offset[bpf_insn] - 1);
|
2014-08-26 21:15:30 -07:00
|
|
|
}
|
|
|
|
|
2014-09-16 08:48:50 +01:00
|
|
|
static void jit_fill_hole(void *area, unsigned int size)
|
|
|
|
{
|
2017-06-28 16:58:03 +02:00
|
|
|
__le32 *ptr;
|
2014-09-16 08:48:50 +01:00
|
|
|
/* We are guaranteed to have aligned memory. */
|
|
|
|
for (ptr = area; size >= sizeof(u32); size -= sizeof(u32))
|
|
|
|
*ptr++ = cpu_to_le32(AARCH64_BREAK_FAULT);
|
|
|
|
}
|
|
|
|
|
2024-02-28 14:18:24 +00:00
|
|
|
int bpf_arch_text_invalidate(void *dst, size_t len)
|
|
|
|
{
|
|
|
|
if (!aarch64_insn_set(dst, AARCH64_BREAK_FAULT, len))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-08-26 21:15:30 -07:00
|
|
|
static inline int epilogue_offset(const struct jit_ctx *ctx)
|
|
|
|
{
|
2014-12-03 08:38:01 +00:00
|
|
|
int to = ctx->epilogue_offset;
|
|
|
|
int from = ctx->idx;
|
2014-08-26 21:15:30 -07:00
|
|
|
|
|
|
|
return to - from;
|
|
|
|
}
|
|
|
|
|
bpf, arm64: Optimize ADD,SUB,JMP BPF_K using arm64 add/sub immediates
The current code for BPF_{ADD,SUB} BPF_K loads the BPF immediate to a
temporary register before performing the addition/subtraction. Similarly,
BPF_JMP BPF_K cases load the immediate to a temporary register before
comparison.
This patch introduces optimizations that use arm64 immediate add, sub,
cmn, or cmp instructions when the BPF immediate fits. If the immediate
does not fit, it falls back to using a temporary register.
Example of generated code for BPF_ALU64_IMM(BPF_ADD, R0, 2):
without optimization:
24: mov x10, #0x2
28: add x7, x7, x10
with optimization:
24: add x7, x7, #0x2
The code could use A64_{ADD,SUB}_I directly and check if it returns
AARCH64_BREAK_FAULT, similar to how logical immediates are handled.
However, aarch64_insn_gen_add_sub_imm from insn.c prints error messages
when the immediate does not fit, and it's simpler to check if the
immediate fits ahead of time.
Co-developed-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Luke Nelson <luke.r.nels@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20200508181547.24783-4-luke.r.nels@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
2020-05-08 11:15:46 -07:00
|
|
|
static bool is_addsub_imm(u32 imm)
|
|
|
|
{
|
|
|
|
/* Either imm12 or shifted imm12. */
|
|
|
|
return !(imm & ~0xfff) || !(imm & ~0xfff000);
|
|
|
|
}
|
|
|
|
|
bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate offset)
The current BPF store/load instruction is translated by the JIT into two
instructions. The first instruction moves the immediate offset into a
temporary register. The second instruction uses this temporary register
to do the real store/load.
In fact, arm64 supports addressing with immediate offsets. So This patch
introduces optimization that uses arm64 str/ldr instruction with immediate
offset when the offset fits.
Example of generated instuction for r2 = *(u64 *)(r1 + 0):
without optimization:
mov x10, 0
ldr x1, [x0, x10]
with optimization:
ldr x1, [x0, 0]
If the offset is negative, or is not aligned correctly, or exceeds max
value, rollback to the use of temporary register.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-3-xukuohai@huawei.com
2022-03-21 11:28:49 -04:00
|
|
|
/*
|
|
|
|
* There are 3 types of AArch64 LDR/STR (immediate) instruction:
|
|
|
|
* Post-index, Pre-index, Unsigned offset.
|
|
|
|
*
|
|
|
|
* For BPF ldr/str, the "unsigned offset" type is sufficient.
|
|
|
|
*
|
|
|
|
* "Unsigned offset" type LDR(immediate) format:
|
|
|
|
*
|
|
|
|
* 3 2 1 0
|
|
|
|
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
|
|
|
|
* +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
|
|
* |x x|1 1 1 0 0 1 0 1| imm12 | Rn | Rt |
|
|
|
|
* +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
|
|
* scale
|
|
|
|
*
|
|
|
|
* "Unsigned offset" type STR(immediate) format:
|
|
|
|
* 3 2 1 0
|
|
|
|
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
|
|
|
|
* +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
|
|
* |x x|1 1 1 0 0 1 0 0| imm12 | Rn | Rt |
|
|
|
|
* +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
|
|
* scale
|
|
|
|
*
|
|
|
|
* The offset is calculated from imm12 and scale in the following way:
|
|
|
|
*
|
|
|
|
* offset = (u64)imm12 << scale
|
|
|
|
*/
|
bpf, arm64: Adjust the offset of str/ldr(immediate) to positive number
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-4-xukuohai@huawei.com
2022-03-21 11:28:50 -04:00
|
|
|
static bool is_lsi_offset(int offset, int scale)
|
bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate offset)
The current BPF store/load instruction is translated by the JIT into two
instructions. The first instruction moves the immediate offset into a
temporary register. The second instruction uses this temporary register
to do the real store/load.
In fact, arm64 supports addressing with immediate offsets. So This patch
introduces optimization that uses arm64 str/ldr instruction with immediate
offset when the offset fits.
Example of generated instuction for r2 = *(u64 *)(r1 + 0):
without optimization:
mov x10, 0
ldr x1, [x0, x10]
with optimization:
ldr x1, [x0, 0]
If the offset is negative, or is not aligned correctly, or exceeds max
value, rollback to the use of temporary register.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-3-xukuohai@huawei.com
2022-03-21 11:28:49 -04:00
|
|
|
{
|
|
|
|
if (offset < 0)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if (offset > (0xFFF << scale))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if (offset & ((1 << scale) - 1))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
/* generated main prog prologue:
|
bpf, arm64: Implement bpf_arch_text_poke() for arm64
Implement bpf_arch_text_poke() for arm64, so bpf prog or bpf trampoline
can be patched with it.
When the target address is NULL, the original instruction is patched to
a NOP.
When the target address and the source address are within the branch
range, the original instruction is patched to a bl instruction to the
target address directly.
To support attaching bpf trampoline to both regular kernel function and
bpf prog, we follow the ftrace patchsite way for bpf prog. That is, two
instructions are inserted at the beginning of bpf prog, the first one
saves the return address to x9, and the second is a nop which will be
patched to a bl instruction when a bpf trampoline is attached.
However, when a bpf trampoline is attached to bpf prog, the distance
between target address and source address may exceed 128MB, the maximum
branch range, because bpf trampoline and bpf prog are allocated
separately with vmalloc. So long jump should be handled.
When a bpf prog is constructed, a plt pointing to empty trampoline
dummy_tramp is placed at the end:
bpf_prog:
mov x9, lr
nop // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
This is also the state when no trampoline is attached.
When a short-jump bpf trampoline is attached, the patchsite is patched to
a bl instruction to the trampoline directly:
bpf_prog:
mov x9, lr
bl <short-jump bpf trampoline address> // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
When a long-jump bpf trampoline is attached, the plt target is filled with
the trampoline address and the patchsite is patched to a bl instruction to
the plt:
bpf_prog:
mov x9, lr
bl plt // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad <long-jump bpf trampoline address>
dummy_tramp is used to prevent another CPU from jumping to an unknown
location during the patching process, making the patching process easier.
The patching process is as follows:
1. when neither the old address or the new address is a long jump, the
patchsite is replaced with a bl to the new address, or nop if the new
address is NULL;
2. when the old address is not long jump but the new one is, the
branch target address is written to plt first, then the patchsite
is replaced with a bl instruction to the plt;
3. when the old address is long jump but the new one is not, the address
of dummy_tramp is written to plt first, then the patchsite is replaced
with a bl to the new address, or a nop if the new address is NULL;
4. when both the old address and the new address are long jump, the
new address is written to plt and the patchsite is not changed.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20220711150823.2128542-4-xukuohai@huawei.com
2022-07-11 11:08:22 -04:00
|
|
|
* bti c // if CONFIG_ARM64_BTI_KERNEL
|
|
|
|
* mov x9, lr
|
|
|
|
* nop // POKE_OFFSET
|
|
|
|
* paciasp // if CONFIG_ARM64_PTR_AUTH_KERNEL
|
|
|
|
* stp x29, lr, [sp, #-16]!
|
|
|
|
* mov x29, sp
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
* stp xzr, x26, [sp, #-16]!
|
|
|
|
* mov x26, sp
|
bpf, arm64: Implement bpf_arch_text_poke() for arm64
Implement bpf_arch_text_poke() for arm64, so bpf prog or bpf trampoline
can be patched with it.
When the target address is NULL, the original instruction is patched to
a NOP.
When the target address and the source address are within the branch
range, the original instruction is patched to a bl instruction to the
target address directly.
To support attaching bpf trampoline to both regular kernel function and
bpf prog, we follow the ftrace patchsite way for bpf prog. That is, two
instructions are inserted at the beginning of bpf prog, the first one
saves the return address to x9, and the second is a nop which will be
patched to a bl instruction when a bpf trampoline is attached.
However, when a bpf trampoline is attached to bpf prog, the distance
between target address and source address may exceed 128MB, the maximum
branch range, because bpf trampoline and bpf prog are allocated
separately with vmalloc. So long jump should be handled.
When a bpf prog is constructed, a plt pointing to empty trampoline
dummy_tramp is placed at the end:
bpf_prog:
mov x9, lr
nop // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
This is also the state when no trampoline is attached.
When a short-jump bpf trampoline is attached, the patchsite is patched to
a bl instruction to the trampoline directly:
bpf_prog:
mov x9, lr
bl <short-jump bpf trampoline address> // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
When a long-jump bpf trampoline is attached, the plt target is filled with
the trampoline address and the patchsite is patched to a bl instruction to
the plt:
bpf_prog:
mov x9, lr
bl plt // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad <long-jump bpf trampoline address>
dummy_tramp is used to prevent another CPU from jumping to an unknown
location during the patching process, making the patching process easier.
The patching process is as follows:
1. when neither the old address or the new address is a long jump, the
patchsite is replaced with a bl to the new address, or nop if the new
address is NULL;
2. when the old address is not long jump but the new one is, the
branch target address is written to plt first, then the patchsite
is replaced with a bl instruction to the plt;
3. when the old address is long jump but the new one is not, the address
of dummy_tramp is written to plt first, then the patchsite is replaced
with a bl to the new address, or a nop if the new address is NULL;
4. when both the old address and the new address are long jump, the
new address is written to plt and the patchsite is not changed.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20220711150823.2128542-4-xukuohai@huawei.com
2022-07-11 11:08:22 -04:00
|
|
|
* // PROLOGUE_OFFSET
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
* // save callee-saved registers
|
bpf, arm64: Implement bpf_arch_text_poke() for arm64
Implement bpf_arch_text_poke() for arm64, so bpf prog or bpf trampoline
can be patched with it.
When the target address is NULL, the original instruction is patched to
a NOP.
When the target address and the source address are within the branch
range, the original instruction is patched to a bl instruction to the
target address directly.
To support attaching bpf trampoline to both regular kernel function and
bpf prog, we follow the ftrace patchsite way for bpf prog. That is, two
instructions are inserted at the beginning of bpf prog, the first one
saves the return address to x9, and the second is a nop which will be
patched to a bl instruction when a bpf trampoline is attached.
However, when a bpf trampoline is attached to bpf prog, the distance
between target address and source address may exceed 128MB, the maximum
branch range, because bpf trampoline and bpf prog are allocated
separately with vmalloc. So long jump should be handled.
When a bpf prog is constructed, a plt pointing to empty trampoline
dummy_tramp is placed at the end:
bpf_prog:
mov x9, lr
nop // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
This is also the state when no trampoline is attached.
When a short-jump bpf trampoline is attached, the patchsite is patched to
a bl instruction to the trampoline directly:
bpf_prog:
mov x9, lr
bl <short-jump bpf trampoline address> // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
When a long-jump bpf trampoline is attached, the plt target is filled with
the trampoline address and the patchsite is patched to a bl instruction to
the plt:
bpf_prog:
mov x9, lr
bl plt // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad <long-jump bpf trampoline address>
dummy_tramp is used to prevent another CPU from jumping to an unknown
location during the patching process, making the patching process easier.
The patching process is as follows:
1. when neither the old address or the new address is a long jump, the
patchsite is replaced with a bl to the new address, or nop if the new
address is NULL;
2. when the old address is not long jump but the new one is, the
branch target address is written to plt first, then the patchsite
is replaced with a bl instruction to the plt;
3. when the old address is long jump but the new one is not, the address
of dummy_tramp is written to plt first, then the patchsite is replaced
with a bl to the new address, or a nop if the new address is NULL;
4. when both the old address and the new address are long jump, the
new address is written to plt and the patchsite is not changed.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20220711150823.2128542-4-xukuohai@huawei.com
2022-07-11 11:08:22 -04:00
|
|
|
*/
|
bpf, arm64: Fix tailcall hierarchy
This patch fixes a tailcall issue caused by abusing the tailcall in
bpf2bpf feature on arm64 like the way of "bpf, x64: Fix tailcall
hierarchy".
On arm64, when a tail call happens, it uses tail_call_cnt_ptr to
increment tail_call_cnt, too.
At the prologue of main prog, it has to initialize tail_call_cnt and
prepare tail_call_cnt_ptr.
At the prologue of subprog, it pushes x26 register twice, and does not
initialize tail_call_cnt.
At the epilogue, it pops x26 twice, no matter whether it is main prog or
subprog.
Fixes: d4609a5d8c70 ("bpf, arm64: Keep tail call count across bpf2bpf calls")
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
Link: https://lore.kernel.org/r/20240714123902.32305-3-hffilwlqm@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-07-14 20:39:01 +08:00
|
|
|
static void prepare_bpf_tail_call_cnt(struct jit_ctx *ctx)
|
|
|
|
{
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
const bool is_main_prog = !bpf_is_subprog(ctx->prog);
|
bpf, arm64: Fix tailcall hierarchy
This patch fixes a tailcall issue caused by abusing the tailcall in
bpf2bpf feature on arm64 like the way of "bpf, x64: Fix tailcall
hierarchy".
On arm64, when a tail call happens, it uses tail_call_cnt_ptr to
increment tail_call_cnt, too.
At the prologue of main prog, it has to initialize tail_call_cnt and
prepare tail_call_cnt_ptr.
At the prologue of subprog, it pushes x26 register twice, and does not
initialize tail_call_cnt.
At the epilogue, it pops x26 twice, no matter whether it is main prog or
subprog.
Fixes: d4609a5d8c70 ("bpf, arm64: Keep tail call count across bpf2bpf calls")
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
Link: https://lore.kernel.org/r/20240714123902.32305-3-hffilwlqm@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-07-14 20:39:01 +08:00
|
|
|
const u8 ptr = bpf2a64[TCCNT_PTR];
|
|
|
|
|
|
|
|
if (is_main_prog) {
|
|
|
|
/* Initialize tail_call_cnt. */
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
emit(A64_PUSH(A64_ZR, ptr, A64_SP), ctx);
|
bpf, arm64: Fix tailcall hierarchy
This patch fixes a tailcall issue caused by abusing the tailcall in
bpf2bpf feature on arm64 like the way of "bpf, x64: Fix tailcall
hierarchy".
On arm64, when a tail call happens, it uses tail_call_cnt_ptr to
increment tail_call_cnt, too.
At the prologue of main prog, it has to initialize tail_call_cnt and
prepare tail_call_cnt_ptr.
At the prologue of subprog, it pushes x26 register twice, and does not
initialize tail_call_cnt.
At the epilogue, it pops x26 twice, no matter whether it is main prog or
subprog.
Fixes: d4609a5d8c70 ("bpf, arm64: Keep tail call count across bpf2bpf calls")
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
Link: https://lore.kernel.org/r/20240714123902.32305-3-hffilwlqm@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-07-14 20:39:01 +08:00
|
|
|
emit(A64_MOV(1, ptr, A64_SP), ctx);
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
} else
|
|
|
|
emit(A64_PUSH(ptr, ptr, A64_SP), ctx);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void find_used_callee_regs(struct jit_ctx *ctx)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
const struct bpf_prog *prog = ctx->prog;
|
|
|
|
const struct bpf_insn *insn = &prog->insnsi[0];
|
|
|
|
int reg_used = 0;
|
|
|
|
|
|
|
|
for (i = 0; i < prog->len; i++, insn++) {
|
|
|
|
if (insn->dst_reg == BPF_REG_6 || insn->src_reg == BPF_REG_6)
|
|
|
|
reg_used |= 1;
|
|
|
|
|
|
|
|
if (insn->dst_reg == BPF_REG_7 || insn->src_reg == BPF_REG_7)
|
|
|
|
reg_used |= 2;
|
|
|
|
|
|
|
|
if (insn->dst_reg == BPF_REG_8 || insn->src_reg == BPF_REG_8)
|
|
|
|
reg_used |= 4;
|
|
|
|
|
|
|
|
if (insn->dst_reg == BPF_REG_9 || insn->src_reg == BPF_REG_9)
|
|
|
|
reg_used |= 8;
|
|
|
|
|
|
|
|
if (insn->dst_reg == BPF_REG_FP || insn->src_reg == BPF_REG_FP) {
|
|
|
|
ctx->fp_used = true;
|
|
|
|
reg_used |= 16;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
i = 0;
|
|
|
|
if (reg_used & 1)
|
|
|
|
ctx->used_callee_reg[i++] = bpf2a64[BPF_REG_6];
|
|
|
|
|
|
|
|
if (reg_used & 2)
|
|
|
|
ctx->used_callee_reg[i++] = bpf2a64[BPF_REG_7];
|
|
|
|
|
|
|
|
if (reg_used & 4)
|
|
|
|
ctx->used_callee_reg[i++] = bpf2a64[BPF_REG_8];
|
|
|
|
|
|
|
|
if (reg_used & 8)
|
|
|
|
ctx->used_callee_reg[i++] = bpf2a64[BPF_REG_9];
|
|
|
|
|
|
|
|
if (reg_used & 16)
|
|
|
|
ctx->used_callee_reg[i++] = bpf2a64[BPF_REG_FP];
|
|
|
|
|
|
|
|
if (ctx->arena_vm_start)
|
|
|
|
ctx->used_callee_reg[i++] = bpf2a64[ARENA_VM_START];
|
|
|
|
|
|
|
|
ctx->nr_used_callee_reg = i;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Save callee-saved registers */
|
|
|
|
static void push_callee_regs(struct jit_ctx *ctx)
|
|
|
|
{
|
|
|
|
int reg1, reg2, i;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Program acting as exception boundary should save all ARM64
|
|
|
|
* Callee-saved registers as the exception callback needs to recover
|
|
|
|
* all ARM64 Callee-saved registers in its epilogue.
|
|
|
|
*/
|
|
|
|
if (ctx->prog->aux->exception_boundary) {
|
|
|
|
emit(A64_PUSH(A64_R(19), A64_R(20), A64_SP), ctx);
|
|
|
|
emit(A64_PUSH(A64_R(21), A64_R(22), A64_SP), ctx);
|
|
|
|
emit(A64_PUSH(A64_R(23), A64_R(24), A64_SP), ctx);
|
|
|
|
emit(A64_PUSH(A64_R(25), A64_R(26), A64_SP), ctx);
|
|
|
|
emit(A64_PUSH(A64_R(27), A64_R(28), A64_SP), ctx);
|
bpf, arm64: Fix tailcall hierarchy
This patch fixes a tailcall issue caused by abusing the tailcall in
bpf2bpf feature on arm64 like the way of "bpf, x64: Fix tailcall
hierarchy".
On arm64, when a tail call happens, it uses tail_call_cnt_ptr to
increment tail_call_cnt, too.
At the prologue of main prog, it has to initialize tail_call_cnt and
prepare tail_call_cnt_ptr.
At the prologue of subprog, it pushes x26 register twice, and does not
initialize tail_call_cnt.
At the epilogue, it pops x26 twice, no matter whether it is main prog or
subprog.
Fixes: d4609a5d8c70 ("bpf, arm64: Keep tail call count across bpf2bpf calls")
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
Link: https://lore.kernel.org/r/20240714123902.32305-3-hffilwlqm@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-07-14 20:39:01 +08:00
|
|
|
} else {
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
find_used_callee_regs(ctx);
|
|
|
|
for (i = 0; i + 1 < ctx->nr_used_callee_reg; i += 2) {
|
|
|
|
reg1 = ctx->used_callee_reg[i];
|
|
|
|
reg2 = ctx->used_callee_reg[i + 1];
|
|
|
|
emit(A64_PUSH(reg1, reg2, A64_SP), ctx);
|
|
|
|
}
|
|
|
|
if (i < ctx->nr_used_callee_reg) {
|
|
|
|
reg1 = ctx->used_callee_reg[i];
|
|
|
|
/* keep SP 16-byte aligned */
|
|
|
|
emit(A64_PUSH(reg1, A64_ZR, A64_SP), ctx);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Restore callee-saved registers */
|
|
|
|
static void pop_callee_regs(struct jit_ctx *ctx)
|
|
|
|
{
|
|
|
|
struct bpf_prog_aux *aux = ctx->prog->aux;
|
|
|
|
int reg1, reg2, i;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Program acting as exception boundary pushes R23 and R24 in addition
|
|
|
|
* to BPF callee-saved registers. Exception callback uses the boundary
|
|
|
|
* program's stack frame, so recover these extra registers in the above
|
|
|
|
* two cases.
|
|
|
|
*/
|
|
|
|
if (aux->exception_boundary || aux->exception_cb) {
|
|
|
|
emit(A64_POP(A64_R(27), A64_R(28), A64_SP), ctx);
|
|
|
|
emit(A64_POP(A64_R(25), A64_R(26), A64_SP), ctx);
|
|
|
|
emit(A64_POP(A64_R(23), A64_R(24), A64_SP), ctx);
|
|
|
|
emit(A64_POP(A64_R(21), A64_R(22), A64_SP), ctx);
|
|
|
|
emit(A64_POP(A64_R(19), A64_R(20), A64_SP), ctx);
|
|
|
|
} else {
|
|
|
|
i = ctx->nr_used_callee_reg - 1;
|
|
|
|
if (ctx->nr_used_callee_reg % 2 != 0) {
|
|
|
|
reg1 = ctx->used_callee_reg[i];
|
|
|
|
emit(A64_POP(reg1, A64_ZR, A64_SP), ctx);
|
|
|
|
i--;
|
|
|
|
}
|
|
|
|
while (i > 0) {
|
|
|
|
reg1 = ctx->used_callee_reg[i - 1];
|
|
|
|
reg2 = ctx->used_callee_reg[i];
|
|
|
|
emit(A64_POP(reg1, reg2, A64_SP), ctx);
|
|
|
|
i -= 2;
|
|
|
|
}
|
bpf, arm64: Fix tailcall hierarchy
This patch fixes a tailcall issue caused by abusing the tailcall in
bpf2bpf feature on arm64 like the way of "bpf, x64: Fix tailcall
hierarchy".
On arm64, when a tail call happens, it uses tail_call_cnt_ptr to
increment tail_call_cnt, too.
At the prologue of main prog, it has to initialize tail_call_cnt and
prepare tail_call_cnt_ptr.
At the prologue of subprog, it pushes x26 register twice, and does not
initialize tail_call_cnt.
At the epilogue, it pops x26 twice, no matter whether it is main prog or
subprog.
Fixes: d4609a5d8c70 ("bpf, arm64: Keep tail call count across bpf2bpf calls")
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
Link: https://lore.kernel.org/r/20240714123902.32305-3-hffilwlqm@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-07-14 20:39:01 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
bpf, arm64: Implement bpf_arch_text_poke() for arm64
Implement bpf_arch_text_poke() for arm64, so bpf prog or bpf trampoline
can be patched with it.
When the target address is NULL, the original instruction is patched to
a NOP.
When the target address and the source address are within the branch
range, the original instruction is patched to a bl instruction to the
target address directly.
To support attaching bpf trampoline to both regular kernel function and
bpf prog, we follow the ftrace patchsite way for bpf prog. That is, two
instructions are inserted at the beginning of bpf prog, the first one
saves the return address to x9, and the second is a nop which will be
patched to a bl instruction when a bpf trampoline is attached.
However, when a bpf trampoline is attached to bpf prog, the distance
between target address and source address may exceed 128MB, the maximum
branch range, because bpf trampoline and bpf prog are allocated
separately with vmalloc. So long jump should be handled.
When a bpf prog is constructed, a plt pointing to empty trampoline
dummy_tramp is placed at the end:
bpf_prog:
mov x9, lr
nop // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
This is also the state when no trampoline is attached.
When a short-jump bpf trampoline is attached, the patchsite is patched to
a bl instruction to the trampoline directly:
bpf_prog:
mov x9, lr
bl <short-jump bpf trampoline address> // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
When a long-jump bpf trampoline is attached, the plt target is filled with
the trampoline address and the patchsite is patched to a bl instruction to
the plt:
bpf_prog:
mov x9, lr
bl plt // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad <long-jump bpf trampoline address>
dummy_tramp is used to prevent another CPU from jumping to an unknown
location during the patching process, making the patching process easier.
The patching process is as follows:
1. when neither the old address or the new address is a long jump, the
patchsite is replaced with a bl to the new address, or nop if the new
address is NULL;
2. when the old address is not long jump but the new one is, the
branch target address is written to plt first, then the patchsite
is replaced with a bl instruction to the plt;
3. when the old address is long jump but the new one is not, the address
of dummy_tramp is written to plt first, then the patchsite is replaced
with a bl to the new address, or a nop if the new address is NULL;
4. when both the old address and the new address are long jump, the
new address is written to plt and the patchsite is not changed.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20220711150823.2128542-4-xukuohai@huawei.com
2022-07-11 11:08:22 -04:00
|
|
|
#define BTI_INSNS (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) ? 1 : 0)
|
|
|
|
#define PAC_INSNS (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL) ? 1 : 0)
|
|
|
|
|
|
|
|
/* Offset of nop instruction in bpf prog entry to be poked */
|
|
|
|
#define POKE_OFFSET (BTI_INSNS + 1)
|
|
|
|
|
2018-01-16 03:46:08 +01:00
|
|
|
/* Tail call offset to jump into */
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
#define PROLOGUE_OFFSET (BTI_INSNS + 2 + PAC_INSNS + 4)
|
arm64: bpf: implement bpf_tail_call() helper
Add support for JMP_CALL_X (tail call) introduced by commit 04fd61ab36ec
("bpf: allow bpf programs to tail-call other bpf programs").
bpf_tail_call() arguments:
ctx - context pointer passed to next program
array - pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
index - index inside array that selects specific program to run
In this implementation arm64 JIT jumps into callee program after prologue,
so callee program reuses the same stack. For tail_call_cnt, we use the
callee-saved R26 (which was already saved/restored but previously unused
by JIT).
With this patch a tail call generates the following code on arm64:
if (index >= array->map.max_entries)
goto out;
34: mov x10, #0x10 // #16
38: ldr w10, [x1,x10]
3c: cmp w2, w10
40: b.ge 0x0000000000000074
if (tail_call_cnt > MAX_TAIL_CALL_CNT)
goto out;
tail_call_cnt++;
44: mov x10, #0x20 // #32
48: cmp x26, x10
4c: b.gt 0x0000000000000074
50: add x26, x26, #0x1
prog = array->ptrs[index];
if (prog == NULL)
goto out;
54: mov x10, #0x68 // #104
58: ldr x10, [x1,x10]
5c: ldr x11, [x10,x2]
60: cbz x11, 0x0000000000000074
goto *(prog->bpf_func + prologue_size);
64: mov x10, #0x20 // #32
68: ldr x10, [x11,x10]
6c: add x10, x10, #0x20
70: br x10
74:
Signed-off-by: Zi Shen Lim <zlim.lnx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 21:18:48 -07:00
|
|
|
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
|
2014-08-26 21:15:30 -07:00
|
|
|
{
|
2017-06-11 03:55:27 +02:00
|
|
|
const struct bpf_prog *prog = ctx->prog;
|
2023-09-13 01:31:58 +02:00
|
|
|
const bool is_main_prog = !bpf_is_subprog(prog);
|
2014-08-26 21:15:30 -07:00
|
|
|
const u8 fp = bpf2a64[BPF_REG_FP];
|
2024-03-25 15:07:15 +00:00
|
|
|
const u8 arena_vm_base = bpf2a64[ARENA_VM_START];
|
arm64: bpf: implement bpf_tail_call() helper
Add support for JMP_CALL_X (tail call) introduced by commit 04fd61ab36ec
("bpf: allow bpf programs to tail-call other bpf programs").
bpf_tail_call() arguments:
ctx - context pointer passed to next program
array - pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
index - index inside array that selects specific program to run
In this implementation arm64 JIT jumps into callee program after prologue,
so callee program reuses the same stack. For tail_call_cnt, we use the
callee-saved R26 (which was already saved/restored but previously unused
by JIT).
With this patch a tail call generates the following code on arm64:
if (index >= array->map.max_entries)
goto out;
34: mov x10, #0x10 // #16
38: ldr w10, [x1,x10]
3c: cmp w2, w10
40: b.ge 0x0000000000000074
if (tail_call_cnt > MAX_TAIL_CALL_CNT)
goto out;
tail_call_cnt++;
44: mov x10, #0x20 // #32
48: cmp x26, x10
4c: b.gt 0x0000000000000074
50: add x26, x26, #0x1
prog = array->ptrs[index];
if (prog == NULL)
goto out;
54: mov x10, #0x68 // #104
58: ldr x10, [x1,x10]
5c: ldr x11, [x10,x2]
60: cbz x11, 0x0000000000000074
goto *(prog->bpf_func + prologue_size);
64: mov x10, #0x20 // #32
68: ldr x10, [x11,x10]
6c: add x10, x10, #0x20
70: br x10
74:
Signed-off-by: Zi Shen Lim <zlim.lnx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 21:18:48 -07:00
|
|
|
const int idx0 = ctx->idx;
|
|
|
|
int cur_offset;
|
2014-08-26 21:15:30 -07:00
|
|
|
|
2015-11-16 14:35:35 -08:00
|
|
|
/*
|
|
|
|
* BPF prog stack layout
|
|
|
|
*
|
|
|
|
* high
|
|
|
|
* original A64_SP => 0:+-----+ BPF prologue
|
|
|
|
* |FP/LR|
|
|
|
|
* current A64_FP => -16:+-----+
|
|
|
|
* | ... | callee saved registers
|
2016-05-16 16:36:26 -07:00
|
|
|
* BPF fp register => -64:+-----+ <= (BPF_FP)
|
2015-11-16 14:35:35 -08:00
|
|
|
* | |
|
|
|
|
* | ... | BPF prog stack
|
|
|
|
* | |
|
2017-06-11 03:55:27 +02:00
|
|
|
* +-----+ <= (BPF_FP - prog->aux->stack_depth)
|
2018-05-14 23:22:31 +02:00
|
|
|
* |RSVD | padding
|
2017-06-11 03:55:27 +02:00
|
|
|
* current A64_SP => +-----+ <= (BPF_FP - ctx->stack_size)
|
2015-11-16 14:35:35 -08:00
|
|
|
* | |
|
|
|
|
* | ... | Function call stack
|
|
|
|
* | |
|
|
|
|
* +-----+
|
|
|
|
* low
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
2023-07-13 09:49:31 -07:00
|
|
|
/* bpf function may be invoked by 3 instruction types:
|
|
|
|
* 1. bl, attached via freplace to bpf prog via short jump
|
|
|
|
* 2. br, attached via freplace to bpf prog via long jump
|
|
|
|
* 3. blr, working as a function pointer, used by emit_call.
|
|
|
|
* So BTI_JC should used here to support both br and blr.
|
|
|
|
*/
|
|
|
|
emit_bti(A64_BTI_JC, ctx);
|
bpf, arm64: Implement bpf_arch_text_poke() for arm64
Implement bpf_arch_text_poke() for arm64, so bpf prog or bpf trampoline
can be patched with it.
When the target address is NULL, the original instruction is patched to
a NOP.
When the target address and the source address are within the branch
range, the original instruction is patched to a bl instruction to the
target address directly.
To support attaching bpf trampoline to both regular kernel function and
bpf prog, we follow the ftrace patchsite way for bpf prog. That is, two
instructions are inserted at the beginning of bpf prog, the first one
saves the return address to x9, and the second is a nop which will be
patched to a bl instruction when a bpf trampoline is attached.
However, when a bpf trampoline is attached to bpf prog, the distance
between target address and source address may exceed 128MB, the maximum
branch range, because bpf trampoline and bpf prog are allocated
separately with vmalloc. So long jump should be handled.
When a bpf prog is constructed, a plt pointing to empty trampoline
dummy_tramp is placed at the end:
bpf_prog:
mov x9, lr
nop // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
This is also the state when no trampoline is attached.
When a short-jump bpf trampoline is attached, the patchsite is patched to
a bl instruction to the trampoline directly:
bpf_prog:
mov x9, lr
bl <short-jump bpf trampoline address> // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
When a long-jump bpf trampoline is attached, the plt target is filled with
the trampoline address and the patchsite is patched to a bl instruction to
the plt:
bpf_prog:
mov x9, lr
bl plt // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad <long-jump bpf trampoline address>
dummy_tramp is used to prevent another CPU from jumping to an unknown
location during the patching process, making the patching process easier.
The patching process is as follows:
1. when neither the old address or the new address is a long jump, the
patchsite is replaced with a bl to the new address, or nop if the new
address is NULL;
2. when the old address is not long jump but the new one is, the
branch target address is written to plt first, then the patchsite
is replaced with a bl instruction to the plt;
3. when the old address is long jump but the new one is not, the address
of dummy_tramp is written to plt first, then the patchsite is replaced
with a bl to the new address, or a nop if the new address is NULL;
4. when both the old address and the new address are long jump, the
new address is written to plt and the patchsite is not changed.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20220711150823.2128542-4-xukuohai@huawei.com
2022-07-11 11:08:22 -04:00
|
|
|
|
|
|
|
emit(A64_MOV(1, A64_R(9), A64_LR), ctx);
|
|
|
|
emit(A64_NOP, ctx);
|
|
|
|
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
if (!prog->aux->exception_cb) {
|
2024-02-01 12:52:25 +00:00
|
|
|
/* Sign lr */
|
|
|
|
if (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL))
|
|
|
|
emit(A64_PACIASP, ctx);
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
|
2024-02-01 12:52:25 +00:00
|
|
|
/* Save FP and LR registers to stay align with ARM64 AAPCS */
|
|
|
|
emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
|
|
|
|
emit(A64_MOV(1, A64_FP, A64_SP), ctx);
|
|
|
|
|
bpf, arm64: Fix tailcall hierarchy
This patch fixes a tailcall issue caused by abusing the tailcall in
bpf2bpf feature on arm64 like the way of "bpf, x64: Fix tailcall
hierarchy".
On arm64, when a tail call happens, it uses tail_call_cnt_ptr to
increment tail_call_cnt, too.
At the prologue of main prog, it has to initialize tail_call_cnt and
prepare tail_call_cnt_ptr.
At the prologue of subprog, it pushes x26 register twice, and does not
initialize tail_call_cnt.
At the epilogue, it pops x26 twice, no matter whether it is main prog or
subprog.
Fixes: d4609a5d8c70 ("bpf, arm64: Keep tail call count across bpf2bpf calls")
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
Link: https://lore.kernel.org/r/20240714123902.32305-3-hffilwlqm@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-07-14 20:39:01 +08:00
|
|
|
prepare_bpf_tail_call_cnt(ctx);
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
|
|
|
|
if (!ebpf_from_cbpf && is_main_prog) {
|
|
|
|
cur_offset = ctx->idx - idx0;
|
|
|
|
if (cur_offset != PROLOGUE_OFFSET) {
|
|
|
|
pr_err_once("PROLOGUE_OFFSET = %d, expected %d!\n",
|
|
|
|
cur_offset, PROLOGUE_OFFSET);
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
/* BTI landing pad for the tail call, done with a BR */
|
|
|
|
emit_bti(A64_BTI_J, ctx);
|
|
|
|
}
|
|
|
|
push_callee_regs(ctx);
|
2024-02-01 12:52:25 +00:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Exception callback receives FP of Main Program as third
|
|
|
|
* parameter
|
|
|
|
*/
|
|
|
|
emit(A64_MOV(1, A64_FP, A64_R(2)), ctx);
|
|
|
|
/*
|
|
|
|
* Main Program already pushed the frame record and the
|
|
|
|
* callee-saved registers. The exception callback will not push
|
|
|
|
* anything and re-use the main program's stack.
|
|
|
|
*
|
bpf, arm64: Fix tailcall hierarchy
This patch fixes a tailcall issue caused by abusing the tailcall in
bpf2bpf feature on arm64 like the way of "bpf, x64: Fix tailcall
hierarchy".
On arm64, when a tail call happens, it uses tail_call_cnt_ptr to
increment tail_call_cnt, too.
At the prologue of main prog, it has to initialize tail_call_cnt and
prepare tail_call_cnt_ptr.
At the prologue of subprog, it pushes x26 register twice, and does not
initialize tail_call_cnt.
At the epilogue, it pops x26 twice, no matter whether it is main prog or
subprog.
Fixes: d4609a5d8c70 ("bpf, arm64: Keep tail call count across bpf2bpf calls")
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
Link: https://lore.kernel.org/r/20240714123902.32305-3-hffilwlqm@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-07-14 20:39:01 +08:00
|
|
|
* 12 registers are on the stack
|
2024-02-01 12:52:25 +00:00
|
|
|
*/
|
bpf, arm64: Fix tailcall hierarchy
This patch fixes a tailcall issue caused by abusing the tailcall in
bpf2bpf feature on arm64 like the way of "bpf, x64: Fix tailcall
hierarchy".
On arm64, when a tail call happens, it uses tail_call_cnt_ptr to
increment tail_call_cnt, too.
At the prologue of main prog, it has to initialize tail_call_cnt and
prepare tail_call_cnt_ptr.
At the prologue of subprog, it pushes x26 register twice, and does not
initialize tail_call_cnt.
At the epilogue, it pops x26 twice, no matter whether it is main prog or
subprog.
Fixes: d4609a5d8c70 ("bpf, arm64: Keep tail call count across bpf2bpf calls")
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
Link: https://lore.kernel.org/r/20240714123902.32305-3-hffilwlqm@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-07-14 20:39:01 +08:00
|
|
|
emit(A64_SUB_I(1, A64_SP, A64_FP, 96), ctx);
|
2024-02-01 12:52:25 +00:00
|
|
|
}
|
2014-08-26 21:15:30 -07:00
|
|
|
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
if (ctx->fp_used)
|
|
|
|
/* Set up BPF prog stack base register */
|
|
|
|
emit(A64_MOV(1, fp, A64_SP), ctx);
|
2024-02-01 12:52:25 +00:00
|
|
|
|
2021-05-10 20:51:59 +08:00
|
|
|
/* Stack must be multiples of 16B */
|
|
|
|
ctx->stack_size = round_up(prog->aux->stack_depth, 16);
|
2018-01-16 03:46:08 +01:00
|
|
|
|
|
|
|
/* Set up function call stack */
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
if (ctx->stack_size)
|
|
|
|
emit(A64_SUB_I(1, A64_SP, A64_SP, ctx->stack_size), ctx);
|
2024-03-25 15:07:15 +00:00
|
|
|
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
if (ctx->arena_vm_start)
|
|
|
|
emit_a64_mov_i64(arena_vm_base, ctx->arena_vm_start, ctx);
|
2024-03-25 15:07:15 +00:00
|
|
|
|
arm64: bpf: implement bpf_tail_call() helper
Add support for JMP_CALL_X (tail call) introduced by commit 04fd61ab36ec
("bpf: allow bpf programs to tail-call other bpf programs").
bpf_tail_call() arguments:
ctx - context pointer passed to next program
array - pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
index - index inside array that selects specific program to run
In this implementation arm64 JIT jumps into callee program after prologue,
so callee program reuses the same stack. For tail_call_cnt, we use the
callee-saved R26 (which was already saved/restored but previously unused
by JIT).
With this patch a tail call generates the following code on arm64:
if (index >= array->map.max_entries)
goto out;
34: mov x10, #0x10 // #16
38: ldr w10, [x1,x10]
3c: cmp w2, w10
40: b.ge 0x0000000000000074
if (tail_call_cnt > MAX_TAIL_CALL_CNT)
goto out;
tail_call_cnt++;
44: mov x10, #0x20 // #32
48: cmp x26, x10
4c: b.gt 0x0000000000000074
50: add x26, x26, #0x1
prog = array->ptrs[index];
if (prog == NULL)
goto out;
54: mov x10, #0x68 // #104
58: ldr x10, [x1,x10]
5c: ldr x11, [x10,x2]
60: cbz x11, 0x0000000000000074
goto *(prog->bpf_func + prologue_size);
64: mov x10, #0x20 // #32
68: ldr x10, [x11,x10]
6c: add x10, x10, #0x20
70: br x10
74:
Signed-off-by: Zi Shen Lim <zlim.lnx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 21:18:48 -07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int emit_bpf_tail_call(struct jit_ctx *ctx)
|
|
|
|
{
|
|
|
|
/* bpf_tail_call(void *prog_ctx, struct bpf_array *array, u64 index) */
|
|
|
|
const u8 r2 = bpf2a64[BPF_REG_2];
|
|
|
|
const u8 r3 = bpf2a64[BPF_REG_3];
|
|
|
|
|
|
|
|
const u8 tmp = bpf2a64[TMP_REG_1];
|
|
|
|
const u8 prg = bpf2a64[TMP_REG_2];
|
bpf, arm64: Fix tailcall hierarchy
This patch fixes a tailcall issue caused by abusing the tailcall in
bpf2bpf feature on arm64 like the way of "bpf, x64: Fix tailcall
hierarchy".
On arm64, when a tail call happens, it uses tail_call_cnt_ptr to
increment tail_call_cnt, too.
At the prologue of main prog, it has to initialize tail_call_cnt and
prepare tail_call_cnt_ptr.
At the prologue of subprog, it pushes x26 register twice, and does not
initialize tail_call_cnt.
At the epilogue, it pops x26 twice, no matter whether it is main prog or
subprog.
Fixes: d4609a5d8c70 ("bpf, arm64: Keep tail call count across bpf2bpf calls")
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
Link: https://lore.kernel.org/r/20240714123902.32305-3-hffilwlqm@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-07-14 20:39:01 +08:00
|
|
|
const u8 tcc = bpf2a64[TMP_REG_3];
|
|
|
|
const u8 ptr = bpf2a64[TCCNT_PTR];
|
arm64: bpf: implement bpf_tail_call() helper
Add support for JMP_CALL_X (tail call) introduced by commit 04fd61ab36ec
("bpf: allow bpf programs to tail-call other bpf programs").
bpf_tail_call() arguments:
ctx - context pointer passed to next program
array - pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
index - index inside array that selects specific program to run
In this implementation arm64 JIT jumps into callee program after prologue,
so callee program reuses the same stack. For tail_call_cnt, we use the
callee-saved R26 (which was already saved/restored but previously unused
by JIT).
With this patch a tail call generates the following code on arm64:
if (index >= array->map.max_entries)
goto out;
34: mov x10, #0x10 // #16
38: ldr w10, [x1,x10]
3c: cmp w2, w10
40: b.ge 0x0000000000000074
if (tail_call_cnt > MAX_TAIL_CALL_CNT)
goto out;
tail_call_cnt++;
44: mov x10, #0x20 // #32
48: cmp x26, x10
4c: b.gt 0x0000000000000074
50: add x26, x26, #0x1
prog = array->ptrs[index];
if (prog == NULL)
goto out;
54: mov x10, #0x68 // #104
58: ldr x10, [x1,x10]
5c: ldr x11, [x10,x2]
60: cbz x11, 0x0000000000000074
goto *(prog->bpf_func + prologue_size);
64: mov x10, #0x20 // #32
68: ldr x10, [x11,x10]
6c: add x10, x10, #0x20
70: br x10
74:
Signed-off-by: Zi Shen Lim <zlim.lnx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 21:18:48 -07:00
|
|
|
size_t off;
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
__le32 *branch1 = NULL;
|
|
|
|
__le32 *branch2 = NULL;
|
|
|
|
__le32 *branch3 = NULL;
|
arm64: bpf: implement bpf_tail_call() helper
Add support for JMP_CALL_X (tail call) introduced by commit 04fd61ab36ec
("bpf: allow bpf programs to tail-call other bpf programs").
bpf_tail_call() arguments:
ctx - context pointer passed to next program
array - pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
index - index inside array that selects specific program to run
In this implementation arm64 JIT jumps into callee program after prologue,
so callee program reuses the same stack. For tail_call_cnt, we use the
callee-saved R26 (which was already saved/restored but previously unused
by JIT).
With this patch a tail call generates the following code on arm64:
if (index >= array->map.max_entries)
goto out;
34: mov x10, #0x10 // #16
38: ldr w10, [x1,x10]
3c: cmp w2, w10
40: b.ge 0x0000000000000074
if (tail_call_cnt > MAX_TAIL_CALL_CNT)
goto out;
tail_call_cnt++;
44: mov x10, #0x20 // #32
48: cmp x26, x10
4c: b.gt 0x0000000000000074
50: add x26, x26, #0x1
prog = array->ptrs[index];
if (prog == NULL)
goto out;
54: mov x10, #0x68 // #104
58: ldr x10, [x1,x10]
5c: ldr x11, [x10,x2]
60: cbz x11, 0x0000000000000074
goto *(prog->bpf_func + prologue_size);
64: mov x10, #0x20 // #32
68: ldr x10, [x11,x10]
6c: add x10, x10, #0x20
70: br x10
74:
Signed-off-by: Zi Shen Lim <zlim.lnx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 21:18:48 -07:00
|
|
|
|
|
|
|
/* if (index >= array->map.max_entries)
|
|
|
|
* goto out;
|
|
|
|
*/
|
|
|
|
off = offsetof(struct bpf_array, map.max_entries);
|
|
|
|
emit_a64_mov_i64(tmp, off, ctx);
|
|
|
|
emit(A64_LDR32(tmp, r2, tmp), ctx);
|
bpf, arm64: fix out of bounds access in tail call
I recently noticed a crash on arm64 when feeding a bogus index
into BPF tail call helper. The crash would not occur when the
interpreter is used, but only in case of JIT. Output looks as
follows:
[ 347.007486] Unable to handle kernel paging request at virtual address fffb850e96492510
[...]
[ 347.043065] [fffb850e96492510] address between user and kernel address ranges
[ 347.050205] Internal error: Oops: 96000004 [#1] SMP
[...]
[ 347.190829] x13: 0000000000000000 x12: 0000000000000000
[ 347.196128] x11: fffc047ebe782800 x10: ffff808fd7d0fd10
[ 347.201427] x9 : 0000000000000000 x8 : 0000000000000000
[ 347.206726] x7 : 0000000000000000 x6 : 001c991738000000
[ 347.212025] x5 : 0000000000000018 x4 : 000000000000ba5a
[ 347.217325] x3 : 00000000000329c4 x2 : ffff808fd7cf0500
[ 347.222625] x1 : ffff808fd7d0fc00 x0 : ffff808fd7cf0500
[ 347.227926] Process test_verifier (pid: 4548, stack limit = 0x000000007467fa61)
[ 347.235221] Call trace:
[ 347.237656] 0xffff000002f3a4fc
[ 347.240784] bpf_test_run+0x78/0xf8
[ 347.244260] bpf_prog_test_run_skb+0x148/0x230
[ 347.248694] SyS_bpf+0x77c/0x1110
[ 347.251999] el0_svc_naked+0x30/0x34
[ 347.255564] Code: 9100075a d280220a 8b0a002a d37df04b (f86b694b)
[...]
In this case the index used in BPF r3 is the same as in r1
at the time of the call, meaning we fed a pointer as index;
here, it had the value 0xffff808fd7cf0500 which sits in x2.
While I found tail calls to be working in general (also for
hitting the error cases), I noticed the following in the code
emission:
# bpftool p d j i 988
[...]
38: ldr w10, [x1,x10]
3c: cmp w2, w10
40: b.ge 0x000000000000007c <-- signed cmp
44: mov x10, #0x20 // #32
48: cmp x26, x10
4c: b.gt 0x000000000000007c
50: add x26, x26, #0x1
54: mov x10, #0x110 // #272
58: add x10, x1, x10
5c: lsl x11, x2, #3
60: ldr x11, [x10,x11] <-- faulting insn (f86b694b)
64: cbz x11, 0x000000000000007c
[...]
Meaning, the tests passed because commit ddb55992b04d ("arm64:
bpf: implement bpf_tail_call() helper") was using signed compares
instead of unsigned which as a result had the test wrongly passing.
Change this but also the tail call count test both into unsigned
and cap the index as u32. Latter we did as well in 90caccdd8cc0
("bpf: fix bpf_tail_call() x64 JIT") and is needed in addition here,
too. Tested on HiSilicon Hi1616.
Result after patch:
# bpftool p d j i 268
[...]
38: ldr w10, [x1,x10]
3c: add w2, w2, #0x0
40: cmp w2, w10
44: b.cs 0x0000000000000080
48: mov x10, #0x20 // #32
4c: cmp x26, x10
50: b.hi 0x0000000000000080
54: add x26, x26, #0x1
58: mov x10, #0x110 // #272
5c: add x10, x1, x10
60: lsl x11, x2, #3
64: ldr x11, [x10,x11]
68: cbz x11, 0x0000000000000080
[...]
Fixes: ddb55992b04d ("arm64: bpf: implement bpf_tail_call() helper")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-02-23 01:03:43 +01:00
|
|
|
emit(A64_MOV(0, r3, r3), ctx);
|
arm64: bpf: implement bpf_tail_call() helper
Add support for JMP_CALL_X (tail call) introduced by commit 04fd61ab36ec
("bpf: allow bpf programs to tail-call other bpf programs").
bpf_tail_call() arguments:
ctx - context pointer passed to next program
array - pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
index - index inside array that selects specific program to run
In this implementation arm64 JIT jumps into callee program after prologue,
so callee program reuses the same stack. For tail_call_cnt, we use the
callee-saved R26 (which was already saved/restored but previously unused
by JIT).
With this patch a tail call generates the following code on arm64:
if (index >= array->map.max_entries)
goto out;
34: mov x10, #0x10 // #16
38: ldr w10, [x1,x10]
3c: cmp w2, w10
40: b.ge 0x0000000000000074
if (tail_call_cnt > MAX_TAIL_CALL_CNT)
goto out;
tail_call_cnt++;
44: mov x10, #0x20 // #32
48: cmp x26, x10
4c: b.gt 0x0000000000000074
50: add x26, x26, #0x1
prog = array->ptrs[index];
if (prog == NULL)
goto out;
54: mov x10, #0x68 // #104
58: ldr x10, [x1,x10]
5c: ldr x11, [x10,x2]
60: cbz x11, 0x0000000000000074
goto *(prog->bpf_func + prologue_size);
64: mov x10, #0x20 // #32
68: ldr x10, [x11,x10]
6c: add x10, x10, #0x20
70: br x10
74:
Signed-off-by: Zi Shen Lim <zlim.lnx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 21:18:48 -07:00
|
|
|
emit(A64_CMP(0, r3, tmp), ctx);
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
branch1 = ctx->image + ctx->idx;
|
|
|
|
emit(A64_NOP, ctx);
|
arm64: bpf: implement bpf_tail_call() helper
Add support for JMP_CALL_X (tail call) introduced by commit 04fd61ab36ec
("bpf: allow bpf programs to tail-call other bpf programs").
bpf_tail_call() arguments:
ctx - context pointer passed to next program
array - pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
index - index inside array that selects specific program to run
In this implementation arm64 JIT jumps into callee program after prologue,
so callee program reuses the same stack. For tail_call_cnt, we use the
callee-saved R26 (which was already saved/restored but previously unused
by JIT).
With this patch a tail call generates the following code on arm64:
if (index >= array->map.max_entries)
goto out;
34: mov x10, #0x10 // #16
38: ldr w10, [x1,x10]
3c: cmp w2, w10
40: b.ge 0x0000000000000074
if (tail_call_cnt > MAX_TAIL_CALL_CNT)
goto out;
tail_call_cnt++;
44: mov x10, #0x20 // #32
48: cmp x26, x10
4c: b.gt 0x0000000000000074
50: add x26, x26, #0x1
prog = array->ptrs[index];
if (prog == NULL)
goto out;
54: mov x10, #0x68 // #104
58: ldr x10, [x1,x10]
5c: ldr x11, [x10,x2]
60: cbz x11, 0x0000000000000074
goto *(prog->bpf_func + prologue_size);
64: mov x10, #0x20 // #32
68: ldr x10, [x11,x10]
6c: add x10, x10, #0x20
70: br x10
74:
Signed-off-by: Zi Shen Lim <zlim.lnx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 21:18:48 -07:00
|
|
|
|
bpf: Change value of MAX_TAIL_CALL_CNT from 32 to 33
In the current code, the actual max tail call count is 33 which is greater
than MAX_TAIL_CALL_CNT (defined as 32). The actual limit is not consistent
with the meaning of MAX_TAIL_CALL_CNT and thus confusing at first glance.
We can see the historical evolution from commit 04fd61ab36ec ("bpf: allow
bpf programs to tail-call other bpf programs") and commit f9dabe016b63
("bpf: Undo off-by-one in interpreter tail call count limit"). In order
to avoid changing existing behavior, the actual limit is 33 now, this is
reasonable.
After commit 874be05f525e ("bpf, tests: Add tail call test suite"), we can
see there exists failed testcase.
On all archs when CONFIG_BPF_JIT_ALWAYS_ON is not set:
# echo 0 > /proc/sys/net/core/bpf_jit_enable
# modprobe test_bpf
# dmesg | grep -w FAIL
Tail call error path, max count reached jited:0 ret 34 != 33 FAIL
On some archs:
# echo 1 > /proc/sys/net/core/bpf_jit_enable
# modprobe test_bpf
# dmesg | grep -w FAIL
Tail call error path, max count reached jited:1 ret 34 != 33 FAIL
Although the above failed testcase has been fixed in commit 18935a72eb25
("bpf/tests: Fix error in tail call limit tests"), it would still be good
to change the value of MAX_TAIL_CALL_CNT from 32 to 33 to make the code
more readable.
The 32-bit x86 JIT was using a limit of 32, just fix the wrong comments and
limit to 33 tail calls as the constant MAX_TAIL_CALL_CNT updated. For the
mips64 JIT, use "ori" instead of "addiu" as suggested by Johan Almbladh.
For the riscv JIT, use RV_REG_TCC directly to save one register move as
suggested by Björn Töpel. For the other implementations, no function changes,
it does not change the current limit 33, the new value of MAX_TAIL_CALL_CNT
can reflect the actual max tail call count, the related tail call testcases
in test_bpf module and selftests can work well for the interpreter and the
JIT.
Here are the test results on x86_64:
# uname -m
x86_64
# echo 0 > /proc/sys/net/core/bpf_jit_enable
# modprobe test_bpf test_suite=test_tail_calls
# dmesg | tail -1
test_bpf: test_tail_calls: Summary: 8 PASSED, 0 FAILED, [0/8 JIT'ed]
# rmmod test_bpf
# echo 1 > /proc/sys/net/core/bpf_jit_enable
# modprobe test_bpf test_suite=test_tail_calls
# dmesg | tail -1
test_bpf: test_tail_calls: Summary: 8 PASSED, 0 FAILED, [8/8 JIT'ed]
# rmmod test_bpf
# ./test_progs -t tailcalls
#142 tailcalls:OK
Summary: 1/11 PASSED, 0 SKIPPED, 0 FAILED
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Björn Töpel <bjorn@kernel.org>
Acked-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/bpf/1636075800-3264-1-git-send-email-yangtiezhu@loongson.cn
2021-11-05 09:30:00 +08:00
|
|
|
/*
|
bpf, arm64: Fix tailcall hierarchy
This patch fixes a tailcall issue caused by abusing the tailcall in
bpf2bpf feature on arm64 like the way of "bpf, x64: Fix tailcall
hierarchy".
On arm64, when a tail call happens, it uses tail_call_cnt_ptr to
increment tail_call_cnt, too.
At the prologue of main prog, it has to initialize tail_call_cnt and
prepare tail_call_cnt_ptr.
At the prologue of subprog, it pushes x26 register twice, and does not
initialize tail_call_cnt.
At the epilogue, it pops x26 twice, no matter whether it is main prog or
subprog.
Fixes: d4609a5d8c70 ("bpf, arm64: Keep tail call count across bpf2bpf calls")
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
Link: https://lore.kernel.org/r/20240714123902.32305-3-hffilwlqm@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-07-14 20:39:01 +08:00
|
|
|
* if ((*tail_call_cnt_ptr) >= MAX_TAIL_CALL_CNT)
|
arm64: bpf: implement bpf_tail_call() helper
Add support for JMP_CALL_X (tail call) introduced by commit 04fd61ab36ec
("bpf: allow bpf programs to tail-call other bpf programs").
bpf_tail_call() arguments:
ctx - context pointer passed to next program
array - pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
index - index inside array that selects specific program to run
In this implementation arm64 JIT jumps into callee program after prologue,
so callee program reuses the same stack. For tail_call_cnt, we use the
callee-saved R26 (which was already saved/restored but previously unused
by JIT).
With this patch a tail call generates the following code on arm64:
if (index >= array->map.max_entries)
goto out;
34: mov x10, #0x10 // #16
38: ldr w10, [x1,x10]
3c: cmp w2, w10
40: b.ge 0x0000000000000074
if (tail_call_cnt > MAX_TAIL_CALL_CNT)
goto out;
tail_call_cnt++;
44: mov x10, #0x20 // #32
48: cmp x26, x10
4c: b.gt 0x0000000000000074
50: add x26, x26, #0x1
prog = array->ptrs[index];
if (prog == NULL)
goto out;
54: mov x10, #0x68 // #104
58: ldr x10, [x1,x10]
5c: ldr x11, [x10,x2]
60: cbz x11, 0x0000000000000074
goto *(prog->bpf_func + prologue_size);
64: mov x10, #0x20 // #32
68: ldr x10, [x11,x10]
6c: add x10, x10, #0x20
70: br x10
74:
Signed-off-by: Zi Shen Lim <zlim.lnx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 21:18:48 -07:00
|
|
|
* goto out;
|
|
|
|
*/
|
|
|
|
emit_a64_mov_i64(tmp, MAX_TAIL_CALL_CNT, ctx);
|
bpf, arm64: Fix tailcall hierarchy
This patch fixes a tailcall issue caused by abusing the tailcall in
bpf2bpf feature on arm64 like the way of "bpf, x64: Fix tailcall
hierarchy".
On arm64, when a tail call happens, it uses tail_call_cnt_ptr to
increment tail_call_cnt, too.
At the prologue of main prog, it has to initialize tail_call_cnt and
prepare tail_call_cnt_ptr.
At the prologue of subprog, it pushes x26 register twice, and does not
initialize tail_call_cnt.
At the epilogue, it pops x26 twice, no matter whether it is main prog or
subprog.
Fixes: d4609a5d8c70 ("bpf, arm64: Keep tail call count across bpf2bpf calls")
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
Link: https://lore.kernel.org/r/20240714123902.32305-3-hffilwlqm@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-07-14 20:39:01 +08:00
|
|
|
emit(A64_LDR64I(tcc, ptr, 0), ctx);
|
arm64: bpf: implement bpf_tail_call() helper
Add support for JMP_CALL_X (tail call) introduced by commit 04fd61ab36ec
("bpf: allow bpf programs to tail-call other bpf programs").
bpf_tail_call() arguments:
ctx - context pointer passed to next program
array - pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
index - index inside array that selects specific program to run
In this implementation arm64 JIT jumps into callee program after prologue,
so callee program reuses the same stack. For tail_call_cnt, we use the
callee-saved R26 (which was already saved/restored but previously unused
by JIT).
With this patch a tail call generates the following code on arm64:
if (index >= array->map.max_entries)
goto out;
34: mov x10, #0x10 // #16
38: ldr w10, [x1,x10]
3c: cmp w2, w10
40: b.ge 0x0000000000000074
if (tail_call_cnt > MAX_TAIL_CALL_CNT)
goto out;
tail_call_cnt++;
44: mov x10, #0x20 // #32
48: cmp x26, x10
4c: b.gt 0x0000000000000074
50: add x26, x26, #0x1
prog = array->ptrs[index];
if (prog == NULL)
goto out;
54: mov x10, #0x68 // #104
58: ldr x10, [x1,x10]
5c: ldr x11, [x10,x2]
60: cbz x11, 0x0000000000000074
goto *(prog->bpf_func + prologue_size);
64: mov x10, #0x20 // #32
68: ldr x10, [x11,x10]
6c: add x10, x10, #0x20
70: br x10
74:
Signed-off-by: Zi Shen Lim <zlim.lnx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 21:18:48 -07:00
|
|
|
emit(A64_CMP(1, tcc, tmp), ctx);
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
branch2 = ctx->image + ctx->idx;
|
|
|
|
emit(A64_NOP, ctx);
|
|
|
|
|
|
|
|
/* (*tail_call_cnt_ptr)++; */
|
arm64: bpf: implement bpf_tail_call() helper
Add support for JMP_CALL_X (tail call) introduced by commit 04fd61ab36ec
("bpf: allow bpf programs to tail-call other bpf programs").
bpf_tail_call() arguments:
ctx - context pointer passed to next program
array - pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
index - index inside array that selects specific program to run
In this implementation arm64 JIT jumps into callee program after prologue,
so callee program reuses the same stack. For tail_call_cnt, we use the
callee-saved R26 (which was already saved/restored but previously unused
by JIT).
With this patch a tail call generates the following code on arm64:
if (index >= array->map.max_entries)
goto out;
34: mov x10, #0x10 // #16
38: ldr w10, [x1,x10]
3c: cmp w2, w10
40: b.ge 0x0000000000000074
if (tail_call_cnt > MAX_TAIL_CALL_CNT)
goto out;
tail_call_cnt++;
44: mov x10, #0x20 // #32
48: cmp x26, x10
4c: b.gt 0x0000000000000074
50: add x26, x26, #0x1
prog = array->ptrs[index];
if (prog == NULL)
goto out;
54: mov x10, #0x68 // #104
58: ldr x10, [x1,x10]
5c: ldr x11, [x10,x2]
60: cbz x11, 0x0000000000000074
goto *(prog->bpf_func + prologue_size);
64: mov x10, #0x20 // #32
68: ldr x10, [x11,x10]
6c: add x10, x10, #0x20
70: br x10
74:
Signed-off-by: Zi Shen Lim <zlim.lnx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 21:18:48 -07:00
|
|
|
emit(A64_ADD_I(1, tcc, tcc, 1), ctx);
|
|
|
|
|
|
|
|
/* prog = array->ptrs[index];
|
|
|
|
* if (prog == NULL)
|
|
|
|
* goto out;
|
|
|
|
*/
|
|
|
|
off = offsetof(struct bpf_array, ptrs);
|
|
|
|
emit_a64_mov_i64(tmp, off, ctx);
|
bpf, arm64: fix faulty emission of map access in tail calls
Shubham was recently asking on netdev why in arm64 JIT we don't multiply
the index for accessing the tail call map by 8. That led me into testing
out arm64 JIT wrt tail calls and it turned out I got a NULL pointer
dereference on the tail call.
The buggy access is at:
prog = array->ptrs[index];
if (prog == NULL)
goto out;
[...]
00000060: d2800e0a mov x10, #0x70 // #112
00000064: f86a682a ldr x10, [x1,x10]
00000068: f862694b ldr x11, [x10,x2]
0000006c: b40000ab cbz x11, 0x00000080
[...]
The code triggering the crash is f862694b. x1 at the time contains the
address of the bpf array, x10 offsetof(struct bpf_array, ptrs). Meaning,
above we load the pointer to the program at map slot 0 into x10. x10
can then be NULL if the slot is not occupied, which we later on try to
access with a user given offset in x2 that is the map index.
Fix this by emitting the following instead:
[...]
00000060: d2800e0a mov x10, #0x70 // #112
00000064: 8b0a002a add x10, x1, x10
00000068: d37df04b lsl x11, x2, #3
0000006c: f86b694b ldr x11, [x10,x11]
00000070: b40000ab cbz x11, 0x00000084
[...]
This basically adds the offset to ptrs to the base address of the bpf
array we got and we later on access the map with an index * 8 offset
relative to that. The tail call map itself is basically one large area
with meta data at the head followed by the array of prog pointers.
This makes tail calls working again, tested on Cavium ThunderX ARMv8.
Fixes: ddb55992b04d ("arm64: bpf: implement bpf_tail_call() helper")
Reported-by: Shubham Bansal <illusionist.neo@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-11 01:53:15 +02:00
|
|
|
emit(A64_ADD(1, tmp, r2, tmp), ctx);
|
|
|
|
emit(A64_LSL(1, prg, r3, 3), ctx);
|
|
|
|
emit(A64_LDR64(prg, tmp, prg), ctx);
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
branch3 = ctx->image + ctx->idx;
|
|
|
|
emit(A64_NOP, ctx);
|
arm64: bpf: implement bpf_tail_call() helper
Add support for JMP_CALL_X (tail call) introduced by commit 04fd61ab36ec
("bpf: allow bpf programs to tail-call other bpf programs").
bpf_tail_call() arguments:
ctx - context pointer passed to next program
array - pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
index - index inside array that selects specific program to run
In this implementation arm64 JIT jumps into callee program after prologue,
so callee program reuses the same stack. For tail_call_cnt, we use the
callee-saved R26 (which was already saved/restored but previously unused
by JIT).
With this patch a tail call generates the following code on arm64:
if (index >= array->map.max_entries)
goto out;
34: mov x10, #0x10 // #16
38: ldr w10, [x1,x10]
3c: cmp w2, w10
40: b.ge 0x0000000000000074
if (tail_call_cnt > MAX_TAIL_CALL_CNT)
goto out;
tail_call_cnt++;
44: mov x10, #0x20 // #32
48: cmp x26, x10
4c: b.gt 0x0000000000000074
50: add x26, x26, #0x1
prog = array->ptrs[index];
if (prog == NULL)
goto out;
54: mov x10, #0x68 // #104
58: ldr x10, [x1,x10]
5c: ldr x11, [x10,x2]
60: cbz x11, 0x0000000000000074
goto *(prog->bpf_func + prologue_size);
64: mov x10, #0x20 // #32
68: ldr x10, [x11,x10]
6c: add x10, x10, #0x20
70: br x10
74:
Signed-off-by: Zi Shen Lim <zlim.lnx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 21:18:48 -07:00
|
|
|
|
bpf, arm64: Fix tailcall hierarchy
This patch fixes a tailcall issue caused by abusing the tailcall in
bpf2bpf feature on arm64 like the way of "bpf, x64: Fix tailcall
hierarchy".
On arm64, when a tail call happens, it uses tail_call_cnt_ptr to
increment tail_call_cnt, too.
At the prologue of main prog, it has to initialize tail_call_cnt and
prepare tail_call_cnt_ptr.
At the prologue of subprog, it pushes x26 register twice, and does not
initialize tail_call_cnt.
At the epilogue, it pops x26 twice, no matter whether it is main prog or
subprog.
Fixes: d4609a5d8c70 ("bpf, arm64: Keep tail call count across bpf2bpf calls")
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
Link: https://lore.kernel.org/r/20240714123902.32305-3-hffilwlqm@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-07-14 20:39:01 +08:00
|
|
|
/* Update tail_call_cnt if the slot is populated. */
|
|
|
|
emit(A64_STR64I(tcc, ptr, 0), ctx);
|
|
|
|
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
/* restore SP */
|
|
|
|
if (ctx->stack_size)
|
|
|
|
emit(A64_ADD_I(1, A64_SP, A64_SP, ctx->stack_size), ctx);
|
|
|
|
|
|
|
|
pop_callee_regs(ctx);
|
|
|
|
|
2018-01-16 03:46:08 +01:00
|
|
|
/* goto *(prog->bpf_func + prologue_offset); */
|
arm64: bpf: implement bpf_tail_call() helper
Add support for JMP_CALL_X (tail call) introduced by commit 04fd61ab36ec
("bpf: allow bpf programs to tail-call other bpf programs").
bpf_tail_call() arguments:
ctx - context pointer passed to next program
array - pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
index - index inside array that selects specific program to run
In this implementation arm64 JIT jumps into callee program after prologue,
so callee program reuses the same stack. For tail_call_cnt, we use the
callee-saved R26 (which was already saved/restored but previously unused
by JIT).
With this patch a tail call generates the following code on arm64:
if (index >= array->map.max_entries)
goto out;
34: mov x10, #0x10 // #16
38: ldr w10, [x1,x10]
3c: cmp w2, w10
40: b.ge 0x0000000000000074
if (tail_call_cnt > MAX_TAIL_CALL_CNT)
goto out;
tail_call_cnt++;
44: mov x10, #0x20 // #32
48: cmp x26, x10
4c: b.gt 0x0000000000000074
50: add x26, x26, #0x1
prog = array->ptrs[index];
if (prog == NULL)
goto out;
54: mov x10, #0x68 // #104
58: ldr x10, [x1,x10]
5c: ldr x11, [x10,x2]
60: cbz x11, 0x0000000000000074
goto *(prog->bpf_func + prologue_size);
64: mov x10, #0x20 // #32
68: ldr x10, [x11,x10]
6c: add x10, x10, #0x20
70: br x10
74:
Signed-off-by: Zi Shen Lim <zlim.lnx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 21:18:48 -07:00
|
|
|
off = offsetof(struct bpf_prog, bpf_func);
|
|
|
|
emit_a64_mov_i64(tmp, off, ctx);
|
|
|
|
emit(A64_LDR64(tmp, prg, tmp), ctx);
|
|
|
|
emit(A64_ADD_I(1, tmp, tmp, sizeof(u32) * PROLOGUE_OFFSET), ctx);
|
|
|
|
emit(A64_BR(tmp), ctx);
|
|
|
|
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
if (ctx->image) {
|
|
|
|
off = &ctx->image[ctx->idx] - branch1;
|
|
|
|
*branch1 = cpu_to_le32(A64_B_(A64_COND_CS, off));
|
|
|
|
|
|
|
|
off = &ctx->image[ctx->idx] - branch2;
|
|
|
|
*branch2 = cpu_to_le32(A64_B_(A64_COND_CS, off));
|
|
|
|
|
|
|
|
off = &ctx->image[ctx->idx] - branch3;
|
|
|
|
*branch3 = cpu_to_le32(A64_CBZ(1, prg, off));
|
arm64: bpf: implement bpf_tail_call() helper
Add support for JMP_CALL_X (tail call) introduced by commit 04fd61ab36ec
("bpf: allow bpf programs to tail-call other bpf programs").
bpf_tail_call() arguments:
ctx - context pointer passed to next program
array - pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
index - index inside array that selects specific program to run
In this implementation arm64 JIT jumps into callee program after prologue,
so callee program reuses the same stack. For tail_call_cnt, we use the
callee-saved R26 (which was already saved/restored but previously unused
by JIT).
With this patch a tail call generates the following code on arm64:
if (index >= array->map.max_entries)
goto out;
34: mov x10, #0x10 // #16
38: ldr w10, [x1,x10]
3c: cmp w2, w10
40: b.ge 0x0000000000000074
if (tail_call_cnt > MAX_TAIL_CALL_CNT)
goto out;
tail_call_cnt++;
44: mov x10, #0x20 // #32
48: cmp x26, x10
4c: b.gt 0x0000000000000074
50: add x26, x26, #0x1
prog = array->ptrs[index];
if (prog == NULL)
goto out;
54: mov x10, #0x68 // #104
58: ldr x10, [x1,x10]
5c: ldr x11, [x10,x2]
60: cbz x11, 0x0000000000000074
goto *(prog->bpf_func + prologue_size);
64: mov x10, #0x20 // #32
68: ldr x10, [x11,x10]
6c: add x10, x10, #0x20
70: br x10
74:
Signed-off-by: Zi Shen Lim <zlim.lnx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 21:18:48 -07:00
|
|
|
}
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
|
arm64: bpf: implement bpf_tail_call() helper
Add support for JMP_CALL_X (tail call) introduced by commit 04fd61ab36ec
("bpf: allow bpf programs to tail-call other bpf programs").
bpf_tail_call() arguments:
ctx - context pointer passed to next program
array - pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
index - index inside array that selects specific program to run
In this implementation arm64 JIT jumps into callee program after prologue,
so callee program reuses the same stack. For tail_call_cnt, we use the
callee-saved R26 (which was already saved/restored but previously unused
by JIT).
With this patch a tail call generates the following code on arm64:
if (index >= array->map.max_entries)
goto out;
34: mov x10, #0x10 // #16
38: ldr w10, [x1,x10]
3c: cmp w2, w10
40: b.ge 0x0000000000000074
if (tail_call_cnt > MAX_TAIL_CALL_CNT)
goto out;
tail_call_cnt++;
44: mov x10, #0x20 // #32
48: cmp x26, x10
4c: b.gt 0x0000000000000074
50: add x26, x26, #0x1
prog = array->ptrs[index];
if (prog == NULL)
goto out;
54: mov x10, #0x68 // #104
58: ldr x10, [x1,x10]
5c: ldr x11, [x10,x2]
60: cbz x11, 0x0000000000000074
goto *(prog->bpf_func + prologue_size);
64: mov x10, #0x20 // #32
68: ldr x10, [x11,x10]
6c: add x10, x10, #0x20
70: br x10
74:
Signed-off-by: Zi Shen Lim <zlim.lnx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 21:18:48 -07:00
|
|
|
return 0;
|
2014-08-26 21:15:30 -07:00
|
|
|
}
|
|
|
|
|
bpf, arm64: Support more atomic operations
Atomics for eBPF patch series adds support for atomic[64]_fetch_add,
atomic[64]_[fetch_]{and,or,xor} and atomic[64]_{xchg|cmpxchg}, but it
only adds support for x86-64, so support these atomic operations for
arm64 as well.
Basically the implementation procedure is almost mechanical translation
of code snippets in atomic_ll_sc.h & atomic_lse.h & cmpxchg.h located
under arch/arm64/include/asm.
When LSE atomic is unavailable, an extra temporary register is needed for
(BPF_ADD | BPF_FETCH) to save the value of src register, instead of adding
TMP_REG_4 just use BPF_REG_AX instead. Also make emit_lse_atomic() as an
empty inline function when CONFIG_ARM64_LSE_ATOMICS is disabled.
For cpus_have_cap(ARM64_HAS_LSE_ATOMICS) case and no-LSE-ATOMICS case, the
following three tests: "./test_verifier", "./test_progs -t atomic" and
"insmod ./test_bpf.ko" are exercised and passed.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220217072232.1186625-4-houtao1@huawei.com
2022-02-17 15:22:31 +08:00
|
|
|
#ifdef CONFIG_ARM64_LSE_ATOMICS
|
|
|
|
static int emit_lse_atomic(const struct bpf_insn *insn, struct jit_ctx *ctx)
|
|
|
|
{
|
|
|
|
const u8 code = insn->code;
|
2024-04-26 16:11:16 +00:00
|
|
|
const u8 arena_vm_base = bpf2a64[ARENA_VM_START];
|
bpf, arm64: Support more atomic operations
Atomics for eBPF patch series adds support for atomic[64]_fetch_add,
atomic[64]_[fetch_]{and,or,xor} and atomic[64]_{xchg|cmpxchg}, but it
only adds support for x86-64, so support these atomic operations for
arm64 as well.
Basically the implementation procedure is almost mechanical translation
of code snippets in atomic_ll_sc.h & atomic_lse.h & cmpxchg.h located
under arch/arm64/include/asm.
When LSE atomic is unavailable, an extra temporary register is needed for
(BPF_ADD | BPF_FETCH) to save the value of src register, instead of adding
TMP_REG_4 just use BPF_REG_AX instead. Also make emit_lse_atomic() as an
empty inline function when CONFIG_ARM64_LSE_ATOMICS is disabled.
For cpus_have_cap(ARM64_HAS_LSE_ATOMICS) case and no-LSE-ATOMICS case, the
following three tests: "./test_verifier", "./test_progs -t atomic" and
"insmod ./test_bpf.ko" are exercised and passed.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220217072232.1186625-4-houtao1@huawei.com
2022-02-17 15:22:31 +08:00
|
|
|
const u8 dst = bpf2a64[insn->dst_reg];
|
|
|
|
const u8 src = bpf2a64[insn->src_reg];
|
|
|
|
const u8 tmp = bpf2a64[TMP_REG_1];
|
|
|
|
const u8 tmp2 = bpf2a64[TMP_REG_2];
|
|
|
|
const bool isdw = BPF_SIZE(code) == BPF_DW;
|
2024-04-26 16:11:16 +00:00
|
|
|
const bool arena = BPF_MODE(code) == BPF_PROBE_ATOMIC;
|
bpf, arm64: Support more atomic operations
Atomics for eBPF patch series adds support for atomic[64]_fetch_add,
atomic[64]_[fetch_]{and,or,xor} and atomic[64]_{xchg|cmpxchg}, but it
only adds support for x86-64, so support these atomic operations for
arm64 as well.
Basically the implementation procedure is almost mechanical translation
of code snippets in atomic_ll_sc.h & atomic_lse.h & cmpxchg.h located
under arch/arm64/include/asm.
When LSE atomic is unavailable, an extra temporary register is needed for
(BPF_ADD | BPF_FETCH) to save the value of src register, instead of adding
TMP_REG_4 just use BPF_REG_AX instead. Also make emit_lse_atomic() as an
empty inline function when CONFIG_ARM64_LSE_ATOMICS is disabled.
For cpus_have_cap(ARM64_HAS_LSE_ATOMICS) case and no-LSE-ATOMICS case, the
following three tests: "./test_verifier", "./test_progs -t atomic" and
"insmod ./test_bpf.ko" are exercised and passed.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220217072232.1186625-4-houtao1@huawei.com
2022-02-17 15:22:31 +08:00
|
|
|
const s16 off = insn->off;
|
2024-04-26 16:11:16 +00:00
|
|
|
u8 reg = dst;
|
bpf, arm64: Support more atomic operations
Atomics for eBPF patch series adds support for atomic[64]_fetch_add,
atomic[64]_[fetch_]{and,or,xor} and atomic[64]_{xchg|cmpxchg}, but it
only adds support for x86-64, so support these atomic operations for
arm64 as well.
Basically the implementation procedure is almost mechanical translation
of code snippets in atomic_ll_sc.h & atomic_lse.h & cmpxchg.h located
under arch/arm64/include/asm.
When LSE atomic is unavailable, an extra temporary register is needed for
(BPF_ADD | BPF_FETCH) to save the value of src register, instead of adding
TMP_REG_4 just use BPF_REG_AX instead. Also make emit_lse_atomic() as an
empty inline function when CONFIG_ARM64_LSE_ATOMICS is disabled.
For cpus_have_cap(ARM64_HAS_LSE_ATOMICS) case and no-LSE-ATOMICS case, the
following three tests: "./test_verifier", "./test_progs -t atomic" and
"insmod ./test_bpf.ko" are exercised and passed.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220217072232.1186625-4-houtao1@huawei.com
2022-02-17 15:22:31 +08:00
|
|
|
|
2024-04-26 16:11:16 +00:00
|
|
|
if (off || arena) {
|
|
|
|
if (off) {
|
|
|
|
emit_a64_mov_i(1, tmp, off, ctx);
|
|
|
|
emit(A64_ADD(1, tmp, tmp, dst), ctx);
|
|
|
|
reg = tmp;
|
|
|
|
}
|
|
|
|
if (arena) {
|
|
|
|
emit(A64_ADD(1, tmp, reg, arena_vm_base), ctx);
|
|
|
|
reg = tmp;
|
|
|
|
}
|
bpf, arm64: Support more atomic operations
Atomics for eBPF patch series adds support for atomic[64]_fetch_add,
atomic[64]_[fetch_]{and,or,xor} and atomic[64]_{xchg|cmpxchg}, but it
only adds support for x86-64, so support these atomic operations for
arm64 as well.
Basically the implementation procedure is almost mechanical translation
of code snippets in atomic_ll_sc.h & atomic_lse.h & cmpxchg.h located
under arch/arm64/include/asm.
When LSE atomic is unavailable, an extra temporary register is needed for
(BPF_ADD | BPF_FETCH) to save the value of src register, instead of adding
TMP_REG_4 just use BPF_REG_AX instead. Also make emit_lse_atomic() as an
empty inline function when CONFIG_ARM64_LSE_ATOMICS is disabled.
For cpus_have_cap(ARM64_HAS_LSE_ATOMICS) case and no-LSE-ATOMICS case, the
following three tests: "./test_verifier", "./test_progs -t atomic" and
"insmod ./test_bpf.ko" are exercised and passed.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220217072232.1186625-4-houtao1@huawei.com
2022-02-17 15:22:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
switch (insn->imm) {
|
|
|
|
/* lock *(u32/u64 *)(dst_reg + off) <op>= src_reg */
|
|
|
|
case BPF_ADD:
|
|
|
|
emit(A64_STADD(isdw, reg, src), ctx);
|
|
|
|
break;
|
|
|
|
case BPF_AND:
|
|
|
|
emit(A64_MVN(isdw, tmp2, src), ctx);
|
|
|
|
emit(A64_STCLR(isdw, reg, tmp2), ctx);
|
|
|
|
break;
|
|
|
|
case BPF_OR:
|
|
|
|
emit(A64_STSET(isdw, reg, src), ctx);
|
|
|
|
break;
|
|
|
|
case BPF_XOR:
|
|
|
|
emit(A64_STEOR(isdw, reg, src), ctx);
|
|
|
|
break;
|
|
|
|
/* src_reg = atomic_fetch_<op>(dst_reg + off, src_reg) */
|
|
|
|
case BPF_ADD | BPF_FETCH:
|
|
|
|
emit(A64_LDADDAL(isdw, src, reg, src), ctx);
|
|
|
|
break;
|
|
|
|
case BPF_AND | BPF_FETCH:
|
|
|
|
emit(A64_MVN(isdw, tmp2, src), ctx);
|
|
|
|
emit(A64_LDCLRAL(isdw, src, reg, tmp2), ctx);
|
|
|
|
break;
|
|
|
|
case BPF_OR | BPF_FETCH:
|
|
|
|
emit(A64_LDSETAL(isdw, src, reg, src), ctx);
|
|
|
|
break;
|
|
|
|
case BPF_XOR | BPF_FETCH:
|
|
|
|
emit(A64_LDEORAL(isdw, src, reg, src), ctx);
|
|
|
|
break;
|
|
|
|
/* src_reg = atomic_xchg(dst_reg + off, src_reg); */
|
|
|
|
case BPF_XCHG:
|
|
|
|
emit(A64_SWPAL(isdw, src, reg, src), ctx);
|
|
|
|
break;
|
|
|
|
/* r0 = atomic_cmpxchg(dst_reg + off, r0, src_reg); */
|
|
|
|
case BPF_CMPXCHG:
|
|
|
|
emit(A64_CASAL(isdw, src, reg, bpf2a64[BPF_REG_0]), ctx);
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
pr_err_once("unknown atomic op code %02x\n", insn->imm);
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static inline int emit_lse_atomic(const struct bpf_insn *insn, struct jit_ctx *ctx)
|
|
|
|
{
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
static int emit_ll_sc_atomic(const struct bpf_insn *insn, struct jit_ctx *ctx)
|
|
|
|
{
|
|
|
|
const u8 code = insn->code;
|
|
|
|
const u8 dst = bpf2a64[insn->dst_reg];
|
|
|
|
const u8 src = bpf2a64[insn->src_reg];
|
|
|
|
const u8 tmp = bpf2a64[TMP_REG_1];
|
|
|
|
const u8 tmp2 = bpf2a64[TMP_REG_2];
|
|
|
|
const u8 tmp3 = bpf2a64[TMP_REG_3];
|
|
|
|
const int i = insn - ctx->prog->insnsi;
|
|
|
|
const s32 imm = insn->imm;
|
|
|
|
const s16 off = insn->off;
|
|
|
|
const bool isdw = BPF_SIZE(code) == BPF_DW;
|
|
|
|
u8 reg;
|
|
|
|
s32 jmp_offset;
|
|
|
|
|
2024-04-26 16:11:16 +00:00
|
|
|
if (BPF_MODE(code) == BPF_PROBE_ATOMIC) {
|
|
|
|
/* ll_sc based atomics don't support unsafe pointers yet. */
|
|
|
|
pr_err_once("unknown atomic opcode %02x\n", code);
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
bpf, arm64: Support more atomic operations
Atomics for eBPF patch series adds support for atomic[64]_fetch_add,
atomic[64]_[fetch_]{and,or,xor} and atomic[64]_{xchg|cmpxchg}, but it
only adds support for x86-64, so support these atomic operations for
arm64 as well.
Basically the implementation procedure is almost mechanical translation
of code snippets in atomic_ll_sc.h & atomic_lse.h & cmpxchg.h located
under arch/arm64/include/asm.
When LSE atomic is unavailable, an extra temporary register is needed for
(BPF_ADD | BPF_FETCH) to save the value of src register, instead of adding
TMP_REG_4 just use BPF_REG_AX instead. Also make emit_lse_atomic() as an
empty inline function when CONFIG_ARM64_LSE_ATOMICS is disabled.
For cpus_have_cap(ARM64_HAS_LSE_ATOMICS) case and no-LSE-ATOMICS case, the
following three tests: "./test_verifier", "./test_progs -t atomic" and
"insmod ./test_bpf.ko" are exercised and passed.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220217072232.1186625-4-houtao1@huawei.com
2022-02-17 15:22:31 +08:00
|
|
|
if (!off) {
|
|
|
|
reg = dst;
|
|
|
|
} else {
|
|
|
|
emit_a64_mov_i(1, tmp, off, ctx);
|
|
|
|
emit(A64_ADD(1, tmp, tmp, dst), ctx);
|
|
|
|
reg = tmp;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (imm == BPF_ADD || imm == BPF_AND ||
|
|
|
|
imm == BPF_OR || imm == BPF_XOR) {
|
|
|
|
/* lock *(u32/u64 *)(dst_reg + off) <op>= src_reg */
|
|
|
|
emit(A64_LDXR(isdw, tmp2, reg), ctx);
|
|
|
|
if (imm == BPF_ADD)
|
|
|
|
emit(A64_ADD(isdw, tmp2, tmp2, src), ctx);
|
|
|
|
else if (imm == BPF_AND)
|
|
|
|
emit(A64_AND(isdw, tmp2, tmp2, src), ctx);
|
|
|
|
else if (imm == BPF_OR)
|
|
|
|
emit(A64_ORR(isdw, tmp2, tmp2, src), ctx);
|
|
|
|
else
|
|
|
|
emit(A64_EOR(isdw, tmp2, tmp2, src), ctx);
|
|
|
|
emit(A64_STXR(isdw, tmp2, reg, tmp3), ctx);
|
|
|
|
jmp_offset = -3;
|
|
|
|
check_imm19(jmp_offset);
|
|
|
|
emit(A64_CBNZ(0, tmp3, jmp_offset), ctx);
|
|
|
|
} else if (imm == (BPF_ADD | BPF_FETCH) ||
|
|
|
|
imm == (BPF_AND | BPF_FETCH) ||
|
|
|
|
imm == (BPF_OR | BPF_FETCH) ||
|
|
|
|
imm == (BPF_XOR | BPF_FETCH)) {
|
|
|
|
/* src_reg = atomic_fetch_<op>(dst_reg + off, src_reg) */
|
|
|
|
const u8 ax = bpf2a64[BPF_REG_AX];
|
|
|
|
|
|
|
|
emit(A64_MOV(isdw, ax, src), ctx);
|
|
|
|
emit(A64_LDXR(isdw, src, reg), ctx);
|
|
|
|
if (imm == (BPF_ADD | BPF_FETCH))
|
|
|
|
emit(A64_ADD(isdw, tmp2, src, ax), ctx);
|
|
|
|
else if (imm == (BPF_AND | BPF_FETCH))
|
|
|
|
emit(A64_AND(isdw, tmp2, src, ax), ctx);
|
|
|
|
else if (imm == (BPF_OR | BPF_FETCH))
|
|
|
|
emit(A64_ORR(isdw, tmp2, src, ax), ctx);
|
|
|
|
else
|
|
|
|
emit(A64_EOR(isdw, tmp2, src, ax), ctx);
|
|
|
|
emit(A64_STLXR(isdw, tmp2, reg, tmp3), ctx);
|
|
|
|
jmp_offset = -3;
|
|
|
|
check_imm19(jmp_offset);
|
|
|
|
emit(A64_CBNZ(0, tmp3, jmp_offset), ctx);
|
|
|
|
emit(A64_DMB_ISH, ctx);
|
|
|
|
} else if (imm == BPF_XCHG) {
|
|
|
|
/* src_reg = atomic_xchg(dst_reg + off, src_reg); */
|
|
|
|
emit(A64_MOV(isdw, tmp2, src), ctx);
|
|
|
|
emit(A64_LDXR(isdw, src, reg), ctx);
|
|
|
|
emit(A64_STLXR(isdw, tmp2, reg, tmp3), ctx);
|
|
|
|
jmp_offset = -2;
|
|
|
|
check_imm19(jmp_offset);
|
|
|
|
emit(A64_CBNZ(0, tmp3, jmp_offset), ctx);
|
|
|
|
emit(A64_DMB_ISH, ctx);
|
|
|
|
} else if (imm == BPF_CMPXCHG) {
|
|
|
|
/* r0 = atomic_cmpxchg(dst_reg + off, r0, src_reg); */
|
|
|
|
const u8 r0 = bpf2a64[BPF_REG_0];
|
|
|
|
|
|
|
|
emit(A64_MOV(isdw, tmp2, r0), ctx);
|
|
|
|
emit(A64_LDXR(isdw, r0, reg), ctx);
|
|
|
|
emit(A64_EOR(isdw, tmp3, r0, tmp2), ctx);
|
|
|
|
jmp_offset = 4;
|
|
|
|
check_imm19(jmp_offset);
|
|
|
|
emit(A64_CBNZ(isdw, tmp3, jmp_offset), ctx);
|
|
|
|
emit(A64_STLXR(isdw, src, reg, tmp3), ctx);
|
|
|
|
jmp_offset = -4;
|
|
|
|
check_imm19(jmp_offset);
|
|
|
|
emit(A64_CBNZ(0, tmp3, jmp_offset), ctx);
|
|
|
|
emit(A64_DMB_ISH, ctx);
|
|
|
|
} else {
|
|
|
|
pr_err_once("unknown atomic op code %02x\n", imm);
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
bpf, arm64: Implement bpf_arch_text_poke() for arm64
Implement bpf_arch_text_poke() for arm64, so bpf prog or bpf trampoline
can be patched with it.
When the target address is NULL, the original instruction is patched to
a NOP.
When the target address and the source address are within the branch
range, the original instruction is patched to a bl instruction to the
target address directly.
To support attaching bpf trampoline to both regular kernel function and
bpf prog, we follow the ftrace patchsite way for bpf prog. That is, two
instructions are inserted at the beginning of bpf prog, the first one
saves the return address to x9, and the second is a nop which will be
patched to a bl instruction when a bpf trampoline is attached.
However, when a bpf trampoline is attached to bpf prog, the distance
between target address and source address may exceed 128MB, the maximum
branch range, because bpf trampoline and bpf prog are allocated
separately with vmalloc. So long jump should be handled.
When a bpf prog is constructed, a plt pointing to empty trampoline
dummy_tramp is placed at the end:
bpf_prog:
mov x9, lr
nop // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
This is also the state when no trampoline is attached.
When a short-jump bpf trampoline is attached, the patchsite is patched to
a bl instruction to the trampoline directly:
bpf_prog:
mov x9, lr
bl <short-jump bpf trampoline address> // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
When a long-jump bpf trampoline is attached, the plt target is filled with
the trampoline address and the patchsite is patched to a bl instruction to
the plt:
bpf_prog:
mov x9, lr
bl plt // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad <long-jump bpf trampoline address>
dummy_tramp is used to prevent another CPU from jumping to an unknown
location during the patching process, making the patching process easier.
The patching process is as follows:
1. when neither the old address or the new address is a long jump, the
patchsite is replaced with a bl to the new address, or nop if the new
address is NULL;
2. when the old address is not long jump but the new one is, the
branch target address is written to plt first, then the patchsite
is replaced with a bl instruction to the plt;
3. when the old address is long jump but the new one is not, the address
of dummy_tramp is written to plt first, then the patchsite is replaced
with a bl to the new address, or a nop if the new address is NULL;
4. when both the old address and the new address are long jump, the
new address is written to plt and the patchsite is not changed.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20220711150823.2128542-4-xukuohai@huawei.com
2022-07-11 11:08:22 -04:00
|
|
|
void dummy_tramp(void);
|
|
|
|
|
|
|
|
asm (
|
|
|
|
" .pushsection .text, \"ax\", @progbits\n"
|
2022-07-13 10:35:03 -07:00
|
|
|
" .global dummy_tramp\n"
|
bpf, arm64: Implement bpf_arch_text_poke() for arm64
Implement bpf_arch_text_poke() for arm64, so bpf prog or bpf trampoline
can be patched with it.
When the target address is NULL, the original instruction is patched to
a NOP.
When the target address and the source address are within the branch
range, the original instruction is patched to a bl instruction to the
target address directly.
To support attaching bpf trampoline to both regular kernel function and
bpf prog, we follow the ftrace patchsite way for bpf prog. That is, two
instructions are inserted at the beginning of bpf prog, the first one
saves the return address to x9, and the second is a nop which will be
patched to a bl instruction when a bpf trampoline is attached.
However, when a bpf trampoline is attached to bpf prog, the distance
between target address and source address may exceed 128MB, the maximum
branch range, because bpf trampoline and bpf prog are allocated
separately with vmalloc. So long jump should be handled.
When a bpf prog is constructed, a plt pointing to empty trampoline
dummy_tramp is placed at the end:
bpf_prog:
mov x9, lr
nop // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
This is also the state when no trampoline is attached.
When a short-jump bpf trampoline is attached, the patchsite is patched to
a bl instruction to the trampoline directly:
bpf_prog:
mov x9, lr
bl <short-jump bpf trampoline address> // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
When a long-jump bpf trampoline is attached, the plt target is filled with
the trampoline address and the patchsite is patched to a bl instruction to
the plt:
bpf_prog:
mov x9, lr
bl plt // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad <long-jump bpf trampoline address>
dummy_tramp is used to prevent another CPU from jumping to an unknown
location during the patching process, making the patching process easier.
The patching process is as follows:
1. when neither the old address or the new address is a long jump, the
patchsite is replaced with a bl to the new address, or nop if the new
address is NULL;
2. when the old address is not long jump but the new one is, the
branch target address is written to plt first, then the patchsite
is replaced with a bl instruction to the plt;
3. when the old address is long jump but the new one is not, the address
of dummy_tramp is written to plt first, then the patchsite is replaced
with a bl to the new address, or a nop if the new address is NULL;
4. when both the old address and the new address are long jump, the
new address is written to plt and the patchsite is not changed.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20220711150823.2128542-4-xukuohai@huawei.com
2022-07-11 11:08:22 -04:00
|
|
|
" .type dummy_tramp, %function\n"
|
|
|
|
"dummy_tramp:"
|
|
|
|
#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)
|
|
|
|
" bti j\n" /* dummy_tramp is called via "br x10" */
|
|
|
|
#endif
|
2022-07-21 08:13:19 -04:00
|
|
|
" mov x10, x30\n"
|
|
|
|
" mov x30, x9\n"
|
bpf, arm64: Implement bpf_arch_text_poke() for arm64
Implement bpf_arch_text_poke() for arm64, so bpf prog or bpf trampoline
can be patched with it.
When the target address is NULL, the original instruction is patched to
a NOP.
When the target address and the source address are within the branch
range, the original instruction is patched to a bl instruction to the
target address directly.
To support attaching bpf trampoline to both regular kernel function and
bpf prog, we follow the ftrace patchsite way for bpf prog. That is, two
instructions are inserted at the beginning of bpf prog, the first one
saves the return address to x9, and the second is a nop which will be
patched to a bl instruction when a bpf trampoline is attached.
However, when a bpf trampoline is attached to bpf prog, the distance
between target address and source address may exceed 128MB, the maximum
branch range, because bpf trampoline and bpf prog are allocated
separately with vmalloc. So long jump should be handled.
When a bpf prog is constructed, a plt pointing to empty trampoline
dummy_tramp is placed at the end:
bpf_prog:
mov x9, lr
nop // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
This is also the state when no trampoline is attached.
When a short-jump bpf trampoline is attached, the patchsite is patched to
a bl instruction to the trampoline directly:
bpf_prog:
mov x9, lr
bl <short-jump bpf trampoline address> // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
When a long-jump bpf trampoline is attached, the plt target is filled with
the trampoline address and the patchsite is patched to a bl instruction to
the plt:
bpf_prog:
mov x9, lr
bl plt // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad <long-jump bpf trampoline address>
dummy_tramp is used to prevent another CPU from jumping to an unknown
location during the patching process, making the patching process easier.
The patching process is as follows:
1. when neither the old address or the new address is a long jump, the
patchsite is replaced with a bl to the new address, or nop if the new
address is NULL;
2. when the old address is not long jump but the new one is, the
branch target address is written to plt first, then the patchsite
is replaced with a bl instruction to the plt;
3. when the old address is long jump but the new one is not, the address
of dummy_tramp is written to plt first, then the patchsite is replaced
with a bl to the new address, or a nop if the new address is NULL;
4. when both the old address and the new address are long jump, the
new address is written to plt and the patchsite is not changed.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20220711150823.2128542-4-xukuohai@huawei.com
2022-07-11 11:08:22 -04:00
|
|
|
" ret x10\n"
|
|
|
|
" .size dummy_tramp, .-dummy_tramp\n"
|
|
|
|
" .popsection\n"
|
|
|
|
);
|
|
|
|
|
|
|
|
/* build a plt initialized like this:
|
|
|
|
*
|
|
|
|
* plt:
|
|
|
|
* ldr tmp, target
|
|
|
|
* br tmp
|
|
|
|
* target:
|
|
|
|
* .quad dummy_tramp
|
|
|
|
*
|
|
|
|
* when a long jump trampoline is attached, target is filled with the
|
|
|
|
* trampoline address, and when the trampoline is removed, target is
|
|
|
|
* restored to dummy_tramp address.
|
|
|
|
*/
|
|
|
|
static void build_plt(struct jit_ctx *ctx)
|
|
|
|
{
|
|
|
|
const u8 tmp = bpf2a64[TMP_REG_1];
|
|
|
|
struct bpf_plt *plt = NULL;
|
|
|
|
|
|
|
|
/* make sure target is 64-bit aligned */
|
|
|
|
if ((ctx->idx + PLT_TARGET_OFFSET / AARCH64_INSN_SIZE) % 2)
|
|
|
|
emit(A64_NOP, ctx);
|
|
|
|
|
|
|
|
plt = (struct bpf_plt *)(ctx->image + ctx->idx);
|
|
|
|
/* plt is called via bl, no BTI needed here */
|
|
|
|
emit(A64_LDR64LIT(tmp, 2 * AARCH64_INSN_SIZE), ctx);
|
|
|
|
emit(A64_BR(tmp), ctx);
|
|
|
|
|
|
|
|
if (ctx->image)
|
|
|
|
plt->target = (u64)&dummy_tramp;
|
|
|
|
}
|
|
|
|
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
static void build_epilogue(struct jit_ctx *ctx)
|
2014-08-26 21:15:30 -07:00
|
|
|
{
|
|
|
|
const u8 r0 = bpf2a64[BPF_REG_0];
|
bpf, arm64: Fix tailcall hierarchy
This patch fixes a tailcall issue caused by abusing the tailcall in
bpf2bpf feature on arm64 like the way of "bpf, x64: Fix tailcall
hierarchy".
On arm64, when a tail call happens, it uses tail_call_cnt_ptr to
increment tail_call_cnt, too.
At the prologue of main prog, it has to initialize tail_call_cnt and
prepare tail_call_cnt_ptr.
At the prologue of subprog, it pushes x26 register twice, and does not
initialize tail_call_cnt.
At the epilogue, it pops x26 twice, no matter whether it is main prog or
subprog.
Fixes: d4609a5d8c70 ("bpf, arm64: Keep tail call count across bpf2bpf calls")
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
Link: https://lore.kernel.org/r/20240714123902.32305-3-hffilwlqm@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-07-14 20:39:01 +08:00
|
|
|
const u8 ptr = bpf2a64[TCCNT_PTR];
|
2014-08-26 21:15:30 -07:00
|
|
|
|
|
|
|
/* We're done with BPF stack */
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
if (ctx->stack_size)
|
|
|
|
emit(A64_ADD_I(1, A64_SP, A64_SP, ctx->stack_size), ctx);
|
2014-08-26 21:15:30 -07:00
|
|
|
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
pop_callee_regs(ctx);
|
2024-02-01 12:52:25 +00:00
|
|
|
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
emit(A64_POP(A64_ZR, ptr, A64_SP), ctx);
|
2014-08-26 21:15:30 -07:00
|
|
|
|
2015-11-16 14:35:35 -08:00
|
|
|
/* Restore FP/LR registers */
|
|
|
|
emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx);
|
2014-08-26 21:15:30 -07:00
|
|
|
|
|
|
|
/* Set return value */
|
|
|
|
emit(A64_MOV(1, A64_R(0), r0), ctx);
|
|
|
|
|
2022-04-02 03:39:42 -04:00
|
|
|
/* Authenticate lr */
|
|
|
|
if (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL))
|
|
|
|
emit(A64_AUTIASP, ctx);
|
|
|
|
|
2014-08-26 21:15:30 -07:00
|
|
|
emit(A64_RET(A64_LR), ctx);
|
|
|
|
}
|
|
|
|
|
2020-07-28 17:21:26 +02:00
|
|
|
#define BPF_FIXUP_OFFSET_MASK GENMASK(26, 0)
|
|
|
|
#define BPF_FIXUP_REG_MASK GENMASK(31, 27)
|
2024-03-25 15:07:15 +00:00
|
|
|
#define DONT_CLEAR 5 /* Unused ARM64 register from BPF's POV */
|
2020-07-28 17:21:26 +02:00
|
|
|
|
arm64: extable: add `type` and `data` fields
Subsequent patches will add specialized handlers for fixups, in addition
to the simple PC fixup and BPF handlers we have today. In preparation,
this patch adds a new `type` field to struct exception_table_entry, and
uses this to distinguish the fixup and BPF cases. A `data` field is also
added so that subsequent patches can associate data specific to each
exception site (e.g. register numbers).
Handlers are named ex_handler_*() for consistency, following the exmaple
of x86. At the same time, get_ex_fixup() is split out into a helper so
that it can be used by other ex_handler_*() functions ins subsequent
patches.
This patch will increase the size of the exception tables, which will be
remedied by subsequent patches removing redundant fixup code. There
should be no functional change as a result of this patch.
Since each entry is now 12 bytes in size, we must reduce the alignment
of each entry from `.align 3` (i.e. 8 bytes) to `.align 2` (i.e. 4
bytes), which is the natrual alignment of the `insn` and `fixup` fields.
The current 8-byte alignment is a holdover from when the `insn` and
`fixup` fields was 8 bytes, and while not harmful has not been necessary
since commit:
6c94f27ac847ff8e ("arm64: switch to relative exception tables")
Similarly, RO_EXCEPTION_TABLE_ALIGN is dropped to 4 bytes.
Concurrently with this patch, x86's exception table entry format is
being updated (similarly to a 12-byte format, with 32-bytes of absolute
data). Once both have been merged it should be possible to unify the
sorttable logic for the two.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: James Morse <james.morse@arm.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20211019160219.5202-11-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2021-10-19 17:02:16 +01:00
|
|
|
bool ex_handler_bpf(const struct exception_table_entry *ex,
|
|
|
|
struct pt_regs *regs)
|
2020-07-28 17:21:26 +02:00
|
|
|
{
|
|
|
|
off_t offset = FIELD_GET(BPF_FIXUP_OFFSET_MASK, ex->fixup);
|
|
|
|
int dst_reg = FIELD_GET(BPF_FIXUP_REG_MASK, ex->fixup);
|
|
|
|
|
2024-03-25 15:07:15 +00:00
|
|
|
if (dst_reg != DONT_CLEAR)
|
|
|
|
regs->regs[dst_reg] = 0;
|
2020-07-28 17:21:26 +02:00
|
|
|
regs->pc = (unsigned long)&ex->fixup - offset;
|
2021-10-19 17:02:14 +01:00
|
|
|
return true;
|
2020-07-28 17:21:26 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/* For accesses to BTF pointers, add an entry to the exception table */
|
|
|
|
static int add_exception_handler(const struct bpf_insn *insn,
|
|
|
|
struct jit_ctx *ctx,
|
|
|
|
int dst_reg)
|
|
|
|
{
|
2024-02-28 14:18:24 +00:00
|
|
|
off_t ins_offset;
|
|
|
|
off_t fixup_offset;
|
2020-07-28 17:21:26 +02:00
|
|
|
unsigned long pc;
|
|
|
|
struct exception_table_entry *ex;
|
|
|
|
|
|
|
|
if (!ctx->image)
|
|
|
|
/* First pass */
|
|
|
|
return 0;
|
|
|
|
|
2023-08-15 11:41:53 -04:00
|
|
|
if (BPF_MODE(insn->code) != BPF_PROBE_MEM &&
|
2024-03-25 15:07:15 +00:00
|
|
|
BPF_MODE(insn->code) != BPF_PROBE_MEMSX &&
|
2024-04-26 16:11:16 +00:00
|
|
|
BPF_MODE(insn->code) != BPF_PROBE_MEM32 &&
|
|
|
|
BPF_MODE(insn->code) != BPF_PROBE_ATOMIC)
|
2020-07-28 17:21:26 +02:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (!ctx->prog->aux->extable ||
|
|
|
|
WARN_ON_ONCE(ctx->exentry_idx >= ctx->prog->aux->num_exentries))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
ex = &ctx->prog->aux->extable[ctx->exentry_idx];
|
2024-02-28 14:18:24 +00:00
|
|
|
pc = (unsigned long)&ctx->ro_image[ctx->idx - 1];
|
2020-07-28 17:21:26 +02:00
|
|
|
|
2024-02-28 14:18:24 +00:00
|
|
|
/*
|
|
|
|
* This is the relative offset of the instruction that may fault from
|
|
|
|
* the exception table itself. This will be written to the exception
|
|
|
|
* table and if this instruction faults, the destination register will
|
|
|
|
* be set to '0' and the execution will jump to the next instruction.
|
|
|
|
*/
|
|
|
|
ins_offset = pc - (long)&ex->insn;
|
|
|
|
if (WARN_ON_ONCE(ins_offset >= 0 || ins_offset < INT_MIN))
|
2020-07-28 17:21:26 +02:00
|
|
|
return -ERANGE;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Since the extable follows the program, the fixup offset is always
|
|
|
|
* negative and limited to BPF_JIT_REGION_SIZE. Store a positive value
|
|
|
|
* to keep things simple, and put the destination register in the upper
|
|
|
|
* bits. We don't need to worry about buildtime or runtime sort
|
|
|
|
* modifying the upper bits because the table is already sorted, and
|
|
|
|
* isn't part of the main exception table.
|
2024-02-28 14:18:24 +00:00
|
|
|
*
|
|
|
|
* The fixup_offset is set to the next instruction from the instruction
|
|
|
|
* that may fault. The execution will jump to this after handling the
|
|
|
|
* fault.
|
2020-07-28 17:21:26 +02:00
|
|
|
*/
|
2024-02-28 14:18:24 +00:00
|
|
|
fixup_offset = (long)&ex->fixup - (pc + AARCH64_INSN_SIZE);
|
|
|
|
if (!FIELD_FIT(BPF_FIXUP_OFFSET_MASK, fixup_offset))
|
2020-07-28 17:21:26 +02:00
|
|
|
return -ERANGE;
|
|
|
|
|
2024-02-28 14:18:24 +00:00
|
|
|
/*
|
|
|
|
* The offsets above have been calculated using the RO buffer but we
|
|
|
|
* need to use the R/W buffer for writes.
|
|
|
|
* switch ex to rw buffer for writing.
|
|
|
|
*/
|
|
|
|
ex = (void *)ctx->image + ((void *)ex - (void *)ctx->ro_image);
|
|
|
|
|
|
|
|
ex->insn = ins_offset;
|
|
|
|
|
2024-03-25 15:07:15 +00:00
|
|
|
if (BPF_CLASS(insn->code) != BPF_LDX)
|
|
|
|
dst_reg = DONT_CLEAR;
|
|
|
|
|
2024-02-28 14:18:24 +00:00
|
|
|
ex->fixup = FIELD_PREP(BPF_FIXUP_OFFSET_MASK, fixup_offset) |
|
2020-07-28 17:21:26 +02:00
|
|
|
FIELD_PREP(BPF_FIXUP_REG_MASK, dst_reg);
|
|
|
|
|
arm64: extable: add `type` and `data` fields
Subsequent patches will add specialized handlers for fixups, in addition
to the simple PC fixup and BPF handlers we have today. In preparation,
this patch adds a new `type` field to struct exception_table_entry, and
uses this to distinguish the fixup and BPF cases. A `data` field is also
added so that subsequent patches can associate data specific to each
exception site (e.g. register numbers).
Handlers are named ex_handler_*() for consistency, following the exmaple
of x86. At the same time, get_ex_fixup() is split out into a helper so
that it can be used by other ex_handler_*() functions ins subsequent
patches.
This patch will increase the size of the exception tables, which will be
remedied by subsequent patches removing redundant fixup code. There
should be no functional change as a result of this patch.
Since each entry is now 12 bytes in size, we must reduce the alignment
of each entry from `.align 3` (i.e. 8 bytes) to `.align 2` (i.e. 4
bytes), which is the natrual alignment of the `insn` and `fixup` fields.
The current 8-byte alignment is a holdover from when the `insn` and
`fixup` fields was 8 bytes, and while not harmful has not been necessary
since commit:
6c94f27ac847ff8e ("arm64: switch to relative exception tables")
Similarly, RO_EXCEPTION_TABLE_ALIGN is dropped to 4 bytes.
Concurrently with this patch, x86's exception table entry format is
being updated (similarly to a 12-byte format, with 32-bytes of absolute
data). Once both have been merged it should be possible to unify the
sorttable logic for the two.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: James Morse <james.morse@arm.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20211019160219.5202-11-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2021-10-19 17:02:16 +01:00
|
|
|
ex->type = EX_TYPE_BPF;
|
|
|
|
|
2020-07-28 17:21:26 +02:00
|
|
|
ctx->exentry_idx++;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-09-16 21:29:23 +01:00
|
|
|
/* JITs an eBPF instruction.
|
|
|
|
* Returns:
|
|
|
|
* 0 - successfully JITed an 8-byte eBPF instruction.
|
|
|
|
* >0 - successfully JITed a 16-byte eBPF instruction.
|
|
|
|
* <0 - failed to JIT.
|
|
|
|
*/
|
2018-11-26 14:05:39 +01:00
|
|
|
static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
|
|
|
|
bool extra_pass)
|
2014-08-26 21:15:30 -07:00
|
|
|
{
|
|
|
|
const u8 code = insn->code;
|
2024-03-25 15:07:15 +00:00
|
|
|
u8 dst = bpf2a64[insn->dst_reg];
|
|
|
|
u8 src = bpf2a64[insn->src_reg];
|
2014-08-26 21:15:30 -07:00
|
|
|
const u8 tmp = bpf2a64[TMP_REG_1];
|
|
|
|
const u8 tmp2 = bpf2a64[TMP_REG_2];
|
bpf, arm64: Adjust the offset of str/ldr(immediate) to positive number
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-4-xukuohai@huawei.com
2022-03-21 11:28:50 -04:00
|
|
|
const u8 fp = bpf2a64[BPF_REG_FP];
|
2024-03-25 15:07:15 +00:00
|
|
|
const u8 arena_vm_base = bpf2a64[ARENA_VM_START];
|
2014-08-26 21:15:30 -07:00
|
|
|
const s16 off = insn->off;
|
|
|
|
const s32 imm = insn->imm;
|
|
|
|
const int i = insn - ctx->prog->insnsi;
|
2019-01-26 12:26:08 -05:00
|
|
|
const bool is64 = BPF_CLASS(code) == BPF_ALU64 ||
|
|
|
|
BPF_CLASS(code) == BPF_JMP;
|
bpf, arm64: Support more atomic operations
Atomics for eBPF patch series adds support for atomic[64]_fetch_add,
atomic[64]_[fetch_]{and,or,xor} and atomic[64]_{xchg|cmpxchg}, but it
only adds support for x86-64, so support these atomic operations for
arm64 as well.
Basically the implementation procedure is almost mechanical translation
of code snippets in atomic_ll_sc.h & atomic_lse.h & cmpxchg.h located
under arch/arm64/include/asm.
When LSE atomic is unavailable, an extra temporary register is needed for
(BPF_ADD | BPF_FETCH) to save the value of src register, instead of adding
TMP_REG_4 just use BPF_REG_AX instead. Also make emit_lse_atomic() as an
empty inline function when CONFIG_ARM64_LSE_ATOMICS is disabled.
For cpus_have_cap(ARM64_HAS_LSE_ATOMICS) case and no-LSE-ATOMICS case, the
following three tests: "./test_verifier", "./test_progs -t atomic" and
"insmod ./test_bpf.ko" are exercised and passed.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220217072232.1186625-4-houtao1@huawei.com
2022-02-17 15:22:31 +08:00
|
|
|
u8 jmp_cond;
|
2014-08-26 21:15:30 -07:00
|
|
|
s32 jmp_offset;
|
bpf, arm64: Optimize AND,OR,XOR,JSET BPF_K using arm64 logical immediates
The current code for BPF_{AND,OR,XOR,JSET} BPF_K loads the immediate to
a temporary register before use.
This patch changes the code to avoid using a temporary register
when the BPF immediate is encodable using an arm64 logical immediate
instruction. If the encoding fails (due to the immediate not being
encodable), it falls back to using a temporary register.
Example of generated code for BPF_ALU32_IMM(BPF_AND, R0, 0x80000001):
without optimization:
24: mov w10, #0x8000ffff
28: movk w10, #0x1
2c: and w7, w7, w10
with optimization:
24: and w7, w7, #0x80000001
Since the encoding process is quite complex, the JIT reuses existing
functionality in arch/arm64/kernel/insn.c for encoding logical immediates
rather than duplicate it in the JIT.
Co-developed-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Luke Nelson <luke.r.nels@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20200508181547.24783-3-luke.r.nels@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
2020-05-08 11:15:45 -07:00
|
|
|
u32 a64_insn;
|
bpf, arm64: Adjust the offset of str/ldr(immediate) to positive number
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-4-xukuohai@huawei.com
2022-03-21 11:28:50 -04:00
|
|
|
u8 src_adj;
|
|
|
|
u8 dst_adj;
|
|
|
|
int off_adj;
|
2020-07-28 17:21:26 +02:00
|
|
|
int ret;
|
2023-08-15 11:41:53 -04:00
|
|
|
bool sign_extend;
|
2014-08-26 21:15:30 -07:00
|
|
|
|
|
|
|
switch (code) {
|
|
|
|
/* dst = src */
|
|
|
|
case BPF_ALU | BPF_MOV | BPF_X:
|
|
|
|
case BPF_ALU64 | BPF_MOV | BPF_X:
|
2024-03-25 15:07:16 +00:00
|
|
|
if (insn_is_cast_user(insn)) {
|
|
|
|
emit(A64_MOV(0, tmp, src), ctx); // 32-bit mov clears the upper 32 bits
|
|
|
|
emit_a64_mov_i(0, dst, ctx->user_vm_start >> 32, ctx);
|
|
|
|
emit(A64_LSL(1, dst, dst, 32), ctx);
|
|
|
|
emit(A64_CBZ(1, tmp, 2), ctx);
|
|
|
|
emit(A64_ORR(1, tmp, dst, tmp), ctx);
|
|
|
|
emit(A64_MOV(1, dst, tmp), ctx);
|
|
|
|
break;
|
2024-05-02 15:18:53 +00:00
|
|
|
} else if (insn_is_mov_percpu_addr(insn)) {
|
|
|
|
if (dst != src)
|
|
|
|
emit(A64_MOV(1, dst, src), ctx);
|
|
|
|
if (cpus_have_cap(ARM64_HAS_VIRT_HOST_EXTN))
|
|
|
|
emit(A64_MRS_TPIDR_EL2(tmp), ctx);
|
|
|
|
else
|
|
|
|
emit(A64_MRS_TPIDR_EL1(tmp), ctx);
|
|
|
|
emit(A64_ADD(1, dst, dst, tmp), ctx);
|
|
|
|
break;
|
2024-03-25 15:07:16 +00:00
|
|
|
}
|
2023-08-15 11:41:54 -04:00
|
|
|
switch (insn->off) {
|
|
|
|
case 0:
|
|
|
|
emit(A64_MOV(is64, dst, src), ctx);
|
|
|
|
break;
|
|
|
|
case 8:
|
|
|
|
emit(A64_SXTB(is64, dst, src), ctx);
|
|
|
|
break;
|
|
|
|
case 16:
|
|
|
|
emit(A64_SXTH(is64, dst, src), ctx);
|
|
|
|
break;
|
|
|
|
case 32:
|
|
|
|
emit(A64_SXTW(is64, dst, src), ctx);
|
|
|
|
break;
|
|
|
|
}
|
2014-08-26 21:15:30 -07:00
|
|
|
break;
|
|
|
|
/* dst = dst OP src */
|
|
|
|
case BPF_ALU | BPF_ADD | BPF_X:
|
|
|
|
case BPF_ALU64 | BPF_ADD | BPF_X:
|
|
|
|
emit(A64_ADD(is64, dst, dst, src), ctx);
|
|
|
|
break;
|
|
|
|
case BPF_ALU | BPF_SUB | BPF_X:
|
|
|
|
case BPF_ALU64 | BPF_SUB | BPF_X:
|
|
|
|
emit(A64_SUB(is64, dst, dst, src), ctx);
|
|
|
|
break;
|
|
|
|
case BPF_ALU | BPF_AND | BPF_X:
|
|
|
|
case BPF_ALU64 | BPF_AND | BPF_X:
|
|
|
|
emit(A64_AND(is64, dst, dst, src), ctx);
|
|
|
|
break;
|
|
|
|
case BPF_ALU | BPF_OR | BPF_X:
|
|
|
|
case BPF_ALU64 | BPF_OR | BPF_X:
|
|
|
|
emit(A64_ORR(is64, dst, dst, src), ctx);
|
|
|
|
break;
|
|
|
|
case BPF_ALU | BPF_XOR | BPF_X:
|
|
|
|
case BPF_ALU64 | BPF_XOR | BPF_X:
|
|
|
|
emit(A64_EOR(is64, dst, dst, src), ctx);
|
|
|
|
break;
|
|
|
|
case BPF_ALU | BPF_MUL | BPF_X:
|
|
|
|
case BPF_ALU64 | BPF_MUL | BPF_X:
|
|
|
|
emit(A64_MUL(is64, dst, dst, src), ctx);
|
|
|
|
break;
|
|
|
|
case BPF_ALU | BPF_DIV | BPF_X:
|
|
|
|
case BPF_ALU64 | BPF_DIV | BPF_X:
|
2023-08-15 11:41:57 -04:00
|
|
|
if (!off)
|
|
|
|
emit(A64_UDIV(is64, dst, dst, src), ctx);
|
|
|
|
else
|
|
|
|
emit(A64_SDIV(is64, dst, dst, src), ctx);
|
2021-05-18 16:56:10 +08:00
|
|
|
break;
|
2014-08-26 21:15:30 -07:00
|
|
|
case BPF_ALU | BPF_MOD | BPF_X:
|
|
|
|
case BPF_ALU64 | BPF_MOD | BPF_X:
|
2023-08-15 11:41:57 -04:00
|
|
|
if (!off)
|
|
|
|
emit(A64_UDIV(is64, tmp, dst, src), ctx);
|
|
|
|
else
|
|
|
|
emit(A64_SDIV(is64, tmp, dst, src), ctx);
|
2021-05-18 16:56:10 +08:00
|
|
|
emit(A64_MSUB(is64, dst, dst, tmp, src), ctx);
|
2014-08-26 21:15:30 -07:00
|
|
|
break;
|
2014-09-16 19:37:35 +01:00
|
|
|
case BPF_ALU | BPF_LSH | BPF_X:
|
|
|
|
case BPF_ALU64 | BPF_LSH | BPF_X:
|
|
|
|
emit(A64_LSLV(is64, dst, dst, src), ctx);
|
|
|
|
break;
|
|
|
|
case BPF_ALU | BPF_RSH | BPF_X:
|
|
|
|
case BPF_ALU64 | BPF_RSH | BPF_X:
|
|
|
|
emit(A64_LSRV(is64, dst, dst, src), ctx);
|
|
|
|
break;
|
|
|
|
case BPF_ALU | BPF_ARSH | BPF_X:
|
|
|
|
case BPF_ALU64 | BPF_ARSH | BPF_X:
|
|
|
|
emit(A64_ASRV(is64, dst, dst, src), ctx);
|
|
|
|
break;
|
2014-08-26 21:15:30 -07:00
|
|
|
/* dst = -dst */
|
|
|
|
case BPF_ALU | BPF_NEG:
|
|
|
|
case BPF_ALU64 | BPF_NEG:
|
|
|
|
emit(A64_NEG(is64, dst, dst), ctx);
|
|
|
|
break;
|
|
|
|
/* dst = BSWAP##imm(dst) */
|
|
|
|
case BPF_ALU | BPF_END | BPF_FROM_LE:
|
|
|
|
case BPF_ALU | BPF_END | BPF_FROM_BE:
|
2023-08-15 11:41:55 -04:00
|
|
|
case BPF_ALU64 | BPF_END | BPF_FROM_LE:
|
2014-08-26 21:15:30 -07:00
|
|
|
#ifdef CONFIG_CPU_BIG_ENDIAN
|
2023-08-15 11:41:55 -04:00
|
|
|
if (BPF_CLASS(code) == BPF_ALU && BPF_SRC(code) == BPF_FROM_BE)
|
2015-06-25 18:39:15 -07:00
|
|
|
goto emit_bswap_uxt;
|
2014-08-26 21:15:30 -07:00
|
|
|
#else /* !CONFIG_CPU_BIG_ENDIAN */
|
2023-08-15 11:41:55 -04:00
|
|
|
if (BPF_CLASS(code) == BPF_ALU && BPF_SRC(code) == BPF_FROM_LE)
|
2015-06-25 18:39:15 -07:00
|
|
|
goto emit_bswap_uxt;
|
2014-08-26 21:15:30 -07:00
|
|
|
#endif
|
|
|
|
switch (imm) {
|
|
|
|
case 16:
|
|
|
|
emit(A64_REV16(is64, dst, dst), ctx);
|
2015-06-25 18:39:15 -07:00
|
|
|
/* zero-extend 16 bits into 64 bits */
|
|
|
|
emit(A64_UXTH(is64, dst, dst), ctx);
|
2014-08-26 21:15:30 -07:00
|
|
|
break;
|
|
|
|
case 32:
|
2024-03-21 09:18:09 +01:00
|
|
|
emit(A64_REV32(0, dst, dst), ctx);
|
2015-06-25 18:39:15 -07:00
|
|
|
/* upper 32 bits already cleared */
|
2014-08-26 21:15:30 -07:00
|
|
|
break;
|
|
|
|
case 64:
|
|
|
|
emit(A64_REV64(dst, dst), ctx);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
break;
|
2015-06-25 18:39:15 -07:00
|
|
|
emit_bswap_uxt:
|
|
|
|
switch (imm) {
|
|
|
|
case 16:
|
|
|
|
/* zero-extend 16 bits into 64 bits */
|
|
|
|
emit(A64_UXTH(is64, dst, dst), ctx);
|
|
|
|
break;
|
|
|
|
case 32:
|
|
|
|
/* zero-extend 32 bits into 64 bits */
|
|
|
|
emit(A64_UXTW(is64, dst, dst), ctx);
|
|
|
|
break;
|
|
|
|
case 64:
|
|
|
|
/* nop */
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
break;
|
2014-08-26 21:15:30 -07:00
|
|
|
/* dst = imm */
|
|
|
|
case BPF_ALU | BPF_MOV | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_MOV | BPF_K:
|
|
|
|
emit_a64_mov_i(is64, dst, imm, ctx);
|
|
|
|
break;
|
|
|
|
/* dst = dst OP imm */
|
|
|
|
case BPF_ALU | BPF_ADD | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_ADD | BPF_K:
|
bpf, arm64: Optimize ADD,SUB,JMP BPF_K using arm64 add/sub immediates
The current code for BPF_{ADD,SUB} BPF_K loads the BPF immediate to a
temporary register before performing the addition/subtraction. Similarly,
BPF_JMP BPF_K cases load the immediate to a temporary register before
comparison.
This patch introduces optimizations that use arm64 immediate add, sub,
cmn, or cmp instructions when the BPF immediate fits. If the immediate
does not fit, it falls back to using a temporary register.
Example of generated code for BPF_ALU64_IMM(BPF_ADD, R0, 2):
without optimization:
24: mov x10, #0x2
28: add x7, x7, x10
with optimization:
24: add x7, x7, #0x2
The code could use A64_{ADD,SUB}_I directly and check if it returns
AARCH64_BREAK_FAULT, similar to how logical immediates are handled.
However, aarch64_insn_gen_add_sub_imm from insn.c prints error messages
when the immediate does not fit, and it's simpler to check if the
immediate fits ahead of time.
Co-developed-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Luke Nelson <luke.r.nels@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20200508181547.24783-4-luke.r.nels@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
2020-05-08 11:15:46 -07:00
|
|
|
if (is_addsub_imm(imm)) {
|
|
|
|
emit(A64_ADD_I(is64, dst, dst, imm), ctx);
|
|
|
|
} else if (is_addsub_imm(-imm)) {
|
|
|
|
emit(A64_SUB_I(is64, dst, dst, -imm), ctx);
|
|
|
|
} else {
|
|
|
|
emit_a64_mov_i(is64, tmp, imm, ctx);
|
|
|
|
emit(A64_ADD(is64, dst, dst, tmp), ctx);
|
|
|
|
}
|
2014-08-26 21:15:30 -07:00
|
|
|
break;
|
|
|
|
case BPF_ALU | BPF_SUB | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_SUB | BPF_K:
|
bpf, arm64: Optimize ADD,SUB,JMP BPF_K using arm64 add/sub immediates
The current code for BPF_{ADD,SUB} BPF_K loads the BPF immediate to a
temporary register before performing the addition/subtraction. Similarly,
BPF_JMP BPF_K cases load the immediate to a temporary register before
comparison.
This patch introduces optimizations that use arm64 immediate add, sub,
cmn, or cmp instructions when the BPF immediate fits. If the immediate
does not fit, it falls back to using a temporary register.
Example of generated code for BPF_ALU64_IMM(BPF_ADD, R0, 2):
without optimization:
24: mov x10, #0x2
28: add x7, x7, x10
with optimization:
24: add x7, x7, #0x2
The code could use A64_{ADD,SUB}_I directly and check if it returns
AARCH64_BREAK_FAULT, similar to how logical immediates are handled.
However, aarch64_insn_gen_add_sub_imm from insn.c prints error messages
when the immediate does not fit, and it's simpler to check if the
immediate fits ahead of time.
Co-developed-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Luke Nelson <luke.r.nels@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20200508181547.24783-4-luke.r.nels@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
2020-05-08 11:15:46 -07:00
|
|
|
if (is_addsub_imm(imm)) {
|
|
|
|
emit(A64_SUB_I(is64, dst, dst, imm), ctx);
|
|
|
|
} else if (is_addsub_imm(-imm)) {
|
|
|
|
emit(A64_ADD_I(is64, dst, dst, -imm), ctx);
|
|
|
|
} else {
|
|
|
|
emit_a64_mov_i(is64, tmp, imm, ctx);
|
|
|
|
emit(A64_SUB(is64, dst, dst, tmp), ctx);
|
|
|
|
}
|
2014-08-26 21:15:30 -07:00
|
|
|
break;
|
|
|
|
case BPF_ALU | BPF_AND | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_AND | BPF_K:
|
bpf, arm64: Optimize AND,OR,XOR,JSET BPF_K using arm64 logical immediates
The current code for BPF_{AND,OR,XOR,JSET} BPF_K loads the immediate to
a temporary register before use.
This patch changes the code to avoid using a temporary register
when the BPF immediate is encodable using an arm64 logical immediate
instruction. If the encoding fails (due to the immediate not being
encodable), it falls back to using a temporary register.
Example of generated code for BPF_ALU32_IMM(BPF_AND, R0, 0x80000001):
without optimization:
24: mov w10, #0x8000ffff
28: movk w10, #0x1
2c: and w7, w7, w10
with optimization:
24: and w7, w7, #0x80000001
Since the encoding process is quite complex, the JIT reuses existing
functionality in arch/arm64/kernel/insn.c for encoding logical immediates
rather than duplicate it in the JIT.
Co-developed-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Luke Nelson <luke.r.nels@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20200508181547.24783-3-luke.r.nels@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
2020-05-08 11:15:45 -07:00
|
|
|
a64_insn = A64_AND_I(is64, dst, dst, imm);
|
|
|
|
if (a64_insn != AARCH64_BREAK_FAULT) {
|
|
|
|
emit(a64_insn, ctx);
|
|
|
|
} else {
|
|
|
|
emit_a64_mov_i(is64, tmp, imm, ctx);
|
|
|
|
emit(A64_AND(is64, dst, dst, tmp), ctx);
|
|
|
|
}
|
2014-08-26 21:15:30 -07:00
|
|
|
break;
|
|
|
|
case BPF_ALU | BPF_OR | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_OR | BPF_K:
|
bpf, arm64: Optimize AND,OR,XOR,JSET BPF_K using arm64 logical immediates
The current code for BPF_{AND,OR,XOR,JSET} BPF_K loads the immediate to
a temporary register before use.
This patch changes the code to avoid using a temporary register
when the BPF immediate is encodable using an arm64 logical immediate
instruction. If the encoding fails (due to the immediate not being
encodable), it falls back to using a temporary register.
Example of generated code for BPF_ALU32_IMM(BPF_AND, R0, 0x80000001):
without optimization:
24: mov w10, #0x8000ffff
28: movk w10, #0x1
2c: and w7, w7, w10
with optimization:
24: and w7, w7, #0x80000001
Since the encoding process is quite complex, the JIT reuses existing
functionality in arch/arm64/kernel/insn.c for encoding logical immediates
rather than duplicate it in the JIT.
Co-developed-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Luke Nelson <luke.r.nels@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20200508181547.24783-3-luke.r.nels@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
2020-05-08 11:15:45 -07:00
|
|
|
a64_insn = A64_ORR_I(is64, dst, dst, imm);
|
|
|
|
if (a64_insn != AARCH64_BREAK_FAULT) {
|
|
|
|
emit(a64_insn, ctx);
|
|
|
|
} else {
|
|
|
|
emit_a64_mov_i(is64, tmp, imm, ctx);
|
|
|
|
emit(A64_ORR(is64, dst, dst, tmp), ctx);
|
|
|
|
}
|
2014-08-26 21:15:30 -07:00
|
|
|
break;
|
|
|
|
case BPF_ALU | BPF_XOR | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_XOR | BPF_K:
|
bpf, arm64: Optimize AND,OR,XOR,JSET BPF_K using arm64 logical immediates
The current code for BPF_{AND,OR,XOR,JSET} BPF_K loads the immediate to
a temporary register before use.
This patch changes the code to avoid using a temporary register
when the BPF immediate is encodable using an arm64 logical immediate
instruction. If the encoding fails (due to the immediate not being
encodable), it falls back to using a temporary register.
Example of generated code for BPF_ALU32_IMM(BPF_AND, R0, 0x80000001):
without optimization:
24: mov w10, #0x8000ffff
28: movk w10, #0x1
2c: and w7, w7, w10
with optimization:
24: and w7, w7, #0x80000001
Since the encoding process is quite complex, the JIT reuses existing
functionality in arch/arm64/kernel/insn.c for encoding logical immediates
rather than duplicate it in the JIT.
Co-developed-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Luke Nelson <luke.r.nels@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20200508181547.24783-3-luke.r.nels@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
2020-05-08 11:15:45 -07:00
|
|
|
a64_insn = A64_EOR_I(is64, dst, dst, imm);
|
|
|
|
if (a64_insn != AARCH64_BREAK_FAULT) {
|
|
|
|
emit(a64_insn, ctx);
|
|
|
|
} else {
|
|
|
|
emit_a64_mov_i(is64, tmp, imm, ctx);
|
|
|
|
emit(A64_EOR(is64, dst, dst, tmp), ctx);
|
|
|
|
}
|
2014-08-26 21:15:30 -07:00
|
|
|
break;
|
|
|
|
case BPF_ALU | BPF_MUL | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_MUL | BPF_K:
|
|
|
|
emit_a64_mov_i(is64, tmp, imm, ctx);
|
|
|
|
emit(A64_MUL(is64, dst, dst, tmp), ctx);
|
|
|
|
break;
|
|
|
|
case BPF_ALU | BPF_DIV | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_DIV | BPF_K:
|
|
|
|
emit_a64_mov_i(is64, tmp, imm, ctx);
|
2023-08-15 11:41:57 -04:00
|
|
|
if (!off)
|
|
|
|
emit(A64_UDIV(is64, dst, dst, tmp), ctx);
|
|
|
|
else
|
|
|
|
emit(A64_SDIV(is64, dst, dst, tmp), ctx);
|
2014-08-26 21:15:30 -07:00
|
|
|
break;
|
|
|
|
case BPF_ALU | BPF_MOD | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_MOD | BPF_K:
|
|
|
|
emit_a64_mov_i(is64, tmp2, imm, ctx);
|
2023-08-15 11:41:57 -04:00
|
|
|
if (!off)
|
|
|
|
emit(A64_UDIV(is64, tmp, dst, tmp2), ctx);
|
|
|
|
else
|
|
|
|
emit(A64_SDIV(is64, tmp, dst, tmp2), ctx);
|
2019-09-02 11:44:48 +05:30
|
|
|
emit(A64_MSUB(is64, dst, dst, tmp, tmp2), ctx);
|
2014-08-26 21:15:30 -07:00
|
|
|
break;
|
|
|
|
case BPF_ALU | BPF_LSH | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_LSH | BPF_K:
|
|
|
|
emit(A64_LSL(is64, dst, dst, imm), ctx);
|
|
|
|
break;
|
|
|
|
case BPF_ALU | BPF_RSH | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_RSH | BPF_K:
|
|
|
|
emit(A64_LSR(is64, dst, dst, imm), ctx);
|
|
|
|
break;
|
|
|
|
case BPF_ALU | BPF_ARSH | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_ARSH | BPF_K:
|
|
|
|
emit(A64_ASR(is64, dst, dst, imm), ctx);
|
|
|
|
break;
|
|
|
|
|
|
|
|
/* JUMP off */
|
|
|
|
case BPF_JMP | BPF_JA:
|
2023-08-15 11:41:56 -04:00
|
|
|
case BPF_JMP32 | BPF_JA:
|
|
|
|
if (BPF_CLASS(code) == BPF_JMP)
|
|
|
|
jmp_offset = bpf2a64_offset(i, off, ctx);
|
|
|
|
else
|
|
|
|
jmp_offset = bpf2a64_offset(i, imm, ctx);
|
2014-08-26 21:15:30 -07:00
|
|
|
check_imm26(jmp_offset);
|
|
|
|
emit(A64_B(jmp_offset), ctx);
|
|
|
|
break;
|
|
|
|
/* IF (dst COND src) JUMP off */
|
|
|
|
case BPF_JMP | BPF_JEQ | BPF_X:
|
|
|
|
case BPF_JMP | BPF_JGT | BPF_X:
|
2017-08-10 01:39:57 +02:00
|
|
|
case BPF_JMP | BPF_JLT | BPF_X:
|
2014-08-26 21:15:30 -07:00
|
|
|
case BPF_JMP | BPF_JGE | BPF_X:
|
2017-08-10 01:39:57 +02:00
|
|
|
case BPF_JMP | BPF_JLE | BPF_X:
|
2014-08-26 21:15:30 -07:00
|
|
|
case BPF_JMP | BPF_JNE | BPF_X:
|
|
|
|
case BPF_JMP | BPF_JSGT | BPF_X:
|
2017-08-10 01:39:57 +02:00
|
|
|
case BPF_JMP | BPF_JSLT | BPF_X:
|
2014-08-26 21:15:30 -07:00
|
|
|
case BPF_JMP | BPF_JSGE | BPF_X:
|
2017-08-10 01:39:57 +02:00
|
|
|
case BPF_JMP | BPF_JSLE | BPF_X:
|
2019-01-26 12:26:08 -05:00
|
|
|
case BPF_JMP32 | BPF_JEQ | BPF_X:
|
|
|
|
case BPF_JMP32 | BPF_JGT | BPF_X:
|
|
|
|
case BPF_JMP32 | BPF_JLT | BPF_X:
|
|
|
|
case BPF_JMP32 | BPF_JGE | BPF_X:
|
|
|
|
case BPF_JMP32 | BPF_JLE | BPF_X:
|
|
|
|
case BPF_JMP32 | BPF_JNE | BPF_X:
|
|
|
|
case BPF_JMP32 | BPF_JSGT | BPF_X:
|
|
|
|
case BPF_JMP32 | BPF_JSLT | BPF_X:
|
|
|
|
case BPF_JMP32 | BPF_JSGE | BPF_X:
|
|
|
|
case BPF_JMP32 | BPF_JSLE | BPF_X:
|
|
|
|
emit(A64_CMP(is64, dst, src), ctx);
|
2014-08-26 21:15:30 -07:00
|
|
|
emit_cond_jmp:
|
arm64: bpf: Fix branch offset in JIT
Running the eBPF test_verifier leads to random errors looking like this:
[ 6525.735488] Unexpected kernel BRK exception at EL1
[ 6525.735502] Internal error: ptrace BRK handler: f2000100 [#1] SMP
[ 6525.741609] Modules linked in: nls_utf8 cifs libdes libarc4 dns_resolver fscache binfmt_misc nls_ascii nls_cp437 vfat fat aes_ce_blk crypto_simd cryptd aes_ce_cipher ghash_ce gf128mul efi_pstore sha2_ce sha256_arm64 sha1_ce evdev efivars efivarfs ip_tables x_tables autofs4 btrfs blake2b_generic xor xor_neon zstd_compress raid6_pq libcrc32c crc32c_generic ahci xhci_pci libahci xhci_hcd igb libata i2c_algo_bit nvme realtek usbcore nvme_core scsi_mod t10_pi netsec mdio_devres of_mdio gpio_keys fixed_phy libphy gpio_mb86s7x
[ 6525.787760] CPU: 3 PID: 7881 Comm: test_verifier Tainted: G W 5.9.0-rc1+ #47
[ 6525.796111] Hardware name: Socionext SynQuacer E-series DeveloperBox, BIOS build #1 Jun 6 2020
[ 6525.804812] pstate: 20000005 (nzCv daif -PAN -UAO BTYPE=--)
[ 6525.810390] pc : bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
[ 6525.815613] lr : bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
[ 6525.820832] sp : ffff8000130cbb80
[ 6525.824141] x29: ffff8000130cbbb0 x28: 0000000000000000
[ 6525.829451] x27: 000005ef6fcbf39b x26: 0000000000000000
[ 6525.834759] x25: ffff8000130cbb80 x24: ffff800011dc7038
[ 6525.840067] x23: ffff8000130cbd00 x22: ffff0008f624d080
[ 6525.845375] x21: 0000000000000001 x20: ffff800011dc7000
[ 6525.850682] x19: 0000000000000000 x18: 0000000000000000
[ 6525.855990] x17: 0000000000000000 x16: 0000000000000000
[ 6525.861298] x15: 0000000000000000 x14: 0000000000000000
[ 6525.866606] x13: 0000000000000000 x12: 0000000000000000
[ 6525.871913] x11: 0000000000000001 x10: ffff8000000a660c
[ 6525.877220] x9 : ffff800010951810 x8 : ffff8000130cbc38
[ 6525.882528] x7 : 0000000000000000 x6 : 0000009864cfa881
[ 6525.887836] x5 : 00ffffffffffffff x4 : 002880ba1a0b3e9f
[ 6525.893144] x3 : 0000000000000018 x2 : ffff8000000a4374
[ 6525.898452] x1 : 000000000000000a x0 : 0000000000000009
[ 6525.903760] Call trace:
[ 6525.906202] bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
[ 6525.911076] bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
[ 6525.915957] bpf_dispatcher_xdp_func+0x14/0x20
[ 6525.920398] bpf_test_run+0x70/0x1b0
[ 6525.923969] bpf_prog_test_run_xdp+0xec/0x190
[ 6525.928326] __do_sys_bpf+0xc88/0x1b28
[ 6525.932072] __arm64_sys_bpf+0x24/0x30
[ 6525.935820] el0_svc_common.constprop.0+0x70/0x168
[ 6525.940607] do_el0_svc+0x28/0x88
[ 6525.943920] el0_sync_handler+0x88/0x190
[ 6525.947838] el0_sync+0x140/0x180
[ 6525.951154] Code: d4202000 d4202000 d4202000 d4202000 (d4202000)
[ 6525.957249] ---[ end trace cecc3f93b14927e2 ]---
The reason is the offset[] creation and later usage, while building
the eBPF body. The code currently omits the first instruction, since
build_insn() will increase our ctx->idx before saving it.
That was fine up until bounded eBPF loops were introduced. After that
introduction, offset[0] must be the offset of the end of prologue which
is the start of the 1st insn while, offset[n] holds the
offset of the end of n-th insn.
When "taken loop with back jump to 1st insn" test runs, it will
eventually call bpf2a64_offset(-1, 2, ctx). Since negative indexing is
permitted, the current outcome depends on the value stored in
ctx->offset[-1], which has nothing to do with our array.
If the value happens to be 0 the tests will work. If not this error
triggers.
commit 7c2e988f400e ("bpf: fix x64 JIT code generation for jmp to 1st insn")
fixed an indentical bug on x86 when eBPF bounded loops were introduced.
So let's fix it by creating the ctx->offset[] differently. Track the
beginning of instruction and account for the extra instruction while
calculating the arm instruction offsets.
Fixes: 2589726d12a1 ("bpf: introduce bounded loops")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Jiri Olsa <jolsa@kernel.org>
Co-developed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Co-developed-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20200917084925.177348-1-ilias.apalodimas@linaro.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-09-17 11:49:25 +03:00
|
|
|
jmp_offset = bpf2a64_offset(i, off, ctx);
|
2014-08-26 21:15:30 -07:00
|
|
|
check_imm19(jmp_offset);
|
|
|
|
switch (BPF_OP(code)) {
|
|
|
|
case BPF_JEQ:
|
|
|
|
jmp_cond = A64_COND_EQ;
|
|
|
|
break;
|
|
|
|
case BPF_JGT:
|
|
|
|
jmp_cond = A64_COND_HI;
|
|
|
|
break;
|
2017-08-10 01:39:57 +02:00
|
|
|
case BPF_JLT:
|
|
|
|
jmp_cond = A64_COND_CC;
|
|
|
|
break;
|
2014-08-26 21:15:30 -07:00
|
|
|
case BPF_JGE:
|
|
|
|
jmp_cond = A64_COND_CS;
|
|
|
|
break;
|
2017-08-10 01:39:57 +02:00
|
|
|
case BPF_JLE:
|
|
|
|
jmp_cond = A64_COND_LS;
|
|
|
|
break;
|
2016-05-12 23:37:58 -07:00
|
|
|
case BPF_JSET:
|
2014-08-26 21:15:30 -07:00
|
|
|
case BPF_JNE:
|
|
|
|
jmp_cond = A64_COND_NE;
|
|
|
|
break;
|
|
|
|
case BPF_JSGT:
|
|
|
|
jmp_cond = A64_COND_GT;
|
|
|
|
break;
|
2017-08-10 01:39:57 +02:00
|
|
|
case BPF_JSLT:
|
|
|
|
jmp_cond = A64_COND_LT;
|
|
|
|
break;
|
2014-08-26 21:15:30 -07:00
|
|
|
case BPF_JSGE:
|
|
|
|
jmp_cond = A64_COND_GE;
|
|
|
|
break;
|
2017-08-10 01:39:57 +02:00
|
|
|
case BPF_JSLE:
|
|
|
|
jmp_cond = A64_COND_LE;
|
|
|
|
break;
|
2014-08-26 21:15:30 -07:00
|
|
|
default:
|
|
|
|
return -EFAULT;
|
|
|
|
}
|
|
|
|
emit(A64_B_(jmp_cond, jmp_offset), ctx);
|
|
|
|
break;
|
|
|
|
case BPF_JMP | BPF_JSET | BPF_X:
|
2019-01-26 12:26:08 -05:00
|
|
|
case BPF_JMP32 | BPF_JSET | BPF_X:
|
|
|
|
emit(A64_TST(is64, dst, src), ctx);
|
2014-08-26 21:15:30 -07:00
|
|
|
goto emit_cond_jmp;
|
|
|
|
/* IF (dst COND imm) JUMP off */
|
|
|
|
case BPF_JMP | BPF_JEQ | BPF_K:
|
|
|
|
case BPF_JMP | BPF_JGT | BPF_K:
|
2017-08-10 01:39:57 +02:00
|
|
|
case BPF_JMP | BPF_JLT | BPF_K:
|
2014-08-26 21:15:30 -07:00
|
|
|
case BPF_JMP | BPF_JGE | BPF_K:
|
2017-08-10 01:39:57 +02:00
|
|
|
case BPF_JMP | BPF_JLE | BPF_K:
|
2014-08-26 21:15:30 -07:00
|
|
|
case BPF_JMP | BPF_JNE | BPF_K:
|
|
|
|
case BPF_JMP | BPF_JSGT | BPF_K:
|
2017-08-10 01:39:57 +02:00
|
|
|
case BPF_JMP | BPF_JSLT | BPF_K:
|
2014-08-26 21:15:30 -07:00
|
|
|
case BPF_JMP | BPF_JSGE | BPF_K:
|
2017-08-10 01:39:57 +02:00
|
|
|
case BPF_JMP | BPF_JSLE | BPF_K:
|
2019-01-26 12:26:08 -05:00
|
|
|
case BPF_JMP32 | BPF_JEQ | BPF_K:
|
|
|
|
case BPF_JMP32 | BPF_JGT | BPF_K:
|
|
|
|
case BPF_JMP32 | BPF_JLT | BPF_K:
|
|
|
|
case BPF_JMP32 | BPF_JGE | BPF_K:
|
|
|
|
case BPF_JMP32 | BPF_JLE | BPF_K:
|
|
|
|
case BPF_JMP32 | BPF_JNE | BPF_K:
|
|
|
|
case BPF_JMP32 | BPF_JSGT | BPF_K:
|
|
|
|
case BPF_JMP32 | BPF_JSLT | BPF_K:
|
|
|
|
case BPF_JMP32 | BPF_JSGE | BPF_K:
|
|
|
|
case BPF_JMP32 | BPF_JSLE | BPF_K:
|
bpf, arm64: Optimize ADD,SUB,JMP BPF_K using arm64 add/sub immediates
The current code for BPF_{ADD,SUB} BPF_K loads the BPF immediate to a
temporary register before performing the addition/subtraction. Similarly,
BPF_JMP BPF_K cases load the immediate to a temporary register before
comparison.
This patch introduces optimizations that use arm64 immediate add, sub,
cmn, or cmp instructions when the BPF immediate fits. If the immediate
does not fit, it falls back to using a temporary register.
Example of generated code for BPF_ALU64_IMM(BPF_ADD, R0, 2):
without optimization:
24: mov x10, #0x2
28: add x7, x7, x10
with optimization:
24: add x7, x7, #0x2
The code could use A64_{ADD,SUB}_I directly and check if it returns
AARCH64_BREAK_FAULT, similar to how logical immediates are handled.
However, aarch64_insn_gen_add_sub_imm from insn.c prints error messages
when the immediate does not fit, and it's simpler to check if the
immediate fits ahead of time.
Co-developed-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Luke Nelson <luke.r.nels@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20200508181547.24783-4-luke.r.nels@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
2020-05-08 11:15:46 -07:00
|
|
|
if (is_addsub_imm(imm)) {
|
|
|
|
emit(A64_CMP_I(is64, dst, imm), ctx);
|
|
|
|
} else if (is_addsub_imm(-imm)) {
|
|
|
|
emit(A64_CMN_I(is64, dst, -imm), ctx);
|
|
|
|
} else {
|
|
|
|
emit_a64_mov_i(is64, tmp, imm, ctx);
|
|
|
|
emit(A64_CMP(is64, dst, tmp), ctx);
|
|
|
|
}
|
2014-08-26 21:15:30 -07:00
|
|
|
goto emit_cond_jmp;
|
|
|
|
case BPF_JMP | BPF_JSET | BPF_K:
|
2019-01-26 12:26:08 -05:00
|
|
|
case BPF_JMP32 | BPF_JSET | BPF_K:
|
bpf, arm64: Optimize AND,OR,XOR,JSET BPF_K using arm64 logical immediates
The current code for BPF_{AND,OR,XOR,JSET} BPF_K loads the immediate to
a temporary register before use.
This patch changes the code to avoid using a temporary register
when the BPF immediate is encodable using an arm64 logical immediate
instruction. If the encoding fails (due to the immediate not being
encodable), it falls back to using a temporary register.
Example of generated code for BPF_ALU32_IMM(BPF_AND, R0, 0x80000001):
without optimization:
24: mov w10, #0x8000ffff
28: movk w10, #0x1
2c: and w7, w7, w10
with optimization:
24: and w7, w7, #0x80000001
Since the encoding process is quite complex, the JIT reuses existing
functionality in arch/arm64/kernel/insn.c for encoding logical immediates
rather than duplicate it in the JIT.
Co-developed-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Luke Nelson <luke.r.nels@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20200508181547.24783-3-luke.r.nels@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
2020-05-08 11:15:45 -07:00
|
|
|
a64_insn = A64_TST_I(is64, dst, imm);
|
|
|
|
if (a64_insn != AARCH64_BREAK_FAULT) {
|
|
|
|
emit(a64_insn, ctx);
|
|
|
|
} else {
|
|
|
|
emit_a64_mov_i(is64, tmp, imm, ctx);
|
|
|
|
emit(A64_TST(is64, dst, tmp), ctx);
|
|
|
|
}
|
2014-08-26 21:15:30 -07:00
|
|
|
goto emit_cond_jmp;
|
|
|
|
/* function call */
|
|
|
|
case BPF_JMP | BPF_CALL:
|
|
|
|
{
|
|
|
|
const u8 r0 = bpf2a64[BPF_REG_0];
|
2018-11-26 14:05:39 +01:00
|
|
|
bool func_addr_fixed;
|
|
|
|
u64 func_addr;
|
bpf, arm64: inline bpf_get_smp_processor_id() helper
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting
a read from struct thread_info. The SP_EL0 system register holds the
pointer to the task_struct and thread_info is the first member of this
struct. We can read the cpu number from the thread_info.
Here is how the ARM64 JITed assembly changes after this commit:
ARM64 JIT
===========
BEFORE AFTER
-------- -------
int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id();
mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0
movk x10, #0x802b, lsl #16 ldr w7, [x10, #24]
movk x10, #0x8000, lsl #32
blr x10
add x7, x0, #0x0
Performance improvement using benchmark[1]
./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc
+---------------+-------------------+-------------------+--------------+
| Name | Before | After | % change |
|---------------+-------------------+-------------------+--------------|
| glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% |
| arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% |
| hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% |
+---------------+-------------------+-------------------+--------------+
[1] https://github.com/anakryiko/linux/commit/8dec900975ef
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240502151854.9810-5-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-02 15:18:54 +00:00
|
|
|
u32 cpu_offset;
|
|
|
|
|
|
|
|
/* Implement helper call to bpf_get_smp_processor_id() inline */
|
|
|
|
if (insn->src_reg == 0 && insn->imm == BPF_FUNC_get_smp_processor_id) {
|
|
|
|
cpu_offset = offsetof(struct thread_info, cpu);
|
|
|
|
|
|
|
|
emit(A64_MRS_SP_EL0(tmp), ctx);
|
|
|
|
if (is_lsi_offset(cpu_offset, 2)) {
|
|
|
|
emit(A64_LDR32I(r0, tmp, cpu_offset), ctx);
|
|
|
|
} else {
|
|
|
|
emit_a64_mov_i(1, tmp2, cpu_offset, ctx);
|
|
|
|
emit(A64_LDR32(r0, tmp, tmp2), ctx);
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
2014-08-26 21:15:30 -07:00
|
|
|
|
bpf, arm64: Inline bpf_get_current_task/_btf() helpers
On ARM64, the pointer to task_struct is always available in the sp_el0
register and therefore the calls to bpf_get_current_task() and
bpf_get_current_task_btf() can be inlined into a single MRS instruction.
Here is the difference before and after this change:
Before:
; struct task_struct *task = bpf_get_current_task_btf();
54: mov x10, #0xffffffffffff7978 // #-34440
58: movk x10, #0x802b, lsl #16
5c: movk x10, #0x8000, lsl #32
60: blr x10 --------------> 0xffff8000802b7978 <+0>: mrs x0, sp_el0
64: add x7, x0, #0x0 <-------------- 0xffff8000802b797c <+4>: ret
After:
; struct task_struct *task = bpf_get_current_task_btf();
54: mrs x7, sp_el0
This shows around 1% performance improvement in artificial microbenchmark.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Xu Kuohai <xukuohai@huawei.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240619131334.4297-1-puranjay@kernel.org
2024-06-19 13:13:34 +00:00
|
|
|
/* Implement helper call to bpf_get_current_task/_btf() inline */
|
|
|
|
if (insn->src_reg == 0 && (insn->imm == BPF_FUNC_get_current_task ||
|
|
|
|
insn->imm == BPF_FUNC_get_current_task_btf)) {
|
|
|
|
emit(A64_MRS_SP_EL0(r0), ctx);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2018-11-26 14:05:39 +01:00
|
|
|
ret = bpf_jit_get_func_addr(ctx->prog, insn, extra_pass,
|
|
|
|
&func_addr, &func_addr_fixed);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
2022-07-11 11:08:23 -04:00
|
|
|
emit_call(func_addr, ctx);
|
2014-08-26 21:15:30 -07:00
|
|
|
emit(A64_MOV(1, r0, A64_R(0)), ctx);
|
|
|
|
break;
|
|
|
|
}
|
arm64: bpf: implement bpf_tail_call() helper
Add support for JMP_CALL_X (tail call) introduced by commit 04fd61ab36ec
("bpf: allow bpf programs to tail-call other bpf programs").
bpf_tail_call() arguments:
ctx - context pointer passed to next program
array - pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
index - index inside array that selects specific program to run
In this implementation arm64 JIT jumps into callee program after prologue,
so callee program reuses the same stack. For tail_call_cnt, we use the
callee-saved R26 (which was already saved/restored but previously unused
by JIT).
With this patch a tail call generates the following code on arm64:
if (index >= array->map.max_entries)
goto out;
34: mov x10, #0x10 // #16
38: ldr w10, [x1,x10]
3c: cmp w2, w10
40: b.ge 0x0000000000000074
if (tail_call_cnt > MAX_TAIL_CALL_CNT)
goto out;
tail_call_cnt++;
44: mov x10, #0x20 // #32
48: cmp x26, x10
4c: b.gt 0x0000000000000074
50: add x26, x26, #0x1
prog = array->ptrs[index];
if (prog == NULL)
goto out;
54: mov x10, #0x68 // #104
58: ldr x10, [x1,x10]
5c: ldr x11, [x10,x2]
60: cbz x11, 0x0000000000000074
goto *(prog->bpf_func + prologue_size);
64: mov x10, #0x20 // #32
68: ldr x10, [x11,x10]
6c: add x10, x10, #0x20
70: br x10
74:
Signed-off-by: Zi Shen Lim <zlim.lnx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 21:18:48 -07:00
|
|
|
/* tail call */
|
2017-05-30 13:31:27 -07:00
|
|
|
case BPF_JMP | BPF_TAIL_CALL:
|
arm64: bpf: implement bpf_tail_call() helper
Add support for JMP_CALL_X (tail call) introduced by commit 04fd61ab36ec
("bpf: allow bpf programs to tail-call other bpf programs").
bpf_tail_call() arguments:
ctx - context pointer passed to next program
array - pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
index - index inside array that selects specific program to run
In this implementation arm64 JIT jumps into callee program after prologue,
so callee program reuses the same stack. For tail_call_cnt, we use the
callee-saved R26 (which was already saved/restored but previously unused
by JIT).
With this patch a tail call generates the following code on arm64:
if (index >= array->map.max_entries)
goto out;
34: mov x10, #0x10 // #16
38: ldr w10, [x1,x10]
3c: cmp w2, w10
40: b.ge 0x0000000000000074
if (tail_call_cnt > MAX_TAIL_CALL_CNT)
goto out;
tail_call_cnt++;
44: mov x10, #0x20 // #32
48: cmp x26, x10
4c: b.gt 0x0000000000000074
50: add x26, x26, #0x1
prog = array->ptrs[index];
if (prog == NULL)
goto out;
54: mov x10, #0x68 // #104
58: ldr x10, [x1,x10]
5c: ldr x11, [x10,x2]
60: cbz x11, 0x0000000000000074
goto *(prog->bpf_func + prologue_size);
64: mov x10, #0x20 // #32
68: ldr x10, [x11,x10]
6c: add x10, x10, #0x20
70: br x10
74:
Signed-off-by: Zi Shen Lim <zlim.lnx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 21:18:48 -07:00
|
|
|
if (emit_bpf_tail_call(ctx))
|
|
|
|
return -EFAULT;
|
|
|
|
break;
|
2014-08-26 21:15:30 -07:00
|
|
|
/* function return */
|
|
|
|
case BPF_JMP | BPF_EXIT:
|
2014-12-03 08:38:01 +00:00
|
|
|
/* Optimization: when last instruction is EXIT,
|
|
|
|
simply fallthrough to epilogue. */
|
2014-08-26 21:15:30 -07:00
|
|
|
if (i == ctx->prog->len - 1)
|
|
|
|
break;
|
|
|
|
jmp_offset = epilogue_offset(ctx);
|
|
|
|
check_imm26(jmp_offset);
|
|
|
|
emit(A64_B(jmp_offset), ctx);
|
|
|
|
break;
|
|
|
|
|
2014-09-16 21:29:23 +01:00
|
|
|
/* dst = imm64 */
|
|
|
|
case BPF_LD | BPF_IMM | BPF_DW:
|
|
|
|
{
|
|
|
|
const struct bpf_insn insn1 = insn[1];
|
|
|
|
u64 imm64;
|
|
|
|
|
2015-05-08 06:39:51 +01:00
|
|
|
imm64 = (u64)insn1.imm << 32 | (u32)imm;
|
bpf, arm64: Use emit_addr_mov_i64() for BPF_PSEUDO_FUNC
The following error is reported when running "./test_progs -t for_each"
under arm64:
bpf_jit: multi-func JIT bug 58 != 56
[...]
JIT doesn't support bpf-to-bpf calls
The root cause is the size of BPF_PSEUDO_FUNC instruction increases
from 2 to 3 after the address of called bpf-function is settled and
there are two bpf-to-bpf calls in test_pkt_access. The generated
instructions are shown below:
0x48: 21 00 C0 D2 movz x1, #0x1, lsl #32
0x4c: 21 00 80 F2 movk x1, #0x1
0x48: E1 3F C0 92 movn x1, #0x1ff, lsl #32
0x4c: 41 FE A2 F2 movk x1, #0x17f2, lsl #16
0x50: 81 70 9F F2 movk x1, #0xfb84
Fixing it by using emit_addr_mov_i64() for BPF_PSEUDO_FUNC, so
the size of jited image will not change.
Fixes: 69c087ba6225 ("bpf: Add bpf_for_each_map_elem() helper")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20211231151018.3781550-1-houtao1@huawei.com
2021-12-31 23:10:18 +08:00
|
|
|
if (bpf_pseudo_func(insn))
|
|
|
|
emit_addr_mov_i64(dst, imm64, ctx);
|
|
|
|
else
|
|
|
|
emit_a64_mov_i64(dst, imm64, ctx);
|
2014-09-16 21:29:23 +01:00
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2023-08-15 11:41:53 -04:00
|
|
|
/* LDX: dst = (u64)*(unsigned size *)(src + off) */
|
2014-08-26 21:15:30 -07:00
|
|
|
case BPF_LDX | BPF_MEM | BPF_W:
|
|
|
|
case BPF_LDX | BPF_MEM | BPF_H:
|
|
|
|
case BPF_LDX | BPF_MEM | BPF_B:
|
|
|
|
case BPF_LDX | BPF_MEM | BPF_DW:
|
2020-07-28 17:21:26 +02:00
|
|
|
case BPF_LDX | BPF_PROBE_MEM | BPF_DW:
|
|
|
|
case BPF_LDX | BPF_PROBE_MEM | BPF_W:
|
|
|
|
case BPF_LDX | BPF_PROBE_MEM | BPF_H:
|
|
|
|
case BPF_LDX | BPF_PROBE_MEM | BPF_B:
|
2023-08-15 11:41:53 -04:00
|
|
|
/* LDXS: dst_reg = (s64)*(signed size *)(src_reg + off) */
|
|
|
|
case BPF_LDX | BPF_MEMSX | BPF_B:
|
|
|
|
case BPF_LDX | BPF_MEMSX | BPF_H:
|
|
|
|
case BPF_LDX | BPF_MEMSX | BPF_W:
|
|
|
|
case BPF_LDX | BPF_PROBE_MEMSX | BPF_B:
|
|
|
|
case BPF_LDX | BPF_PROBE_MEMSX | BPF_H:
|
|
|
|
case BPF_LDX | BPF_PROBE_MEMSX | BPF_W:
|
2024-03-25 15:07:15 +00:00
|
|
|
case BPF_LDX | BPF_PROBE_MEM32 | BPF_B:
|
|
|
|
case BPF_LDX | BPF_PROBE_MEM32 | BPF_H:
|
|
|
|
case BPF_LDX | BPF_PROBE_MEM32 | BPF_W:
|
|
|
|
case BPF_LDX | BPF_PROBE_MEM32 | BPF_DW:
|
|
|
|
if (BPF_MODE(insn->code) == BPF_PROBE_MEM32) {
|
|
|
|
emit(A64_ADD(1, tmp2, src, arena_vm_base), ctx);
|
|
|
|
src = tmp2;
|
|
|
|
}
|
2024-08-26 15:16:23 +08:00
|
|
|
if (src == fp) {
|
|
|
|
src_adj = A64_SP;
|
|
|
|
off_adj = off + ctx->stack_size;
|
bpf, arm64: Adjust the offset of str/ldr(immediate) to positive number
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-4-xukuohai@huawei.com
2022-03-21 11:28:50 -04:00
|
|
|
} else {
|
|
|
|
src_adj = src;
|
|
|
|
off_adj = off;
|
|
|
|
}
|
2023-08-15 11:41:53 -04:00
|
|
|
sign_extend = (BPF_MODE(insn->code) == BPF_MEMSX ||
|
|
|
|
BPF_MODE(insn->code) == BPF_PROBE_MEMSX);
|
2014-08-26 21:15:30 -07:00
|
|
|
switch (BPF_SIZE(code)) {
|
|
|
|
case BPF_W:
|
bpf, arm64: Adjust the offset of str/ldr(immediate) to positive number
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-4-xukuohai@huawei.com
2022-03-21 11:28:50 -04:00
|
|
|
if (is_lsi_offset(off_adj, 2)) {
|
2023-08-15 11:41:53 -04:00
|
|
|
if (sign_extend)
|
|
|
|
emit(A64_LDRSWI(dst, src_adj, off_adj), ctx);
|
|
|
|
else
|
|
|
|
emit(A64_LDR32I(dst, src_adj, off_adj), ctx);
|
bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate offset)
The current BPF store/load instruction is translated by the JIT into two
instructions. The first instruction moves the immediate offset into a
temporary register. The second instruction uses this temporary register
to do the real store/load.
In fact, arm64 supports addressing with immediate offsets. So This patch
introduces optimization that uses arm64 str/ldr instruction with immediate
offset when the offset fits.
Example of generated instuction for r2 = *(u64 *)(r1 + 0):
without optimization:
mov x10, 0
ldr x1, [x0, x10]
with optimization:
ldr x1, [x0, 0]
If the offset is negative, or is not aligned correctly, or exceeds max
value, rollback to the use of temporary register.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-3-xukuohai@huawei.com
2022-03-21 11:28:49 -04:00
|
|
|
} else {
|
|
|
|
emit_a64_mov_i(1, tmp, off, ctx);
|
2023-08-15 11:41:53 -04:00
|
|
|
if (sign_extend)
|
bpf, arm64: fix bug in BPF_LDX_MEMSX
A64_LDRSW() takes three registers: Xt, Xn, Xm as arguments and it loads
and sign extends the value at address Xn + Xm into register Xt.
Currently, the offset is being directly used in place of the tmp
register which has the offset already loaded by the last emitted
instruction.
This will cause JIT failures. The easiest way to reproduce this is to
test the following code through test_bpf module:
{
"BPF_LDX_MEMSX | BPF_W",
.u.insns_int = {
BPF_LD_IMM64(R1, 0x00000000deadbeefULL),
BPF_LD_IMM64(R2, 0xffffffffdeadbeefULL),
BPF_STX_MEM(BPF_DW, R10, R1, -7),
BPF_LDX_MEMSX(BPF_W, R0, R10, -7),
BPF_JMP_REG(BPF_JNE, R0, R2, 1),
BPF_ALU64_IMM(BPF_MOV, R0, 0),
BPF_EXIT_INSN(),
},
INTERNAL,
{ },
{ { 0, 0 } },
.stack_depth = 7,
},
We need to use the offset as -7 to trigger this code path, there could
be other valid ways to trigger this from proper BPF programs as well.
This code is rejected by the JIT because -7 is passed to A64_LDRSW() but
it expects a valid register (0 - 31).
roott@pjy:~# modprobe test_bpf test_name="BPF_LDX_MEMSX | BPF_W"
[11300.490371] test_bpf: test_bpf: set 'test_bpf' as the default test_suite.
[11300.491750] test_bpf: #345 BPF_LDX_MEMSX | BPF_W
[11300.493179] aarch64_insn_encode_register: unknown register encoding -7
[11300.494133] aarch64_insn_encode_register: unknown register encoding -7
[11300.495292] FAIL to select_runtime err=-524
[11300.496804] test_bpf: Summary: 0 PASSED, 1 FAILED, [0/0 JIT'ed]
modprobe: ERROR: could not insert 'test_bpf': Invalid argument
Applying this patch fixes the issue.
root@pjy:~# modprobe test_bpf test_name="BPF_LDX_MEMSX | BPF_W"
[ 292.837436] test_bpf: test_bpf: set 'test_bpf' as the default test_suite.
[ 292.839416] test_bpf: #345 BPF_LDX_MEMSX | BPF_W jited:1 156 PASS
[ 292.844794] test_bpf: Summary: 1 PASSED, 0 FAILED, [1/1 JIT'ed]
Fixes: cc88f540da52 ("bpf, arm64: Support sign-extension load instructions")
Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Message-ID: <20240312235917.103626-1-puranjay12@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-12 23:59:17 +00:00
|
|
|
emit(A64_LDRSW(dst, src, tmp), ctx);
|
2023-08-15 11:41:53 -04:00
|
|
|
else
|
|
|
|
emit(A64_LDR32(dst, src, tmp), ctx);
|
bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate offset)
The current BPF store/load instruction is translated by the JIT into two
instructions. The first instruction moves the immediate offset into a
temporary register. The second instruction uses this temporary register
to do the real store/load.
In fact, arm64 supports addressing with immediate offsets. So This patch
introduces optimization that uses arm64 str/ldr instruction with immediate
offset when the offset fits.
Example of generated instuction for r2 = *(u64 *)(r1 + 0):
without optimization:
mov x10, 0
ldr x1, [x0, x10]
with optimization:
ldr x1, [x0, 0]
If the offset is negative, or is not aligned correctly, or exceeds max
value, rollback to the use of temporary register.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-3-xukuohai@huawei.com
2022-03-21 11:28:49 -04:00
|
|
|
}
|
2014-08-26 21:15:30 -07:00
|
|
|
break;
|
|
|
|
case BPF_H:
|
bpf, arm64: Adjust the offset of str/ldr(immediate) to positive number
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-4-xukuohai@huawei.com
2022-03-21 11:28:50 -04:00
|
|
|
if (is_lsi_offset(off_adj, 1)) {
|
2023-08-15 11:41:53 -04:00
|
|
|
if (sign_extend)
|
|
|
|
emit(A64_LDRSHI(dst, src_adj, off_adj), ctx);
|
|
|
|
else
|
|
|
|
emit(A64_LDRHI(dst, src_adj, off_adj), ctx);
|
bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate offset)
The current BPF store/load instruction is translated by the JIT into two
instructions. The first instruction moves the immediate offset into a
temporary register. The second instruction uses this temporary register
to do the real store/load.
In fact, arm64 supports addressing with immediate offsets. So This patch
introduces optimization that uses arm64 str/ldr instruction with immediate
offset when the offset fits.
Example of generated instuction for r2 = *(u64 *)(r1 + 0):
without optimization:
mov x10, 0
ldr x1, [x0, x10]
with optimization:
ldr x1, [x0, 0]
If the offset is negative, or is not aligned correctly, or exceeds max
value, rollback to the use of temporary register.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-3-xukuohai@huawei.com
2022-03-21 11:28:49 -04:00
|
|
|
} else {
|
|
|
|
emit_a64_mov_i(1, tmp, off, ctx);
|
2023-08-15 11:41:53 -04:00
|
|
|
if (sign_extend)
|
|
|
|
emit(A64_LDRSH(dst, src, tmp), ctx);
|
|
|
|
else
|
|
|
|
emit(A64_LDRH(dst, src, tmp), ctx);
|
bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate offset)
The current BPF store/load instruction is translated by the JIT into two
instructions. The first instruction moves the immediate offset into a
temporary register. The second instruction uses this temporary register
to do the real store/load.
In fact, arm64 supports addressing with immediate offsets. So This patch
introduces optimization that uses arm64 str/ldr instruction with immediate
offset when the offset fits.
Example of generated instuction for r2 = *(u64 *)(r1 + 0):
without optimization:
mov x10, 0
ldr x1, [x0, x10]
with optimization:
ldr x1, [x0, 0]
If the offset is negative, or is not aligned correctly, or exceeds max
value, rollback to the use of temporary register.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-3-xukuohai@huawei.com
2022-03-21 11:28:49 -04:00
|
|
|
}
|
2014-08-26 21:15:30 -07:00
|
|
|
break;
|
|
|
|
case BPF_B:
|
bpf, arm64: Adjust the offset of str/ldr(immediate) to positive number
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-4-xukuohai@huawei.com
2022-03-21 11:28:50 -04:00
|
|
|
if (is_lsi_offset(off_adj, 0)) {
|
2023-08-15 11:41:53 -04:00
|
|
|
if (sign_extend)
|
|
|
|
emit(A64_LDRSBI(dst, src_adj, off_adj), ctx);
|
|
|
|
else
|
|
|
|
emit(A64_LDRBI(dst, src_adj, off_adj), ctx);
|
bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate offset)
The current BPF store/load instruction is translated by the JIT into two
instructions. The first instruction moves the immediate offset into a
temporary register. The second instruction uses this temporary register
to do the real store/load.
In fact, arm64 supports addressing with immediate offsets. So This patch
introduces optimization that uses arm64 str/ldr instruction with immediate
offset when the offset fits.
Example of generated instuction for r2 = *(u64 *)(r1 + 0):
without optimization:
mov x10, 0
ldr x1, [x0, x10]
with optimization:
ldr x1, [x0, 0]
If the offset is negative, or is not aligned correctly, or exceeds max
value, rollback to the use of temporary register.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-3-xukuohai@huawei.com
2022-03-21 11:28:49 -04:00
|
|
|
} else {
|
|
|
|
emit_a64_mov_i(1, tmp, off, ctx);
|
2023-08-15 11:41:53 -04:00
|
|
|
if (sign_extend)
|
|
|
|
emit(A64_LDRSB(dst, src, tmp), ctx);
|
|
|
|
else
|
|
|
|
emit(A64_LDRB(dst, src, tmp), ctx);
|
bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate offset)
The current BPF store/load instruction is translated by the JIT into two
instructions. The first instruction moves the immediate offset into a
temporary register. The second instruction uses this temporary register
to do the real store/load.
In fact, arm64 supports addressing with immediate offsets. So This patch
introduces optimization that uses arm64 str/ldr instruction with immediate
offset when the offset fits.
Example of generated instuction for r2 = *(u64 *)(r1 + 0):
without optimization:
mov x10, 0
ldr x1, [x0, x10]
with optimization:
ldr x1, [x0, 0]
If the offset is negative, or is not aligned correctly, or exceeds max
value, rollback to the use of temporary register.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-3-xukuohai@huawei.com
2022-03-21 11:28:49 -04:00
|
|
|
}
|
2014-08-26 21:15:30 -07:00
|
|
|
break;
|
|
|
|
case BPF_DW:
|
bpf, arm64: Adjust the offset of str/ldr(immediate) to positive number
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-4-xukuohai@huawei.com
2022-03-21 11:28:50 -04:00
|
|
|
if (is_lsi_offset(off_adj, 3)) {
|
|
|
|
emit(A64_LDR64I(dst, src_adj, off_adj), ctx);
|
bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate offset)
The current BPF store/load instruction is translated by the JIT into two
instructions. The first instruction moves the immediate offset into a
temporary register. The second instruction uses this temporary register
to do the real store/load.
In fact, arm64 supports addressing with immediate offsets. So This patch
introduces optimization that uses arm64 str/ldr instruction with immediate
offset when the offset fits.
Example of generated instuction for r2 = *(u64 *)(r1 + 0):
without optimization:
mov x10, 0
ldr x1, [x0, x10]
with optimization:
ldr x1, [x0, 0]
If the offset is negative, or is not aligned correctly, or exceeds max
value, rollback to the use of temporary register.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-3-xukuohai@huawei.com
2022-03-21 11:28:49 -04:00
|
|
|
} else {
|
|
|
|
emit_a64_mov_i(1, tmp, off, ctx);
|
|
|
|
emit(A64_LDR64(dst, src, tmp), ctx);
|
|
|
|
}
|
2014-08-26 21:15:30 -07:00
|
|
|
break;
|
|
|
|
}
|
2020-07-28 17:21:26 +02:00
|
|
|
|
|
|
|
ret = add_exception_handler(insn, ctx, dst);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2014-08-26 21:15:30 -07:00
|
|
|
break;
|
|
|
|
|
2021-07-13 08:18:31 +00:00
|
|
|
/* speculation barrier */
|
|
|
|
case BPF_ST | BPF_NOSPEC:
|
|
|
|
/*
|
|
|
|
* Nothing required here.
|
|
|
|
*
|
|
|
|
* In case of arm64, we rely on the firmware mitigation of
|
|
|
|
* Speculative Store Bypass as controlled via the ssbd kernel
|
|
|
|
* parameter. Whenever the mitigation is enabled, it works
|
|
|
|
* for all of the kernel code with no need to provide any
|
|
|
|
* additional instructions.
|
|
|
|
*/
|
|
|
|
break;
|
|
|
|
|
2014-08-26 21:15:30 -07:00
|
|
|
/* ST: *(size *)(dst + off) = imm */
|
|
|
|
case BPF_ST | BPF_MEM | BPF_W:
|
|
|
|
case BPF_ST | BPF_MEM | BPF_H:
|
|
|
|
case BPF_ST | BPF_MEM | BPF_B:
|
|
|
|
case BPF_ST | BPF_MEM | BPF_DW:
|
2024-03-25 15:07:15 +00:00
|
|
|
case BPF_ST | BPF_PROBE_MEM32 | BPF_B:
|
|
|
|
case BPF_ST | BPF_PROBE_MEM32 | BPF_H:
|
|
|
|
case BPF_ST | BPF_PROBE_MEM32 | BPF_W:
|
|
|
|
case BPF_ST | BPF_PROBE_MEM32 | BPF_DW:
|
|
|
|
if (BPF_MODE(insn->code) == BPF_PROBE_MEM32) {
|
|
|
|
emit(A64_ADD(1, tmp2, dst, arena_vm_base), ctx);
|
|
|
|
dst = tmp2;
|
|
|
|
}
|
2024-08-26 15:16:23 +08:00
|
|
|
if (dst == fp) {
|
|
|
|
dst_adj = A64_SP;
|
|
|
|
off_adj = off + ctx->stack_size;
|
bpf, arm64: Adjust the offset of str/ldr(immediate) to positive number
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-4-xukuohai@huawei.com
2022-03-21 11:28:50 -04:00
|
|
|
} else {
|
|
|
|
dst_adj = dst;
|
|
|
|
off_adj = off;
|
|
|
|
}
|
2015-11-30 14:24:07 -08:00
|
|
|
/* Load imm to a register then store it */
|
|
|
|
emit_a64_mov_i(1, tmp, imm, ctx);
|
|
|
|
switch (BPF_SIZE(code)) {
|
|
|
|
case BPF_W:
|
bpf, arm64: Adjust the offset of str/ldr(immediate) to positive number
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-4-xukuohai@huawei.com
2022-03-21 11:28:50 -04:00
|
|
|
if (is_lsi_offset(off_adj, 2)) {
|
|
|
|
emit(A64_STR32I(tmp, dst_adj, off_adj), ctx);
|
bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate offset)
The current BPF store/load instruction is translated by the JIT into two
instructions. The first instruction moves the immediate offset into a
temporary register. The second instruction uses this temporary register
to do the real store/load.
In fact, arm64 supports addressing with immediate offsets. So This patch
introduces optimization that uses arm64 str/ldr instruction with immediate
offset when the offset fits.
Example of generated instuction for r2 = *(u64 *)(r1 + 0):
without optimization:
mov x10, 0
ldr x1, [x0, x10]
with optimization:
ldr x1, [x0, 0]
If the offset is negative, or is not aligned correctly, or exceeds max
value, rollback to the use of temporary register.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-3-xukuohai@huawei.com
2022-03-21 11:28:49 -04:00
|
|
|
} else {
|
|
|
|
emit_a64_mov_i(1, tmp2, off, ctx);
|
|
|
|
emit(A64_STR32(tmp, dst, tmp2), ctx);
|
|
|
|
}
|
2015-11-30 14:24:07 -08:00
|
|
|
break;
|
|
|
|
case BPF_H:
|
bpf, arm64: Adjust the offset of str/ldr(immediate) to positive number
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-4-xukuohai@huawei.com
2022-03-21 11:28:50 -04:00
|
|
|
if (is_lsi_offset(off_adj, 1)) {
|
|
|
|
emit(A64_STRHI(tmp, dst_adj, off_adj), ctx);
|
bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate offset)
The current BPF store/load instruction is translated by the JIT into two
instructions. The first instruction moves the immediate offset into a
temporary register. The second instruction uses this temporary register
to do the real store/load.
In fact, arm64 supports addressing with immediate offsets. So This patch
introduces optimization that uses arm64 str/ldr instruction with immediate
offset when the offset fits.
Example of generated instuction for r2 = *(u64 *)(r1 + 0):
without optimization:
mov x10, 0
ldr x1, [x0, x10]
with optimization:
ldr x1, [x0, 0]
If the offset is negative, or is not aligned correctly, or exceeds max
value, rollback to the use of temporary register.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-3-xukuohai@huawei.com
2022-03-21 11:28:49 -04:00
|
|
|
} else {
|
|
|
|
emit_a64_mov_i(1, tmp2, off, ctx);
|
|
|
|
emit(A64_STRH(tmp, dst, tmp2), ctx);
|
|
|
|
}
|
2015-11-30 14:24:07 -08:00
|
|
|
break;
|
|
|
|
case BPF_B:
|
bpf, arm64: Adjust the offset of str/ldr(immediate) to positive number
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-4-xukuohai@huawei.com
2022-03-21 11:28:50 -04:00
|
|
|
if (is_lsi_offset(off_adj, 0)) {
|
|
|
|
emit(A64_STRBI(tmp, dst_adj, off_adj), ctx);
|
bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate offset)
The current BPF store/load instruction is translated by the JIT into two
instructions. The first instruction moves the immediate offset into a
temporary register. The second instruction uses this temporary register
to do the real store/load.
In fact, arm64 supports addressing with immediate offsets. So This patch
introduces optimization that uses arm64 str/ldr instruction with immediate
offset when the offset fits.
Example of generated instuction for r2 = *(u64 *)(r1 + 0):
without optimization:
mov x10, 0
ldr x1, [x0, x10]
with optimization:
ldr x1, [x0, 0]
If the offset is negative, or is not aligned correctly, or exceeds max
value, rollback to the use of temporary register.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-3-xukuohai@huawei.com
2022-03-21 11:28:49 -04:00
|
|
|
} else {
|
|
|
|
emit_a64_mov_i(1, tmp2, off, ctx);
|
|
|
|
emit(A64_STRB(tmp, dst, tmp2), ctx);
|
|
|
|
}
|
2015-11-30 14:24:07 -08:00
|
|
|
break;
|
|
|
|
case BPF_DW:
|
bpf, arm64: Adjust the offset of str/ldr(immediate) to positive number
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-4-xukuohai@huawei.com
2022-03-21 11:28:50 -04:00
|
|
|
if (is_lsi_offset(off_adj, 3)) {
|
|
|
|
emit(A64_STR64I(tmp, dst_adj, off_adj), ctx);
|
bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate offset)
The current BPF store/load instruction is translated by the JIT into two
instructions. The first instruction moves the immediate offset into a
temporary register. The second instruction uses this temporary register
to do the real store/load.
In fact, arm64 supports addressing with immediate offsets. So This patch
introduces optimization that uses arm64 str/ldr instruction with immediate
offset when the offset fits.
Example of generated instuction for r2 = *(u64 *)(r1 + 0):
without optimization:
mov x10, 0
ldr x1, [x0, x10]
with optimization:
ldr x1, [x0, 0]
If the offset is negative, or is not aligned correctly, or exceeds max
value, rollback to the use of temporary register.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-3-xukuohai@huawei.com
2022-03-21 11:28:49 -04:00
|
|
|
} else {
|
|
|
|
emit_a64_mov_i(1, tmp2, off, ctx);
|
|
|
|
emit(A64_STR64(tmp, dst, tmp2), ctx);
|
|
|
|
}
|
2015-11-30 14:24:07 -08:00
|
|
|
break;
|
|
|
|
}
|
2024-03-25 15:07:15 +00:00
|
|
|
|
|
|
|
ret = add_exception_handler(insn, ctx, dst);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2015-11-30 14:24:07 -08:00
|
|
|
break;
|
2014-08-26 21:15:30 -07:00
|
|
|
|
|
|
|
/* STX: *(size *)(dst + off) = src */
|
|
|
|
case BPF_STX | BPF_MEM | BPF_W:
|
|
|
|
case BPF_STX | BPF_MEM | BPF_H:
|
|
|
|
case BPF_STX | BPF_MEM | BPF_B:
|
|
|
|
case BPF_STX | BPF_MEM | BPF_DW:
|
2024-03-25 15:07:15 +00:00
|
|
|
case BPF_STX | BPF_PROBE_MEM32 | BPF_B:
|
|
|
|
case BPF_STX | BPF_PROBE_MEM32 | BPF_H:
|
|
|
|
case BPF_STX | BPF_PROBE_MEM32 | BPF_W:
|
|
|
|
case BPF_STX | BPF_PROBE_MEM32 | BPF_DW:
|
|
|
|
if (BPF_MODE(insn->code) == BPF_PROBE_MEM32) {
|
|
|
|
emit(A64_ADD(1, tmp2, dst, arena_vm_base), ctx);
|
|
|
|
dst = tmp2;
|
|
|
|
}
|
2024-08-26 15:16:23 +08:00
|
|
|
if (dst == fp) {
|
|
|
|
dst_adj = A64_SP;
|
|
|
|
off_adj = off + ctx->stack_size;
|
bpf, arm64: Adjust the offset of str/ldr(immediate) to positive number
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-4-xukuohai@huawei.com
2022-03-21 11:28:50 -04:00
|
|
|
} else {
|
|
|
|
dst_adj = dst;
|
|
|
|
off_adj = off;
|
|
|
|
}
|
2014-08-26 21:15:30 -07:00
|
|
|
switch (BPF_SIZE(code)) {
|
|
|
|
case BPF_W:
|
bpf, arm64: Adjust the offset of str/ldr(immediate) to positive number
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-4-xukuohai@huawei.com
2022-03-21 11:28:50 -04:00
|
|
|
if (is_lsi_offset(off_adj, 2)) {
|
|
|
|
emit(A64_STR32I(src, dst_adj, off_adj), ctx);
|
bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate offset)
The current BPF store/load instruction is translated by the JIT into two
instructions. The first instruction moves the immediate offset into a
temporary register. The second instruction uses this temporary register
to do the real store/load.
In fact, arm64 supports addressing with immediate offsets. So This patch
introduces optimization that uses arm64 str/ldr instruction with immediate
offset when the offset fits.
Example of generated instuction for r2 = *(u64 *)(r1 + 0):
without optimization:
mov x10, 0
ldr x1, [x0, x10]
with optimization:
ldr x1, [x0, 0]
If the offset is negative, or is not aligned correctly, or exceeds max
value, rollback to the use of temporary register.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-3-xukuohai@huawei.com
2022-03-21 11:28:49 -04:00
|
|
|
} else {
|
|
|
|
emit_a64_mov_i(1, tmp, off, ctx);
|
|
|
|
emit(A64_STR32(src, dst, tmp), ctx);
|
|
|
|
}
|
2014-08-26 21:15:30 -07:00
|
|
|
break;
|
|
|
|
case BPF_H:
|
bpf, arm64: Adjust the offset of str/ldr(immediate) to positive number
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-4-xukuohai@huawei.com
2022-03-21 11:28:50 -04:00
|
|
|
if (is_lsi_offset(off_adj, 1)) {
|
|
|
|
emit(A64_STRHI(src, dst_adj, off_adj), ctx);
|
bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate offset)
The current BPF store/load instruction is translated by the JIT into two
instructions. The first instruction moves the immediate offset into a
temporary register. The second instruction uses this temporary register
to do the real store/load.
In fact, arm64 supports addressing with immediate offsets. So This patch
introduces optimization that uses arm64 str/ldr instruction with immediate
offset when the offset fits.
Example of generated instuction for r2 = *(u64 *)(r1 + 0):
without optimization:
mov x10, 0
ldr x1, [x0, x10]
with optimization:
ldr x1, [x0, 0]
If the offset is negative, or is not aligned correctly, or exceeds max
value, rollback to the use of temporary register.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-3-xukuohai@huawei.com
2022-03-21 11:28:49 -04:00
|
|
|
} else {
|
|
|
|
emit_a64_mov_i(1, tmp, off, ctx);
|
|
|
|
emit(A64_STRH(src, dst, tmp), ctx);
|
|
|
|
}
|
2014-08-26 21:15:30 -07:00
|
|
|
break;
|
|
|
|
case BPF_B:
|
bpf, arm64: Adjust the offset of str/ldr(immediate) to positive number
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-4-xukuohai@huawei.com
2022-03-21 11:28:50 -04:00
|
|
|
if (is_lsi_offset(off_adj, 0)) {
|
|
|
|
emit(A64_STRBI(src, dst_adj, off_adj), ctx);
|
bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate offset)
The current BPF store/load instruction is translated by the JIT into two
instructions. The first instruction moves the immediate offset into a
temporary register. The second instruction uses this temporary register
to do the real store/load.
In fact, arm64 supports addressing with immediate offsets. So This patch
introduces optimization that uses arm64 str/ldr instruction with immediate
offset when the offset fits.
Example of generated instuction for r2 = *(u64 *)(r1 + 0):
without optimization:
mov x10, 0
ldr x1, [x0, x10]
with optimization:
ldr x1, [x0, 0]
If the offset is negative, or is not aligned correctly, or exceeds max
value, rollback to the use of temporary register.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-3-xukuohai@huawei.com
2022-03-21 11:28:49 -04:00
|
|
|
} else {
|
|
|
|
emit_a64_mov_i(1, tmp, off, ctx);
|
|
|
|
emit(A64_STRB(src, dst, tmp), ctx);
|
|
|
|
}
|
2014-08-26 21:15:30 -07:00
|
|
|
break;
|
|
|
|
case BPF_DW:
|
bpf, arm64: Adjust the offset of str/ldr(immediate) to positive number
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-4-xukuohai@huawei.com
2022-03-21 11:28:50 -04:00
|
|
|
if (is_lsi_offset(off_adj, 3)) {
|
|
|
|
emit(A64_STR64I(src, dst_adj, off_adj), ctx);
|
bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate offset)
The current BPF store/load instruction is translated by the JIT into two
instructions. The first instruction moves the immediate offset into a
temporary register. The second instruction uses this temporary register
to do the real store/load.
In fact, arm64 supports addressing with immediate offsets. So This patch
introduces optimization that uses arm64 str/ldr instruction with immediate
offset when the offset fits.
Example of generated instuction for r2 = *(u64 *)(r1 + 0):
without optimization:
mov x10, 0
ldr x1, [x0, x10]
with optimization:
ldr x1, [x0, 0]
If the offset is negative, or is not aligned correctly, or exceeds max
value, rollback to the use of temporary register.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-3-xukuohai@huawei.com
2022-03-21 11:28:49 -04:00
|
|
|
} else {
|
|
|
|
emit_a64_mov_i(1, tmp, off, ctx);
|
|
|
|
emit(A64_STR64(src, dst, tmp), ctx);
|
|
|
|
}
|
2014-08-26 21:15:30 -07:00
|
|
|
break;
|
|
|
|
}
|
2024-03-25 15:07:15 +00:00
|
|
|
|
|
|
|
ret = add_exception_handler(insn, ctx, dst);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2014-08-26 21:15:30 -07:00
|
|
|
break;
|
2019-04-26 21:48:22 +02:00
|
|
|
|
2021-01-14 18:17:44 +00:00
|
|
|
case BPF_STX | BPF_ATOMIC | BPF_W:
|
|
|
|
case BPF_STX | BPF_ATOMIC | BPF_DW:
|
2024-04-26 16:11:16 +00:00
|
|
|
case BPF_STX | BPF_PROBE_ATOMIC | BPF_W:
|
|
|
|
case BPF_STX | BPF_PROBE_ATOMIC | BPF_DW:
|
bpf, arm64: Support more atomic operations
Atomics for eBPF patch series adds support for atomic[64]_fetch_add,
atomic[64]_[fetch_]{and,or,xor} and atomic[64]_{xchg|cmpxchg}, but it
only adds support for x86-64, so support these atomic operations for
arm64 as well.
Basically the implementation procedure is almost mechanical translation
of code snippets in atomic_ll_sc.h & atomic_lse.h & cmpxchg.h located
under arch/arm64/include/asm.
When LSE atomic is unavailable, an extra temporary register is needed for
(BPF_ADD | BPF_FETCH) to save the value of src register, instead of adding
TMP_REG_4 just use BPF_REG_AX instead. Also make emit_lse_atomic() as an
empty inline function when CONFIG_ARM64_LSE_ATOMICS is disabled.
For cpus_have_cap(ARM64_HAS_LSE_ATOMICS) case and no-LSE-ATOMICS case, the
following three tests: "./test_verifier", "./test_progs -t atomic" and
"insmod ./test_bpf.ko" are exercised and passed.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220217072232.1186625-4-houtao1@huawei.com
2022-02-17 15:22:31 +08:00
|
|
|
if (cpus_have_cap(ARM64_HAS_LSE_ATOMICS))
|
|
|
|
ret = emit_lse_atomic(insn, ctx);
|
|
|
|
else
|
|
|
|
ret = emit_ll_sc_atomic(insn, ctx);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2024-04-26 16:11:16 +00:00
|
|
|
|
|
|
|
ret = add_exception_handler(insn, ctx, dst);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
bpf, arm64: implement jiting of BPF_XADD
This work adds BPF_XADD for BPF_W/BPF_DW to the arm64 JIT and therefore
completes JITing of all BPF instructions, meaning we can thus also remove
the 'notyet' label and do not need to fall back to the interpreter when
BPF_XADD is used in a program!
This now also brings arm64 JIT in line with x86_64, s390x, ppc64, sparc64,
where all current eBPF features are supported.
BPF_W example from test_bpf:
.u.insns_int = {
BPF_ALU32_IMM(BPF_MOV, R0, 0x12),
BPF_ST_MEM(BPF_W, R10, -40, 0x10),
BPF_STX_XADD(BPF_W, R10, R0, -40),
BPF_LDX_MEM(BPF_W, R0, R10, -40),
BPF_EXIT_INSN(),
},
[...]
00000020: 52800247 mov w7, #0x12 // #18
00000024: 928004eb mov x11, #0xffffffffffffffd8 // #-40
00000028: d280020a mov x10, #0x10 // #16
0000002c: b82b6b2a str w10, [x25,x11]
// start of xadd mapping:
00000030: 928004ea mov x10, #0xffffffffffffffd8 // #-40
00000034: 8b19014a add x10, x10, x25
00000038: f9800151 prfm pstl1strm, [x10]
0000003c: 885f7d4b ldxr w11, [x10]
00000040: 0b07016b add w11, w11, w7
00000044: 880b7d4b stxr w11, w11, [x10]
00000048: 35ffffab cbnz w11, 0x0000003c
// end of xadd mapping:
[...]
BPF_DW example from test_bpf:
.u.insns_int = {
BPF_ALU32_IMM(BPF_MOV, R0, 0x12),
BPF_ST_MEM(BPF_DW, R10, -40, 0x10),
BPF_STX_XADD(BPF_DW, R10, R0, -40),
BPF_LDX_MEM(BPF_DW, R0, R10, -40),
BPF_EXIT_INSN(),
},
[...]
00000020: 52800247 mov w7, #0x12 // #18
00000024: 928004eb mov x11, #0xffffffffffffffd8 // #-40
00000028: d280020a mov x10, #0x10 // #16
0000002c: f82b6b2a str x10, [x25,x11]
// start of xadd mapping:
00000030: 928004ea mov x10, #0xffffffffffffffd8 // #-40
00000034: 8b19014a add x10, x10, x25
00000038: f9800151 prfm pstl1strm, [x10]
0000003c: c85f7d4b ldxr x11, [x10]
00000040: 8b07016b add x11, x11, x7
00000044: c80b7d4b stxr w11, x11, [x10]
00000048: 35ffffab cbnz w11, 0x0000003c
// end of xadd mapping:
[...]
Tested on Cavium ThunderX ARMv8, test suite results after the patch:
No JIT: [ 3751.855362] test_bpf: Summary: 311 PASSED, 0 FAILED, [0/303 JIT'ed]
With JIT: [ 3573.759527] test_bpf: Summary: 311 PASSED, 0 FAILED, [303/303 JIT'ed]
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-01 02:57:20 +02:00
|
|
|
break;
|
2014-08-26 21:15:30 -07:00
|
|
|
|
|
|
|
default:
|
|
|
|
pr_err_once("unknown opcode %02x\n", code);
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-11-26 14:05:39 +01:00
|
|
|
static int build_body(struct jit_ctx *ctx, bool extra_pass)
|
2014-08-26 21:15:30 -07:00
|
|
|
{
|
|
|
|
const struct bpf_prog *prog = ctx->prog;
|
|
|
|
int i;
|
|
|
|
|
arm64: bpf: Fix branch offset in JIT
Running the eBPF test_verifier leads to random errors looking like this:
[ 6525.735488] Unexpected kernel BRK exception at EL1
[ 6525.735502] Internal error: ptrace BRK handler: f2000100 [#1] SMP
[ 6525.741609] Modules linked in: nls_utf8 cifs libdes libarc4 dns_resolver fscache binfmt_misc nls_ascii nls_cp437 vfat fat aes_ce_blk crypto_simd cryptd aes_ce_cipher ghash_ce gf128mul efi_pstore sha2_ce sha256_arm64 sha1_ce evdev efivars efivarfs ip_tables x_tables autofs4 btrfs blake2b_generic xor xor_neon zstd_compress raid6_pq libcrc32c crc32c_generic ahci xhci_pci libahci xhci_hcd igb libata i2c_algo_bit nvme realtek usbcore nvme_core scsi_mod t10_pi netsec mdio_devres of_mdio gpio_keys fixed_phy libphy gpio_mb86s7x
[ 6525.787760] CPU: 3 PID: 7881 Comm: test_verifier Tainted: G W 5.9.0-rc1+ #47
[ 6525.796111] Hardware name: Socionext SynQuacer E-series DeveloperBox, BIOS build #1 Jun 6 2020
[ 6525.804812] pstate: 20000005 (nzCv daif -PAN -UAO BTYPE=--)
[ 6525.810390] pc : bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
[ 6525.815613] lr : bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
[ 6525.820832] sp : ffff8000130cbb80
[ 6525.824141] x29: ffff8000130cbbb0 x28: 0000000000000000
[ 6525.829451] x27: 000005ef6fcbf39b x26: 0000000000000000
[ 6525.834759] x25: ffff8000130cbb80 x24: ffff800011dc7038
[ 6525.840067] x23: ffff8000130cbd00 x22: ffff0008f624d080
[ 6525.845375] x21: 0000000000000001 x20: ffff800011dc7000
[ 6525.850682] x19: 0000000000000000 x18: 0000000000000000
[ 6525.855990] x17: 0000000000000000 x16: 0000000000000000
[ 6525.861298] x15: 0000000000000000 x14: 0000000000000000
[ 6525.866606] x13: 0000000000000000 x12: 0000000000000000
[ 6525.871913] x11: 0000000000000001 x10: ffff8000000a660c
[ 6525.877220] x9 : ffff800010951810 x8 : ffff8000130cbc38
[ 6525.882528] x7 : 0000000000000000 x6 : 0000009864cfa881
[ 6525.887836] x5 : 00ffffffffffffff x4 : 002880ba1a0b3e9f
[ 6525.893144] x3 : 0000000000000018 x2 : ffff8000000a4374
[ 6525.898452] x1 : 000000000000000a x0 : 0000000000000009
[ 6525.903760] Call trace:
[ 6525.906202] bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
[ 6525.911076] bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
[ 6525.915957] bpf_dispatcher_xdp_func+0x14/0x20
[ 6525.920398] bpf_test_run+0x70/0x1b0
[ 6525.923969] bpf_prog_test_run_xdp+0xec/0x190
[ 6525.928326] __do_sys_bpf+0xc88/0x1b28
[ 6525.932072] __arm64_sys_bpf+0x24/0x30
[ 6525.935820] el0_svc_common.constprop.0+0x70/0x168
[ 6525.940607] do_el0_svc+0x28/0x88
[ 6525.943920] el0_sync_handler+0x88/0x190
[ 6525.947838] el0_sync+0x140/0x180
[ 6525.951154] Code: d4202000 d4202000 d4202000 d4202000 (d4202000)
[ 6525.957249] ---[ end trace cecc3f93b14927e2 ]---
The reason is the offset[] creation and later usage, while building
the eBPF body. The code currently omits the first instruction, since
build_insn() will increase our ctx->idx before saving it.
That was fine up until bounded eBPF loops were introduced. After that
introduction, offset[0] must be the offset of the end of prologue which
is the start of the 1st insn while, offset[n] holds the
offset of the end of n-th insn.
When "taken loop with back jump to 1st insn" test runs, it will
eventually call bpf2a64_offset(-1, 2, ctx). Since negative indexing is
permitted, the current outcome depends on the value stored in
ctx->offset[-1], which has nothing to do with our array.
If the value happens to be 0 the tests will work. If not this error
triggers.
commit 7c2e988f400e ("bpf: fix x64 JIT code generation for jmp to 1st insn")
fixed an indentical bug on x86 when eBPF bounded loops were introduced.
So let's fix it by creating the ctx->offset[] differently. Track the
beginning of instruction and account for the extra instruction while
calculating the arm instruction offsets.
Fixes: 2589726d12a1 ("bpf: introduce bounded loops")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Jiri Olsa <jolsa@kernel.org>
Co-developed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Co-developed-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20200917084925.177348-1-ilias.apalodimas@linaro.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-09-17 11:49:25 +03:00
|
|
|
/*
|
|
|
|
* - offset[0] offset of the end of prologue,
|
|
|
|
* start of the 1st instruction.
|
|
|
|
* - offset[1] - offset of the end of 1st instruction,
|
|
|
|
* start of the 2nd instruction
|
|
|
|
* [....]
|
|
|
|
* - offset[3] - offset of the end of 3rd instruction,
|
|
|
|
* start of 4th instruction
|
|
|
|
*/
|
2014-08-26 21:15:30 -07:00
|
|
|
for (i = 0; i < prog->len; i++) {
|
|
|
|
const struct bpf_insn *insn = &prog->insnsi[i];
|
|
|
|
int ret;
|
|
|
|
|
bpf, arm64: Jit BPF_CALL to direct call when possible
Currently, BPF_CALL is always jited to indirect call. When target is
within the range of direct call, BPF_CALL can be jited to direct call.
For example, the following BPF_CALL
call __htab_map_lookup_elem
is always jited to indirect call:
mov x10, #0xffffffffffff18f4
movk x10, #0x821, lsl #16
movk x10, #0x8000, lsl #32
blr x10
When the address of target __htab_map_lookup_elem is within the range of
direct call, the BPF_CALL can be jited to:
bl 0xfffffffffd33bc98
This patch does such jit optimization by emitting arm64 direct calls for
BPF_CALL when possible, indirect calls otherwise.
Without this patch, the jit works as follows.
1. First pass
A. Determine jited position and size for each bpf instruction.
B. Computed the jited image size.
2. Allocate jited image with size computed in step 1.
3. Second pass
A. Adjust jump offset for jump instructions
B. Write the final image.
This works because, for a given bpf prog, regardless of where the jited
image is allocated, the jited result for each instruction is fixed. The
second pass differs from the first only in adjusting the jump offsets,
like changing "jmp imm1" to "jmp imm2", while the position and size of
the "jmp" instruction remain unchanged.
Now considering whether to jit BPF_CALL to arm64 direct or indirect call
instruction. The choice depends solely on the jump offset: direct call
if the jump offset is within 128MB, indirect call otherwise.
For a given BPF_CALL, the target address is known, so the jump offset is
decided by the jited address of the BPF_CALL instruction. In other words,
for a given bpf prog, the jited result for each BPF_CALL is determined
by its jited address.
The jited address for a BPF_CALL is the jited image address plus the
total jited size of all preceding instructions. For a given bpf prog,
there are clearly no BPF_CALL instructions before the first BPF_CALL
instruction. Since the jited result for all other instructions other
than BPF_CALL are fixed, the total jited size preceding the first
BPF_CALL is also fixed. Therefore, once the jited image is allocated,
the jited address for the first BPF_CALL is fixed.
Now that the jited result for the first BPF_CALL is fixed, the jited
results for all instructions preceding the second BPF_CALL are fixed.
So the jited address and result for the second BPF_CALL are also fixed.
Similarly, we can conclude that the jited addresses and results for all
subsequent BPF_CALL instructions are fixed.
This means that, for a given bpf prog, once the jited image is allocated,
the jited address and result for all instructions, including all BPF_CALL
instructions, are fixed.
Based on the observation, with this patch, the jit works as follows.
1. First pass
Estimate the maximum jited image size. In this pass, all BPF_CALLs
are jited to arm64 indirect calls since the jump offsets are unknown
because the jited image is not allocated.
2. Allocate jited image with size estimated in step 1.
3. Second pass
A. Determine the jited result for each BPF_CALL.
B. Determine jited address and size for each bpf instruction.
4. Third pass
A. Adjust jump offset for jump instructions.
B. Write the final image.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20240903094407.601107-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-03 17:44:07 +08:00
|
|
|
ctx->offset[i] = ctx->idx;
|
2018-11-26 14:05:39 +01:00
|
|
|
ret = build_insn(insn, ctx, extra_pass);
|
2014-09-16 21:29:23 +01:00
|
|
|
if (ret > 0) {
|
|
|
|
i++;
|
bpf, arm64: Jit BPF_CALL to direct call when possible
Currently, BPF_CALL is always jited to indirect call. When target is
within the range of direct call, BPF_CALL can be jited to direct call.
For example, the following BPF_CALL
call __htab_map_lookup_elem
is always jited to indirect call:
mov x10, #0xffffffffffff18f4
movk x10, #0x821, lsl #16
movk x10, #0x8000, lsl #32
blr x10
When the address of target __htab_map_lookup_elem is within the range of
direct call, the BPF_CALL can be jited to:
bl 0xfffffffffd33bc98
This patch does such jit optimization by emitting arm64 direct calls for
BPF_CALL when possible, indirect calls otherwise.
Without this patch, the jit works as follows.
1. First pass
A. Determine jited position and size for each bpf instruction.
B. Computed the jited image size.
2. Allocate jited image with size computed in step 1.
3. Second pass
A. Adjust jump offset for jump instructions
B. Write the final image.
This works because, for a given bpf prog, regardless of where the jited
image is allocated, the jited result for each instruction is fixed. The
second pass differs from the first only in adjusting the jump offsets,
like changing "jmp imm1" to "jmp imm2", while the position and size of
the "jmp" instruction remain unchanged.
Now considering whether to jit BPF_CALL to arm64 direct or indirect call
instruction. The choice depends solely on the jump offset: direct call
if the jump offset is within 128MB, indirect call otherwise.
For a given BPF_CALL, the target address is known, so the jump offset is
decided by the jited address of the BPF_CALL instruction. In other words,
for a given bpf prog, the jited result for each BPF_CALL is determined
by its jited address.
The jited address for a BPF_CALL is the jited image address plus the
total jited size of all preceding instructions. For a given bpf prog,
there are clearly no BPF_CALL instructions before the first BPF_CALL
instruction. Since the jited result for all other instructions other
than BPF_CALL are fixed, the total jited size preceding the first
BPF_CALL is also fixed. Therefore, once the jited image is allocated,
the jited address for the first BPF_CALL is fixed.
Now that the jited result for the first BPF_CALL is fixed, the jited
results for all instructions preceding the second BPF_CALL are fixed.
So the jited address and result for the second BPF_CALL are also fixed.
Similarly, we can conclude that the jited addresses and results for all
subsequent BPF_CALL instructions are fixed.
This means that, for a given bpf prog, once the jited image is allocated,
the jited address and result for all instructions, including all BPF_CALL
instructions, are fixed.
Based on the observation, with this patch, the jit works as follows.
1. First pass
Estimate the maximum jited image size. In this pass, all BPF_CALLs
are jited to arm64 indirect calls since the jump offsets are unknown
because the jited image is not allocated.
2. Allocate jited image with size estimated in step 1.
3. Second pass
A. Determine the jited result for each BPF_CALL.
B. Determine jited address and size for each bpf instruction.
4. Third pass
A. Adjust jump offset for jump instructions.
B. Write the final image.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20240903094407.601107-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-03 17:44:07 +08:00
|
|
|
ctx->offset[i] = ctx->idx;
|
2014-09-16 21:29:23 +01:00
|
|
|
continue;
|
|
|
|
}
|
2014-08-26 21:15:30 -07:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
}
|
arm64: bpf: Fix branch offset in JIT
Running the eBPF test_verifier leads to random errors looking like this:
[ 6525.735488] Unexpected kernel BRK exception at EL1
[ 6525.735502] Internal error: ptrace BRK handler: f2000100 [#1] SMP
[ 6525.741609] Modules linked in: nls_utf8 cifs libdes libarc4 dns_resolver fscache binfmt_misc nls_ascii nls_cp437 vfat fat aes_ce_blk crypto_simd cryptd aes_ce_cipher ghash_ce gf128mul efi_pstore sha2_ce sha256_arm64 sha1_ce evdev efivars efivarfs ip_tables x_tables autofs4 btrfs blake2b_generic xor xor_neon zstd_compress raid6_pq libcrc32c crc32c_generic ahci xhci_pci libahci xhci_hcd igb libata i2c_algo_bit nvme realtek usbcore nvme_core scsi_mod t10_pi netsec mdio_devres of_mdio gpio_keys fixed_phy libphy gpio_mb86s7x
[ 6525.787760] CPU: 3 PID: 7881 Comm: test_verifier Tainted: G W 5.9.0-rc1+ #47
[ 6525.796111] Hardware name: Socionext SynQuacer E-series DeveloperBox, BIOS build #1 Jun 6 2020
[ 6525.804812] pstate: 20000005 (nzCv daif -PAN -UAO BTYPE=--)
[ 6525.810390] pc : bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
[ 6525.815613] lr : bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
[ 6525.820832] sp : ffff8000130cbb80
[ 6525.824141] x29: ffff8000130cbbb0 x28: 0000000000000000
[ 6525.829451] x27: 000005ef6fcbf39b x26: 0000000000000000
[ 6525.834759] x25: ffff8000130cbb80 x24: ffff800011dc7038
[ 6525.840067] x23: ffff8000130cbd00 x22: ffff0008f624d080
[ 6525.845375] x21: 0000000000000001 x20: ffff800011dc7000
[ 6525.850682] x19: 0000000000000000 x18: 0000000000000000
[ 6525.855990] x17: 0000000000000000 x16: 0000000000000000
[ 6525.861298] x15: 0000000000000000 x14: 0000000000000000
[ 6525.866606] x13: 0000000000000000 x12: 0000000000000000
[ 6525.871913] x11: 0000000000000001 x10: ffff8000000a660c
[ 6525.877220] x9 : ffff800010951810 x8 : ffff8000130cbc38
[ 6525.882528] x7 : 0000000000000000 x6 : 0000009864cfa881
[ 6525.887836] x5 : 00ffffffffffffff x4 : 002880ba1a0b3e9f
[ 6525.893144] x3 : 0000000000000018 x2 : ffff8000000a4374
[ 6525.898452] x1 : 000000000000000a x0 : 0000000000000009
[ 6525.903760] Call trace:
[ 6525.906202] bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
[ 6525.911076] bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
[ 6525.915957] bpf_dispatcher_xdp_func+0x14/0x20
[ 6525.920398] bpf_test_run+0x70/0x1b0
[ 6525.923969] bpf_prog_test_run_xdp+0xec/0x190
[ 6525.928326] __do_sys_bpf+0xc88/0x1b28
[ 6525.932072] __arm64_sys_bpf+0x24/0x30
[ 6525.935820] el0_svc_common.constprop.0+0x70/0x168
[ 6525.940607] do_el0_svc+0x28/0x88
[ 6525.943920] el0_sync_handler+0x88/0x190
[ 6525.947838] el0_sync+0x140/0x180
[ 6525.951154] Code: d4202000 d4202000 d4202000 d4202000 (d4202000)
[ 6525.957249] ---[ end trace cecc3f93b14927e2 ]---
The reason is the offset[] creation and later usage, while building
the eBPF body. The code currently omits the first instruction, since
build_insn() will increase our ctx->idx before saving it.
That was fine up until bounded eBPF loops were introduced. After that
introduction, offset[0] must be the offset of the end of prologue which
is the start of the 1st insn while, offset[n] holds the
offset of the end of n-th insn.
When "taken loop with back jump to 1st insn" test runs, it will
eventually call bpf2a64_offset(-1, 2, ctx). Since negative indexing is
permitted, the current outcome depends on the value stored in
ctx->offset[-1], which has nothing to do with our array.
If the value happens to be 0 the tests will work. If not this error
triggers.
commit 7c2e988f400e ("bpf: fix x64 JIT code generation for jmp to 1st insn")
fixed an indentical bug on x86 when eBPF bounded loops were introduced.
So let's fix it by creating the ctx->offset[] differently. Track the
beginning of instruction and account for the extra instruction while
calculating the arm instruction offsets.
Fixes: 2589726d12a1 ("bpf: introduce bounded loops")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Jiri Olsa <jolsa@kernel.org>
Co-developed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Co-developed-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20200917084925.177348-1-ilias.apalodimas@linaro.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-09-17 11:49:25 +03:00
|
|
|
/*
|
|
|
|
* offset is allocated with prog->len + 1 so fill in
|
|
|
|
* the last element with the offset after the last
|
|
|
|
* instruction (end of program)
|
|
|
|
*/
|
bpf, arm64: Jit BPF_CALL to direct call when possible
Currently, BPF_CALL is always jited to indirect call. When target is
within the range of direct call, BPF_CALL can be jited to direct call.
For example, the following BPF_CALL
call __htab_map_lookup_elem
is always jited to indirect call:
mov x10, #0xffffffffffff18f4
movk x10, #0x821, lsl #16
movk x10, #0x8000, lsl #32
blr x10
When the address of target __htab_map_lookup_elem is within the range of
direct call, the BPF_CALL can be jited to:
bl 0xfffffffffd33bc98
This patch does such jit optimization by emitting arm64 direct calls for
BPF_CALL when possible, indirect calls otherwise.
Without this patch, the jit works as follows.
1. First pass
A. Determine jited position and size for each bpf instruction.
B. Computed the jited image size.
2. Allocate jited image with size computed in step 1.
3. Second pass
A. Adjust jump offset for jump instructions
B. Write the final image.
This works because, for a given bpf prog, regardless of where the jited
image is allocated, the jited result for each instruction is fixed. The
second pass differs from the first only in adjusting the jump offsets,
like changing "jmp imm1" to "jmp imm2", while the position and size of
the "jmp" instruction remain unchanged.
Now considering whether to jit BPF_CALL to arm64 direct or indirect call
instruction. The choice depends solely on the jump offset: direct call
if the jump offset is within 128MB, indirect call otherwise.
For a given BPF_CALL, the target address is known, so the jump offset is
decided by the jited address of the BPF_CALL instruction. In other words,
for a given bpf prog, the jited result for each BPF_CALL is determined
by its jited address.
The jited address for a BPF_CALL is the jited image address plus the
total jited size of all preceding instructions. For a given bpf prog,
there are clearly no BPF_CALL instructions before the first BPF_CALL
instruction. Since the jited result for all other instructions other
than BPF_CALL are fixed, the total jited size preceding the first
BPF_CALL is also fixed. Therefore, once the jited image is allocated,
the jited address for the first BPF_CALL is fixed.
Now that the jited result for the first BPF_CALL is fixed, the jited
results for all instructions preceding the second BPF_CALL are fixed.
So the jited address and result for the second BPF_CALL are also fixed.
Similarly, we can conclude that the jited addresses and results for all
subsequent BPF_CALL instructions are fixed.
This means that, for a given bpf prog, once the jited image is allocated,
the jited address and result for all instructions, including all BPF_CALL
instructions, are fixed.
Based on the observation, with this patch, the jit works as follows.
1. First pass
Estimate the maximum jited image size. In this pass, all BPF_CALLs
are jited to arm64 indirect calls since the jump offsets are unknown
because the jited image is not allocated.
2. Allocate jited image with size estimated in step 1.
3. Second pass
A. Determine the jited result for each BPF_CALL.
B. Determine jited address and size for each bpf instruction.
4. Third pass
A. Adjust jump offset for jump instructions.
B. Write the final image.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20240903094407.601107-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-03 17:44:07 +08:00
|
|
|
ctx->offset[i] = ctx->idx;
|
2014-08-26 21:15:30 -07:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2016-01-13 23:33:22 -08:00
|
|
|
static int validate_code(struct jit_ctx *ctx)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < ctx->idx; i++) {
|
|
|
|
u32 a64_insn = le32_to_cpu(ctx->image[i]);
|
|
|
|
|
|
|
|
if (a64_insn == AARCH64_BREAK_FAULT)
|
|
|
|
return -1;
|
|
|
|
}
|
2022-07-11 11:08:23 -04:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int validate_ctx(struct jit_ctx *ctx)
|
|
|
|
{
|
|
|
|
if (validate_code(ctx))
|
|
|
|
return -1;
|
2016-01-13 23:33:22 -08:00
|
|
|
|
2020-07-28 17:21:26 +02:00
|
|
|
if (WARN_ON_ONCE(ctx->exentry_idx != ctx->prog->aux->num_exentries))
|
|
|
|
return -1;
|
|
|
|
|
2016-01-13 23:33:22 -08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-08-26 21:15:30 -07:00
|
|
|
static inline void bpf_flush_icache(void *start, void *end)
|
|
|
|
{
|
|
|
|
flush_icache_range((unsigned long)start, (unsigned long)end);
|
|
|
|
}
|
|
|
|
|
2017-12-14 17:55:16 -08:00
|
|
|
struct arm64_jit_data {
|
|
|
|
struct bpf_binary_header *header;
|
2024-02-28 14:18:24 +00:00
|
|
|
u8 *ro_image;
|
|
|
|
struct bpf_binary_header *ro_header;
|
2017-12-14 17:55:16 -08:00
|
|
|
struct jit_ctx ctx;
|
|
|
|
};
|
|
|
|
|
2016-05-13 19:08:31 +02:00
|
|
|
struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
|
2014-08-26 21:15:30 -07:00
|
|
|
{
|
bpf, arm64: Implement bpf_arch_text_poke() for arm64
Implement bpf_arch_text_poke() for arm64, so bpf prog or bpf trampoline
can be patched with it.
When the target address is NULL, the original instruction is patched to
a NOP.
When the target address and the source address are within the branch
range, the original instruction is patched to a bl instruction to the
target address directly.
To support attaching bpf trampoline to both regular kernel function and
bpf prog, we follow the ftrace patchsite way for bpf prog. That is, two
instructions are inserted at the beginning of bpf prog, the first one
saves the return address to x9, and the second is a nop which will be
patched to a bl instruction when a bpf trampoline is attached.
However, when a bpf trampoline is attached to bpf prog, the distance
between target address and source address may exceed 128MB, the maximum
branch range, because bpf trampoline and bpf prog are allocated
separately with vmalloc. So long jump should be handled.
When a bpf prog is constructed, a plt pointing to empty trampoline
dummy_tramp is placed at the end:
bpf_prog:
mov x9, lr
nop // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
This is also the state when no trampoline is attached.
When a short-jump bpf trampoline is attached, the patchsite is patched to
a bl instruction to the trampoline directly:
bpf_prog:
mov x9, lr
bl <short-jump bpf trampoline address> // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
When a long-jump bpf trampoline is attached, the plt target is filled with
the trampoline address and the patchsite is patched to a bl instruction to
the plt:
bpf_prog:
mov x9, lr
bl plt // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad <long-jump bpf trampoline address>
dummy_tramp is used to prevent another CPU from jumping to an unknown
location during the patching process, making the patching process easier.
The patching process is as follows:
1. when neither the old address or the new address is a long jump, the
patchsite is replaced with a bl to the new address, or nop if the new
address is NULL;
2. when the old address is not long jump but the new one is, the
branch target address is written to plt first, then the patchsite
is replaced with a bl instruction to the plt;
3. when the old address is long jump but the new one is not, the address
of dummy_tramp is written to plt first, then the patchsite is replaced
with a bl to the new address, or a nop if the new address is NULL;
4. when both the old address and the new address are long jump, the
new address is written to plt and the patchsite is not changed.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20220711150823.2128542-4-xukuohai@huawei.com
2022-07-11 11:08:22 -04:00
|
|
|
int image_size, prog_size, extable_size, extable_align, extable_offset;
|
2016-05-13 19:08:34 +02:00
|
|
|
struct bpf_prog *tmp, *orig_prog = prog;
|
2014-09-16 08:48:50 +01:00
|
|
|
struct bpf_binary_header *header;
|
2024-02-28 14:18:24 +00:00
|
|
|
struct bpf_binary_header *ro_header;
|
2017-12-14 17:55:16 -08:00
|
|
|
struct arm64_jit_data *jit_data;
|
2018-05-14 23:22:33 +02:00
|
|
|
bool was_classic = bpf_prog_was_classic(prog);
|
2016-05-13 19:08:34 +02:00
|
|
|
bool tmp_blinded = false;
|
2017-12-14 17:55:16 -08:00
|
|
|
bool extra_pass = false;
|
2014-08-26 21:15:30 -07:00
|
|
|
struct jit_ctx ctx;
|
2014-09-16 08:48:50 +01:00
|
|
|
u8 *image_ptr;
|
2024-02-28 14:18:24 +00:00
|
|
|
u8 *ro_image_ptr;
|
bpf, arm64: Jit BPF_CALL to direct call when possible
Currently, BPF_CALL is always jited to indirect call. When target is
within the range of direct call, BPF_CALL can be jited to direct call.
For example, the following BPF_CALL
call __htab_map_lookup_elem
is always jited to indirect call:
mov x10, #0xffffffffffff18f4
movk x10, #0x821, lsl #16
movk x10, #0x8000, lsl #32
blr x10
When the address of target __htab_map_lookup_elem is within the range of
direct call, the BPF_CALL can be jited to:
bl 0xfffffffffd33bc98
This patch does such jit optimization by emitting arm64 direct calls for
BPF_CALL when possible, indirect calls otherwise.
Without this patch, the jit works as follows.
1. First pass
A. Determine jited position and size for each bpf instruction.
B. Computed the jited image size.
2. Allocate jited image with size computed in step 1.
3. Second pass
A. Adjust jump offset for jump instructions
B. Write the final image.
This works because, for a given bpf prog, regardless of where the jited
image is allocated, the jited result for each instruction is fixed. The
second pass differs from the first only in adjusting the jump offsets,
like changing "jmp imm1" to "jmp imm2", while the position and size of
the "jmp" instruction remain unchanged.
Now considering whether to jit BPF_CALL to arm64 direct or indirect call
instruction. The choice depends solely on the jump offset: direct call
if the jump offset is within 128MB, indirect call otherwise.
For a given BPF_CALL, the target address is known, so the jump offset is
decided by the jited address of the BPF_CALL instruction. In other words,
for a given bpf prog, the jited result for each BPF_CALL is determined
by its jited address.
The jited address for a BPF_CALL is the jited image address plus the
total jited size of all preceding instructions. For a given bpf prog,
there are clearly no BPF_CALL instructions before the first BPF_CALL
instruction. Since the jited result for all other instructions other
than BPF_CALL are fixed, the total jited size preceding the first
BPF_CALL is also fixed. Therefore, once the jited image is allocated,
the jited address for the first BPF_CALL is fixed.
Now that the jited result for the first BPF_CALL is fixed, the jited
results for all instructions preceding the second BPF_CALL are fixed.
So the jited address and result for the second BPF_CALL are also fixed.
Similarly, we can conclude that the jited addresses and results for all
subsequent BPF_CALL instructions are fixed.
This means that, for a given bpf prog, once the jited image is allocated,
the jited address and result for all instructions, including all BPF_CALL
instructions, are fixed.
Based on the observation, with this patch, the jit works as follows.
1. First pass
Estimate the maximum jited image size. In this pass, all BPF_CALLs
are jited to arm64 indirect calls since the jump offsets are unknown
because the jited image is not allocated.
2. Allocate jited image with size estimated in step 1.
3. Second pass
A. Determine the jited result for each BPF_CALL.
B. Determine jited address and size for each bpf instruction.
4. Third pass
A. Adjust jump offset for jump instructions.
B. Write the final image.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20240903094407.601107-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-03 17:44:07 +08:00
|
|
|
int body_idx;
|
|
|
|
int exentry_idx;
|
2014-08-26 21:15:30 -07:00
|
|
|
|
2017-12-14 17:55:14 -08:00
|
|
|
if (!prog->jit_requested)
|
2016-05-13 19:08:34 +02:00
|
|
|
return orig_prog;
|
|
|
|
|
|
|
|
tmp = bpf_jit_blind_constants(prog);
|
|
|
|
/* If blinding was requested and we failed during blinding,
|
|
|
|
* we must fall back to the interpreter.
|
|
|
|
*/
|
|
|
|
if (IS_ERR(tmp))
|
|
|
|
return orig_prog;
|
|
|
|
if (tmp != prog) {
|
|
|
|
tmp_blinded = true;
|
|
|
|
prog = tmp;
|
|
|
|
}
|
2014-08-26 21:15:30 -07:00
|
|
|
|
2017-12-14 17:55:16 -08:00
|
|
|
jit_data = prog->aux->jit_data;
|
|
|
|
if (!jit_data) {
|
|
|
|
jit_data = kzalloc(sizeof(*jit_data), GFP_KERNEL);
|
|
|
|
if (!jit_data) {
|
|
|
|
prog = orig_prog;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
prog->aux->jit_data = jit_data;
|
|
|
|
}
|
|
|
|
if (jit_data->ctx.offset) {
|
|
|
|
ctx = jit_data->ctx;
|
2024-02-28 14:18:24 +00:00
|
|
|
ro_image_ptr = jit_data->ro_image;
|
|
|
|
ro_header = jit_data->ro_header;
|
2017-12-14 17:55:16 -08:00
|
|
|
header = jit_data->header;
|
2024-02-28 14:18:24 +00:00
|
|
|
image_ptr = (void *)header + ((void *)ro_image_ptr
|
|
|
|
- (void *)ro_header);
|
2017-12-14 17:55:16 -08:00
|
|
|
extra_pass = true;
|
2020-07-28 17:21:26 +02:00
|
|
|
prog_size = sizeof(u32) * ctx.idx;
|
2017-12-14 17:55:16 -08:00
|
|
|
goto skip_init_ctx;
|
|
|
|
}
|
2014-08-26 21:15:30 -07:00
|
|
|
memset(&ctx, 0, sizeof(ctx));
|
|
|
|
ctx.prog = prog;
|
|
|
|
|
2022-08-04 10:54:42 +08:00
|
|
|
ctx.offset = kvcalloc(prog->len + 1, sizeof(int), GFP_KERNEL);
|
2016-05-13 19:08:34 +02:00
|
|
|
if (ctx.offset == NULL) {
|
|
|
|
prog = orig_prog;
|
2017-12-14 17:55:16 -08:00
|
|
|
goto out_off;
|
2016-05-13 19:08:34 +02:00
|
|
|
}
|
2014-08-26 21:15:30 -07:00
|
|
|
|
2024-03-25 15:07:16 +00:00
|
|
|
ctx.user_vm_start = bpf_arena_get_user_vm_start(prog->aux->arena);
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
ctx.arena_vm_start = bpf_arena_get_kern_vm_start(prog->aux->arena);
|
bpf, arm64: Adjust the offset of str/ldr(immediate) to positive number
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-4-xukuohai@huawei.com
2022-03-21 11:28:50 -04:00
|
|
|
|
bpf, arm64: Jit BPF_CALL to direct call when possible
Currently, BPF_CALL is always jited to indirect call. When target is
within the range of direct call, BPF_CALL can be jited to direct call.
For example, the following BPF_CALL
call __htab_map_lookup_elem
is always jited to indirect call:
mov x10, #0xffffffffffff18f4
movk x10, #0x821, lsl #16
movk x10, #0x8000, lsl #32
blr x10
When the address of target __htab_map_lookup_elem is within the range of
direct call, the BPF_CALL can be jited to:
bl 0xfffffffffd33bc98
This patch does such jit optimization by emitting arm64 direct calls for
BPF_CALL when possible, indirect calls otherwise.
Without this patch, the jit works as follows.
1. First pass
A. Determine jited position and size for each bpf instruction.
B. Computed the jited image size.
2. Allocate jited image with size computed in step 1.
3. Second pass
A. Adjust jump offset for jump instructions
B. Write the final image.
This works because, for a given bpf prog, regardless of where the jited
image is allocated, the jited result for each instruction is fixed. The
second pass differs from the first only in adjusting the jump offsets,
like changing "jmp imm1" to "jmp imm2", while the position and size of
the "jmp" instruction remain unchanged.
Now considering whether to jit BPF_CALL to arm64 direct or indirect call
instruction. The choice depends solely on the jump offset: direct call
if the jump offset is within 128MB, indirect call otherwise.
For a given BPF_CALL, the target address is known, so the jump offset is
decided by the jited address of the BPF_CALL instruction. In other words,
for a given bpf prog, the jited result for each BPF_CALL is determined
by its jited address.
The jited address for a BPF_CALL is the jited image address plus the
total jited size of all preceding instructions. For a given bpf prog,
there are clearly no BPF_CALL instructions before the first BPF_CALL
instruction. Since the jited result for all other instructions other
than BPF_CALL are fixed, the total jited size preceding the first
BPF_CALL is also fixed. Therefore, once the jited image is allocated,
the jited address for the first BPF_CALL is fixed.
Now that the jited result for the first BPF_CALL is fixed, the jited
results for all instructions preceding the second BPF_CALL are fixed.
So the jited address and result for the second BPF_CALL are also fixed.
Similarly, we can conclude that the jited addresses and results for all
subsequent BPF_CALL instructions are fixed.
This means that, for a given bpf prog, once the jited image is allocated,
the jited address and result for all instructions, including all BPF_CALL
instructions, are fixed.
Based on the observation, with this patch, the jit works as follows.
1. First pass
Estimate the maximum jited image size. In this pass, all BPF_CALLs
are jited to arm64 indirect calls since the jump offsets are unknown
because the jited image is not allocated.
2. Allocate jited image with size estimated in step 1.
3. Second pass
A. Determine the jited result for each BPF_CALL.
B. Determine jited address and size for each bpf instruction.
4. Third pass
A. Adjust jump offset for jump instructions.
B. Write the final image.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20240903094407.601107-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-03 17:44:07 +08:00
|
|
|
/* Pass 1: Estimate the maximum image size.
|
2022-02-26 20:19:05 +08:00
|
|
|
*
|
|
|
|
* BPF line info needs ctx->offset[i] to be the offset of
|
|
|
|
* instruction[i] in jited image, so build prologue first.
|
|
|
|
*/
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
if (build_prologue(&ctx, was_classic)) {
|
2016-05-13 19:08:34 +02:00
|
|
|
prog = orig_prog;
|
|
|
|
goto out_off;
|
|
|
|
}
|
2014-08-26 21:15:30 -07:00
|
|
|
|
2022-02-26 20:19:05 +08:00
|
|
|
if (build_body(&ctx, extra_pass)) {
|
arm64: bpf: implement bpf_tail_call() helper
Add support for JMP_CALL_X (tail call) introduced by commit 04fd61ab36ec
("bpf: allow bpf programs to tail-call other bpf programs").
bpf_tail_call() arguments:
ctx - context pointer passed to next program
array - pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
index - index inside array that selects specific program to run
In this implementation arm64 JIT jumps into callee program after prologue,
so callee program reuses the same stack. For tail_call_cnt, we use the
callee-saved R26 (which was already saved/restored but previously unused
by JIT).
With this patch a tail call generates the following code on arm64:
if (index >= array->map.max_entries)
goto out;
34: mov x10, #0x10 // #16
38: ldr w10, [x1,x10]
3c: cmp w2, w10
40: b.ge 0x0000000000000074
if (tail_call_cnt > MAX_TAIL_CALL_CNT)
goto out;
tail_call_cnt++;
44: mov x10, #0x20 // #32
48: cmp x26, x10
4c: b.gt 0x0000000000000074
50: add x26, x26, #0x1
prog = array->ptrs[index];
if (prog == NULL)
goto out;
54: mov x10, #0x68 // #104
58: ldr x10, [x1,x10]
5c: ldr x11, [x10,x2]
60: cbz x11, 0x0000000000000074
goto *(prog->bpf_func + prologue_size);
64: mov x10, #0x20 // #32
68: ldr x10, [x11,x10]
6c: add x10, x10, #0x20
70: br x10
74:
Signed-off-by: Zi Shen Lim <zlim.lnx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 21:18:48 -07:00
|
|
|
prog = orig_prog;
|
|
|
|
goto out_off;
|
|
|
|
}
|
2014-12-03 08:38:01 +00:00
|
|
|
|
|
|
|
ctx.epilogue_offset = ctx.idx;
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
build_epilogue(&ctx);
|
bpf, arm64: Implement bpf_arch_text_poke() for arm64
Implement bpf_arch_text_poke() for arm64, so bpf prog or bpf trampoline
can be patched with it.
When the target address is NULL, the original instruction is patched to
a NOP.
When the target address and the source address are within the branch
range, the original instruction is patched to a bl instruction to the
target address directly.
To support attaching bpf trampoline to both regular kernel function and
bpf prog, we follow the ftrace patchsite way for bpf prog. That is, two
instructions are inserted at the beginning of bpf prog, the first one
saves the return address to x9, and the second is a nop which will be
patched to a bl instruction when a bpf trampoline is attached.
However, when a bpf trampoline is attached to bpf prog, the distance
between target address and source address may exceed 128MB, the maximum
branch range, because bpf trampoline and bpf prog are allocated
separately with vmalloc. So long jump should be handled.
When a bpf prog is constructed, a plt pointing to empty trampoline
dummy_tramp is placed at the end:
bpf_prog:
mov x9, lr
nop // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
This is also the state when no trampoline is attached.
When a short-jump bpf trampoline is attached, the patchsite is patched to
a bl instruction to the trampoline directly:
bpf_prog:
mov x9, lr
bl <short-jump bpf trampoline address> // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
When a long-jump bpf trampoline is attached, the plt target is filled with
the trampoline address and the patchsite is patched to a bl instruction to
the plt:
bpf_prog:
mov x9, lr
bl plt // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad <long-jump bpf trampoline address>
dummy_tramp is used to prevent another CPU from jumping to an unknown
location during the patching process, making the patching process easier.
The patching process is as follows:
1. when neither the old address or the new address is a long jump, the
patchsite is replaced with a bl to the new address, or nop if the new
address is NULL;
2. when the old address is not long jump but the new one is, the
branch target address is written to plt first, then the patchsite
is replaced with a bl instruction to the plt;
3. when the old address is long jump but the new one is not, the address
of dummy_tramp is written to plt first, then the patchsite is replaced
with a bl to the new address, or a nop if the new address is NULL;
4. when both the old address and the new address are long jump, the
new address is written to plt and the patchsite is not changed.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20220711150823.2128542-4-xukuohai@huawei.com
2022-07-11 11:08:22 -04:00
|
|
|
build_plt(&ctx);
|
2014-08-26 21:15:30 -07:00
|
|
|
|
bpf, arm64: Implement bpf_arch_text_poke() for arm64
Implement bpf_arch_text_poke() for arm64, so bpf prog or bpf trampoline
can be patched with it.
When the target address is NULL, the original instruction is patched to
a NOP.
When the target address and the source address are within the branch
range, the original instruction is patched to a bl instruction to the
target address directly.
To support attaching bpf trampoline to both regular kernel function and
bpf prog, we follow the ftrace patchsite way for bpf prog. That is, two
instructions are inserted at the beginning of bpf prog, the first one
saves the return address to x9, and the second is a nop which will be
patched to a bl instruction when a bpf trampoline is attached.
However, when a bpf trampoline is attached to bpf prog, the distance
between target address and source address may exceed 128MB, the maximum
branch range, because bpf trampoline and bpf prog are allocated
separately with vmalloc. So long jump should be handled.
When a bpf prog is constructed, a plt pointing to empty trampoline
dummy_tramp is placed at the end:
bpf_prog:
mov x9, lr
nop // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
This is also the state when no trampoline is attached.
When a short-jump bpf trampoline is attached, the patchsite is patched to
a bl instruction to the trampoline directly:
bpf_prog:
mov x9, lr
bl <short-jump bpf trampoline address> // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
When a long-jump bpf trampoline is attached, the plt target is filled with
the trampoline address and the patchsite is patched to a bl instruction to
the plt:
bpf_prog:
mov x9, lr
bl plt // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad <long-jump bpf trampoline address>
dummy_tramp is used to prevent another CPU from jumping to an unknown
location during the patching process, making the patching process easier.
The patching process is as follows:
1. when neither the old address or the new address is a long jump, the
patchsite is replaced with a bl to the new address, or nop if the new
address is NULL;
2. when the old address is not long jump but the new one is, the
branch target address is written to plt first, then the patchsite
is replaced with a bl instruction to the plt;
3. when the old address is long jump but the new one is not, the address
of dummy_tramp is written to plt first, then the patchsite is replaced
with a bl to the new address, or a nop if the new address is NULL;
4. when both the old address and the new address are long jump, the
new address is written to plt and the patchsite is not changed.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20220711150823.2128542-4-xukuohai@huawei.com
2022-07-11 11:08:22 -04:00
|
|
|
extable_align = __alignof__(struct exception_table_entry);
|
2020-07-28 17:21:26 +02:00
|
|
|
extable_size = prog->aux->num_exentries *
|
|
|
|
sizeof(struct exception_table_entry);
|
|
|
|
|
bpf, arm64: Jit BPF_CALL to direct call when possible
Currently, BPF_CALL is always jited to indirect call. When target is
within the range of direct call, BPF_CALL can be jited to direct call.
For example, the following BPF_CALL
call __htab_map_lookup_elem
is always jited to indirect call:
mov x10, #0xffffffffffff18f4
movk x10, #0x821, lsl #16
movk x10, #0x8000, lsl #32
blr x10
When the address of target __htab_map_lookup_elem is within the range of
direct call, the BPF_CALL can be jited to:
bl 0xfffffffffd33bc98
This patch does such jit optimization by emitting arm64 direct calls for
BPF_CALL when possible, indirect calls otherwise.
Without this patch, the jit works as follows.
1. First pass
A. Determine jited position and size for each bpf instruction.
B. Computed the jited image size.
2. Allocate jited image with size computed in step 1.
3. Second pass
A. Adjust jump offset for jump instructions
B. Write the final image.
This works because, for a given bpf prog, regardless of where the jited
image is allocated, the jited result for each instruction is fixed. The
second pass differs from the first only in adjusting the jump offsets,
like changing "jmp imm1" to "jmp imm2", while the position and size of
the "jmp" instruction remain unchanged.
Now considering whether to jit BPF_CALL to arm64 direct or indirect call
instruction. The choice depends solely on the jump offset: direct call
if the jump offset is within 128MB, indirect call otherwise.
For a given BPF_CALL, the target address is known, so the jump offset is
decided by the jited address of the BPF_CALL instruction. In other words,
for a given bpf prog, the jited result for each BPF_CALL is determined
by its jited address.
The jited address for a BPF_CALL is the jited image address plus the
total jited size of all preceding instructions. For a given bpf prog,
there are clearly no BPF_CALL instructions before the first BPF_CALL
instruction. Since the jited result for all other instructions other
than BPF_CALL are fixed, the total jited size preceding the first
BPF_CALL is also fixed. Therefore, once the jited image is allocated,
the jited address for the first BPF_CALL is fixed.
Now that the jited result for the first BPF_CALL is fixed, the jited
results for all instructions preceding the second BPF_CALL are fixed.
So the jited address and result for the second BPF_CALL are also fixed.
Similarly, we can conclude that the jited addresses and results for all
subsequent BPF_CALL instructions are fixed.
This means that, for a given bpf prog, once the jited image is allocated,
the jited address and result for all instructions, including all BPF_CALL
instructions, are fixed.
Based on the observation, with this patch, the jit works as follows.
1. First pass
Estimate the maximum jited image size. In this pass, all BPF_CALLs
are jited to arm64 indirect calls since the jump offsets are unknown
because the jited image is not allocated.
2. Allocate jited image with size estimated in step 1.
3. Second pass
A. Determine the jited result for each BPF_CALL.
B. Determine jited address and size for each bpf instruction.
4. Third pass
A. Adjust jump offset for jump instructions.
B. Write the final image.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20240903094407.601107-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-03 17:44:07 +08:00
|
|
|
/* Now we know the maximum image size. */
|
2020-07-28 17:21:26 +02:00
|
|
|
prog_size = sizeof(u32) * ctx.idx;
|
bpf, arm64: Implement bpf_arch_text_poke() for arm64
Implement bpf_arch_text_poke() for arm64, so bpf prog or bpf trampoline
can be patched with it.
When the target address is NULL, the original instruction is patched to
a NOP.
When the target address and the source address are within the branch
range, the original instruction is patched to a bl instruction to the
target address directly.
To support attaching bpf trampoline to both regular kernel function and
bpf prog, we follow the ftrace patchsite way for bpf prog. That is, two
instructions are inserted at the beginning of bpf prog, the first one
saves the return address to x9, and the second is a nop which will be
patched to a bl instruction when a bpf trampoline is attached.
However, when a bpf trampoline is attached to bpf prog, the distance
between target address and source address may exceed 128MB, the maximum
branch range, because bpf trampoline and bpf prog are allocated
separately with vmalloc. So long jump should be handled.
When a bpf prog is constructed, a plt pointing to empty trampoline
dummy_tramp is placed at the end:
bpf_prog:
mov x9, lr
nop // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
This is also the state when no trampoline is attached.
When a short-jump bpf trampoline is attached, the patchsite is patched to
a bl instruction to the trampoline directly:
bpf_prog:
mov x9, lr
bl <short-jump bpf trampoline address> // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
When a long-jump bpf trampoline is attached, the plt target is filled with
the trampoline address and the patchsite is patched to a bl instruction to
the plt:
bpf_prog:
mov x9, lr
bl plt // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad <long-jump bpf trampoline address>
dummy_tramp is used to prevent another CPU from jumping to an unknown
location during the patching process, making the patching process easier.
The patching process is as follows:
1. when neither the old address or the new address is a long jump, the
patchsite is replaced with a bl to the new address, or nop if the new
address is NULL;
2. when the old address is not long jump but the new one is, the
branch target address is written to plt first, then the patchsite
is replaced with a bl instruction to the plt;
3. when the old address is long jump but the new one is not, the address
of dummy_tramp is written to plt first, then the patchsite is replaced
with a bl to the new address, or a nop if the new address is NULL;
4. when both the old address and the new address are long jump, the
new address is written to plt and the patchsite is not changed.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20220711150823.2128542-4-xukuohai@huawei.com
2022-07-11 11:08:22 -04:00
|
|
|
/* also allocate space for plt target */
|
|
|
|
extable_offset = round_up(prog_size + PLT_TARGET_SIZE, extable_align);
|
|
|
|
image_size = extable_offset + extable_size;
|
2024-02-28 14:18:24 +00:00
|
|
|
ro_header = bpf_jit_binary_pack_alloc(image_size, &ro_image_ptr,
|
|
|
|
sizeof(u32), &header, &image_ptr,
|
|
|
|
jit_fill_hole);
|
|
|
|
if (!ro_header) {
|
2016-05-13 19:08:34 +02:00
|
|
|
prog = orig_prog;
|
|
|
|
goto out_off;
|
|
|
|
}
|
2014-08-26 21:15:30 -07:00
|
|
|
|
bpf, arm64: Jit BPF_CALL to direct call when possible
Currently, BPF_CALL is always jited to indirect call. When target is
within the range of direct call, BPF_CALL can be jited to direct call.
For example, the following BPF_CALL
call __htab_map_lookup_elem
is always jited to indirect call:
mov x10, #0xffffffffffff18f4
movk x10, #0x821, lsl #16
movk x10, #0x8000, lsl #32
blr x10
When the address of target __htab_map_lookup_elem is within the range of
direct call, the BPF_CALL can be jited to:
bl 0xfffffffffd33bc98
This patch does such jit optimization by emitting arm64 direct calls for
BPF_CALL when possible, indirect calls otherwise.
Without this patch, the jit works as follows.
1. First pass
A. Determine jited position and size for each bpf instruction.
B. Computed the jited image size.
2. Allocate jited image with size computed in step 1.
3. Second pass
A. Adjust jump offset for jump instructions
B. Write the final image.
This works because, for a given bpf prog, regardless of where the jited
image is allocated, the jited result for each instruction is fixed. The
second pass differs from the first only in adjusting the jump offsets,
like changing "jmp imm1" to "jmp imm2", while the position and size of
the "jmp" instruction remain unchanged.
Now considering whether to jit BPF_CALL to arm64 direct or indirect call
instruction. The choice depends solely on the jump offset: direct call
if the jump offset is within 128MB, indirect call otherwise.
For a given BPF_CALL, the target address is known, so the jump offset is
decided by the jited address of the BPF_CALL instruction. In other words,
for a given bpf prog, the jited result for each BPF_CALL is determined
by its jited address.
The jited address for a BPF_CALL is the jited image address plus the
total jited size of all preceding instructions. For a given bpf prog,
there are clearly no BPF_CALL instructions before the first BPF_CALL
instruction. Since the jited result for all other instructions other
than BPF_CALL are fixed, the total jited size preceding the first
BPF_CALL is also fixed. Therefore, once the jited image is allocated,
the jited address for the first BPF_CALL is fixed.
Now that the jited result for the first BPF_CALL is fixed, the jited
results for all instructions preceding the second BPF_CALL are fixed.
So the jited address and result for the second BPF_CALL are also fixed.
Similarly, we can conclude that the jited addresses and results for all
subsequent BPF_CALL instructions are fixed.
This means that, for a given bpf prog, once the jited image is allocated,
the jited address and result for all instructions, including all BPF_CALL
instructions, are fixed.
Based on the observation, with this patch, the jit works as follows.
1. First pass
Estimate the maximum jited image size. In this pass, all BPF_CALLs
are jited to arm64 indirect calls since the jump offsets are unknown
because the jited image is not allocated.
2. Allocate jited image with size estimated in step 1.
3. Second pass
A. Determine the jited result for each BPF_CALL.
B. Determine jited address and size for each bpf instruction.
4. Third pass
A. Adjust jump offset for jump instructions.
B. Write the final image.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20240903094407.601107-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-03 17:44:07 +08:00
|
|
|
/* Pass 2: Determine jited position and result for each instruction */
|
2014-08-26 21:15:30 -07:00
|
|
|
|
2024-02-28 14:18:24 +00:00
|
|
|
/*
|
|
|
|
* Use the image(RW) for writing the JITed instructions. But also save
|
|
|
|
* the ro_image(RX) for calculating the offsets in the image. The RW
|
|
|
|
* image will be later copied to the RX image from where the program
|
|
|
|
* will run. The bpf_jit_binary_pack_finalize() will do this copy in the
|
|
|
|
* final step.
|
|
|
|
*/
|
2017-06-28 16:58:03 +02:00
|
|
|
ctx.image = (__le32 *)image_ptr;
|
2024-02-28 14:18:24 +00:00
|
|
|
ctx.ro_image = (__le32 *)ro_image_ptr;
|
2020-07-28 17:21:26 +02:00
|
|
|
if (extable_size)
|
2024-02-28 14:18:24 +00:00
|
|
|
prog->aux->extable = (void *)ro_image_ptr + extable_offset;
|
2017-12-14 17:55:16 -08:00
|
|
|
skip_init_ctx:
|
2014-08-26 21:15:30 -07:00
|
|
|
ctx.idx = 0;
|
2020-07-28 17:21:26 +02:00
|
|
|
ctx.exentry_idx = 0;
|
bpf, arm64: Jit BPF_CALL to direct call when possible
Currently, BPF_CALL is always jited to indirect call. When target is
within the range of direct call, BPF_CALL can be jited to direct call.
For example, the following BPF_CALL
call __htab_map_lookup_elem
is always jited to indirect call:
mov x10, #0xffffffffffff18f4
movk x10, #0x821, lsl #16
movk x10, #0x8000, lsl #32
blr x10
When the address of target __htab_map_lookup_elem is within the range of
direct call, the BPF_CALL can be jited to:
bl 0xfffffffffd33bc98
This patch does such jit optimization by emitting arm64 direct calls for
BPF_CALL when possible, indirect calls otherwise.
Without this patch, the jit works as follows.
1. First pass
A. Determine jited position and size for each bpf instruction.
B. Computed the jited image size.
2. Allocate jited image with size computed in step 1.
3. Second pass
A. Adjust jump offset for jump instructions
B. Write the final image.
This works because, for a given bpf prog, regardless of where the jited
image is allocated, the jited result for each instruction is fixed. The
second pass differs from the first only in adjusting the jump offsets,
like changing "jmp imm1" to "jmp imm2", while the position and size of
the "jmp" instruction remain unchanged.
Now considering whether to jit BPF_CALL to arm64 direct or indirect call
instruction. The choice depends solely on the jump offset: direct call
if the jump offset is within 128MB, indirect call otherwise.
For a given BPF_CALL, the target address is known, so the jump offset is
decided by the jited address of the BPF_CALL instruction. In other words,
for a given bpf prog, the jited result for each BPF_CALL is determined
by its jited address.
The jited address for a BPF_CALL is the jited image address plus the
total jited size of all preceding instructions. For a given bpf prog,
there are clearly no BPF_CALL instructions before the first BPF_CALL
instruction. Since the jited result for all other instructions other
than BPF_CALL are fixed, the total jited size preceding the first
BPF_CALL is also fixed. Therefore, once the jited image is allocated,
the jited address for the first BPF_CALL is fixed.
Now that the jited result for the first BPF_CALL is fixed, the jited
results for all instructions preceding the second BPF_CALL are fixed.
So the jited address and result for the second BPF_CALL are also fixed.
Similarly, we can conclude that the jited addresses and results for all
subsequent BPF_CALL instructions are fixed.
This means that, for a given bpf prog, once the jited image is allocated,
the jited address and result for all instructions, including all BPF_CALL
instructions, are fixed.
Based on the observation, with this patch, the jit works as follows.
1. First pass
Estimate the maximum jited image size. In this pass, all BPF_CALLs
are jited to arm64 indirect calls since the jump offsets are unknown
because the jited image is not allocated.
2. Allocate jited image with size estimated in step 1.
3. Second pass
A. Determine the jited result for each BPF_CALL.
B. Determine jited address and size for each bpf instruction.
4. Third pass
A. Adjust jump offset for jump instructions.
B. Write the final image.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20240903094407.601107-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-03 17:44:07 +08:00
|
|
|
ctx.write = true;
|
2014-09-16 08:48:50 +01:00
|
|
|
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
build_prologue(&ctx, was_classic);
|
2014-08-26 21:15:30 -07:00
|
|
|
|
bpf, arm64: Jit BPF_CALL to direct call when possible
Currently, BPF_CALL is always jited to indirect call. When target is
within the range of direct call, BPF_CALL can be jited to direct call.
For example, the following BPF_CALL
call __htab_map_lookup_elem
is always jited to indirect call:
mov x10, #0xffffffffffff18f4
movk x10, #0x821, lsl #16
movk x10, #0x8000, lsl #32
blr x10
When the address of target __htab_map_lookup_elem is within the range of
direct call, the BPF_CALL can be jited to:
bl 0xfffffffffd33bc98
This patch does such jit optimization by emitting arm64 direct calls for
BPF_CALL when possible, indirect calls otherwise.
Without this patch, the jit works as follows.
1. First pass
A. Determine jited position and size for each bpf instruction.
B. Computed the jited image size.
2. Allocate jited image with size computed in step 1.
3. Second pass
A. Adjust jump offset for jump instructions
B. Write the final image.
This works because, for a given bpf prog, regardless of where the jited
image is allocated, the jited result for each instruction is fixed. The
second pass differs from the first only in adjusting the jump offsets,
like changing "jmp imm1" to "jmp imm2", while the position and size of
the "jmp" instruction remain unchanged.
Now considering whether to jit BPF_CALL to arm64 direct or indirect call
instruction. The choice depends solely on the jump offset: direct call
if the jump offset is within 128MB, indirect call otherwise.
For a given BPF_CALL, the target address is known, so the jump offset is
decided by the jited address of the BPF_CALL instruction. In other words,
for a given bpf prog, the jited result for each BPF_CALL is determined
by its jited address.
The jited address for a BPF_CALL is the jited image address plus the
total jited size of all preceding instructions. For a given bpf prog,
there are clearly no BPF_CALL instructions before the first BPF_CALL
instruction. Since the jited result for all other instructions other
than BPF_CALL are fixed, the total jited size preceding the first
BPF_CALL is also fixed. Therefore, once the jited image is allocated,
the jited address for the first BPF_CALL is fixed.
Now that the jited result for the first BPF_CALL is fixed, the jited
results for all instructions preceding the second BPF_CALL are fixed.
So the jited address and result for the second BPF_CALL are also fixed.
Similarly, we can conclude that the jited addresses and results for all
subsequent BPF_CALL instructions are fixed.
This means that, for a given bpf prog, once the jited image is allocated,
the jited address and result for all instructions, including all BPF_CALL
instructions, are fixed.
Based on the observation, with this patch, the jit works as follows.
1. First pass
Estimate the maximum jited image size. In this pass, all BPF_CALLs
are jited to arm64 indirect calls since the jump offsets are unknown
because the jited image is not allocated.
2. Allocate jited image with size estimated in step 1.
3. Second pass
A. Determine the jited result for each BPF_CALL.
B. Determine jited address and size for each bpf instruction.
4. Third pass
A. Adjust jump offset for jump instructions.
B. Write the final image.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20240903094407.601107-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-03 17:44:07 +08:00
|
|
|
/* Record exentry_idx and body_idx before first build_body */
|
|
|
|
exentry_idx = ctx.exentry_idx;
|
|
|
|
body_idx = ctx.idx;
|
|
|
|
/* Dont write body instructions to memory for now */
|
|
|
|
ctx.write = false;
|
|
|
|
|
2018-11-26 14:05:39 +01:00
|
|
|
if (build_body(&ctx, extra_pass)) {
|
2016-05-13 19:08:34 +02:00
|
|
|
prog = orig_prog;
|
2024-02-28 14:18:24 +00:00
|
|
|
goto out_free_hdr;
|
2014-09-11 10:36:48 +01:00
|
|
|
}
|
2014-08-26 21:15:30 -07:00
|
|
|
|
bpf, arm64: Jit BPF_CALL to direct call when possible
Currently, BPF_CALL is always jited to indirect call. When target is
within the range of direct call, BPF_CALL can be jited to direct call.
For example, the following BPF_CALL
call __htab_map_lookup_elem
is always jited to indirect call:
mov x10, #0xffffffffffff18f4
movk x10, #0x821, lsl #16
movk x10, #0x8000, lsl #32
blr x10
When the address of target __htab_map_lookup_elem is within the range of
direct call, the BPF_CALL can be jited to:
bl 0xfffffffffd33bc98
This patch does such jit optimization by emitting arm64 direct calls for
BPF_CALL when possible, indirect calls otherwise.
Without this patch, the jit works as follows.
1. First pass
A. Determine jited position and size for each bpf instruction.
B. Computed the jited image size.
2. Allocate jited image with size computed in step 1.
3. Second pass
A. Adjust jump offset for jump instructions
B. Write the final image.
This works because, for a given bpf prog, regardless of where the jited
image is allocated, the jited result for each instruction is fixed. The
second pass differs from the first only in adjusting the jump offsets,
like changing "jmp imm1" to "jmp imm2", while the position and size of
the "jmp" instruction remain unchanged.
Now considering whether to jit BPF_CALL to arm64 direct or indirect call
instruction. The choice depends solely on the jump offset: direct call
if the jump offset is within 128MB, indirect call otherwise.
For a given BPF_CALL, the target address is known, so the jump offset is
decided by the jited address of the BPF_CALL instruction. In other words,
for a given bpf prog, the jited result for each BPF_CALL is determined
by its jited address.
The jited address for a BPF_CALL is the jited image address plus the
total jited size of all preceding instructions. For a given bpf prog,
there are clearly no BPF_CALL instructions before the first BPF_CALL
instruction. Since the jited result for all other instructions other
than BPF_CALL are fixed, the total jited size preceding the first
BPF_CALL is also fixed. Therefore, once the jited image is allocated,
the jited address for the first BPF_CALL is fixed.
Now that the jited result for the first BPF_CALL is fixed, the jited
results for all instructions preceding the second BPF_CALL are fixed.
So the jited address and result for the second BPF_CALL are also fixed.
Similarly, we can conclude that the jited addresses and results for all
subsequent BPF_CALL instructions are fixed.
This means that, for a given bpf prog, once the jited image is allocated,
the jited address and result for all instructions, including all BPF_CALL
instructions, are fixed.
Based on the observation, with this patch, the jit works as follows.
1. First pass
Estimate the maximum jited image size. In this pass, all BPF_CALLs
are jited to arm64 indirect calls since the jump offsets are unknown
because the jited image is not allocated.
2. Allocate jited image with size estimated in step 1.
3. Second pass
A. Determine the jited result for each BPF_CALL.
B. Determine jited address and size for each bpf instruction.
4. Third pass
A. Adjust jump offset for jump instructions.
B. Write the final image.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20240903094407.601107-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-03 17:44:07 +08:00
|
|
|
ctx.epilogue_offset = ctx.idx;
|
|
|
|
ctx.exentry_idx = exentry_idx;
|
|
|
|
ctx.idx = body_idx;
|
|
|
|
ctx.write = true;
|
|
|
|
|
|
|
|
/* Pass 3: Adjust jump offset and write final image */
|
|
|
|
if (build_body(&ctx, extra_pass) ||
|
|
|
|
WARN_ON_ONCE(ctx.idx != ctx.epilogue_offset)) {
|
|
|
|
prog = orig_prog;
|
|
|
|
goto out_free_hdr;
|
|
|
|
}
|
|
|
|
|
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers
The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp x19, x20, [sp, #-16]!
1c: stp x21, x22, [sp, #-16]!
20: stp x26, x25, [sp, #-16]!
24: mov x26, #0
28: stp x26, x25, [sp, #-16]!
2c: mov x26, sp
30: stp x27, x28, [sp, #-16]!
34: mov x25, sp
38: bti j // tailcall target
3c: sub sp, sp, #0
40: mov x7, #0
44: add sp, sp, #0
48: ldp x27, x28, [sp], #16
4c: ldp x26, x25, [sp], #16
50: ldp x26, x25, [sp], #16
54: ldp x21, x22, [sp], #16
58: ldp x19, x20, [sp], #16
5c: ldp fp, lr, [sp], #16
60: mov x0, x7
64: autiasp
68: ret
Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.
Now the jited result of empty prog is:
0: bti jc
4: mov x9, lr
8: nop
c: paciasp
10: stp fp, lr, [sp, #-16]!
14: mov fp, sp
18: stp xzr, x26, [sp, #-16]!
1c: mov x26, sp
20: bti j // tailcall target
24: mov x7, #0
28: ldp xzr, x26, [sp], #16
2c: ldp fp, lr, [sp], #16
30: mov x0, x7
34: autiasp
38: ret
Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.
[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.
- Before:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4313.43 msec task-clock # 0.874 CPUs utilized ( +- 0.16% )
574 context-switches # 133.073 /sec ( +- 1.14% )
0 cpu-migrations # 0.000 /sec
538 page-faults # 124.727 /sec ( +- 0.57% )
10697772784 cycles # 2.480 GHz ( +- 0.22% ) (61.19%)
25511241955 instructions # 2.38 insn per cycle ( +- 0.08% ) (66.70%)
5108910557 branches # 1.184 G/sec ( +- 0.08% ) (72.38%)
2800459 branch-misses # 0.05% of all branches ( +- 0.51% ) (72.36%)
TopDownL1 # 0.60 retiring ( +- 0.09% ) (66.84%)
# 0.21 frontend_bound ( +- 0.15% ) (61.31%)
# 0.12 bad_speculation ( +- 0.08% ) (50.11%)
# 0.07 backend_bound ( +- 0.16% ) (33.30%)
8274201819 L1-dcache-loads # 1.918 G/sec ( +- 0.18% ) (33.15%)
468268 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 4.69% ) (33.16%)
385383 LLC-loads # 89.345 K/sec ( +- 5.22% ) (33.16%)
38296 LLC-load-misses # 9.94% of all LL-cache accesses ( +- 42.52% ) (38.69%)
6886576501 L1-icache-loads # 1.597 G/sec ( +- 0.35% ) (38.69%)
1848585 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 4.52% ) (44.23%)
9043645883 dTLB-loads # 2.097 G/sec ( +- 0.10% ) (44.33%)
416672 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 5.15% ) (49.89%)
6925626111 iTLB-loads # 1.606 G/sec ( +- 0.35% ) (55.46%)
66220 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.88% ) (55.50%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9372 +- 0.0526 seconds time elapsed ( +- 1.07% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10924.50 msec task-clock # 0.945 CPUs utilized ( +- 0.08% )
603 context-switches # 55.197 /sec ( +- 1.13% )
0 cpu-migrations # 0.000 /sec
566 page-faults # 51.810 /sec ( +- 0.42% )
27381270695 cycles # 2.506 GHz ( +- 0.18% ) (60.46%)
56996583922 instructions # 2.08 insn per cycle ( +- 0.21% ) (66.11%)
10321647567 branches # 944.816 M/sec ( +- 0.17% ) (71.79%)
3347735 branch-misses # 0.03% of all branches ( +- 3.72% ) (72.15%)
TopDownL1 # 0.52 retiring ( +- 0.13% ) (66.74%)
# 0.27 frontend_bound ( +- 0.14% ) (61.27%)
# 0.14 bad_speculation ( +- 0.19% ) (50.36%)
# 0.07 backend_bound ( +- 0.42% ) (33.89%)
18740797617 L1-dcache-loads # 1.715 G/sec ( +- 0.43% ) (33.71%)
13715669 L1-dcache-load-misses # 0.07% of all L1-dcache accesses ( +- 32.85% ) (33.34%)
4087551 LLC-loads # 374.164 K/sec ( +- 29.53% ) (33.26%)
267906 LLC-load-misses # 6.55% of all LL-cache accesses ( +- 23.90% ) (38.76%)
15811864229 L1-icache-loads # 1.447 G/sec ( +- 0.12% ) (38.73%)
2976833 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 9.73% ) (44.22%)
20138907471 dTLB-loads # 1.843 G/sec ( +- 0.18% ) (44.15%)
732850 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 11.18% ) (49.64%)
15895726702 iTLB-loads # 1.455 G/sec ( +- 0.15% ) (55.13%)
152075 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 4.71% ) (54.98%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.5613 +- 0.0317 seconds time elapsed ( +- 0.27% )
- After:
Performance counter stats for './test_progs -t tailcalls' (5 runs):
4278.78 msec task-clock # 0.871 CPUs utilized ( +- 0.15% )
569 context-switches # 132.982 /sec ( +- 0.58% )
0 cpu-migrations # 0.000 /sec
539 page-faults # 125.970 /sec ( +- 0.43% )
10588986432 cycles # 2.475 GHz ( +- 0.20% ) (60.91%)
25303825043 instructions # 2.39 insn per cycle ( +- 0.08% ) (66.48%)
5110756256 branches # 1.194 G/sec ( +- 0.07% ) (72.03%)
2719569 branch-misses # 0.05% of all branches ( +- 2.42% ) (72.03%)
TopDownL1 # 0.60 retiring ( +- 0.22% ) (66.31%)
# 0.22 frontend_bound ( +- 0.21% ) (60.83%)
# 0.12 bad_speculation ( +- 0.26% ) (50.25%)
# 0.06 backend_bound ( +- 0.17% ) (33.52%)
8163648527 L1-dcache-loads # 1.908 G/sec ( +- 0.33% ) (33.52%)
694979 L1-dcache-load-misses # 0.01% of all L1-dcache accesses ( +- 30.53% ) (33.52%)
1902347 LLC-loads # 444.600 K/sec ( +- 48.84% ) (33.69%)
96677 LLC-load-misses # 5.08% of all LL-cache accesses ( +- 43.48% ) (39.30%)
6863517589 L1-icache-loads # 1.604 G/sec ( +- 0.37% ) (39.17%)
1871519 L1-icache-load-misses # 0.03% of all L1-icache accesses ( +- 6.78% ) (44.56%)
8927782813 dTLB-loads # 2.087 G/sec ( +- 0.14% ) (44.37%)
438237 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 6.00% ) (49.75%)
6886906831 iTLB-loads # 1.610 G/sec ( +- 0.36% ) (55.08%)
67568 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 3.27% ) (54.86%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
4.9114 +- 0.0309 seconds time elapsed ( +- 0.63% )
Performance counter stats for './test_progs -t flow_dissector' (5 runs):
10948.40 msec task-clock # 0.942 CPUs utilized ( +- 0.05% )
615 context-switches # 56.173 /sec ( +- 1.65% )
1 cpu-migrations # 0.091 /sec ( +- 31.62% )
567 page-faults # 51.788 /sec ( +- 0.44% )
27334194328 cycles # 2.497 GHz ( +- 0.08% ) (61.05%)
56656528828 instructions # 2.07 insn per cycle ( +- 0.08% ) (66.67%)
10270389422 branches # 938.072 M/sec ( +- 0.10% ) (72.21%)
3453837 branch-misses # 0.03% of all branches ( +- 3.75% ) (72.27%)
TopDownL1 # 0.52 retiring ( +- 0.16% ) (66.55%)
# 0.27 frontend_bound ( +- 0.09% ) (60.91%)
# 0.14 bad_speculation ( +- 0.08% ) (49.85%)
# 0.07 backend_bound ( +- 0.16% ) (33.33%)
18982866028 L1-dcache-loads # 1.734 G/sec ( +- 0.24% ) (33.34%)
8802454 L1-dcache-load-misses # 0.05% of all L1-dcache accesses ( +- 52.30% ) (33.31%)
2612962 LLC-loads # 238.661 K/sec ( +- 29.78% ) (33.45%)
264107 LLC-load-misses # 10.11% of all LL-cache accesses ( +- 18.34% ) (39.07%)
15793205997 L1-icache-loads # 1.443 G/sec ( +- 0.15% ) (39.09%)
3930802 L1-icache-load-misses # 0.02% of all L1-icache accesses ( +- 3.72% ) (44.66%)
20097828496 dTLB-loads # 1.836 G/sec ( +- 0.09% ) (44.68%)
961757 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 3.32% ) (50.15%)
15838728506 iTLB-loads # 1.447 G/sec ( +- 0.09% ) (55.62%)
167652 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 1.28% ) (55.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.6173 +- 0.0268 seconds time elapsed ( +- 0.23% )
[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-26 15:16:24 +08:00
|
|
|
build_epilogue(&ctx);
|
bpf, arm64: Implement bpf_arch_text_poke() for arm64
Implement bpf_arch_text_poke() for arm64, so bpf prog or bpf trampoline
can be patched with it.
When the target address is NULL, the original instruction is patched to
a NOP.
When the target address and the source address are within the branch
range, the original instruction is patched to a bl instruction to the
target address directly.
To support attaching bpf trampoline to both regular kernel function and
bpf prog, we follow the ftrace patchsite way for bpf prog. That is, two
instructions are inserted at the beginning of bpf prog, the first one
saves the return address to x9, and the second is a nop which will be
patched to a bl instruction when a bpf trampoline is attached.
However, when a bpf trampoline is attached to bpf prog, the distance
between target address and source address may exceed 128MB, the maximum
branch range, because bpf trampoline and bpf prog are allocated
separately with vmalloc. So long jump should be handled.
When a bpf prog is constructed, a plt pointing to empty trampoline
dummy_tramp is placed at the end:
bpf_prog:
mov x9, lr
nop // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
This is also the state when no trampoline is attached.
When a short-jump bpf trampoline is attached, the patchsite is patched to
a bl instruction to the trampoline directly:
bpf_prog:
mov x9, lr
bl <short-jump bpf trampoline address> // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
When a long-jump bpf trampoline is attached, the plt target is filled with
the trampoline address and the patchsite is patched to a bl instruction to
the plt:
bpf_prog:
mov x9, lr
bl plt // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad <long-jump bpf trampoline address>
dummy_tramp is used to prevent another CPU from jumping to an unknown
location during the patching process, making the patching process easier.
The patching process is as follows:
1. when neither the old address or the new address is a long jump, the
patchsite is replaced with a bl to the new address, or nop if the new
address is NULL;
2. when the old address is not long jump but the new one is, the
branch target address is written to plt first, then the patchsite
is replaced with a bl instruction to the plt;
3. when the old address is long jump but the new one is not, the address
of dummy_tramp is written to plt first, then the patchsite is replaced
with a bl to the new address, or a nop if the new address is NULL;
4. when both the old address and the new address are long jump, the
new address is written to plt and the patchsite is not changed.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20220711150823.2128542-4-xukuohai@huawei.com
2022-07-11 11:08:22 -04:00
|
|
|
build_plt(&ctx);
|
2014-08-26 21:15:30 -07:00
|
|
|
|
bpf, arm64: Jit BPF_CALL to direct call when possible
Currently, BPF_CALL is always jited to indirect call. When target is
within the range of direct call, BPF_CALL can be jited to direct call.
For example, the following BPF_CALL
call __htab_map_lookup_elem
is always jited to indirect call:
mov x10, #0xffffffffffff18f4
movk x10, #0x821, lsl #16
movk x10, #0x8000, lsl #32
blr x10
When the address of target __htab_map_lookup_elem is within the range of
direct call, the BPF_CALL can be jited to:
bl 0xfffffffffd33bc98
This patch does such jit optimization by emitting arm64 direct calls for
BPF_CALL when possible, indirect calls otherwise.
Without this patch, the jit works as follows.
1. First pass
A. Determine jited position and size for each bpf instruction.
B. Computed the jited image size.
2. Allocate jited image with size computed in step 1.
3. Second pass
A. Adjust jump offset for jump instructions
B. Write the final image.
This works because, for a given bpf prog, regardless of where the jited
image is allocated, the jited result for each instruction is fixed. The
second pass differs from the first only in adjusting the jump offsets,
like changing "jmp imm1" to "jmp imm2", while the position and size of
the "jmp" instruction remain unchanged.
Now considering whether to jit BPF_CALL to arm64 direct or indirect call
instruction. The choice depends solely on the jump offset: direct call
if the jump offset is within 128MB, indirect call otherwise.
For a given BPF_CALL, the target address is known, so the jump offset is
decided by the jited address of the BPF_CALL instruction. In other words,
for a given bpf prog, the jited result for each BPF_CALL is determined
by its jited address.
The jited address for a BPF_CALL is the jited image address plus the
total jited size of all preceding instructions. For a given bpf prog,
there are clearly no BPF_CALL instructions before the first BPF_CALL
instruction. Since the jited result for all other instructions other
than BPF_CALL are fixed, the total jited size preceding the first
BPF_CALL is also fixed. Therefore, once the jited image is allocated,
the jited address for the first BPF_CALL is fixed.
Now that the jited result for the first BPF_CALL is fixed, the jited
results for all instructions preceding the second BPF_CALL are fixed.
So the jited address and result for the second BPF_CALL are also fixed.
Similarly, we can conclude that the jited addresses and results for all
subsequent BPF_CALL instructions are fixed.
This means that, for a given bpf prog, once the jited image is allocated,
the jited address and result for all instructions, including all BPF_CALL
instructions, are fixed.
Based on the observation, with this patch, the jit works as follows.
1. First pass
Estimate the maximum jited image size. In this pass, all BPF_CALLs
are jited to arm64 indirect calls since the jump offsets are unknown
because the jited image is not allocated.
2. Allocate jited image with size estimated in step 1.
3. Second pass
A. Determine the jited result for each BPF_CALL.
B. Determine jited address and size for each bpf instruction.
4. Third pass
A. Adjust jump offset for jump instructions.
B. Write the final image.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20240903094407.601107-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-03 17:44:07 +08:00
|
|
|
/* Extra pass to validate JITed code. */
|
2022-07-11 11:08:23 -04:00
|
|
|
if (validate_ctx(&ctx)) {
|
2016-05-13 19:08:34 +02:00
|
|
|
prog = orig_prog;
|
2024-02-28 14:18:24 +00:00
|
|
|
goto out_free_hdr;
|
2016-01-13 23:33:22 -08:00
|
|
|
}
|
|
|
|
|
bpf, arm64: Jit BPF_CALL to direct call when possible
Currently, BPF_CALL is always jited to indirect call. When target is
within the range of direct call, BPF_CALL can be jited to direct call.
For example, the following BPF_CALL
call __htab_map_lookup_elem
is always jited to indirect call:
mov x10, #0xffffffffffff18f4
movk x10, #0x821, lsl #16
movk x10, #0x8000, lsl #32
blr x10
When the address of target __htab_map_lookup_elem is within the range of
direct call, the BPF_CALL can be jited to:
bl 0xfffffffffd33bc98
This patch does such jit optimization by emitting arm64 direct calls for
BPF_CALL when possible, indirect calls otherwise.
Without this patch, the jit works as follows.
1. First pass
A. Determine jited position and size for each bpf instruction.
B. Computed the jited image size.
2. Allocate jited image with size computed in step 1.
3. Second pass
A. Adjust jump offset for jump instructions
B. Write the final image.
This works because, for a given bpf prog, regardless of where the jited
image is allocated, the jited result for each instruction is fixed. The
second pass differs from the first only in adjusting the jump offsets,
like changing "jmp imm1" to "jmp imm2", while the position and size of
the "jmp" instruction remain unchanged.
Now considering whether to jit BPF_CALL to arm64 direct or indirect call
instruction. The choice depends solely on the jump offset: direct call
if the jump offset is within 128MB, indirect call otherwise.
For a given BPF_CALL, the target address is known, so the jump offset is
decided by the jited address of the BPF_CALL instruction. In other words,
for a given bpf prog, the jited result for each BPF_CALL is determined
by its jited address.
The jited address for a BPF_CALL is the jited image address plus the
total jited size of all preceding instructions. For a given bpf prog,
there are clearly no BPF_CALL instructions before the first BPF_CALL
instruction. Since the jited result for all other instructions other
than BPF_CALL are fixed, the total jited size preceding the first
BPF_CALL is also fixed. Therefore, once the jited image is allocated,
the jited address for the first BPF_CALL is fixed.
Now that the jited result for the first BPF_CALL is fixed, the jited
results for all instructions preceding the second BPF_CALL are fixed.
So the jited address and result for the second BPF_CALL are also fixed.
Similarly, we can conclude that the jited addresses and results for all
subsequent BPF_CALL instructions are fixed.
This means that, for a given bpf prog, once the jited image is allocated,
the jited address and result for all instructions, including all BPF_CALL
instructions, are fixed.
Based on the observation, with this patch, the jit works as follows.
1. First pass
Estimate the maximum jited image size. In this pass, all BPF_CALLs
are jited to arm64 indirect calls since the jump offsets are unknown
because the jited image is not allocated.
2. Allocate jited image with size estimated in step 1.
3. Second pass
A. Determine the jited result for each BPF_CALL.
B. Determine jited address and size for each bpf instruction.
4. Third pass
A. Adjust jump offset for jump instructions.
B. Write the final image.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20240903094407.601107-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-03 17:44:07 +08:00
|
|
|
/* update the real prog size */
|
|
|
|
prog_size = sizeof(u32) * ctx.idx;
|
|
|
|
|
2014-08-26 21:15:30 -07:00
|
|
|
/* And we're done. */
|
|
|
|
if (bpf_jit_enable > 1)
|
2020-07-28 17:21:26 +02:00
|
|
|
bpf_jit_dump(prog->len, prog_size, 2, ctx.image);
|
2014-08-26 21:15:30 -07:00
|
|
|
|
2017-12-14 17:55:16 -08:00
|
|
|
if (!prog->is_func || extra_pass) {
|
bpf, arm64: Jit BPF_CALL to direct call when possible
Currently, BPF_CALL is always jited to indirect call. When target is
within the range of direct call, BPF_CALL can be jited to direct call.
For example, the following BPF_CALL
call __htab_map_lookup_elem
is always jited to indirect call:
mov x10, #0xffffffffffff18f4
movk x10, #0x821, lsl #16
movk x10, #0x8000, lsl #32
blr x10
When the address of target __htab_map_lookup_elem is within the range of
direct call, the BPF_CALL can be jited to:
bl 0xfffffffffd33bc98
This patch does such jit optimization by emitting arm64 direct calls for
BPF_CALL when possible, indirect calls otherwise.
Without this patch, the jit works as follows.
1. First pass
A. Determine jited position and size for each bpf instruction.
B. Computed the jited image size.
2. Allocate jited image with size computed in step 1.
3. Second pass
A. Adjust jump offset for jump instructions
B. Write the final image.
This works because, for a given bpf prog, regardless of where the jited
image is allocated, the jited result for each instruction is fixed. The
second pass differs from the first only in adjusting the jump offsets,
like changing "jmp imm1" to "jmp imm2", while the position and size of
the "jmp" instruction remain unchanged.
Now considering whether to jit BPF_CALL to arm64 direct or indirect call
instruction. The choice depends solely on the jump offset: direct call
if the jump offset is within 128MB, indirect call otherwise.
For a given BPF_CALL, the target address is known, so the jump offset is
decided by the jited address of the BPF_CALL instruction. In other words,
for a given bpf prog, the jited result for each BPF_CALL is determined
by its jited address.
The jited address for a BPF_CALL is the jited image address plus the
total jited size of all preceding instructions. For a given bpf prog,
there are clearly no BPF_CALL instructions before the first BPF_CALL
instruction. Since the jited result for all other instructions other
than BPF_CALL are fixed, the total jited size preceding the first
BPF_CALL is also fixed. Therefore, once the jited image is allocated,
the jited address for the first BPF_CALL is fixed.
Now that the jited result for the first BPF_CALL is fixed, the jited
results for all instructions preceding the second BPF_CALL are fixed.
So the jited address and result for the second BPF_CALL are also fixed.
Similarly, we can conclude that the jited addresses and results for all
subsequent BPF_CALL instructions are fixed.
This means that, for a given bpf prog, once the jited image is allocated,
the jited address and result for all instructions, including all BPF_CALL
instructions, are fixed.
Based on the observation, with this patch, the jit works as follows.
1. First pass
Estimate the maximum jited image size. In this pass, all BPF_CALLs
are jited to arm64 indirect calls since the jump offsets are unknown
because the jited image is not allocated.
2. Allocate jited image with size estimated in step 1.
3. Second pass
A. Determine the jited result for each BPF_CALL.
B. Determine jited address and size for each bpf instruction.
4. Third pass
A. Adjust jump offset for jump instructions.
B. Write the final image.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20240903094407.601107-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-03 17:44:07 +08:00
|
|
|
/* The jited image may shrink since the jited result for
|
|
|
|
* BPF_CALL to subprog may be changed from indirect call
|
|
|
|
* to direct call.
|
|
|
|
*/
|
|
|
|
if (extra_pass && ctx.idx > jit_data->ctx.idx) {
|
|
|
|
pr_err_once("multi-func JIT bug %d > %d\n",
|
2017-12-14 17:55:16 -08:00
|
|
|
ctx.idx, jit_data->ctx.idx);
|
|
|
|
prog->bpf_func = NULL;
|
|
|
|
prog->jited = 0;
|
2022-05-31 14:51:13 -07:00
|
|
|
prog->jited_len = 0;
|
2024-02-28 14:18:24 +00:00
|
|
|
goto out_free_hdr;
|
|
|
|
}
|
2024-06-14 23:24:08 -03:00
|
|
|
if (WARN_ON(bpf_jit_binary_pack_finalize(ro_header, header))) {
|
2024-02-28 14:18:24 +00:00
|
|
|
/* ro_header has been freed */
|
|
|
|
ro_header = NULL;
|
|
|
|
prog = orig_prog;
|
2017-12-14 17:55:16 -08:00
|
|
|
goto out_off;
|
|
|
|
}
|
2024-02-28 14:18:24 +00:00
|
|
|
/*
|
|
|
|
* The instructions have now been copied to the ROX region from
|
|
|
|
* where they will execute. Now the data cache has to be cleaned to
|
|
|
|
* the PoU and the I-cache has to be invalidated for the VAs.
|
|
|
|
*/
|
|
|
|
bpf_flush_icache(ro_header, ctx.ro_image + ctx.idx);
|
2017-12-14 17:55:16 -08:00
|
|
|
} else {
|
|
|
|
jit_data->ctx = ctx;
|
2024-02-28 14:18:24 +00:00
|
|
|
jit_data->ro_image = ro_image_ptr;
|
2017-12-14 17:55:16 -08:00
|
|
|
jit_data->header = header;
|
2024-02-28 14:18:24 +00:00
|
|
|
jit_data->ro_header = ro_header;
|
2017-12-14 17:55:16 -08:00
|
|
|
}
|
2024-02-28 14:18:24 +00:00
|
|
|
|
|
|
|
prog->bpf_func = (void *)ctx.ro_image;
|
2015-09-30 01:41:50 +02:00
|
|
|
prog->jited = 1;
|
2020-07-28 17:21:26 +02:00
|
|
|
prog->jited_len = prog_size;
|
2016-05-13 19:08:34 +02:00
|
|
|
|
2017-12-14 17:55:16 -08:00
|
|
|
if (!prog->is_func || extra_pass) {
|
2022-02-26 20:19:06 +08:00
|
|
|
int i;
|
|
|
|
|
|
|
|
/* offset[prog->len] is the size of program */
|
|
|
|
for (i = 0; i <= prog->len; i++)
|
|
|
|
ctx.offset[i] *= AARCH64_INSN_SIZE;
|
arm64: bpf: Fix branch offset in JIT
Running the eBPF test_verifier leads to random errors looking like this:
[ 6525.735488] Unexpected kernel BRK exception at EL1
[ 6525.735502] Internal error: ptrace BRK handler: f2000100 [#1] SMP
[ 6525.741609] Modules linked in: nls_utf8 cifs libdes libarc4 dns_resolver fscache binfmt_misc nls_ascii nls_cp437 vfat fat aes_ce_blk crypto_simd cryptd aes_ce_cipher ghash_ce gf128mul efi_pstore sha2_ce sha256_arm64 sha1_ce evdev efivars efivarfs ip_tables x_tables autofs4 btrfs blake2b_generic xor xor_neon zstd_compress raid6_pq libcrc32c crc32c_generic ahci xhci_pci libahci xhci_hcd igb libata i2c_algo_bit nvme realtek usbcore nvme_core scsi_mod t10_pi netsec mdio_devres of_mdio gpio_keys fixed_phy libphy gpio_mb86s7x
[ 6525.787760] CPU: 3 PID: 7881 Comm: test_verifier Tainted: G W 5.9.0-rc1+ #47
[ 6525.796111] Hardware name: Socionext SynQuacer E-series DeveloperBox, BIOS build #1 Jun 6 2020
[ 6525.804812] pstate: 20000005 (nzCv daif -PAN -UAO BTYPE=--)
[ 6525.810390] pc : bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
[ 6525.815613] lr : bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
[ 6525.820832] sp : ffff8000130cbb80
[ 6525.824141] x29: ffff8000130cbbb0 x28: 0000000000000000
[ 6525.829451] x27: 000005ef6fcbf39b x26: 0000000000000000
[ 6525.834759] x25: ffff8000130cbb80 x24: ffff800011dc7038
[ 6525.840067] x23: ffff8000130cbd00 x22: ffff0008f624d080
[ 6525.845375] x21: 0000000000000001 x20: ffff800011dc7000
[ 6525.850682] x19: 0000000000000000 x18: 0000000000000000
[ 6525.855990] x17: 0000000000000000 x16: 0000000000000000
[ 6525.861298] x15: 0000000000000000 x14: 0000000000000000
[ 6525.866606] x13: 0000000000000000 x12: 0000000000000000
[ 6525.871913] x11: 0000000000000001 x10: ffff8000000a660c
[ 6525.877220] x9 : ffff800010951810 x8 : ffff8000130cbc38
[ 6525.882528] x7 : 0000000000000000 x6 : 0000009864cfa881
[ 6525.887836] x5 : 00ffffffffffffff x4 : 002880ba1a0b3e9f
[ 6525.893144] x3 : 0000000000000018 x2 : ffff8000000a4374
[ 6525.898452] x1 : 000000000000000a x0 : 0000000000000009
[ 6525.903760] Call trace:
[ 6525.906202] bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
[ 6525.911076] bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
[ 6525.915957] bpf_dispatcher_xdp_func+0x14/0x20
[ 6525.920398] bpf_test_run+0x70/0x1b0
[ 6525.923969] bpf_prog_test_run_xdp+0xec/0x190
[ 6525.928326] __do_sys_bpf+0xc88/0x1b28
[ 6525.932072] __arm64_sys_bpf+0x24/0x30
[ 6525.935820] el0_svc_common.constprop.0+0x70/0x168
[ 6525.940607] do_el0_svc+0x28/0x88
[ 6525.943920] el0_sync_handler+0x88/0x190
[ 6525.947838] el0_sync+0x140/0x180
[ 6525.951154] Code: d4202000 d4202000 d4202000 d4202000 (d4202000)
[ 6525.957249] ---[ end trace cecc3f93b14927e2 ]---
The reason is the offset[] creation and later usage, while building
the eBPF body. The code currently omits the first instruction, since
build_insn() will increase our ctx->idx before saving it.
That was fine up until bounded eBPF loops were introduced. After that
introduction, offset[0] must be the offset of the end of prologue which
is the start of the 1st insn while, offset[n] holds the
offset of the end of n-th insn.
When "taken loop with back jump to 1st insn" test runs, it will
eventually call bpf2a64_offset(-1, 2, ctx). Since negative indexing is
permitted, the current outcome depends on the value stored in
ctx->offset[-1], which has nothing to do with our array.
If the value happens to be 0 the tests will work. If not this error
triggers.
commit 7c2e988f400e ("bpf: fix x64 JIT code generation for jmp to 1st insn")
fixed an indentical bug on x86 when eBPF bounded loops were introduced.
So let's fix it by creating the ctx->offset[] differently. Track the
beginning of instruction and account for the extra instruction while
calculating the arm instruction offsets.
Fixes: 2589726d12a1 ("bpf: introduce bounded loops")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Jiri Olsa <jolsa@kernel.org>
Co-developed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Co-developed-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20200917084925.177348-1-ilias.apalodimas@linaro.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-09-17 11:49:25 +03:00
|
|
|
bpf_prog_fill_jited_linfo(prog, ctx.offset + 1);
|
2016-05-13 19:08:34 +02:00
|
|
|
out_off:
|
2022-08-04 10:54:42 +08:00
|
|
|
kvfree(ctx.offset);
|
2017-12-14 17:55:16 -08:00
|
|
|
kfree(jit_data);
|
|
|
|
prog->aux->jit_data = NULL;
|
|
|
|
}
|
2016-05-13 19:08:34 +02:00
|
|
|
out:
|
|
|
|
if (tmp_blinded)
|
|
|
|
bpf_jit_prog_release_other(prog, prog == orig_prog ?
|
|
|
|
tmp : orig_prog);
|
2016-05-13 19:08:31 +02:00
|
|
|
return prog;
|
2024-02-28 14:18:24 +00:00
|
|
|
|
|
|
|
out_free_hdr:
|
|
|
|
if (header) {
|
|
|
|
bpf_arch_text_copy(&ro_header->size, &header->size,
|
|
|
|
sizeof(header->size));
|
|
|
|
bpf_jit_binary_pack_free(ro_header, header);
|
|
|
|
}
|
|
|
|
goto out_off;
|
2014-08-26 21:15:30 -07:00
|
|
|
}
|
2018-11-23 23:18:04 +01:00
|
|
|
|
2022-01-30 17:29:15 +08:00
|
|
|
bool bpf_jit_supports_kfunc_call(void)
|
|
|
|
{
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2024-02-28 14:18:24 +00:00
|
|
|
void *bpf_arch_text_copy(void *dst, void *src, size_t len)
|
|
|
|
{
|
|
|
|
if (!aarch64_insn_copy(dst, src, len))
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
return dst;
|
|
|
|
}
|
|
|
|
|
2021-10-14 15:25:52 +01:00
|
|
|
u64 bpf_jit_alloc_exec_limit(void)
|
|
|
|
{
|
2021-11-05 16:50:45 +00:00
|
|
|
return VMALLOC_END - VMALLOC_START;
|
2021-10-14 15:25:52 +01:00
|
|
|
}
|
|
|
|
|
bpf, arm64: Keep tail call count across bpf2bpf calls
Today doing a BPF tail call after a BPF to BPF call, that is from a
subprogram, is allowed only by the x86-64 BPF JIT. Mixing these features
requires support from JIT. Tail call count has to be tracked through BPF to
BPF calls, as well as through BPF tail calls to prevent unbounded chains of
tail calls.
arm64 BPF JIT stores the tail call count (TCC) in a dedicated
register (X26). This makes it easier to support bpf2bpf calls mixed with
tail calls than on x86 platform.
In order to keep the tail call count in tact throughout bpf2bpf calls, all
we need to do is tweak the program prologue generator. When emitting
prologue for a subprogram, we skip the block that initializes the tail call
count and emits a jump pad for the tail call.
With this change, a sample execution flow where a bpf2bpf call is followed
by a tail call would look like so:
int entry(struct __sk_buff *skb):
0xffffffc0090151d4: paciasp
0xffffffc0090151d8: stp x29, x30, [sp, #-16]!
0xffffffc0090151dc: mov x29, sp
0xffffffc0090151e0: stp x19, x20, [sp, #-16]!
0xffffffc0090151e4: stp x21, x22, [sp, #-16]!
0xffffffc0090151e8: stp x25, x26, [sp, #-16]!
0xffffffc0090151ec: stp x27, x28, [sp, #-16]!
0xffffffc0090151f0: mov x25, sp
0xffffffc0090151f4: mov x26, #0x0 // <- init TCC only
0xffffffc0090151f8: bti j // in main prog
0xffffffc0090151fc: sub x27, x25, #0x0
0xffffffc009015200: sub sp, sp, #0x10
0xffffffc009015204: mov w1, #0x0
0xffffffc009015208: mov x10, #0xffffffffffffffff
0xffffffc00901520c: strb w1, [x25, x10]
0xffffffc009015210: mov x10, #0xffffffffffffd25c
0xffffffc009015214: movk x10, #0x902, lsl #16
0xffffffc009015218: movk x10, #0xffc0, lsl #32
0xffffffc00901521c: blr x10 -------------------. // bpf2bpf call
0xffffffc009015220: add x7, x0, #0x0 <-------------.
0xffffffc009015224: add sp, sp, #0x10 | |
0xffffffc009015228: ldp x27, x28, [sp], #16 | |
0xffffffc00901522c: ldp x25, x26, [sp], #16 | |
0xffffffc009015230: ldp x21, x22, [sp], #16 | |
0xffffffc009015234: ldp x19, x20, [sp], #16 | |
0xffffffc009015238: ldp x29, x30, [sp], #16 | |
0xffffffc00901523c: add x0, x7, #0x0 | |
0xffffffc009015240: autiasp | |
0xffffffc009015244: ret | |
| |
int subprog_tail(struct __sk_buff *skb): | |
0xffffffc00902d25c: paciasp <----------------------' |
0xffffffc00902d260: stp x29, x30, [sp, #-16]! |
0xffffffc00902d264: mov x29, sp |
0xffffffc00902d268: stp x19, x20, [sp, #-16]! |
0xffffffc00902d26c: stp x21, x22, [sp, #-16]! |
0xffffffc00902d270: stp x25, x26, [sp, #-16]! |
0xffffffc00902d274: stp x27, x28, [sp, #-16]! |
0xffffffc00902d278: mov x25, sp |
0xffffffc00902d27c: sub x27, x25, #0x0 |
0xffffffc00902d280: sub sp, sp, #0x10 | // <- end of prologue, notice:
0xffffffc00902d284: add x19, x0, #0x0 | // 1) TCC not touched, and
0xffffffc00902d288: mov w0, #0x1 | // 2) no tail call jump pad
0xffffffc00902d28c: mov x10, #0xfffffffffffffffc |
0xffffffc00902d290: str w0, [x25, x10] |
0xffffffc00902d294: mov x20, #0xffffff80ffffffff |
0xffffffc00902d298: movk x20, #0xc033, lsl #16 |
0xffffffc00902d29c: movk x20, #0x4e00 |
0xffffffc00902d2a0: add x0, x19, #0x0 |
0xffffffc00902d2a4: add x1, x20, #0x0 |
0xffffffc00902d2a8: mov x2, #0x0 |
0xffffffc00902d2ac: mov w10, #0x24 |
0xffffffc00902d2b0: ldr w10, [x1, x10] |
0xffffffc00902d2b4: add w2, w2, #0x0 |
0xffffffc00902d2b8: cmp w2, w10 |
0xffffffc00902d2bc: b.cs 0xffffffc00902d2f8 |
0xffffffc00902d2c0: mov w10, #0x21 |
0xffffffc00902d2c4: cmp x26, x10 | // TCC >= MAX_TAIL_CALL_CNT?
0xffffffc00902d2c8: b.cs 0xffffffc00902d2f8 |
0xffffffc00902d2cc: add x26, x26, #0x1 | // TCC++
0xffffffc00902d2d0: mov w10, #0x110 |
0xffffffc00902d2d4: add x10, x1, x10 |
0xffffffc00902d2d8: lsl x11, x2, #3 |
0xffffffc00902d2dc: ldr x11, [x10, x11] |
0xffffffc00902d2e0: cbz x11, 0xffffffc00902d2f8 |
0xffffffc00902d2e4: mov w10, #0x30 |
0xffffffc00902d2e8: ldr x10, [x11, x10] |
0xffffffc00902d2ec: add x10, x10, #0x24 |
0xffffffc00902d2f0: add sp, sp, #0x10 | // <- destroy just current
0xffffffc00902d2f4: br x10 ---------------------. | // BPF stack frame
0xffffffc00902d2f8: mov x10, #0xfffffffffffffffc | | // before the tail call
0xffffffc00902d2fc: ldr w7, [x25, x10] | |
0xffffffc00902d300: add sp, sp, #0x10 | |
0xffffffc00902d304: ldp x27, x28, [sp], #16 | |
0xffffffc00902d308: ldp x25, x26, [sp], #16 | |
0xffffffc00902d30c: ldp x21, x22, [sp], #16 | |
0xffffffc00902d310: ldp x19, x20, [sp], #16 | |
0xffffffc00902d314: ldp x29, x30, [sp], #16 | |
0xffffffc00902d318: add x0, x7, #0x0 | |
0xffffffc00902d31c: autiasp | |
0xffffffc00902d320: ret | |
| |
int classifier_0(struct __sk_buff *skb): | |
0xffffffc008ff5874: paciasp | |
0xffffffc008ff5878: stp x29, x30, [sp, #-16]! | |
0xffffffc008ff587c: mov x29, sp | |
0xffffffc008ff5880: stp x19, x20, [sp, #-16]! | |
0xffffffc008ff5884: stp x21, x22, [sp, #-16]! | |
0xffffffc008ff5888: stp x25, x26, [sp, #-16]! | |
0xffffffc008ff588c: stp x27, x28, [sp, #-16]! | |
0xffffffc008ff5890: mov x25, sp | |
0xffffffc008ff5894: mov x26, #0x0 | |
0xffffffc008ff5898: bti j <----------------------' |
0xffffffc008ff589c: sub x27, x25, #0x0 |
0xffffffc008ff58a0: sub sp, sp, #0x0 |
0xffffffc008ff58a4: mov x0, #0xffffffc0ffffffff |
0xffffffc008ff58a8: movk x0, #0x8fc, lsl #16 |
0xffffffc008ff58ac: movk x0, #0x6000 |
0xffffffc008ff58b0: mov w1, #0x1 |
0xffffffc008ff58b4: str w1, [x0] |
0xffffffc008ff58b8: mov w7, #0x0 |
0xffffffc008ff58bc: mov sp, sp |
0xffffffc008ff58c0: ldp x27, x28, [sp], #16 |
0xffffffc008ff58c4: ldp x25, x26, [sp], #16 |
0xffffffc008ff58c8: ldp x21, x22, [sp], #16 |
0xffffffc008ff58cc: ldp x19, x20, [sp], #16 |
0xffffffc008ff58d0: ldp x29, x30, [sp], #16 |
0xffffffc008ff58d4: add x0, x7, #0x0 |
0xffffffc008ff58d8: autiasp |
0xffffffc008ff58dc: ret -------------------------------'
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220617105735.733938-3-jakub@cloudflare.com
2022-06-17 12:57:35 +02:00
|
|
|
/* Indicate the JIT backend supports mixing bpf2bpf and tailcalls. */
|
|
|
|
bool bpf_jit_supports_subprog_tailcalls(void)
|
|
|
|
{
|
|
|
|
return true;
|
|
|
|
}
|
bpf, arm64: Implement bpf_arch_text_poke() for arm64
Implement bpf_arch_text_poke() for arm64, so bpf prog or bpf trampoline
can be patched with it.
When the target address is NULL, the original instruction is patched to
a NOP.
When the target address and the source address are within the branch
range, the original instruction is patched to a bl instruction to the
target address directly.
To support attaching bpf trampoline to both regular kernel function and
bpf prog, we follow the ftrace patchsite way for bpf prog. That is, two
instructions are inserted at the beginning of bpf prog, the first one
saves the return address to x9, and the second is a nop which will be
patched to a bl instruction when a bpf trampoline is attached.
However, when a bpf trampoline is attached to bpf prog, the distance
between target address and source address may exceed 128MB, the maximum
branch range, because bpf trampoline and bpf prog are allocated
separately with vmalloc. So long jump should be handled.
When a bpf prog is constructed, a plt pointing to empty trampoline
dummy_tramp is placed at the end:
bpf_prog:
mov x9, lr
nop // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
This is also the state when no trampoline is attached.
When a short-jump bpf trampoline is attached, the patchsite is patched to
a bl instruction to the trampoline directly:
bpf_prog:
mov x9, lr
bl <short-jump bpf trampoline address> // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
When a long-jump bpf trampoline is attached, the plt target is filled with
the trampoline address and the patchsite is patched to a bl instruction to
the plt:
bpf_prog:
mov x9, lr
bl plt // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad <long-jump bpf trampoline address>
dummy_tramp is used to prevent another CPU from jumping to an unknown
location during the patching process, making the patching process easier.
The patching process is as follows:
1. when neither the old address or the new address is a long jump, the
patchsite is replaced with a bl to the new address, or nop if the new
address is NULL;
2. when the old address is not long jump but the new one is, the
branch target address is written to plt first, then the patchsite
is replaced with a bl instruction to the plt;
3. when the old address is long jump but the new one is not, the address
of dummy_tramp is written to plt first, then the patchsite is replaced
with a bl to the new address, or a nop if the new address is NULL;
4. when both the old address and the new address are long jump, the
new address is written to plt and the patchsite is not changed.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20220711150823.2128542-4-xukuohai@huawei.com
2022-07-11 11:08:22 -04:00
|
|
|
|
2022-07-11 11:08:23 -04:00
|
|
|
static void invoke_bpf_prog(struct jit_ctx *ctx, struct bpf_tramp_link *l,
|
|
|
|
int args_off, int retval_off, int run_ctx_off,
|
|
|
|
bool save_ret)
|
|
|
|
{
|
2022-08-08 00:07:35 -04:00
|
|
|
__le32 *branch;
|
2022-07-11 11:08:23 -04:00
|
|
|
u64 enter_prog;
|
|
|
|
u64 exit_prog;
|
|
|
|
struct bpf_prog *p = l->link.prog;
|
|
|
|
int cookie_off = offsetof(struct bpf_tramp_run_ctx, bpf_cookie);
|
|
|
|
|
2022-10-25 11:45:16 -07:00
|
|
|
enter_prog = (u64)bpf_trampoline_enter(p);
|
|
|
|
exit_prog = (u64)bpf_trampoline_exit(p);
|
2022-07-11 11:08:23 -04:00
|
|
|
|
|
|
|
if (l->cookie == 0) {
|
|
|
|
/* if cookie is zero, one instruction is enough to store it */
|
|
|
|
emit(A64_STR64I(A64_ZR, A64_SP, run_ctx_off + cookie_off), ctx);
|
|
|
|
} else {
|
|
|
|
emit_a64_mov_i64(A64_R(10), l->cookie, ctx);
|
|
|
|
emit(A64_STR64I(A64_R(10), A64_SP, run_ctx_off + cookie_off),
|
|
|
|
ctx);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* save p to callee saved register x19 to avoid loading p with mov_i64
|
|
|
|
* each time.
|
|
|
|
*/
|
|
|
|
emit_addr_mov_i64(A64_R(19), (const u64)p, ctx);
|
|
|
|
|
|
|
|
/* arg1: prog */
|
|
|
|
emit(A64_MOV(1, A64_R(0), A64_R(19)), ctx);
|
|
|
|
/* arg2: &run_ctx */
|
|
|
|
emit(A64_ADD_I(1, A64_R(1), A64_SP, run_ctx_off), ctx);
|
|
|
|
|
|
|
|
emit_call(enter_prog, ctx);
|
|
|
|
|
2024-04-16 14:42:07 +08:00
|
|
|
/* save return value to callee saved register x20 */
|
|
|
|
emit(A64_MOV(1, A64_R(20), A64_R(0)), ctx);
|
|
|
|
|
2022-07-11 11:08:23 -04:00
|
|
|
/* if (__bpf_prog_enter(prog) == 0)
|
|
|
|
* goto skip_exec_of_prog;
|
|
|
|
*/
|
|
|
|
branch = ctx->image + ctx->idx;
|
|
|
|
emit(A64_NOP, ctx);
|
|
|
|
|
|
|
|
emit(A64_ADD_I(1, A64_R(0), A64_SP, args_off), ctx);
|
|
|
|
if (!p->jited)
|
|
|
|
emit_addr_mov_i64(A64_R(1), (const u64)p->insnsi, ctx);
|
|
|
|
|
|
|
|
emit_call((const u64)p->bpf_func, ctx);
|
|
|
|
|
|
|
|
if (save_ret)
|
|
|
|
emit(A64_STR64I(A64_R(0), A64_SP, retval_off), ctx);
|
|
|
|
|
|
|
|
if (ctx->image) {
|
|
|
|
int offset = &ctx->image[ctx->idx] - branch;
|
2022-08-08 00:07:35 -04:00
|
|
|
*branch = cpu_to_le32(A64_CBZ(1, A64_R(0), offset));
|
2022-07-11 11:08:23 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
/* arg1: prog */
|
|
|
|
emit(A64_MOV(1, A64_R(0), A64_R(19)), ctx);
|
|
|
|
/* arg2: start time */
|
|
|
|
emit(A64_MOV(1, A64_R(1), A64_R(20)), ctx);
|
|
|
|
/* arg3: &run_ctx */
|
|
|
|
emit(A64_ADD_I(1, A64_R(2), A64_SP, run_ctx_off), ctx);
|
|
|
|
|
|
|
|
emit_call(exit_prog, ctx);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void invoke_bpf_mod_ret(struct jit_ctx *ctx, struct bpf_tramp_links *tl,
|
|
|
|
int args_off, int retval_off, int run_ctx_off,
|
2022-08-08 00:07:35 -04:00
|
|
|
__le32 **branches)
|
2022-07-11 11:08:23 -04:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
/* The first fmod_ret program will receive a garbage return value.
|
|
|
|
* Set this to 0 to avoid confusing the program.
|
|
|
|
*/
|
|
|
|
emit(A64_STR64I(A64_ZR, A64_SP, retval_off), ctx);
|
|
|
|
for (i = 0; i < tl->nr_links; i++) {
|
|
|
|
invoke_bpf_prog(ctx, tl->links[i], args_off, retval_off,
|
|
|
|
run_ctx_off, true);
|
|
|
|
/* if (*(u64 *)(sp + retval_off) != 0)
|
|
|
|
* goto do_fexit;
|
|
|
|
*/
|
|
|
|
emit(A64_LDR64I(A64_R(10), A64_SP, retval_off), ctx);
|
|
|
|
/* Save the location of branch, and generate a nop.
|
|
|
|
* This nop will be replaced with a cbnz later.
|
|
|
|
*/
|
|
|
|
branches[i] = ctx->image + ctx->idx;
|
|
|
|
emit(A64_NOP, ctx);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-05-11 16:05:07 +02:00
|
|
|
static void save_args(struct jit_ctx *ctx, int args_off, int nregs)
|
2022-07-11 11:08:23 -04:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
2023-05-11 16:05:07 +02:00
|
|
|
for (i = 0; i < nregs; i++) {
|
2022-07-11 11:08:23 -04:00
|
|
|
emit(A64_STR64I(i, A64_SP, args_off), ctx);
|
|
|
|
args_off += 8;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-05-11 16:05:07 +02:00
|
|
|
static void restore_args(struct jit_ctx *ctx, int args_off, int nregs)
|
2022-07-11 11:08:23 -04:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
2023-05-11 16:05:07 +02:00
|
|
|
for (i = 0; i < nregs; i++) {
|
2022-07-11 11:08:23 -04:00
|
|
|
emit(A64_LDR64I(i, A64_SP, args_off), ctx);
|
|
|
|
args_off += 8;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
bpf, arm64: Remove garbage frame for struct_ops trampoline
The callsite layout for arm64 fentry is:
mov x9, lr
nop
When a bpf prog is attached, the nop instruction is patched to a call
to bpf trampoline:
mov x9, lr
bl <bpf trampoline>
So two return addresses are passed to bpf trampoline: the return address
for the traced function/prog, stored in x9, and the return address for
the bpf trampoline itself, stored in lr. To obtain a full and accurate
call stack, the bpf trampoline constructs two fake function frames using
x9 and lr.
However, struct_ops progs are invoked directly as function callbacks,
meaning that x9 is not set as it is in the fentry callsite. In this case,
the frame constructed using x9 is garbage. The following stack trace for
struct_ops, captured by perf sampling, illustrates this issue, where
tcp_ack+0x404 is a garbage frame:
ffffffc0801a04b4 bpf_prog_50992e55a0f655a9_bpf_cubic_cong_avoid+0x98 (bpf_prog_50992e55a0f655a9_bpf_cubic_cong_avoid)
ffffffc0801a228c [unknown] ([kernel.kallsyms]) // bpf trampoline
ffffffd08d362590 tcp_ack+0x798 ([kernel.kallsyms]) // caller for bpf trampoline
ffffffd08d3621fc tcp_ack+0x404 ([kernel.kallsyms]) // garbage frame
ffffffd08d36452c tcp_rcv_established+0x4ac ([kernel.kallsyms])
ffffffd08d375c58 tcp_v4_do_rcv+0x1f0 ([kernel.kallsyms])
ffffffd08d378630 tcp_v4_rcv+0xeb8 ([kernel.kallsyms])
To fix it, construct only one frame using lr for struct_ops.
The above stack trace also indicates that there is no kernel symbol for
struct_ops bpf trampoline. This will be addressed in a follow-up patch.
Fixes: efc9909fdce0 ("bpf, arm64: Add bpf trampoline for arm64")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Tested-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20241025085220.533949-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-25 16:52:20 +08:00
|
|
|
static bool is_struct_ops_tramp(const struct bpf_tramp_links *fentry_links)
|
|
|
|
{
|
|
|
|
return fentry_links->nr_links == 1 &&
|
|
|
|
fentry_links->links[0]->link.type == BPF_LINK_TYPE_STRUCT_OPS;
|
|
|
|
}
|
|
|
|
|
2022-07-11 11:08:23 -04:00
|
|
|
/* Based on the x86's implementation of arch_prepare_bpf_trampoline().
|
|
|
|
*
|
|
|
|
* bpf prog and function entry before bpf trampoline hooked:
|
|
|
|
* mov x9, lr
|
|
|
|
* nop
|
|
|
|
*
|
|
|
|
* bpf prog and function entry after bpf trampoline hooked:
|
|
|
|
* mov x9, lr
|
|
|
|
* bl <bpf_trampoline or plt>
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
|
2023-12-06 14:40:49 -08:00
|
|
|
struct bpf_tramp_links *tlinks, void *func_addr,
|
2023-05-11 16:05:07 +02:00
|
|
|
int nregs, u32 flags)
|
2022-07-11 11:08:23 -04:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
int stack_size;
|
|
|
|
int retaddr_off;
|
|
|
|
int regs_off;
|
|
|
|
int retval_off;
|
|
|
|
int args_off;
|
2023-05-11 16:05:07 +02:00
|
|
|
int nregs_off;
|
2022-07-11 11:08:23 -04:00
|
|
|
int ip_off;
|
|
|
|
int run_ctx_off;
|
|
|
|
struct bpf_tramp_links *fentry = &tlinks[BPF_TRAMP_FENTRY];
|
|
|
|
struct bpf_tramp_links *fexit = &tlinks[BPF_TRAMP_FEXIT];
|
|
|
|
struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN];
|
|
|
|
bool save_ret;
|
2022-08-08 00:07:35 -04:00
|
|
|
__le32 **branches = NULL;
|
bpf, arm64: Remove garbage frame for struct_ops trampoline
The callsite layout for arm64 fentry is:
mov x9, lr
nop
When a bpf prog is attached, the nop instruction is patched to a call
to bpf trampoline:
mov x9, lr
bl <bpf trampoline>
So two return addresses are passed to bpf trampoline: the return address
for the traced function/prog, stored in x9, and the return address for
the bpf trampoline itself, stored in lr. To obtain a full and accurate
call stack, the bpf trampoline constructs two fake function frames using
x9 and lr.
However, struct_ops progs are invoked directly as function callbacks,
meaning that x9 is not set as it is in the fentry callsite. In this case,
the frame constructed using x9 is garbage. The following stack trace for
struct_ops, captured by perf sampling, illustrates this issue, where
tcp_ack+0x404 is a garbage frame:
ffffffc0801a04b4 bpf_prog_50992e55a0f655a9_bpf_cubic_cong_avoid+0x98 (bpf_prog_50992e55a0f655a9_bpf_cubic_cong_avoid)
ffffffc0801a228c [unknown] ([kernel.kallsyms]) // bpf trampoline
ffffffd08d362590 tcp_ack+0x798 ([kernel.kallsyms]) // caller for bpf trampoline
ffffffd08d3621fc tcp_ack+0x404 ([kernel.kallsyms]) // garbage frame
ffffffd08d36452c tcp_rcv_established+0x4ac ([kernel.kallsyms])
ffffffd08d375c58 tcp_v4_do_rcv+0x1f0 ([kernel.kallsyms])
ffffffd08d378630 tcp_v4_rcv+0xeb8 ([kernel.kallsyms])
To fix it, construct only one frame using lr for struct_ops.
The above stack trace also indicates that there is no kernel symbol for
struct_ops bpf trampoline. This will be addressed in a follow-up patch.
Fixes: efc9909fdce0 ("bpf, arm64: Add bpf trampoline for arm64")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Tested-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20241025085220.533949-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-25 16:52:20 +08:00
|
|
|
bool is_struct_ops = is_struct_ops_tramp(fentry);
|
2022-07-11 11:08:23 -04:00
|
|
|
|
|
|
|
/* trampoline stack layout:
|
|
|
|
* [ parent ip ]
|
|
|
|
* [ FP ]
|
|
|
|
* SP + retaddr_off [ self ip ]
|
|
|
|
* [ FP ]
|
|
|
|
*
|
|
|
|
* [ padding ] align SP to multiples of 16
|
|
|
|
*
|
|
|
|
* [ x20 ] callee saved reg x20
|
|
|
|
* SP + regs_off [ x19 ] callee saved reg x19
|
|
|
|
*
|
|
|
|
* SP + retval_off [ return value ] BPF_TRAMP_F_CALL_ORIG or
|
|
|
|
* BPF_TRAMP_F_RET_FENTRY_RET
|
|
|
|
*
|
2023-05-11 16:05:07 +02:00
|
|
|
* [ arg reg N ]
|
2022-07-11 11:08:23 -04:00
|
|
|
* [ ... ]
|
2023-05-11 16:05:07 +02:00
|
|
|
* SP + args_off [ arg reg 1 ]
|
2022-07-11 11:08:23 -04:00
|
|
|
*
|
2023-05-11 16:05:07 +02:00
|
|
|
* SP + nregs_off [ arg regs count ]
|
2022-07-11 11:08:23 -04:00
|
|
|
*
|
|
|
|
* SP + ip_off [ traced function ] BPF_TRAMP_F_IP_ARG flag
|
|
|
|
*
|
|
|
|
* SP + run_ctx_off [ bpf_tramp_run_ctx ]
|
|
|
|
*/
|
|
|
|
|
|
|
|
stack_size = 0;
|
|
|
|
run_ctx_off = stack_size;
|
|
|
|
/* room for bpf_tramp_run_ctx */
|
|
|
|
stack_size += round_up(sizeof(struct bpf_tramp_run_ctx), 8);
|
|
|
|
|
|
|
|
ip_off = stack_size;
|
|
|
|
/* room for IP address argument */
|
|
|
|
if (flags & BPF_TRAMP_F_IP_ARG)
|
|
|
|
stack_size += 8;
|
|
|
|
|
2023-05-11 16:05:07 +02:00
|
|
|
nregs_off = stack_size;
|
2022-07-11 11:08:23 -04:00
|
|
|
/* room for args count */
|
|
|
|
stack_size += 8;
|
|
|
|
|
|
|
|
args_off = stack_size;
|
|
|
|
/* room for args */
|
2023-05-11 16:05:07 +02:00
|
|
|
stack_size += nregs * 8;
|
2022-07-11 11:08:23 -04:00
|
|
|
|
|
|
|
/* room for return value */
|
|
|
|
retval_off = stack_size;
|
|
|
|
save_ret = flags & (BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_RET_FENTRY_RET);
|
|
|
|
if (save_ret)
|
|
|
|
stack_size += 8;
|
|
|
|
|
|
|
|
/* room for callee saved registers, currently x19 and x20 are used */
|
|
|
|
regs_off = stack_size;
|
|
|
|
stack_size += 16;
|
|
|
|
|
|
|
|
/* round up to multiples of 16 to avoid SPAlignmentFault */
|
|
|
|
stack_size = round_up(stack_size, 16);
|
|
|
|
|
|
|
|
/* return address locates above FP */
|
|
|
|
retaddr_off = stack_size + 8;
|
|
|
|
|
|
|
|
/* bpf trampoline may be invoked by 3 instruction types:
|
|
|
|
* 1. bl, attached to bpf prog or kernel function via short jump
|
|
|
|
* 2. br, attached to bpf prog or kernel function via long jump
|
|
|
|
* 3. blr, working as a function pointer, used by struct_ops.
|
|
|
|
* So BTI_JC should used here to support both br and blr.
|
|
|
|
*/
|
|
|
|
emit_bti(A64_BTI_JC, ctx);
|
|
|
|
|
bpf, arm64: Remove garbage frame for struct_ops trampoline
The callsite layout for arm64 fentry is:
mov x9, lr
nop
When a bpf prog is attached, the nop instruction is patched to a call
to bpf trampoline:
mov x9, lr
bl <bpf trampoline>
So two return addresses are passed to bpf trampoline: the return address
for the traced function/prog, stored in x9, and the return address for
the bpf trampoline itself, stored in lr. To obtain a full and accurate
call stack, the bpf trampoline constructs two fake function frames using
x9 and lr.
However, struct_ops progs are invoked directly as function callbacks,
meaning that x9 is not set as it is in the fentry callsite. In this case,
the frame constructed using x9 is garbage. The following stack trace for
struct_ops, captured by perf sampling, illustrates this issue, where
tcp_ack+0x404 is a garbage frame:
ffffffc0801a04b4 bpf_prog_50992e55a0f655a9_bpf_cubic_cong_avoid+0x98 (bpf_prog_50992e55a0f655a9_bpf_cubic_cong_avoid)
ffffffc0801a228c [unknown] ([kernel.kallsyms]) // bpf trampoline
ffffffd08d362590 tcp_ack+0x798 ([kernel.kallsyms]) // caller for bpf trampoline
ffffffd08d3621fc tcp_ack+0x404 ([kernel.kallsyms]) // garbage frame
ffffffd08d36452c tcp_rcv_established+0x4ac ([kernel.kallsyms])
ffffffd08d375c58 tcp_v4_do_rcv+0x1f0 ([kernel.kallsyms])
ffffffd08d378630 tcp_v4_rcv+0xeb8 ([kernel.kallsyms])
To fix it, construct only one frame using lr for struct_ops.
The above stack trace also indicates that there is no kernel symbol for
struct_ops bpf trampoline. This will be addressed in a follow-up patch.
Fixes: efc9909fdce0 ("bpf, arm64: Add bpf trampoline for arm64")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Tested-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20241025085220.533949-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-25 16:52:20 +08:00
|
|
|
/* x9 is not set for struct_ops */
|
|
|
|
if (!is_struct_ops) {
|
|
|
|
/* frame for parent function */
|
|
|
|
emit(A64_PUSH(A64_FP, A64_R(9), A64_SP), ctx);
|
|
|
|
emit(A64_MOV(1, A64_FP, A64_SP), ctx);
|
|
|
|
}
|
2022-07-11 11:08:23 -04:00
|
|
|
|
bpf, arm64: Remove garbage frame for struct_ops trampoline
The callsite layout for arm64 fentry is:
mov x9, lr
nop
When a bpf prog is attached, the nop instruction is patched to a call
to bpf trampoline:
mov x9, lr
bl <bpf trampoline>
So two return addresses are passed to bpf trampoline: the return address
for the traced function/prog, stored in x9, and the return address for
the bpf trampoline itself, stored in lr. To obtain a full and accurate
call stack, the bpf trampoline constructs two fake function frames using
x9 and lr.
However, struct_ops progs are invoked directly as function callbacks,
meaning that x9 is not set as it is in the fentry callsite. In this case,
the frame constructed using x9 is garbage. The following stack trace for
struct_ops, captured by perf sampling, illustrates this issue, where
tcp_ack+0x404 is a garbage frame:
ffffffc0801a04b4 bpf_prog_50992e55a0f655a9_bpf_cubic_cong_avoid+0x98 (bpf_prog_50992e55a0f655a9_bpf_cubic_cong_avoid)
ffffffc0801a228c [unknown] ([kernel.kallsyms]) // bpf trampoline
ffffffd08d362590 tcp_ack+0x798 ([kernel.kallsyms]) // caller for bpf trampoline
ffffffd08d3621fc tcp_ack+0x404 ([kernel.kallsyms]) // garbage frame
ffffffd08d36452c tcp_rcv_established+0x4ac ([kernel.kallsyms])
ffffffd08d375c58 tcp_v4_do_rcv+0x1f0 ([kernel.kallsyms])
ffffffd08d378630 tcp_v4_rcv+0xeb8 ([kernel.kallsyms])
To fix it, construct only one frame using lr for struct_ops.
The above stack trace also indicates that there is no kernel symbol for
struct_ops bpf trampoline. This will be addressed in a follow-up patch.
Fixes: efc9909fdce0 ("bpf, arm64: Add bpf trampoline for arm64")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Tested-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20241025085220.533949-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-25 16:52:20 +08:00
|
|
|
/* frame for patched function for tracing, or caller for struct_ops */
|
2022-07-11 11:08:23 -04:00
|
|
|
emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
|
|
|
|
emit(A64_MOV(1, A64_FP, A64_SP), ctx);
|
|
|
|
|
|
|
|
/* allocate stack space */
|
|
|
|
emit(A64_SUB_I(1, A64_SP, A64_SP, stack_size), ctx);
|
|
|
|
|
|
|
|
if (flags & BPF_TRAMP_F_IP_ARG) {
|
|
|
|
/* save ip address of the traced function */
|
2023-12-06 14:40:49 -08:00
|
|
|
emit_addr_mov_i64(A64_R(10), (const u64)func_addr, ctx);
|
2022-07-11 11:08:23 -04:00
|
|
|
emit(A64_STR64I(A64_R(10), A64_SP, ip_off), ctx);
|
|
|
|
}
|
|
|
|
|
2023-05-11 16:05:07 +02:00
|
|
|
/* save arg regs count*/
|
|
|
|
emit(A64_MOVZ(1, A64_R(10), nregs, 0), ctx);
|
|
|
|
emit(A64_STR64I(A64_R(10), A64_SP, nregs_off), ctx);
|
2022-07-11 11:08:23 -04:00
|
|
|
|
2023-05-11 16:05:07 +02:00
|
|
|
/* save arg regs */
|
|
|
|
save_args(ctx, args_off, nregs);
|
2022-07-11 11:08:23 -04:00
|
|
|
|
|
|
|
/* save callee saved registers */
|
|
|
|
emit(A64_STR64I(A64_R(19), A64_SP, regs_off), ctx);
|
|
|
|
emit(A64_STR64I(A64_R(20), A64_SP, regs_off + 8), ctx);
|
|
|
|
|
|
|
|
if (flags & BPF_TRAMP_F_CALL_ORIG) {
|
2024-10-18 15:16:43 -07:00
|
|
|
/* for the first pass, assume the worst case */
|
|
|
|
if (!ctx->image)
|
|
|
|
ctx->idx += 4;
|
|
|
|
else
|
|
|
|
emit_a64_mov_i64(A64_R(0), (const u64)im, ctx);
|
2022-07-11 11:08:23 -04:00
|
|
|
emit_call((const u64)__bpf_tramp_enter, ctx);
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; i < fentry->nr_links; i++)
|
|
|
|
invoke_bpf_prog(ctx, fentry->links[i], args_off,
|
|
|
|
retval_off, run_ctx_off,
|
|
|
|
flags & BPF_TRAMP_F_RET_FENTRY_RET);
|
|
|
|
|
|
|
|
if (fmod_ret->nr_links) {
|
2022-08-08 00:07:35 -04:00
|
|
|
branches = kcalloc(fmod_ret->nr_links, sizeof(__le32 *),
|
2022-07-11 11:08:23 -04:00
|
|
|
GFP_KERNEL);
|
|
|
|
if (!branches)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
invoke_bpf_mod_ret(ctx, fmod_ret, args_off, retval_off,
|
|
|
|
run_ctx_off, branches);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flags & BPF_TRAMP_F_CALL_ORIG) {
|
2023-05-11 16:05:07 +02:00
|
|
|
restore_args(ctx, args_off, nregs);
|
2022-07-11 11:08:23 -04:00
|
|
|
/* call original func */
|
|
|
|
emit(A64_LDR64I(A64_R(10), A64_SP, retaddr_off), ctx);
|
bpf, arm64: Fixed a BTI error on returning to patched function
When BPF_TRAMP_F_CALL_ORIG is set, BPF trampoline uses BLR to jump
back to the instruction next to call site to call the patched function.
For BTI-enabled kernel, the instruction next to call site is usually
PACIASP, in this case, it's safe to jump back with BLR. But when
the call site is not followed by a PACIASP or bti, a BTI exception
is triggered.
Here is a fault log:
Unhandled 64-bit el1h sync exception on CPU0, ESR 0x0000000034000002 -- BTI
CPU: 0 PID: 263 Comm: test_progs Tainted: GF
Hardware name: linux,dummy-virt (DT)
pstate: 40400805 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=-c)
pc : bpf_fentry_test1+0xc/0x30
lr : bpf_trampoline_6442573892_0+0x48/0x1000
sp : ffff80000c0c3a50
x29: ffff80000c0c3a90 x28: ffff0000c2e6c080 x27: 0000000000000000
x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000050
x23: 0000000000000000 x22: 0000ffffcfd2a7f0 x21: 000000000000000a
x20: 0000ffffcfd2a7f0 x19: 0000000000000000 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffffcfd2a7f0
x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000000 x10: ffff80000914f5e4 x9 : ffff8000082a1528
x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0101010101010101
x5 : 0000000000000000 x4 : 00000000fffffff2 x3 : 0000000000000001
x2 : ffff8001f4b82000 x1 : 0000000000000000 x0 : 0000000000000001
Kernel panic - not syncing: Unhandled exception
CPU: 0 PID: 263 Comm: test_progs Tainted: GF
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0xec/0x144
show_stack+0x24/0x7c
dump_stack_lvl+0x8c/0xb8
dump_stack+0x18/0x34
panic+0x1cc/0x3ec
__el0_error_handler_common+0x0/0x130
el1h_64_sync_handler+0x60/0xd0
el1h_64_sync+0x78/0x7c
bpf_fentry_test1+0xc/0x30
bpf_fentry_test1+0xc/0x30
bpf_prog_test_run_tracing+0xdc/0x2a0
__sys_bpf+0x438/0x22a0
__arm64_sys_bpf+0x30/0x54
invoke_syscall+0x78/0x110
el0_svc_common.constprop.0+0x6c/0x1d0
do_el0_svc+0x38/0xe0
el0_svc+0x30/0xd0
el0t_64_sync_handler+0x1ac/0x1b0
el0t_64_sync+0x1a0/0x1a4
Kernel Offset: disabled
CPU features: 0x0000,00034c24,f994fdab
Memory Limit: none
And the instruction next to call site of bpf_fentry_test1 is ADD,
not PACIASP:
<bpf_fentry_test1>:
bti c
nop
nop
add w0, w0, #0x1
paciasp
For BPF prog, JIT always puts a PACIASP after call site for BTI-enabled
kernel, so there is no problem. To fix it, replace BLR with RET to bypass
the branch target check.
Fixes: efc9909fdce0 ("bpf, arm64: Add bpf trampoline for arm64")
Reported-by: Florent Revest <revest@chromium.org>
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Florent Revest <revest@chromium.org>
Acked-by: Florent Revest <revest@chromium.org>
Link: https://lore.kernel.org/bpf/20230401234144.3719742-1-xukuohai@huaweicloud.com
2023-04-01 19:41:44 -04:00
|
|
|
emit(A64_ADR(A64_LR, AARCH64_INSN_SIZE * 2), ctx);
|
|
|
|
emit(A64_RET(A64_R(10)), ctx);
|
2022-07-11 11:08:23 -04:00
|
|
|
/* store return value */
|
|
|
|
emit(A64_STR64I(A64_R(0), A64_SP, retval_off), ctx);
|
|
|
|
/* reserve a nop for bpf_tramp_image_put */
|
2024-03-04 20:28:03 +00:00
|
|
|
im->ip_after_call = ctx->ro_image + ctx->idx;
|
2022-07-11 11:08:23 -04:00
|
|
|
emit(A64_NOP, ctx);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* update the branches saved in invoke_bpf_mod_ret with cbnz */
|
|
|
|
for (i = 0; i < fmod_ret->nr_links && ctx->image != NULL; i++) {
|
|
|
|
int offset = &ctx->image[ctx->idx] - branches[i];
|
2022-08-08 00:07:35 -04:00
|
|
|
*branches[i] = cpu_to_le32(A64_CBNZ(1, A64_R(10), offset));
|
2022-07-11 11:08:23 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; i < fexit->nr_links; i++)
|
|
|
|
invoke_bpf_prog(ctx, fexit->links[i], args_off, retval_off,
|
|
|
|
run_ctx_off, false);
|
|
|
|
|
|
|
|
if (flags & BPF_TRAMP_F_CALL_ORIG) {
|
2024-03-04 20:28:03 +00:00
|
|
|
im->ip_epilogue = ctx->ro_image + ctx->idx;
|
2024-10-18 15:16:43 -07:00
|
|
|
/* for the first pass, assume the worst case */
|
|
|
|
if (!ctx->image)
|
|
|
|
ctx->idx += 4;
|
|
|
|
else
|
|
|
|
emit_a64_mov_i64(A64_R(0), (const u64)im, ctx);
|
2022-07-11 11:08:23 -04:00
|
|
|
emit_call((const u64)__bpf_tramp_exit, ctx);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flags & BPF_TRAMP_F_RESTORE_REGS)
|
2023-05-11 16:05:07 +02:00
|
|
|
restore_args(ctx, args_off, nregs);
|
2022-07-11 11:08:23 -04:00
|
|
|
|
|
|
|
/* restore callee saved register x19 and x20 */
|
|
|
|
emit(A64_LDR64I(A64_R(19), A64_SP, regs_off), ctx);
|
|
|
|
emit(A64_LDR64I(A64_R(20), A64_SP, regs_off + 8), ctx);
|
|
|
|
|
|
|
|
if (save_ret)
|
|
|
|
emit(A64_LDR64I(A64_R(0), A64_SP, retval_off), ctx);
|
|
|
|
|
|
|
|
/* reset SP */
|
|
|
|
emit(A64_MOV(1, A64_SP, A64_FP), ctx);
|
|
|
|
|
bpf, arm64: Remove garbage frame for struct_ops trampoline
The callsite layout for arm64 fentry is:
mov x9, lr
nop
When a bpf prog is attached, the nop instruction is patched to a call
to bpf trampoline:
mov x9, lr
bl <bpf trampoline>
So two return addresses are passed to bpf trampoline: the return address
for the traced function/prog, stored in x9, and the return address for
the bpf trampoline itself, stored in lr. To obtain a full and accurate
call stack, the bpf trampoline constructs two fake function frames using
x9 and lr.
However, struct_ops progs are invoked directly as function callbacks,
meaning that x9 is not set as it is in the fentry callsite. In this case,
the frame constructed using x9 is garbage. The following stack trace for
struct_ops, captured by perf sampling, illustrates this issue, where
tcp_ack+0x404 is a garbage frame:
ffffffc0801a04b4 bpf_prog_50992e55a0f655a9_bpf_cubic_cong_avoid+0x98 (bpf_prog_50992e55a0f655a9_bpf_cubic_cong_avoid)
ffffffc0801a228c [unknown] ([kernel.kallsyms]) // bpf trampoline
ffffffd08d362590 tcp_ack+0x798 ([kernel.kallsyms]) // caller for bpf trampoline
ffffffd08d3621fc tcp_ack+0x404 ([kernel.kallsyms]) // garbage frame
ffffffd08d36452c tcp_rcv_established+0x4ac ([kernel.kallsyms])
ffffffd08d375c58 tcp_v4_do_rcv+0x1f0 ([kernel.kallsyms])
ffffffd08d378630 tcp_v4_rcv+0xeb8 ([kernel.kallsyms])
To fix it, construct only one frame using lr for struct_ops.
The above stack trace also indicates that there is no kernel symbol for
struct_ops bpf trampoline. This will be addressed in a follow-up patch.
Fixes: efc9909fdce0 ("bpf, arm64: Add bpf trampoline for arm64")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Tested-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20241025085220.533949-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-25 16:52:20 +08:00
|
|
|
if (is_struct_ops) {
|
|
|
|
emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx);
|
|
|
|
emit(A64_RET(A64_LR), ctx);
|
2022-07-11 11:08:23 -04:00
|
|
|
} else {
|
bpf, arm64: Remove garbage frame for struct_ops trampoline
The callsite layout for arm64 fentry is:
mov x9, lr
nop
When a bpf prog is attached, the nop instruction is patched to a call
to bpf trampoline:
mov x9, lr
bl <bpf trampoline>
So two return addresses are passed to bpf trampoline: the return address
for the traced function/prog, stored in x9, and the return address for
the bpf trampoline itself, stored in lr. To obtain a full and accurate
call stack, the bpf trampoline constructs two fake function frames using
x9 and lr.
However, struct_ops progs are invoked directly as function callbacks,
meaning that x9 is not set as it is in the fentry callsite. In this case,
the frame constructed using x9 is garbage. The following stack trace for
struct_ops, captured by perf sampling, illustrates this issue, where
tcp_ack+0x404 is a garbage frame:
ffffffc0801a04b4 bpf_prog_50992e55a0f655a9_bpf_cubic_cong_avoid+0x98 (bpf_prog_50992e55a0f655a9_bpf_cubic_cong_avoid)
ffffffc0801a228c [unknown] ([kernel.kallsyms]) // bpf trampoline
ffffffd08d362590 tcp_ack+0x798 ([kernel.kallsyms]) // caller for bpf trampoline
ffffffd08d3621fc tcp_ack+0x404 ([kernel.kallsyms]) // garbage frame
ffffffd08d36452c tcp_rcv_established+0x4ac ([kernel.kallsyms])
ffffffd08d375c58 tcp_v4_do_rcv+0x1f0 ([kernel.kallsyms])
ffffffd08d378630 tcp_v4_rcv+0xeb8 ([kernel.kallsyms])
To fix it, construct only one frame using lr for struct_ops.
The above stack trace also indicates that there is no kernel symbol for
struct_ops bpf trampoline. This will be addressed in a follow-up patch.
Fixes: efc9909fdce0 ("bpf, arm64: Add bpf trampoline for arm64")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Tested-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20241025085220.533949-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-25 16:52:20 +08:00
|
|
|
/* pop frames */
|
|
|
|
emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx);
|
|
|
|
emit(A64_POP(A64_FP, A64_R(9), A64_SP), ctx);
|
|
|
|
|
|
|
|
if (flags & BPF_TRAMP_F_SKIP_FRAME) {
|
|
|
|
/* skip patched function, return to parent */
|
|
|
|
emit(A64_MOV(1, A64_LR, A64_R(9)), ctx);
|
|
|
|
emit(A64_RET(A64_R(9)), ctx);
|
|
|
|
} else {
|
|
|
|
/* return to patched function */
|
|
|
|
emit(A64_MOV(1, A64_R(10), A64_LR), ctx);
|
|
|
|
emit(A64_MOV(1, A64_LR, A64_R(9)), ctx);
|
|
|
|
emit(A64_RET(A64_R(10)), ctx);
|
|
|
|
}
|
2022-07-11 11:08:23 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
kfree(branches);
|
|
|
|
|
|
|
|
return ctx->idx;
|
|
|
|
}
|
|
|
|
|
2023-12-06 14:40:52 -08:00
|
|
|
static int btf_func_model_nregs(const struct btf_func_model *m)
|
2022-07-11 11:08:23 -04:00
|
|
|
{
|
2023-05-11 16:05:07 +02:00
|
|
|
int nregs = m->nr_args;
|
2023-12-06 14:40:52 -08:00
|
|
|
int i;
|
2022-07-11 11:08:23 -04:00
|
|
|
|
2023-05-11 16:05:07 +02:00
|
|
|
/* extra registers needed for struct argument */
|
2022-08-31 08:27:02 -07:00
|
|
|
for (i = 0; i < MAX_BPF_FUNC_ARGS; i++) {
|
2023-05-11 16:05:07 +02:00
|
|
|
/* The arg_size is at most 16 bytes, enforced by the verifier. */
|
2022-08-31 08:27:02 -07:00
|
|
|
if (m->arg_flags[i] & BTF_FMODEL_STRUCT_ARG)
|
2023-05-11 16:05:07 +02:00
|
|
|
nregs += (m->arg_size[i] + 7) / 8 - 1;
|
2022-08-31 08:27:02 -07:00
|
|
|
}
|
|
|
|
|
2023-12-06 14:40:52 -08:00
|
|
|
return nregs;
|
|
|
|
}
|
|
|
|
|
|
|
|
int arch_bpf_trampoline_size(const struct btf_func_model *m, u32 flags,
|
|
|
|
struct bpf_tramp_links *tlinks, void *func_addr)
|
|
|
|
{
|
|
|
|
struct jit_ctx ctx = {
|
|
|
|
.image = NULL,
|
|
|
|
.idx = 0,
|
|
|
|
};
|
|
|
|
struct bpf_tramp_image im;
|
|
|
|
int nregs, ret;
|
|
|
|
|
|
|
|
nregs = btf_func_model_nregs(m);
|
2023-05-11 16:05:07 +02:00
|
|
|
/* the first 8 registers are used for arguments */
|
|
|
|
if (nregs > 8)
|
|
|
|
return -ENOTSUPP;
|
|
|
|
|
2023-12-06 14:40:52 -08:00
|
|
|
ret = prepare_trampoline(&ctx, &im, tlinks, func_addr, nregs, flags);
|
2022-07-11 11:08:23 -04:00
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
2023-12-06 14:40:52 -08:00
|
|
|
return ret < 0 ? ret : ret * AARCH64_INSN_SIZE;
|
|
|
|
}
|
2022-07-11 11:08:23 -04:00
|
|
|
|
2024-03-04 20:28:03 +00:00
|
|
|
void *arch_alloc_bpf_trampoline(unsigned int size)
|
|
|
|
{
|
|
|
|
return bpf_prog_pack_alloc(size, jit_fill_hole);
|
|
|
|
}
|
|
|
|
|
|
|
|
void arch_free_bpf_trampoline(void *image, unsigned int size)
|
|
|
|
{
|
|
|
|
bpf_prog_pack_free(image, size);
|
|
|
|
}
|
|
|
|
|
2024-03-16 08:35:41 +01:00
|
|
|
int arch_protect_bpf_trampoline(void *image, unsigned int size)
|
2024-03-04 20:28:03 +00:00
|
|
|
{
|
2024-03-16 08:35:41 +01:00
|
|
|
return 0;
|
2024-03-04 20:28:03 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *ro_image,
|
|
|
|
void *ro_image_end, const struct btf_func_model *m,
|
2023-12-06 14:40:52 -08:00
|
|
|
u32 flags, struct bpf_tramp_links *tlinks,
|
|
|
|
void *func_addr)
|
|
|
|
{
|
|
|
|
int ret, nregs;
|
2024-03-04 20:28:03 +00:00
|
|
|
void *image, *tmp;
|
|
|
|
u32 size = ro_image_end - ro_image;
|
|
|
|
|
|
|
|
/* image doesn't need to be in module memory range, so we can
|
|
|
|
* use kvmalloc.
|
|
|
|
*/
|
|
|
|
image = kvmalloc(size, GFP_KERNEL);
|
|
|
|
if (!image)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2023-12-06 14:40:52 -08:00
|
|
|
struct jit_ctx ctx = {
|
|
|
|
.image = image,
|
2024-03-04 20:28:03 +00:00
|
|
|
.ro_image = ro_image,
|
2023-12-06 14:40:52 -08:00
|
|
|
.idx = 0,
|
bpf, arm64: Jit BPF_CALL to direct call when possible
Currently, BPF_CALL is always jited to indirect call. When target is
within the range of direct call, BPF_CALL can be jited to direct call.
For example, the following BPF_CALL
call __htab_map_lookup_elem
is always jited to indirect call:
mov x10, #0xffffffffffff18f4
movk x10, #0x821, lsl #16
movk x10, #0x8000, lsl #32
blr x10
When the address of target __htab_map_lookup_elem is within the range of
direct call, the BPF_CALL can be jited to:
bl 0xfffffffffd33bc98
This patch does such jit optimization by emitting arm64 direct calls for
BPF_CALL when possible, indirect calls otherwise.
Without this patch, the jit works as follows.
1. First pass
A. Determine jited position and size for each bpf instruction.
B. Computed the jited image size.
2. Allocate jited image with size computed in step 1.
3. Second pass
A. Adjust jump offset for jump instructions
B. Write the final image.
This works because, for a given bpf prog, regardless of where the jited
image is allocated, the jited result for each instruction is fixed. The
second pass differs from the first only in adjusting the jump offsets,
like changing "jmp imm1" to "jmp imm2", while the position and size of
the "jmp" instruction remain unchanged.
Now considering whether to jit BPF_CALL to arm64 direct or indirect call
instruction. The choice depends solely on the jump offset: direct call
if the jump offset is within 128MB, indirect call otherwise.
For a given BPF_CALL, the target address is known, so the jump offset is
decided by the jited address of the BPF_CALL instruction. In other words,
for a given bpf prog, the jited result for each BPF_CALL is determined
by its jited address.
The jited address for a BPF_CALL is the jited image address plus the
total jited size of all preceding instructions. For a given bpf prog,
there are clearly no BPF_CALL instructions before the first BPF_CALL
instruction. Since the jited result for all other instructions other
than BPF_CALL are fixed, the total jited size preceding the first
BPF_CALL is also fixed. Therefore, once the jited image is allocated,
the jited address for the first BPF_CALL is fixed.
Now that the jited result for the first BPF_CALL is fixed, the jited
results for all instructions preceding the second BPF_CALL are fixed.
So the jited address and result for the second BPF_CALL are also fixed.
Similarly, we can conclude that the jited addresses and results for all
subsequent BPF_CALL instructions are fixed.
This means that, for a given bpf prog, once the jited image is allocated,
the jited address and result for all instructions, including all BPF_CALL
instructions, are fixed.
Based on the observation, with this patch, the jit works as follows.
1. First pass
Estimate the maximum jited image size. In this pass, all BPF_CALLs
are jited to arm64 indirect calls since the jump offsets are unknown
because the jited image is not allocated.
2. Allocate jited image with size estimated in step 1.
3. Second pass
A. Determine the jited result for each BPF_CALL.
B. Determine jited address and size for each bpf instruction.
4. Third pass
A. Adjust jump offset for jump instructions.
B. Write the final image.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20240903094407.601107-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-03 17:44:07 +08:00
|
|
|
.write = true,
|
2023-12-06 14:40:52 -08:00
|
|
|
};
|
|
|
|
|
|
|
|
nregs = btf_func_model_nregs(m);
|
|
|
|
/* the first 8 registers are used for arguments */
|
|
|
|
if (nregs > 8)
|
|
|
|
return -ENOTSUPP;
|
|
|
|
|
2024-03-04 20:28:03 +00:00
|
|
|
jit_fill_hole(image, (unsigned int)(ro_image_end - ro_image));
|
2023-12-06 14:40:49 -08:00
|
|
|
ret = prepare_trampoline(&ctx, im, tlinks, func_addr, nregs, flags);
|
2022-07-11 11:08:23 -04:00
|
|
|
|
2024-03-04 20:28:03 +00:00
|
|
|
if (ret > 0 && validate_code(&ctx) < 0) {
|
2022-07-11 11:08:23 -04:00
|
|
|
ret = -EINVAL;
|
2024-03-04 20:28:03 +00:00
|
|
|
goto out;
|
|
|
|
}
|
2022-07-11 11:08:23 -04:00
|
|
|
|
|
|
|
if (ret > 0)
|
|
|
|
ret *= AARCH64_INSN_SIZE;
|
|
|
|
|
2024-03-04 20:28:03 +00:00
|
|
|
tmp = bpf_arch_text_copy(ro_image, image, size);
|
|
|
|
if (IS_ERR(tmp)) {
|
|
|
|
ret = PTR_ERR(tmp);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
bpf_flush_icache(ro_image, ro_image + size);
|
|
|
|
out:
|
|
|
|
kvfree(image);
|
2022-07-11 11:08:23 -04:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
bpf, arm64: Implement bpf_arch_text_poke() for arm64
Implement bpf_arch_text_poke() for arm64, so bpf prog or bpf trampoline
can be patched with it.
When the target address is NULL, the original instruction is patched to
a NOP.
When the target address and the source address are within the branch
range, the original instruction is patched to a bl instruction to the
target address directly.
To support attaching bpf trampoline to both regular kernel function and
bpf prog, we follow the ftrace patchsite way for bpf prog. That is, two
instructions are inserted at the beginning of bpf prog, the first one
saves the return address to x9, and the second is a nop which will be
patched to a bl instruction when a bpf trampoline is attached.
However, when a bpf trampoline is attached to bpf prog, the distance
between target address and source address may exceed 128MB, the maximum
branch range, because bpf trampoline and bpf prog are allocated
separately with vmalloc. So long jump should be handled.
When a bpf prog is constructed, a plt pointing to empty trampoline
dummy_tramp is placed at the end:
bpf_prog:
mov x9, lr
nop // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
This is also the state when no trampoline is attached.
When a short-jump bpf trampoline is attached, the patchsite is patched to
a bl instruction to the trampoline directly:
bpf_prog:
mov x9, lr
bl <short-jump bpf trampoline address> // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad dummy_tramp // plt target
When a long-jump bpf trampoline is attached, the plt target is filled with
the trampoline address and the patchsite is patched to a bl instruction to
the plt:
bpf_prog:
mov x9, lr
bl plt // patchsite
...
ret
plt:
ldr x10, target
br x10
target:
.quad <long-jump bpf trampoline address>
dummy_tramp is used to prevent another CPU from jumping to an unknown
location during the patching process, making the patching process easier.
The patching process is as follows:
1. when neither the old address or the new address is a long jump, the
patchsite is replaced with a bl to the new address, or nop if the new
address is NULL;
2. when the old address is not long jump but the new one is, the
branch target address is written to plt first, then the patchsite
is replaced with a bl instruction to the plt;
3. when the old address is long jump but the new one is not, the address
of dummy_tramp is written to plt first, then the patchsite is replaced
with a bl to the new address, or a nop if the new address is NULL;
4. when both the old address and the new address are long jump, the
new address is written to plt and the patchsite is not changed.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20220711150823.2128542-4-xukuohai@huawei.com
2022-07-11 11:08:22 -04:00
|
|
|
static bool is_long_jump(void *ip, void *target)
|
|
|
|
{
|
|
|
|
long offset;
|
|
|
|
|
|
|
|
/* NULL target means this is a NOP */
|
|
|
|
if (!target)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
offset = (long)target - (long)ip;
|
|
|
|
return offset < -SZ_128M || offset >= SZ_128M;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int gen_branch_or_nop(enum aarch64_insn_branch_type type, void *ip,
|
|
|
|
void *addr, void *plt, u32 *insn)
|
|
|
|
{
|
|
|
|
void *target;
|
|
|
|
|
|
|
|
if (!addr) {
|
|
|
|
*insn = aarch64_insn_gen_nop();
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (is_long_jump(ip, addr))
|
|
|
|
target = plt;
|
|
|
|
else
|
|
|
|
target = addr;
|
|
|
|
|
|
|
|
*insn = aarch64_insn_gen_branch_imm((unsigned long)ip,
|
|
|
|
(unsigned long)target,
|
|
|
|
type);
|
|
|
|
|
|
|
|
return *insn != AARCH64_BREAK_FAULT ? 0 : -EFAULT;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Replace the branch instruction from @ip to @old_addr in a bpf prog or a bpf
|
|
|
|
* trampoline with the branch instruction from @ip to @new_addr. If @old_addr
|
|
|
|
* or @new_addr is NULL, the old or new instruction is NOP.
|
|
|
|
*
|
|
|
|
* When @ip is the bpf prog entry, a bpf trampoline is being attached or
|
|
|
|
* detached. Since bpf trampoline and bpf prog are allocated separately with
|
|
|
|
* vmalloc, the address distance may exceed 128MB, the maximum branch range.
|
|
|
|
* So long jump should be handled.
|
|
|
|
*
|
|
|
|
* When a bpf prog is constructed, a plt pointing to empty trampoline
|
|
|
|
* dummy_tramp is placed at the end:
|
|
|
|
*
|
|
|
|
* bpf_prog:
|
|
|
|
* mov x9, lr
|
|
|
|
* nop // patchsite
|
|
|
|
* ...
|
|
|
|
* ret
|
|
|
|
*
|
|
|
|
* plt:
|
|
|
|
* ldr x10, target
|
|
|
|
* br x10
|
|
|
|
* target:
|
|
|
|
* .quad dummy_tramp // plt target
|
|
|
|
*
|
|
|
|
* This is also the state when no trampoline is attached.
|
|
|
|
*
|
|
|
|
* When a short-jump bpf trampoline is attached, the patchsite is patched
|
|
|
|
* to a bl instruction to the trampoline directly:
|
|
|
|
*
|
|
|
|
* bpf_prog:
|
|
|
|
* mov x9, lr
|
|
|
|
* bl <short-jump bpf trampoline address> // patchsite
|
|
|
|
* ...
|
|
|
|
* ret
|
|
|
|
*
|
|
|
|
* plt:
|
|
|
|
* ldr x10, target
|
|
|
|
* br x10
|
|
|
|
* target:
|
|
|
|
* .quad dummy_tramp // plt target
|
|
|
|
*
|
|
|
|
* When a long-jump bpf trampoline is attached, the plt target is filled with
|
|
|
|
* the trampoline address and the patchsite is patched to a bl instruction to
|
|
|
|
* the plt:
|
|
|
|
*
|
|
|
|
* bpf_prog:
|
|
|
|
* mov x9, lr
|
|
|
|
* bl plt // patchsite
|
|
|
|
* ...
|
|
|
|
* ret
|
|
|
|
*
|
|
|
|
* plt:
|
|
|
|
* ldr x10, target
|
|
|
|
* br x10
|
|
|
|
* target:
|
|
|
|
* .quad <long-jump bpf trampoline address> // plt target
|
|
|
|
*
|
|
|
|
* The dummy_tramp is used to prevent another CPU from jumping to unknown
|
|
|
|
* locations during the patching process, making the patching process easier.
|
|
|
|
*/
|
|
|
|
int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type,
|
|
|
|
void *old_addr, void *new_addr)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
u32 old_insn;
|
|
|
|
u32 new_insn;
|
|
|
|
u32 replaced;
|
|
|
|
struct bpf_plt *plt = NULL;
|
|
|
|
unsigned long size = 0UL;
|
|
|
|
unsigned long offset = ~0UL;
|
|
|
|
enum aarch64_insn_branch_type branch_type;
|
|
|
|
char namebuf[KSYM_NAME_LEN];
|
|
|
|
void *image = NULL;
|
|
|
|
u64 plt_target = 0ULL;
|
|
|
|
bool poking_bpf_entry;
|
|
|
|
|
|
|
|
if (!__bpf_address_lookup((unsigned long)ip, &size, &offset, namebuf))
|
|
|
|
/* Only poking bpf text is supported. Since kernel function
|
|
|
|
* entry is set up by ftrace, we reply on ftrace to poke kernel
|
|
|
|
* functions.
|
|
|
|
*/
|
|
|
|
return -ENOTSUPP;
|
|
|
|
|
|
|
|
image = ip - offset;
|
|
|
|
/* zero offset means we're poking bpf prog entry */
|
|
|
|
poking_bpf_entry = (offset == 0UL);
|
|
|
|
|
|
|
|
/* bpf prog entry, find plt and the real patchsite */
|
|
|
|
if (poking_bpf_entry) {
|
|
|
|
/* plt locates at the end of bpf prog */
|
|
|
|
plt = image + size - PLT_TARGET_OFFSET;
|
|
|
|
|
|
|
|
/* skip to the nop instruction in bpf prog entry:
|
|
|
|
* bti c // if BTI enabled
|
|
|
|
* mov x9, x30
|
|
|
|
* nop
|
|
|
|
*/
|
|
|
|
ip = image + POKE_OFFSET * AARCH64_INSN_SIZE;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* long jump is only possible at bpf prog entry */
|
|
|
|
if (WARN_ON((is_long_jump(ip, new_addr) || is_long_jump(ip, old_addr)) &&
|
|
|
|
!poking_bpf_entry))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (poke_type == BPF_MOD_CALL)
|
|
|
|
branch_type = AARCH64_INSN_BRANCH_LINK;
|
|
|
|
else
|
|
|
|
branch_type = AARCH64_INSN_BRANCH_NOLINK;
|
|
|
|
|
|
|
|
if (gen_branch_or_nop(branch_type, ip, old_addr, plt, &old_insn) < 0)
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
if (gen_branch_or_nop(branch_type, ip, new_addr, plt, &new_insn) < 0)
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
if (is_long_jump(ip, new_addr))
|
|
|
|
plt_target = (u64)new_addr;
|
|
|
|
else if (is_long_jump(ip, old_addr))
|
|
|
|
/* if the old target is a long jump and the new target is not,
|
|
|
|
* restore the plt target to dummy_tramp, so there is always a
|
|
|
|
* legal and harmless address stored in plt target, and we'll
|
|
|
|
* never jump from plt to an unknown place.
|
|
|
|
*/
|
|
|
|
plt_target = (u64)&dummy_tramp;
|
|
|
|
|
|
|
|
if (plt_target) {
|
|
|
|
/* non-zero plt_target indicates we're patching a bpf prog,
|
|
|
|
* which is read only.
|
|
|
|
*/
|
|
|
|
if (set_memory_rw(PAGE_MASK & ((uintptr_t)&plt->target), 1))
|
|
|
|
return -EFAULT;
|
|
|
|
WRITE_ONCE(plt->target, plt_target);
|
|
|
|
set_memory_ro(PAGE_MASK & ((uintptr_t)&plt->target), 1);
|
|
|
|
/* since plt target points to either the new trampoline
|
|
|
|
* or dummy_tramp, even if another CPU reads the old plt
|
|
|
|
* target value before fetching the bl instruction to plt,
|
|
|
|
* it will be brought back by dummy_tramp, so no barrier is
|
|
|
|
* required here.
|
|
|
|
*/
|
|
|
|
}
|
|
|
|
|
|
|
|
/* if the old target and the new target are both long jumps, no
|
|
|
|
* patching is required
|
|
|
|
*/
|
|
|
|
if (old_insn == new_insn)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
mutex_lock(&text_mutex);
|
|
|
|
if (aarch64_insn_read(ip, &replaced)) {
|
|
|
|
ret = -EFAULT;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (replaced != old_insn) {
|
|
|
|
ret = -EFAULT;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* We call aarch64_insn_patch_text_nosync() to replace instruction
|
|
|
|
* atomically, so no other CPUs will fetch a half-new and half-old
|
|
|
|
* instruction. But there is chance that another CPU executes the
|
|
|
|
* old instruction after the patching operation finishes (e.g.,
|
|
|
|
* pipeline not flushed, or icache not synchronized yet).
|
|
|
|
*
|
|
|
|
* 1. when a new trampoline is attached, it is not a problem for
|
|
|
|
* different CPUs to jump to different trampolines temporarily.
|
|
|
|
*
|
|
|
|
* 2. when an old trampoline is freed, we should wait for all other
|
|
|
|
* CPUs to exit the trampoline and make sure the trampoline is no
|
|
|
|
* longer reachable, since bpf_tramp_image_put() function already
|
|
|
|
* uses percpu_ref and task-based rcu to do the sync, no need to call
|
|
|
|
* the sync version here, see bpf_tramp_image_put() for details.
|
|
|
|
*/
|
|
|
|
ret = aarch64_insn_patch_text_nosync(ip, new_insn);
|
|
|
|
out:
|
|
|
|
mutex_unlock(&text_mutex);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
2024-01-19 18:25:28 +08:00
|
|
|
|
|
|
|
bool bpf_jit_supports_ptr_xchg(void)
|
|
|
|
{
|
|
|
|
return true;
|
|
|
|
}
|
2024-02-01 12:52:25 +00:00
|
|
|
|
|
|
|
bool bpf_jit_supports_exceptions(void)
|
|
|
|
{
|
|
|
|
/* We unwind through both kernel frames starting from within bpf_throw
|
|
|
|
* call and BPF frames. Therefore we require FP unwinder to be enabled
|
|
|
|
* to walk kernel frames and reach BPF frames in the stack trace.
|
|
|
|
* ARM64 kernel is aways compiled with CONFIG_FRAME_POINTER=y
|
|
|
|
*/
|
|
|
|
return true;
|
|
|
|
}
|
2024-02-28 14:18:24 +00:00
|
|
|
|
2024-03-25 15:07:16 +00:00
|
|
|
bool bpf_jit_supports_arena(void)
|
|
|
|
{
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2024-04-26 16:11:16 +00:00
|
|
|
bool bpf_jit_supports_insn(struct bpf_insn *insn, bool in_arena)
|
|
|
|
{
|
|
|
|
if (!in_arena)
|
|
|
|
return true;
|
|
|
|
switch (insn->code) {
|
|
|
|
case BPF_STX | BPF_ATOMIC | BPF_W:
|
|
|
|
case BPF_STX | BPF_ATOMIC | BPF_DW:
|
|
|
|
if (!cpus_have_cap(ARM64_HAS_LSE_ATOMICS))
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2024-05-02 15:18:53 +00:00
|
|
|
bool bpf_jit_supports_percpu_insn(void)
|
|
|
|
{
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
bpf, arm64: inline bpf_get_smp_processor_id() helper
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting
a read from struct thread_info. The SP_EL0 system register holds the
pointer to the task_struct and thread_info is the first member of this
struct. We can read the cpu number from the thread_info.
Here is how the ARM64 JITed assembly changes after this commit:
ARM64 JIT
===========
BEFORE AFTER
-------- -------
int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id();
mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0
movk x10, #0x802b, lsl #16 ldr w7, [x10, #24]
movk x10, #0x8000, lsl #32
blr x10
add x7, x0, #0x0
Performance improvement using benchmark[1]
./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc
+---------------+-------------------+-------------------+--------------+
| Name | Before | After | % change |
|---------------+-------------------+-------------------+--------------|
| glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% |
| arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% |
| hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% |
+---------------+-------------------+-------------------+--------------+
[1] https://github.com/anakryiko/linux/commit/8dec900975ef
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240502151854.9810-5-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-02 15:18:54 +00:00
|
|
|
bool bpf_jit_inlines_helper_call(s32 imm)
|
|
|
|
{
|
|
|
|
switch (imm) {
|
|
|
|
case BPF_FUNC_get_smp_processor_id:
|
bpf, arm64: Inline bpf_get_current_task/_btf() helpers
On ARM64, the pointer to task_struct is always available in the sp_el0
register and therefore the calls to bpf_get_current_task() and
bpf_get_current_task_btf() can be inlined into a single MRS instruction.
Here is the difference before and after this change:
Before:
; struct task_struct *task = bpf_get_current_task_btf();
54: mov x10, #0xffffffffffff7978 // #-34440
58: movk x10, #0x802b, lsl #16
5c: movk x10, #0x8000, lsl #32
60: blr x10 --------------> 0xffff8000802b7978 <+0>: mrs x0, sp_el0
64: add x7, x0, #0x0 <-------------- 0xffff8000802b797c <+4>: ret
After:
; struct task_struct *task = bpf_get_current_task_btf();
54: mrs x7, sp_el0
This shows around 1% performance improvement in artificial microbenchmark.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Xu Kuohai <xukuohai@huawei.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240619131334.4297-1-puranjay@kernel.org
2024-06-19 13:13:34 +00:00
|
|
|
case BPF_FUNC_get_current_task:
|
|
|
|
case BPF_FUNC_get_current_task_btf:
|
bpf, arm64: inline bpf_get_smp_processor_id() helper
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting
a read from struct thread_info. The SP_EL0 system register holds the
pointer to the task_struct and thread_info is the first member of this
struct. We can read the cpu number from the thread_info.
Here is how the ARM64 JITed assembly changes after this commit:
ARM64 JIT
===========
BEFORE AFTER
-------- -------
int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id();
mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0
movk x10, #0x802b, lsl #16 ldr w7, [x10, #24]
movk x10, #0x8000, lsl #32
blr x10
add x7, x0, #0x0
Performance improvement using benchmark[1]
./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc
+---------------+-------------------+-------------------+--------------+
| Name | Before | After | % change |
|---------------+-------------------+-------------------+--------------|
| glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% |
| arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% |
| hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% |
+---------------+-------------------+-------------------+--------------+
[1] https://github.com/anakryiko/linux/commit/8dec900975ef
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240502151854.9810-5-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-02 15:18:54 +00:00
|
|
|
return true;
|
|
|
|
default:
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2024-02-28 14:18:24 +00:00
|
|
|
void bpf_jit_free(struct bpf_prog *prog)
|
|
|
|
{
|
|
|
|
if (prog->jited) {
|
|
|
|
struct arm64_jit_data *jit_data = prog->aux->jit_data;
|
|
|
|
struct bpf_binary_header *hdr;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we fail the final pass of JIT (from jit_subprogs),
|
|
|
|
* the program may not be finalized yet. Call finalize here
|
|
|
|
* before freeing it.
|
|
|
|
*/
|
|
|
|
if (jit_data) {
|
|
|
|
bpf_arch_text_copy(&jit_data->ro_header->size, &jit_data->header->size,
|
|
|
|
sizeof(jit_data->header->size));
|
|
|
|
kfree(jit_data);
|
|
|
|
}
|
|
|
|
hdr = bpf_jit_binary_pack_hdr(prog);
|
|
|
|
bpf_jit_binary_pack_free(hdr, NULL);
|
|
|
|
WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(prog));
|
|
|
|
}
|
|
|
|
|
|
|
|
bpf_prog_unlock_free(prog);
|
|
|
|
}
|