mirror of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-05-19 15:48:24 +00:00
docs: networking: convert filter.txt to ReST
- add SPDX header; - adjust title markup; - mark code blocks and literals as such; - use footnote markup; - mark tables as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
This commit is contained in:
parent
aee113427c
commit
cb3f0d56e1
7 changed files with 484 additions and 377 deletions
|
@ -7,7 +7,7 @@ Filter) facility, with a focus on the extended BPF version (eBPF).
|
||||||
|
|
||||||
This kernel side documentation is still work in progress. The main
|
This kernel side documentation is still work in progress. The main
|
||||||
textual documentation is (for historical reasons) described in
|
textual documentation is (for historical reasons) described in
|
||||||
`Documentation/networking/filter.txt`_, which describe both classical
|
`Documentation/networking/filter.rst`_, which describe both classical
|
||||||
and extended BPF instruction-set.
|
and extended BPF instruction-set.
|
||||||
The Cilium project also maintains a `BPF and XDP Reference Guide`_
|
The Cilium project also maintains a `BPF and XDP Reference Guide`_
|
||||||
that goes into great technical depth about the BPF Architecture.
|
that goes into great technical depth about the BPF Architecture.
|
||||||
|
@ -59,7 +59,7 @@ Testing and debugging BPF
|
||||||
|
|
||||||
|
|
||||||
.. Links:
|
.. Links:
|
||||||
.. _Documentation/networking/filter.txt: ../networking/filter.txt
|
.. _Documentation/networking/filter.rst: ../networking/filter.txt
|
||||||
.. _man-pages: https://www.kernel.org/doc/man-pages/
|
.. _man-pages: https://www.kernel.org/doc/man-pages/
|
||||||
.. _bpf(2): http://man7.org/linux/man-pages/man2/bpf.2.html
|
.. _bpf(2): http://man7.org/linux/man-pages/man2/bpf.2.html
|
||||||
.. _BPF and XDP Reference Guide: http://cilium.readthedocs.io/en/latest/bpf/
|
.. _BPF and XDP Reference Guide: http://cilium.readthedocs.io/en/latest/bpf/
|
||||||
|
|
|
@ -1,3 +1,6 @@
|
||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
=======================================================
|
||||||
Linux Socket Filtering aka Berkeley Packet Filter (BPF)
|
Linux Socket Filtering aka Berkeley Packet Filter (BPF)
|
||||||
=======================================================
|
=======================================================
|
||||||
|
|
||||||
|
@ -42,10 +45,10 @@ displays what is being placed into this structure.
|
||||||
|
|
||||||
Although we were only speaking about sockets here, BPF in Linux is used
|
Although we were only speaking about sockets here, BPF in Linux is used
|
||||||
in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel
|
in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel
|
||||||
qdisc layer, SECCOMP-BPF (SECure COMPuting [1]), and lots of other places
|
qdisc layer, SECCOMP-BPF (SECure COMPuting [1]_), and lots of other places
|
||||||
such as team driver, PTP code, etc where BPF is being used.
|
such as team driver, PTP code, etc where BPF is being used.
|
||||||
|
|
||||||
[1] Documentation/userspace-api/seccomp_filter.rst
|
.. [1] Documentation/userspace-api/seccomp_filter.rst
|
||||||
|
|
||||||
Original BPF paper:
|
Original BPF paper:
|
||||||
|
|
||||||
|
@ -59,7 +62,7 @@ Structure
|
||||||
---------
|
---------
|
||||||
|
|
||||||
User space applications include <linux/filter.h> which contains the
|
User space applications include <linux/filter.h> which contains the
|
||||||
following relevant structures:
|
following relevant structures::
|
||||||
|
|
||||||
struct sock_filter { /* Filter block */
|
struct sock_filter { /* Filter block */
|
||||||
__u16 code; /* Actual filter code */
|
__u16 code; /* Actual filter code */
|
||||||
|
@ -70,7 +73,7 @@ struct sock_filter { /* Filter block */
|
||||||
|
|
||||||
Such a structure is assembled as an array of 4-tuples, that contains
|
Such a structure is assembled as an array of 4-tuples, that contains
|
||||||
a code, jt, jf and k value. jt and jf are jump offsets and k a generic
|
a code, jt, jf and k value. jt and jf are jump offsets and k a generic
|
||||||
value to be used for a provided code.
|
value to be used for a provided code::
|
||||||
|
|
||||||
struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
|
struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
|
||||||
unsigned short len; /* Number of filter blocks */
|
unsigned short len; /* Number of filter blocks */
|
||||||
|
@ -83,6 +86,8 @@ follow-up example) is being passed to the kernel through setsockopt(2).
|
||||||
Example
|
Example
|
||||||
-------
|
-------
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
#include <sys/socket.h>
|
#include <sys/socket.h>
|
||||||
#include <sys/types.h>
|
#include <sys/types.h>
|
||||||
#include <arpa/inet.h>
|
#include <arpa/inet.h>
|
||||||
|
@ -178,15 +183,17 @@ closely modelled after Steven McCanne's and Van Jacobson's BPF paper.
|
||||||
|
|
||||||
The BPF architecture consists of the following basic elements:
|
The BPF architecture consists of the following basic elements:
|
||||||
|
|
||||||
|
======= ====================================================
|
||||||
Element Description
|
Element Description
|
||||||
|
======= ====================================================
|
||||||
A 32 bit wide accumulator
|
A 32 bit wide accumulator
|
||||||
X 32 bit wide X register
|
X 32 bit wide X register
|
||||||
M[] 16 x 32 bit wide misc registers aka "scratch memory
|
M[] 16 x 32 bit wide misc registers aka "scratch memory
|
||||||
store", addressable from 0 to 15
|
store", addressable from 0 to 15
|
||||||
|
======= ====================================================
|
||||||
|
|
||||||
A program, that is translated by bpf_asm into "opcodes" is an array that
|
A program, that is translated by bpf_asm into "opcodes" is an array that
|
||||||
consists of the following elements (as already mentioned):
|
consists of the following elements (as already mentioned)::
|
||||||
|
|
||||||
op:16, jt:8, jf:8, k:32
|
op:16, jt:8, jf:8, k:32
|
||||||
|
|
||||||
|
@ -201,8 +208,9 @@ and return instructions that are also represented in bpf_asm syntax. This
|
||||||
table lists all bpf_asm instructions available resp. what their underlying
|
table lists all bpf_asm instructions available resp. what their underlying
|
||||||
opcodes as defined in linux/filter.h stand for:
|
opcodes as defined in linux/filter.h stand for:
|
||||||
|
|
||||||
|
=========== =================== =====================
|
||||||
Instruction Addressing mode Description
|
Instruction Addressing mode Description
|
||||||
|
=========== =================== =====================
|
||||||
ld 1, 2, 3, 4, 12 Load word into A
|
ld 1, 2, 3, 4, 12 Load word into A
|
||||||
ldi 4 Load word into A
|
ldi 4 Load word into A
|
||||||
ldh 1, 2 Load half-word into A
|
ldh 1, 2 Load half-word into A
|
||||||
|
@ -241,11 +249,13 @@ opcodes as defined in linux/filter.h stand for:
|
||||||
txa Copy X into A
|
txa Copy X into A
|
||||||
|
|
||||||
ret 4, 11 Return
|
ret 4, 11 Return
|
||||||
|
=========== =================== =====================
|
||||||
|
|
||||||
The next table shows addressing formats from the 2nd column:
|
The next table shows addressing formats from the 2nd column:
|
||||||
|
|
||||||
|
=============== =================== ===============================================
|
||||||
Addressing mode Syntax Description
|
Addressing mode Syntax Description
|
||||||
|
=============== =================== ===============================================
|
||||||
0 x/%x Register X
|
0 x/%x Register X
|
||||||
1 [k] BHW at byte offset k in the packet
|
1 [k] BHW at byte offset k in the packet
|
||||||
2 [x + k] BHW at the offset X + k in the packet
|
2 [x + k] BHW at the offset X + k in the packet
|
||||||
|
@ -259,6 +269,7 @@ The next table shows addressing formats from the 2nd column:
|
||||||
10 x/%x,Lt Jump to Lt if predicate is true
|
10 x/%x,Lt Jump to Lt if predicate is true
|
||||||
11 a/%a Accumulator A
|
11 a/%a Accumulator A
|
||||||
12 extension BPF extension
|
12 extension BPF extension
|
||||||
|
=============== =================== ===============================================
|
||||||
|
|
||||||
The Linux kernel also has a couple of BPF extensions that are used along
|
The Linux kernel also has a couple of BPF extensions that are used along
|
||||||
with the class of load instructions by "overloading" the k argument with
|
with the class of load instructions by "overloading" the k argument with
|
||||||
|
@ -267,8 +278,9 @@ extensions are loaded into A.
|
||||||
|
|
||||||
Possible BPF extensions are shown in the following table:
|
Possible BPF extensions are shown in the following table:
|
||||||
|
|
||||||
|
=================================== =================================================
|
||||||
Extension Description
|
Extension Description
|
||||||
|
=================================== =================================================
|
||||||
len skb->len
|
len skb->len
|
||||||
proto skb->protocol
|
proto skb->protocol
|
||||||
type skb->pkt_type
|
type skb->pkt_type
|
||||||
|
@ -285,18 +297,19 @@ Possible BPF extensions are shown in the following table:
|
||||||
vlan_avail skb_vlan_tag_present(skb)
|
vlan_avail skb_vlan_tag_present(skb)
|
||||||
vlan_tpid skb->vlan_proto
|
vlan_tpid skb->vlan_proto
|
||||||
rand prandom_u32()
|
rand prandom_u32()
|
||||||
|
=================================== =================================================
|
||||||
|
|
||||||
These extensions can also be prefixed with '#'.
|
These extensions can also be prefixed with '#'.
|
||||||
Examples for low-level BPF:
|
Examples for low-level BPF:
|
||||||
|
|
||||||
** ARP packets:
|
**ARP packets**::
|
||||||
|
|
||||||
ldh [12]
|
ldh [12]
|
||||||
jne #0x806, drop
|
jne #0x806, drop
|
||||||
ret #-1
|
ret #-1
|
||||||
drop: ret #0
|
drop: ret #0
|
||||||
|
|
||||||
** IPv4 TCP packets:
|
**IPv4 TCP packets**::
|
||||||
|
|
||||||
ldh [12]
|
ldh [12]
|
||||||
jne #0x800, drop
|
jne #0x800, drop
|
||||||
|
@ -305,14 +318,15 @@ Examples for low-level BPF:
|
||||||
ret #-1
|
ret #-1
|
||||||
drop: ret #0
|
drop: ret #0
|
||||||
|
|
||||||
** (Accelerated) VLAN w/ id 10:
|
**(Accelerated) VLAN w/ id 10**::
|
||||||
|
|
||||||
ld vlan_tci
|
ld vlan_tci
|
||||||
jneq #10, drop
|
jneq #10, drop
|
||||||
ret #-1
|
ret #-1
|
||||||
drop: ret #0
|
drop: ret #0
|
||||||
|
|
||||||
** icmp random packet sampling, 1 in 4
|
**icmp random packet sampling, 1 in 4**:
|
||||||
|
|
||||||
ldh [12]
|
ldh [12]
|
||||||
jne #0x800, drop
|
jne #0x800, drop
|
||||||
ldb [23]
|
ldb [23]
|
||||||
|
@ -324,7 +338,7 @@ Examples for low-level BPF:
|
||||||
ret #-1
|
ret #-1
|
||||||
drop: ret #0
|
drop: ret #0
|
||||||
|
|
||||||
** SECCOMP filter example:
|
**SECCOMP filter example**::
|
||||||
|
|
||||||
ld [4] /* offsetof(struct seccomp_data, arch) */
|
ld [4] /* offsetof(struct seccomp_data, arch) */
|
||||||
jne #0xc000003e, bad /* AUDIT_ARCH_X86_64 */
|
jne #0xc000003e, bad /* AUDIT_ARCH_X86_64 */
|
||||||
|
@ -345,12 +359,12 @@ Examples for low-level BPF:
|
||||||
The above example code can be placed into a file (here called "foo"), and
|
The above example code can be placed into a file (here called "foo"), and
|
||||||
then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf
|
then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf
|
||||||
and cls_bpf understands and can directly be loaded with. Example with above
|
and cls_bpf understands and can directly be loaded with. Example with above
|
||||||
ARP code:
|
ARP code::
|
||||||
|
|
||||||
$ ./bpf_asm foo
|
$ ./bpf_asm foo
|
||||||
4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0,
|
4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0,
|
||||||
|
|
||||||
In copy and paste C-like output:
|
In copy and paste C-like output::
|
||||||
|
|
||||||
$ ./bpf_asm -c foo
|
$ ./bpf_asm -c foo
|
||||||
{ 0x28, 0, 0, 0x0000000c },
|
{ 0x28, 0, 0, 0x0000000c },
|
||||||
|
@ -365,7 +379,7 @@ bpf_dbg under tools/bpf/ in the kernel source directory. This debugger allows
|
||||||
for testing BPF filters against given pcap files, single stepping through the
|
for testing BPF filters against given pcap files, single stepping through the
|
||||||
BPF code on the pcap's packets and to do BPF machine register dumps.
|
BPF code on the pcap's packets and to do BPF machine register dumps.
|
||||||
|
|
||||||
Starting bpf_dbg is trivial and just requires issuing:
|
Starting bpf_dbg is trivial and just requires issuing::
|
||||||
|
|
||||||
# ./bpf_dbg
|
# ./bpf_dbg
|
||||||
|
|
||||||
|
@ -381,31 +395,36 @@ Interaction in bpf_dbg happens through a shell that also has auto-completion
|
||||||
support (follow-up example commands starting with '>' denote bpf_dbg shell).
|
support (follow-up example commands starting with '>' denote bpf_dbg shell).
|
||||||
The usual workflow would be to ...
|
The usual workflow would be to ...
|
||||||
|
|
||||||
> load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0
|
* load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0
|
||||||
Loads a BPF filter from standard output of bpf_asm, or transformed via
|
Loads a BPF filter from standard output of bpf_asm, or transformed via
|
||||||
e.g. `tcpdump -iem1 -ddd port 22 | tr '\n' ','`. Note that for JIT
|
e.g. ``tcpdump -iem1 -ddd port 22 | tr '\n' ','``. Note that for JIT
|
||||||
debugging (next section), this command creates a temporary socket and
|
debugging (next section), this command creates a temporary socket and
|
||||||
loads the BPF code into the kernel. Thus, this will also be useful for
|
loads the BPF code into the kernel. Thus, this will also be useful for
|
||||||
JIT developers.
|
JIT developers.
|
||||||
|
|
||||||
> load pcap foo.pcap
|
* load pcap foo.pcap
|
||||||
|
|
||||||
Loads standard tcpdump pcap file.
|
Loads standard tcpdump pcap file.
|
||||||
|
|
||||||
> run [<n>]
|
* run [<n>]
|
||||||
|
|
||||||
bpf passes:1 fails:9
|
bpf passes:1 fails:9
|
||||||
Runs through all packets from a pcap to account how many passes and fails
|
Runs through all packets from a pcap to account how many passes and fails
|
||||||
the filter will generate. A limit of packets to traverse can be given.
|
the filter will generate. A limit of packets to traverse can be given.
|
||||||
|
|
||||||
> disassemble
|
* disassemble::
|
||||||
|
|
||||||
l0: ldh [12]
|
l0: ldh [12]
|
||||||
l1: jeq #0x800, l2, l5
|
l1: jeq #0x800, l2, l5
|
||||||
l2: ldb [23]
|
l2: ldb [23]
|
||||||
l3: jeq #0x1, l4, l5
|
l3: jeq #0x1, l4, l5
|
||||||
l4: ret #0xffff
|
l4: ret #0xffff
|
||||||
l5: ret #0
|
l5: ret #0
|
||||||
|
|
||||||
Prints out BPF code disassembly.
|
Prints out BPF code disassembly.
|
||||||
|
|
||||||
> dump
|
* dump::
|
||||||
|
|
||||||
/* { op, jt, jf, k }, */
|
/* { op, jt, jf, k }, */
|
||||||
{ 0x28, 0, 0, 0x0000000c },
|
{ 0x28, 0, 0, 0x0000000c },
|
||||||
{ 0x15, 0, 3, 0x00000800 },
|
{ 0x15, 0, 3, 0x00000800 },
|
||||||
|
@ -413,19 +432,26 @@ l5: ret #0
|
||||||
{ 0x15, 0, 1, 0x00000001 },
|
{ 0x15, 0, 1, 0x00000001 },
|
||||||
{ 0x06, 0, 0, 0x0000ffff },
|
{ 0x06, 0, 0, 0x0000ffff },
|
||||||
{ 0x06, 0, 0, 0000000000 },
|
{ 0x06, 0, 0, 0000000000 },
|
||||||
|
|
||||||
Prints out C-style BPF code dump.
|
Prints out C-style BPF code dump.
|
||||||
|
|
||||||
> breakpoint 0
|
* breakpoint 0::
|
||||||
|
|
||||||
breakpoint at: l0: ldh [12]
|
breakpoint at: l0: ldh [12]
|
||||||
> breakpoint 1
|
|
||||||
|
* breakpoint 1::
|
||||||
|
|
||||||
breakpoint at: l1: jeq #0x800, l2, l5
|
breakpoint at: l1: jeq #0x800, l2, l5
|
||||||
|
|
||||||
...
|
...
|
||||||
|
|
||||||
Sets breakpoints at particular BPF instructions. Issuing a `run` command
|
Sets breakpoints at particular BPF instructions. Issuing a `run` command
|
||||||
will walk through the pcap file continuing from the current packet and
|
will walk through the pcap file continuing from the current packet and
|
||||||
break when a breakpoint is being hit (another `run` will continue from
|
break when a breakpoint is being hit (another `run` will continue from
|
||||||
the currently active breakpoint executing next instructions):
|
the currently active breakpoint executing next instructions):
|
||||||
|
|
||||||
> run
|
* run::
|
||||||
|
|
||||||
-- register dump --
|
-- register dump --
|
||||||
pc: [0] <-- program counter
|
pc: [0] <-- program counter
|
||||||
code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction
|
code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction
|
||||||
|
@ -441,24 +467,28 @@ breakpoint at: l1: jeq #0x800, l2, l5
|
||||||
(breakpoint)
|
(breakpoint)
|
||||||
>
|
>
|
||||||
|
|
||||||
> breakpoint
|
* breakpoint::
|
||||||
|
|
||||||
breakpoints: 0 1
|
breakpoints: 0 1
|
||||||
|
|
||||||
Prints currently set breakpoints.
|
Prints currently set breakpoints.
|
||||||
|
|
||||||
> step [-<n>, +<n>]
|
* step [-<n>, +<n>]
|
||||||
|
|
||||||
Performs single stepping through the BPF program from the current pc
|
Performs single stepping through the BPF program from the current pc
|
||||||
offset. Thus, on each step invocation, above register dump is issued.
|
offset. Thus, on each step invocation, above register dump is issued.
|
||||||
This can go forwards and backwards in time, a plain `step` will break
|
This can go forwards and backwards in time, a plain `step` will break
|
||||||
on the next BPF instruction, thus +1. (No `run` needs to be issued here.)
|
on the next BPF instruction, thus +1. (No `run` needs to be issued here.)
|
||||||
|
|
||||||
> select <n>
|
* select <n>
|
||||||
|
|
||||||
Selects a given packet from the pcap file to continue from. Thus, on
|
Selects a given packet from the pcap file to continue from. Thus, on
|
||||||
the next `run` or `step`, the BPF program is being evaluated against
|
the next `run` or `step`, the BPF program is being evaluated against
|
||||||
the user pre-selected packet. Numbering starts just as in Wireshark
|
the user pre-selected packet. Numbering starts just as in Wireshark
|
||||||
with index 1.
|
with index 1.
|
||||||
|
|
||||||
> quit
|
* quit
|
||||||
#
|
|
||||||
Exits bpf_dbg.
|
Exits bpf_dbg.
|
||||||
|
|
||||||
JIT compiler
|
JIT compiler
|
||||||
|
@ -468,16 +498,16 @@ The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC,
|
||||||
PowerPC, ARM, ARM64, MIPS, RISC-V and s390 and can be enabled through
|
PowerPC, ARM, ARM64, MIPS, RISC-V and s390 and can be enabled through
|
||||||
CONFIG_BPF_JIT. The JIT compiler is transparently invoked for each
|
CONFIG_BPF_JIT. The JIT compiler is transparently invoked for each
|
||||||
attached filter from user space or for internal kernel users if it has
|
attached filter from user space or for internal kernel users if it has
|
||||||
been previously enabled by root:
|
been previously enabled by root::
|
||||||
|
|
||||||
echo 1 > /proc/sys/net/core/bpf_jit_enable
|
echo 1 > /proc/sys/net/core/bpf_jit_enable
|
||||||
|
|
||||||
For JIT developers, doing audits etc, each compile run can output the generated
|
For JIT developers, doing audits etc, each compile run can output the generated
|
||||||
opcode image into the kernel log via:
|
opcode image into the kernel log via::
|
||||||
|
|
||||||
echo 2 > /proc/sys/net/core/bpf_jit_enable
|
echo 2 > /proc/sys/net/core/bpf_jit_enable
|
||||||
|
|
||||||
Example output from dmesg:
|
Example output from dmesg::
|
||||||
|
|
||||||
[ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f
|
[ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f
|
||||||
[ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68
|
[ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68
|
||||||
|
@ -493,7 +523,7 @@ is discouraged and introspection through bpftool (under tools/bpf/bpftool/) is t
|
||||||
generally recommended approach instead.
|
generally recommended approach instead.
|
||||||
|
|
||||||
In the kernel source tree under tools/bpf/, there's bpf_jit_disasm for
|
In the kernel source tree under tools/bpf/, there's bpf_jit_disasm for
|
||||||
generating disassembly out of the kernel log's hexdump:
|
generating disassembly out of the kernel log's hexdump::
|
||||||
|
|
||||||
# ./bpf_jit_disasm
|
# ./bpf_jit_disasm
|
||||||
70 bytes emitted from JIT compiler (pass:3, flen:6)
|
70 bytes emitted from JIT compiler (pass:3, flen:6)
|
||||||
|
@ -663,9 +693,9 @@ Some core changes of the new internal format:
|
||||||
|
|
||||||
- Conditional jt/jf targets replaced with jt/fall-through:
|
- Conditional jt/jf targets replaced with jt/fall-through:
|
||||||
|
|
||||||
While the original design has constructs such as "if (cond) jump_true;
|
While the original design has constructs such as ``if (cond) jump_true;
|
||||||
else jump_false;", they are being replaced into alternative constructs like
|
else jump_false;``, they are being replaced into alternative constructs like
|
||||||
"if (cond) jump_true; /* else fall-through */".
|
``if (cond) jump_true; /* else fall-through */``.
|
||||||
|
|
||||||
- Introduces bpf_call insn and register passing convention for zero overhead
|
- Introduces bpf_call insn and register passing convention for zero overhead
|
||||||
calls from/to other kernel functions:
|
calls from/to other kernel functions:
|
||||||
|
@ -684,13 +714,13 @@ Some core changes of the new internal format:
|
||||||
a return value of the function. Since R6 - R9 are callee saved, their state
|
a return value of the function. Since R6 - R9 are callee saved, their state
|
||||||
is preserved across the call.
|
is preserved across the call.
|
||||||
|
|
||||||
For example, consider three C functions:
|
For example, consider three C functions::
|
||||||
|
|
||||||
u64 f1() { return (*_f2)(1); }
|
u64 f1() { return (*_f2)(1); }
|
||||||
u64 f2(u64 a) { return f3(a + 1, a); }
|
u64 f2(u64 a) { return f3(a + 1, a); }
|
||||||
u64 f3(u64 a, u64 b) { return a - b; }
|
u64 f3(u64 a, u64 b) { return a - b; }
|
||||||
|
|
||||||
GCC can compile f1, f3 into x86_64:
|
GCC can compile f1, f3 into x86_64::
|
||||||
|
|
||||||
f1:
|
f1:
|
||||||
movl $1, %edi
|
movl $1, %edi
|
||||||
|
@ -701,7 +731,7 @@ Some core changes of the new internal format:
|
||||||
subq %rsi, %rax
|
subq %rsi, %rax
|
||||||
ret
|
ret
|
||||||
|
|
||||||
Function f2 in eBPF may look like:
|
Function f2 in eBPF may look like::
|
||||||
|
|
||||||
f2:
|
f2:
|
||||||
bpf_mov R2, R1
|
bpf_mov R2, R1
|
||||||
|
@ -709,7 +739,7 @@ Some core changes of the new internal format:
|
||||||
bpf_call f3
|
bpf_call f3
|
||||||
bpf_exit
|
bpf_exit
|
||||||
|
|
||||||
If f2 is JITed and the pointer stored to '_f2'. The calls f1 -> f2 -> f3 and
|
If f2 is JITed and the pointer stored to ``_f2``. The calls f1 -> f2 -> f3 and
|
||||||
returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to
|
returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to
|
||||||
be used to call into f2.
|
be used to call into f2.
|
||||||
|
|
||||||
|
@ -722,6 +752,8 @@ Some core changes of the new internal format:
|
||||||
On 64-bit architectures all register map to HW registers one to one. For
|
On 64-bit architectures all register map to HW registers one to one. For
|
||||||
example, x86_64 JIT compiler can map them as ...
|
example, x86_64 JIT compiler can map them as ...
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
R0 - rax
|
R0 - rax
|
||||||
R1 - rdi
|
R1 - rdi
|
||||||
R2 - rsi
|
R2 - rsi
|
||||||
|
@ -737,7 +769,7 @@ Some core changes of the new internal format:
|
||||||
... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing
|
... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing
|
||||||
and rbx, r12 - r15 are callee saved.
|
and rbx, r12 - r15 are callee saved.
|
||||||
|
|
||||||
Then the following internal BPF pseudo-program:
|
Then the following internal BPF pseudo-program::
|
||||||
|
|
||||||
bpf_mov R6, R1 /* save ctx */
|
bpf_mov R6, R1 /* save ctx */
|
||||||
bpf_mov R2, 2
|
bpf_mov R2, 2
|
||||||
|
@ -755,7 +787,7 @@ Some core changes of the new internal format:
|
||||||
bpf_add R0, R7
|
bpf_add R0, R7
|
||||||
bpf_exit
|
bpf_exit
|
||||||
|
|
||||||
After JIT to x86_64 may look like:
|
After JIT to x86_64 may look like::
|
||||||
|
|
||||||
push %rbp
|
push %rbp
|
||||||
mov %rsp,%rbp
|
mov %rsp,%rbp
|
||||||
|
@ -781,7 +813,7 @@ Some core changes of the new internal format:
|
||||||
leaveq
|
leaveq
|
||||||
retq
|
retq
|
||||||
|
|
||||||
Which is in this example equivalent in C to:
|
Which is in this example equivalent in C to::
|
||||||
|
|
||||||
u64 bpf_filter(u64 ctx)
|
u64 bpf_filter(u64 ctx)
|
||||||
{
|
{
|
||||||
|
@ -790,12 +822,12 @@ Some core changes of the new internal format:
|
||||||
|
|
||||||
In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64
|
In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64
|
||||||
arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper
|
arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper
|
||||||
registers and place their return value into '%rax' which is R0 in eBPF.
|
registers and place their return value into ``%rax`` which is R0 in eBPF.
|
||||||
Prologue and epilogue are emitted by JIT and are implicit in the
|
Prologue and epilogue are emitted by JIT and are implicit in the
|
||||||
interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve
|
interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve
|
||||||
them across the calls as defined by calling convention.
|
them across the calls as defined by calling convention.
|
||||||
|
|
||||||
For example the following program is invalid:
|
For example the following program is invalid::
|
||||||
|
|
||||||
bpf_mov R1, 1
|
bpf_mov R1, 1
|
||||||
bpf_call foo
|
bpf_call foo
|
||||||
|
@ -814,7 +846,7 @@ The input context pointer for invoking the interpreter function is generic,
|
||||||
its content is defined by a specific use case. For seccomp register R1 points
|
its content is defined by a specific use case. For seccomp register R1 points
|
||||||
to seccomp_data, for converted BPF filters R1 points to a skb.
|
to seccomp_data, for converted BPF filters R1 points to a skb.
|
||||||
|
|
||||||
A program, that is translated internally consists of the following elements:
|
A program, that is translated internally consists of the following elements::
|
||||||
|
|
||||||
op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32
|
op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32
|
||||||
|
|
||||||
|
@ -824,7 +856,7 @@ instructions must be multiple of 8 bytes to preserve backward compatibility.
|
||||||
|
|
||||||
Internal BPF is a general purpose RISC instruction set. Not every register and
|
Internal BPF is a general purpose RISC instruction set. Not every register and
|
||||||
every instruction are used during translation from original BPF to new format.
|
every instruction are used during translation from original BPF to new format.
|
||||||
For example, socket filters are not using 'exclusive add' instruction, but
|
For example, socket filters are not using ``exclusive add`` instruction, but
|
||||||
tracing filters may do to maintain counters of events, for example. Register R9
|
tracing filters may do to maintain counters of events, for example. Register R9
|
||||||
is not used by socket filters either, but more complex filters may be running
|
is not used by socket filters either, but more complex filters may be running
|
||||||
out of registers and would have to resort to spill/fill to stack.
|
out of registers and would have to resort to spill/fill to stack.
|
||||||
|
@ -849,7 +881,7 @@ eBPF opcode encoding
|
||||||
|
|
||||||
eBPF is reusing most of the opcode encoding from classic to simplify conversion
|
eBPF is reusing most of the opcode encoding from classic to simplify conversion
|
||||||
of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code'
|
of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code'
|
||||||
field is divided into three parts:
|
field is divided into three parts::
|
||||||
|
|
||||||
+----------------+--------+--------------------+
|
+----------------+--------+--------------------+
|
||||||
| 4 bits | 1 bit | 3 bits |
|
| 4 bits | 1 bit | 3 bits |
|
||||||
|
@ -859,8 +891,9 @@ field is divided into three parts:
|
||||||
|
|
||||||
Three LSB bits store instruction class which is one of:
|
Three LSB bits store instruction class which is one of:
|
||||||
|
|
||||||
Classic BPF classes: eBPF classes:
|
=================== ===============
|
||||||
|
Classic BPF classes eBPF classes
|
||||||
|
=================== ===============
|
||||||
BPF_LD 0x00 BPF_LD 0x00
|
BPF_LD 0x00 BPF_LD 0x00
|
||||||
BPF_LDX 0x01 BPF_LDX 0x01
|
BPF_LDX 0x01 BPF_LDX 0x01
|
||||||
BPF_ST 0x02 BPF_ST 0x02
|
BPF_ST 0x02 BPF_ST 0x02
|
||||||
|
@ -869,25 +902,28 @@ Three LSB bits store instruction class which is one of:
|
||||||
BPF_JMP 0x05 BPF_JMP 0x05
|
BPF_JMP 0x05 BPF_JMP 0x05
|
||||||
BPF_RET 0x06 BPF_JMP32 0x06
|
BPF_RET 0x06 BPF_JMP32 0x06
|
||||||
BPF_MISC 0x07 BPF_ALU64 0x07
|
BPF_MISC 0x07 BPF_ALU64 0x07
|
||||||
|
=================== ===============
|
||||||
|
|
||||||
When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ...
|
When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ...
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
BPF_K 0x00
|
BPF_K 0x00
|
||||||
BPF_X 0x08
|
BPF_X 0x08
|
||||||
|
|
||||||
* in classic BPF, this means:
|
* in classic BPF, this means::
|
||||||
|
|
||||||
BPF_SRC(code) == BPF_X - use register X as source operand
|
BPF_SRC(code) == BPF_X - use register X as source operand
|
||||||
BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
|
BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
|
||||||
|
|
||||||
* in eBPF, this means:
|
* in eBPF, this means::
|
||||||
|
|
||||||
BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand
|
BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand
|
||||||
BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
|
BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
|
||||||
|
|
||||||
... and four MSB bits store operation code.
|
... and four MSB bits store operation code.
|
||||||
|
|
||||||
If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of:
|
If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of::
|
||||||
|
|
||||||
BPF_ADD 0x00
|
BPF_ADD 0x00
|
||||||
BPF_SUB 0x10
|
BPF_SUB 0x10
|
||||||
|
@ -904,7 +940,7 @@ If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of:
|
||||||
BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */
|
BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */
|
||||||
BPF_END 0xd0 /* eBPF only: endianness conversion */
|
BPF_END 0xd0 /* eBPF only: endianness conversion */
|
||||||
|
|
||||||
If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of:
|
If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of::
|
||||||
|
|
||||||
BPF_JA 0x00 /* BPF_JMP only */
|
BPF_JA 0x00 /* BPF_JMP only */
|
||||||
BPF_JEQ 0x10
|
BPF_JEQ 0x10
|
||||||
|
@ -934,7 +970,7 @@ exactly the same operations as BPF_ALU, but with 64-bit wide operands
|
||||||
instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.:
|
instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.:
|
||||||
dst_reg = dst_reg + src_reg
|
dst_reg = dst_reg + src_reg
|
||||||
|
|
||||||
Classic BPF wastes the whole BPF_RET class to represent a single 'ret'
|
Classic BPF wastes the whole BPF_RET class to represent a single ``ret``
|
||||||
operation. Classic BPF_RET | BPF_K means copy imm32 into return register
|
operation. Classic BPF_RET | BPF_K means copy imm32 into return register
|
||||||
and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT
|
and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT
|
||||||
in eBPF means function exit only. The eBPF program needs to store return
|
in eBPF means function exit only. The eBPF program needs to store return
|
||||||
|
@ -942,7 +978,7 @@ value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as
|
||||||
BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide
|
BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide
|
||||||
operands for the comparisons instead.
|
operands for the comparisons instead.
|
||||||
|
|
||||||
For load and store instructions the 8-bit 'code' field is divided as:
|
For load and store instructions the 8-bit 'code' field is divided as::
|
||||||
|
|
||||||
+--------+--------+-------------------+
|
+--------+--------+-------------------+
|
||||||
| 3 bits | 2 bits | 3 bits |
|
| 3 bits | 2 bits | 3 bits |
|
||||||
|
@ -952,19 +988,21 @@ For load and store instructions the 8-bit 'code' field is divided as:
|
||||||
|
|
||||||
Size modifier is one of ...
|
Size modifier is one of ...
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
BPF_W 0x00 /* word */
|
BPF_W 0x00 /* word */
|
||||||
BPF_H 0x08 /* half word */
|
BPF_H 0x08 /* half word */
|
||||||
BPF_B 0x10 /* byte */
|
BPF_B 0x10 /* byte */
|
||||||
BPF_DW 0x18 /* eBPF only, double word */
|
BPF_DW 0x18 /* eBPF only, double word */
|
||||||
|
|
||||||
... which encodes size of load/store operation:
|
... which encodes size of load/store operation::
|
||||||
|
|
||||||
B - 1 byte
|
B - 1 byte
|
||||||
H - 2 byte
|
H - 2 byte
|
||||||
W - 4 byte
|
W - 4 byte
|
||||||
DW - 8 byte (eBPF only)
|
DW - 8 byte (eBPF only)
|
||||||
|
|
||||||
Mode modifier is one of:
|
Mode modifier is one of::
|
||||||
|
|
||||||
BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */
|
BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */
|
||||||
BPF_ABS 0x20
|
BPF_ABS 0x20
|
||||||
|
@ -979,7 +1017,7 @@ eBPF has two non-generic instructions: (BPF_ABS | <size> | BPF_LD) and
|
||||||
|
|
||||||
They had to be carried over from classic to have strong performance of
|
They had to be carried over from classic to have strong performance of
|
||||||
socket filters running in eBPF interpreter. These instructions can only
|
socket filters running in eBPF interpreter. These instructions can only
|
||||||
be used when interpreter context is a pointer to 'struct sk_buff' and
|
be used when interpreter context is a pointer to ``struct sk_buff`` and
|
||||||
have seven implicit operands. Register R6 is an implicit input that must
|
have seven implicit operands. Register R6 is an implicit input that must
|
||||||
contain pointer to sk_buff. Register R0 is an implicit output which contains
|
contain pointer to sk_buff. Register R0 is an implicit output which contains
|
||||||
the data fetched from the packet. Registers R1-R5 are scratch registers
|
the data fetched from the packet. Registers R1-R5 are scratch registers
|
||||||
|
@ -992,14 +1030,14 @@ the interpreter will abort the execution of the program. JIT compilers
|
||||||
therefore must preserve this property. src_reg and imm32 fields are
|
therefore must preserve this property. src_reg and imm32 fields are
|
||||||
explicit inputs to these instructions.
|
explicit inputs to these instructions.
|
||||||
|
|
||||||
For example:
|
For example::
|
||||||
|
|
||||||
BPF_IND | BPF_W | BPF_LD means:
|
BPF_IND | BPF_W | BPF_LD means:
|
||||||
|
|
||||||
R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32))
|
R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32))
|
||||||
and R1 - R5 were scratched.
|
and R1 - R5 were scratched.
|
||||||
|
|
||||||
Unlike classic BPF instruction set, eBPF has generic load/store operations:
|
Unlike classic BPF instruction set, eBPF has generic load/store operations::
|
||||||
|
|
||||||
BPF_MEM | <size> | BPF_STX: *(size *) (dst_reg + off) = src_reg
|
BPF_MEM | <size> | BPF_STX: *(size *) (dst_reg + off) = src_reg
|
||||||
BPF_MEM | <size> | BPF_ST: *(size *) (dst_reg + off) = imm32
|
BPF_MEM | <size> | BPF_ST: *(size *) (dst_reg + off) = imm32
|
||||||
|
@ -1011,7 +1049,7 @@ Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and
|
||||||
2 byte atomic increments are not supported.
|
2 byte atomic increments are not supported.
|
||||||
|
|
||||||
eBPF has one 16-byte instruction: BPF_LD | BPF_DW | BPF_IMM which consists
|
eBPF has one 16-byte instruction: BPF_LD | BPF_DW | BPF_IMM which consists
|
||||||
of two consecutive 'struct bpf_insn' 8-byte blocks and interpreted as single
|
of two consecutive ``struct bpf_insn`` 8-byte blocks and interpreted as single
|
||||||
instruction that loads 64-bit immediate value into a dst_reg.
|
instruction that loads 64-bit immediate value into a dst_reg.
|
||||||
Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads
|
Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads
|
||||||
32-bit immediate value into a register.
|
32-bit immediate value into a register.
|
||||||
|
@ -1037,38 +1075,48 @@ since addition of two valid pointers makes invalid pointer.
|
||||||
(In 'secure' mode verifier will reject any type of pointer arithmetic to make
|
(In 'secure' mode verifier will reject any type of pointer arithmetic to make
|
||||||
sure that kernel addresses don't leak to unprivileged users)
|
sure that kernel addresses don't leak to unprivileged users)
|
||||||
|
|
||||||
If register was never written to, it's not readable:
|
If register was never written to, it's not readable::
|
||||||
|
|
||||||
bpf_mov R0 = R2
|
bpf_mov R0 = R2
|
||||||
bpf_exit
|
bpf_exit
|
||||||
|
|
||||||
will be rejected, since R2 is unreadable at the start of the program.
|
will be rejected, since R2 is unreadable at the start of the program.
|
||||||
|
|
||||||
After kernel function call, R1-R5 are reset to unreadable and
|
After kernel function call, R1-R5 are reset to unreadable and
|
||||||
R0 has a return type of the function.
|
R0 has a return type of the function.
|
||||||
|
|
||||||
Since R6-R9 are callee saved, their state is preserved across the call.
|
Since R6-R9 are callee saved, their state is preserved across the call.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
bpf_mov R6 = 1
|
bpf_mov R6 = 1
|
||||||
bpf_call foo
|
bpf_call foo
|
||||||
bpf_mov R0 = R6
|
bpf_mov R0 = R6
|
||||||
bpf_exit
|
bpf_exit
|
||||||
|
|
||||||
is a correct program. If there was R1 instead of R6, it would have
|
is a correct program. If there was R1 instead of R6, it would have
|
||||||
been rejected.
|
been rejected.
|
||||||
|
|
||||||
load/store instructions are allowed only with registers of valid types, which
|
load/store instructions are allowed only with registers of valid types, which
|
||||||
are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked.
|
are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked.
|
||||||
For example:
|
For example::
|
||||||
|
|
||||||
bpf_mov R1 = 1
|
bpf_mov R1 = 1
|
||||||
bpf_mov R2 = 2
|
bpf_mov R2 = 2
|
||||||
bpf_xadd *(u32 *)(R1 + 3) += R2
|
bpf_xadd *(u32 *)(R1 + 3) += R2
|
||||||
bpf_exit
|
bpf_exit
|
||||||
|
|
||||||
will be rejected, since R1 doesn't have a valid pointer type at the time of
|
will be rejected, since R1 doesn't have a valid pointer type at the time of
|
||||||
execution of instruction bpf_xadd.
|
execution of instruction bpf_xadd.
|
||||||
|
|
||||||
At the start R1 type is PTR_TO_CTX (a pointer to generic 'struct bpf_context')
|
At the start R1 type is PTR_TO_CTX (a pointer to generic ``struct bpf_context``)
|
||||||
A callback is used to customize verifier to restrict eBPF program access to only
|
A callback is used to customize verifier to restrict eBPF program access to only
|
||||||
certain fields within ctx structure with specified size and alignment.
|
certain fields within ctx structure with specified size and alignment.
|
||||||
|
|
||||||
For example, the following insn:
|
For example, the following insn::
|
||||||
|
|
||||||
bpf_ld R0 = *(u32 *)(R6 + 8)
|
bpf_ld R0 = *(u32 *)(R6 + 8)
|
||||||
|
|
||||||
intends to load a word from address R6 + 8 and store it into R0
|
intends to load a word from address R6 + 8 and store it into R0
|
||||||
If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know
|
If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know
|
||||||
that offset 8 of size 4 bytes can be accessed for reading, otherwise
|
that offset 8 of size 4 bytes can be accessed for reading, otherwise
|
||||||
|
@ -1079,10 +1127,13 @@ so it will fail verification, since it's out of bounds.
|
||||||
|
|
||||||
The verifier will allow eBPF program to read data from stack only after
|
The verifier will allow eBPF program to read data from stack only after
|
||||||
it wrote into it.
|
it wrote into it.
|
||||||
|
|
||||||
Classic BPF verifier does similar check with M[0-15] memory slots.
|
Classic BPF verifier does similar check with M[0-15] memory slots.
|
||||||
For example:
|
For example::
|
||||||
|
|
||||||
bpf_ld R0 = *(u32 *)(R10 - 4)
|
bpf_ld R0 = *(u32 *)(R10 - 4)
|
||||||
bpf_exit
|
bpf_exit
|
||||||
|
|
||||||
is invalid program.
|
is invalid program.
|
||||||
Though R10 is correct read-only register and has type PTR_TO_STACK
|
Though R10 is correct read-only register and has type PTR_TO_STACK
|
||||||
and R10 - 4 is within stack bounds, there were no stores into that location.
|
and R10 - 4 is within stack bounds, there were no stores into that location.
|
||||||
|
@ -1113,24 +1164,33 @@ Register value tracking
|
||||||
-----------------------
|
-----------------------
|
||||||
In order to determine the safety of an eBPF program, the verifier must track
|
In order to determine the safety of an eBPF program, the verifier must track
|
||||||
the range of possible values in each register and also in each stack slot.
|
the range of possible values in each register and also in each stack slot.
|
||||||
This is done with 'struct bpf_reg_state', defined in include/linux/
|
This is done with ``struct bpf_reg_state``, defined in include/linux/
|
||||||
bpf_verifier.h, which unifies tracking of scalar and pointer values. Each
|
bpf_verifier.h, which unifies tracking of scalar and pointer values. Each
|
||||||
register state has a type, which is either NOT_INIT (the register has not been
|
register state has a type, which is either NOT_INIT (the register has not been
|
||||||
written to), SCALAR_VALUE (some value which is not usable as a pointer), or a
|
written to), SCALAR_VALUE (some value which is not usable as a pointer), or a
|
||||||
pointer type. The types of pointers describe their base, as follows:
|
pointer type. The types of pointers describe their base, as follows:
|
||||||
PTR_TO_CTX Pointer to bpf_context.
|
|
||||||
CONST_PTR_TO_MAP Pointer to struct bpf_map. "Const" because arithmetic
|
|
||||||
|
PTR_TO_CTX
|
||||||
|
Pointer to bpf_context.
|
||||||
|
CONST_PTR_TO_MAP
|
||||||
|
Pointer to struct bpf_map. "Const" because arithmetic
|
||||||
on these pointers is forbidden.
|
on these pointers is forbidden.
|
||||||
PTR_TO_MAP_VALUE Pointer to the value stored in a map element.
|
PTR_TO_MAP_VALUE
|
||||||
|
Pointer to the value stored in a map element.
|
||||||
PTR_TO_MAP_VALUE_OR_NULL
|
PTR_TO_MAP_VALUE_OR_NULL
|
||||||
Either a pointer to a map value, or NULL; map accesses
|
Either a pointer to a map value, or NULL; map accesses
|
||||||
(see section 'eBPF maps', below) return this type,
|
(see section 'eBPF maps', below) return this type,
|
||||||
which becomes a PTR_TO_MAP_VALUE when checked != NULL.
|
which becomes a PTR_TO_MAP_VALUE when checked != NULL.
|
||||||
Arithmetic on these pointers is forbidden.
|
Arithmetic on these pointers is forbidden.
|
||||||
PTR_TO_STACK Frame pointer.
|
PTR_TO_STACK
|
||||||
PTR_TO_PACKET skb->data.
|
Frame pointer.
|
||||||
PTR_TO_PACKET_END skb->data + headlen; arithmetic forbidden.
|
PTR_TO_PACKET
|
||||||
PTR_TO_SOCKET Pointer to struct bpf_sock_ops, implicitly refcounted.
|
skb->data.
|
||||||
|
PTR_TO_PACKET_END
|
||||||
|
skb->data + headlen; arithmetic forbidden.
|
||||||
|
PTR_TO_SOCKET
|
||||||
|
Pointer to struct bpf_sock_ops, implicitly refcounted.
|
||||||
PTR_TO_SOCKET_OR_NULL
|
PTR_TO_SOCKET_OR_NULL
|
||||||
Either a pointer to a socket, or NULL; socket lookup
|
Either a pointer to a socket, or NULL; socket lookup
|
||||||
returns this type, which becomes a PTR_TO_SOCKET when
|
returns this type, which becomes a PTR_TO_SOCKET when
|
||||||
|
@ -1138,15 +1198,19 @@ pointer type. The types of pointers describe their base, as follows:
|
||||||
so programs must release the reference through the
|
so programs must release the reference through the
|
||||||
socket release function before the end of the program.
|
socket release function before the end of the program.
|
||||||
Arithmetic on these pointers is forbidden.
|
Arithmetic on these pointers is forbidden.
|
||||||
|
|
||||||
However, a pointer may be offset from this base (as a result of pointer
|
However, a pointer may be offset from this base (as a result of pointer
|
||||||
arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable
|
arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable
|
||||||
offset'. The former is used when an exactly-known value (e.g. an immediate
|
offset'. The former is used when an exactly-known value (e.g. an immediate
|
||||||
operand) is added to a pointer, while the latter is used for values which are
|
operand) is added to a pointer, while the latter is used for values which are
|
||||||
not exactly known. The variable offset is also used in SCALAR_VALUEs, to track
|
not exactly known. The variable offset is also used in SCALAR_VALUEs, to track
|
||||||
the range of possible values in the register.
|
the range of possible values in the register.
|
||||||
|
|
||||||
The verifier's knowledge about the variable offset consists of:
|
The verifier's knowledge about the variable offset consists of:
|
||||||
|
|
||||||
* minimum and maximum values as unsigned
|
* minimum and maximum values as unsigned
|
||||||
* minimum and maximum values as signed
|
* minimum and maximum values as signed
|
||||||
|
|
||||||
* knowledge of the values of individual bits, in the form of a 'tnum': a u64
|
* knowledge of the values of individual bits, in the form of a 'tnum': a u64
|
||||||
'mask' and a u64 'value'. 1s in the mask represent bits whose value is unknown;
|
'mask' and a u64 'value'. 1s in the mask represent bits whose value is unknown;
|
||||||
1s in the value represent bits known to be 1. Bits known to be 0 have 0 in both
|
1s in the value represent bits known to be 1. Bits known to be 0 have 0 in both
|
||||||
|
@ -1188,7 +1252,7 @@ The 'id' field is also used on PTR_TO_SOCKET and PTR_TO_SOCKET_OR_NULL, common
|
||||||
to all copies of the pointer returned from a socket lookup. This has similar
|
to all copies of the pointer returned from a socket lookup. This has similar
|
||||||
behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but
|
behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but
|
||||||
it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly
|
it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly
|
||||||
represents a reference to the corresponding 'struct sock'. To ensure that the
|
represents a reference to the corresponding ``struct sock``. To ensure that the
|
||||||
reference is not leaked, it is imperative to NULL-check the reference and in
|
reference is not leaked, it is imperative to NULL-check the reference and in
|
||||||
the non-NULL case, and pass the valid reference to the socket release function.
|
the non-NULL case, and pass the valid reference to the socket release function.
|
||||||
|
|
||||||
|
@ -1196,7 +1260,8 @@ Direct packet access
|
||||||
--------------------
|
--------------------
|
||||||
In cls_bpf and act_bpf programs the verifier allows direct access to the packet
|
In cls_bpf and act_bpf programs the verifier allows direct access to the packet
|
||||||
data via skb->data and skb->data_end pointers.
|
data via skb->data and skb->data_end pointers.
|
||||||
Ex:
|
Ex::
|
||||||
|
|
||||||
1: r4 = *(u32 *)(r1 +80) /* load skb->data_end */
|
1: r4 = *(u32 *)(r1 +80) /* load skb->data_end */
|
||||||
2: r3 = *(u32 *)(r1 +76) /* load skb->data */
|
2: r3 = *(u32 *)(r1 +76) /* load skb->data */
|
||||||
3: r5 = r3
|
3: r5 = r3
|
||||||
|
@ -1206,7 +1271,7 @@ R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp
|
||||||
6: r0 = *(u16 *)(r3 +12) /* access 12 and 13 bytes of the packet */
|
6: r0 = *(u16 *)(r3 +12) /* access 12 and 13 bytes of the packet */
|
||||||
|
|
||||||
this 2byte load from the packet is safe to do, since the program author
|
this 2byte load from the packet is safe to do, since the program author
|
||||||
did check 'if (skb->data + 14 > skb->data_end) goto err' at insn #5 which
|
did check ``if (skb->data + 14 > skb->data_end) goto err`` at insn #5 which
|
||||||
means that in the fall-through case the register R3 (which points to skb->data)
|
means that in the fall-through case the register R3 (which points to skb->data)
|
||||||
has at least 14 directly accessible bytes. The verifier marks it
|
has at least 14 directly accessible bytes. The verifier marks it
|
||||||
as R3=pkt(id=0,off=0,r=14).
|
as R3=pkt(id=0,off=0,r=14).
|
||||||
|
@ -1215,10 +1280,12 @@ off=0 means that no additional constants were added.
|
||||||
r=14 is the range of safe access which means that bytes [R3, R3 + 14) are ok.
|
r=14 is the range of safe access which means that bytes [R3, R3 + 14) are ok.
|
||||||
Note that R5 is marked as R5=pkt(id=0,off=14,r=14). It also points
|
Note that R5 is marked as R5=pkt(id=0,off=14,r=14). It also points
|
||||||
to the packet data, but constant 14 was added to the register, so
|
to the packet data, but constant 14 was added to the register, so
|
||||||
it now points to 'skb->data + 14' and accessible range is [R5, R5 + 14 - 14)
|
it now points to ``skb->data + 14`` and accessible range is [R5, R5 + 14 - 14)
|
||||||
which is zero bytes.
|
which is zero bytes.
|
||||||
|
|
||||||
More complex packet access may look like:
|
More complex packet access may look like::
|
||||||
|
|
||||||
|
|
||||||
R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp
|
R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp
|
||||||
6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */
|
6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */
|
||||||
7: r4 = *(u8 *)(r3 +12)
|
7: r4 = *(u8 *)(r3 +12)
|
||||||
|
@ -1235,32 +1302,36 @@ More complex packet access may look like:
|
||||||
18: if r2 > r1 goto pc+2
|
18: if r2 > r1 goto pc+2
|
||||||
R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp
|
R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp
|
||||||
19: r1 = *(u8 *)(r3 +4)
|
19: r1 = *(u8 *)(r3 +4)
|
||||||
|
|
||||||
The state of the register R3 is R3=pkt(id=2,off=0,r=8)
|
The state of the register R3 is R3=pkt(id=2,off=0,r=8)
|
||||||
id=2 means that two 'r3 += rX' instructions were seen, so r3 points to some
|
id=2 means that two ``r3 += rX`` instructions were seen, so r3 points to some
|
||||||
offset within a packet and since the program author did
|
offset within a packet and since the program author did
|
||||||
'if (r3 + 8 > r1) goto err' at insn #18, the safe range is [R3, R3 + 8).
|
``if (r3 + 8 > r1) goto err`` at insn #18, the safe range is [R3, R3 + 8).
|
||||||
The verifier only allows 'add'/'sub' operations on packet registers. Any other
|
The verifier only allows 'add'/'sub' operations on packet registers. Any other
|
||||||
operation will set the register state to 'SCALAR_VALUE' and it won't be
|
operation will set the register state to 'SCALAR_VALUE' and it won't be
|
||||||
available for direct packet access.
|
available for direct packet access.
|
||||||
Operation 'r3 += rX' may overflow and become less than original skb->data,
|
|
||||||
therefore the verifier has to prevent that. So when it sees 'r3 += rX'
|
Operation ``r3 += rX`` may overflow and become less than original skb->data,
|
||||||
|
therefore the verifier has to prevent that. So when it sees ``r3 += rX``
|
||||||
instruction and rX is more than 16-bit value, any subsequent bounds-check of r3
|
instruction and rX is more than 16-bit value, any subsequent bounds-check of r3
|
||||||
against skb->data_end will not give us 'range' information, so attempts to read
|
against skb->data_end will not give us 'range' information, so attempts to read
|
||||||
through the pointer will give "invalid access to packet" error.
|
through the pointer will give "invalid access to packet" error.
|
||||||
Ex. after insn 'r4 = *(u8 *)(r3 +12)' (insn #7 above) the state of r4 is
|
|
||||||
|
Ex. after insn ``r4 = *(u8 *)(r3 +12)`` (insn #7 above) the state of r4 is
|
||||||
R4=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) which means that upper 56 bits
|
R4=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) which means that upper 56 bits
|
||||||
of the register are guaranteed to be zero, and nothing is known about the lower
|
of the register are guaranteed to be zero, and nothing is known about the lower
|
||||||
8 bits. After insn 'r4 *= 14' the state becomes
|
8 bits. After insn ``r4 *= 14`` the state becomes
|
||||||
R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)), since multiplying an 8-bit
|
R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)), since multiplying an 8-bit
|
||||||
value by constant 14 will keep upper 52 bits as zero, also the least significant
|
value by constant 14 will keep upper 52 bits as zero, also the least significant
|
||||||
bit will be zero as 14 is even. Similarly 'r2 >>= 48' will make
|
bit will be zero as 14 is even. Similarly ``r2 >>= 48`` will make
|
||||||
R2=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)), since the shift is not sign
|
R2=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)), since the shift is not sign
|
||||||
extending. This logic is implemented in adjust_reg_min_max_vals() function,
|
extending. This logic is implemented in adjust_reg_min_max_vals() function,
|
||||||
which calls adjust_ptr_min_max_vals() for adding pointer to scalar (or vice
|
which calls adjust_ptr_min_max_vals() for adding pointer to scalar (or vice
|
||||||
versa) and adjust_scalar_min_max_vals() for operations on two scalars.
|
versa) and adjust_scalar_min_max_vals() for operations on two scalars.
|
||||||
|
|
||||||
The end result is that bpf program author can access packet directly
|
The end result is that bpf program author can access packet directly
|
||||||
using normal C code as:
|
using normal C code as::
|
||||||
|
|
||||||
void *data = (void *)(long)skb->data;
|
void *data = (void *)(long)skb->data;
|
||||||
void *data_end = (void *)(long)skb->data_end;
|
void *data_end = (void *)(long)skb->data_end;
|
||||||
struct eth_hdr *eth = data;
|
struct eth_hdr *eth = data;
|
||||||
|
@ -1275,6 +1346,7 @@ using normal C code as:
|
||||||
return 0;
|
return 0;
|
||||||
if (udp->dest == 53 || udp->source == 9)
|
if (udp->dest == 53 || udp->source == 9)
|
||||||
...;
|
...;
|
||||||
|
|
||||||
which makes such programs easier to write comparing to LD_ABS insn
|
which makes such programs easier to write comparing to LD_ABS insn
|
||||||
and significantly faster.
|
and significantly faster.
|
||||||
|
|
||||||
|
@ -1284,23 +1356,24 @@ eBPF maps
|
||||||
and userspace.
|
and userspace.
|
||||||
|
|
||||||
The maps are accessed from user space via BPF syscall, which has commands:
|
The maps are accessed from user space via BPF syscall, which has commands:
|
||||||
|
|
||||||
- create a map with given type and attributes
|
- create a map with given type and attributes
|
||||||
map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)
|
``map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)``
|
||||||
using attr->map_type, attr->key_size, attr->value_size, attr->max_entries
|
using attr->map_type, attr->key_size, attr->value_size, attr->max_entries
|
||||||
returns process-local file descriptor or negative error
|
returns process-local file descriptor or negative error
|
||||||
|
|
||||||
- lookup key in a given map
|
- lookup key in a given map
|
||||||
err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)
|
``err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)``
|
||||||
using attr->map_fd, attr->key, attr->value
|
using attr->map_fd, attr->key, attr->value
|
||||||
returns zero and stores found elem into value or negative error
|
returns zero and stores found elem into value or negative error
|
||||||
|
|
||||||
- create or update key/value pair in a given map
|
- create or update key/value pair in a given map
|
||||||
err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)
|
``err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)``
|
||||||
using attr->map_fd, attr->key, attr->value
|
using attr->map_fd, attr->key, attr->value
|
||||||
returns zero or negative error
|
returns zero or negative error
|
||||||
|
|
||||||
- find and delete element by key in a given map
|
- find and delete element by key in a given map
|
||||||
err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)
|
``err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)``
|
||||||
using attr->map_fd, attr->key
|
using attr->map_fd, attr->key
|
||||||
|
|
||||||
- to delete map: close(fd)
|
- to delete map: close(fd)
|
||||||
|
@ -1312,10 +1385,11 @@ are concurrently updating.
|
||||||
maps can have different types: hash, array, bloom filter, radix-tree, etc.
|
maps can have different types: hash, array, bloom filter, radix-tree, etc.
|
||||||
|
|
||||||
The map is defined by:
|
The map is defined by:
|
||||||
. type
|
|
||||||
. max number of elements
|
- type
|
||||||
. key size in bytes
|
- max number of elements
|
||||||
. value size in bytes
|
- key size in bytes
|
||||||
|
- value size in bytes
|
||||||
|
|
||||||
Pruning
|
Pruning
|
||||||
-------
|
-------
|
||||||
|
@ -1339,57 +1413,75 @@ Understanding eBPF verifier messages
|
||||||
The following are few examples of invalid eBPF programs and verifier error
|
The following are few examples of invalid eBPF programs and verifier error
|
||||||
messages as seen in the log:
|
messages as seen in the log:
|
||||||
|
|
||||||
Program with unreachable instructions:
|
Program with unreachable instructions::
|
||||||
|
|
||||||
static struct bpf_insn prog[] = {
|
static struct bpf_insn prog[] = {
|
||||||
BPF_EXIT_INSN(),
|
BPF_EXIT_INSN(),
|
||||||
BPF_EXIT_INSN(),
|
BPF_EXIT_INSN(),
|
||||||
};
|
};
|
||||||
|
|
||||||
Error:
|
Error:
|
||||||
|
|
||||||
unreachable insn 1
|
unreachable insn 1
|
||||||
|
|
||||||
Program that reads uninitialized register:
|
Program that reads uninitialized register::
|
||||||
|
|
||||||
BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
|
BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
|
||||||
BPF_EXIT_INSN(),
|
BPF_EXIT_INSN(),
|
||||||
Error:
|
|
||||||
|
Error::
|
||||||
|
|
||||||
0: (bf) r0 = r2
|
0: (bf) r0 = r2
|
||||||
R2 !read_ok
|
R2 !read_ok
|
||||||
|
|
||||||
Program that doesn't initialize R0 before exiting:
|
Program that doesn't initialize R0 before exiting::
|
||||||
|
|
||||||
BPF_MOV64_REG(BPF_REG_2, BPF_REG_1),
|
BPF_MOV64_REG(BPF_REG_2, BPF_REG_1),
|
||||||
BPF_EXIT_INSN(),
|
BPF_EXIT_INSN(),
|
||||||
Error:
|
|
||||||
|
Error::
|
||||||
|
|
||||||
0: (bf) r2 = r1
|
0: (bf) r2 = r1
|
||||||
1: (95) exit
|
1: (95) exit
|
||||||
R0 !read_ok
|
R0 !read_ok
|
||||||
|
|
||||||
Program that accesses stack out of bounds:
|
Program that accesses stack out of bounds::
|
||||||
|
|
||||||
BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0),
|
BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0),
|
||||||
BPF_EXIT_INSN(),
|
BPF_EXIT_INSN(),
|
||||||
Error:
|
|
||||||
|
Error::
|
||||||
|
|
||||||
0: (7a) *(u64 *)(r10 +8) = 0
|
0: (7a) *(u64 *)(r10 +8) = 0
|
||||||
invalid stack off=8 size=8
|
invalid stack off=8 size=8
|
||||||
|
|
||||||
Program that doesn't initialize stack before passing its address into function:
|
Program that doesn't initialize stack before passing its address into function::
|
||||||
|
|
||||||
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
|
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
|
||||||
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
|
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
|
||||||
BPF_LD_MAP_FD(BPF_REG_1, 0),
|
BPF_LD_MAP_FD(BPF_REG_1, 0),
|
||||||
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
|
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
|
||||||
BPF_EXIT_INSN(),
|
BPF_EXIT_INSN(),
|
||||||
Error:
|
|
||||||
|
Error::
|
||||||
|
|
||||||
0: (bf) r2 = r10
|
0: (bf) r2 = r10
|
||||||
1: (07) r2 += -8
|
1: (07) r2 += -8
|
||||||
2: (b7) r1 = 0x0
|
2: (b7) r1 = 0x0
|
||||||
3: (85) call 1
|
3: (85) call 1
|
||||||
invalid indirect read from stack off -8+0 size 8
|
invalid indirect read from stack off -8+0 size 8
|
||||||
|
|
||||||
Program that uses invalid map_fd=0 while calling to map_lookup_elem() function:
|
Program that uses invalid map_fd=0 while calling to map_lookup_elem() function::
|
||||||
|
|
||||||
BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
|
BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
|
||||||
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
|
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
|
||||||
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
|
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
|
||||||
BPF_LD_MAP_FD(BPF_REG_1, 0),
|
BPF_LD_MAP_FD(BPF_REG_1, 0),
|
||||||
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
|
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
|
||||||
BPF_EXIT_INSN(),
|
BPF_EXIT_INSN(),
|
||||||
Error:
|
|
||||||
|
Error::
|
||||||
|
|
||||||
0: (7a) *(u64 *)(r10 -8) = 0
|
0: (7a) *(u64 *)(r10 -8) = 0
|
||||||
1: (bf) r2 = r10
|
1: (bf) r2 = r10
|
||||||
2: (07) r2 += -8
|
2: (07) r2 += -8
|
||||||
|
@ -1398,7 +1490,8 @@ Error:
|
||||||
fd 0 is not pointing to valid bpf_map
|
fd 0 is not pointing to valid bpf_map
|
||||||
|
|
||||||
Program that doesn't check return value of map_lookup_elem() before accessing
|
Program that doesn't check return value of map_lookup_elem() before accessing
|
||||||
map element:
|
map element::
|
||||||
|
|
||||||
BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
|
BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
|
||||||
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
|
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
|
||||||
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
|
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
|
||||||
|
@ -1406,7 +1499,9 @@ map element:
|
||||||
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
|
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
|
||||||
BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
|
BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
|
||||||
BPF_EXIT_INSN(),
|
BPF_EXIT_INSN(),
|
||||||
Error:
|
|
||||||
|
Error::
|
||||||
|
|
||||||
0: (7a) *(u64 *)(r10 -8) = 0
|
0: (7a) *(u64 *)(r10 -8) = 0
|
||||||
1: (bf) r2 = r10
|
1: (bf) r2 = r10
|
||||||
2: (07) r2 += -8
|
2: (07) r2 += -8
|
||||||
|
@ -1416,7 +1511,8 @@ Error:
|
||||||
R0 invalid mem access 'map_value_or_null'
|
R0 invalid mem access 'map_value_or_null'
|
||||||
|
|
||||||
Program that correctly checks map_lookup_elem() returned value for NULL, but
|
Program that correctly checks map_lookup_elem() returned value for NULL, but
|
||||||
accesses the memory with incorrect alignment:
|
accesses the memory with incorrect alignment::
|
||||||
|
|
||||||
BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
|
BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
|
||||||
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
|
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
|
||||||
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
|
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
|
||||||
|
@ -1425,7 +1521,9 @@ accesses the memory with incorrect alignment:
|
||||||
BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
|
BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
|
||||||
BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0),
|
BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0),
|
||||||
BPF_EXIT_INSN(),
|
BPF_EXIT_INSN(),
|
||||||
Error:
|
|
||||||
|
Error::
|
||||||
|
|
||||||
0: (7a) *(u64 *)(r10 -8) = 0
|
0: (7a) *(u64 *)(r10 -8) = 0
|
||||||
1: (bf) r2 = r10
|
1: (bf) r2 = r10
|
||||||
2: (07) r2 += -8
|
2: (07) r2 += -8
|
||||||
|
@ -1438,7 +1536,8 @@ Error:
|
||||||
|
|
||||||
Program that correctly checks map_lookup_elem() returned value for NULL and
|
Program that correctly checks map_lookup_elem() returned value for NULL and
|
||||||
accesses memory with correct alignment in one side of 'if' branch, but fails
|
accesses memory with correct alignment in one side of 'if' branch, but fails
|
||||||
to do so in the other side of 'if' branch:
|
to do so in the other side of 'if' branch::
|
||||||
|
|
||||||
BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
|
BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
|
||||||
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
|
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
|
||||||
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
|
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
|
||||||
|
@ -1449,7 +1548,9 @@ to do so in the other side of 'if' branch:
|
||||||
BPF_EXIT_INSN(),
|
BPF_EXIT_INSN(),
|
||||||
BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1),
|
BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1),
|
||||||
BPF_EXIT_INSN(),
|
BPF_EXIT_INSN(),
|
||||||
Error:
|
|
||||||
|
Error::
|
||||||
|
|
||||||
0: (7a) *(u64 *)(r10 -8) = 0
|
0: (7a) *(u64 *)(r10 -8) = 0
|
||||||
1: (bf) r2 = r10
|
1: (bf) r2 = r10
|
||||||
2: (07) r2 += -8
|
2: (07) r2 += -8
|
||||||
|
@ -1465,8 +1566,8 @@ Error:
|
||||||
R0 invalid mem access 'imm'
|
R0 invalid mem access 'imm'
|
||||||
|
|
||||||
Program that performs a socket lookup then sets the pointer to NULL without
|
Program that performs a socket lookup then sets the pointer to NULL without
|
||||||
checking it:
|
checking it::
|
||||||
value:
|
|
||||||
BPF_MOV64_IMM(BPF_REG_2, 0),
|
BPF_MOV64_IMM(BPF_REG_2, 0),
|
||||||
BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
|
BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
|
||||||
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
|
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
|
||||||
|
@ -1477,7 +1578,9 @@ value:
|
||||||
BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp),
|
BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp),
|
||||||
BPF_MOV64_IMM(BPF_REG_0, 0),
|
BPF_MOV64_IMM(BPF_REG_0, 0),
|
||||||
BPF_EXIT_INSN(),
|
BPF_EXIT_INSN(),
|
||||||
Error:
|
|
||||||
|
Error::
|
||||||
|
|
||||||
0: (b7) r2 = 0
|
0: (b7) r2 = 0
|
||||||
1: (63) *(u32 *)(r10 -8) = r2
|
1: (63) *(u32 *)(r10 -8) = r2
|
||||||
2: (bf) r2 = r10
|
2: (bf) r2 = r10
|
||||||
|
@ -1491,7 +1594,8 @@ Error:
|
||||||
Unreleased reference id=1, alloc_insn=7
|
Unreleased reference id=1, alloc_insn=7
|
||||||
|
|
||||||
Program that performs a socket lookup but does not NULL-check the returned
|
Program that performs a socket lookup but does not NULL-check the returned
|
||||||
value:
|
value::
|
||||||
|
|
||||||
BPF_MOV64_IMM(BPF_REG_2, 0),
|
BPF_MOV64_IMM(BPF_REG_2, 0),
|
||||||
BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
|
BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
|
||||||
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
|
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
|
||||||
|
@ -1501,7 +1605,9 @@ value:
|
||||||
BPF_MOV64_IMM(BPF_REG_5, 0),
|
BPF_MOV64_IMM(BPF_REG_5, 0),
|
||||||
BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp),
|
BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp),
|
||||||
BPF_EXIT_INSN(),
|
BPF_EXIT_INSN(),
|
||||||
Error:
|
|
||||||
|
Error::
|
||||||
|
|
||||||
0: (b7) r2 = 0
|
0: (b7) r2 = 0
|
||||||
1: (63) *(u32 *)(r10 -8) = r2
|
1: (63) *(u32 *)(r10 -8) = r2
|
||||||
2: (bf) r2 = r10
|
2: (bf) r2 = r10
|
||||||
|
@ -1519,7 +1625,7 @@ Testing
|
||||||
Next to the BPF toolchain, the kernel also ships a test module that contains
|
Next to the BPF toolchain, the kernel also ships a test module that contains
|
||||||
various test cases for classic and internal BPF that can be executed against
|
various test cases for classic and internal BPF that can be executed against
|
||||||
the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and
|
the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and
|
||||||
enabled via Kconfig:
|
enabled via Kconfig::
|
||||||
|
|
||||||
CONFIG_TEST_BPF=m
|
CONFIG_TEST_BPF=m
|
||||||
|
|
||||||
|
@ -1540,6 +1646,6 @@ The document was written in the hope that it is found useful and in order
|
||||||
to give potential BPF hackers or security auditors a better overview of
|
to give potential BPF hackers or security auditors a better overview of
|
||||||
the underlying architecture.
|
the underlying architecture.
|
||||||
|
|
||||||
Jay Schulist <jschlst@samba.org>
|
- Jay Schulist <jschlst@samba.org>
|
||||||
Daniel Borkmann <daniel@iogearbox.net>
|
- Daniel Borkmann <daniel@iogearbox.net>
|
||||||
Alexei Starovoitov <ast@kernel.org>
|
- Alexei Starovoitov <ast@kernel.org>
|
|
@ -56,6 +56,7 @@ Contents:
|
||||||
driver
|
driver
|
||||||
eql
|
eql
|
||||||
fib_trie
|
fib_trie
|
||||||
|
filter
|
||||||
|
|
||||||
.. only:: subproject and html
|
.. only:: subproject and html
|
||||||
|
|
||||||
|
|
|
@ -1051,7 +1051,7 @@ for more information on hardware timestamps.
|
||||||
-------------------------------------------------------------------------------
|
-------------------------------------------------------------------------------
|
||||||
|
|
||||||
- Packet sockets work well together with Linux socket filters, thus you also
|
- Packet sockets work well together with Linux socket filters, thus you also
|
||||||
might want to have a look at Documentation/networking/filter.txt
|
might want to have a look at Documentation/networking/filter.rst
|
||||||
|
|
||||||
--------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------
|
||||||
+ THANKS
|
+ THANKS
|
||||||
|
|
|
@ -3192,7 +3192,7 @@ Q: https://patchwork.ozlabs.org/project/netdev/list/?delegate=77147
|
||||||
T: git git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git
|
T: git git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git
|
||||||
T: git git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
|
T: git git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
|
||||||
F: Documentation/bpf/
|
F: Documentation/bpf/
|
||||||
F: Documentation/networking/filter.txt
|
F: Documentation/networking/filter.rst
|
||||||
F: arch/*/net/*
|
F: arch/*/net/*
|
||||||
F: include/linux/bpf*
|
F: include/linux/bpf*
|
||||||
F: include/linux/filter.h
|
F: include/linux/filter.h
|
||||||
|
|
|
@ -11,7 +11,7 @@
|
||||||
*
|
*
|
||||||
* How to get into it:
|
* How to get into it:
|
||||||
*
|
*
|
||||||
* 1) read Documentation/networking/filter.txt
|
* 1) read Documentation/networking/filter.rst
|
||||||
* 2) Run `bpf_asm [-c] <filter-prog file>` to translate into binary
|
* 2) Run `bpf_asm [-c] <filter-prog file>` to translate into binary
|
||||||
* blob that is loadable with xt_bpf, cls_bpf et al. Note: -c will
|
* blob that is loadable with xt_bpf, cls_bpf et al. Note: -c will
|
||||||
* pretty print a C-like construct.
|
* pretty print a C-like construct.
|
||||||
|
|
|
@ -13,7 +13,7 @@
|
||||||
* for making a verdict when multiple simple BPF programs are combined
|
* for making a verdict when multiple simple BPF programs are combined
|
||||||
* into one in order to prevent parsing same headers multiple times.
|
* into one in order to prevent parsing same headers multiple times.
|
||||||
*
|
*
|
||||||
* More on how to debug BPF opcodes see Documentation/networking/filter.txt
|
* More on how to debug BPF opcodes see Documentation/networking/filter.rst
|
||||||
* which is the main document on BPF. Mini howto for getting started:
|
* which is the main document on BPF. Mini howto for getting started:
|
||||||
*
|
*
|
||||||
* 1) `./bpf_dbg` to enter the shell (shell cmds denoted with '>'):
|
* 1) `./bpf_dbg` to enter the shell (shell cmds denoted with '>'):
|
||||||
|
|
Loading…
Add table
Reference in a new issue