linux/net
Willem de Bruijn 61fad6816f net/packet: tpacket_rcv: avoid a producer race condition
PACKET_RX_RING can cause multiple writers to access the same slot if a
fast writer wraps the ring while a slow writer is still copying. This
is particularly likely with few, large, slots (e.g., GSO packets).

Synchronize kernel thread ownership of rx ring slots with a bitmap.

Writers acquire a slot race-free by testing tp_status TP_STATUS_KERNEL
while holding the sk receive queue lock. They release this lock before
copying and set tp_status to TP_STATUS_USER to release to userspace
when done. During copying, another writer may take the lock, also see
TP_STATUS_KERNEL, and start writing to the same slot.

Introduce a new rx_owner_map bitmap with a bit per slot. To acquire a
slot, test and set with the lock held. To release race-free, update
tp_status and owner bit as a transaction, so take the lock again.

This is the one of a variety of discussed options (see Link below):

* instead of a shadow ring, embed the data in the slot itself, such as
in tp_padding. But any test for this field may match a value left by
userspace, causing deadlock.

* avoid the lock on release. This leaves a small race if releasing the
shadow slot before setting TP_STATUS_USER. The below reproducer showed
that this race is not academic. If releasing the slot after tp_status,
the race is more subtle. See the first link for details.

* add a new tp_status TP_KERNEL_OWNED to avoid the transactional store
of two fields. But, legacy applications may interpret all non-zero
tp_status as owned by the user. As libpcap does. So this is possible
only opt-in by newer processes. It can be added as an optional mode.

* embed the struct at the tail of pg_vec to avoid extra allocation.
The implementation proved no less complex than a separate field.

The additional locking cost on release adds contention, no different
than scaling on multicore or multiqueue h/w. In practice, below
reproducer nor small packet tcpdump showed a noticeable change in
perf report in cycles spent in spinlock. Where contention is
problematic, packet sockets support mitigation through PACKET_FANOUT.
And we can consider adding opt-in state TP_KERNEL_OWNED.

Easy to reproduce by running multiple netperf or similar TCP_STREAM
flows concurrently with `tcpdump -B 129 -n greater 60000`.

Based on an earlier patchset by Jon Rosen. See links below.

I believe this issue goes back to the introduction of tpacket_rcv,
which predates git history.

Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg237222.html
Suggested-by: Jon Rosen <jrosen@cisco.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jon Rosen <jrosen@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15 00:25:25 -07:00
..
6lowpan
9p
802
8021q
appletalk
atm proc: convert everything to "struct proc_ops" 2020-02-04 03:05:26 +00:00
ax25
batman-adv batman-adv: Don't schedule OGM for disabled interface 2020-02-18 09:07:55 +01:00
bluetooth Bluetooth: Fix race condition in hci_release_sock() 2020-01-26 10:34:17 +02:00
bpf
bpfilter net/bpfilter: fix dprintf usage for /dev/kmsg 2020-03-14 20:58:10 -07:00
bridge net: bridge: fix stale eth hdr pointer in br_dev_xmit 2020-02-24 11:11:19 -08:00
caif net: caif: Add lockdep expression to RCU traversal primitive 2020-03-11 22:55:25 -07:00
can
ceph Merge branch 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-02-08 13:26:41 -08:00
core Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf 2020-03-13 11:13:45 -07:00
dcb
dccp
decnet
dns_resolver
dsa net: dsa: Don't instantiate phylink for CPU/DSA ports unless needed 2020-03-11 23:46:11 -07:00
ethernet net: remove eth_change_mtu 2020-01-27 11:09:31 +01:00
ethtool ethtool: limit bitset size 2020-02-26 11:27:31 -08:00
hsr net: hsr: Pass lockdep expression to RCU lists 2020-02-19 10:55:42 -08:00
ieee802154 nl802154: add missing attribute validation for dev_type 2020-03-03 13:28:48 -08:00
ife
ipv4 net: ip_gre: Separate ERSPAN newlink / changelink callbacks 2020-03-15 00:14:08 -07:00
ipv6 seg6: fix SRv6 L2 tunnels to use IANA-assigned protocol number 2020-03-11 23:49:30 -07:00
iucv
kcm
key
l2tp l2tp: Allow duplicate session creation with UDP 2020-02-04 12:35:49 +01:00
l3mdev
lapb
llc
mac80211 mac80211: Do not send mesh HWMP PREQ if HWMP is disabled 2020-03-11 09:04:14 +01:00
mac802154
mpls
mptcp mptcp: always include dack if possible. 2020-03-05 21:34:42 -08:00
ncsi
netfilter netfilter: nft_chain_nat: inet family is missing module ownership 2020-03-06 18:00:43 +01:00
netlabel netlabel_domainhash.c: Use built-in RCU list checking 2020-02-18 12:44:23 -08:00
netlink netlink: Use netlink header as base to calculate bad attribute offset 2020-02-29 21:21:23 -08:00
netrom
nfc net: nfc: fix bounds checking bugs on "pipe" 2020-03-05 21:32:42 -08:00
nsh
openvswitch openvswitch: add missing attribute validation for hash 2020-03-03 13:28:48 -08:00
packet net/packet: tpacket_rcv: avoid a producer race condition 2020-03-15 00:25:25 -07:00
phonet
psample
qrtr
rds net/rds: Track user mapped pages through special API 2020-02-16 18:37:09 -08:00
rfkill
rose Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-01-26 10:40:21 +01:00
rxrpc rxrpc: Fix call RCU cleanup using non-bh-safe locks 2020-02-07 11:20:57 +01:00
sched net_sched: keep alloc_hash updated after hash allocation 2020-03-14 20:42:29 -07:00
sctp inet_diag: return classid for all socket types 2020-03-08 21:57:48 -07:00
smc net/smc: cancel event worker during device removal 2020-03-10 15:40:33 -07:00
strparser
sunrpc xprtrdma: Fix DMA scatter-gather list mapping imbalance 2020-02-13 15:35:33 -05:00
switchdev
tipc tipc: add missing attribute validation for MTU property 2020-03-03 13:28:49 -08:00
tls net/tls: Fix to avoid gettig invalid tls record 2020-02-19 16:32:06 -08:00
unix unix: It's CONFIG_PROC_FS not CONFIG_PROCFS 2020-02-27 11:52:35 -08:00
vmw_vsock vsock: fix potential deadlock in transport->release() 2020-02-27 12:03:56 -08:00
wimax
wireless nl80211: add missing attribute validation for channel switch 2020-03-11 08:58:39 +01:00
x25
xdp xsk: Publish global consumer pointers when NAPI is finished 2020-02-11 15:51:11 +01:00
xfrm xfrm: interface: use icmp_ndo_send helper 2020-02-13 14:19:00 -08:00
compat.c
Kconfig net: disable BRIDGE_NETFILTER by default 2020-02-20 15:02:02 -08:00
Makefile mptcp: Add MPTCP socket stubs 2020-01-24 13:44:07 +01:00
socket.c
sysctl_net.c