mirror of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-08-05 16:54:27 +00:00
![]() PACKET_RX_RING can cause multiple writers to access the same slot if a fast writer wraps the ring while a slow writer is still copying. This is particularly likely with few, large, slots (e.g., GSO packets). Synchronize kernel thread ownership of rx ring slots with a bitmap. Writers acquire a slot race-free by testing tp_status TP_STATUS_KERNEL while holding the sk receive queue lock. They release this lock before copying and set tp_status to TP_STATUS_USER to release to userspace when done. During copying, another writer may take the lock, also see TP_STATUS_KERNEL, and start writing to the same slot. Introduce a new rx_owner_map bitmap with a bit per slot. To acquire a slot, test and set with the lock held. To release race-free, update tp_status and owner bit as a transaction, so take the lock again. This is the one of a variety of discussed options (see Link below): * instead of a shadow ring, embed the data in the slot itself, such as in tp_padding. But any test for this field may match a value left by userspace, causing deadlock. * avoid the lock on release. This leaves a small race if releasing the shadow slot before setting TP_STATUS_USER. The below reproducer showed that this race is not academic. If releasing the slot after tp_status, the race is more subtle. See the first link for details. * add a new tp_status TP_KERNEL_OWNED to avoid the transactional store of two fields. But, legacy applications may interpret all non-zero tp_status as owned by the user. As libpcap does. So this is possible only opt-in by newer processes. It can be added as an optional mode. * embed the struct at the tail of pg_vec to avoid extra allocation. The implementation proved no less complex than a separate field. The additional locking cost on release adds contention, no different than scaling on multicore or multiqueue h/w. In practice, below reproducer nor small packet tcpdump showed a noticeable change in perf report in cycles spent in spinlock. Where contention is problematic, packet sockets support mitigation through PACKET_FANOUT. And we can consider adding opt-in state TP_KERNEL_OWNED. Easy to reproduce by running multiple netperf or similar TCP_STREAM flows concurrently with `tcpdump -B 129 -n greater 60000`. Based on an earlier patchset by Jon Rosen. See links below. I believe this issue goes back to the introduction of tpacket_rcv, which predates git history. Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg237222.html Suggested-by: Jon Rosen <jrosen@cisco.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jon Rosen <jrosen@cisco.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
---|---|---|
.. | ||
6lowpan | ||
9p | ||
802 | ||
8021q | ||
appletalk | ||
atm | ||
ax25 | ||
batman-adv | ||
bluetooth | ||
bpf | ||
bpfilter | ||
bridge | ||
caif | ||
can | ||
ceph | ||
core | ||
dcb | ||
dccp | ||
decnet | ||
dns_resolver | ||
dsa | ||
ethernet | ||
ethtool | ||
hsr | ||
ieee802154 | ||
ife | ||
ipv4 | ||
ipv6 | ||
iucv | ||
kcm | ||
key | ||
l2tp | ||
l3mdev | ||
lapb | ||
llc | ||
mac80211 | ||
mac802154 | ||
mpls | ||
mptcp | ||
ncsi | ||
netfilter | ||
netlabel | ||
netlink | ||
netrom | ||
nfc | ||
nsh | ||
openvswitch | ||
packet | ||
phonet | ||
psample | ||
qrtr | ||
rds | ||
rfkill | ||
rose | ||
rxrpc | ||
sched | ||
sctp | ||
smc | ||
strparser | ||
sunrpc | ||
switchdev | ||
tipc | ||
tls | ||
unix | ||
vmw_vsock | ||
wimax | ||
wireless | ||
x25 | ||
xdp | ||
xfrm | ||
compat.c | ||
Kconfig | ||
Makefile | ||
socket.c | ||
sysctl_net.c |