linux/net
Aaron Conole 30a92c9e3d openvswitch: Set the skbuff pkt_type for proper pmtud support.
Open vSwitch is originally intended to switch at layer 2, only dealing with
Ethernet frames.  With the introduction of l3 tunnels support, it crossed
into the realm of needing to care a bit about some routing details when
making forwarding decisions.  If an oversized packet would need to be
fragmented during this forwarding decision, there is a chance for pmtu
to get involved and generate a routing exception.  This is gated by the
skbuff->pkt_type field.

When a flow is already loaded into the openvswitch module this field is
set up and transitioned properly as a packet moves from one port to
another.  In the case that a packet execute is invoked after a flow is
newly installed this field is not properly initialized.  This causes the
pmtud mechanism to omit sending the required exception messages across
the tunnel boundary and a second attempt needs to be made to make sure
that the routing exception is properly setup.  To fix this, we set the
outgoing packet's pkt_type to PACKET_OUTGOING, since it can only get
to the openvswitch module via a port device or packet command.

Even for bridge ports as users, the pkt_type needs to be reset when
doing the transmit as the packet is truly outgoing and routing needs
to get involved post packet transformations, in the case of
VXLAN/GENEVE/udp-tunnel packets.  In general, the pkt_type on output
gets ignored, since we go straight to the driver, but in the case of
tunnel ports they go through IP routing layer.

This issue is periodically encountered in complex setups, such as large
openshift deployments, where multiple sets of tunnel traversal occurs.
A way to recreate this is with the ovn-heater project that can setup
a networking environment which mimics such large deployments.  We need
larger environments for this because we need to ensure that flow
misses occur.  In these environment, without this patch, we can see:

  ./ovn_cluster.sh start
  podman exec ovn-chassis-1 ip r a 170.168.0.5/32 dev eth1 mtu 1200
  podman exec ovn-chassis-1 ip netns exec sw01p1 ip r flush cache
  podman exec ovn-chassis-1 ip netns exec sw01p1 \
         ping 21.0.0.3 -M do -s 1300 -c2
  PING 21.0.0.3 (21.0.0.3) 1300(1328) bytes of data.
  From 21.0.0.3 icmp_seq=2 Frag needed and DF set (mtu = 1142)

  --- 21.0.0.3 ping statistics ---
  ...

Using tcpdump, we can also see the expected ICMP FRAG_NEEDED message is not
sent into the server.

With this patch, setting the pkt_type, we see the following:

  podman exec ovn-chassis-1 ip netns exec sw01p1 \
         ping 21.0.0.3 -M do -s 1300 -c2
  PING 21.0.0.3 (21.0.0.3) 1300(1328) bytes of data.
  From 21.0.0.3 icmp_seq=1 Frag needed and DF set (mtu = 1222)
  ping: local error: message too long, mtu=1222

  --- 21.0.0.3 ping statistics ---
  ...

In this case, the first ping request receives the FRAG_NEEDED message and
a local routing exception is created.

Tested-by: Jaime Caamano <jcaamano@redhat.com>
Reported-at: https://issues.redhat.com/browse/FDP-164
Fixes: 58264848a5 ("openvswitch: Add vxlan tunneling support.")
Signed-off-by: Aaron Conole <aconole@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Link: https://lore.kernel.org/r/20240516200941.16152-1-aconole@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-21 15:34:04 +02:00
..
6lowpan
9p netfs, 9p: Implement helpers for new write code 2024-05-01 18:07:37 +01:00
802
8021q net: annotate writes on dev->mtu from ndo_change_mtu() 2024-05-07 16:19:14 -07:00
appletalk Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2024-05-09 10:01:01 -07:00
atm inet: introduce dst_rtable() helper 2024-04-30 18:32:38 -07:00
ax25 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2024-05-14 10:53:19 -07:00
batman-adv net: annotate writes on dev->mtu from ndo_change_mtu() 2024-05-07 16:19:14 -07:00
bluetooth Bluetooth: hci_core: Fix not handling hdev->le_num_of_adv_sets=1 2024-05-14 10:56:37 -04:00
bpf bpf: check bpf_dummy_struct_ops program params for test runs 2024-04-25 12:42:43 -07:00
bridge Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next 2024-05-15 07:30:49 -07:00
caif caif: Use UTILITY_NAME_LENGTH instead of hard-coding 16 2024-04-02 18:20:00 -07:00
can
ceph
core net: revert partially applied PHY topology series 2024-05-13 18:35:02 -07:00
dcb
dccp net: dccp: Fix ccid2_rtt_estimator() kernel-doc 2024-05-07 16:15:08 -07:00
devlink devlink: extend devlink_param *set pointer 2024-04-22 13:05:19 -07:00
dns_resolver
dsa net: dsa: add support switches global DSCP priority mapping 2024-05-08 10:35:10 +01:00
ethernet ethernet: Add helper for assigning packet type when dest address does not match device address 2024-04-25 08:20:54 -07:00
ethtool net: revert partially applied PHY topology series 2024-05-13 18:35:02 -07:00
handshake net/handshake: remove redundant assignment to variable ret 2024-04-16 17:14:55 -07:00
hsr Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2024-05-09 10:01:01 -07:00
ieee802154 net: Remove the now superfluous sentinel elements from ctl_table array 2024-05-03 13:29:41 +01:00
ife
ipv4 tcp: Fix shift-out-of-bounds in dctcp_update_alpha(). 2024-05-21 13:34:50 +02:00
ipv6 ipv6: sr: fix memleak in seg6_hmac_init_algo 2024-05-21 13:16:25 +02:00
iucv net/iucv: Avoid explicit cpumask var allocation on stack 2024-04-02 18:19:09 -07:00
kcm
key
l2tp l2tp: fix ICMP error handling for UDP-encap sockets 2024-05-17 12:15:22 -07:00
l3mdev
lapb
llc net: Remove ctl_table sentinel elements from several networking subsystems 2024-05-03 13:29:42 +01:00
mac80211 wifi: mac80211: handle color change per link 2024-05-03 10:18:19 +02:00
mac802154
mctp
mpls net: Remove the now superfluous sentinel elements from ctl_table array 2024-05-03 13:29:41 +01:00
mptcp mptcp: include inet_common in mib.h 2024-05-13 18:29:23 -07:00
ncsi
netfilter netfilter pull request 24-05-12 2024-05-13 13:12:35 -07:00
netlabel netlabel: fix RCU annotation for IPv4 options on socket creation 2024-05-13 14:58:12 -07:00
netlink netlink: support all extack types in dumps 2024-04-23 10:09:49 -07:00
netrom netrom: fix possible dead-lock in nr_rt_ioctl() 2024-05-16 19:35:44 -07:00
nfc nfc: nci: Fix uninit-value in nci_rx_work 2024-05-20 11:41:26 +01:00
nsh nsh: Restore skb->{protocol,data,mac_header} for outer header in nsh_gso_segment(). 2024-04-26 12:20:01 +02:00
openvswitch openvswitch: Set the skbuff pkt_type for proper pmtud support. 2024-05-21 15:34:04 +02:00
packet af_packet: do not call packet_read_pending() from tpacket_destruct_skb() 2024-05-16 19:38:05 -07:00
phonet Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2024-05-09 10:01:01 -07:00
psample ip_tunnel: convert __be16 tunnel flags to bitmaps 2024-04-01 10:49:28 +01:00
qrtr net: qrtr: ns: Fix module refcnt 2024-05-16 09:47:45 +01:00
rds net: rds: Remove the now superfluous sentinel elements from ctl_table array 2024-05-03 13:29:42 +01:00
rfkill net: rfkill: gpio: Convert to platform remove callback returning void 2024-03-25 15:40:22 +01:00
rose net: Remove ctl_table sentinel elements from several networking subsystems 2024-05-03 13:29:42 +01:00
rxrpc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2024-05-09 10:01:01 -07:00
sched net/sched: adjust device watchdog timer to detect stopped queue at right time 2024-05-09 20:24:13 -07:00
sctp net: Remove ctl_table sentinel elements from several networking subsystems 2024-05-03 13:29:42 +01:00
smc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2024-05-09 10:01:01 -07:00
strparser
sunrpc net: sunrpc: Remove the now superfluous sentinel elements from ctl_table array 2024-05-03 13:29:42 +01:00
switchdev net: bridge: switchdev: Improve error message for port_obj_add/del functions 2024-05-08 12:19:12 +01:00
tipc net: Remove ctl_table sentinel elements from several networking subsystems 2024-05-03 13:29:42 +01:00
tls Revert "net: mirror skb frag ref/unref helpers" 2024-05-03 16:05:53 -07:00
unix af_unix: Fix garbage collection of embryos carrying OOB with SCM_RIGHTS 2024-05-21 13:42:02 +02:00
vmw_vsock vsock/virtio: fix packet delivery to tap device 2024-04-02 18:00:24 -07:00
wireless Merge wireless into wireless-next 2024-05-06 16:32:51 +02:00
x25 ax.25: x.25: Remove the now superfluous sentinel elements from ctl_table array 2024-05-03 13:29:43 +01:00
xdp xsk: validate user input for XDP_{UMEM|COMPLETION}_FILL_RING 2024-04-05 22:47:22 -07:00
xfrm Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2024-05-09 10:01:01 -07:00
compat.c
devres.c
Kconfig net: add IEEE 802.1q specific helpers 2024-05-08 10:35:09 +01:00
Kconfig.debug
Makefile
socket.c io_uring: separate header for exported net bits 2024-04-15 08:10:26 -06:00
sysctl_net.c sysctl: treewide: constify argument ctl_table_root::permissions(table) 2024-04-24 09:43:54 +02:00