Commit graph

8744 commits

Author SHA1 Message Date
Ido Schimmel
fc1266a061 ipv6: fib_rules: Add port mask matching
Extend IPv6 FIB rules to match on source and destination ports using a
mask. Note that the mask is only set when not matching on a range.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250217134109.311176-5-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-19 18:43:38 -08:00
Willem de Bruijn
5cd2f78886 ipv6: initialize inet socket cookies with sockcm_init
Avoid open coding the same logic.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250214222720.3205500-8-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-18 18:27:20 -08:00
Willem de Bruijn
096208592b ipv6: replace ipcm6_init calls with ipcm6_init_sk
This initializes tclass and dontfrag before cmsg parsing, removing the
need for explicit checks against -1 in each caller.

Leave hlimit set to -1, because its full initialization
(in ip6_sk_dst_hoplimit) requires more state (dst, flowi6, ..).

This also prepares for calling sockcm_init in a follow-on patch.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250214222720.3205500-7-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-18 18:27:20 -08:00
Eric Dumazet
0784d83df3 ndisc: ndisc_send_redirect() cleanup
ndisc_send_redirect() is always called under rcu_read_lock().

It can use dev_net_rcu() and avoid one redundant
rcu_read_lock()/rcu_read_unlock() pair.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250214140705.2105890-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-15 09:08:46 -08:00
Eric Dumazet
a3a128f611 inet: consolidate inet_csk_clone_lock()
Current inet_sock_set_state trace from inet_csk_clone_lock() is missing
many details :

... sock:inet_sock_set_state: family=AF_INET6 protocol=IPPROTO_TCP \
    sport=4901 dport=0 \
    saddr=127.0.0.6 daddr=0.0.0.0 \
    saddrv6=:: daddrv6=:: \
    oldstate=TCP_LISTEN newstate=TCP_SYN_RECV

Only the sport gives the listener port, no other parts of the n-tuple are correct.

In this patch, I initialize relevant fields of the new socket before
calling inet_sk_set_state(newsk, TCP_SYN_RECV).

We now have a trace including all the source/destination bits.

... sock:inet_sock_set_state: family=AF_INET6 protocol=IPPROTO_TCP \
    sport=4901 dport=47648 \
    saddr=127.0.0.6 daddr=127.0.0.6 \
    saddrv6=2002:a05:6830:1f85:: daddrv6=2001:4860:f803:65::3 \
    oldstate=TCP_LISTEN newstate=TCP_SYN_RECV

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250212131328.1514243-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-14 13:40:33 -08:00
Jakub Kicinski
7a7e019713 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.14-rc3).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-13 12:43:30 -08:00
Eric Dumazet
a527750d87 ipv6: mcast: add RCU protection to mld_newpack()
mld_newpack() can be called without RTNL or RCU being held.

Note that we no longer can use sock_alloc_send_skb() because
ipv6.igmp_sk uses GFP_KERNEL allocations which can sleep.

Instead use alloc_skb() and charge the net->ipv6.igmp_sk
socket under RCU protection.

Fixes: b8ad0cbc58 ("[NETNS][IPV6] mcast - handle several network namespace")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250212141021.1663666-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-13 08:37:21 -08:00
Kuniyuki Iwashima
5a1ccffd30 ip: fib_rules: Fetch net from fib_rule in fib[46]_rule_configure().
The following patch will not set skb->sk from VRF path.

Let's fetch net from fib_rule->fr_net instead of sock_net(skb->sk)
in fib[46]_rule_configure().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250207072502.87775-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10 19:08:52 -08:00
Eric Dumazet
087c1faa59 ipv6: mcast: extend RCU protection in igmp6_send()
igmp6_send() can be called without RTNL or RCU being held.

Extend RCU protection so that we can safely fetch the net pointer
and avoid a potential UAF.

Note that we no longer can use sock_alloc_send_skb() because
ipv6.igmp_sk uses GFP_KERNEL allocations which can sleep.

Instead use alloc_skb() and charge the net->ipv6.igmp_sk
socket under RCU protection.

Fixes: b8ad0cbc58 ("[NETNS][IPV6] mcast - handle several network namespace")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250207135841.1948589-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10 18:09:10 -08:00
Eric Dumazet
ed6ae1f325 ndisc: extend RCU protection in ndisc_send_skb()
ndisc_send_skb() can be called without RTNL or RCU held.

Acquire rcu_read_lock() earlier, so that we can use dev_net_rcu()
and avoid a potential UAF.

Fixes: 1762f7e88e ("[NETNS][IPV6] ndisc - make socket control per namespace")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250207135841.1948589-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10 18:09:10 -08:00
Eric Dumazet
628e6d1893 ndisc: use RCU protection in ndisc_alloc_skb()
ndisc_alloc_skb() can be called without RTNL or RCU being held.

Add RCU protection to avoid possible UAF.

Fixes: de09334b93 ("ndisc: Introduce ndisc_alloc_skb() helper.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250207135841.1948589-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10 18:09:09 -08:00
Eric Dumazet
48145a57d4 ndisc: ndisc_send_redirect() must use dev_get_by_index_rcu()
ndisc_send_redirect() is called under RCU protection, not RTNL.

It must use dev_get_by_index_rcu() instead of __dev_get_by_index()

Fixes: 2f17becfbe ("vrf: check the original netdevice for generating redirect")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Suryaputra <ssuryaextr@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250207135841.1948589-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10 18:09:09 -08:00
Eric Dumazet
b768294d44 ipv6: Use RCU in ip6_input()
Instead of grabbing rcu_read_lock() from ip6_input_finish(),
do it earlier in is caller, so that ip6_input() access
to dev_net() can be validated by LOCKDEP.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250205155120.1676781-13-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06 16:14:15 -08:00
Eric Dumazet
34aef2b0ce ipv6: icmp: convert to dev_net_rcu()
icmp6_send() must acquire rcu_read_lock() sooner to ensure
the dev_net() call done from a safe context.

Other ICMPv6 uses of dev_net() seem safe, change them to
dev_net_rcu() to get LOCKDEP support to catch bugs.

Fixes: 9a43b709a2 ("[NETNS][IPV6] icmp6 - make icmpv6_socket per namespace")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250205155120.1676781-12-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06 16:14:15 -08:00
Eric Dumazet
3c8ffcd248 ipv6: use RCU protection in ip6_default_advmss()
ip6_default_advmss() needs rcu protection to make
sure the net structure it reads does not disappear.

Fixes: 5578689a4e ("[NETNS][IPV6] route6 - make route6 per namespace")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250205155120.1676781-11-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06 16:14:15 -08:00
Yan Zhai
235174b2be udp: gso: do not drop small packets when PMTU reduces
Commit 4094871db1 ("udp: only do GSO if # of segs > 1") avoided GSO
for small packets. But the kernel currently dismisses GSO requests only
after checking MTU/PMTU on gso_size. This means any packets, regardless
of their payload sizes, could be dropped when PMTU becomes smaller than
requested gso_size. We encountered this issue in production and it
caused a reliability problem that new QUIC connection cannot be
established before PMTU cache expired, while non GSO sockets still
worked fine at the same time.

Ideally, do not check any GSO related constraints when payload size is
smaller than requested gso_size, and return EMSGSIZE instead of EINVAL
on MTU/PMTU check failure to be more specific on the error cause.

Fixes: 4094871db1 ("udp: only do GSO if # of segs > 1")
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-02-03 10:13:27 +00:00
Jakub Kicinski
92191dd107 net: ipv6: fix dst ref loops in rpl, seg6 and ioam6 lwtunnels
Some lwtunnels have a dst cache for post-transformation dst.
If the packet destination did not change we may end up recording
a reference to the lwtunnel in its own cache, and the lwtunnel
state will never be freed.

Discovered by the ioam6.sh test, kmemleak was recently fixed
to catch per-cpu memory leaks. I'm not sure if rpl and seg6
can actually hit this, but in principle I don't see why not.

Fixes: 8cb3bf8bff ("ipv6: ioam: Add support for the ip6ip6 encapsulation")
Fixes: 6c8702c60b ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels")
Fixes: a7a29f9c36 ("net: ipv6: add rpl sr tunnel")
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250130031519.2716843-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-01 16:55:44 -08:00
Jakub Kicinski
c71a192976 net: ipv6: fix dst refleaks in rpl, seg6 and ioam6 lwtunnels
dst_cache_get() gives us a reference, we need to release it.

Discovered by the ioam6.sh test, kmemleak was recently fixed
to catch per-cpu memory leaks.

Fixes: 985ec6f5e6 ("net: ipv6: rpl_iptunnel: mitigate 2-realloc issue")
Fixes: 40475b6376 ("net: ipv6: seg6_iptunnel: mitigate 2-realloc issue")
Fixes: dce525185b ("net: ipv6: ioam6_iptunnel: mitigate 2-realloc issue")
Reviewed-by: Justin Iurman <justin.iurman@uliege.be>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250130031519.2716843-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-01 16:55:30 -08:00
Jakub Kicinski
463ec95a16 ipsec-2025-01-27
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmeXIF4ACgkQrB3Eaf9P
 W7fRMA/+Js2x0HNA3+6SMb5nJzY6lywi1BIRAzstyfd6EsxbHlgfdWYCCpixboA0
 /ZfDe7yPND/ewPIQLT9eO6hk9YzuAVhYUkdIcDC5jdFDNbh9dDqBdyu5P/5spsi9
 9SdFEucoOsKBP4ejmSvtwGsVNIf/1vB8hFqYxB+vh8+d/g8PHrI3xxk+2b7KkIGS
 ms+IyDCoVdCGQUOp4BGtEQbzXtx67diH5dcfwg8/DJpSMbfqO3ZFRG7gPu8C5Igt
 cxVSCW67rv/zzPkGPv8B+nczAdVUZ3OFXgEWxdDCN/mUbFKwxUcIDxZVJMfBBAUP
 lcjsbzmNfj2PNMLZFe/5LuU6o+sFEZdxmTPmvbb+lSYrRHx2oz2/Jb871gEj8rTC
 vNZ+1Lu1k7QRjEPiO1fe85vWdmU4G81+WAzC88nD0KYLDUN4c+MmxUFQkKbAxf6p
 e6VCihcKqi5Sa6R73Ohm87iyiSuv8WvkyVSM0XgQrkXWDFy5Jp2Bo25pW0QgVxK+
 l/aHhDA+YHFEOZTcjZsh/EdKlQRIxBNJ3ualITkjd2T+A1WyWm0A3S+kYZQCKqiM
 WGGWM3oVNXkUAaRxvURNvmXqO+hPeKfIElDeVrOUjG8zQ+EktKcg4KpDQb2BGJCj
 s9ksFj0pplR4GHxUrFmkEPxJWYKpFqUYCZMJDnBnHFm1ykC7QGM=
 =pg+h
 -----END PGP SIGNATURE-----

Merge tag 'ipsec-2025-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec

Steffen Klassert says:

====================
pull request (net): ipsec 2025-01-27

1) Fix incrementing the upper 32 bit sequence numbers for GSO skbs.
   From Jianbo Liu.

2) Fix an out-of-bounds read on xfrm state lookup.
   From Florian Westphal.

3) Fix secpath handling on packet offload mode.
   From Alexandre Cassen.

4) Fix the usage of skb->sk in the xfrm layer.

5) Don't disable preemption while looking up cache state
   to fix PREEMPT_RT.
   From Sebastian Sewior.

* tag 'ipsec-2025-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
  xfrm: Don't disable preemption while looking up cache state.
  xfrm: Fix the usage of skb->sk
  xfrm: delete intermediate secpath entry in packet offload mode
  xfrm: state: fix out-of-bounds read during lookup
  xfrm: replay: Fix the update of replay_esn->oseq_hi for GSO
====================

Link: https://patch.msgid.link/20250127060757.3946314-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-27 15:15:12 -08:00
Kuniyuki Iwashima
7bcf45ddb8 ipv6: Convert inet6_rtm_deladdr() to per-netns RTNL.
Let's register inet6_rtm_deladdr() with RTNL_FLAG_DOIT_PERNET and
hold rtnl_net_lock() before inet6_addr_del().

Now that inet6_addr_del() is always called under per-netns RTNL.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-12-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20 12:16:06 -08:00
Kuniyuki Iwashima
82a1e6aa8f ipv6: Convert inet6_rtm_newaddr() to per-netns RTNL.
Let's register inet6_rtm_newaddr() with RTNL_FLAG_DOIT_PERNET
and hold rtnl_net_lock() before __dev_get_by_index().

Now that inet6_addr_add() and inet6_addr_modify() are always
called under per-netns RTNL.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-11-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20 12:16:06 -08:00
Kuniyuki Iwashima
867b385251 ipv6: Move lifetime validation to inet6_rtm_newaddr().
inet6_addr_add() and inet6_addr_modify() have the same code to validate
IPv6 lifetime that is done under RTNL.

Let's factorise it out to inet6_rtm_newaddr() so that we can validate
the lifetime without RTNL later.

Note that inet6_addr_add() is called from addrconf_add_ifaddr(), but the
lifetime is INFINITY_LIFE_TIME in the path, so expires and flags are 0.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-10-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20 12:16:05 -08:00
Kuniyuki Iwashima
2f1ace4127 ipv6: Set cfg.ifa_flags before device lookup in inet6_rtm_newaddr().
We will convert inet6_rtm_newaddr() to per-netns RTNL.

Except for IFA_F_OPTIMISTIC, cfg.ifa_flags can be set before
__dev_get_by_index().

Let's move ifa_flags setup before __dev_get_by_index() so that
we can set ifa_flags without RTNL.

Also, now it's moved before tb[IFA_CACHEINFO] in preparing for
the next patch.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-9-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20 12:16:05 -08:00
Kuniyuki Iwashima
f7fce98a73 ipv6: Pass dev to inet6_addr_add().
inet6_addr_add() is called from inet6_rtm_newaddr() and
addrconf_add_ifaddr().

inet6_addr_add() looks up dev by __dev_get_by_index(), but
it's already done in inet6_rtm_newaddr().

Let's move the 2nd lookup to addrconf_add_ifaddr() and pass
dev to inet6_addr_add().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-8-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20 12:16:05 -08:00
Kuniyuki Iwashima
832128cc44 ipv6: Convert inet6_ioctl() to per-netns RTNL.
These functions are called from inet6_ioctl() with a socket's netns
and hold RTNL.

  * SIOCSIFADDR    : addrconf_add_ifaddr()
  * SIOCDIFADDR    : addrconf_del_ifaddr()
  * SIOCSIFDSTADDR : addrconf_set_dstaddr()

Let's use rtnl_net_lock().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-7-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20 12:16:05 -08:00
Kuniyuki Iwashima
cdc5c1196e ipv6: Hold rtnl_net_lock() in addrconf_init() and addrconf_cleanup().
addrconf_init() holds RTNL for blackhole_netdev, which is the global
device in init_net.

addrconf_cleanup() holds RTNL to clean up devices in init_net too.

Let's use rtnl_net_lock(&init_net) there.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-6-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20 12:16:05 -08:00
Kuniyuki Iwashima
02cdd78b4e ipv6: Hold rtnl_net_lock() in addrconf_dad_work().
addrconf_dad_work() is per-address work and holds RTNL internally.

We can fetch netns as dev_net(ifp->idev->dev).

Let's use rtnl_net_lock().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20 12:16:05 -08:00
Kuniyuki Iwashima
6550ba0863 ipv6: Hold rtnl_net_lock() in addrconf_verify_work().
addrconf_verify_work() is per-netns work to call addrconf_verify_rtnl()
under RTNL.

Let's use rtnl_net_lock().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20 12:16:05 -08:00
Kuniyuki Iwashima
93c839e3ed ipv6: Convert net.ipv6.conf.${DEV}.XXX sysctl to per-netns RTNL.
net.ipv6.conf.${DEV}.XXX sysctl are changed under RTNL:

  * forwarding
  * ignore_routes_with_linkdown
  * disable_ipv6
  * proxy_ndp
  * addr_gen_mode
  * stable_secret
  * disable_policy

Let's use rtnl_net_lock() there.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20 12:16:04 -08:00
Steffen Klassert
1620c88887 xfrm: Fix the usage of skb->sk
xfrm assumed to always have a full socket at skb->sk.
This is not always true, so fix it by converting to a
full socket before it is used.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
2025-01-20 07:06:53 +01:00
Eric Dumazet
3440fa34ad inet: ipmr: fix data-races
Following fields of 'struct mr_mfc' can be updated
concurrently (no lock protection) from ip_mr_forward()
and ip6_mr_forward()

- bytes
- pkt
- wrong_if
- lastuse

They also can be read from other functions.

Convert bytes, pkt and wrong_if to atomic_long_t,
and use READ_ONCE()/WRITE_ONCE() for lastuse.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250114221049.1190631-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15 15:07:23 -08:00
David S. Miller
7b24f164cf ipsec-next-2025-01-09
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmd/mNkACgkQrB3Eaf9P
 W7cQww/9Hnv7+wosuBxFW2o2ptgZThiwj/Ovz0oiWGWvctpN7VlpwmrDOsXQ+XMn
 xF6JBFEjnJanYoDBb78D0dMffJcMcUKZopTUU/ZMitSNr8aIHYiuB4SWPG1tqxl4
 Ete7Mr2m3tS96YePQNnAaRZzEuGsx3BQb28VLTWl9So81MByD2OK4fsAbYz22Gg8
 7A6tDHn1mUd9b2VG+LeeBZaDDFG8C0O2x4E/8Z3DX3z1N8y3LABPwZ38jcgTviKO
 1ZldGrJT+PownBydu23bWDassKE2TuVvGH9e/SOPeQj8DJ4Lmd0bafMZTY6xwfNT
 RJCwhlzZUpYRXFzvcf3+U3egsqEWEemV7/LzAapdT0V9OqfLWMUh3b1jMA4KcblZ
 qmlm/MhZyXutDoeuASwtM4jgM3wGwovOofrKKsb13hD9VLBs5jFZmFSw5TlbmwE3
 sjv7V4pFwNyfJnwQtyMmfuuHiy8w+fzqAA2GCg8mF3OosHABH/FOvEBP8xg1Vqu1
 iKlLkByfyaCFD+GxTPzqSDvSB8nDzeZBgBM/ILGuwH0OWr+0gxBzl1sxrhAkw9hC
 Gf+4J3wg7EknTxfrJk4LyqfyS50GvUIzpLSkHYxDtQRd4zv75bxRzMvoZvuapTtN
 GGpWYftKO8n2kgLORQ0dTtFee3c4w/No+KXWsFdtRVNpyk3N0Yw=
 =U79q
 -----END PGP SIGNATURE-----

Merge tag 'ipsec-next-2025-01-09' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next

Steffen Klassert says:

====================
ipsec-next-2025-01-09

1) Implement the AGGFRAG protocol and basic IP-TFS (RFC9347) functionality.
   From Christian Hopps.

2) Support ESN context update to hardware for TX.
   From Jianbo Liu.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2025-01-10 09:15:17 +00:00
Yuyang Huang
33d97a07b3 netlink: add IPv6 anycast join/leave notifications
This change introduces a mechanism for notifying userspace
applications about changes to IPv6 anycast addresses via netlink. It
includes:

* Addition and deletion of IPv6 anycast addresses are reported using
  RTM_NEWANYCAST and RTM_DELANYCAST.
* A new netlink group (RTNLGRP_IPV6_ACADDR) for subscribing to these
  notifications.

This enables user space applications(e.g. ip monitor) to efficiently
track anycast addresses through netlink messages, improving metrics
collection and system monitoring. It also unlocks the potential for
advanced anycast management in user space, such as hardware offload
control and fine grained network control.

Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Yuyang Huang <yuyanghuang@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250107114355.1766086-1-yuyanghuang@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-09 12:54:45 +01:00
Jakub Kicinski
385f186aba Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.13-rc6).

No conflicts.

Adjacent changes:

include/linux/if_vlan.h
  f91a5b8089 ("af_packet: fix vlan_get_protocol_dgram() vs MSG_PEEK")
  3f330db306 ("net: reformat kdoc return statements")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-03 16:29:29 -08:00
Eric Dumazet
260466b576 ila: serialize calls to nf_register_net_hooks()
syzbot found a race in ila_add_mapping() [1]

commit 031ae72825 ("ila: call nf_unregister_net_hooks() sooner")
attempted to fix a similar issue.

Looking at the syzbot repro, we have concurrent ILA_CMD_ADD commands.

Add a mutex to make sure at most one thread is calling nf_register_net_hooks().

[1]
 BUG: KASAN: slab-use-after-free in rht_key_hashfn include/linux/rhashtable.h:159 [inline]
 BUG: KASAN: slab-use-after-free in __rhashtable_lookup.constprop.0+0x426/0x550 include/linux/rhashtable.h:604
Read of size 4 at addr ffff888028f40008 by task dhcpcd/5501

CPU: 1 UID: 0 PID: 5501 Comm: dhcpcd Not tainted 6.13.0-rc4-syzkaller-00054-gd6ef8b40d075 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
Call Trace:
 <IRQ>
  __dump_stack lib/dump_stack.c:94 [inline]
  dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120
  print_address_description mm/kasan/report.c:378 [inline]
  print_report+0xc3/0x620 mm/kasan/report.c:489
  kasan_report+0xd9/0x110 mm/kasan/report.c:602
  rht_key_hashfn include/linux/rhashtable.h:159 [inline]
  __rhashtable_lookup.constprop.0+0x426/0x550 include/linux/rhashtable.h:604
  rhashtable_lookup include/linux/rhashtable.h:646 [inline]
  rhashtable_lookup_fast include/linux/rhashtable.h:672 [inline]
  ila_lookup_wildcards net/ipv6/ila/ila_xlat.c:127 [inline]
  ila_xlat_addr net/ipv6/ila/ila_xlat.c:652 [inline]
  ila_nf_input+0x1ee/0x620 net/ipv6/ila/ila_xlat.c:185
  nf_hook_entry_hookfn include/linux/netfilter.h:154 [inline]
  nf_hook_slow+0xbb/0x200 net/netfilter/core.c:626
  nf_hook.constprop.0+0x42e/0x750 include/linux/netfilter.h:269
  NF_HOOK include/linux/netfilter.h:312 [inline]
  ipv6_rcv+0xa4/0x680 net/ipv6/ip6_input.c:309
  __netif_receive_skb_one_core+0x12e/0x1e0 net/core/dev.c:5672
  __netif_receive_skb+0x1d/0x160 net/core/dev.c:5785
  process_backlog+0x443/0x15f0 net/core/dev.c:6117
  __napi_poll.constprop.0+0xb7/0x550 net/core/dev.c:6883
  napi_poll net/core/dev.c:6952 [inline]
  net_rx_action+0xa94/0x1010 net/core/dev.c:7074
  handle_softirqs+0x213/0x8f0 kernel/softirq.c:561
  __do_softirq kernel/softirq.c:595 [inline]
  invoke_softirq kernel/softirq.c:435 [inline]
  __irq_exit_rcu+0x109/0x170 kernel/softirq.c:662
  irq_exit_rcu+0x9/0x30 kernel/softirq.c:678
  instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1049 [inline]
  sysvec_apic_timer_interrupt+0xa4/0xc0 arch/x86/kernel/apic/apic.c:1049

Fixes: 7f00feaf10 ("ila: Add generic ILA translation facility")
Reported-by: syzbot+47e761d22ecf745f72b9@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6772c9ae.050a0220.2f3838.04c7.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Tom Herbert <tom@herbertland.com>
Link: https://patch.msgid.link/20241230162849.2795486-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-02 18:42:32 -08:00
Yuyang Huang
aa4ad7c3f2 netlink: correct nlmsg size for multicast notifications
Corrected the netlink message size calculation for multicast group
join/leave notifications. The previous calculation did not account for
the inclusion of both IPv4/IPv6 addresses and ifa_cacheinfo in the
payload. This fix ensures that the allocated message size is
sufficient to hold all necessary information.

This patch also includes the following improvements:
* Uses GFP_KERNEL instead of GFP_ATOMIC when holding the RTNL mutex.
* Uses nla_total_size(sizeof(struct in6_addr)) instead of
  nla_total_size(16).
* Removes unnecessary EXPORT_SYMBOL().

Fixes: 2c2b61d213 ("netlink: add IGMP/MLD join/leave notifications")
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Yuyang Huang <yuyanghuang@google.com>
Link: https://patch.msgid.link/20241221100007.1910089-1-yuyanghuang@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-23 10:26:43 -08:00
Stefano Brivio
a502ea6fa9 udp: Deal with race between UDP socket address change and rehash
If a UDP socket changes its local address while it's receiving
datagrams, as a result of connect(), there is a period during which
a lookup operation might fail to find it, after the address is changed
but before the secondary hash (port and address) and the four-tuple
hash (local and remote ports and addresses) are updated.

Secondary hash chains were introduced by commit 30fff9231f ("udp:
bind() optimisation") and, as a result, a rehash operation became
needed to make a bound socket reachable again after a connect().

This operation was introduced by commit 719f835853 ("udp: add
rehash on connect()") which isn't however a complete fix: the
socket will be found once the rehashing completes, but not while
it's pending.

This is noticeable with a socat(1) server in UDP4-LISTEN mode, and a
client sending datagrams to it. After the server receives the first
datagram (cf. _xioopen_ipdgram_listen()), it issues a connect() to
the address of the sender, in order to set up a directed flow.

Now, if the client, running on a different CPU thread, happens to
send a (subsequent) datagram while the server's socket changes its
address, but is not rehashed yet, this will result in a failed
lookup and a port unreachable error delivered to the client, as
apparent from the following reproducer:

  LEN=$(($(cat /proc/sys/net/core/wmem_default) / 4))
  dd if=/dev/urandom bs=1 count=${LEN} of=tmp.in

  while :; do
  	taskset -c 1 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc &
  	sleep 0.1 || sleep 1
  	taskset -c 2 socat OPEN:tmp.in UDP4:localhost:1337,shut-null
  	wait
  done

where the client will eventually get ECONNREFUSED on a write()
(typically the second or third one of a given iteration):

  2024/11/13 21:28:23 socat[46901] E write(6, 0x556db2e3c000, 8192): Connection refused

This issue was first observed as a seldom failure in Podman's tests
checking UDP functionality while using pasta(1) to connect the
container's network namespace, which leads us to a reproducer with
the lookup error resulting in an ICMP packet on a tap device:

  LOCAL_ADDR="$(ip -j -4 addr show|jq -rM '.[] | .addr_info[0] | select(.scope == "global").local')"

  while :; do
  	./pasta --config-net -p pasta.pcap -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc &
  	sleep 0.2 || sleep 1
  	socat OPEN:tmp.in UDP4:${LOCAL_ADDR}:1337,shut-null
  	wait
  	cmp tmp.in tmp.out
  done

Once this fails:

  tmp.in tmp.out differ: char 8193, line 29

we can finally have a look at what's going on:

  $ tshark -r pasta.pcap
      1   0.000000           :: ? ff02::16     ICMPv6 110 Multicast Listener Report Message v2
      2   0.168690 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      3   0.168767 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      4   0.168806 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      5   0.168827 c6:47:05:8d:dc:04 ? Broadcast    ARP 42 Who has 88.198.0.161? Tell 88.198.0.164
      6   0.168851 9a:55:9a:55:9a:55 ? c6:47:05:8d:dc:04 ARP 42 88.198.0.161 is at 9a:55:9a:55:9a:55
      7   0.168875 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      8   0.168896 88.198.0.164 ? 88.198.0.161 ICMP 590 Destination unreachable (Port unreachable)
      9   0.168926 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
     10   0.168959 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
     11   0.168989 88.198.0.161 ? 88.198.0.164 UDP 4138 60260 ? 1337 Len=4096
     12   0.169010 88.198.0.161 ? 88.198.0.164 UDP 42 60260 ? 1337 Len=0

On the third datagram received, the network namespace of the container
initiates an ARP lookup to deliver the ICMP message.

In another variant of this reproducer, starting the client with:

  strace -f pasta --config-net -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc 2>strace.log &

and connecting to the socat server using a loopback address:

  socat OPEN:tmp.in UDP4:localhost:1337,shut-null

we can more clearly observe a sendmmsg() call failing after the
first datagram is delivered:

  [pid 278012] connect(173, 0x7fff96c95fc0, 16) = 0
  [...]
  [pid 278012] recvmmsg(173, 0x7fff96c96020, 1024, MSG_DONTWAIT, NULL) = -1 EAGAIN (Resource temporarily unavailable)
  [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = 1
  [...]
  [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = -1 ECONNREFUSED (Connection refused)

and, somewhat confusingly, after a connect() on the same socket
succeeded.

Until commit 4cdeeee925 ("net: udp: prefer listeners bound to an
address"), the race between receive address change and lookup didn't
actually cause visible issues, because, once the lookup based on the
secondary hash chain failed, we would still attempt a lookup based on
the primary hash (destination port only), and find the socket with the
outdated secondary hash.

That change, however, dropped port-only lookups altogether, as side
effect, making the race visible.

To fix this, while avoiding the need to make address changes and
rehash atomic against lookups, reintroduce primary hash lookups as
fallback, if lookups based on four-tuple and secondary hashes fail.

To this end, introduce a simplified lookup implementation, which
doesn't take care of SO_REUSEPORT groups: if we have one, there are
multiple sockets that would match the four-tuple or secondary hash,
meaning that we can't run into this race at all.

v2:
  - instead of synchronising lookup operations against address change
    plus rehash, reintroduce a simplified version of the original
    primary hash lookup as fallback

v1:
  - fix build with CONFIG_IPV6=n: add ifdef around sk_v6_rcv_saddr
    usage (Kuniyuki Iwashima)
  - directly use sk_rcv_saddr for IPv4 receive addresses instead of
    fetching inet_rcv_saddr (Kuniyuki Iwashima)
  - move inet_update_saddr() to inet_hashtables.h and use that
    to set IPv4/IPv6 addresses as suitable (Kuniyuki Iwashima)
  - rebase onto net-next, update commit message accordingly

Reported-by: Ed Santiago <santiago@redhat.com>
Link: https://github.com/containers/podman/issues/24147
Analysed-by: David Gibson <david@gibson.dropbear.id.au>
Fixes: 30fff9231f ("udp: bind() optimisation")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-12-23 11:39:55 +00:00
Ido Schimmel
ba4138032a ipv6: Add flow label to route get requests
The default IPv6 multipath hash policy takes the flow label into account
when calculating a multipath hash and previous patches added a flow
label selector to IPv6 FIB rules.

Allow user space to specify a flow label in route get requests by adding
a new netlink attribute and using its value to populate the "flowlabel"
field in the IPv6 flow info structure prior to a route lookup.

Deny the attribute in RTM_{NEW,DEL}ROUTE requests by checking for it in
rtm_to_fib6_config() and returning an error if present.

A subsequent patch will use this capability to test the new flow label
selector in IPv6 FIB rules.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-19 16:02:22 +01:00
Ido Schimmel
9aa77531a1 ipv6: fib_rules: Add flow label support
Implement support for the new flow label selector which allows IPv6 FIB
rules to match on the flow label with a mask. Ensure that both flow
label attributes are specified (or none) and that the mask is valid.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-19 16:02:22 +01:00
Eric Dumazet
a853c60950 inetpeer: do not get a refcount in inet_getpeer()
All inet_getpeer() callers except ip4_frag_init() don't need
to acquire a permanent refcount on the inetpeer.

They can switch to full RCU protection.

Move the refcount_inc_not_zero() into ip4_frag_init(),
so that all the other callers no longer have to
perform a pair of expensive atomic operations on
a possibly contended cache line.

inet_putpeer() no longer needs to be exported.

After this patch, my DUT can receive 8,400,000 UDP packets
per second targeting closed ports, using 50% less cpu cycles
than before.

Also change two calls to l3mdev_master_ifindex() by
l3mdev_master_ifindex_rcu() (Ido ideas)

Fixes: 8c2bd38b95 ("icmp: change the order of rate limits")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241215175629.1248773-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-17 19:37:48 -08:00
Eric Dumazet
661cd8fc8e inetpeer: remove create argument of inet_getpeer_v[46]()
All callers of inet_getpeer_v4() and inet_getpeer_v6()
want to create an inetpeer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241215175629.1248773-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-17 19:37:00 -08:00
Anna Emese Nyiri
a32f3e9d1e sock: support SO_PRIORITY cmsg
The Linux socket API currently allows setting SO_PRIORITY at the
socket level, applying a uniform priority to all packets sent through
that socket. The exception to this is IP_TOS, when the priority value
is calculated during the handling of
ancillary data, as implemented in commit f02db315b8 ("ipv4: IP_TOS
and IP_TTL can be specified as ancillary data").
However, this is a computed
value, and there is currently no mechanism to set a custom priority
via control messages prior to this patch.

According to this patch, if SO_PRIORITY is specified as ancillary data,
the packet is sent with the priority value set through
sockc->priority, overriding the socket-level values
set via the traditional setsockopt() method. This is analogous to
the existing support for SO_MARK, as implemented in
commit c6af0c227a ("ip: support SO_MARK cmsg").

If both cmsg SO_PRIORITY and IP_TOS are passed, then the one that
takes precedence is the last one in the cmsg list.

This patch has the side effect that raw_send_hdrinc now interprets cmsg
IP_TOS.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Suggested-by: Ferenc Fejes <fejes@inf.elte.hu>
Signed-off-by: Anna Emese Nyiri <annaemesenyiri@gmail.com>
Link: https://patch.msgid.link/20241213084457.45120-3-annaemesenyiri@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-16 18:13:44 -08:00
Yuyang Huang
2c2b61d213 netlink: add IGMP/MLD join/leave notifications
This change introduces netlink notifications for multicast address
changes. The following features are included:
* Addition and deletion of multicast addresses are reported using
  RTM_NEWMULTICAST and RTM_DELMULTICAST messages with AF_INET and
  AF_INET6.
* Two new notification groups: RTNLGRP_IPV4_MCADDR and
  RTNLGRP_IPV6_MCADDR are introduced for receiving these events.

This change allows user space applications (e.g., ip monitor) to
efficiently track multicast group memberships by listening for netlink
events. Previously, applications relied on inefficient polling of
procfs, introducing delays. With netlink notifications, applications
receive realtime updates on multicast group membership changes,
enabling more precise metrics collection and system monitoring. 

This change also unlocks the potential for implementing a wide range
of sophisticated multicast related features in user space by allowing
applications to combine kernel provided multicast address information
with user space data and communicate decisions back to the kernel for
more fine grained control. This mechanism can be used for various
purposes, including multicast filtering, IGMP/MLD offload, and
IGMP/MLD snooping.

Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Co-developed-by: Patrick Ruddy <pruddy@vyatta.att-mail.com>
Signed-off-by: Patrick Ruddy <pruddy@vyatta.att-mail.com>
Link: https://lore.kernel.org/r/20180906091056.21109-1-pruddy@vyatta.att-mail.com
Signed-off-by: Yuyang Huang <yuyanghuang@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-12-15 12:31:35 +00:00
Eric Dumazet
00bf2032e9 ipv6: mcast: annotate data-race around psf->sf_count[MCAST_XXX]
psf->sf_count[MCAST_XXX] fields are read locklessly from
ipv6_chk_mcast_addr() and igmp6_mcf_seq_show().

Add READ_ONCE() and WRITE_ONCE() annotations accordingly.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241210183352.86530-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-11 20:15:29 -08:00
Eric Dumazet
626962911a ipv6: mcast: annotate data-races around mc->mca_sfcount[MCAST_EXCLUDE]
mc->mca_sfcount[MCAST_EXCLUDE] is read locklessly from
ipv6_chk_mcast_addr().

Add READ_ONCE() and WRITE_ONCE() annotations accordingly.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241210183352.86530-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-11 20:15:29 -08:00
Eric Dumazet
d51cfd5f4f ipv6: mcast: reduce ipv6_chk_mcast_addr() indentation
Add a label and two gotos to shorten lines by two tabulations,
to ease code review of following patches.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241210183352.86530-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-11 20:15:28 -08:00
Jakub Kicinski
302cc446cb Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.13-rc2).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-05 11:50:14 -08:00
Justin Iurman
985ec6f5e6 net: ipv6: rpl_iptunnel: mitigate 2-realloc issue
This patch mitigates the two-reallocations issue with rpl_iptunnel by
providing the dst_entry (in the cache) to the first call to
skb_cow_head(). As a result, the very first iteration would still
trigger two reallocations (i.e., empty cache), while next iterations
would only trigger a single reallocation.

Performance tests before/after applying this patch, which clearly shows
there is no impact (it even shows improvement):
- before: https://ibb.co/nQJhqwc
- after: https://ibb.co/4ZvW6wV

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Cc: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-05 11:15:56 +01:00
Justin Iurman
40475b6376 net: ipv6: seg6_iptunnel: mitigate 2-realloc issue
This patch mitigates the two-reallocations issue with seg6_iptunnel by
providing the dst_entry (in the cache) to the first call to
skb_cow_head(). As a result, the very first iteration would still
trigger two reallocations (i.e., empty cache), while next iterations
would only trigger a single reallocation.

Performance tests before/after applying this patch, which clearly shows
the improvement:
- before: https://ibb.co/3Cg4sNH
- after: https://ibb.co/8rQ350r

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Cc: David Lebrun <dlebrun@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-05 11:15:56 +01:00
Justin Iurman
dce525185b net: ipv6: ioam6_iptunnel: mitigate 2-realloc issue
This patch mitigates the two-reallocations issue with ioam6_iptunnel by
providing the dst_entry (in the cache) to the first call to
skb_cow_head(). As a result, the very first iteration may still trigger
two reallocations (i.e., empty cache), while next iterations would only
trigger a single reallocation.

Performance tests before/after applying this patch, which clearly shows
the improvement:
- inline mode:
  - before: https://ibb.co/LhQ8V63
  - after: https://ibb.co/x5YT2bS
- encap mode:
  - before: https://ibb.co/3Cjm5m0
  - after: https://ibb.co/TwpsxTC
- encap mode with tunsrc:
  - before: https://ibb.co/Gpy9QPg
  - after: https://ibb.co/PW1bZFT

This patch also fixes an incorrect behavior: after the insertion, the
second call to skb_cow_head() makes sure that the dev has enough
headroom in the skb for layer 2 and stuff. In that case, the "old"
dst_entry was used, which is now fixed. After discussing with Paolo, it
appears that both patches can be merged into a single one -this one-
(for the sake of readability) and target net-next.

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-05 11:15:56 +01:00