Since commit 0e23387491 ("ipv6: fix races in ip6_dst_destroy()"),
'table' is unused in __fib6_drop_pcpu_from(), no need pass it from
fib6_drop_pcpu_from().
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250701041235.1333687-1-yuehaibing@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
change 'Maximium' to 'Maximum'
Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://patch.msgid.link/20250702055820.112190-1-zhaochenguang@kylinos.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
MSG_ZEROCOPY data must be copied before data is queued to local
sockets, to avoid indefinite timeout until memory release.
This test is performed by skb_orphan_frags_rx, which is called when
looping an egress skb to packet sockets, error queue or ingress path.
To preserve zerocopy for skbs that are looped to ingress but are then
forwarded to an egress device rather than delivered locally, defer
this last check until an skb enters the local IP receive path.
This is analogous to existing behavior of skb_clear_delivery_time.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630194312.1571410-2-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Both functions are always called under RCU.
We remove the extra rcu_read_lock()/rcu_read_unlock().
We use skb_dst_dev_net_rcu() instead of skb_dst_dev_net().
We use dev_net_rcu() instead of dev_net().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-11-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use the new helpers as a step to deal with potential dst->dev races.
v2: fix typo in ipv6_rthdr_rcv() (kernel test robot <lkp@intel.com>)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-10-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use the new helper as a step to deal with potential dst->dev races.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
(dst_entry)->lastuse is read and written locklessly,
add corresponding annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
(dst_entry)->expires is read and written locklessly,
add corresponding annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
____cacheline_aligned_in_smp attribute only makes sure to align
a field to a cache line. It does not prevent the linker to use
the remaining of the cache line for other variables, causing
potential false sharing.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
____cacheline_aligned_in_smp attribute only makes sure to align
a field to a cache line. It does not prevent the linker to use
the remaining of the cache line for other variables, causing
potential false sharing.
Move tcp_memory_allocated into a dedicated cache line.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The referenced commit replaced a call to __xfrm4|6_udp_encap_rcv() with
a custom check for non-ESP markers. But what the called function also
did was setting the transport header to the ESP header. The function
that follows, esp4|6_gro_receive(), relies on that being set when it calls
xfrm_parse_spi(). We have to set the full offset as the skb's head was
not moved yet so adding just the UDP header length won't work.
Fixes: e3fd057776 ("xfrm: Fix UDP GRO handling for some corner cases")
Signed-off-by: Tobias Brunner <tobias@strongswan.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Fix a typo:
lenghts -> length
The typo has been identified using codespell, and the tool currently
does not report any additional issues in comments.
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250629171226.4988-2-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is possible via the ioctl API:
> ip -6 tunnel change ip6tnl0 mode any
Let's align the netlink API:
> ip link set ip6tnl0 type ip6tnl mode any
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://patch.msgid.link/20250630145602.1027220-1-nicolas.dichtel@6wind.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now inet_rtx_syn_ack() is only used by TCP, it can directly
call tcp_rtx_synack() instead of using an indirect call
to req->rsk_ops->rtx_syn_ack().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250626153017.2156274-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
lwtunnel_fill_encap() has NULL check and return 0, so no need
to check before call it.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250624140015.3929241-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Difference between sock_i_uid() and sk_uid() is that
after sock_orphan(), sock_i_uid() returns GLOBAL_ROOT_UID
while sk_uid() returns the last cached sk->sk_uid value.
None of sock_i_uid() callers care about this.
Use sk_uid() which is much faster and inlined.
Note that diag/dump users are calling sock_i_ino() and
can not see the full benefit yet.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Link: https://patch.msgid.link/20250620133001.4090592-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sk->sk_uid can be read while another thread changes its
value in sockfs_setattr().
Add sk_uid(const struct sock *sk) helper to factorize the needed
READ_ONCE() annotations, and add corresponding WRITE_ONCE()
where needed.
Fixes: 86741ec254 ("net: core: Add a UID field to struct sock.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Link: https://patch.msgid.link/20250620133001.4090592-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit 3e6a0243ff ("gre: Fix again IPv6 link-local address
generation."), addrconf_gre_config() has stopped handling IP6GRE
devices specially and just calls the regular addrconf_addr_gen()
function to create their link-local IPv6 addresses.
We can thus avoid using addrconf_gre_config() for IP6GRE devices and
use the normal IPv6 initialisation path instead (that is, jump directly
to addrconf_dev_config() in addrconf_init_auto_addrs()).
See commit 3e6a0243ff ("gre: Fix again IPv6 link-local address
generation.") for a deeper explanation on how and why GRE devices
started handling their IPv6 link-local address generation specially,
why it was a problem, and why this is not even necessary in most cases
(especially for GRE over IPv6).
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/a9144be9c7ec3cf09f25becae5e8fdf141fde9f6.1750075076.git.gnault@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Multicast routing is today handled in the input path. Locally generated MC
packets don't hit the IPMR code today. Thus if a VXLAN remote address is
multicast, the driver needs to set an OIF during route lookup. Thus MC
routing configuration needs to be kept in sync with the VXLAN FDB and MDB.
Ideally, the VXLAN packets would be routed by the MC routing code instead.
To that end, this patch adds support to route locally generated multicast
packets. The newly-added routines do largely what ip6_mr_input() and
ip6_mr_forward() do: make an MR cache lookup to find where to send the
packets, and use ip6_output() to send each of them. When no cache entry is
found, the packet is punted to the daemon for resolution.
Similarly to the IPv4 case in a previous patch, the new logic is contingent
on a newly-added IP6CB flag being set.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/3bcc034a3ab4d3c291072fff38f78d7fbbeef4e6.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some of the work of ip6mr_forward2() is specific to IPMR forwarding, and
should not take place on the output path. In order to allow reuse of the
common parts, extract out of the function a helper,
ip6mr_prepare_forward().
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/8932bd5c0fbe3f662b158803b8509604fa7bc113.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The netfilter hook is invoked with skb->dev for input netdevice, and
vif_dev for output netdevice. However at the point of invocation, skb->dev
is already set to vif_dev, and MR-forwarded packets are reported with
in=out:
# ip6tables -A FORWARD -j LOG --log-prefix '[forw]'
# cd tools/testing/selftests/net/forwarding
# ./router_multicast.sh
# dmesg | fgrep '[forw]'
[ 1670.248245] [forw]IN=v5 OUT=v5 [...]
For reference, IPv4 MR code shows in and out as appropriate.
Fix by caching skb->dev and using the updated value for output netdev.
Fixes: 7bc570c8b4 ("[IPV6] MROUTE: Support multicast forwarding.")
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/3141ae8386fbe13fef4b793faa75e6bae58d798a.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ip6tunnel_xmit() erases the contents of the SKB control block. In order to
be able to set particular IP6CB flags on the SKB, add a corresponding
parameter, and propagate it to udp_tunnel6_xmit_skb() as well.
In one of the following patches, VXLAN driver will use this facility to
mark packets as subject to IPv6 multicast routing.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/acb4f9f3e40c3a931236c3af08a720b017fbfbfb.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The function always returns zero, thus the return value does not carry any
signal. Just make it void.
Most callers already ignore the return value. However:
- Refold arguments of the call from sctp_v6_xmit() so that they fit into
the 80-column limit.
- tipc_udp_xmit() initializes err from the return value, but that should
already be always zero at that point. So there's no practical change, but
elision of the assignment prompts a couple more tweaks to clean up the
function.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/7facacf9d8ca3ca9391a4aee88160913671b868d.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
iptunnel_xmit() erases the contents of the SKB control block. In order to
be able to set particular IPCB flags on the SKB, add a corresponding
parameter, and propagate it to udp_tunnel_xmit_skb() as well.
In one of the following patches, VXLAN driver will use this facility to
mark packets as subject to IP multicast routing.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Acked-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/89c9daf9f2dc088b6b92ccebcc929f51742de91f.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Extend the End.X behavior to accept an output interface as an optional
attribute and make use of it when resolving a route. This is needed when
user space wants to use a link-local address as the nexthop address.
Before:
# ip route add 2001:db8:1::/64 encap seg6local action End.X nh6 fe80::1 oif eth0 dev sr6
# ip route add 2001:db8:2::/64 encap seg6local action End.X nh6 2001:db8:10::1 dev sr6
$ ip -6 route show
2001:db8:1::/64 encap seg6local action End.X nh6 fe80::1 dev sr6 metric 1024 pref medium
2001:db8:2::/64 encap seg6local action End.X nh6 2001:db8:10::1 dev sr6 metric 1024 pref medium
After:
# ip route add 2001:db8:1::/64 encap seg6local action End.X nh6 fe80::1 oif eth0 dev sr6
# ip route add 2001:db8:2::/64 encap seg6local action End.X nh6 2001:db8:10::1 dev sr6
$ ip -6 route show
2001:db8:1::/64 encap seg6local action End.X nh6 fe80::1 oif eth0 dev sr6 metric 1024 pref medium
2001:db8:2::/64 encap seg6local action End.X nh6 2001:db8:10::1 dev sr6 metric 1024 pref medium
Note that the oif attribute is not dumped to user space when it was not
specified (as an oif of 0) since each entry keeps track of the optional
attributes that it parsed during configuration (see struct
seg6_local_lwt::parsed_optattrs).
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Link: https://patch.msgid.link/20250612122323.584113-4-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
seg6_lookup_nexthop() is a wrapper around seg6_lookup_any_nexthop().
Change End.X behavior to invoke seg6_lookup_any_nexthop() directly so
that we would not need to expose the new output interface argument
outside of the seg6local module.
No functional changes intended.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Link: https://patch.msgid.link/20250612122323.584113-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
seg6_lookup_any_nexthop() is called by the different endpoint behaviors
(e.g., End, End.X) to resolve an IPv6 route. Extend the function with an
output interface argument so that it could be used to resolve a route
with a certain output interface. This will be used by subsequent patches
that will extend the End.X behavior with an output interface as an
optional argument.
ip6_route_input_lookup() cannot be used when an output interface is
specified as it ignores this parameter. Similarly, calling
ip6_pol_route() when a table ID was not specified (e.g., End.X behavior)
is wrong.
Therefore, when an output interface is specified without a table ID,
resolve the route using ip6_route_output() which will take the output
interface into account.
Note that no endpoint behavior currently passes both a table ID and an
output interface, so the oif argument passed to ip6_pol_route() is
always zero and there are no functional changes in this regard.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Link: https://patch.msgid.link/20250612122323.584113-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Current release - regressions:
- af_unix: allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD
Current release - new code bugs:
- eth: airoha: correct enable mask for RX queues 16-31
- veth: prevent NULL pointer dereference in veth_xdp_rcv when peer
disappears under traffic
- ipv6: move fib6_config_validate() to ip6_route_add(), prevent invalid
routes
Previous releases - regressions:
- phy: phy_caps: don't skip better duplex match on non-exact match
- dsa: b53: fix untagged traffic sent via cpu tagged with VID 0
- Revert "wifi: mwifiex: Fix HT40 bandwidth issue.", it caused transient
packet loss, exact reason not fully understood, yet
Previous releases - always broken:
- net: clear the dst when BPF is changing skb protocol (IPv4 <> IPv6)
- sched: sfq: fix a potential crash on gso_skb handling
- Bluetooth: intel: improve rx buffer posting to avoid causing issues
in the firmware
- eth: intel: i40e: make reset handling robust against multiple requests
- eth: mlx5: ensure FW pages are always allocated on the local NUMA
node, even when device is configure to 'serve' another node
- wifi: ath12k: fix GCC_GCC_PCIE_HOT_RST definition for WCN7850,
prevent kernel crashes
- wifi: ath11k: avoid burning CPU in ath11k_debugfs_fw_stats_request()
for 3 sec if fw_stats_done is not set
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmhK/3IACgkQMUZtbf5S
IruE5A//RdwiBW/pqoMIiRKLA3HZeUA/beYOl4DwVf8WFQNUIqdboeAi6k4yFrS+
SykKN0s1z8fW45lA46iFv3sR0QKYGln/v/cANsqojYqKBD3PF42dRifFlEAIz2M5
fnXK1VHPJOFK/OBOyKiiW3R6mFv+v9epZM8BKED77vFy7osDV2zkObePeE8/34B7
yVAr6JNTpB5Ex4ziG+e/6tFF6IX9RJLBl4fkRRynLDSsb1NFuy39LxPsxRQPxnzo
tlfHfxEFl5qDNGondUoSxmp38HoO6MRofWp1d1GZoBbTXi0gXV26I5WaaBHBqPkm
jZ7AtIMQq2+JuEg0y4dFFRehZLwLEMuhvlbacbIOKNBngVIsploBzvbG3ntWuUa4
Z5VFayQXumsHB5g7+vEFK6vCPaIpatKt419JsFXogNvVmmQzghALFlSymm/WbyGL
Bj3R448xGDJw+2zDAXSH/nMMHkRaQd2Ptj2czvJ0Y7Fj8bxJgH0okaHOBrk9RQTQ
bdUGCiMY84p6WI7rKDkFyyohMxppdYsY8A9hSdGgpqvu7dZi5yGmzz1Sp9+uSfSF
Lj61am4LSvRsIuTP5cdqmTBn3mZS5R49hvJsFddgXRhF+Y9gB7LSm0sypZbuOEKD
m9ijKcNETglzer0iMCwAVrIbDHGjqqHS74DkRzsuPsQ8kaCjsno=
=0mtm
-----END PGP SIGNATURE-----
Merge tag 'net-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bluetooth and wireless.
Current release - regressions:
- af_unix: allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD
Current release - new code bugs:
- eth: airoha: correct enable mask for RX queues 16-31
- veth: prevent NULL pointer dereference in veth_xdp_rcv when peer
disappears under traffic
- ipv6: move fib6_config_validate() to ip6_route_add(), prevent
invalid routes
Previous releases - regressions:
- phy: phy_caps: don't skip better duplex match on non-exact match
- dsa: b53: fix untagged traffic sent via cpu tagged with VID 0
- Revert "wifi: mwifiex: Fix HT40 bandwidth issue.", it caused
transient packet loss, exact reason not fully understood, yet
Previous releases - always broken:
- net: clear the dst when BPF is changing skb protocol (IPv4 <> IPv6)
- sched: sfq: fix a potential crash on gso_skb handling
- Bluetooth: intel: improve rx buffer posting to avoid causing issues
in the firmware
- eth: intel: i40e: make reset handling robust against multiple
requests
- eth: mlx5: ensure FW pages are always allocated on the local NUMA
node, even when device is configure to 'serve' another node
- wifi: ath12k: fix GCC_GCC_PCIE_HOT_RST definition for WCN7850,
prevent kernel crashes
- wifi: ath11k: avoid burning CPU in ath11k_debugfs_fw_stats_request()
for 3 sec if fw_stats_done is not set"
* tag 'net-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (70 commits)
selftests: drv-net: rss_ctx: Add test for ntuple rules targeting default RSS context
net: ethtool: Don't check if RSS context exists in case of context 0
af_unix: Allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD.
ipv6: Move fib6_config_validate() to ip6_route_add().
net: drv: netdevsim: don't napi_complete() from netpoll
net/mlx5: HWS, Add error checking to hws_bwc_rule_complex_hash_node_get()
veth: prevent NULL pointer dereference in veth_xdp_rcv
net_sched: remove qdisc_tree_flush_backlog()
net_sched: ets: fix a race in ets_qdisc_change()
net_sched: tbf: fix a race in tbf_change()
net_sched: red: fix a race in __red_change()
net_sched: prio: fix a race in prio_tune()
net_sched: sch_sfq: reject invalid perturb period
net: phy: phy_caps: Don't skip better duplex macth on non-exact match
MAINTAINERS: Update Kuniyuki Iwashima's email address.
selftests: net: add test case for NAT46 looping back dst
net: clear the dst when changing skb protocol
net/mlx5e: Fix number of lanes to UNKNOWN when using data_rate_oper
net/mlx5e: Fix leak of Geneve TLV option object
net/mlx5: HWS, make sure the uplink is the last destination
...
Move this API to the canonical timer_*() namespace.
[ tglx: Redone against pre rc1 ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
The kernel currently validates that the length of the provided nexthop
address does not exceed the specified length. This can lead to the
kernel reading uninitialized memory if user space provided a shorter
length than the specified one.
Fix by validating that the provided length exactly matches the specified
one.
Fixes: d1df6fd8a1 ("ipv6: sr: define core operations for seg6local lightweight tunnel")
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250604113252.371528-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
During ILA address translations, the L4 checksums can be handled in
different ways. One of them, adj-transport, consist in parsing the
transport layer and updating any found checksum. This logic relies on
inet_proto_csum_replace_by_diff and produces an incorrect skb->csum when
in state CHECKSUM_COMPLETE.
This bug can be reproduced with a simple ILA to SIR mapping, assuming
packets are received with CHECKSUM_COMPLETE:
$ ip a show dev eth0
14: eth0@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 62:ae:35:9e:0f:8d brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 3333:0:0:1::c078/64 scope global
valid_lft forever preferred_lft forever
inet6 fd00:10:244:1::c078/128 scope global nodad
valid_lft forever preferred_lft forever
inet6 fe80::60ae:35ff:fe9e:f8d/64 scope link proto kernel_ll
valid_lft forever preferred_lft forever
$ ip ila add loc_match fd00:10:244:1 loc 3333:0:0:1 \
csum-mode adj-transport ident-type luid dev eth0
Then I hit [fd00:10:244:1::c078]:8000 with a server listening only on
[3333:0:0:1::c078]:8000. With the bug, the SYN packet is dropped with
SKB_DROP_REASON_TCP_CSUM after inet_proto_csum_replace_by_diff changed
skb->csum. The translation and drop are visible on pwru [1] traces:
IFACE TUPLE FUNC
eth0:9 [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp) ipv6_rcv
eth0:9 [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp) ip6_rcv_core
eth0:9 [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp) nf_hook_slow
eth0:9 [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp) inet_proto_csum_replace_by_diff
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) tcp_v6_early_demux
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) ip6_route_input
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) ip6_input
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) ip6_input_finish
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) ip6_protocol_deliver_rcu
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) raw6_local_deliver
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) ipv6_raw_deliver
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) tcp_v6_rcv
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) __skb_checksum_complete
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) kfree_skb_reason(SKB_DROP_REASON_TCP_CSUM)
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) skb_release_head_state
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) skb_release_data
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) skb_free_head
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) kfree_skbmem
This is happening because inet_proto_csum_replace_by_diff is updating
skb->csum when it shouldn't. The L4 checksum is updated such that it
"cancels" the IPv6 address change in terms of checksum computation, so
the impact on skb->csum is null.
Note this would be different for an IPv4 packet since three fields
would be updated: the IPv4 address, the IP checksum, and the L4
checksum. Two would cancel each other and skb->csum would still need
to be updated to take the L4 checksum change into account.
This patch fixes it by passing an ipv6 flag to
inet_proto_csum_replace_by_diff, to skip the skb->csum update if we're
in the IPv6 case. Note the behavior of the only other user of
inet_proto_csum_replace_by_diff, the BPF subsystem, is left as is in
this patch and fixed in the subsequent patch.
With the fix, using the reproduction from above, I can confirm
skb->csum is not touched by inet_proto_csum_replace_by_diff and the TCP
SYN proceeds to the application after the ILA translation.
Link: https://github.com/cilium/pwru [1]
Fixes: 65d7ab8de5 ("net: Identifier Locator Addressing module")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/b5539869e3550d46068504feb02d37653d939c0b.1748509484.git.paul.chaignon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmgwd00ACgkQ1w0aZmrP
KyEfwA//RXQ3i8PCa7lKHxDRhVzG3rEvgXRmiXeNd+JjzsCnybBb7+wRf3dtBGWT
+1s44Utx1JqosWxCVBulqYC5bqSC66789l5X2jhYJmUZxRrbcsqPngwnIrjb/XeK
ZJM62wiRhkBQED7yZLGy+y4VHQiG8CEMt16AOQHk863aruWv1tT7up90CTtzA545
4GF/grU3FC0PsoTLwzWyvqsWK+9uk3Y4Tifp5hU3w6uRD9EjX5tHCZlXXSqOF5gu
KT26OYsePYXhJVZIwDf2oVLGi0EVTPB9IFxZSNgLqyXqu2ILAb9OwRNVTNfTP7Pg
1RWJWmgqvRNs9OM2ecifYgQf/AfvCL0Cja1BJOjmvtICuGegrYH7G5YYQsMl9CoE
7jBoTzpToSASat5+dwoz81Bvzh447dYxRE2VmbxmRTTWToQYS1KGBPc9e3u/n5Rr
ruh8tRZ3/R0Fy+YLDkrJst3grh5RLITbuyu4ElJMArPU50mLTVYxKd6nA3BqwB5G
1GmLfCzvQH3e6PKz6CNke1AytVDy/wLTXtcbLnze2Muaj4AqhtOe5Q8ypnOO0Vyk
PsJ6U3rm2asd3GE9+AIx8gZBv8yCu1w9CiwLK8ybT2NETb2dEnqPgWeDyT7rpcaD
sQOPsBE1q/TEp9gofbYCHBm5E2mX9UP7Q6EHCTekrI97xLq8Q2M=
=fBhd
-----END PGP SIGNATURE-----
Merge tag 'nf-next-25-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following batch contains Netfilter updates for net-next,
specifically 26 patches: 5 patches adding/updating selftests,
4 fixes, 3 PREEMPT_RT fixes, and 14 patches to enhance nf_tables):
1) Improve selftest coverage for pipapo 4 bit group format, from
Florian Westphal.
2) Fix incorrect dependencies when compiling a kernel without
legacy ip{6}tables support, also from Florian.
3) Two patches to fix nft_fib vrf issues, including selftest updates
to improve coverage, also from Florian Westphal.
4) Fix incorrect nesting in nft_tunnel's GENEVE support, from
Fernando F. Mancera.
5) Three patches to fix PREEMPT_RT issues with nf_dup infrastructure
and nft_inner to match in inner headers, from Sebastian Andrzej Siewior.
6) Integrate conntrack information into nft trace infrastructure,
from Florian Westphal.
7) A series of 13 patches to allow to specify wildcard netdevice in
netdev basechain and flowtables, eg.
table netdev filter {
chain ingress {
type filter hook ingress devices = { eth0, eth1, vlan* } priority 0; policy accept;
}
}
This also allows for runtime hook registration on NETDEV_{UN}REGISTER
event, from Phil Sutter.
netfilter pull request 25-05-23
* tag 'nf-next-25-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: (26 commits)
selftests: netfilter: Torture nftables netdev hooks
netfilter: nf_tables: Add notifications for hook changes
netfilter: nf_tables: Support wildcard netdev hook specs
netfilter: nf_tables: Sort labels in nft_netdev_hook_alloc()
netfilter: nf_tables: Handle NETDEV_CHANGENAME events
netfilter: nf_tables: Wrap netdev notifiers
netfilter: nf_tables: Respect NETDEV_REGISTER events
netfilter: nf_tables: Prepare for handling NETDEV_REGISTER events
netfilter: nf_tables: Have a list of nf_hook_ops in nft_hook
netfilter: nf_tables: Pass nf_hook_ops to nft_unregister_flowtable_hook()
netfilter: nf_tables: Introduce nft_register_flowtable_ops()
netfilter: nf_tables: Introduce nft_hook_find_ops{,_rcu}()
netfilter: nf_tables: Introduce functions freeing nft_hook objects
netfilter: nf_tables: add packets conntrack state to debug trace info
netfilter: conntrack: make nf_conntrack_id callable without a module dependency
netfilter: nf_dup_netdev: Move the recursion counter struct netdev_xmit
netfilter: nft_inner: Use nested-BH locking for nft_pcpu_tun_ctx
netfilter: nf_dup{4, 6}: Move duplication check to task_struct
netfilter: nft_tunnel: fix geneve_opt dump
selftests: netfilter: nft_fib.sh: add type and oif tests with and without VRFs
...
====================
Link: https://patch.msgid.link/20250523132712.458507-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
nf_skb_duplicated is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.
Due to the recursion involved, the simplest change is to make it a
per-task variable.
Move the per-CPU variable nf_skb_duplicated to task_struct and name it
in_nf_duplicate. Add it to the existing bitfield so it doesn't use
additional memory.
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
fib has two modes:
1. Obtain output device according to source or destination address
2. Obtain the type of the address, e.g. local, unicast, multicast.
'fib daddr type' should return 'local' if the address is configured
in this netns or unicast otherwise.
'fib daddr . iif type' should return 'local' if the address is configured
on the input interface or unicast otherwise, i.e. more restrictive.
However, if the interface is part of a VRF, then 'fib daddr type'
returns unicast even if the address is configured on the incoming
interface.
This is broken for both ipv4 and ipv6.
In the ipv4 case, inet_dev_addr_type must only be used if the
'iif' or 'oif' (strict mode) was requested.
Else inet_addr_type_dev_table() needs to be used and the correct
dev argument must be passed as well so the correct fib (vrf) table
is used.
In the ipv6 case, the bug is similar, without strict mode, dev is NULL
so .flowi6_l3mdev will be set to 0.
Add a new 'nft_fib_l3mdev_master_ifindex_rcu()' helper and use that
to init the .l3mdev structure member.
For ipv6, use it from nft_fib6_flowi_init() which gets called from
both the 'type' and the 'route' mode eval functions.
This provides consistent behaviour for all modes for both ipv4 and ipv6:
If strict matching is requested, the input respectively output device
of the netfilter hooks is used.
Otherwise, use skb->dev to obtain the l3mdev ifindex.
Without this, most type checks in updated nft_fib.sh selftest fail:
FAIL: did not find veth0 . 10.9.9.1 . local in fibtype4
FAIL: did not find veth0 . dead:1::1 . local in fibtype6
FAIL: did not find veth0 . dead:9::1 . local in fibtype6
FAIL: did not find tvrf . 10.0.1.1 . local in fibtype4
FAIL: did not find tvrf . 10.9.9.1 . local in fibtype4
FAIL: did not find tvrf . dead:1::1 . local in fibtype6
FAIL: did not find tvrf . dead:9::1 . local in fibtype6
FAIL: fib expression address types match (iif in vrf)
(fib errounously returns 'unicast' for all of them, even
though all of these addresses are local to the vrf).
Fixes: f6d0cbcf09 ("netfilter: nf_tables: add fib expression")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Cross-merge networking fixes after downstream PR (net-6.15-rc8).
Conflicts:
80f2ab46c2 ("irdma: free iwdev->rf after removing MSI-X")
4bcc063939 ("ice, irdma: fix an off by one in error handling code")
c24a65b6a2 ("iidc/ice/irdma: Update IDC to support multiple consumers")
https://lore.kernel.org/20250513130630.280ee6c5@canb.auug.org.au
No extra adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
With a VRF, ipv4 and ipv6 FIB expression behave differently.
fib daddr . iif oif
Will return the input interface name for ipv4, but the real device
for ipv6. Example:
If VRF device name is tvrf and real (incoming) device is veth0.
First round is ok, both ipv4 and ipv6 will yield 'veth0'.
But in the second round (incoming device will be set to "tvrf"), ipv4
will yield "tvrf" whereas ipv6 returns "veth0" for the second round too.
This makes ipv6 behave like ipv4.
A followup patch will add a test case for this, without this change
it will fail with:
get element inet t fibif6iif { tvrf . dead:1::99 . tvrf }
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FAIL: did not find tvrf . dead:1::99 . tvrf in fibif6iif
Alternatively we could either not do anything at all or change
ipv4 to also return the lower/real device, however, nft (userspace)
doc says "iif: if fib lookup provides a route then check its output
interface is identical to the packets input interface." which is what
the nft fib ipv4 behaviour is.
Fixes: f6d0cbcf09 ("netfilter: nf_tables: add fib expression")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmgtYgEACgkQrB3Eaf9P
W7e1ag//da84UIRyJwMfDO4Y3MXDNPslNSDuq0HuvwRtdLIBLFtwitSzU1uhKsxY
yn5v7RSsxvp6lXW2RT+Ycor2qZ/mGHJsHcVfG7m0YjxH6unw7yzjqn5LNNzRbYN4
NcD8P0skuX6d80EFPUB3Hsnmdj1VKR62OsWyk3rAPb4CLBVKJt9OsseVfN4bn1R0
TaZSIkdh5EDGYXTBKb49jc8LFfQo7+uVg/AjtZ/2ZsWt+Qgw3XevTIcwLokH00rt
GzXcLjC1g+b6TeVncOuD1oiNJUtQVGYV23t2yQlk9k2HFzCdNnq0YM9pzawwiI+l
icBV2X/QFjhdCRkvJRF4dkXq/4tnnEmYoY/1vSOoWR9VmY2u8Lr3VRiDD/h0gYJT
KXd8YPMtZLDnLgmH+DwWbv4vdLtHvQTmB8XFzb/4VN6Ikucenry3loJsUsLnS+Je
t1/7unLrg9yyJC6UPzweqjAx+6VgZvem/M5kejIVxHpk+Wg2dXGZ2jz4fsVuZYPB
dMLj1h1MLn4gOt2b/bdI2do0C+p2R1axrTNw+RiqwCrb1h5Ey+7RAhWyXyaHUEs3
1brMAgOcvdbaaeSIpoHJ8eJx/PgRxDrxRnUC3HjCGPNApYQXC3FM3POk7wwJ9C0i
odlHrq+yOdzLCZyU+YKdR1q3kPq9AWpUSmc4Olg359OQ9IxDGQw=
=bgyq
-----END PGP SIGNATURE-----
Merge tag 'ipsec-2025-05-21' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2025-05-21
1) Fix some missing kfree_skb in the error paths of espintcp.
From Sabrina Dubroca.
2) Fix a reference leak in espintcp.
From Sabrina Dubroca.
3) Fix UDP GRO handling for ESPINUDP.
From Tobias Brunner.
4) Fix ipcomp truesize computation on the receive path.
From Sabrina Dubroca.
5) Sanitize marks before policy/state insertation.
From Paul Chaignon.
* tag 'ipsec-2025-05-21' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: Sanitize marks before insert
xfrm: ipcomp: fix truesize computation on receive
xfrm: Fix UDP GRO handling for some corner cases
espintcp: remove encap socket caching to avoid reference leak
espintcp: fix skb leaks
====================
Link: https://patch.msgid.link/20250521054348.4057269-1-steffen.klassert@secunet.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
These two commits preallocated two per-cpu variables in
ip6_route_info_create() as fib_nh_common_init() and fib6_nh_init()
were expected to be called under RCU.
* commit d27b9c40db ("ipv6: Preallocate nhc_pcpu_rth_output in
ip6_route_info_create().")
* commit 5720a328c3 ("ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in
ip6_route_info_create().")
Now these functions can be called without RCU and can use GFP_KERNEL.
Let's revert the commits.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250516022759.44392-8-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit c4837b9853 ("ipv6: Split ip6_route_info_create()."),
ip6_route_info_create_nh() uses GFP_ATOMIC as it was expected to be
called under RCU.
Now, we can call it without RCU and use GFP_KERNEL.
Let's pass gfp_flags to ip6_route_info_create_nh().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250516022759.44392-7-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 71c0efb6d1 ("ipv6: Factorise ip6_route_multipath_add().") split
a loop in ip6_route_multipath_add() so that we can put rcu_read_lock()
between ip6_route_info_create() and ip6_route_info_create_nh().
We no longer need to do so as ip6_route_info_create_nh() does not require
RCU now.
Let's revert the commit to simplify the code.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250516022759.44392-6-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The previous patch fixed the same issue mentioned in
commit 14a0087e72 ("ipv6: sr: switch to GFP_ATOMIC
flag to allocate memory during seg6local LWT setup").
Let's revert it.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Link: https://patch.msgid.link/20250516022759.44392-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 169fd62799 ("ipv6: Get rid of RTNL for SIOCADDRT and
RTM_NEWROUTE.") added rcu_read_lock() covering
ip6_route_info_create_nh() and __ip6_ins_rt() to guarantee that
nexthop and netdev will not go away.
However, as reported by syzkaller [0], ip_tun_build_state() calls
dst_cache_init() with GFP_KERNEL during the RCU critical section.
ip6_route_info_create_nh() fetches nexthop or netdev depending on
whether RTA_NH_ID is set, and struct fib6_info holds a refcount
of either of them by nexthop_get() or netdev_get_by_index().
netdev_get_by_index() looks up a dev and calls dev_hold() under RCU.
So, we need RCU only around nexthop_find_by_id() and nexthop_get()
( and a few more nexthop code).
Let's add rcu_read_lock() there and remove rcu_read_lock() in
ip6_route_add() and ip6_route_multipath_add().
Now these functions called from fib6_add() need RCU:
- inet6_rt_notify()
- fib6_drop_pcpu_from() (via fib6_purge_rt())
- rt6_flush_exceptions() (via fib6_purge_rt())
- ip6_ignore_linkdown() (via rt6_multipath_rebalance())
All callers of inet6_rt_notify() need RCU, so rcu_read_lock() is
added there.
[0]:
[ BUG: Invalid wait context ]
6.15.0-rc4-syzkaller-00746-g836b313a14a3 #0 Tainted: G W
----------------------------
syz-executor234/5832 is trying to lock:
ffffffff8e021688 (pcpu_alloc_mutex){+.+.}-{4:4}, at:
pcpu_alloc_noprof+0x284/0x16b0 mm/percpu.c:1782
other info that might help us debug this:
context-{5:5}
1 lock held by syz-executor234/5832:
0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire
include/linux/rcupdate.h:331 [inline]
0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock
include/linux/rcupdate.h:841 [inline]
0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at:
ip6_route_add+0x4d/0x2f0 net/ipv6/route.c:3913
stack backtrace:
CPU: 0 UID: 0 PID: 5832 Comm: syz-executor234 Tainted: G W
6.15.0-rc4-syzkaller-00746-g836b313a14a3 #0 PREEMPT(full)
Tainted: [W]=WARN
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 04/29/2025
Call Trace:
<TASK>
dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
print_lock_invalid_wait_context kernel/locking/lockdep.c:4831 [inline]
check_wait_context kernel/locking/lockdep.c:4903 [inline]
__lock_acquire+0xbcf/0xd20 kernel/locking/lockdep.c:5185
lock_acquire+0x120/0x360 kernel/locking/lockdep.c:5866
__mutex_lock_common kernel/locking/mutex.c:601 [inline]
__mutex_lock+0x182/0xe80 kernel/locking/mutex.c:746
pcpu_alloc_noprof+0x284/0x16b0 mm/percpu.c:1782
dst_cache_init+0x37/0xc0 net/core/dst_cache.c:145
ip_tun_build_state+0x193/0x6b0 net/ipv4/ip_tunnel_core.c:687
lwtunnel_build_state+0x381/0x4c0 net/core/lwtunnel.c:137
fib_nh_common_init+0x129/0x460 net/ipv4/fib_semantics.c:635
fib6_nh_init+0x15e4/0x2030 net/ipv6/route.c:3669
ip6_route_info_create_nh+0x139/0x870 net/ipv6/route.c:3866
ip6_route_add+0xf6/0x2f0 net/ipv6/route.c:3915
inet6_rtm_newroute+0x284/0x1c50 net/ipv6/route.c:5732
rtnetlink_rcv_msg+0x7cc/0xb70 net/core/rtnetlink.c:6955
netlink_rcv_skb+0x219/0x490 net/netlink/af_netlink.c:2534
netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline]
netlink_unicast+0x758/0x8d0 net/netlink/af_netlink.c:1339
netlink_sendmsg+0x805/0xb30 net/netlink/af_netlink.c:1883
sock_sendmsg_nosec net/socket.c:712 [inline]
__sock_sendmsg+0x219/0x270 net/socket.c:727
____sys_sendmsg+0x505/0x830 net/socket.c:2566
___sys_sendmsg+0x21f/0x2a0 net/socket.c:2620
__sys_sendmsg net/socket.c:2652 [inline]
__do_sys_sendmsg net/socket.c:2657 [inline]
__se_sys_sendmsg net/socket.c:2655 [inline]
__x64_sys_sendmsg+0x19b/0x260 net/socket.c:2655
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xf6/0x210 arch/x86/entry/syscall_64.c:94
Fixes: 169fd62799 ("ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.")
Reported-by: syzbot+bcc12d6799364500fbec@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=bcc12d6799364500fbec
Reported-by: Eric Dumazet <edumazet@google.com>
Closes: https://lore.kernel.org/netdev/CANn89i+r1cGacVC_6n3-A-WSkAa_Nr+pmxJ7Gt+oP-P9by2aGw@mail.gmail.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250516022759.44392-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit f130a0cc1b ("inet: fix lwtunnel_valid_encap_type() lock
imbalance") added the rtnl_is_held argument as a temporary fix while
I'm converting nexthop and IPv6 routing table to per-netns RTNL or RCU.
Now all callers of lwtunnel_valid_encap_type() do not hold RTNL.
Let's remove the argument.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250516022759.44392-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Once allocated, the IPv6 routing table is not freed until
netns is dismantled.
fib6_get_table() uses rcu_read_lock() while iterating
net->ipv6.fib_table_hash[], but it's not needed and
rather confusing.
Because some callers have this pattern,
table = fib6_get_table();
rcu_read_lock();
/* ... use table here ... */
rcu_read_unlock();
[ See: addrconf_get_prefix_route(), ip6_route_del(),
rt6_get_route_info(), rt6_get_dflt_router() ]
and this looks illegal but is actually safe.
Let's remove rcu_read_lock() in fib6_get_table() and pass true
to the last argument of hlist_for_each_entry_rcu() to bypass
the RCU check.
Note that protection is not needed but RCU helper is used to
avoid data-race.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250516022759.44392-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>