Commit graph

444 commits

Author SHA1 Message Date
Jakub Kicinski
c58c18be88 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Merge in late fixes to prepare for the 6.17 net-next PR.

Conflicts:

net/core/neighbour.c
  1bbb76a899 ("neighbour: Fix null-ptr-deref in neigh_flush_dev().")
  13a936bb99 ("neighbour: Protect tbl->phash_buckets[] with a dedicated mutex.")
  03dc03fa04 ("neighbor: Add NTF_EXT_VALIDATED flag for externally validated entries")

Adjacent changes:

drivers/net/usb/usbnet.c
  0d9cfc9b8c ("net: usbnet: Avoid potential RCU stall on LINK_CHANGE event")
  2c04d279e8 ("net: usb: Convert tasklet API to new bottom half workqueue mechanism")

net/ipv6/route.c
  31d7d67ba1 ("ipv6: annotate data-races around rt->fib6_nsiblings")
  1caf272972 ("ipv6: adopt dst_dev() helper")
  3b3ccf9ed0 ("net: Remove unnecessary NULL check for lwtunnel_fill_encap()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-26 11:49:45 -07:00
Kuniyuki Iwashima
1bbb76a899 neighbour: Fix null-ptr-deref in neigh_flush_dev().
kernel test robot reported null-ptr-deref in neigh_flush_dev(). [0]

The cited commit introduced per-netdev neighbour list and converted
neigh_flush_dev() to use it instead of the global hash table.

One thing we missed is that neigh_table_clear() calls neigh_ifdown()
with NULL dev.

Let's restore the hash table iteration.

Note that IPv6 module is no longer unloadable, so neigh_table_clear()
is called only when IPv6 fails to initialise, which is unlikely to
happen.

[0]:
IPv6: Attempt to unregister permanent protocol 136
IPv6: Attempt to unregister permanent protocol 17
Oops: general protection fault, probably for non-canonical address 0xdffffc00000001a0: 0000 [#1] SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000d00-0x0000000000000d07]
CPU: 1 UID: 0 PID: 1 Comm: systemd Tainted: G                T  6.12.0-rc6-01246-gf7f52738637f #1
Tainted: [T]=RANDSTRUCT
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:neigh_flush_dev.llvm.6395807810224103582+0x52/0x570
Code: c1 e8 03 42 8a 04 38 84 c0 0f 85 15 05 00 00 31 c0 41 83 3e 0a 0f 94 c0 48 8d 1c c3 48 81 c3 f8 0c 00 00 48 89 d8 48 c1 e8 03 <42> 80 3c 38 00 74 08 48 89 df e8 f7 49 93 fe 4c 8b 3b 4d 85 ff 0f
RSP: 0000:ffff88810026f408 EFLAGS: 00010206
RAX: 00000000000001a0 RBX: 0000000000000d00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffc0631640
RBP: ffff88810026f470 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffffffffc0625250 R14: ffffffffc0631640 R15: dffffc0000000000
FS:  00007f575cb83940(0000) GS:ffff8883aee00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f575db40008 CR3: 00000002bf936000 CR4: 00000000000406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 __neigh_ifdown.llvm.6395807810224103582+0x44/0x390
 neigh_table_clear+0xb1/0x268
 ndisc_cleanup+0x21/0x38 [ipv6]
 init_module+0x2f5/0x468 [ipv6]
 do_one_initcall+0x1ba/0x628
 do_init_module+0x21a/0x530
 load_module+0x2550/0x2ea0
 __se_sys_finit_module+0x3d2/0x620
 __x64_sys_finit_module+0x76/0x88
 x64_sys_call+0x7ff/0xde8
 do_syscall_64+0xfb/0x1e8
 entry_SYSCALL_64_after_hwframe+0x67/0x6f
RIP: 0033:0x7f575d6f2719
Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7 d8 64 89 01 48
RSP: 002b:00007fff82a2a268 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
RAX: ffffffffffffffda RBX: 0000557827b45310 RCX: 00007f575d6f2719
RDX: 0000000000000000 RSI: 00007f575d584efd RDI: 0000000000000004
RBP: 00007f575d584efd R08: 0000000000000000 R09: 0000557827b47b00
R10: 0000000000000004 R11: 0000000000000246 R12: 0000000000020000
R13: 0000000000000000 R14: 0000557827b470e0 R15: 00007f575dbb4270
 </TASK>
Modules linked in: ipv6(+)

Fixes: f7f5273863 ("neighbour: Create netdev->neighbour association")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202507200931.7a89ecd8-lkp@intel.com
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250723195443.448163-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-25 13:11:39 -07:00
Kuniyuki Iwashima
dc2a27e524 neighbour: Update pneigh_entry in pneigh_create().
neigh_add() updates pneigh_entry() found or created by pneigh_create().

This update is serialised by RTNL, but we will remove it.

Let's move the update part to pneigh_create() and make it return errno
instead of a pointer of pneigh_entry.

Now, the pneigh code is RTNL free.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-16-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:22 -07:00
Kuniyuki Iwashima
13a936bb99 neighbour: Protect tbl->phash_buckets[] with a dedicated mutex.
tbl->phash_buckets[] is only modified in the slow path by pneigh_create()
and pneigh_delete() under the table lock.

Both of them are called under RTNL, so no extra lock is needed, but we
will remove RTNL from the paths.

pneigh_create() looks up a pneigh_entry, and this part can be lockless,
but it would complicate the logic like

  1. lookup
  2. allocate pengih_entry for GFP_KERNEL
  3. lookup again but under lock
  4. if found, return it after freeing the allocated memory
  5. else, return the new one

Instead, let's add a per-table mutex and run lookup and allocation
under it.

Note that updating pneigh_entry part in neigh_add() is still protected
by RTNL and will be moved to pneigh_create() in the next patch.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-15-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:21 -07:00
Kuniyuki Iwashima
b8b7ed1ea8 neighbour: Drop read_lock_bh(&tbl->lock) in pneigh_lookup().
Now, all callers of pneigh_lookup() are under RCU, and the read
lock there is no longer needed.

Let's drop the lock, inline __pneigh_lookup_1() to pneigh_lookup(),
and call it from pneigh_create().

The next patch will remove tbl->lock from pneigh_create().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-14-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:21 -07:00
Kuniyuki Iwashima
dd103c9a53 neighbour: Remove __pneigh_lookup().
__pneigh_lookup() is the lockless version of pneigh_lookup(),
but its only caller pndisc_is_router() holds the table lock and
reads pneigh_netry.flags.

This is because accessing pneigh_entry after pneigh_lookup() was
illegal unless the caller holds RTNL or the table lock.

Now, pneigh_entry is guaranteed to be alive during the RCU critical
section.

Let's call pneigh_lookup() and use READ_ONCE() for n->flags in
pndisc_is_router() and remove __pneigh_lookup().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-13-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:21 -07:00
Kuniyuki Iwashima
b9c89fa128 neighbour: Use rcu_dereference() in pneigh_get_{first,next}().
Now pneigh_entry is guaranteed to be alive during the
RCU critical section even without holding tbl->lock.

Let's use rcu_dereference() in pneigh_get_{first,next}().

Note that neigh_seq_start() still holds tbl->lock for the
normal neighbour entry.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-12-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:21 -07:00
Kuniyuki Iwashima
32d5eaabf1 neighbour: Drop read_lock_bh(&tbl->lock) in pneigh_dump_table().
Now pneigh_entry is guaranteed to be alive during the
RCU critical section even without holding tbl->lock.

Let's drop read_lock_bh(&tbl->lock) and use rcu_dereference()
to iterate tbl->phash_buckets[] in pneigh_dump_table()

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-11-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:21 -07:00
Kuniyuki Iwashima
ed6e380d2d neighbour: Convert RTM_GETNEIGH to RCU.
Only __dev_get_by_index() is the RTNL dependant in neigh_get().

Let's replace it with dev_get_by_index_rcu() and convert RTM_GETNEIGH
to RCU.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-10-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:21 -07:00
Kuniyuki Iwashima
cc03492c7b neighbour: Annotate access to struct pneigh_entry.{flags,protocol}.
We will convert pneigh readers to RCU, and its flags and protocol
will be read locklessly.

Let's annotate the access to the two fields.

Note that all access to pn->permanent is under RTNL (neigh_add()
and pneigh_ifdown_and_unlock()), so WRITE_ONCE() and READ_ONCE()
are not needed.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-9-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:21 -07:00
Kuniyuki Iwashima
d539d8fbd8 neighbour: Free pneigh_entry after RCU grace period.
We will convert RTM_GETNEIGH to RCU.

neigh_get() looks up pneigh_entry by pneigh_lookup() and passes
it to pneigh_fill_info().

Then, we must ensure that the entry is alive till pneigh_fill_info()
completes, but read_lock_bh(&tbl->lock) in pneigh_lookup() does not
guarantee that.

Also, we will convert all readers of tbl->phash_buckets[] to RCU.

Let's use call_rcu() to free pneigh_entry and update phash_buckets[]
and ->next by rcu_assign_pointer().

pneigh_ifdown_and_unlock() uses list_head to avoid overwriting
->next and moving RCU iterators to another list.

pndisc_destructor() (only IPv6 ndisc uses this) uses a mutex, so it
is not delayed to call_rcu(), where we cannot sleep.  This is fine
because the mcast code works with RCU and ipv6_dev_mc_dec() frees
mcast objects after RCU grace period.

While at it, we change the return type of pneigh_ifdown_and_unlock()
to void.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-8-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:20 -07:00
Kuniyuki Iwashima
d63382aea7 neighbour: Annotate neigh_table.phash_buckets and pneigh_entry.next with __rcu.
The next patch will free pneigh_entry with call_rcu().

Then, we need to annotate neigh_table.phash_buckets[] and
pneigh_entry.next with __rcu.

To make the next patch cleaner, let's annotate the fields in advance.

Currently, all accesses to the fields are under the neigh table lock,
so rcu_dereference_protected() is used with 1 for now, but most of them
(except in pneigh_delete() and pneigh_ifdown_and_unlock()) will be
replaced with rcu_dereference() and rcu_dereference_check().

Note that pneigh_ifdown_and_unlock() changes pneigh_entry.next to a
local list, which is illegal because the RCU iterator could be moved
to another list.  This part will be fixed in the next patch.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:20 -07:00
Kuniyuki Iwashima
e804bd83c1 neighbour: Split pneigh_lookup().
pneigh_lookup() has ASSERT_RTNL() in the middle of the function, which
is confusing.

When called with the last argument, creat, 0, pneigh_lookup() literally
looks up a proxy neighbour entry.  This is the case of the reader path
as the fast path and RTM_GETNEIGH.

pneigh_lookup(), however, creates a pneigh_entry when called with creat 1
from RTM_NEWNEIGH and SIOCSARP, which require RTNL.

Let's split pneigh_lookup() into two functions.

We will convert all the reader paths to RCU, and read_lock_bh(&tbl->lock)
in the new pneigh_lookup() will be dropped.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:20 -07:00
Kuniyuki Iwashima
0e5ac19c78 neighbour: Move neigh_find_table() to neigh_get().
neigh_valid_get_req() calls neigh_find_table() to fetch neigh_tables[].

neigh_find_table() uses rcu_dereference_rtnl(), but RTNL actually does
not protect it at all; neigh_table_clear() can be called without RTNL
and only waits for RCU readers by synchronize_rcu().

Fortunately, there is no bug because IPv4 is built-in, IPv6 cannot be
unloaded, and DECNET was removed.

To fetch neigh_tables[] by rcu_dereference() later, let's move
neigh_find_table() from neigh_valid_get_req() to neigh_get().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:20 -07:00
Kuniyuki Iwashima
3dfe0b57dc neighbour: Allocate skb in neigh_get().
We will remove RTNL for neigh_get() and run it under RCU instead.

neigh_get_reply() and pneigh_get_reply() allocate skb with GFP_KERNEL.

Let's move the allocation before __dev_get_by_index() in neigh_get().

Now, neigh_get_reply() and pneigh_get_reply() are inlined and
rtnl_unicast() is factorised.

We will convert pneigh_lookup() to __pneigh_lookup() later.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:20 -07:00
Kuniyuki Iwashima
f5046fbc1b neighbour: Move two validations from neigh_get() to neigh_valid_get_req().
We will remove RTNL for neigh_get() and run it under RCU instead.

neigh_get() returns -EINVAL in the following cases:

  * NDA_DST is not specified
  * Both ndm->ndm_ifindex and NTF_PROXY are not specified

These validations do not require RCU.

Let's move them to neigh_valid_get_req().

While at it, the extack string for the first case is replaced with
NL_SET_ERR_ATTR_MISS().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:20 -07:00
Kuniyuki Iwashima
caf0a753a8 neighbour: Make neigh_valid_get_req() return ndmsg.
neigh_get() passes 4 local variable pointers to neigh_valid_get_req().

If it returns a pointer of struct ndmsg, we do not need to pass two
of them.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:20 -07:00
Eric Dumazet
8a402bbe54 net: dst: annotate data-races around dst->obsolete
(dst_entry)->obsolete is read locklessly, add corresponding
annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:32:29 -07:00
Ido Schimmel
03dc03fa04 neighbor: Add NTF_EXT_VALIDATED flag for externally validated entries
tl;dr
=====

Add a new neighbor flag ("extern_valid") that can be used to indicate to
the kernel that a neighbor entry was learned and determined to be valid
externally. The kernel will not try to remove or invalidate such an
entry, leaving these decisions to the user space control plane. This is
needed for EVPN multi-homing where a neighbor entry for a multi-homed
host needs to be synced across all the VTEPs among which the host is
multi-homed.

Background
==========

In a typical EVPN multi-homing setup each host is multi-homed using a
set of links called ES (Ethernet Segment, i.e., LAG) to multiple leaf
switches (VTEPs). VTEPs that are connected to the same ES are called ES
peers.

When a neighbor entry is learned on a VTEP, it is distributed to both ES
peers and remote VTEPs using EVPN MAC/IP advertisement routes. ES peers
use the neighbor entry when routing traffic towards the multi-homed host
and remote VTEPs use it for ARP/NS suppression.

Motivation
==========

If the ES link between a host and the VTEP on which the neighbor entry
was locally learned goes down, the EVPN MAC/IP advertisement route will
be withdrawn and the neighbor entries will be removed from both ES peers
and remote VTEPs. Routing towards the multi-homed host and ARP/NS
suppression can fail until another ES peer locally learns the neighbor
entry and distributes it via an EVPN MAC/IP advertisement route.

"draft-rbickhart-evpn-ip-mac-proxy-adv-03" [1] suggests avoiding these
intermittent failures by having the ES peers install the neighbor
entries as before, but also injecting EVPN MAC/IP advertisement routes
with a proxy indication. When the previously mentioned ES link goes down
and the original EVPN MAC/IP advertisement route is withdrawn, the ES
peers will not withdraw their neighbor entries, but instead start aging
timers for the proxy indication.

If an ES peer locally learns the neighbor entry (i.e., it becomes
"reachable"), it will restart its aging timer for the entry and emit an
EVPN MAC/IP advertisement route without a proxy indication. An ES peer
will stop its aging timer for the proxy indication if it observes the
removal of the proxy indication from at least one of the ES peers
advertising the entry.

In the event that the aging timer for the proxy indication expired, an
ES peer will withdraw its EVPN MAC/IP advertisement route. If the timer
expired on all ES peers and they all withdrew their proxy
advertisements, the neighbor entry will be completely removed from the
EVPN fabric.

Implementation
==============

In the above scheme, when the control plane (e.g., FRR) advertises a
neighbor entry with a proxy indication, it expects the corresponding
entry in the data plane (i.e., the kernel) to remain valid and not be
removed due to garbage collection or loss of carrier. The control plane
also expects the kernel to notify it if the entry was learned locally
(i.e., became "reachable") so that it will remove the proxy indication
from the EVPN MAC/IP advertisement route. That is why these entries
cannot be programmed with dummy states such as "permanent" or "noarp".

Instead, add a new neighbor flag ("extern_valid") which indicates that
the entry was learned and determined to be valid externally and should
not be removed or invalidated by the kernel. The kernel can probe the
entry and notify user space when it becomes "reachable" (it is initially
installed as "stale"). However, if the kernel does not receive a
confirmation, have it return the entry to the "stale" state instead of
the "failed" state.

In other words, an entry marked with the "extern_valid" flag behaves
like any other dynamically learned entry other than the fact that the
kernel cannot remove or invalidate it.

One can argue that the "extern_valid" flag should not prevent garbage
collection and that instead a neighbor entry should be programmed with
both the "extern_valid" and "extern_learn" flags. There are two reasons
for not doing that:

1. Unclear why a control plane would like to program an entry that the
   kernel cannot invalidate but can completely remove.

2. The "extern_learn" flag is used by FRR for neighbor entries learned
   on remote VTEPs (for ARP/NS suppression) whereas here we are
   concerned with local entries. This distinction is currently irrelevant
   for the kernel, but might be relevant in the future.

Given that the flag only makes sense when the neighbor has a valid
state, reject attempts to add a neighbor with an invalid state and with
this flag set. For example:

 # ip neigh add 192.0.2.1 nud none dev br0.10 extern_valid
 Error: Cannot create externally validated neighbor with an invalid state.
 # ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid
 # ip neigh replace 192.0.2.1 nud failed dev br0.10 extern_valid
 Error: Cannot mark neighbor as externally validated with an invalid state.

The above means that a neighbor cannot be created with the
"extern_valid" flag and flags such as "use" or "managed" as they result
in a neighbor being created with an invalid state ("none") and
immediately getting probed:

 # ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
 Error: Cannot create externally validated neighbor with an invalid state.

However, these flags can be used together with "extern_valid" after the
neighbor was created with a valid state:

 # ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid
 # ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use

One consequence of preventing the kernel from invalidating a neighbor
entry is that by default it will only try to determine reachability
using unicast probes. This can be changed using the "mcast_resolicit"
sysctl:

 # sysctl net.ipv4.neigh.br0/10.mcast_resolicit
 0
 # tcpdump -nn -e -i br0.10 -Q out arp &
 # ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
 # sysctl -wq net.ipv4.neigh.br0/10.mcast_resolicit=3
 # ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
 62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
 62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
 62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28

iproute2 patches can be found here [2].

[1] https://datatracker.ietf.org/doc/html/draft-rbickhart-evpn-ip-mac-proxy-adv-03
[2] https://github.com/idosch/iproute2/tree/submit/extern_valid_v1

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/20250626073111.244534-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-30 18:14:23 -07:00
Yue Haibing
9b19b50c8d neighbour: Remove redundant assignment to err
'err' has been checked against 0 in the if statement.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250624014216.3686659-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-25 15:26:03 -07:00
Nicolas Escande
c7d78566bb neighbour: add support for NUD_PERMANENT proxy entries
As discussesd before in [0] proxy entries (which are more configuration
than runtime data) should stay when the link (carrier) goes does down.
This is what happens for regular neighbour entries.

So lets fix this by:
  - storing in proxy entries the fact that it was added as NUD_PERMANENT
  - not removing NUD_PERMANENT proxy entries when the carrier goes down
    (same as how it's done in neigh_flush_dev() for regular neigh entries)

[0]: https://lore.kernel.org/netdev/c584ef7e-6897-01f3-5b80-12b53f7b4bf4@kernel.org/

Signed-off-by: Nicolas Escande <nico.escande@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250617141334.3724863-1-nico.escande@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19 16:15:34 -07:00
Ingo Molnar
41cb08555c treewide, timers: Rename from_timer() to timer_container_of()
Move this API to the canonical timer_*() namespace.

[ tglx: Redone against pre rc1 ]

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-06-08 09:07:37 +02:00
Qiu Yutan
e45b7196df net: neigh: use kfree_skb_reason() in neigh_resolve_output() and neigh_connected_output()
Replace kfree_skb() used in neigh_resolve_output() and
neigh_connected_output() with kfree_skb_reason().

Following new skb drop reason is added:
/* failed to fill the device hard header */
SKB_DROP_REASON_NEIGH_HH_FILLFAIL

Signed-off-by: Qiu Yutan <qiu.yutan@zte.com.cn>
Signed-off-by: Jiang Kun <jiang.kun2@zte.com.cn>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Xu Xin <xu.xin16@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-26 10:03:13 +01:00
Breno Leitao
9929ba1942 net: Use nlmsg_payload in neighbour file
Leverage the new nlmsg_payload() helper to avoid checking for message
size and then reading the nlmsg data.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250417-nlmsg_v3-v1-1-9b09d9d7e61d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-21 18:38:00 -07:00
Breno Leitao
2d1f827f06 neighbour: Use nlmsg_payload in neigh_valid_get_req
Update neigh_valid_get_req function to utilize the new nlmsg_payload()
helper function.

This change improves code clarity and safety by ensuring that the
Netlink message payload is properly validated before accessing its data.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250414-nlmsg-v2-3-3d90cb42c6af@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 08:28:54 -07:00
Breno Leitao
7527efe8a4 neighbour: Use nlmsg_payload in neightbl_valid_dump_info
Update neightbl_valid_dump_info function to utilize the new
nlmsg_payload() helper function.

This change improves code clarity and safety by ensuring that the
Netlink message payload is properly validated before accessing its data.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250414-nlmsg-v2-2-3d90cb42c6af@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 08:28:54 -07:00
Thomas Gleixner
8fa7292fee treewide: Switch/rename to timer_delete[_sync]()
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.

Conversion was done with coccinelle plus manual fixups where necessary.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-05 10:30:12 +02:00
Paolo Abeni
f491593394 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.14-rc8).

Conflict:

tools/testing/selftests/net/Makefile
  03544faad7 ("selftest: net: add proc_net_pktgen")
  3ed61b8938 ("selftests: net: test for lwtunnel dst ref loops")

tools/testing/selftests/net/config:
  85cb3711ac ("selftests: net: Add test cases for link and peer netns")
  3ed61b8938 ("selftests: net: test for lwtunnel dst ref loops")

Adjacent commits:

tools/testing/selftests/net/Makefile
  c935af429e ("selftests: net: add support for testing SO_RCVMARK and SO_RCVPRIORITY")
  355d940f4d ("Revert "selftests: Add IPv6 link-local address generation tests for GRE devices."")

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20 21:38:01 +01:00
Lin Ma
90a7138619 net/neighbor: add missing policy for NDTPA_QUEUE_LENBYTES
Previous commit 8b5c171bb3 ("neigh: new unresolved queue limits")
introduces new netlink attribute NDTPA_QUEUE_LENBYTES to represent
approximative value for deprecated QUEUE_LEN. However, it forgot to add
the associated nla_policy in nl_ntbl_parm_policy array. Fix it with one
simple NLA_U32 type policy.

Fixes: 8b5c171bb3 ("neigh: new unresolved queue limits")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Link: https://patch.msgid.link/20250315165113.37600-1-linma@zju.edu.cn
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20 15:23:29 +01:00
Kohei Enju
ef75d8343b neighbour: Replace kvzalloc() with kzalloc() when GFP_ATOMIC is specified
kzalloc() uses page allocator when size is larger than
KMALLOC_MAX_CACHE_SIZE, so the intention of commit ab101c553b
("neighbour: use kvzalloc()/kvfree()") can be achieved by using kzalloc().

When using GFP_ATOMIC, kvzalloc() only tries the kmalloc path,
since the vmalloc path does not support the flag.
In this case, kvzalloc() is equivalent to kzalloc() in that neither try
the vmalloc path, so this replacement brings no functional change.
This is primarily a cleanup change, as the original code functions
correctly.

This patch replaces kvzalloc() introduced by commit 41b3caa7c0
("neighbour: Add hlist_node to struct neighbour"), which is called in
the same context and with the same gfp flag as the aforementioned commit
ab101c553b ("neighbour: use kvzalloc()/kvfree()").

Signed-off-by: Kohei Enju <enjuk@amazon.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://patch.msgid.link/20250219102227.72488-1-enjuk@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20 17:28:01 -08:00
Jakub Kicinski
7a7e019713 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.14-rc3).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-13 12:43:30 -08:00
Eric Dumazet
becbd5850c neighbour: use RCU protection in __neigh_notify()
__neigh_notify() can be called without RTNL or RCU protection.

Use RCU protection to avoid potential UAF.

Fixes: 426b5303eb ("[NETNS]: Modify the neighbour table code so it handles multiple network namespaces")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250207135841.1948589-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10 18:09:09 -08:00
Eric Dumazet
a064068bb6 neighbour: remove neigh_parms_destroy()
neigh_parms_destroy() is a simple kfree(), no need for
a forward declaration.

neigh_parms_put() can instead call kfree() directly.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250203151152.3163876-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-04 14:01:19 -08:00
Jakub Kicinski
0de6a472c3 net/neighbor: clear error in case strict check is not set
Commit 51183d233b ("net/neighbor: Update neigh_dump_info for strict
data checking") added strict checking. The err variable is not cleared,
so if we find no table to dump we will return the validation error even
if user did not want strict checking.

I think the only way to hit this is to send an buggy request, and ask
for a table which doesn't exist, so there's no point treating this
as a real fix. I only noticed it because a syzbot repro depended on it
to trigger another bug.

Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241115003221.733593-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-18 18:42:21 -08:00
Gilad Naaman
f7f5273863 neighbour: Create netdev->neighbour association
Create a mapping between a netdev and its neighoburs,
allowing for much cheaper flushes.

Signed-off-by: Gilad Naaman <gnaaman@drivenets.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241107160444.2913124-7-gnaaman@drivenets.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-09 13:22:57 -08:00
Gilad Naaman
a01a67ab2f neighbour: Remove bare neighbour::next pointer
Remove the now-unused neighbour::next pointer, leaving struct neighbour
solely with the hlist_node implementation.

Signed-off-by: Gilad Naaman <gnaaman@drivenets.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241107160444.2913124-6-gnaaman@drivenets.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-09 13:22:57 -08:00
Gilad Naaman
0e3bcb0f78 neighbour: Convert iteration to use hlist+macro
Remove all usage of the bare neighbour::next pointer,
replacing them with neighbour::hash and its for_each macro.

Signed-off-by: Gilad Naaman <gnaaman@drivenets.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241107160444.2913124-5-gnaaman@drivenets.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-09 13:22:57 -08:00
Gilad Naaman
00df5e1a3f neighbour: Convert seq_file functions to use hlist
Convert seq_file-related neighbour functionality to use neighbour::hash
and the related for_each macro.

Signed-off-by: Gilad Naaman <gnaaman@drivenets.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241107160444.2913124-4-gnaaman@drivenets.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-09 13:22:57 -08:00
Gilad Naaman
41b3caa7c0 neighbour: Add hlist_node to struct neighbour
Add a doubly-linked node to neighbours, so that they
can be deleted without iterating the entire bucket they're in.

Signed-off-by: Gilad Naaman <gnaaman@drivenets.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241107160444.2913124-2-gnaaman@drivenets.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-09 13:22:56 -08:00
Eric Dumazet
ab101c553b neighbour: use kvzalloc()/kvfree()
mm layer is providing convenient functions, we do not have
to work around old limitations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Gilad Naaman <gnaaman@drivenets.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241022150059.1345406-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-28 18:12:06 -07:00
Kuniyuki Iwashima
d0d14aef50 neighbour: Use rtnl_register_many().
We will remove rtnl_register() in favour of rtnl_register_many().

When it succeeds, rtnl_register_many() guarantees all rtnetlink types
in the passed array are supported, and there is no chance that a part
of message types is not supported.

Let's use rtnl_register_many() instead.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241014201828.91221-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-15 18:52:25 -07:00
Li Zetao
c25bdd2ac8 neighbour: delete redundant judgment statements
The initial value of err is -ENOBUFS, and err is guaranteed to be
less than 0 before all goto errout. Therefore, on the error path
of errout, there is no need to repeatedly judge that err is less than 0,
and delete redundant judgments to make the code more concise.

Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-23 14:27:45 +01:00
Joel Granados
78eb4ea25c sysctl: treewide: constify the ctl_table argument of proc_handlers
const qualify the struct ctl_table argument in the proc_handler function
signatures. This is a prerequisite to moving the static ctl_table
structs into .rodata data which will ensure that proc_handler function
pointers cannot be modified.

This patch has been generated by the following coccinelle script:

```
  virtual patch

  @r1@
  identifier ctl, write, buffer, lenp, ppos;
  identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos);

  @r2@
  identifier func, ctl, write, buffer, lenp, ppos;
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos)
  { ... }

  @r3@
  identifier func;
  @@

  int func(
  - struct ctl_table *
  + const struct ctl_table *
    ,int , void *, size_t *, loff_t *);

  @r4@
  identifier func, ctl;
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int , void *, size_t *, loff_t *);

  @r5@
  identifier func, write, buffer, lenp, ppos;
  @@

  int func(
  - struct ctl_table *
  + const struct ctl_table *
    ,int write, void *buffer, size_t *lenp, loff_t *ppos);

```

* Code formatting was adjusted in xfs_sysctl.c to comply with code
  conventions. The xfs_stats_clear_proc_handler,
  xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
  adjusted.

* The ctl_table argument in proc_watchdog_common was const qualified.
  This is called from a proc_handler itself and is calling back into
  another proc_handler, making it necessary to change it as part of the
  proc_handler migration.

Co-developed-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Co-developed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-07-24 20:59:29 +02:00
Thomas Weißschuh
874aa96d78 net/neighbour: constify ctl_table arguments of utility function
The sysctl core is preparing to only expose instances of
struct ctl_table as "const".
This will also affect the ctl_table argument of sysctl handlers.

As the function prototype of all sysctl handlers throughout the tree
needs to stay consistent that change will be done in one commit.

To reduce the size of that final commit, switch utility functions which
are not bound by "typedef proc_handler" to "const struct ctl_table".

No functional change.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20240527-sysctl-const-handler-net-v1-1-16523767d0b2@weissschuh.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-28 19:49:47 -07:00
Joel Granados
ce218712b0 net: Remove the now superfluous sentinel elements from ctl_table array
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

* Remove sentinel element from ctl_table structs.
* Remove the zeroing out of an array element (to make it look like a
  sentinel) in neigh_sysctl_register and lowpan_frags_ns_sysctl_register
  This is not longer needed and is safe after commit c899710fe7
  ("networking: Update to register_net_sysctl_sz") added the array size
  to the ctl_table registration.
* Replace the for loop stop condition in sysctl_core_net_init that tests
  for procname == NULL with one that depends on array size
* Removed the "-1" in mpls_net_init that adjusted for having an extra
  empty element when looping over ctl_table arrays
* Use a table_size variable to keep the value of ARRAY_SIZE

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03 13:29:41 +01:00
Eric Dumazet
1c04b46cbd neighbour: fix neigh_master_filtered()
If we no longer hold RTNL, we must use netdev_master_upper_dev_get_rcu()
instead of netdev_master_upper_dev_get().

Fixes: ba0f780694 ("neighbour: no longer hold RTNL in neigh_dump_info()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240421185753.1808077-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23 19:04:50 -07:00
Eric Dumazet
ba0f780694 neighbour: no longer hold RTNL in neigh_dump_info()
neigh_dump_table() is already relying on RCU protection.

pneigh_dump_table() is using its own protection (tbl->lock)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 12:39:20 +01:00
Eric Dumazet
7e4975f7e7 neighbour: fix neigh_dump_info() return value
Change neigh_dump_table() and pneigh_dump_table()
to either return 0 or -EMSGSIZE if not enough
space was available in the skb.

Then neigh_dump_info() can do the same.

This allows NLMSG_DONE to be appended to the current
skb at the end of a dump, saving a couple of recvmsg()
system calls.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 12:39:20 +01:00
Eric Dumazet
f8f2eb9de6 neighbour: add RCU protection to neigh_tables[]
In order to remove RTNL protection from neightbl_dump_info()
and neigh_dump_info() later, we need to add
RCU protection to neigh_tables[].

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 12:39:20 +01:00
Zheng Li
eabf425bc6 neighbour: guarantee the localhost connections be established successfully even the ARP table is full
Inter-process communication on localhost should be established successfully
even the ARP table is full, many processes on server machine use the
localhost to communicate such as command-line interface (CLI),
servers hope all CLI commands can be executed successfully even the arp
table is full. Right now CLI commands got timeout when the arp table is
full. Set the parameter of exempt_from_gc to be true for LOOPBACK net
device to keep localhost neigh in arp table, not removed by gc.

the steps of reproduced:
server with "gc_thresh3 = 1024" setting, ping server from more than 1024
same netmask Lan IPv4 addresses, run "ssh localhost" on console interface,
then the command will get timeout.

Signed-off-by: Zheng Li <James.Z.Li@Dell.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240416095343.540-1-lizheng043@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-18 12:01:03 +02:00