linux/net/ipv4
Jon Maloy 8c670bdfa5 tcp: correct handling of extreme memory squeeze
Testing with iperf3 using the "pasta" protocol splicer has revealed
a problem in the way tcp handles window advertising in extreme memory
squeeze situations.

Under memory pressure, a socket endpoint may temporarily advertise
a zero-sized window, but this is not stored as part of the socket data.
The reasoning behind this is that it is considered a temporary setting
which shouldn't influence any further calculations.

However, if we happen to stall at an unfortunate value of the current
window size, the algorithm selecting a new value will consistently fail
to advertise a non-zero window once we have freed up enough memory.
This means that this side's notion of the current window size is
different from the one last advertised to the peer, causing the latter
to not send any data to resolve the sitution.

The problem occurs on the iperf3 server side, and the socket in question
is a completely regular socket with the default settings for the
fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.

The following excerpt of a logging session, with own comments added,
shows more in detail what is happening:

//              tcp_v4_rcv(->)
//                tcp_rcv_established(->)
[5201<->39222]:     ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
[5201<->39222]:     tcp_data_queue(->)
[5201<->39222]:        DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
                       [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
                       [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0]
                       [OFO queue: gap: 65480, len: 0]
[5201<->39222]:     tcp_data_queue(<-)
[5201<->39222]:     __tcp_transmit_skb(->)
                        [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
[5201<->39222]:       tcp_select_window(->)
[5201<->39222]:         (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) ? --> TRUE
                        [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
                        returning 0
[5201<->39222]:       tcp_select_window(<-)
[5201<->39222]:       ADVERTISING WIN 0, ACK_SEQ: 265600160
[5201<->39222]:     [__tcp_transmit_skb(<-)
[5201<->39222]:   tcp_rcv_established(<-)
[5201<->39222]: tcp_v4_rcv(<-)

// Receive queue is at 85 buffers and we are out of memory.
// We drop the incoming buffer, although it is in sequence, and decide
// to send an advertisement with a window of zero.
// We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means
// we unconditionally shrink the window.

[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:     [new_win = 0, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]:     NOT calling tcp_send_ack()
                    [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
[5201<->39222]:   __tcp_cleanup_rbuf(<-)
                  [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
                  [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0]
                  returning 6104 bytes
[5201<->39222]: tcp_recvmsg_locked(<-)

// After each read, the algorithm for calculating the new receive
// window in __tcp_cleanup_rbuf() finds it is too small to advertise
// or to update tp->rcv_wnd.
// Meanwhile, the peer thinks the window is zero, and will not send
// any more data to trigger an update from the interrupt mode side.

[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]:     NOT calling tcp_send_ack()
                    [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
[5201<->39222]:   __tcp_cleanup_rbuf(<-)
                  [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
                  [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0]
                  returning 131072 bytes
[5201<->39222]: tcp_recvmsg_locked(<-)

// The above pattern repeats again and again, since nothing changes
// between the reads.

[...]

[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]:     NOT calling tcp_send_ack()
                    [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
[5201<->39222]:   __tcp_cleanup_rbuf(<-)
                  [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
                  [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0]
                  returning 54672 bytes
[5201<->39222]: tcp_recvmsg_locked(<-)

// The receive queue is empty, but no new advertisement has been sent.
// The peer still thinks the receive window is zero, and sends nothing.
// We have ended up in a deadlock situation.

Note that well behaved endpoints will send win0 probes, so the problem
will not occur.

Furthermore, we have observed that in these situations this side may
send out an updated 'th->ack_seq´ which is not stored in tp->rcv_wup
as it should be. Backing ack_seq seems to be harmless, but is of
course still wrong from a protocol viewpoint.

We fix this by updating the socket state correctly when a packet has
been dropped because of memory exhaustion and we have to advertize
a zero window.

Further testing shows that the connection recovers neatly from the
squeeze situation, and traffic can continue indefinitely.

Fixes: e2142825c1 ("net: tcp: send zero-window ACK when no memory")
Cc: Menglong Dong <menglong8.dong@gmail.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20250127231304.1465565-1-jmaloy@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-29 18:58:15 -08:00
..
netfilter netfilter: nf_dup4: Convert nf_dup_ipv4_route() to dscp_t. 2024-11-15 11:00:29 +01:00
af_inet.c ipv4: Define inet_sk_init_flowi4() and use it in inet_sk_rebuild_header(). 2024-12-20 13:50:09 -08:00
ah4.c
arp.c neighbour: Remove bare neighbour::next pointer 2024-11-09 13:22:57 -08:00
bpf_tcp_ca.c bpf: Check unsupported ops from the bpf_struct_ops's cfi_stubs 2024-07-29 12:54:13 -07:00
cipso_ipv4.c move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
datagram.c ipv4: Use inet_sk_init_flowi4() in ip4_datagram_release_cb(). 2024-12-20 13:50:09 -08:00
devinet.c net: convert to nla_get_*_default() 2024-11-11 10:32:06 -08:00
esp4.c ipsec-2025-01-27 2025-01-27 15:15:12 -08:00
esp4_offload.c xfrm: Add an inbound percpu state cache. 2024-10-29 11:56:18 +01:00
fib_frontend.c net: ip: fix unexpected return in fib_validate_source() 2024-11-18 18:57:00 -08:00
fib_lookup.h
fib_notifier.c net: do not acquire rtnl in fib_seq_sum() 2024-10-11 15:35:05 -07:00
fib_rules.c ipv4: fib_rules: Reject flow label attributes 2024-12-19 16:02:21 +01:00
fib_semantics.c ipv4: remove fib_info_devhash[] 2024-10-07 16:46:27 -07:00
fib_trie.c ipv4: output metric as unsigned int 2024-12-15 13:13:40 -08:00
fou_bpf.c
fou_core.c fou: fix initialization of grc 2024-09-09 17:21:47 -07:00
fou_nl.c tools: ynl-gen: use big-endian netlink attribute types 2024-10-22 15:33:24 +02:00
fou_nl.h
gre_demux.c
gre_offload.c
icmp.c inetpeer: do not get a refcount in inet_getpeer() 2024-12-17 19:37:48 -08:00
igmp.c netlink: correct nlmsg size for multicast notifications 2024-12-23 10:26:43 -08:00
inet_connection_sock.c ipv4: Use inet_sk_init_flowi4() in inet_csk_rebuild_route(). 2024-12-20 13:50:09 -08:00
inet_diag.c tcp: annotate data-races around icsk->icsk_pending 2024-10-04 15:34:39 -07:00
inet_fragment.c net: Rename mono_delivery_time to tstamp_type for scalabilty 2024-05-23 14:14:23 -07:00
inet_hashtables.c inet: constify 'struct net' parameter of various lookup helpers 2024-08-05 16:22:45 -07:00
inet_timewait_sock.c tcp: move inet_twsk_schedule helper out of header 2024-06-10 11:54:18 +01:00
inetpeer.c inetpeer: avoid false sharing in inet_peer_xrlim_allow() 2024-12-20 13:04:40 -08:00
ip_forward.c
ip_fragment.c inetpeer: do not get a refcount in inet_getpeer() 2024-12-17 19:37:48 -08:00
ip_gre.c gre: Prepare ipgre_open() to .flowi4_tos conversion. 2025-01-16 16:51:57 -08:00
ip_input.c ipv4: remove useless arg 2025-01-02 17:17:40 -08:00
ip_options.c net: ip: make ip_route_input() return drop reasons 2024-11-12 11:24:51 +01:00
ip_output.c ipv4: Use inet_sk_init_flowi4() in __ip_queue_xmit(). 2024-12-20 13:50:09 -08:00
ip_sockglue.c Networking changes for 6.14. 2025-01-22 08:28:57 -08:00
ip_tunnel.c net: Fix netns for ip_tunnel_init_flow() 2024-12-23 10:04:41 -08:00
ip_tunnel_core.c
ip_vti.c netdev_features: convert NETIF_F_LLTX to dev->lltx 2024-09-03 11:36:43 +02:00
ipcomp.c
ipconfig.c
ipip.c netdev_features: convert NETIF_F_LLTX to dev->lltx 2024-09-03 11:36:43 +02:00
ipmr.c inet: ipmr: fix data-races 2025-01-15 15:07:23 -08:00
ipmr_base.c ipmr: do not call mr_mfc_uses_dev() for unres entries 2025-01-23 07:08:13 -08:00
Kconfig net/tcp: Expand goo.gl link 2024-07-30 18:35:12 -07:00
Makefile
metrics.c net: remove NULL-pointer net parameter in ip_metrics_convert 2024-06-05 10:06:00 +01:00
netfilter.c netfilter: ipv4: Convert ip_route_me_harder() to dscp_t. 2024-11-15 11:00:29 +01:00
netlink.c
nexthop.c net: convert to nla_get_*_default() 2024-11-11 10:32:06 -08:00
ping.c ping: use sk_skb_reason_drop to free rx packets 2024-06-19 12:44:22 +01:00
proc.c tcp: add LINUX_MIB_PAWS_OLD_ACK SNMP counter 2025-01-14 13:28:13 -08:00
protocol.c
raw.c sock: support SO_PRIORITY cmsg 2024-12-16 18:13:44 -08:00
raw_diag.c
route.c ipv4: Prepare inet_rtm_getroute() to .flowi4_tos conversion. 2025-01-16 16:52:29 -08:00
syncookies.c tcp: use sk_skb_reason_drop to free rx packets 2024-06-19 12:44:22 +01:00
sysctl_net_ipv4.c tcp: Add sysctl to configure TIME-WAIT reuse delay 2024-12-11 20:17:33 -08:00
tcp.c tcp: only release congestion control if it has been initialized 2024-10-31 18:22:48 -07:00
tcp_ao.c net/tcp: Add missing lockdep annotations for TCP-AO hlist traversals 2024-11-03 12:10:11 -08:00
tcp_bbr.c tcp: Add new args for cong_control in tcp_congestion_ops 2024-05-02 16:26:56 -07:00
tcp_bic.c
tcp_bpf.c tcp_bpf: Fix copied value in tcp_bpf_sendmsg 2024-12-20 22:53:36 +01:00
tcp_cdg.c
tcp_cong.c tcp: only release congestion control if it has been initialized 2024-10-31 18:22:48 -07:00
tcp_cubic.c tcp_cubic: fix incorrect HyStart round start detection 2025-01-20 12:26:41 +00:00
tcp_dctcp.c tcp: Fix shift-out-of-bounds in dctcp_update_alpha(). 2024-05-21 13:34:50 +02:00
tcp_dctcp.h
tcp_diag.c
tcp_fastopen.c net: use unrcu_pointer() helper 2024-06-06 11:52:52 +02:00
tcp_highspeed.c
tcp_htcp.c tcp: Use clamp() in htcp_alpha_update() 2024-08-06 12:16:25 -07:00
tcp_hybla.c
tcp_illinois.c
tcp_input.c tcp: add LINUX_MIB_PAWS_OLD_ACK SNMP counter 2025-01-14 13:28:13 -08:00
tcp_ipv4.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2025-01-09 16:11:47 -08:00
tcp_lp.c
tcp_metrics.c tcp_metrics: use netlink policy for IPv6 addr len validation 2024-08-19 17:42:57 -07:00
tcp_minisocks.c tcp: Measure TIME-WAIT reuse delay with millisecond precision 2024-12-11 20:17:33 -08:00
tcp_nv.c
tcp_offload.c net: gso: fix tcp fraglist segmentation after pull from frag_list 2024-10-02 17:21:47 -07:00
tcp_output.c tcp: correct handling of extreme memory squeeze 2025-01-29 18:58:15 -08:00
tcp_plb.c
tcp_rate.c
tcp_recovery.c
tcp_scalable.c
tcp_sigpool.c net/tcp_sigpool: Use nested-BH locking for sigpool_scratch. 2024-06-24 16:41:22 -07:00
tcp_timer.c net: tcp: refresh tcp_mstamp for compressed ack in timer 2024-10-07 16:01:39 -07:00
tcp_ulp.c
tcp_vegas.c
tcp_vegas.h
tcp_veno.c
tcp_westwood.c
tcp_yeah.c
tunnel4.c
udp.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2025-01-16 10:34:59 -08:00
udp_bpf.c
udp_diag.c
udp_impl.h
udp_offload.c gso: fix udp gso fraglist segmentation after pull from frag_list 2024-10-02 17:29:31 -07:00
udp_tunnel_core.c ipv4: udp_tunnel: Unmask upper DSCP bits in udp_tunnel_dst_lookup() 2024-09-09 14:14:53 +01:00
udp_tunnel_nic.c
udp_tunnel_stub.c
udplite.c
xfrm4_input.c ipv4: Convert ip_route_input_noref() to dscp_t. 2024-10-03 16:21:21 -07:00
xfrm4_output.c
xfrm4_policy.c xfrm: Convert struct xfrm_dst_lookup_params -> tos to dscp_t. 2024-11-06 12:42:51 +01:00
xfrm4_protocol.c ipv4: Convert ip_route_input_noref() to dscp_t. 2024-10-03 16:21:21 -07:00
xfrm4_state.c
xfrm4_tunnel.c