linux/net
Eric Dumazet c9bee3b7fd tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.

TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :

Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)

For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())

This patch adds two ways to set the limit :

1) Per socket option TCP_NOTSENT_LOWAT

2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.

This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat

Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.

Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
 Specify the minimum number of bytes in the buffer until
 the socket layer will pass the data to the protocol)

Tested:

netperf sessions, and watching /proc/net/protocols "memory" column for TCP

With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.

A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

           412,514 context-switches

     200.034645535 seconds time elapsed

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

         2,675,818 context-switches

     200.029651391 seconds time elapsed

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24 17:54:48 -07:00
..
9p net: trans_rdma: remove unused function 2013-07-24 15:46:27 -07:00
802
8021q net: convert resend IGMP to notifier event 2013-07-23 16:52:47 -07:00
appletalk
atm
ax25 net: Convert uses of typedef ctl_table to struct ctl_table 2013-06-13 02:36:09 -07:00
batman-adv net: Unmap fragment page once iterator is done 2013-06-24 01:46:01 -07:00
bluetooth Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2013-07-09 18:24:39 -07:00
bridge net: convert resend IGMP to notifier event 2013-07-23 16:52:47 -07:00
caif
can
ceph net: add sk_stream_is_writeable() helper 2013-07-24 17:54:48 -07:00
core net: add sk_stream_is_writeable() helper 2013-07-24 17:54:48 -07:00
dcb
dccp net: add sk_stream_is_writeable() helper 2013-07-24 17:54:48 -07:00
decnet net: Convert uses of typedef ctl_table to struct ctl_table 2013-06-13 02:36:09 -07:00
dns_resolver net: strict_strtoul is obsolete, use kstrtoul instead 2013-07-12 16:09:14 -07:00
dsa
ethernet net: Fix sysfs_format_mac() code duplication. 2013-07-16 17:09:22 -07:00
ieee802154
ipv4 tcp: TCP_NOTSENT_LOWAT socket option 2013-07-24 17:54:48 -07:00
ipv6 tcp: TCP_NOTSENT_LOWAT socket option 2013-07-24 17:54:48 -07:00
ipx
irda net/irda: fixed style issues in irttp 2013-07-19 17:34:40 -07:00
iucv net: delete __cpuinit usage from all net files 2013-07-14 19:36:58 -04:00
key af_key: fix info leaks in notify messages 2013-06-26 15:15:54 -07:00
l2tp l2tp: make datapath resilient to packet loss when sequence numbers enabled 2013-07-02 16:33:25 -07:00
lapb
llc
mac80211 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2013-07-09 18:24:39 -07:00
mac802154
mpls
netfilter Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-07-03 14:55:13 -07:00
netlabel
netlink netlink: fix splat in skb_clone with large messages 2013-06-27 22:44:16 -07:00
netrom net: Convert uses of typedef ctl_table to struct ctl_table 2013-06-13 02:36:09 -07:00
nfc NFC: llcp: Fix the well known services endianness 2013-06-14 13:45:10 +02:00
openvswitch openvswitch: Add Kconfig dependency on GRE-DEMUX. 2013-07-01 13:19:43 -07:00
packet net: Provide a generic socket error queue delivery method for Tx time stamps. 2013-07-22 14:58:19 -07:00
phonet net: Convert uses of typedef ctl_table to struct ctl_table 2013-06-13 02:36:09 -07:00
rds net: Convert uses of typedef ctl_table to struct ctl_table 2013-06-13 02:36:09 -07:00
rfkill
rose net: Convert uses of typedef ctl_table to struct ctl_table 2013-06-13 02:36:09 -07:00
rxrpc
sched pkt_sched: sch_qfq: remove a source of high packet delay/jitter 2013-07-18 13:02:00 -07:00
sctp net: sctp: trivial: update mailing list address 2013-07-24 17:53:38 -07:00
sunrpc net: add sk_stream_is_writeable() helper 2013-07-24 17:54:48 -07:00
tipc net/tipc: use %*phC to dump small buffers in hex form 2013-07-11 17:03:36 -07:00
unix Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2013-07-09 18:24:39 -07:00
vmw_vsock VSOCK: Fix VSOCK_HASH and VSOCK_CONN_HASH 2013-06-23 23:51:48 -07:00
wimax
wireless Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem 2013-06-28 13:18:21 -04:00
x25 x25: Fix broken locking in ioctl error paths. 2013-07-01 18:15:25 -07:00
xfrm Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next 2013-06-26 13:23:13 -07:00
compat.c
Kconfig Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2013-07-09 18:24:39 -07:00
Makefile
nonet.c
socket.c net: rename busy poll socket op and globals 2013-07-10 17:08:27 -07:00
sysctl_net.c