linux/net/ipv4
Daniel Borkmann c5c6a8ab45 net: tcp: add key management to congestion control
This patch adds necessary infrastructure to the congestion control
framework for later per route congestion control support.

For a per route congestion control possibility, our aim is to store
a unique u32 key identifier into dst metrics, which can then be
mapped into a tcp_congestion_ops struct. We argue that having a
RTAX key entry is the most simple, generic and easy way to manage,
and also keeps the memory footprint of dst entries lower on 64 bit
than with storing a pointer directly, for example. Having a unique
key id also allows for decoupling actual TCP congestion control
module management from the FIB layer, i.e. we don't have to care
about expensive module refcounting inside the FIB at this point.

We first thought of using an IDR store for the realization, which
takes over dynamic assignment of unused key space and also performs
the key to pointer mapping in RCU. While doing so, we stumbled upon
the issue that due to the nature of dynamic key distribution, it
just so happens, arguably in very rare occasions, that excessive
module loads and unloads can lead to a possible reuse of previously
used key space. Thus, previously stale keys in the dst metric are
now being reassigned to a different congestion control algorithm,
which might lead to unexpected behaviour. One way to resolve this
would have been to walk FIBs on the actually rare occasion of a
module unload and reset the metric keys for each FIB in each netns,
but that's just very costly.

Therefore, we argue a better solution is to reuse the unique
congestion control algorithm name member and map that into u32 key
space through jhash. For that, we split the flags attribute (as it
currently uses 2 bits only anyway) into two u32 attributes, flags
and key, so that we can keep the cacheline boundary of 2 cachelines
on x86_64 and cache the precalculated key at registration time for
the fast path. On average we might expect 2 - 4 modules being loaded
worst case perhaps 15, so a key collision possibility is extremely
low, and guaranteed collision-free on LE/BE for all in-tree modules.
Overall this results in much simpler code, and all without the
overhead of an IDR. Due to the deterministic nature, modules can
now be unloaded, the congestion control algorithm for a specific
but unloaded key will fall back to the default one, and on module
reload time it will switch back to the expected algorithm
transparently.

Joint work with Florian Westphal.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-05 22:55:24 -05:00
..
netfilter Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2014-12-11 14:27:06 -08:00
af_inet.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-11-29 20:47:48 -08:00
ah4.c
arp.c neigh: remove dynamic neigh table registration support 2014-11-11 15:23:54 -05:00
cipso_ipv4.c
datagram.c
devinet.c
esp4.c net: esp: Convert NETDEBUG to pr_info 2014-11-06 15:11:10 -05:00
fib_frontend.c fib_trie: Push rcu_read_lock/unlock to callers 2014-12-31 18:25:54 -05:00
fib_lookup.h
fib_rules.c fib_trie: Push rcu_read_lock/unlock to callers 2014-12-31 18:25:54 -05:00
fib_semantics.c ipv4: fix nexthop attlen check in fib_nh_match 2014-10-14 15:59:37 -04:00
fib_trie.c fib_trie: Add tracking value for suffix length 2014-12-31 18:25:55 -05:00
fou.c ip: Move checksum convert defines to inet 2015-01-05 22:44:46 -05:00
geneve.c geneve: Check family when reusing sockets. 2015-01-04 22:21:33 -05:00
gre_demux.c
gre_offload.c gre: Set inner mac header in gro complete 2014-12-05 21:18:34 -08:00
icmp.c icmp: Remove some spurious dropped packet profile hits from the ICMP path 2014-11-18 15:28:28 -05:00
igmp.c ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs 2014-11-16 16:55:06 -05:00
inet_connection_sock.c
inet_diag.c
inet_fragment.c net: Convert LIMIT_NETDEBUG to net_dbg_ratelimited 2014-11-11 14:10:31 -05:00
inet_hashtables.c
inet_lro.c
inet_timewait_sock.c
inetpeer.c
ip_forward.c
ip_fragment.c net: Convert LIMIT_NETDEBUG to net_dbg_ratelimited 2014-11-11 14:10:31 -05:00
ip_gre.c gre: allow live address change 2014-12-31 14:18:28 -05:00
ip_input.c
ip_options.c
ip_output.c put iov_iter into msghdr 2014-12-09 16:29:03 -05:00
ip_sockglue.c ip: Add offset parameter to ip_cmsg_recv 2015-01-05 22:44:46 -05:00
ip_tunnel.c ip_tunnel: Add missing validation of encap type to ip_tunnel_encap_setup() 2014-12-16 15:20:41 -05:00
ip_tunnel_core.c
ip_vti.c ip_tunnel: the lack of vti_link_ops' dellink() cause kernel panic 2014-11-23 21:11:17 -05:00
ipcomp.c
ipconfig.c
ipip.c fou: Fix typo in returning flags in netlink 2014-11-05 22:18:20 -05:00
ipmr.c
Kconfig net: Move fou_build_header into fou.c and refactor 2014-11-05 16:30:02 -05:00
Makefile
netfilter.c
ping.c put iov_iter into msghdr 2014-12-09 16:29:03 -05:00
proc.c tcp_cubic: add SNMP counters to track how effective is Hystart 2014-12-09 14:58:23 -05:00
protocol.c
raw.c put iov_iter into msghdr 2014-12-09 16:29:03 -05:00
route.c
syncookies.c
sysctl_net_ipv4.c
tcp.c Merge branch 'for-davem-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-12-10 13:17:23 -05:00
tcp_bic.c
tcp_cong.c net: tcp: add key management to congestion control 2015-01-05 22:55:24 -05:00
tcp_cubic.c tcp_cubic: refine Hystart delay threshold 2014-12-09 14:58:23 -05:00
tcp_dctcp.c
tcp_diag.c
tcp_fastopen.c
tcp_highspeed.c
tcp_htcp.c
tcp_hybla.c
tcp_illinois.c
tcp_input.c switch tcp_sock->ucopy from iovec (ucopy.iov) to msghdr (ucopy.msg) 2014-12-09 16:28:22 -05:00
tcp_ipv4.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-12-10 15:48:20 -05:00
tcp_lp.c
tcp_memcontrol.c mm: memcontrol: lockless page counters 2014-12-10 17:41:04 -08:00
tcp_metrics.c
tcp_minisocks.c
tcp_offload.c net: Remove MPLS GSO feature. 2014-11-05 23:52:33 -08:00
tcp_output.c Merge branch 'for-davem-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-12-10 13:17:23 -05:00
tcp_probe.c
tcp_scalable.c
tcp_timer.c net: Convert LIMIT_NETDEBUG to net_dbg_ratelimited 2014-11-11 14:10:31 -05:00
tcp_vegas.c
tcp_vegas.h
tcp_veno.c
tcp_westwood.c
tcp_yeah.c
tunnel4.c
udp.c ip: Add offset parameter to ip_cmsg_recv 2015-01-05 22:44:46 -05:00
udp_diag.c
udp_impl.h
udp_offload.c net: Remove MPLS GSO feature. 2014-11-05 23:52:33 -08:00
udp_tunnel.c ip: Move checksum convert defines to inet 2015-01-05 22:44:46 -05:00
udplite.c
xfrm4_input.c
xfrm4_mode_beet.c
xfrm4_mode_transport.c
xfrm4_mode_tunnel.c
xfrm4_output.c
xfrm4_policy.c
xfrm4_protocol.c
xfrm4_state.c
xfrm4_tunnel.c