linux/drivers
Eric Dumazet 19757cebf0 tcp: switch orphan_count to bare per-cpu counters
Use of percpu_counter structure to track count of orphaned
sockets is causing problems on modern hosts with 256 cpus
or more.

Stefan Bach reported a serious spinlock contention in real workloads,
that I was able to reproduce with a netfilter rule dropping
incoming FIN packets.

    53.56%  server  [kernel.kallsyms]      [k] queued_spin_lock_slowpath
            |
            ---queued_spin_lock_slowpath
               |
                --53.51%--_raw_spin_lock_irqsave
                          |
                           --53.51%--__percpu_counter_sum
                                     tcp_check_oom
                                     |
                                     |--39.03%--__tcp_close
                                     |          tcp_close
                                     |          inet_release
                                     |          inet6_release
                                     |          sock_close
                                     |          __fput
                                     |          ____fput
                                     |          task_work_run
                                     |          exit_to_usermode_loop
                                     |          do_syscall_64
                                     |          entry_SYSCALL_64_after_hwframe
                                     |          __GI___libc_close
                                     |
                                      --14.48%--tcp_out_of_resources
                                                tcp_write_timeout
                                                tcp_retransmit_timer
                                                tcp_write_timer_handler
                                                tcp_write_timer
                                                call_timer_fn
                                                expire_timers
                                                __run_timers
                                                run_timer_softirq
                                                __softirqentry_text_start

As explained in commit cf86a086a1 ("net/dst: use a smaller percpu_counter
batch for dst entries accounting"), default batch size is too big
for the default value of tcp_max_orphans (262144).

But even if we reduce batch sizes, there would still be cases
where the estimated count of orphans is beyond the limit,
and where tcp_too_many_orphans() has to call the expensive
percpu_counter_sum_positive().

One solution is to use plain per-cpu counters, and have
a timer to periodically refresh this cache.

Updating this cache every 100ms seems about right, tcp pressure
state is not radically changing over shorter periods.

percpu_counter was nice 15 years ago while hosts had less
than 16 cpus, not anymore by current standards.

v2: Fix the build issue for CONFIG_CRYPTO_DEV_CHELSIO_TLS=m,
    reported by kernel test robot <lkp@intel.com>
    Remove unused socket argument from tcp_too_many_orphans()

Fixes: dd24c00191 ("net: Use a percpu_counter for orphan_count")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Stefan Bach <sfb@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-15 11:28:34 +01:00
..
accessibility
acpi arm64 fixes: 2021-10-12 11:16:38 -07:00
amba
android binder: make sure fd closes complete 2021-09-14 09:02:13 +02:00
ata libata: Add ATA_HORKAGE_NO_NCQ_ON_ATI for Samsung 860 and 870 SSD. 2021-09-03 08:06:02 -06:00
atm
auxdisplay
base Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-10-14 16:50:14 -07:00
bcma bcma: drop unneeded initialization value 2021-10-05 08:32:30 +03:00
block nbd: use shifts rather than multiplies 2021-09-29 20:31:41 -06:00
bluetooth Bluetooth: Rename driver .prevent_wake to .wakeup 2021-10-01 15:46:15 -07:00
bus bus: ti-sysc: Use CLKDM_NOAUTO for dra7 dcan1 for errata i893 2021-10-06 08:01:13 +03:00
cdrom
char IPMI: A couple of very minor fixes for style and rate limiting 2021-09-12 11:44:58 -07:00
clk One patch to fix an unused variable warning in a Qualcomm clk driver. 2021-09-11 10:05:56 -07:00
clocksource - converted Pistachio platform to use MIPS generic kernel 2021-09-03 11:11:54 -07:00
comedi comedi: Fix memory leak in compat_insnlist() 2021-09-21 17:53:54 +02:00
connector
counter
cpufreq Power management fixes for 5.15-rc2 2021-09-17 12:05:04 -07:00
cpuidle - Core Frameworks 2021-09-07 12:38:59 -07:00
crypto crypto: ccp - fix resource leaks in ccp_run_aes_gcm_cmd() 2021-09-24 15:58:41 +08:00
cxl cxl for v5.15 2021-09-09 11:48:27 -07:00
dax libnvdimm for v5.15 2021-09-09 11:39:57 -07:00
dca
devfreq devfreq: use HZ macros 2021-09-08 11:50:26 -07:00
dio
dma dmaengine updates for v5.15-rc1 2021-09-09 11:07:47 -07:00
dma-buf dma-buf: DMABUF_SYSFS_STATS should depend on DMA_SHARED_BUFFER 2021-09-07 12:42:21 +05:30
edac EDAC/dmc520: Assign the proper type to dimm->edac_mode 2021-09-16 11:00:12 +02:00
eisa
extcon
firewire FireWire (IEEE 1394) subsystem updates: 2021-09-11 09:47:33 -07:00
firmware asm-generic: build fixes for v5.15 2021-10-08 11:57:54 -07:00
fpga fpga: dfl: Avoid reads to AFU CSRs during enumeration 2021-09-16 15:20:55 -07:00
fsi
gnss
gpio gpio fixes for v5.15-rc4 2021-09-30 12:11:35 -07:00
gpu asm-generic: build fixes for v5.15 2021-10-08 11:57:54 -07:00
greybus
hid HID: amd_sfh: Fix potential NULL pointer dereference 2021-09-27 10:00:43 +02:00
hsi net: remove single-byte netdev->dev_addr writes 2021-10-13 10:03:59 -07:00
hv hyperv-fixes for 5.15-rc2 2021-09-15 17:18:56 -07:00
hwmon hwmon: (w83793) Fix NULL pointer dereference by removing unnecessary structure field 2021-10-02 05:14:11 -07:00
hwspinlock
hwtracing coresight: syscfg: Fix compiler warning 2021-09-14 09:03:16 +02:00
i2c i2c: mlxcpld: Modify register setting for 400KHz frequency 2021-10-04 21:56:20 +02:00
i3c
idle
iio iio/test-format: build kunit tests without structleak plugin 2021-10-06 17:53:36 -06:00
infiniband mlx4: replace mlx4_mac_to_u64() with ether_addr_to_u64() 2021-10-05 13:15:35 +01:00
input Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2021-09-11 09:08:28 -07:00
interconnect interconnect: qcom: sdm660: Add missing a2noc qos clocks 2021-09-13 15:49:55 +03:00
iommu asm-generic: build fixes for v5.15 2021-10-08 11:57:54 -07:00
ipack ipack: ipoctal: fix module reference leak 2021-09-27 17:38:49 +02:00
irqchip irqchip/gic: Work around broken Renesas integration 2021-09-22 14:44:25 +01:00
isdn isdn: mISDN: Fix sleeping function called from invalid context 2021-10-09 13:42:51 +01:00
leds leds: pca955x: Switch to i2c probe_new 2021-08-20 11:00:08 +02:00
macintosh memblock: introduce saner 'memblock_free_ptr()' interface 2021-09-14 13:23:22 -07:00
mailbox mailbox: cmdq: add multi-gce clocks support for mt8195 2021-08-31 22:57:45 -05:00
mcb mcb: fix error handling in mcb_alloc_bus() 2021-09-14 11:22:26 +02:00
md md: fix a lock order reversal in md_alloc 2021-09-22 08:45:58 -07:00
media asm-generic: build fixes for v5.15 2021-10-08 11:57:54 -07:00
memory
memstick Driver core update for 5.15-rc1 2021-09-01 08:44:42 -07:00
message
mfd - Core Frameworks 2021-09-07 12:38:59 -07:00
misc misc: bcm-vk: fix tty registration race 2021-09-21 16:17:15 +02:00
mmc asm-generic: build fixes for v5.15 2021-10-08 11:57:54 -07:00
most
mtd Merge branch 'akpm' (patches from Andrew) 2021-09-08 12:55:35 -07:00
mux
net tcp: switch orphan_count to bare per-cpu counters 2021-10-15 11:28:34 +01:00
nfc nfc: microread: drop unneeded debug prints 2021-10-11 17:00:52 -07:00
ntb Bug fixes and clean-ups for Linux v5.15 2021-09-07 13:05:02 -07:00
nubus
nvdimm nvdimm/pmem: fix creating the dax group 2021-09-27 11:40:43 -07:00
nvme nvme: add command id quirk for apple controllers 2021-09-27 10:02:07 -06:00
nvmem nvmem: NVMEM_NINTENDO_OTP should depend on WII 2021-09-21 17:38:37 +02:00
of Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-10-14 16:50:14 -07:00
opp Merge branches 'pm-pci', 'pm-sleep', 'pm-domains' and 'powercap' 2021-08-30 19:25:42 +02:00
parisc parisc: Move pci_dev_is_behind_card_dino to where it is used 2021-09-09 12:44:31 +02:00
parport parisc architecture updates for kernel 5.15: 2021-09-02 13:16:00 -07:00
pci s390 update for v5.15-rc5 2021-10-08 16:46:09 -07:00
pcmcia ethernet: replace netdev->dev_addr assignment loops 2021-10-14 09:22:25 -07:00
perf KVM: arm64: Fix PMU probe ordering 2021-09-20 12:43:34 +01:00
phy Merge branch 'akpm' (patches from Andrew) 2021-09-08 12:55:35 -07:00
pinctrl asm-generic: build fixes for v5.15 2021-10-08 11:57:54 -07:00
platform platform/x86: int1092: Fix non sequential device mode handling 2021-10-11 16:39:25 +02:00
pnp
power power supply and reset changes for the v5.15 series 2021-08-30 11:47:32 -07:00
powercap powercap: Add Power Limit4 support for Alder Lake SoC 2021-08-25 20:12:16 +02:00
pps
ps3
ptp Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-10-07 15:24:06 -07:00
pwm pwm: mtk-disp: Implement atomic API .get_state() 2021-09-02 22:27:46 +02:00
rapidio
ras
regulator regulator: max14577: Revert "regulator: max14577: Add proper module aliases strings" 2021-09-17 13:16:38 +01:00
remoteproc
reset ARM: SoC drivers for 5.15 2021-09-01 15:25:28 -07:00
rpmsg
rtc rtc: cmos: Disable irq around direct invocation of cmos_interrupt() 2021-09-14 10:20:19 +02:00
s390 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-10-07 15:24:06 -07:00
sbus
scsi Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-10-14 16:50:14 -07:00
sh
siox
slimbus Driver core update for 5.15-rc1 2021-09-01 08:44:42 -07:00
soc Fixes for omaps for v5.15 2021-10-07 21:13:57 +02:00
soundwire sound updates for 5.15-rc1 2021-09-01 10:29:29 -07:00
spi spi: Fix modalias issues 2021-09-22 11:58:24 -07:00
spmi
ssb
staging Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-10-07 15:24:06 -07:00
target scsi: target: Fix spelling mistake "CONFLIFT" -> "CONFLICT" 2021-09-22 00:17:29 -04:00
tc
tee tee/optee/shm_pool: fix application of sizeof to pointer 2021-09-14 07:54:56 +02:00
thermal thermal/drivers/tsens: Fix wrong check for tzd in irq handlers 2021-09-21 15:17:11 +02:00
thunderbolt thunderbolt: build kunit tests without structleak plugin 2021-10-06 17:53:49 -06:00
tty xen: branch for v5.15-rc5 2021-10-08 12:55:23 -07:00
uio
usb Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-10-14 16:50:14 -07:00
vdpa vdpa/mlx5: Avoid executing set_vq_ready() if device is reset 2021-09-14 18:10:43 -04:00
vfio vfio/pci: add missing identifier name in argument of function prototype 2021-09-23 14:12:36 -06:00
vhost virtio,vdpa: fixes 2021-09-28 07:27:29 -07:00
video video: fbdev: gbefb: Only instantiate device when built for IP32 2021-10-06 11:12:28 +02:00
virt
virtio virtio: don't fail on !of_device_is_compatible 2021-09-14 18:09:57 -04:00
visorbus
vlynq
vme
w1
watchdog watchdog/sb_watchdog: fix compilation problem due to COMPILE_TEST 2021-09-27 11:57:19 -07:00
xen xen: branch for v5.15-rc5 2021-10-08 12:55:23 -07:00
zorro
Kconfig firmware: include drivers/firmware/Kconfig unconditionally 2021-10-07 16:51:26 +02:00
Makefile