linux/arch/x86/lib/copy_user_64.S

113 lines
2.2 KiB
ArmAsm
Raw Permalink Normal View History

/* SPDX-License-Identifier: GPL-2.0-only */
/*
* Copyright 2008 Vitaly Mayatskikh <vmayatsk@redhat.com>
* Copyright 2002 Andi Kleen, SuSE Labs.
*
* Functions to copy from and to user space.
*/
#include <linux/export.h>
#include <linux/linkage.h>
#include <linux/cfi_types.h>
#include <linux/objtool.h>
x86: re-introduce support for ERMS copies for user space accesses I tried to streamline our user memory copy code fairly aggressively in commit adfcf4231b8c ("x86: don't use REP_GOOD or ERMS for user memory copies"), in order to then be able to clean up the code and inline the modern FSRM case in commit 577e6a7fd50d ("x86: inline the 'rep movs' in user copies for the FSRM case"). We had reports [1] of that causing regressions earlier with blogbench, but that turned out to be a horrible benchmark for that case, and not a sufficient reason for re-instating "rep movsb" on older machines. However, now Eric Dumazet reported [2] a regression in performance that seems to be a rather more real benchmark, where due to the removal of "rep movs" a TCP stream over a 100Gbps network no longer reaches line speed. And it turns out that with the simplified the calling convention for the non-FSRM case in commit 427fda2c8a49 ("x86: improve on the non-rep 'copy_user' function"), re-introducing the ERMS case is actually fairly simple. Of course, that "fairly simple" is glossing over several missteps due to having to fight our assembler alternative code. This code really wanted to rewrite a conditional branch to have two different targets, but that made objtool sufficiently unhappy that this instead just ended up doing a choice between "jump to the unrolled loop, or use 'rep movsb' directly". Let's see if somebody finds a case where the kernel memory copies also care (see commit 68674f94ffc9: "x86: don't use REP_GOOD or ERMS for small memory copies"). But Eric does argue that the user copies are special because networking tries to copy up to 32KB at a time, if order-3 pages allocations are possible. In-kernel memory copies are typically small, unless they are the special "copy pages at a time" kind that still use "rep movs". Link: https://lore.kernel.org/lkml/202305041446.71d46724-yujie.liu@intel.com/ [1] Link: https://lore.kernel.org/lkml/CANn89iKUbyrJ=r2+_kK+sb2ZSSHifFZ7QkPLDpAtkJ8v4WUumA@mail.gmail.com/ [2] Reported-and-tested-by: Eric Dumazet <edumazet@google.com> Fixes: adfcf4231b8c ("x86: don't use REP_GOOD or ERMS for user memory copies") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-05-26 12:34:20 -07:00
#include <asm/cpufeatures.h>
#include <asm/alternative.h>
#include <asm/asm.h>
/*
* rep_movs_alternative - memory copy with exception handling.
* This version is for CPUs that don't have FSRM (Fast Short Rep Movs)
*
* Input:
* rdi destination
* rsi source
* rcx count
*
* Output:
* rcx uncopied bytes or 0 if successful.
*
* NOTE! The calling convention is very intentionally the same as
* for 'rep movs', so that we can rewrite the function call with
* just a plain 'rep movs' on machines that have FSRM. But to make
* it simpler for us, we can clobber rsi/rdi and rax freely.
*/
SYM_FUNC_START(rep_movs_alternative)
ANNOTATE_NOENDBR
cmpq $64,%rcx
x86: re-introduce support for ERMS copies for user space accesses I tried to streamline our user memory copy code fairly aggressively in commit adfcf4231b8c ("x86: don't use REP_GOOD or ERMS for user memory copies"), in order to then be able to clean up the code and inline the modern FSRM case in commit 577e6a7fd50d ("x86: inline the 'rep movs' in user copies for the FSRM case"). We had reports [1] of that causing regressions earlier with blogbench, but that turned out to be a horrible benchmark for that case, and not a sufficient reason for re-instating "rep movsb" on older machines. However, now Eric Dumazet reported [2] a regression in performance that seems to be a rather more real benchmark, where due to the removal of "rep movs" a TCP stream over a 100Gbps network no longer reaches line speed. And it turns out that with the simplified the calling convention for the non-FSRM case in commit 427fda2c8a49 ("x86: improve on the non-rep 'copy_user' function"), re-introducing the ERMS case is actually fairly simple. Of course, that "fairly simple" is glossing over several missteps due to having to fight our assembler alternative code. This code really wanted to rewrite a conditional branch to have two different targets, but that made objtool sufficiently unhappy that this instead just ended up doing a choice between "jump to the unrolled loop, or use 'rep movsb' directly". Let's see if somebody finds a case where the kernel memory copies also care (see commit 68674f94ffc9: "x86: don't use REP_GOOD or ERMS for small memory copies"). But Eric does argue that the user copies are special because networking tries to copy up to 32KB at a time, if order-3 pages allocations are possible. In-kernel memory copies are typically small, unless they are the special "copy pages at a time" kind that still use "rep movs". Link: https://lore.kernel.org/lkml/202305041446.71d46724-yujie.liu@intel.com/ [1] Link: https://lore.kernel.org/lkml/CANn89iKUbyrJ=r2+_kK+sb2ZSSHifFZ7QkPLDpAtkJ8v4WUumA@mail.gmail.com/ [2] Reported-and-tested-by: Eric Dumazet <edumazet@google.com> Fixes: adfcf4231b8c ("x86: don't use REP_GOOD or ERMS for user memory copies") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-05-26 12:34:20 -07:00
jae .Llarge
cmp $8,%ecx
jae .Lword
testl %ecx,%ecx
je .Lexit
.Lcopy_user_tail:
0: movb (%rsi),%al
1: movb %al,(%rdi)
inc %rdi
inc %rsi
dec %rcx
jne .Lcopy_user_tail
.Lexit:
RET
_ASM_EXTABLE_UA( 0b, .Lexit)
_ASM_EXTABLE_UA( 1b, .Lexit)
x86/mce: Reduce number of machine checks taken during recovery When any of the copy functions in arch/x86/lib/copy_user_64.S take a fault, the fixup code copies the remaining byte count from %ecx to %edx and unconditionally jumps to .Lcopy_user_handle_tail to continue the copy in case any more bytes can be copied. If the fault was #PF this may copy more bytes (because the page fault handler might have fixed the fault). But when the fault is a machine check the original copy code will have copied all the way to the poisoned cache line. So .Lcopy_user_handle_tail will just take another machine check for no good reason. Every code path to .Lcopy_user_handle_tail comes from an exception fixup path, so add a check there to check the trap type (in %eax) and simply return the count of remaining bytes if the trap was a machine check. Doing this reduces the number of machine checks taken during synthetic tests from four to three. As well as reducing the number of machine checks, this also allows Skylake generation Xeons to recover some cases that currently fail. The is because REP; MOVSB is only recoverable when source and destination are well aligned and the byte count is large. That useless call to .Lcopy_user_handle_tail may violate one or more of these conditions and generate a fatal machine check. [ Tony: Add more details to commit message. ] [ bp: Fixup comment. Also, another tip patchset which is adding straight-line speculation mitigation changes the "ret" instruction to an all-caps macro "RET". But, since gas is case-insensitive, use "RET" in the newly added asm block already in order to simplify tip branch merging on its way upstream. ] Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/YcTW5dh8yTGucDd+@agluck-desk2.amr.corp.intel.com
2021-12-23 12:07:01 -08:00
.p2align 4
.Lword:
2: movq (%rsi),%rax
3: movq %rax,(%rdi)
addq $8,%rsi
addq $8,%rdi
sub $8,%ecx
je .Lexit
cmp $8,%ecx
jae .Lword
jmp .Lcopy_user_tail
_ASM_EXTABLE_UA( 2b, .Lcopy_user_tail)
_ASM_EXTABLE_UA( 3b, .Lcopy_user_tail)
x86: re-introduce support for ERMS copies for user space accesses I tried to streamline our user memory copy code fairly aggressively in commit adfcf4231b8c ("x86: don't use REP_GOOD or ERMS for user memory copies"), in order to then be able to clean up the code and inline the modern FSRM case in commit 577e6a7fd50d ("x86: inline the 'rep movs' in user copies for the FSRM case"). We had reports [1] of that causing regressions earlier with blogbench, but that turned out to be a horrible benchmark for that case, and not a sufficient reason for re-instating "rep movsb" on older machines. However, now Eric Dumazet reported [2] a regression in performance that seems to be a rather more real benchmark, where due to the removal of "rep movs" a TCP stream over a 100Gbps network no longer reaches line speed. And it turns out that with the simplified the calling convention for the non-FSRM case in commit 427fda2c8a49 ("x86: improve on the non-rep 'copy_user' function"), re-introducing the ERMS case is actually fairly simple. Of course, that "fairly simple" is glossing over several missteps due to having to fight our assembler alternative code. This code really wanted to rewrite a conditional branch to have two different targets, but that made objtool sufficiently unhappy that this instead just ended up doing a choice between "jump to the unrolled loop, or use 'rep movsb' directly". Let's see if somebody finds a case where the kernel memory copies also care (see commit 68674f94ffc9: "x86: don't use REP_GOOD or ERMS for small memory copies"). But Eric does argue that the user copies are special because networking tries to copy up to 32KB at a time, if order-3 pages allocations are possible. In-kernel memory copies are typically small, unless they are the special "copy pages at a time" kind that still use "rep movs". Link: https://lore.kernel.org/lkml/202305041446.71d46724-yujie.liu@intel.com/ [1] Link: https://lore.kernel.org/lkml/CANn89iKUbyrJ=r2+_kK+sb2ZSSHifFZ7QkPLDpAtkJ8v4WUumA@mail.gmail.com/ [2] Reported-and-tested-by: Eric Dumazet <edumazet@google.com> Fixes: adfcf4231b8c ("x86: don't use REP_GOOD or ERMS for user memory copies") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-05-26 12:34:20 -07:00
.Llarge:
0: ALTERNATIVE "jmp .Llarge_movsq", "rep movsb", X86_FEATURE_ERMS
x86: re-introduce support for ERMS copies for user space accesses I tried to streamline our user memory copy code fairly aggressively in commit adfcf4231b8c ("x86: don't use REP_GOOD or ERMS for user memory copies"), in order to then be able to clean up the code and inline the modern FSRM case in commit 577e6a7fd50d ("x86: inline the 'rep movs' in user copies for the FSRM case"). We had reports [1] of that causing regressions earlier with blogbench, but that turned out to be a horrible benchmark for that case, and not a sufficient reason for re-instating "rep movsb" on older machines. However, now Eric Dumazet reported [2] a regression in performance that seems to be a rather more real benchmark, where due to the removal of "rep movs" a TCP stream over a 100Gbps network no longer reaches line speed. And it turns out that with the simplified the calling convention for the non-FSRM case in commit 427fda2c8a49 ("x86: improve on the non-rep 'copy_user' function"), re-introducing the ERMS case is actually fairly simple. Of course, that "fairly simple" is glossing over several missteps due to having to fight our assembler alternative code. This code really wanted to rewrite a conditional branch to have two different targets, but that made objtool sufficiently unhappy that this instead just ended up doing a choice between "jump to the unrolled loop, or use 'rep movsb' directly". Let's see if somebody finds a case where the kernel memory copies also care (see commit 68674f94ffc9: "x86: don't use REP_GOOD or ERMS for small memory copies"). But Eric does argue that the user copies are special because networking tries to copy up to 32KB at a time, if order-3 pages allocations are possible. In-kernel memory copies are typically small, unless they are the special "copy pages at a time" kind that still use "rep movs". Link: https://lore.kernel.org/lkml/202305041446.71d46724-yujie.liu@intel.com/ [1] Link: https://lore.kernel.org/lkml/CANn89iKUbyrJ=r2+_kK+sb2ZSSHifFZ7QkPLDpAtkJ8v4WUumA@mail.gmail.com/ [2] Reported-and-tested-by: Eric Dumazet <edumazet@google.com> Fixes: adfcf4231b8c ("x86: don't use REP_GOOD or ERMS for user memory copies") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-05-26 12:34:20 -07:00
1: RET
_ASM_EXTABLE_UA( 0b, 1b)
x86: re-introduce support for ERMS copies for user space accesses I tried to streamline our user memory copy code fairly aggressively in commit adfcf4231b8c ("x86: don't use REP_GOOD or ERMS for user memory copies"), in order to then be able to clean up the code and inline the modern FSRM case in commit 577e6a7fd50d ("x86: inline the 'rep movs' in user copies for the FSRM case"). We had reports [1] of that causing regressions earlier with blogbench, but that turned out to be a horrible benchmark for that case, and not a sufficient reason for re-instating "rep movsb" on older machines. However, now Eric Dumazet reported [2] a regression in performance that seems to be a rather more real benchmark, where due to the removal of "rep movs" a TCP stream over a 100Gbps network no longer reaches line speed. And it turns out that with the simplified the calling convention for the non-FSRM case in commit 427fda2c8a49 ("x86: improve on the non-rep 'copy_user' function"), re-introducing the ERMS case is actually fairly simple. Of course, that "fairly simple" is glossing over several missteps due to having to fight our assembler alternative code. This code really wanted to rewrite a conditional branch to have two different targets, but that made objtool sufficiently unhappy that this instead just ended up doing a choice between "jump to the unrolled loop, or use 'rep movsb' directly". Let's see if somebody finds a case where the kernel memory copies also care (see commit 68674f94ffc9: "x86: don't use REP_GOOD or ERMS for small memory copies"). But Eric does argue that the user copies are special because networking tries to copy up to 32KB at a time, if order-3 pages allocations are possible. In-kernel memory copies are typically small, unless they are the special "copy pages at a time" kind that still use "rep movs". Link: https://lore.kernel.org/lkml/202305041446.71d46724-yujie.liu@intel.com/ [1] Link: https://lore.kernel.org/lkml/CANn89iKUbyrJ=r2+_kK+sb2ZSSHifFZ7QkPLDpAtkJ8v4WUumA@mail.gmail.com/ [2] Reported-and-tested-by: Eric Dumazet <edumazet@google.com> Fixes: adfcf4231b8c ("x86: don't use REP_GOOD or ERMS for user memory copies") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-05-26 12:34:20 -07:00
.Llarge_movsq:
x86/uaccess: Improve performance by aligning writes to 8 bytes in copy_user_generic(), on non-FSRM/ERMS CPUs History of the performance regression: ====================================== Since the following series of user copy updates were merged upstream ~2 years ago via: a5624566431d ("Merge branch 'x86-rep-insns': x86 user copy clarifications") .. copy_user_generic() on x86_64 stopped doing alignment of the writes to the destination to a 8 byte boundary for the non FSRM case. Previously, this was done through the ALIGN_DESTINATION macro that was used in the now removed copy_user_generic_unrolled function. Turns out this change causes some loss of performance/throughput on some use cases and specific CPU/platforms without FSRM and ERMS. Lately I got two reports of performance/throughput issues after a RHEL 9 kernel pulled the same upstream series with updates to user copy functions. Both reports consisted of running specific networking/TCP related testing using iperf3. Partial upstream fix ==================== The first report was related to a Linux Bridge testing using VMs on a specific machine with an AMD CPU (EPYC 7402), and after a brief investigation it turned out that the later change via: ca96b162bfd2 ("x86: bring back rep movsq for user access on CPUs without ERMS") ... helped/fixed the performance issue. However, after the later commit/fix was applied, then I got another regression reported in a multistream TCP test on a 100Gbit mlx5 nic, also running on an AMD based platform (AMD EPYC 7302 CPU), again that was using iperf3 to run the test. That regression was after applying the later fix/commit, but only this didn't help in telling the whole history. Testing performed to pinpoint residual regression ================================================= So I narrowed down the second regression use case, but running it without traffic through a NIC, on localhost, in trying to narrow down CPU usage and not being limited by other factor like network bandwidth. I used another system also with an AMD CPU (AMD EPYC 7742). Basically, I run iperf3 in server and client mode in the same system, for example: - Start the server binding it to CPU core/thread 19: $ taskset -c 19 iperf3 -D -s -B 127.0.0.1 -p 12000 - Start the client always binding/running on CPU core/thread 17, using perf to get statistics: $ perf stat -o stat.txt taskset -c 17 iperf3 -c 127.0.0.1 -b 0/1000 -V \ -n 50G --repeating-payload -l 16384 -p 12000 --cport 12001 2>&1 \ > stat-19.txt For the client, always running/pinned to CPU 17. But for the iperf3 in server mode, I did test runs using CPUs 19, 21, 23 or not pinned to any specific CPU. So it basically consisted with four runs of the same commands, just changing the CPU which the server is pinned, or without pinning by removing the taskset call before the server command. The CPUs were chosen based on NUMA node they were on, this is the relevant output of lscpu on the system: $ lscpu ... Model name: AMD EPYC 7742 64-Core Processor ... Caches (sum of all): L1d: 2 MiB (64 instances) L1i: 2 MiB (64 instances) L2: 32 MiB (64 instances) L3: 256 MiB (16 instances) NUMA: NUMA node(s): 4 NUMA node0 CPU(s): 0,1,8,9,16,17,24,25,32,33,40,41,48,49,56,57,64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121 NUMA node1 CPU(s): 2,3,10,11,18,19,26,27,34,35,42,43,50,51,58,59,66,67,74,75,82,83,90,91,98,99,106,107,114,115,122,123 NUMA node2 CPU(s): 4,5,12,13,20,21,28,29,36,37,44,45,52,53,60,61,68,69,76,77,84,85,92,93,100,101,108,109,116,117,124,125 NUMA node3 CPU(s): 6,7,14,15,22,23,30,31,38,39,46,47,54,55,62,63,70,71,78,79,86,87,94,95,102,103,110,111,118,119,126,127 ... So for the server run, when picking a CPU, I chose CPUs to be not on the same node. The reason is with that I was able to get/measure relevant performance differences when changing the alignment of the writes to the destination in copy_user_generic. Testing shows up to +81% performance improvement under iperf3 ============================================================= Here's a summary of the iperf3 runs: # Vanilla upstream alignment: CPU RATE SYS TIME sender-receiver Server bind 19: 13.0Gbits/sec 28.371851000 33.233499566 86.9%-70.8% Server bind 21: 12.9Gbits/sec 28.283381000 33.586486621 85.8%-69.9% Server bind 23: 11.1Gbits/sec 33.660190000 39.012243176 87.7%-64.5% Server bind none: 18.9Gbits/sec 19.215339000 22.875117865 86.0%-80.5% # With the attached patch (aligning writes in non ERMS/FSRM case): CPU RATE SYS TIME sender-receiver Server bind 19: 20.8Gbits/sec 14.897284000 20.811101382 75.7%-89.0% Server bind 21: 20.4Gbits/sec 15.205055000 21.263165909 75.4%-89.7% Server bind 23: 20.2Gbits/sec 15.433801000 21.456175000 75.5%-89.8% Server bind none: 26.1Gbits/sec 12.534022000 16.632447315 79.8%-89.6% So I consistently got better results when aligning the write. The results above were run on 6.14.0-rc6/rc7 based kernels. The sys is sys time and then the total time to run/transfer 50G of data. The last field is the CPU usage of sender/receiver iperf3 process. It's also worth to note that each pair of iperf3 runs may get slightly different results on each run, but I always got consistent higher results with the write alignment for this specific test of running the processes on CPUs in different NUMA nodes. Linus Torvalds helped/provided this version of the patch. Initially I proposed a version which aligned writes for all cases in rep_movs_alternative, however it used two extra registers and thus Linus provided an enhanced version that only aligns the write on the large_movsq case, which is sufficient since the problem happens only on those AMD CPUs like ones mentioned above without ERMS/FSRM, and also doesn't require using extra registers. Also, I validated that aligning only on large_movsq case is really enough for getting the performance back. I also tested this patch on an old Intel based non-ERMS/FRMS system (with Xeon E5-2667 - Sandy Bridge based) and didn't get any problems: no performance enhancement but also no regression either, using the same iperf3 based benchmark. Also newer Intel processors after Sandy Bridge usually have ERMS and should not be affected by this change. [ mingo: Updated the changelog. ] Fixes: ca96b162bfd2 ("x86: bring back rep movsq for user access on CPUs without ERMS") Fixes: 034ff37d3407 ("x86: rewrite '__copy_user_nocache' function") Reported-by: Ondrej Lichtner <olichtne@redhat.com> Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Herton R. Krzesinski <herton@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250320142213.2623518-1-herton@redhat.com
2025-03-20 11:22:13 -03:00
/* Do the first possibly unaligned word */
0: movq (%rsi),%rax
1: movq %rax,(%rdi)
_ASM_EXTABLE_UA( 0b, .Lcopy_user_tail)
_ASM_EXTABLE_UA( 1b, .Lcopy_user_tail)
/* What would be the offset to the aligned destination? */
leaq 8(%rdi),%rax
andq $-8,%rax
subq %rdi,%rax
/* .. and update pointers and count to match */
addq %rax,%rdi
addq %rax,%rsi
subq %rax,%rcx
/* make %rcx contain the number of words, %rax the remainder */
movq %rcx,%rax
shrq $3,%rcx
andl $7,%eax
0: rep movsq
movl %eax,%ecx
testl %ecx,%ecx
jne .Lcopy_user_tail
RET
1: leaq (%rax,%rcx,8),%rcx
jmp .Lcopy_user_tail
_ASM_EXTABLE_UA( 0b, 1b)
SYM_FUNC_END(rep_movs_alternative)
EXPORT_SYMBOL(rep_movs_alternative)