Commit graph

3 commits

Author SHA1 Message Date
Eric Biggers
118da22eb6 lib/crc: x86/crc32c: Enable VPCLMULQDQ optimization where beneficial
Improve crc32c() performance on lengths >= 512 bytes by using
crc32_lsb_vpclmul_avx512() instead of crc32c_x86_3way(), when the CPU
supports VPCLMULQDQ and has a "good" implementation of AVX-512.  For now
that means AMD Zen 4 and later, and Intel Sapphire Rapids and later.
Pass crc32_lsb_vpclmul_avx512() the table of constants needed to make it
use the CRC-32C polynomial.

Rationale: VPCLMULQDQ performance has improved on newer CPUs, making
crc32_lsb_vpclmul_avx512() faster than crc32c_x86_3way(), even though
crc32_lsb_vpclmul_avx512() is designed for generic 32-bit CRCs and does
not utilize x86_64's dedicated CRC-32C instructions.

Performance results for len=4096 using crc_kunit:

    CPU                        Before (MB/s)     After (MB/s)
    ======================     =============     ============
    AMD Zen 4 (Genoa)                  19868            28618
    AMD Zen 5 (Ryzen AI 9 365)         24080            46940
    AMD Zen 5 (Turin)                  29566            58468
    Intel Sapphire Rapids              22340            73794
    Intel Emerald Rapids               24696            78666

Performance results for len=512 using crc_kunit:

    CPU                        Before (MB/s)     After (MB/s)
    ======================     =============     ============
    AMD Zen 4 (Genoa)                   7251             7758
    AMD Zen 5 (Ryzen AI 9 365)         17481            19135
    AMD Zen 5 (Turin)                  21332            25424
    Intel Sapphire Rapids              18886            29312
    Intel Emerald Rapids               19675            29045

That being said, in the above benchmarks the ZMM registers are "warm",
so they don't quite tell the whole story.  While significantly improved
from older Intel CPUs, Intel still has ~2000 ns of ZMM warm-up time
where 512-bit instructions execute 4 times more slowly than they
normally do.  In contrast, AMD does better and has virtually zero ZMM
warm-up time (at most ~60 ns).  Thus, while this change is always
beneficial on AMD, strictly speaking there are cases in which it is not
beneficial on Intel, e.g. a small number of 512-byte messages with
"cold" ZMM registers.  But typically, it is beneficial even on Intel.

Note that on AMD Zen 3--5, crc32c() performance could be further
improved with implementations that interleave crc32q and VPCLMULQDQ
instructions.  Unfortunately, it appears that a different such
implementation would be optimal on *each* of these microarchitectures.
Such improvements are left for future work.  This commit just improves
the way that we choose the implementations we already have.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250719224938.126512-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-07-20 20:52:34 -07:00
Eric Biggers
110628e55a lib/crc: x86: Reorganize crc-pclmul static_call initialization
Reorganize the crc-pclmul static_call initialization to place more of
the logic in the *_mod_init_arch() functions instead of in the
INIT_CRC_PCLMUL macro.  This provides the flexibility to do more than a
single static_call update for each CPU feature check.  Right away,
optimize crc64_mod_init_arch() to check the CPU features just once
instead of twice, doing both the crc64_msb and crc64_lsb static_call
updates together.  A later commit will also use this to initialize an
additional static_key when crc32_lsb_vpclmul_avx512() is enabled.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250719224938.126512-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-07-20 20:52:28 -07:00
Eric Biggers
b10749d89f lib/crc: x86: Migrate optimized CRC code into lib/crc/
Move the x86-optimized CRC code from arch/x86/lib/crc* into its new
location in lib/crc/x86/, and wire it up in the new way.  This new way
of organizing the CRC code eliminates the need to artificially split the
code for each CRC variant into separate arch and generic modules,
enabling better inlining and dead code elimination.  For more details,
see "lib/crc: Prepare for arch-optimized code in subdirs of lib/crc/".

Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: "Jason A. Donenfeld" <Jason@zx2c4.com>
Link: https://lore.kernel.org/r/20250607200454.73587-12-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-06-30 09:31:57 -07:00