linux

mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-18 22:14:16 +00:00

Author	SHA1	Message	Date
Ard Biesheuvel	779cee8209	crypto: arm64/crct10dif - Remove remaining 64x64 PMULL fallback code The only remaining user of the fallback implementation of 64x64 polynomial multiplication using 8x8 PMULL instructions is the final reduction from a 16 byte vector to a 16-bit CRC. The fallback code is complicated and messy, and this reduction has little impact on the overall performance, so instead, let's calculate the final CRC by passing the 16 byte vector to the generic CRC-T10DIF implementation when running the fallback version. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-11-15 19:52:51 +08:00
Ard Biesheuvel	67dfb1b73f	crypto: arm64/crct10dif - Use faster 16x64 bit polynomial multiply The CRC-T10DIF implementation for arm64 has a version that uses 8x8 polynomial multiplication, for cores that lack the crypto extensions, which cover the 64x64 polynomial multiplication instruction that the algorithm was built around. This fallback version rather naively adopted the 64x64 polynomial multiplication algorithm that I ported from ARM for the GHASH driver, which needs 8 PMULL8 instructions to implement one PMULL64. This is reasonable, given that each 8-bit vector element needs to be multiplied with each element in the other vector, producing 8 vectors with partial results that need to be combined to yield the correct result. However, most PMULL64 invocations in the CRC-T10DIF code involve multiplication by a pair of 16-bit folding coefficients, and so all the partial results from higher order bytes will be zero, and there is no need to calculate them to begin with. Then, the CRC-T10DIF algorithm always XORs the output values of the PMULL64 instructions being issued in pairs, and so there is no need to faithfully implement each individual PMULL64 instruction, as long as XORing the results pairwise produces the expected result. Implementing these improvements results in a speedup of 3.3x on low-end platforms such as Raspberry Pi 4 (Cortex-A72) Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-11-15 19:52:51 +08:00
Ard Biesheuvel	489a4a05fe	crypto: arm64/crct10dif - use frame_push/pop macros consistently Use the frame_push and frame_pop macros to set up the stack frame so that return address protections will be enabled automically when configured. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2022-12-09 18:45:00 +08:00
Ard Biesheuvel	fc754c024a	crypto: arm64/crc-t10dif - move NEON yield to C code Instead of yielding from the bowels of the asm routine if a reschedule is needed, divide up the input into 4 KB chunks in the C glue. This simplifies the code substantially, and avoids scheduling out the task with the asm routine on the call stack, which is undesirable from a CFI/instrumentation point of view. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-02-10 17:55:58 +11:00
Mark Brown	3ca73b70a3	crypto: arm64 - Consistently enable extension Currently most of the crypto files enable the crypto extension using the .arch directive but crct10dif-ce-core.S uses .cpu instead. Move that over to .arch for consistency. Signed-off-by: Mark Brown <broonie@kernel.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2020-04-24 17:42:16 +10:00
Mark Brown	0e89640b64	crypto: arm64 - Use modern annotations for assembly functions In an effort to clarify and simplify the annotation of assembly functions in the kernel new macros have been introduced. These replace ENTRY and ENDPROC and also add a new annotation for static functions which previously had no ENTRY equivalent. Update the annotations in the crypto code to the new macros. There are a small number of files imported from OpenSSL where the assembly is generated using perl programs, these are not currently annotated at all and have not been modified. Signed-off-by: Mark Brown <broonie@kernel.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2019-12-20 14:58:35 +08:00
Eric Biggers	6227cd12e5	crypto: arm64/crct10dif-ce - cleanup and optimizations The x86, arm, and arm64 asm implementations of crct10dif are very difficult to understand partly because many of the comments, labels, and macros are named incorrectly: the lengths mentioned are usually off by a factor of two from the actual code. Many other things are unnecessarily convoluted as well, e.g. there are many more fold constants than actually needed and some aren't fully reduced. This series therefore cleans up all these implementations to be much more maintainable. I also made some small optimizations where I saw opportunities, resulting in slightly better performance. This patch cleans up the arm64 version. Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2019-02-08 15:29:48 +08:00
Ard Biesheuvel	1b2ca568ca	crypto: arm64/crct10dif - remove dead code Remove some code that is no longer called now that we make sure never to invoke the SIMD routine with less than 16 bytes of input. Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2019-02-01 14:45:52 +08:00
Ard Biesheuvel	2fffee536c	crypto: arm64/crct10dif - implement non-Crypto Extensions alternative The arm64 implementation of the CRC-T10DIF algorithm uses the 64x64 bit polynomial multiplication instructions, which are optional in the architecture, and if these instructions are not available, we fall back to the C routine which is slow and inefficient. So let's reuse the 64x64 bit PMULL alternative from the GHASH driver that uses a sequence of ~40 instructions involving 8x8 bit PMULL and some shifting and masking. This is a lot slower than the original, but it is still twice as fast as the current [unoptimized] C code on Cortex-A53, and it is time invariant and much easier on the D-cache. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2018-09-04 11:37:04 +08:00
Ard Biesheuvel	6c1b0da13e	crypto: arm64/crct10dif - preparatory refactor for 8x8 PMULL version Reorganize the CRC-T10DIF asm routine so we can easily instantiate an alternative version based on 8x8 polynomial multiplication in a subsequent patch. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2018-09-04 11:37:04 +08:00
Ard Biesheuvel	5b3da65177	crypto: arm64/crct10dif-ce - yield NEON after every block of input Avoid excessive scheduling delays under a preemptible kernel by yielding the NEON after every block of input. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2018-05-12 00:13:11 +08:00
Ard Biesheuvel	325f562d8f	crypto: arm64/crct10dif - move literal data to .rodata section Move the CRC-T10DIF literal data to the .rodata section where it is safe from being exploited by speculative execution. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2018-01-18 23:00:31 +11:00
Ard Biesheuvel	6ef5737f39	crypto: arm64/crct10dif - port x86 SSE implementation to arm64 This is a transliteration of the Intel algorithm implemented using SSE and PCLMULQDQ instructions that resides in the file arch/x86/crypto/crct10dif-pcl-asm_64.S, but simplified to only operate on buffers that are 16 byte aligned (but of any size) Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2016-12-07 20:01:17 +08:00

13 commits