user/sven/linux.git/lib/crypto/x86, branch v6.17

lib/crypto: x86/sha1-ni: Convert to use rounds macros

2025-07-21T04:42:42Z

The assembly code that does all 80 rounds of SHA-1 is highly repetitive. Replace it with 20 expansions of a macro that does 4 rounds, using the macro arguments and .if directives to handle the slight variations between rounds. This reduces the length of sha1-ni-asm.S by 129 lines while still producing the exact same object file. This mirrors sha256-ni-asm.S which uses this same strategy. Reviewed-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20250718191900.42877-3-ebiggers@kernel.org Signed-off-by: Eric Biggers

lib/crypto: x86/sha1-ni: Minor optimizations and cleanup

2025-07-21T04:42:34Z

- Store the previous state in %xmm8-%xmm9 instead of spilling it to the stack. There are plenty of unused XMM registers here, so there is no reason to spill to the stack. (While 32-bit code is limited to %xmm0-%xmm7, this is 64-bit code, so it's free to use %xmm8-%xmm15.) - Remove the unnecessary check for nblocks == 0. sha1_ni_transform() is always passed a positive nblocks. - To get an XMM register with 'e' in the high dword and the rest zeroes, just zeroize the register using pxor, then load 'e'. Previously the code loaded 'e', then zeroized the lower dwords by AND-ing with a constant, which was slightly less efficient. - Instead of computing &DATA_PTR[NBLOCKS << 6] and stopping when DATA_PTR reaches that value, instead just decrement NBLOCKS on each iteration and stop when it reaches 0. This is fewer instructions. - Rename DIGEST_PTR to STATE_PTR. It points to the SHA-1 internal state, not a SHA-1 digest value. This commit shrinks the code size of sha1_ni_transform() from 624 bytes to 589 bytes and also shrinks rodata by 16 bytes. Reviewed-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20250718191900.42877-2-ebiggers@kernel.org Signed-off-by: Eric Biggers

lib/crypto: x86/sha1: Migrate optimized code into library

2025-07-14T18:28:35Z

Instead of exposing the x86-optimized SHA-1 code via x86-specific crypto_shash algorithms, instead just implement the sha1_blocks() library function. This is much simpler, it makes the SHA-1 library functions be x86-optimized, and it fixes the longstanding issue where the x86-optimized SHA-1 code was disabled by default. SHA-1 still remains available through crypto_shash, but individual architectures no longer need to handle it. To match sha1_blocks(), change the type of the nblocks parameter of the assembly functions from int to size_t. The assembly functions actually already treated it as size_t. Reviewed-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20250712232329.818226-14-ebiggers@kernel.org Signed-off-by: Eric Biggers

lib/crypto: x86/poly1305: Fix performance regression on short messages

2025-07-11T21:29:42Z

Restore the len >= 288 condition on using the AVX implementation, which was incidentally removed by commit 318c53ae02f2 ("crypto: x86/poly1305 - Add block-only interface"). This check took into account the overhead in key power computation, kernel-mode "FPU", and tail handling associated with the AVX code. Indeed, restoring this check slightly improves performance for len < 256 as measured using poly1305_kunit on an "AMD Ryzen AI 9 365" (Zen 5) CPU: Length Before After ====== ========== ========== 1 30 MB/s 36 MB/s 16 516 MB/s 598 MB/s 64 1700 MB/s 1882 MB/s 127 2265 MB/s 2651 MB/s 128 2457 MB/s 2827 MB/s 200 2702 MB/s 3238 MB/s 256 3841 MB/s 3768 MB/s 511 4580 MB/s 4585 MB/s 512 5430 MB/s 5398 MB/s 1024 7268 MB/s 7305 MB/s 3173 8999 MB/s 8948 MB/s 4096 9942 MB/s 9921 MB/s 16384 10557 MB/s 10545 MB/s While the optimal threshold for this CPU might be slightly lower than 288 (see the len == 256 case), other CPUs would need to be tested too, and these sorts of benchmarks can underestimate the true cost of kernel-mode "FPU". Therefore, for now just restore the 288 threshold. Fixes: 318c53ae02f2 ("crypto: x86/poly1305 - Add block-only interface") Cc: stable@vger.kernel.org Reviewed-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20250706231100.176113-6-ebiggers@kernel.org Signed-off-by: Eric Biggers

lib/crypto: x86/poly1305: Fix register corruption in no-SIMD contexts

2025-07-11T21:29:42Z

Restore the SIMD usability check and base conversion that were removed by commit 318c53ae02f2 ("crypto: x86/poly1305 - Add block-only interface"). This safety check is cheap and is well worth eliminating a footgun. While the Poly1305 functions should not be called when SIMD registers are unusable, if they are anyway, they should just do the right thing instead of corrupting random tasks' registers and/or computing incorrect MACs. Fixing this is also needed for poly1305_kunit to pass. Just use irq_fpu_usable() instead of the original crypto_simd_usable(), since poly1305_kunit won't rely on crypto_simd_disabled_for_test. Fixes: 318c53ae02f2 ("crypto: x86/poly1305 - Add block-only interface") Cc: stable@vger.kernel.org Reviewed-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20250706231100.176113-5-ebiggers@kernel.org Signed-off-by: Eric Biggers

lib/crypto: x86/sha256: Remove unnecessary checks for nblocks==0

2025-07-04T17:23:56Z

Since sha256_blocks() is called only with nblocks >= 1, remove unnecessary checks for nblocks == 0 from the x86 SHA-256 assembly code. Acked-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20250704023958.73274-3-ebiggers@kernel.org Signed-off-by: Eric Biggers

lib/crypto: x86/sha256: Move static_call above kernel-mode FPU section

2025-07-04T17:23:55Z

As I did for sha512_blocks(), reorganize x86's sha256_blocks() to be just a static_call. To achieve that, for each assembly function add a C function that handles the kernel-mode FPU section and fallback. While this increases total code size slightly, the amount of code actually executed on a given system does not increase, and it is slightly more efficient since it eliminates the extra static_key. It also makes the assembly functions be called with standard direct calls instead of static calls, eliminating the need for ANNOTATE_NOENDBR. Acked-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20250704023958.73274-2-ebiggers@kernel.org Signed-off-by: Eric Biggers

lib/crypto: sha256: Consolidate into single module

2025-07-04T17:23:11Z

Consolidate the CPU-based SHA-256 code into a single module, following what I did with SHA-512: - Each arch now provides a header file lib/crypto/$(SRCARCH)/sha256.h, replacing lib/crypto/$(SRCARCH)/sha256.c. The header defines sha256_blocks() and optionally sha256_mod_init_arch(). It is included by lib/crypto/sha256.c, and thus the code gets built into the single libsha256 module, with proper inlining and dead code elimination. - sha256_blocks_generic() is moved from lib/crypto/sha256-generic.c into lib/crypto/sha256.c. It's now a static function marked with __maybe_unused, so the compiler automatically eliminates it in any cases where it's not used. - Whether arch-optimized SHA-256 is buildable is now controlled centrally by lib/crypto/Kconfig instead of by lib/crypto/$(SRCARCH)/Kconfig. The conditions for enabling it remain the same as before, and it remains enabled by default. - Any additional arch-specific translation units for the optimized SHA-256 code (such as assembly files) are now compiled by lib/crypto/Makefile instead of lib/crypto/$(SRCARCH)/Makefile. Acked-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20250630160645.3198-13-ebiggers@kernel.org Signed-off-by: Eric Biggers

lib/crypto: sha256: Remove sha256_is_arch_optimized()

2025-07-04T17:23:11Z

Remove sha256_is_arch_optimized(), since it is no longer used. Acked-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20250630160645.3198-12-ebiggers@kernel.org Signed-off-by: Eric Biggers

lib/crypto: sha256: Propagate sha256_block_state type to implementations

2025-07-04T17:22:57Z

The previous commit made the SHA-256 compression function state be strongly typed, but it wasn't propagated all the way down to the implementations of it. Do that now. Acked-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20250630160645.3198-8-ebiggers@kernel.org Signed-off-by: Eric Biggers