field: force-inline 5x52 mul and sqr#1859
Conversation
|
Ran the benchmarks on i9-14900HX built with GCC 12.3, confirmed the speedups: Results (Min us, lower is better)
|
|
Seeing a ~12.5% speedup for both ECDSA and Schnorr verification and ~3% for signing on my arm64 machine (Snapdragon X Elite - X1E-78-100), using GCC 14.2.0: master: PR: Applying this change to the Bitcoin Core secp256k1 subtree (Branch apply-secp-pr1859) shows the speedup in the script verification benchmarks as well (run via master (commit theStack/bitcoin@654a522):
PR applied (commit theStack/bitcoin@494a473):
|
|
Concept ACK That's a very interesting observation. So far, we tried to stay away from guiding the compiler too much, but the ratio of added complexity vs. gains here is pretty good. @l0rinc What I always wanted to try is profile-guided optimizations, e.g., where the profile is generated in a benchmark run that only performs signature verification (this could even be done automatically as part of the build process). I imagine there could be more low-hanging fruits. Would you be interested in looking into this stuff as well? |
The 5x52 field multiplication and squaring routines are hot in group arithmetic and scalar multiplication. Use the new `SECP256K1_FORCE_INLINE` for the thin wrappers and `int128` inner helpers so compilers can schedule the 64x64->128 arithmetic without a call boundary. The helper uses forced inlining in optimized release-style builds, but falls back to `SECP256K1_INLINE` when no-inline, size optimization, or debug-style macros ask not to force it. Across the measured GCC and MSVC Release builds, this improves ECDSA verification by 0.6% to 9.1%, ECDH by 0.7% to 9.3%, and Schnorr verification by 0.6% to 9.6%. The direct field benchmarks generally show the intended effect on field squaring and multiplication, while Clang results are mostly flat and less consistently positive. This is a code-size tradeoff: the tested static library builds grew by about 4.6% to 4.7%, and the tested Windows Release DLL grew by 14.1%. Co-authored-by: Sebastian Falbesoner <sebastian.falbesoner@gmail.com>
|
Concept ACK. Master: This PR: (GCC 15.2.0 on Ryzen 5950X) |
ac915c9 to
1c537ab
Compare
Problem: The 5x52 field multiplication and squaring routines are hot in group arithmetic and scalar multiplication. Some compilers leave the thin wrappers and int128 inner helpers out of line, which keeps a call boundary in this hot path and limits scheduling of the 64x64->128 arithmetic.
Fix: Define
SECP256K1_FORCE_INLINEnext to the existing inline helper and use it for the 5x52 multiplication and squaring wrappers andint128inner helpers.For default optimized builds, this expands to
__forceinlineon MSVC-compatible compilers and to__attribute__((always_inline))on GCC-compatible compilers. It falls back to the existing inline spelling when inlining is disabled, when optimization is disabled, when optimizing for size on GCC/Clang, or when_DEBUGis defined.Benchmarks: Values are relative changes in
Min(us), lower is better.Tradeoffs: The speedups reproduce most consistently with GCC and MSVC. Clang was less consistently positive.
Inlining also increases code size:
libsecp256k1.alibsecp256k1.alibsecp256k1-*.dllLinux benchmarking script
Linux size comparison script
host: M4-Max.local, compiler: gcc-14 (Homebrew GCC 14.3.0) 14.3.0
host: WIN-A2EHOAU4JET (Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz), system: Microsoft Windows NT 10.0.20348.0, compiler: Microsoft (R) C/C++ Optimizing Compiler Version 19.50.35728 for x64
host: i9-ssd, compiler: gcc (GCC) 16.1.0
host: i7-hdd, compiler: gcc (Ubuntu 14.2.0-19ubuntu2) 14.2.0
host: rpi5-16-3, compiler: gcc (Ubuntu 14.2.0-19ubuntu2) 14.2.0
host: rpi4-2-1, compiler: gcc (Ubuntu 14.2.0-19ubuntu2) 14.2.0
host: umbrel (Intel(R) N150), compiler: gcc (Debian 12.2.0-14+deb12u1) 12.2.0
host: nodl (Cortex-A53), compiler: gcc (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Reviewer measurements
andrewtoth, i9-14900HX, GCC 12.3
theStack, Snapdragon X Elite X1E-78-100, GCC 14.2.0
Bitcoin Core subtree
bench_bitcoin -filter=VerifyScript.*:sipa, Ryzen 5950X, GCC 15.2.0
clang:
host: i9-ssd, compiler: Ubuntu clang version 22.1.6 (++20260508084839+c0262e742787-1~exp1~20260508204859.77)
reindex-chainstate:
2026-05-28 | reindex-chainstate | 950059 blocks | dbcache 5000 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | SSD