perf: Speed up (i)NTT#260
Conversation
Prior to this change, machines with a large number of cores could exhaust the machine's resources, causing a hang in the library. In particular, I observed this on the machine called “megingjord” when executing `Polynomial::par_fast_interpolate()`. This change does not affect performance in any meaningful way.
Not used inside this repo, and not used in downstream dependencies of triton-vm, tasm-lib, or neptune-core.
changelog: ignore
Sword-Smith
left a comment
There was a problem hiding this comment.
I'd like to see this benchmarked in the TVM context, as that's the only context we really care about. Previously, we saw that only the branch free bitreversal calculation gave a speed up. I'd like to make sure that all this caching actually speeds up the TVM prover.
|
Using the suggested changes in Triton VM, I observe the following improvements on megingjord. The padded height in question is BaselineThe mean runtime is 99.2 seconds. Example performance profile### Triton VM – Prove 99.33s #Reps Share Category 107.8 GiB ├─trace execution 907.01ms 1 0.91% (gen – 5.13%) +513.4 MiB ├─Fiat-Shamir: claim 5.81µs 1 0.00% (hash – 0.00%) ±0 B ├─derive additional parameters 117.51µs 1 0.00% ±0 B ├─main tables 47.55s 1 47.87% +67.7 GiB │ ├─create 8.39s 1 8.44% (gen – 47.47%) +6.2 GiB │ ├─pad 407.30ms 1 0.41% (gen – 2.31%) +15.3 MiB │ │ ├─pad original tables 262.08ms 1 0.26% +12.3 MiB │ │ └─fill degree-lowering table 145.08ms 1 0.15% +3.0 MiB │ ├─LDE 23.84s 1 24.00% (LDE – 54.76%) +62.2 GiB │ │ ├─polynomial zero-initialization 1.77µs 1 0.00% ±0 B │ │ ├─interpolation 1.81s 1 1.82% +14.5 GiB │ │ ├─resize 1.25s 1 1.26% +47.3 GiB │ │ ├─evaluation 20.77s 1 20.91% +192.2 MiB │ │ └─memoize 1.01µs 1 0.00% ±0 B │ ├─Merkle tree 6.67s 1 6.71% +1.7 GiB │ │ ├─leafs 5.97s 1 6.01% +477.0 MiB │ │ │ └─hash rows 5.97s 1 6.01% (hash – 45.27%) +477.0 MiB │ │ └─Merkle tree 667.53ms 1 0.67% (hash – 5.07%) +1.2 GiB │ ├─Fiat-Shamir 25.21µs 1 0.00% (hash – 0.00%) ±0 B │ └─extend 7.97s 1 8.02% (gen – 45.09%) +4.1 GiB │ ├─initialize master table 159.30ms 1 0.16% +4.1 GiB │ ├─slice master table 4.31µs 1 0.00% ±0 B │ ├─all tables 7.73s 1 7.79% +2.8 MiB │ └─fill degree lowering table 72.28ms 1 0.07% ±0 B ├─aux tables 19.81s 1 19.95% +34.3 GiB │ ├─LDE 14.09s 1 14.19% (LDE – 32.38%) +37.2 GiB │ │ ├─polynomial zero-initialization 1.66µs 1 0.00% ±0 B │ │ ├─interpolation 1.79s 1 1.80% +4.2 GiB │ │ ├─resize 869.73ms 1 0.88% +32.9 GiB │ │ ├─evaluation 11.43s 1 11.51% +116.9 MiB │ │ └─memoize 831.00ns 1 0.00% ±0 B │ ├─Merkle tree 5.52s 1 5.56% +1.1 GiB │ │ ├─leafs 4.81s 1 4.84% +534.0 MiB │ │ │ └─hash rows 4.81s 1 4.84% (hash – 36.50%) +534.0 MiB │ │ └─Merkle tree 675.73ms 1 0.68% (hash – 5.13%) +1.2 GiB │ └─Fiat-Shamir 116.92µs 1 0.00% (hash – 0.00%) ±0 B ├─quotient calculation (cached) 5.94s 1 5.98% (CC – 64.14%) +397.3 MiB │ ├─zerofier inverse 1.67s 1 1.68% +521.6 MiB │ └─evaluate AIR, compute quotient codeword 4.24s 1 4.27% +385.5 MiB ├─quotient LDE 5.60s 1 5.64% (LDE – 12.86%) +1.3 GiB ├─hash rows of quotient segments 394.85ms 1 0.40% (hash – 3.00%) +628.5 MiB ├─Merkle tree 663.78ms 1 0.67% (hash – 5.04%) +1.3 GiB ├─out-of-domain rows 4.86s 1 4.90% +108.3 MiB ├─Fiat-Shamir 68.85µs 1 0.00% (hash – 0.00%) ±0 B ├─linear combination 4.86s 1 4.89% +609.3 MiB │ ├─main 424.33ms 1 0.43% (CC – 4.58%) +16.7 MiB │ ├─aux 383.92ms 1 0.39% (CC – 4.15%) +16.6 MiB │ └─quotient 2.13s 1 2.15% (CC – 23.04%) +239.1 MiB ├─DEEP 1.05s 1 1.06% +1.1 GiB │ ├─main&aux curr row 346.67ms 1 0.35% +310.7 MiB │ ├─main&aux next row 352.10ms 1 0.35% +458.6 MiB │ └─segmented quotient 349.27ms 1 0.35% +350.8 MiB ├─combined DEEP polynomial 378.00ms 1 0.38% -790.9 MiB │ └─sum 377.94ms 1 0.38% (CC – 4.08%) -790.9 MiB ├─FRI 2.65s 1 2.66% +43.9 MiB └─open trace leafs 833.93µs 1 0.00% ±0 B Cache Swap-Indices (ef186b5)The mean runtime is 94.2 seconds. Example performance profile### Triton VM – Prove 93.33s #Reps Share Category 107.1 GiB ├─trace execution 927.96ms 1 0.99% (gen – 5.42%) +505.3 MiB ├─Fiat-Shamir: claim 6.17µs 1 0.00% (hash – 0.00%) ±0 B ├─derive additional parameters 120.17µs 1 0.00% ±0 B ├─main tables 45.46s 1 48.71% +67.9 GiB │ ├─create 7.88s 1 8.44% (gen – 46.02%) +6.3 GiB │ ├─pad 412.08ms 1 0.44% (gen – 2.41%) +8.3 MiB │ │ ├─pad original tables 261.83ms 1 0.28% +8.3 MiB │ │ └─fill degree-lowering table 150.12ms 1 0.16% ±0 B │ ├─LDE 22.98s 1 24.62% (LDE – 54.80%) +62.1 GiB │ │ ├─polynomial zero-initialization 2.21µs 1 0.00% ±0 B │ │ ├─interpolation 1.86s 1 1.99% +14.5 GiB │ │ ├─resize 1.25s 1 1.34% +47.4 GiB │ │ ├─evaluation 19.87s 1 21.29% +303.9 MiB │ │ └─memoize 721.00ns 1 0.00% ±0 B │ ├─Merkle tree 6.01s 1 6.44% +1.7 GiB │ │ ├─leafs 5.77s 1 6.18% +499.5 MiB │ │ │ └─hash rows 5.77s 1 6.18% (hash – 51.12%) +499.5 MiB │ │ └─Merkle tree 204.48ms 1 0.22% (hash – 1.81%) +1.2 GiB │ ├─Fiat-Shamir 24.70µs 1 0.00% (hash – 0.00%) ±0 B │ └─extend 7.90s 1 8.47% (gen – 46.16%) +4.1 GiB │ ├─initialize master table 161.12ms 1 0.17% +4.1 GiB │ ├─slice master table 4.62µs 1 0.00% ±0 B │ ├─all tables 7.67s 1 8.22% +1.5 MiB │ └─fill degree lowering table 72.25ms 1 0.08% ±0 B ├─aux tables 19.05s 1 20.41% +34.3 GiB │ ├─LDE 14.00s 1 15.00% (LDE – 33.39%) +37.3 GiB │ │ ├─polynomial zero-initialization 1.86µs 1 0.00% ±0 B │ │ ├─interpolation 1.78s 1 1.90% +4.2 GiB │ │ ├─resize 870.28ms 1 0.93% +32.1 GiB │ │ ├─evaluation 11.36s 1 12.17% +162.5 MiB │ │ └─memoize 772.00ns 1 0.00% ±0 B │ ├─Merkle tree 4.85s 1 5.19% +1.1 GiB │ │ ├─leafs 4.61s 1 4.94% +493.5 MiB │ │ │ └─hash rows 4.61s 1 4.94% (hash – 40.87%) +493.5 MiB │ │ └─Merkle tree 198.63ms 1 0.21% (hash – 1.76%) +1.2 GiB │ └─Fiat-Shamir 127.75µs 1 0.00% (hash – 0.00%) ±0 B ├─quotient calculation (cached) 6.00s 1 6.43% (CC – 67.83%) +381.9 MiB │ ├─zerofier inverse 1.67s 1 1.78% +524.9 MiB │ └─evaluate AIR, compute quotient codeword 4.31s 1 4.61% +364.5 MiB ├─quotient LDE 4.95s 1 5.30% (LDE – 11.80%) +1.3 GiB ├─hash rows of quotient segments 290.35ms 1 0.31% (hash – 2.57%) +625.5 MiB ├─Merkle tree 209.86ms 1 0.22% (hash – 1.86%) +1.3 GiB ├─out-of-domain rows 4.79s 1 5.13% +120.3 MiB ├─Fiat-Shamir 67.41µs 1 0.00% (hash – 0.00%) ±0 B ├─linear combination 4.01s 1 4.29% +608.6 MiB │ ├─main 382.33ms 1 0.41% (CC – 4.32%) +20.3 MiB │ ├─aux 348.81ms 1 0.37% (CC – 3.94%) +12.3 MiB │ └─quotient 1.74s 1 1.87% (CC – 19.70%) +239.1 MiB ├─DEEP 1.06s 1 1.13% +1.8 GiB │ ├─main&aux curr row 351.39ms 1 0.38% +306.5 MiB │ ├─main&aux next row 354.57ms 1 0.38% +467.1 MiB │ └─segmented quotient 350.81ms 1 0.38% +334.6 MiB ├─combined DEEP polynomial 371.43ms 1 0.40% -761.5 MiB │ └─sum 371.38ms 1 0.40% (CC – 4.20%) -761.5 MiB ├─FRI 1.57s 1 1.69% -10.2 MiB └─open trace leafs 849.67µs 1 0.00% ±0 B Cache Twiddle Factors (1cecaeb)The mean runtime is 93.1 seconds. Example performance profile### Triton VM – Prove 93.31s #Reps Share Category 102.8 GiB ├─trace execution 920.64ms 1 0.99% (gen – 5.52%) +509.8 MiB ├─Fiat-Shamir: claim 7.13µs 1 0.00% (hash – 0.00%) ±0 B ├─derive additional parameters 128.80µs 1 0.00% ±0 B ├─main tables 45.50s 1 48.76% +59.2 GiB │ ├─create 7.30s 1 7.83% (gen – 43.81%) +6.3 GiB │ ├─pad 400.23ms 1 0.43% (gen – 2.40%) +3.0 MiB │ │ ├─pad original tables 260.77ms 1 0.28% +1.5 MiB │ │ └─fill degree-lowering table 139.32ms 1 0.15% +1.5 MiB │ ├─LDE 23.08s 1 24.73% (LDE – 54.44%) +62.2 GiB │ │ ├─polynomial zero-initialization 2.24µs 1 0.00% ±0 B │ │ ├─interpolation 1.92s 1 2.06% +14.5 GiB │ │ ├─resize 1.25s 1 1.34% +47.4 GiB │ │ ├─evaluation 19.91s 1 21.34% +371.5 MiB │ │ └─memoize 891.00ns 1 0.00% ±0 B │ ├─Merkle tree 5.95s 1 6.38% +1.1 GiB │ │ ├─leafs 5.72s 1 6.13% +534.0 MiB │ │ │ └─hash rows 5.72s 1 6.13% (hash – 51.09%) +534.0 MiB │ │ └─Merkle tree 196.24ms 1 0.21% (hash – 1.75%) +1.2 GiB │ ├─Fiat-Shamir 24.76µs 1 0.00% (hash – 0.00%) ±0 B │ └─extend 8.05s 1 8.62% (gen – 48.27%) +4.2 GiB │ ├─initialize master table 151.05ms 1 0.16% +4.1 GiB │ ├─slice master table 4.87µs 1 0.00% ±0 B │ ├─all tables 7.82s 1 8.38% +36.5 MiB │ └─fill degree lowering table 74.22ms 1 0.08% ±0 B ├─aux tables 19.67s 1 21.08% +34.3 GiB │ ├─LDE 14.64s 1 15.70% (LDE – 34.54%) +37.2 GiB │ │ ├─polynomial zero-initialization 1.95µs 1 0.00% ±0 B │ │ ├─interpolation 2.56s 1 2.74% +4.2 GiB │ │ ├─resize 869.29ms 1 0.93% +32.9 GiB │ │ ├─evaluation 11.22s 1 12.02% +138.1 MiB │ │ └─memoize 992.00ns 1 0.00% ±0 B │ ├─Merkle tree 4.83s 1 5.17% +1.1 GiB │ │ ├─leafs 4.58s 1 4.91% +508.5 MiB │ │ │ └─hash rows 4.58s 1 4.91% (hash – 40.94%) +508.5 MiB │ │ └─Merkle tree 209.24ms 1 0.22% (hash – 1.87%) +1.3 GiB │ └─Fiat-Shamir 110.92µs 1 0.00% (hash – 0.00%) ±0 B ├─quotient calculation (cached) 5.85s 1 6.27% (CC – 69.13%) +376.5 MiB │ ├─zerofier inverse 1.61s 1 1.72% +505.1 MiB │ └─evaluate AIR, compute quotient codeword 4.21s 1 4.51% +381.0 MiB ├─quotient LDE 4.67s 1 5.01% (LDE – 11.02%) +2.9 GiB ├─hash rows of quotient segments 281.02ms 1 0.30% (hash – 2.51%) +642.0 MiB ├─Merkle tree 205.39ms 1 0.22% (hash – 1.83%) +1.2 GiB ├─out-of-domain rows 4.85s 1 5.20% +2.3 GiB ├─Fiat-Shamir 67.82µs 1 0.00% (hash – 0.00%) ±0 B ├─linear combination 3.63s 1 3.89% +607.8 MiB │ ├─main 358.94ms 1 0.38% (CC – 4.24%) +15.3 MiB │ ├─aux 327.89ms 1 0.35% (CC – 3.88%) +18.1 MiB │ └─quotient 1.57s 1 1.68% (CC – 18.54%) +238.5 MiB ├─DEEP 1.10s 1 1.18% +1.8 GiB │ ├─main&aux curr row 355.26ms 1 0.38% +298.3 MiB │ ├─main&aux next row 372.52ms 1 0.40% +464.8 MiB │ └─segmented quotient 376.37ms 1 0.40% +341.6 MiB ├─combined DEEP polynomial 356.77ms 1 0.38% -759.2 MiB │ └─sum 356.71ms 1 0.38% (CC – 4.22%) -759.2 MiB ├─FRI 1.59s 1 1.70% +27.9 MiB └─open trace leafs 842.97µs 1 0.00% ±0 B |
Previously, n = 1. Good benchmarks take time. |
Sword-Smith
left a comment
There was a problem hiding this comment.
LGTM. Very cool speedup!!
aszepieniec
left a comment
There was a problem hiding this comment.
On the whole it looks good. I added some comments inline. Please address how you please before merging.
Uses branch-free bitreverse, which was developed with help from https://stackoverflow.com/a/9144870/2574407. On the machine known as “megingjord”: bfe_ntt/len/23 time: [304.41 ms 304.85 ms 305.33 ms] change: [−29.258% −28.787% −28.391%] (p = 0.00 < 0.05) xfe_ntt/len/23 time: [661.13 ms 662.03 ms 663.05 ms] change: [−19.874% −19.685% −19.443%] (p = 0.00 < 0.05) bfe_intt/len/23 time: [311.42 ms 311.89 ms 312.42 ms] change: [−28.255% −27.928% −27.626%] (p = 0.00 < 0.05) xfe_intt/len/23 time: [681.64 ms 682.69 ms 683.84 ms] change: [−18.723% −18.525% −18.328%] (p = 0.00 < 0.05) Co-authored-by: Jan Ferdinand Sauer <ferdinand@neptune.cash>
Performance change within noise threshold. changelog: ignore
On the machine known as “megingjord”:
bfe_ntt/len/23 time: [238.36 ms 238.66 ms 239.00 ms]
change: [−21.801% −21.654% −21.507%] (p = 0.00 < 0.05)
xfe_ntt/len/23 time: [601.28 ms 603.79 ms 607.49 ms]
change: [−7.2785% −6.8430% −6.2522%] (p = 0.00 < 0.05)
bfe_intt/len/23 time: [242.37 ms 242.79 ms 243.28 ms]
change: [−22.017% −21.810% −21.598%] (p = 0.00 < 0.05)
xfe_intt/len/23 time: [621.95 ms 622.50 ms 623.08 ms]
change: [−9.1162% −8.9436% −8.7779%] (p = 0.00 < 0.05)
Co-authored-by: Jan Ferdinand Sauer <ferdinand@neptune.cash>
2594c64 to
1cecaeb
Compare
Cache anything that can be cached to speed up NTT and iNTT. The overall performance improvements are quite nice.
On megingjord:
On my laptop:
Each of the commits is quite self-contained; feel free to review them in isolation.
Supersedes #258.