perf: Speed up (i)NTT by jan-ferdinand · Pull Request #260 · Neptune-Crypto/twenty-first

jan-ferdinand · 2025-06-13T09:32:39Z

Cache anything that can be cached to speed up NTT and iNTT. The overall performance improvements are quite nice.

On megingjord:

$2^{23}$	bfe	xfe
ntt	-45.0%	-25.9%
intt	-44.6%	-25.5%

On my laptop:

$2^{23}$	bfe	xfe
ntt	-49.2%	-24.5%
intt	-49.2%	-23.2%

Each of the commits is quite self-contained; feel free to review them in isolation.

Supersedes #258.

Prior to this change, machines with a large number of cores could exhaust the machine's resources, causing a hang in the library. In particular, I observed this on the machine called “megingjord” when executing `Polynomial::par_fast_interpolate()`. This change does not affect performance in any meaningful way.

Not used inside this repo, and not used in downstream dependencies of triton-vm, tasm-lib, or neptune-core.

changelog: ignore

coveralls · 2025-06-13T10:10:55Z

coverage: 97.923% (+0.03%) from 97.895%
when pulling 1cecaeb on jfs/ntt-perf
into 9e96e84 on master.

Sword-Smith

I'd like to see this benchmarked in the TVM context, as that's the only context we really care about. Previously, we saw that only the branch free bitreversal calculation gave a speed up. I'd like to make sure that all this caching actually speeds up the TVM prover.

jan-ferdinand · 2025-06-13T11:24:37Z

Using the suggested changes in Triton VM, I observe the following improvements on megingjord. The padded height in question is $2^{21}$, translating to a FRI domain length of $2^{24}$.

Baseline

The mean runtime is 99.2 seconds.

Benchmark 1: cargo run --release -- --profile prove --program ~/tmp_spin.tasm --input 21
  Time (mean ± σ):     99.238 s ±  0.766 s    [User: 8051.447 s, System: 974.266 s]
  Range (min … max):   98.286 s … 100.893 s    10 runs

Example performance profile

### Triton VM – Prove 99.33s #Reps Share Category 107.8 GiB ├─trace execution 907.01ms 1 0.91% (gen – 5.13%) +513.4 MiB ├─Fiat-Shamir: claim 5.81µs 1 0.00% (hash – 0.00%) ±0 B ├─derive additional parameters 117.51µs 1 0.00% ±0 B ├─main tables 47.55s 1 47.87% +67.7 GiB │ ├─create 8.39s 1 8.44% (gen – 47.47%) +6.2 GiB │ ├─pad 407.30ms 1 0.41% (gen – 2.31%) +15.3 MiB │ │ ├─pad original tables 262.08ms 1 0.26% +12.3 MiB │ │ └─fill degree-lowering table 145.08ms 1 0.15% +3.0 MiB │ ├─LDE 23.84s 1 24.00% (LDE – 54.76%) +62.2 GiB │ │ ├─polynomial zero-initialization 1.77µs 1 0.00% ±0 B │ │ ├─interpolation 1.81s 1 1.82% +14.5 GiB │ │ ├─resize 1.25s 1 1.26% +47.3 GiB │ │ ├─evaluation 20.77s 1 20.91% +192.2 MiB │ │ └─memoize 1.01µs 1 0.00% ±0 B │ ├─Merkle tree 6.67s 1 6.71% +1.7 GiB │ │ ├─leafs 5.97s 1 6.01% +477.0 MiB │ │ │ └─hash rows 5.97s 1 6.01% (hash – 45.27%) +477.0 MiB │ │ └─Merkle tree 667.53ms 1 0.67% (hash – 5.07%) +1.2 GiB │ ├─Fiat-Shamir 25.21µs 1 0.00% (hash – 0.00%) ±0 B │ └─extend 7.97s 1 8.02% (gen – 45.09%) +4.1 GiB │ ├─initialize master table 159.30ms 1 0.16% +4.1 GiB │ ├─slice master table 4.31µs 1 0.00% ±0 B │ ├─all tables 7.73s 1 7.79% +2.8 MiB │ └─fill degree lowering table 72.28ms 1 0.07% ±0 B ├─aux tables 19.81s 1 19.95% +34.3 GiB │ ├─LDE 14.09s 1 14.19% (LDE – 32.38%) +37.2 GiB │ │ ├─polynomial zero-initialization 1.66µs 1 0.00% ±0 B │ │ ├─interpolation 1.79s 1 1.80% +4.2 GiB │ │ ├─resize 869.73ms 1 0.88% +32.9 GiB │ │ ├─evaluation 11.43s 1 11.51% +116.9 MiB │ │ └─memoize 831.00ns 1 0.00% ±0 B │ ├─Merkle tree 5.52s 1 5.56% +1.1 GiB │ │ ├─leafs 4.81s 1 4.84% +534.0 MiB │ │ │ └─hash rows 4.81s 1 4.84% (hash – 36.50%) +534.0 MiB │ │ └─Merkle tree 675.73ms 1 0.68% (hash – 5.13%) +1.2 GiB │ └─Fiat-Shamir 116.92µs 1 0.00% (hash – 0.00%) ±0 B ├─quotient calculation (cached) 5.94s 1 5.98% (CC – 64.14%) +397.3 MiB │ ├─zerofier inverse 1.67s 1 1.68% +521.6 MiB │ └─evaluate AIR, compute quotient codeword 4.24s 1 4.27% +385.5 MiB ├─quotient LDE 5.60s 1 5.64% (LDE – 12.86%) +1.3 GiB ├─hash rows of quotient segments 394.85ms 1 0.40% (hash – 3.00%) +628.5 MiB ├─Merkle tree 663.78ms 1 0.67% (hash – 5.04%) +1.3 GiB ├─out-of-domain rows 4.86s 1 4.90% +108.3 MiB ├─Fiat-Shamir 68.85µs 1 0.00% (hash – 0.00%) ±0 B ├─linear combination 4.86s 1 4.89% +609.3 MiB │ ├─main 424.33ms 1 0.43% (CC – 4.58%) +16.7 MiB │ ├─aux 383.92ms 1 0.39% (CC – 4.15%) +16.6 MiB │ └─quotient 2.13s 1 2.15% (CC – 23.04%) +239.1 MiB ├─DEEP 1.05s 1 1.06% +1.1 GiB │ ├─main&aux curr row 346.67ms 1 0.35% +310.7 MiB │ ├─main&aux next row 352.10ms 1 0.35% +458.6 MiB │ └─segmented quotient 349.27ms 1 0.35% +350.8 MiB ├─combined DEEP polynomial 378.00ms 1 0.38% -790.9 MiB │ └─sum 377.94ms 1 0.38% (CC – 4.08%) -790.9 MiB ├─FRI 2.65s 1 2.66% +43.9 MiB └─open trace leafs 833.93µs 1 0.00% ±0 B

Cache Swap-Indices (`ef186b5`)

The mean runtime is 94.2 seconds.

Benchmark 1: cargo run --release -- --profile prove --program ~/tmp_spin.tasm --input 21
  Time (mean ± σ):     94.169 s ±  0.373 s    [User: 7881.071 s, System: 1030.685 s]
  Range (min … max):   93.695 s … 94.849 s    10 runs

Example performance profile

### Triton VM – Prove 93.33s #Reps Share Category 107.1 GiB ├─trace execution 927.96ms 1 0.99% (gen – 5.42%) +505.3 MiB ├─Fiat-Shamir: claim 6.17µs 1 0.00% (hash – 0.00%) ±0 B ├─derive additional parameters 120.17µs 1 0.00% ±0 B ├─main tables 45.46s 1 48.71% +67.9 GiB │ ├─create 7.88s 1 8.44% (gen – 46.02%) +6.3 GiB │ ├─pad 412.08ms 1 0.44% (gen – 2.41%) +8.3 MiB │ │ ├─pad original tables 261.83ms 1 0.28% +8.3 MiB │ │ └─fill degree-lowering table 150.12ms 1 0.16% ±0 B │ ├─LDE 22.98s 1 24.62% (LDE – 54.80%) +62.1 GiB │ │ ├─polynomial zero-initialization 2.21µs 1 0.00% ±0 B │ │ ├─interpolation 1.86s 1 1.99% +14.5 GiB │ │ ├─resize 1.25s 1 1.34% +47.4 GiB │ │ ├─evaluation 19.87s 1 21.29% +303.9 MiB │ │ └─memoize 721.00ns 1 0.00% ±0 B │ ├─Merkle tree 6.01s 1 6.44% +1.7 GiB │ │ ├─leafs 5.77s 1 6.18% +499.5 MiB │ │ │ └─hash rows 5.77s 1 6.18% (hash – 51.12%) +499.5 MiB │ │ └─Merkle tree 204.48ms 1 0.22% (hash – 1.81%) +1.2 GiB │ ├─Fiat-Shamir 24.70µs 1 0.00% (hash – 0.00%) ±0 B │ └─extend 7.90s 1 8.47% (gen – 46.16%) +4.1 GiB │ ├─initialize master table 161.12ms 1 0.17% +4.1 GiB │ ├─slice master table 4.62µs 1 0.00% ±0 B │ ├─all tables 7.67s 1 8.22% +1.5 MiB │ └─fill degree lowering table 72.25ms 1 0.08% ±0 B ├─aux tables 19.05s 1 20.41% +34.3 GiB │ ├─LDE 14.00s 1 15.00% (LDE – 33.39%) +37.3 GiB │ │ ├─polynomial zero-initialization 1.86µs 1 0.00% ±0 B │ │ ├─interpolation 1.78s 1 1.90% +4.2 GiB │ │ ├─resize 870.28ms 1 0.93% +32.1 GiB │ │ ├─evaluation 11.36s 1 12.17% +162.5 MiB │ │ └─memoize 772.00ns 1 0.00% ±0 B │ ├─Merkle tree 4.85s 1 5.19% +1.1 GiB │ │ ├─leafs 4.61s 1 4.94% +493.5 MiB │ │ │ └─hash rows 4.61s 1 4.94% (hash – 40.87%) +493.5 MiB │ │ └─Merkle tree 198.63ms 1 0.21% (hash – 1.76%) +1.2 GiB │ └─Fiat-Shamir 127.75µs 1 0.00% (hash – 0.00%) ±0 B ├─quotient calculation (cached) 6.00s 1 6.43% (CC – 67.83%) +381.9 MiB │ ├─zerofier inverse 1.67s 1 1.78% +524.9 MiB │ └─evaluate AIR, compute quotient codeword 4.31s 1 4.61% +364.5 MiB ├─quotient LDE 4.95s 1 5.30% (LDE – 11.80%) +1.3 GiB ├─hash rows of quotient segments 290.35ms 1 0.31% (hash – 2.57%) +625.5 MiB ├─Merkle tree 209.86ms 1 0.22% (hash – 1.86%) +1.3 GiB ├─out-of-domain rows 4.79s 1 5.13% +120.3 MiB ├─Fiat-Shamir 67.41µs 1 0.00% (hash – 0.00%) ±0 B ├─linear combination 4.01s 1 4.29% +608.6 MiB │ ├─main 382.33ms 1 0.41% (CC – 4.32%) +20.3 MiB │ ├─aux 348.81ms 1 0.37% (CC – 3.94%) +12.3 MiB │ └─quotient 1.74s 1 1.87% (CC – 19.70%) +239.1 MiB ├─DEEP 1.06s 1 1.13% +1.8 GiB │ ├─main&aux curr row 351.39ms 1 0.38% +306.5 MiB │ ├─main&aux next row 354.57ms 1 0.38% +467.1 MiB │ └─segmented quotient 350.81ms 1 0.38% +334.6 MiB ├─combined DEEP polynomial 371.43ms 1 0.40% -761.5 MiB │ └─sum 371.38ms 1 0.40% (CC – 4.20%) -761.5 MiB ├─FRI 1.57s 1 1.69% -10.2 MiB └─open trace leafs 849.67µs 1 0.00% ±0 B

Cache Twiddle Factors (`1cecaeb`)

The mean runtime is 93.1 seconds.

Benchmark 1: cargo run --release -- --profile prove --program ~/tmp_spin.tasm --input 21
  Time (mean ± σ):     93.134 s ±  0.368 s    [User: 7820.986 s, System: 1031.994 s]
  Range (min … max):   92.506 s … 93.633 s    10 runs

Example performance profile

### Triton VM – Prove 93.31s #Reps Share Category 102.8 GiB ├─trace execution 920.64ms 1 0.99% (gen – 5.52%) +509.8 MiB ├─Fiat-Shamir: claim 7.13µs 1 0.00% (hash – 0.00%) ±0 B ├─derive additional parameters 128.80µs 1 0.00% ±0 B ├─main tables 45.50s 1 48.76% +59.2 GiB │ ├─create 7.30s 1 7.83% (gen – 43.81%) +6.3 GiB │ ├─pad 400.23ms 1 0.43% (gen – 2.40%) +3.0 MiB │ │ ├─pad original tables 260.77ms 1 0.28% +1.5 MiB │ │ └─fill degree-lowering table 139.32ms 1 0.15% +1.5 MiB │ ├─LDE 23.08s 1 24.73% (LDE – 54.44%) +62.2 GiB │ │ ├─polynomial zero-initialization 2.24µs 1 0.00% ±0 B │ │ ├─interpolation 1.92s 1 2.06% +14.5 GiB │ │ ├─resize 1.25s 1 1.34% +47.4 GiB │ │ ├─evaluation 19.91s 1 21.34% +371.5 MiB │ │ └─memoize 891.00ns 1 0.00% ±0 B │ ├─Merkle tree 5.95s 1 6.38% +1.1 GiB │ │ ├─leafs 5.72s 1 6.13% +534.0 MiB │ │ │ └─hash rows 5.72s 1 6.13% (hash – 51.09%) +534.0 MiB │ │ └─Merkle tree 196.24ms 1 0.21% (hash – 1.75%) +1.2 GiB │ ├─Fiat-Shamir 24.76µs 1 0.00% (hash – 0.00%) ±0 B │ └─extend 8.05s 1 8.62% (gen – 48.27%) +4.2 GiB │ ├─initialize master table 151.05ms 1 0.16% +4.1 GiB │ ├─slice master table 4.87µs 1 0.00% ±0 B │ ├─all tables 7.82s 1 8.38% +36.5 MiB │ └─fill degree lowering table 74.22ms 1 0.08% ±0 B ├─aux tables 19.67s 1 21.08% +34.3 GiB │ ├─LDE 14.64s 1 15.70% (LDE – 34.54%) +37.2 GiB │ │ ├─polynomial zero-initialization 1.95µs 1 0.00% ±0 B │ │ ├─interpolation 2.56s 1 2.74% +4.2 GiB │ │ ├─resize 869.29ms 1 0.93% +32.9 GiB │ │ ├─evaluation 11.22s 1 12.02% +138.1 MiB │ │ └─memoize 992.00ns 1 0.00% ±0 B │ ├─Merkle tree 4.83s 1 5.17% +1.1 GiB │ │ ├─leafs 4.58s 1 4.91% +508.5 MiB │ │ │ └─hash rows 4.58s 1 4.91% (hash – 40.94%) +508.5 MiB │ │ └─Merkle tree 209.24ms 1 0.22% (hash – 1.87%) +1.3 GiB │ └─Fiat-Shamir 110.92µs 1 0.00% (hash – 0.00%) ±0 B ├─quotient calculation (cached) 5.85s 1 6.27% (CC – 69.13%) +376.5 MiB │ ├─zerofier inverse 1.61s 1 1.72% +505.1 MiB │ └─evaluate AIR, compute quotient codeword 4.21s 1 4.51% +381.0 MiB ├─quotient LDE 4.67s 1 5.01% (LDE – 11.02%) +2.9 GiB ├─hash rows of quotient segments 281.02ms 1 0.30% (hash – 2.51%) +642.0 MiB ├─Merkle tree 205.39ms 1 0.22% (hash – 1.83%) +1.2 GiB ├─out-of-domain rows 4.85s 1 5.20% +2.3 GiB ├─Fiat-Shamir 67.82µs 1 0.00% (hash – 0.00%) ±0 B ├─linear combination 3.63s 1 3.89% +607.8 MiB │ ├─main 358.94ms 1 0.38% (CC – 4.24%) +15.3 MiB │ ├─aux 327.89ms 1 0.35% (CC – 3.88%) +18.1 MiB │ └─quotient 1.57s 1 1.68% (CC – 18.54%) +238.5 MiB ├─DEEP 1.10s 1 1.18% +1.8 GiB │ ├─main&aux curr row 355.26ms 1 0.38% +298.3 MiB │ ├─main&aux next row 372.52ms 1 0.40% +464.8 MiB │ └─segmented quotient 376.37ms 1 0.40% +341.6 MiB ├─combined DEEP polynomial 356.77ms 1 0.38% -759.2 MiB │ └─sum 356.71ms 1 0.38% (CC – 4.22%) -759.2 MiB ├─FRI 1.59s 1 1.70% +27.9 MiB └─open trace leafs 842.97µs 1 0.00% ±0 B

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Speed up (i)NTT#260

perf: Speed up (i)NTT#260
jan-ferdinand merged 6 commits intomasterfrom
jfs/ntt-perf

jan-ferdinand commented Jun 13, 2025

Uh oh!

coveralls commented Jun 13, 2025 •

edited

Loading

Uh oh!

Sword-Smith left a comment

Uh oh!

jan-ferdinand commented Jun 13, 2025 •

edited

Loading

Categories

Categories

Categories

Uh oh!

jan-ferdinand commented Jun 13, 2025

Uh oh!

Sword-Smith left a comment

Uh oh!

aszepieniec left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jan-ferdinand commented Jun 13, 2025

Uh oh!

coveralls commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sword-Smith left a comment

Choose a reason for hiding this comment

Uh oh!

jan-ferdinand commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Baseline

Categories

Cache Swap-Indices (ef186b5)

Categories

Cache Twiddle Factors (1cecaeb)

Categories

Uh oh!

jan-ferdinand commented Jun 13, 2025

Uh oh!

Sword-Smith left a comment

Choose a reason for hiding this comment

Uh oh!

aszepieniec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coveralls commented Jun 13, 2025 •

edited

Loading

jan-ferdinand commented Jun 13, 2025 •

edited

Loading

Cache Swap-Indices (`ef186b5`)

Cache Twiddle Factors (`1cecaeb`)