Skip to content

perf: Speed up (i)NTT#260

Merged
jan-ferdinand merged 6 commits intomasterfrom
jfs/ntt-perf
Jun 16, 2025
Merged

perf: Speed up (i)NTT#260
jan-ferdinand merged 6 commits intomasterfrom
jfs/ntt-perf

Conversation

@jan-ferdinand
Copy link
Copy Markdown
Member

Cache anything that can be cached to speed up NTT and iNTT. The overall performance improvements are quite nice.

On megingjord:

$2^{23}$ bfe xfe
ntt -45.0% -25.9%
intt -44.6% -25.5%

On my laptop:

$2^{23}$ bfe xfe
ntt -49.2% -24.5%
intt -49.2% -23.2%

Each of the commits is quite self-contained; feel free to review them in isolation.

Supersedes #258.

jan-ferdinand and others added 3 commits June 11, 2025 09:20
Prior to this change, machines with a large number of cores could
exhaust the machine's resources, causing a hang in the library. In
particular, I observed this on the machine called “megingjord” when
executing `Polynomial::par_fast_interpolate()`.

This change does not affect performance in any meaningful way.
Not used inside this repo, and not used in downstream dependencies of
triton-vm, tasm-lib, or neptune-core.
@coveralls
Copy link
Copy Markdown

coveralls commented Jun 13, 2025

Coverage Status

coverage: 97.923% (+0.03%) from 97.895%
when pulling 1cecaeb on jfs/ntt-perf
into 9e96e84 on master.

Copy link
Copy Markdown
Member

@Sword-Smith Sword-Smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see this benchmarked in the TVM context, as that's the only context we really care about. Previously, we saw that only the branch free bitreversal calculation gave a speed up. I'd like to make sure that all this caching actually speeds up the TVM prover.

@jan-ferdinand
Copy link
Copy Markdown
Member Author

jan-ferdinand commented Jun 13, 2025

Using the suggested changes in Triton VM, I observe the following improvements on megingjord. The padded height in question is $2^{21}$, translating to a FRI domain length of $2^{24}$.

Baseline

The mean runtime is 99.2 seconds.

Benchmark 1: cargo run --release -- --profile prove --program ~/tmp_spin.tasm --input 21
  Time (mean ± σ):     99.238 s ±  0.766 s    [User: 8051.447 s, System: 974.266 s]
  Range (min … max):   98.286 s … 100.893 s    10 runs
Example performance profile
### Triton VM – Prove                          99.33s    #Reps   Share  Category          107.8 GiB
├─trace execution                             907.01ms       1   0.91%  (gen  –  5.13%)  +513.4 MiB
├─Fiat-Shamir: claim                            5.81µs       1   0.00%  (hash –  0.00%)      ±0 B
├─derive additional parameters                117.51µs       1   0.00%                       ±0 B
├─main tables                                  47.55s        1  47.87%                    +67.7 GiB
│ ├─create                                      8.39s        1   8.44%  (gen  – 47.47%)    +6.2 GiB
│ ├─pad                                       407.30ms       1   0.41%  (gen  –  2.31%)   +15.3 MiB
│ │ ├─pad original tables                     262.08ms       1   0.26%                    +12.3 MiB
│ │ └─fill degree-lowering table              145.08ms       1   0.15%                     +3.0 MiB
│ ├─LDE                                        23.84s        1  24.00%  (LDE  – 54.76%)   +62.2 GiB
│ │ ├─polynomial zero-initialization            1.77µs       1   0.00%                       ±0 B
│ │ ├─interpolation                             1.81s        1   1.82%                    +14.5 GiB
│ │ ├─resize                                    1.25s        1   1.26%                    +47.3 GiB
│ │ ├─evaluation                               20.77s        1  20.91%                   +192.2 MiB
│ │ └─memoize                                   1.01µs       1   0.00%                       ±0 B
│ ├─Merkle tree                                 6.67s        1   6.71%                     +1.7 GiB
│ │ ├─leafs                                     5.97s        1   6.01%                   +477.0 MiB
│ │ │ └─hash rows                               5.97s        1   6.01%  (hash – 45.27%)  +477.0 MiB
│ │ └─Merkle tree                             667.53ms       1   0.67%  (hash –  5.07%)    +1.2 GiB
│ ├─Fiat-Shamir                                25.21µs       1   0.00%  (hash –  0.00%)      ±0 B
│ └─extend                                      7.97s        1   8.02%  (gen  – 45.09%)    +4.1 GiB
│   ├─initialize master table                 159.30ms       1   0.16%                     +4.1 GiB
│   ├─slice master table                        4.31µs       1   0.00%                       ±0 B
│   ├─all tables                                7.73s        1   7.79%                     +2.8 MiB
│   └─fill degree lowering table               72.28ms       1   0.07%                       ±0 B
├─aux tables                                   19.81s        1  19.95%                    +34.3 GiB
│ ├─LDE                                        14.09s        1  14.19%  (LDE  – 32.38%)   +37.2 GiB
│ │ ├─polynomial zero-initialization            1.66µs       1   0.00%                       ±0 B
│ │ ├─interpolation                             1.79s        1   1.80%                     +4.2 GiB
│ │ ├─resize                                  869.73ms       1   0.88%                    +32.9 GiB
│ │ ├─evaluation                               11.43s        1  11.51%                   +116.9 MiB
│ │ └─memoize                                 831.00ns       1   0.00%                       ±0 B
│ ├─Merkle tree                                 5.52s        1   5.56%                     +1.1 GiB
│ │ ├─leafs                                     4.81s        1   4.84%                   +534.0 MiB
│ │ │ └─hash rows                               4.81s        1   4.84%  (hash – 36.50%)  +534.0 MiB
│ │ └─Merkle tree                             675.73ms       1   0.68%  (hash –  5.13%)    +1.2 GiB
│ └─Fiat-Shamir                               116.92µs       1   0.00%  (hash –  0.00%)      ±0 B
├─quotient calculation (cached)                 5.94s        1   5.98%  (CC   – 64.14%)  +397.3 MiB
│ ├─zerofier inverse                            1.67s        1   1.68%                   +521.6 MiB
│ └─evaluate AIR, compute quotient codeword     4.24s        1   4.27%                   +385.5 MiB
├─quotient LDE                                  5.60s        1   5.64%  (LDE  – 12.86%)    +1.3 GiB
├─hash rows of quotient segments              394.85ms       1   0.40%  (hash –  3.00%)  +628.5 MiB
├─Merkle tree                                 663.78ms       1   0.67%  (hash –  5.04%)    +1.3 GiB
├─out-of-domain rows                            4.86s        1   4.90%                   +108.3 MiB
├─Fiat-Shamir                                  68.85µs       1   0.00%  (hash –  0.00%)      ±0 B
├─linear combination                            4.86s        1   4.89%                   +609.3 MiB
│ ├─main                                      424.33ms       1   0.43%  (CC   –  4.58%)   +16.7 MiB
│ ├─aux                                       383.92ms       1   0.39%  (CC   –  4.15%)   +16.6 MiB
│ └─quotient                                    2.13s        1   2.15%  (CC   – 23.04%)  +239.1 MiB
├─DEEP                                          1.05s        1   1.06%                     +1.1 GiB
│ ├─main&aux curr row                         346.67ms       1   0.35%                   +310.7 MiB
│ ├─main&aux next row                         352.10ms       1   0.35%                   +458.6 MiB
│ └─segmented quotient                        349.27ms       1   0.35%                   +350.8 MiB
├─combined DEEP polynomial                    378.00ms       1   0.38%                   -790.9 MiB
│ └─sum                                       377.94ms       1   0.38%  (CC   –  4.08%)  -790.9 MiB
├─FRI                                           2.65s        1   2.66%                    +43.9 MiB
└─open trace leafs                            833.93µs       1   0.00%                       ±0 B

Categories

LDE 43.53s 43.82%
gen 17.67s 17.79%
hash 13.18s 13.27%
CC 9.26s 9.32%

Clock frequency is 7087 Hz (704027 clock cycles / (99328 ms / 1 iterations))
Optimal clock frequency is 21113 Hz (2097152 padded height / (99328 ms / 1 iterations))
FRI domain length is 2^24

Cache Swap-Indices (ef186b5)

The mean runtime is 94.2 seconds.

Benchmark 1: cargo run --release -- --profile prove --program ~/tmp_spin.tasm --input 21
  Time (mean ± σ):     94.169 s ±  0.373 s    [User: 7881.071 s, System: 1030.685 s]
  Range (min … max):   93.695 s … 94.849 s    10 runs
Example performance profile
### Triton VM – Prove                          93.33s    #Reps   Share  Category          107.1 GiB
├─trace execution                             927.96ms       1   0.99%  (gen  –  5.42%)  +505.3 MiB
├─Fiat-Shamir: claim                            6.17µs       1   0.00%  (hash –  0.00%)      ±0 B
├─derive additional parameters                120.17µs       1   0.00%                       ±0 B
├─main tables                                  45.46s        1  48.71%                    +67.9 GiB
│ ├─create                                      7.88s        1   8.44%  (gen  – 46.02%)    +6.3 GiB
│ ├─pad                                       412.08ms       1   0.44%  (gen  –  2.41%)    +8.3 MiB
│ │ ├─pad original tables                     261.83ms       1   0.28%                     +8.3 MiB
│ │ └─fill degree-lowering table              150.12ms       1   0.16%                       ±0 B
│ ├─LDE                                        22.98s        1  24.62%  (LDE  – 54.80%)   +62.1 GiB
│ │ ├─polynomial zero-initialization            2.21µs       1   0.00%                       ±0 B
│ │ ├─interpolation                             1.86s        1   1.99%                    +14.5 GiB
│ │ ├─resize                                    1.25s        1   1.34%                    +47.4 GiB
│ │ ├─evaluation                               19.87s        1  21.29%                   +303.9 MiB
│ │ └─memoize                                 721.00ns       1   0.00%                       ±0 B
│ ├─Merkle tree                                 6.01s        1   6.44%                     +1.7 GiB
│ │ ├─leafs                                     5.77s        1   6.18%                   +499.5 MiB
│ │ │ └─hash rows                               5.77s        1   6.18%  (hash – 51.12%)  +499.5 MiB
│ │ └─Merkle tree                             204.48ms       1   0.22%  (hash –  1.81%)    +1.2 GiB
│ ├─Fiat-Shamir                                24.70µs       1   0.00%  (hash –  0.00%)      ±0 B
│ └─extend                                      7.90s        1   8.47%  (gen  – 46.16%)    +4.1 GiB
│   ├─initialize master table                 161.12ms       1   0.17%                     +4.1 GiB
│   ├─slice master table                        4.62µs       1   0.00%                       ±0 B
│   ├─all tables                                7.67s        1   8.22%                     +1.5 MiB
│   └─fill degree lowering table               72.25ms       1   0.08%                       ±0 B
├─aux tables                                   19.05s        1  20.41%                    +34.3 GiB
│ ├─LDE                                        14.00s        1  15.00%  (LDE  – 33.39%)   +37.3 GiB
│ │ ├─polynomial zero-initialization            1.86µs       1   0.00%                       ±0 B
│ │ ├─interpolation                             1.78s        1   1.90%                     +4.2 GiB
│ │ ├─resize                                  870.28ms       1   0.93%                    +32.1 GiB
│ │ ├─evaluation                               11.36s        1  12.17%                   +162.5 MiB
│ │ └─memoize                                 772.00ns       1   0.00%                       ±0 B
│ ├─Merkle tree                                 4.85s        1   5.19%                     +1.1 GiB
│ │ ├─leafs                                     4.61s        1   4.94%                   +493.5 MiB
│ │ │ └─hash rows                               4.61s        1   4.94%  (hash – 40.87%)  +493.5 MiB
│ │ └─Merkle tree                             198.63ms       1   0.21%  (hash –  1.76%)    +1.2 GiB
│ └─Fiat-Shamir                               127.75µs       1   0.00%  (hash –  0.00%)      ±0 B
├─quotient calculation (cached)                 6.00s        1   6.43%  (CC   – 67.83%)  +381.9 MiB
│ ├─zerofier inverse                            1.67s        1   1.78%                   +524.9 MiB
│ └─evaluate AIR, compute quotient codeword     4.31s        1   4.61%                   +364.5 MiB
├─quotient LDE                                  4.95s        1   5.30%  (LDE  – 11.80%)    +1.3 GiB
├─hash rows of quotient segments              290.35ms       1   0.31%  (hash –  2.57%)  +625.5 MiB
├─Merkle tree                                 209.86ms       1   0.22%  (hash –  1.86%)    +1.3 GiB
├─out-of-domain rows                            4.79s        1   5.13%                   +120.3 MiB
├─Fiat-Shamir                                  67.41µs       1   0.00%  (hash –  0.00%)      ±0 B
├─linear combination                            4.01s        1   4.29%                   +608.6 MiB
│ ├─main                                      382.33ms       1   0.41%  (CC   –  4.32%)   +20.3 MiB
│ ├─aux                                       348.81ms       1   0.37%  (CC   –  3.94%)   +12.3 MiB
│ └─quotient                                    1.74s        1   1.87%  (CC   – 19.70%)  +239.1 MiB
├─DEEP                                          1.06s        1   1.13%                     +1.8 GiB
│ ├─main&aux curr row                         351.39ms       1   0.38%                   +306.5 MiB
│ ├─main&aux next row                         354.57ms       1   0.38%                   +467.1 MiB
│ └─segmented quotient                        350.81ms       1   0.38%                   +334.6 MiB
├─combined DEEP polynomial                    371.43ms       1   0.40%                   -761.5 MiB
│ └─sum                                       371.38ms       1   0.40%  (CC   –  4.20%)  -761.5 MiB
├─FRI                                           1.57s        1   1.69%                    -10.2 MiB
└─open trace leafs                            849.67µs       1   0.00%                       ±0 B

Categories

LDE 41.93s 44.93%
gen 17.12s 18.34%
hash 11.29s 12.09%
CC 8.84s 9.48%

Clock frequency is 7543 Hz (704027 clock cycles / (93332 ms / 1 iterations))
Optimal clock frequency is 22469 Hz (2097152 padded height / (93332 ms / 1 iterations))
FRI domain length is 2^24

Cache Twiddle Factors (1cecaeb)

The mean runtime is 93.1 seconds.

Benchmark 1: cargo run --release -- --profile prove --program ~/tmp_spin.tasm --input 21
  Time (mean ± σ):     93.134 s ±  0.368 s    [User: 7820.986 s, System: 1031.994 s]
  Range (min … max):   92.506 s … 93.633 s    10 runs
Example performance profile
### Triton VM – Prove                          93.31s    #Reps   Share  Category          102.8 GiB
├─trace execution                             920.64ms       1   0.99%  (gen  –  5.52%)  +509.8 MiB
├─Fiat-Shamir: claim                            7.13µs       1   0.00%  (hash –  0.00%)      ±0 B
├─derive additional parameters                128.80µs       1   0.00%                       ±0 B
├─main tables                                  45.50s        1  48.76%                    +59.2 GiB
│ ├─create                                      7.30s        1   7.83%  (gen  – 43.81%)    +6.3 GiB
│ ├─pad                                       400.23ms       1   0.43%  (gen  –  2.40%)    +3.0 MiB
│ │ ├─pad original tables                     260.77ms       1   0.28%                     +1.5 MiB
│ │ └─fill degree-lowering table              139.32ms       1   0.15%                     +1.5 MiB
│ ├─LDE                                        23.08s        1  24.73%  (LDE  – 54.44%)   +62.2 GiB
│ │ ├─polynomial zero-initialization            2.24µs       1   0.00%                       ±0 B
│ │ ├─interpolation                             1.92s        1   2.06%                    +14.5 GiB
│ │ ├─resize                                    1.25s        1   1.34%                    +47.4 GiB
│ │ ├─evaluation                               19.91s        1  21.34%                   +371.5 MiB
│ │ └─memoize                                 891.00ns       1   0.00%                       ±0 B
│ ├─Merkle tree                                 5.95s        1   6.38%                     +1.1 GiB
│ │ ├─leafs                                     5.72s        1   6.13%                   +534.0 MiB
│ │ │ └─hash rows                               5.72s        1   6.13%  (hash – 51.09%)  +534.0 MiB
│ │ └─Merkle tree                             196.24ms       1   0.21%  (hash –  1.75%)    +1.2 GiB
│ ├─Fiat-Shamir                                24.76µs       1   0.00%  (hash –  0.00%)      ±0 B
│ └─extend                                      8.05s        1   8.62%  (gen  – 48.27%)    +4.2 GiB
│   ├─initialize master table                 151.05ms       1   0.16%                     +4.1 GiB
│   ├─slice master table                        4.87µs       1   0.00%                       ±0 B
│   ├─all tables                                7.82s        1   8.38%                    +36.5 MiB
│   └─fill degree lowering table               74.22ms       1   0.08%                       ±0 B
├─aux tables                                   19.67s        1  21.08%                    +34.3 GiB
│ ├─LDE                                        14.64s        1  15.70%  (LDE  – 34.54%)   +37.2 GiB
│ │ ├─polynomial zero-initialization            1.95µs       1   0.00%                       ±0 B
│ │ ├─interpolation                             2.56s        1   2.74%                     +4.2 GiB
│ │ ├─resize                                  869.29ms       1   0.93%                    +32.9 GiB
│ │ ├─evaluation                               11.22s        1  12.02%                   +138.1 MiB
│ │ └─memoize                                 992.00ns       1   0.00%                       ±0 B
│ ├─Merkle tree                                 4.83s        1   5.17%                     +1.1 GiB
│ │ ├─leafs                                     4.58s        1   4.91%                   +508.5 MiB
│ │ │ └─hash rows                               4.58s        1   4.91%  (hash – 40.94%)  +508.5 MiB
│ │ └─Merkle tree                             209.24ms       1   0.22%  (hash –  1.87%)    +1.3 GiB
│ └─Fiat-Shamir                               110.92µs       1   0.00%  (hash –  0.00%)      ±0 B
├─quotient calculation (cached)                 5.85s        1   6.27%  (CC   – 69.13%)  +376.5 MiB
│ ├─zerofier inverse                            1.61s        1   1.72%                   +505.1 MiB
│ └─evaluate AIR, compute quotient codeword     4.21s        1   4.51%                   +381.0 MiB
├─quotient LDE                                  4.67s        1   5.01%  (LDE  – 11.02%)    +2.9 GiB
├─hash rows of quotient segments              281.02ms       1   0.30%  (hash –  2.51%)  +642.0 MiB
├─Merkle tree                                 205.39ms       1   0.22%  (hash –  1.83%)    +1.2 GiB
├─out-of-domain rows                            4.85s        1   5.20%                     +2.3 GiB
├─Fiat-Shamir                                  67.82µs       1   0.00%  (hash –  0.00%)      ±0 B
├─linear combination                            3.63s        1   3.89%                   +607.8 MiB
│ ├─main                                      358.94ms       1   0.38%  (CC   –  4.24%)   +15.3 MiB
│ ├─aux                                       327.89ms       1   0.35%  (CC   –  3.88%)   +18.1 MiB
│ └─quotient                                    1.57s        1   1.68%  (CC   – 18.54%)  +238.5 MiB
├─DEEP                                          1.10s        1   1.18%                     +1.8 GiB
│ ├─main&aux curr row                         355.26ms       1   0.38%                   +298.3 MiB
│ ├─main&aux next row                         372.52ms       1   0.40%                   +464.8 MiB
│ └─segmented quotient                        376.37ms       1   0.40%                   +341.6 MiB
├─combined DEEP polynomial                    356.77ms       1   0.38%                   -759.2 MiB
│ └─sum                                       356.71ms       1   0.38%  (CC   –  4.22%)  -759.2 MiB
├─FRI                                           1.59s        1   1.70%                    +27.9 MiB
└─open trace leafs                            842.97µs       1   0.00%                       ±0 B

Categories

LDE 42.40s 45.44%
gen 16.67s 17.87%
hash 11.20s 12.00%
CC 8.46s 9.07%

Clock frequency is 7545 Hz (704027 clock cycles / (93307 ms / 1 iterations))
Optimal clock frequency is 22475 Hz (2097152 padded height / (93307 ms / 1 iterations))
FRI domain length is 2^24

@jan-ferdinand
Copy link
Copy Markdown
Member Author

Previously […]

Previously, n = 1. Good benchmarks take time.

Copy link
Copy Markdown
Member

@Sword-Smith Sword-Smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Very cool speedup!!

Copy link
Copy Markdown
Collaborator

@aszepieniec aszepieniec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the whole it looks good. I added some comments inline. Please address how you please before merging.

Comment thread twenty-first/src/math/ntt.rs
Comment thread twenty-first/src/math/ntt.rs
Comment thread twenty-first/src/math/ntt.rs
@Sword-Smith Sword-Smith mentioned this pull request Jun 15, 2025
jan-ferdinand and others added 3 commits June 16, 2025 11:01
Uses branch-free bitreverse, which was developed with help from
https://stackoverflow.com/a/9144870/2574407.

On the machine known as “megingjord”:

bfe_ntt/len/23  time:   [304.41 ms 304.85 ms 305.33 ms]
              change:   [−29.258% −28.787% −28.391%] (p = 0.00 < 0.05)

xfe_ntt/len/23 time:    [661.13 ms 662.03 ms 663.05 ms]
              change:   [−19.874% −19.685% −19.443%] (p = 0.00 < 0.05)

bfe_intt/len/23 time:   [311.42 ms 311.89 ms 312.42 ms]
              change:   [−28.255% −27.928% −27.626%] (p = 0.00 < 0.05)

xfe_intt/len/23 time:   [681.64 ms 682.69 ms 683.84 ms]
              change:   [−18.723% −18.525% −18.328%] (p = 0.00 < 0.05)

Co-authored-by: Jan Ferdinand Sauer <ferdinand@neptune.cash>
Performance change within noise threshold.

changelog: ignore
On the machine known as “megingjord”:

bfe_ntt/len/23  time:   [238.36 ms 238.66 ms 239.00 ms]
              change:   [−21.801% −21.654% −21.507%] (p = 0.00 < 0.05)

xfe_ntt/len/23 time:    [601.28 ms 603.79 ms 607.49 ms]
              change:   [−7.2785% −6.8430% −6.2522%] (p = 0.00 < 0.05)

bfe_intt/len/23 time:   [242.37 ms 242.79 ms 243.28 ms]
              change:   [−22.017% −21.810% −21.598%] (p = 0.00 < 0.05)

xfe_intt/len/23 time:   [621.95 ms 622.50 ms 623.08 ms]
              change:   [−9.1162% −8.9436% −8.7779%] (p = 0.00 < 0.05)

Co-authored-by: Jan Ferdinand Sauer <ferdinand@neptune.cash>
@jan-ferdinand jan-ferdinand merged commit a72ff37 into master Jun 16, 2025
5 checks passed
@jan-ferdinand jan-ferdinand deleted the jfs/ntt-perf branch June 16, 2025 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants