Skip to content

Commit 9cc6e41

Browse files
committed
Update README.
1 parent 84f8fa8 commit 9cc6e41

File tree

1 file changed

+11
-8
lines changed

1 file changed

+11
-8
lines changed

README.md

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -99,14 +99,17 @@ Benchmarks comparing cuTile.jl against cuTile Python on an RTX 5080 (`tileiras`
9999
|--------|------|-------|--------|--------|
100100
| Vector Addition | 2^27 f32 | 842 GB/s | 844 GB/s | OK (=) |
101101
| Matrix Transpose | 8192² f32 | 801 GB/s | 810 GB/s | OK (-1%) |
102-
| Layer Normalization | 4096² f32 fwd | 691 GB/s | 720 GB/s | -4% |
103-
| Matrix Multiplication | 4096³ f32 | 43.3 TFLOPS | 43.3 TFLOPS | OK (=) |
104-
| Batch Matrix Multiply | 1024×512×2048 ×8 f32 | 30.4 TFLOPS | 30.8 TFLOPS | OK (-1%) |
105-
| FFT (3-stage Cooley-Tukey) | 1024-pt ×64 c64 | 3283 μs | 3133 μs | -5% |
106-
| Mixture of Experts | 256tok 1024h 32e 2048i f16 | 18.3 TFLOPS | 20.1 TFLOPS | -9% |
107-
| Attention (FMHA) | 8×16×1024² ×64 f16 causal | 88.1 TFLOPS | 61.3 TFLOPS | +44%* |
108-
109-
\* Likely due to Python's compiler splitting the causal masking loop into two
102+
| Layer Normalization | 4096² f32 fwd | 687 GB/s | 720 GB/s | -5% |
103+
| Matrix Multiplication | 4096³ f32 | 47.2 TFLOPS | 43.3 TFLOPS | +9%* |
104+
| Batch Matrix Multiply | 1024×512×2048 ×8 f32 | 33.5 TFLOPS | 30.8 TFLOPS | +9%* |
105+
| FFT (3-stage Cooley-Tukey) | 1024-pt ×64 c64 | 3264 μs | 3133 μs | -4% |
106+
| Mixture of Experts | 256tok 1024h 32e 2048i f16 | 19.5 TFLOPS | 20.1 TFLOPS | -3% |
107+
| Attention (FMHA) | 8×16×1024² ×64 f16 causal | 87.8 TFLOPS | 61.3 TFLOPS | +43%** |
108+
109+
\* Likely because Julia's `for` loop guards give `tileiras` a guarantee that the
110+
loop body executes at least once, enabling more aggressive warp scheduling.
111+
112+
\*\* Likely due to Python's compiler splitting the causal masking loop into two
110113
loops, duplicating the loop body. Julia emits a single loop with a conditional.
111114

112115

0 commit comments

Comments
 (0)