Skip to content

Commit 850a808

Browse files
authored
Merge pull request #144 from JuliaGPU/tb/examples
Update examples
2 parents 522f65a + 96c765b commit 850a808

4 files changed

Lines changed: 169 additions & 193 deletions

File tree

README.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -96,16 +96,16 @@ Benchmarks comparing cuTile.jl against cuTile Python on an RTX 5080:
9696

9797
| Kernel | Julia | Python | Status |
9898
|--------|-------|--------|--------|
99-
| Vector Addition | 813 GB/s | 834 GB/s | OK (-3%) |
100-
| Matrix Transpose | 769 GB/s | 795 GB/s | OK (-3%) |
101-
| Matrix Multiplication | 48.3 TFLOPS | 48.6 TFLOPS | OK (=) |
102-
| Layer Normalization | 254 GB/s | 683 GB/s | https://github.com/JuliaGPU/cuTile.jl/issues/1 (-63%) |
103-
| Batch Matrix Multiply | 31.7 TFLOPS | 31.6 TFLOPS | OK (=) |
104-
| FFT (3-stage Cooley-Tukey) | 508 μs | 230 μs | (-55%) |
105-
106-
Compute-intensive kernels (matmul, batch matmul) perform identically to Python. Memory-bound
107-
kernels (vadd, transpose) are within ~3% of Python. The layernorm kernel is slower due to
108-
conservative token threading in the compiler (see https://github.com/JuliaGPU/cuTile.jl/issues/1).
99+
| Vector Addition | 840 GB/s | 844 GB/s | OK (=) |
100+
| Matrix Transpose | 806 GB/s | 816 GB/s | OK (-1%) |
101+
| Layer Normalization | 1074 GB/s | 761 GB/s | OK (+41%) |
102+
| Matrix Multiplication | 36.8 TFLOPS | 50.7 TFLOPS | -27% |
103+
| Batch Matrix Multiply | 28.3 TFLOPS | 40.0 TFLOPS | -29% |
104+
| FFT (3-stage Cooley-Tukey) | 571 μs | 192 μs | -66% |
105+
106+
Memory-bound kernels (vadd, transpose, layernorm) match or beat Python. Compute-intensive
107+
kernels (matmul, batch matmul, FFT) are slower due to conservative token threading in the
108+
generated Tile IR, which serializes loads that could otherwise be pipelined.
109109

110110

111111
## Supported Operations

examples/batchmatmul.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ def batchmatmul_cutile_kernel(A, B, C, tm: ct.Constant[int], tn: ct.Constant[int
4545
# Example harness
4646
#=============================================================================
4747

48-
def prepare(*, benchmark: bool = False, Batch: int = None, M: int = None, K: int = None, N: int = None, dtype=np.float16):
48+
def prepare(*, benchmark: bool = False, Batch: int = None, M: int = None, K: int = None, N: int = None, dtype=np.float32):
4949
"""Allocate and initialize data for batch matmul."""
5050
if Batch is None:
5151
Batch = 8 if benchmark else 4

0 commit comments

Comments
 (0)