Add Mixture of Experts (MoE) example by maleadt · Pull Request #163 · JuliaGPU/cuTile.jl

maleadt · 2026-04-01T10:54:23Z

Port of cuTile Python's MoE.py sample with two kernels:

fused_moe_kernel: tiled matmul with gather/scatter for expert routing
silu_and_mul_kernel: element-wise SiLU activation

The Julia fused_moe_kernel uses opt_level=0 as a workaround for a tileiras optimizer crash (NVIDIA/cuda-tile#17). Because of that, performance isn't great, so marking this draft.

maleadt · 2026-04-01T13:11:19Z

The Julia fused_moe_kernel uses opt_level=0 as a workaround for a tileiras optimizer crash (NVIDIA/cuda-tile#17). Because of that, performance isn't great, so marking this draft.

Found a workaround, improving performance from a 6x slowdown to 1.6x. Still not great, so I'll leave this draft until we have the necessary passes to improve performance.

maleadt · 2026-04-01T14:40:38Z

Replacing the indexing kernel with a call to cuMemcpy2D, we're down to 1.17x overhead.

Port of cuTile Python's MoE.py sample with two kernels: - fused_moe_kernel: tiled matmul with gather/scatter for expert routing - silu_and_mul_kernel: element-wise SiLU activation The Julia fused_moe_kernel uses opt_level=0 as a workaround for a tileiras optimizer crash caused by token loop-carries from gather ops. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use order=(2,1,3) on ct.load for the B tile to produce a dim_map in the partition view, eliminating the explicit permute [1,0] in the Tile IR loop body. This matches Python cuTile's order=(0,2,1) approach. The permute was generating 96 extra PRMT (byte permute) and 8 STS (shared memory store) instructions per loop iteration, plus 6x more L1 cache traffic from the shared memory round-trip. SASS instruction counts (fused_moe v1, no mul_routed_weight): Before: 7,920 After: 5,712 Python: 5,792 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Layer norm forward improved from -9% to -5% (algebra rule improvements help its index patterns). Add MoE benchmark (-8% vs Python). All other kernels unchanged within measurement noise. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

maleadt · 2026-04-02T09:24:22Z

With #166 and #168 (enabled by #167) performance is good here; 8% slower than Python, and those improvements also sped up LayerNorm from -9% to -5%.

maleadt force-pushed the tb/moe branch 2 times, most recently from f5bb5ce to 9dc6864 Compare April 1, 2026 13:10

maleadt mentioned this pull request Apr 1, 2026

Add a LICM pass #165

Merged

maleadt and others added 7 commits April 2, 2026 11:20

MoE: Switch dimensions to make generated IR match cuTile Python's.

d283e14

MoE: Remove unnecessary specialization

09311e9

Also add examples deps to the test project.

c5f14ad

Avoid expensive kernel to fetch elements.

7cb2ea3

maleadt force-pushed the tb/moe branch from c511ec5 to 86f873d Compare April 2, 2026 09:23

maleadt marked this pull request as ready for review April 2, 2026 09:24

maleadt merged commit 416f3b3 into main Apr 2, 2026
9 checks passed

maleadt deleted the tb/moe branch April 2, 2026 09:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Mixture of Experts (MoE) example#163

Add Mixture of Experts (MoE) example#163
maleadt merged 7 commits intomainfrom
tb/moe

maleadt commented Apr 1, 2026 •

edited

Loading

Uh oh!

maleadt commented Apr 1, 2026

Uh oh!

maleadt commented Apr 1, 2026

Uh oh!

maleadt commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maleadt commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maleadt commented Apr 1, 2026

Uh oh!

maleadt commented Apr 1, 2026

Uh oh!

maleadt commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

maleadt commented Apr 1, 2026 •

edited

Loading