Skip to content

Add Mixture of Experts (MoE) example#163

Merged
maleadt merged 7 commits intomainfrom
tb/moe
Apr 2, 2026
Merged

Add Mixture of Experts (MoE) example#163
maleadt merged 7 commits intomainfrom
tb/moe

Conversation

@maleadt
Copy link
Copy Markdown
Member

@maleadt maleadt commented Apr 1, 2026

Port of cuTile Python's MoE.py sample with two kernels:

  • fused_moe_kernel: tiled matmul with gather/scatter for expert routing
  • silu_and_mul_kernel: element-wise SiLU activation

The Julia fused_moe_kernel uses opt_level=0 as a workaround for a tileiras optimizer crash (NVIDIA/cuda-tile#17). Because of that, performance isn't great, so marking this draft.

@maleadt maleadt force-pushed the tb/moe branch 2 times, most recently from f5bb5ce to 9dc6864 Compare April 1, 2026 13:10
@maleadt
Copy link
Copy Markdown
Member Author

maleadt commented Apr 1, 2026

The Julia fused_moe_kernel uses opt_level=0 as a workaround for a tileiras optimizer crash (NVIDIA/cuda-tile#17). Because of that, performance isn't great, so marking this draft.

Found a workaround, improving performance from a 6x slowdown to 1.6x. Still not great, so I'll leave this draft until we have the necessary passes to improve performance.

@maleadt maleadt mentioned this pull request Apr 1, 2026
@maleadt
Copy link
Copy Markdown
Member Author

maleadt commented Apr 1, 2026

Replacing the indexing kernel with a call to cuMemcpy2D, we're down to 1.17x overhead.

maleadt and others added 7 commits April 2, 2026 11:20
Port of cuTile Python's MoE.py sample with two kernels:
- fused_moe_kernel: tiled matmul with gather/scatter for expert routing
- silu_and_mul_kernel: element-wise SiLU activation

The Julia fused_moe_kernel uses opt_level=0 as a workaround for a
tileiras optimizer crash caused by token loop-carries from gather ops.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use order=(2,1,3) on ct.load for the B tile to produce a dim_map in the
partition view, eliminating the explicit permute [1,0] in the Tile IR loop
body. This matches Python cuTile's order=(0,2,1) approach.

The permute was generating 96 extra PRMT (byte permute) and 8 STS
(shared memory store) instructions per loop iteration, plus 6x more L1
cache traffic from the shared memory round-trip.

SASS instruction counts (fused_moe v1, no mul_routed_weight):
  Before: 7,920
  After:  5,712
  Python: 5,792

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Layer norm forward improved from -9% to -5% (algebra rule improvements
help its index patterns). Add MoE benchmark (-8% vs Python). All other
kernels unchanged within measurement noise.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@maleadt
Copy link
Copy Markdown
Member Author

maleadt commented Apr 2, 2026

With #166 and #168 (enabled by #167) performance is good here; 8% slower than Python, and those improvements also sped up LayerNorm from -9% to -5%.

@maleadt maleadt marked this pull request as ready for review April 2, 2026 09:24
@maleadt maleadt merged commit 416f3b3 into main Apr 2, 2026
9 checks passed
@maleadt maleadt deleted the tb/moe branch April 2, 2026 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant