Conversation
f5bb5ce to
9dc6864
Compare
Member
Author
Found a workaround, improving performance from a 6x slowdown to 1.6x. Still not great, so I'll leave this draft until we have the necessary passes to improve performance. |
Merged
Member
Author
|
Replacing the indexing kernel with a call to |
Port of cuTile Python's MoE.py sample with two kernels: - fused_moe_kernel: tiled matmul with gather/scatter for expert routing - silu_and_mul_kernel: element-wise SiLU activation The Julia fused_moe_kernel uses opt_level=0 as a workaround for a tileiras optimizer crash caused by token loop-carries from gather ops. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use order=(2,1,3) on ct.load for the B tile to produce a dim_map in the partition view, eliminating the explicit permute [1,0] in the Tile IR loop body. This matches Python cuTile's order=(0,2,1) approach. The permute was generating 96 extra PRMT (byte permute) and 8 STS (shared memory store) instructions per loop iteration, plus 6x more L1 cache traffic from the shared memory round-trip. SASS instruction counts (fused_moe v1, no mul_routed_weight): Before: 7,920 After: 5,712 Python: 5,792 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Layer norm forward improved from -9% to -5% (algebra rule improvements help its index patterns). Add MoE benchmark (-8% vs Python). All other kernels unchanged within measurement noise. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Member
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Port of cuTile Python's MoE.py sample with two kernels:
The Julia fused_moe_kernel uses opt_level=0 as a workaround for a tileiras optimizer crash (NVIDIA/cuda-tile#17). Because of that, performance isn't great, so marking this draft.