#89 added an alias-aware token ordering pass that correctly separates memory operations by alias set, improving matmul (-15%) and batch matmul (-7%). However, it regresses the layernorm forward kernel by 1.67x (0.24 ms → 0.40 ms at 4096×4096).
Our pass architecture (alias analysis, LAST_OP/LAST_STORE tracking, eager join_tokens) matches cuTile Python's token_order.py exactly. The difference is one optimization Python has that we don't: loop parallel store.
When a TileStore in a for-loop uses the induction variable as its index (non-overlapping across iterations), the store can use the token from before the loop instead of a loop-carried token. This breaks the dependency chain through the loop. Once the store is parallelized, the token carries and joins for read-only alias sets become dead code, and DCE removes them.
Without this optimization, layernorm's normalize loop (loads X, W, B; stores Y — 4 non-aliasing arrays) generates 5 loop-carried tokens with join_tokens after every load. Python generates 0.
Benchmark impact (RTX 5080):
| Kernel |
Before pass |
After pass |
Δ |
| Layernorm fwd |
0.24 ms |
0.40 ms |
+67% (regression) |
| Matrix Multiply |
3.73 ms |
3.19 ms |
-15% (improvement) |
| Batch MatMul |
0.61 ms |
0.57 ms |
-7% (improvement) |
Fix: Port _try_loop_parallel_store from res/cutile-python/src/cuda/tile/_passes/token_order.py (lines 425–541) and add a DCE pass on the structured IR after token ordering. See PLAN2.md for details.
#89 added an alias-aware token ordering pass that correctly separates memory operations by alias set, improving matmul (-15%) and batch matmul (-7%). However, it regresses the layernorm forward kernel by 1.67x (0.24 ms → 0.40 ms at 4096×4096).
Our pass architecture (alias analysis,
LAST_OP/LAST_STOREtracking, eagerjoin_tokens) matches cuTile Python'stoken_order.pyexactly. The difference is one optimization Python has that we don't: loop parallel store.When a
TileStorein a for-loop uses the induction variable as its index (non-overlapping across iterations), the store can use the token from before the loop instead of a loop-carried token. This breaks the dependency chain through the loop. Once the store is parallelized, the token carries and joins for read-only alias sets become dead code, and DCE removes them.Without this optimization, layernorm's normalize loop (loads X, W, B; stores Y — 4 non-aliasing arrays) generates 5 loop-carried tokens with
join_tokensafter every load. Python generates 0.Benchmark impact (RTX 5080):
Fix: Port
_try_loop_parallel_storefromres/cutile-python/src/cuda/tile/_passes/token_order.py(lines 425–541) and add a DCE pass on the structured IR after token ordering. SeePLAN2.mdfor details.