Layernorm regression: Token threading requires loop parallel store optimization

#89 added an alias-aware token ordering pass that correctly separates memory operations by alias set, improving matmul (-15%) and batch matmul (-7%). However, it regresses the layernorm forward kernel by 1.67x (0.24 ms → 0.40 ms at 4096×4096).

Our pass architecture (alias analysis, `LAST_OP`/`LAST_STORE` tracking, eager `join_tokens`) matches cuTile Python's `token_order.py` exactly. The difference is one optimization Python has that we don't: **loop parallel store**.

When a `TileStore` in a for-loop uses the induction variable as its index (non-overlapping across iterations), the store can use the token from *before* the loop instead of a loop-carried token. This breaks the dependency chain through the loop. Once the store is parallelized, the token carries and joins for read-only alias sets become dead code, and DCE removes them.

Without this optimization, layernorm's normalize loop (loads X, W, B; stores Y — 4 non-aliasing arrays) generates 5 loop-carried tokens with `join_tokens` after every load. Python generates 0.

**Benchmark impact (RTX 5080):**

| Kernel | Before pass | After pass | Δ |
|--------|------------|------------|---|
| Layernorm fwd | 0.24 ms | 0.40 ms | +67% (regression) |
| Matrix Multiply | 3.73 ms | 3.19 ms | -15% (improvement) |
| Batch MatMul | 0.61 ms | 0.57 ms | -7% (improvement) |

**Fix:** Port `_try_loop_parallel_store` from `res/cutile-python/src/cuda/tile/_passes/token_order.py` (lines 425–541) and add a DCE pass on the structured IR after token ordering. See `PLAN2.md` for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Layernorm regression: Token threading requires loop parallel store optimization #146

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Kernel	Before pass	After pass	Δ
Layernorm fwd	0.24 ms	0.40 ms	+67% (regression)
Matrix Multiply	3.73 ms	3.19 ms	-15% (improvement)
Batch MatMul	0.61 ms	0.57 ms	-7% (improvement)

Layernorm regression: Token threading requires loop parallel store optimization #146

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions