Add a LICM pass by maleadt · Pull Request #165 · JuliaGPU/cuTile.jl

maleadt · 2026-04-01T14:31:01Z

This was an experiment for #163. The for loops we generate contain a broadcast + reshape coming from something .+ one(T) (oh how I've come to hate 1-based indexing), but hoisting it outside the loop doesn't improve performance, so I'm not sure we want this. cuTile Python does have it though, and the implementation here is based on that.

Depends on maleadt/IRStructurizer.jl#24

The previous LICM pass hoisted all loop-invariant operations (arithmetic, broadcasts, view constructors, etc.) — all of which are marked Pure in the MLIR Tile IR dialect and already hoisted by MLIR's built-in LICM at optLevel >= 2. Benchmarks confirmed zero performance difference when the pass was disabled entirely. The new pass focuses on what MLIR structurally cannot do: hoisting memory loads out of loops. After token ordering, loads have token dependencies that anchor them inside loops. By hoisting before token insertion, we avoid creating unnecessary token carries. Key changes: - Run alias_analysis_pass! before licm_pass! (was after) - Only hoist loads, not pure ops (MLIR handles those) - Verify alias safety: a load is only hoisted when no store in the loop body writes to an overlapping alias set - Simplified from 200 to 150 lines with clearer structure Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The previous LICM only targeted loads and ran before token ordering, but failed to hoist anything because load dependencies (make_partition_view, Core.tuple) were always generated inline inside the loop body. The new approach mirrors cuTile Python's code_motion.py: run after token_order_pass! and hoist ALL loop-invariant operations based on data dependencies. Token dependencies naturally prevent unsafe hoisting of loads that alias with stores — no separate alias analysis needed for LICM. This correctly hoists loop-invariant loads and their entire dependency chain (tensor_view → partition_view → load → reshape → broadcast). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Rewrite LICM from a 200-line stack-based depth-tracking algorithm to a simple fixpoint loop using IRStructurizer's is_defined_outside, move_before!, and operands. Processes innermost loops first (post-order), repeatedly hoisting ops whose operands are all defined outside the loop. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

maleadt · 2026-04-06T17:49:45Z

Even though this doesn't improve performance of any of the examples we have, it's what cuTile Python does, and with some IRStructurizer utilities the implementation is really short. So let's merge this.

maleadt and others added 5 commits April 6, 2026 11:28

Add LICM pass.

cde62c4

Use IRStructurizer APIs.

e93e199

Drop underscore prefixes from internal LICM names.

c927a47

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

maleadt force-pushed the tb/licm branch from 6bd9133 to c927a47 Compare April 6, 2026 09:32

maleadt marked this pull request as ready for review April 6, 2026 17:49

maleadt merged commit 958aa7d into main Apr 6, 2026
9 of 17 checks passed

maleadt deleted the tb/licm branch April 6, 2026 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a LICM pass#165

Add a LICM pass#165
maleadt merged 6 commits intomainfrom
tb/licm

maleadt commented Apr 1, 2026

Uh oh!

maleadt commented Apr 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maleadt commented Apr 1, 2026

Uh oh!

maleadt commented Apr 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant