Conversation
The previous LICM pass hoisted all loop-invariant operations (arithmetic, broadcasts, view constructors, etc.) — all of which are marked Pure in the MLIR Tile IR dialect and already hoisted by MLIR's built-in LICM at optLevel >= 2. Benchmarks confirmed zero performance difference when the pass was disabled entirely. The new pass focuses on what MLIR structurally cannot do: hoisting memory loads out of loops. After token ordering, loads have token dependencies that anchor them inside loops. By hoisting before token insertion, we avoid creating unnecessary token carries. Key changes: - Run alias_analysis_pass! before licm_pass! (was after) - Only hoist loads, not pure ops (MLIR handles those) - Verify alias safety: a load is only hoisted when no store in the loop body writes to an overlapping alias set - Simplified from 200 to 150 lines with clearer structure Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous LICM only targeted loads and ran before token ordering, but failed to hoist anything because load dependencies (make_partition_view, Core.tuple) were always generated inline inside the loop body. The new approach mirrors cuTile Python's code_motion.py: run after token_order_pass! and hoist ALL loop-invariant operations based on data dependencies. Token dependencies naturally prevent unsafe hoisting of loads that alias with stores — no separate alias analysis needed for LICM. This correctly hoists loop-invariant loads and their entire dependency chain (tensor_view → partition_view → load → reshape → broadcast). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rewrite LICM from a 200-line stack-based depth-tracking algorithm to a simple fixpoint loop using IRStructurizer's is_defined_outside, move_before!, and operands. Processes innermost loops first (post-order), repeatedly hoisting ops whose operands are all defined outside the loop. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Member
Author
|
Even though this doesn't improve performance of any of the examples we have, it's what cuTile Python does, and with some IRStructurizer utilities the implementation is really short. So let's merge this. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This was an experiment for #163. The for loops we generate contain a
broadcast+reshapecoming fromsomething .+ one(T)(oh how I've come to hate 1-based indexing), but hoisting it outside the loop doesn't improve performance, so I'm not sure we want this. cuTile Python does have it though, and the implementation here is based on that.Depends on maleadt/IRStructurizer.jl#24