Avoid full-context routed-expert padding#1794
Open
xeophon wants to merge 1 commit into
Open
Conversation
ApprovabilityVerdict: Needs human review This change modifies how expert routing arrays are padded during model inference - shifting from upfront full-context padding to lazy final-node padding. While an optimization, it changes processing logic in ML infrastructure that warrants verification. You can customize Macroscope's approvability policy. Learn more. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Keep the V1 routed-expert payload unchanged and construct the engine's one-row-short padding only for the final node slice. Node alignment and owned-array behavior stay the same, while the transient work scales with the retained suffix instead of the full prompt context.
Design
The engine may omit routing for the final token because no forward pass follows it. Attribution still repeats the last available routing row, but now does so when the final token-bearing node ends exactly one row beyond the payload. Complete slices continue to receive their existing owned copies, and empty or out-of-range slices remain unset.
This localizes the padding to
arr[off:]plus the final row. It avoids materializing a padded full-context array that is immediately discarded after new-node attribution.Performance
A PEP 723 benchmark used a
uint8routed-expert tensor shaped(1,000,000, 8, 4), with 900,000 reused rows and 100,001 retained new rows:The retained result remains 3,200,032 bytes in both cases; the savings come entirely from removing the context-sized temporary copy.
Note
Cursor Bugbot is generating a summary for commit fe33a92. Configure here.
Note
Fix full-context padding in
_attribute_routed_expertsto pad only the final nodePreviously, when the routing array was one row short, the entire array was padded globally by appending a copy of the last row. Now, padding is applied locally and only to the final node that consumes the missing row, keeping all other nodes'
routed_expertsslices aligned to their actual token counts.Changes are isolated to graph.py.
Macroscope summarized fe33a92.