Skip to content

Avoid full-context routed-expert padding#1794

Open
xeophon wants to merge 1 commit into
feat/nano-as-v1from
codex/avoid-routed-expert-padding
Open

Avoid full-context routed-expert padding#1794
xeophon wants to merge 1 commit into
feat/nano-as-v1from
codex/avoid-routed-expert-padding

Conversation

@xeophon

@xeophon xeophon commented Jun 21, 2026

Copy link
Copy Markdown
Member

Overview

Keep the V1 routed-expert payload unchanged and construct the engine's one-row-short padding only for the final node slice. Node alignment and owned-array behavior stay the same, while the transient work scales with the retained suffix instead of the full prompt context.

Design

The engine may omit routing for the final token because no forward pass follows it. Attribution still repeats the last available routing row, but now does so when the final token-bearing node ends exactly one row beyond the payload. Complete slices continue to receive their existing owned copies, and empty or out-of-range slices remain unset.

This localizes the padding to arr[off:] plus the final row. It avoids materializing a padded full-context array that is immediately discarded after new-node attribution.

Performance

A PEP 723 benchmark used a uint8 routed-expert tensor shaped (1,000,000, 8, 4), with 900,000 reused rows and 100,001 retained new rows:

Measurement Before After Saved
Median synchronous copy time 0.495 ms 0.046 ms 0.449 ms (90.7%)
Peak traced allocation 33.570 MiB 3.052 MiB 30.518 MiB (90.9%)
Fresh-process maximum RSS 98.141 MiB 67.391 MiB 30.750 MiB (31.3%)
Large-array bytes copied 35,200,064 B 3,200,032 B 32,000,032 B (90.9%)

The retained result remains 3,200,032 bytes in both cases; the savings come entirely from removing the context-sized temporary copy.


Note

Cursor Bugbot is generating a summary for commit fe33a92. Configure here.

Note

Fix full-context padding in _attribute_routed_experts to pad only the final node

Previously, when the routing array was one row short, the entire array was padded globally by appending a copy of the last row. Now, padding is applied locally and only to the final node that consumes the missing row, keeping all other nodes' routed_experts slices aligned to their actual token counts.

Changes are isolated to graph.py.

Macroscope summarized fe33a92.

@macroscopeapp

macroscopeapp Bot commented Jun 21, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

This change modifies how expert routing arrays are padded during model inference - shifting from upfront full-context padding to lazy final-node padding. While an optimization, it changes processing logic in ML infrastructure that warrants verification.

You can customize Macroscope's approvability policy. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant