Skip to content

Reuse rollout token counts across limit checks#1799

Open
xeophon wants to merge 1 commit into
feat/nano-as-v1from
codex/reuse-rollout-token-counts
Open

Reuse rollout token counts across limit checks#1799
xeophon wants to merge 1 commit into
feat/nano-as-v1from
codex/reuse-rollout-token-counts

Conversation

@xeophon

@xeophon xeophon commented Jun 21, 2026

Copy link
Copy Markdown
Member

Overview

Reduce synchronous token-limit overhead by avoiding repeated reconstruction of the same derived branch paths. The trace node graph remains the source of truth, limit precedence and soft-cap behavior are unchanged, and the commit only changes the production interception server.

What changed

  • Return immediately when no token cap is configured.
  • Count directly from Trace.nodes when canonical append order proves the graph is a single root-to-leaf path.
  • Materialize Trace.branches once for compacted, subagent, or otherwise non-linear graphs, then reuse that view across enabled input, output, and total token checks.
  • Preserve the existing max_turns → input → output → total precedence and >= boundaries.

Why

Trace.nodes stores each message once, while Trace.branches is an uncached derived view built by finding leaves and walking each parent chain. The previous input, output, and total properties each requested that view independently. When several token caps were enabled and still below their thresholds, the same graph paths were reconstructed up to three times.

The canonical linear case can safely use the existing node order without allocating a branch view. Other graph shapes continue through the established branch abstraction, but share one snapshot rather than rebuilding it for every count. This keeps arbitrary node ordering and branching semantics on the existing path.

Performance

Measurements use median time.perf_counter() wall time with GC before each repetition; peak Python allocation is measured separately with tracemalloc so tracing overhead does not affect timings.

Workload Before After Time saved Peak allocation
200k-node linear stress case, 1 token/node 88.784 ms 23.526 ms 65.258 ms (73.5%) 12.00 MiB → below 0.01 MiB display precision
2k nodes / 1k shared-trunk branches, 1 token/node 217.673 ms 125.654 ms 92.019 ms (42.3%) 8.23 MiB → 8.23 MiB
2k-node linear long-horizon case, 32 tokens/node 0.732 ms 0.315 ms 0.417 ms (57.0%) 0.16 MiB → below 0.01 MiB display precision
2k nodes / 10 branches, 32 tokens/node 5.341 ms 3.600 ms 1.741 ms (32.6%) 0.19 MiB → 0.19 MiB
10k nodes / 10 branches, 32 tokens/node 26.767 ms 18.172 ms 8.595 ms (32.1%) 0.93 MiB → 0.93 MiB

The branched peak remains unchanged because both paths hold at most one full branch snapshot at a time; the saving is reduced allocation churn and graph-walk CPU from eliminating additional snapshots. Since limit checks run synchronously from the interception session, the wall-time reduction also shortens the corresponding event-loop stall.

At higher token density, mask summation becomes a larger share of the work: the 2k-node / 10-branch case at 128 tokens per node measured 8.711 ms → 6.964 ms, saving 1.747 ms (20.1%). This is expected because the change targets graph reconstruction rather than token-mask arithmetic.

Scope

The commit contains only verifiers/v1/interception/server.py. Benchmark scripts, focused test scaffolding, project metadata, and lockfiles are intentionally excluded.


Note

Medium Risk
Limit enforcement semantics must stay aligned with existing branch-based totals on branched traces; mistakes would allow extra turns or stop rollouts early.

Overview
RolloutLimits.reached no longer relies on Trace’s prompt_len / completion_len / total_tokens helpers, which each walked Trace.branches independently when several token caps were enabled.

It returns immediately when no token caps are set. For traces in canonical linear append order (each node’s parent is the previous index), it counts from trace.nodes directly and skips building a branch view. For branched or non-linear graphs, it builds trace.branches once and sums per-branch lengths for input, output, and total checks.

max_turns precedence and >= stop boundaries are unchanged; only how counts are derived for the synchronous pre-turn check in the interception server.

Reviewed by Cursor Bugbot for commit af6db01. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Reuse rollout token counts across limit checks in RolloutLimits.reached

  • Adds an early return in RolloutLimits.reached when all token caps are None, avoiding unnecessary computation.
  • For traces forming a single linear chain, computes token counts directly from node-level data (node.token_ids lengths and masked token counts) rather than trace-level aggregates.
  • For non-linear (branched) traces, replaces single trace-level aggregates (trace.prompt_len, etc.) with sums across trace.branches.
  • Behavioral Change: token cap comparisons now use different aggregation paths depending on graph topology, which may produce different values than before for branched traces.

Macroscope summarized af6db01.

@macroscopeapp

macroscopeapp Bot commented Jun 21, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

This PR changes how token limits are calculated during rollouts - from using trace-level properties directly to summing across branches. Since this modifies limit-checking logic that controls when rollouts stop, the behavioral implications warrant human review.

You can customize Macroscope's approvability policy. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant