Skip to content

Honor rollout concurrency for grouped V1 evals#1815

Open
xeophon wants to merge 6 commits into
feat/nano-as-v1from
codex/v1-group-rollout-cap
Open

Honor rollout concurrency for grouped V1 evals#1815
xeophon wants to merge 6 commits into
feat/nano-as-v1from
codex/v1-group-rollout-cap

Conversation

@xeophon

@xeophon xeophon commented Jun 21, 2026

Copy link
Copy Markdown
Member

Overview

Honor max_concurrent as a rollout-level resource bound for group-scored V1 server evals. A group request owns num_rollouts inseparable rollouts, so the runner now derives request concurrency as max(1, max_concurrent // num_rollouts).

This keeps cross-rollout scoring groups intact, guarantees progress for a group larger than the cap, and leaves independent server rollouts, local eval, resume behavior, and output handling unchanged.

Reasoning

The env-server semaphore previously counted requests. That is equivalent to rollout concurrency for independent scoring, but a group-scored request expands into multiple rollouts inside the worker. With a rollout cap of 128 and group size 8, 128 request permits therefore allowed 1,024 live rollouts.

Scaling group-request permits by group size aligns the server path with the rollout-level meaning of max_concurrent. Floor division intentionally leaves spare capacity for non-divisible group sizes; one oversized group still receives a permit because group scoring cannot split it.

Reducing request permits must not suppress elastic worker fan-out, so the broker now counts a run_group request as load n for least-busy dispatch and scale-up. Other requests consume one rollout slot.

Performance and resource impact

A scheduler-focused reproduction used 256 groups × 8 rollouts, cap 128, and 64 KiB retained per active rollout. Values are medians from three fresh processes per variant.

Measurement Before After Saved
Peak group requests 128 16 112 (87.5%)
Peak live rollouts 1,024 128 896 (87.5%)
Peak traced allocation 65.139 MiB 9.036 MiB 56.103 MiB (86.1%)
Peak RSS 195.641 MiB 124.344 MiB 71.297 MiB (36.4%)
Retained RSS increase 84.172 MiB 12.938 MiB 71.234 MiB (84.6%)
Maximum event-loop lag 46.257 ms 41.786 ms 4.471 ms (9.7%)
Wall time 0.068005 s 0.206936 s -0.138931 s (-204.3%)
Returned traces 2,048 2,048 unchanged

The wall-time result is deliberately reported as negative time saved: enforcing the resource cap creates more request waves in an unconstrained synthetic workload. This is a resource-bound correction, not a throughput claim.


Note

Medium Risk
Changes scheduling and pool scaling for group-scored server evals only; incorrect slot accounting could under- or over-utilize workers, but auth and data paths are untouched.

Overview
max_concurrent is now enforced as a rollout budget on the V1 env-server path when the taskset uses group scoring, instead of counting one permit per HTTP-style run_group request (which could fan out to many rollouts per worker).

In runner.py, the client-side semaphore for group-scored server evals is set to max(1, max_concurrent // num_rollouts) so in-flight groups stay within the intended rollout cap while still allowing at least one indivisible group to run.

In pool.py, least-busy dispatch, per-worker active load, and elastic scale-up use rollout slots (n from run_group payloads; other methods count as 1), with a global in_flight counter aligned to that model.

Reviewed by Cursor Bugbot for commit 0580585. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Honor rollout concurrency limits for grouped V1 evals

  • In runner.py, the asyncio.Semaphore for run_eval_server now uses max_concurrent / num_rollouts (minimum 1) when group scoring is required, preventing oversubscription when each run_group request consumes multiple rollout slots.
  • In pool.py, the broker loop tracks rollout_slots per pending request, decoding run_group payloads as RunGroupRequest to extract n. Worker active counts and the in_flight total are incremented/decremented by slot count rather than request count.
  • Elastic scaling via _maybe_scale_up now receives the in_flight slot total instead of the raw pending request count, so scale-up decisions reflect actual rollout load.
  • Behavioral Change: least-busy worker dispatch and scaling thresholds now operate on rollout slots; clusters serving large run_group requests will scale up sooner than before.

Macroscope summarized 0580585.

@macroscopeapp

macroscopeapp Bot commented Jun 21, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

This PR modifies concurrency control and resource scaling logic with two unresolved review comments identifying potential issues: broken elastic pool worker fan-out and possible resource exhaustion from unvalidated group sizes. These substantive concerns require human review.

You can customize Macroscope's approvability policy. Learn more.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 317c0432ec

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/cli/eval/runner.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a606aece47

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/serve/pool.py Outdated
Comment thread verifiers/v1/serve/pool.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f2245272d7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/serve/pool.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant