Honor rollout concurrency for grouped V1 evals by xeophon · Pull Request #1815 · PrimeIntellect-ai/verifiers

xeophon · 2026-06-21T14:00:14Z

Overview

Honor max_concurrent as a rollout-level resource bound for group-scored V1 server evals. A group request owns num_rollouts inseparable rollouts, so the runner now derives request concurrency as max(1, max_concurrent // num_rollouts).

This keeps cross-rollout scoring groups intact, guarantees progress for a group larger than the cap, and leaves independent server rollouts, local eval, resume behavior, and output handling unchanged.

Reasoning

The env-server semaphore previously counted requests. That is equivalent to rollout concurrency for independent scoring, but a group-scored request expands into multiple rollouts inside the worker. With a rollout cap of 128 and group size 8, 128 request permits therefore allowed 1,024 live rollouts.

Scaling group-request permits by group size aligns the server path with the rollout-level meaning of max_concurrent. Floor division intentionally leaves spare capacity for non-divisible group sizes; one oversized group still receives a permit because group scoring cannot split it.

Reducing request permits must not suppress elastic worker fan-out, so the broker now counts a run_group request as load n for least-busy dispatch and scale-up. Other requests consume one rollout slot.

Performance and resource impact

A scheduler-focused reproduction used 256 groups × 8 rollouts, cap 128, and 64 KiB retained per active rollout. Values are medians from three fresh processes per variant.

Measurement	Before	After	Saved
Peak group requests	128	16	112 (87.5%)
Peak live rollouts	1,024	128	896 (87.5%)
Peak traced allocation	65.139 MiB	9.036 MiB	56.103 MiB (86.1%)
Peak RSS	195.641 MiB	124.344 MiB	71.297 MiB (36.4%)
Retained RSS increase	84.172 MiB	12.938 MiB	71.234 MiB (84.6%)
Maximum event-loop lag	46.257 ms	41.786 ms	4.471 ms (9.7%)
Wall time	0.068005 s	0.206936 s	-0.138931 s (-204.3%)
Returned traces	2,048	2,048	unchanged

The wall-time result is deliberately reported as negative time saved: enforcing the resource cap creates more request waves in an unconstrained synthetic workload. This is a resource-bound correction, not a throughput claim.

Note

Medium Risk
Changes scheduling and pool scaling for group-scored server evals only; incorrect slot accounting could under- or over-utilize workers, but auth and data paths are untouched.

Overview
max_concurrent is now enforced as a rollout budget on the V1 env-server path when the taskset uses group scoring, instead of counting one permit per HTTP-style run_group request (which could fan out to many rollouts per worker).

In runner.py, the client-side semaphore for group-scored server evals is set to max(1, max_concurrent // num_rollouts) so in-flight groups stay within the intended rollout cap while still allowing at least one indivisible group to run.

In pool.py, least-busy dispatch, per-worker active load, and elastic scale-up use rollout slots (n from run_group payloads; other methods count as 1), with a global in_flight counter aligned to that model.

^{Reviewed by Cursor Bugbot for commit 0580585. Bugbot is set up for automated code reviews on this repo. Configure here.}

Note

Honor rollout concurrency limits for grouped V1 evals

In runner.py, the asyncio.Semaphore for run_eval_server now uses max_concurrent / num_rollouts (minimum 1) when group scoring is required, preventing oversubscription when each run_group request consumes multiple rollout slots.
In pool.py, the broker loop tracks rollout_slots per pending request, decoding run_group payloads as RunGroupRequest to extract n. Worker active counts and the in_flight total are incremented/decremented by slot count rather than request count.
Elastic scaling via _maybe_scale_up now receives the in_flight slot total instead of the raw pending request count, so scale-up decisions reflect actual rollout load.
Behavioral Change: least-busy worker dispatch and scaling thresholds now operate on rollout slots; clusters serving large run_group requests will scale up sooner than before.

^{Macroscope summarized 0580585.}

macroscopeapp · 2026-06-21T14:01:59Z

Approvability

Verdict: Needs human review

This PR modifies concurrency control and resource scaling logic with two unresolved review comments identifying potential issues: broken elastic pool worker fan-out and possible resource exhaustion from unvalidated group sizes. These substantive concerns require human review.

^{You can customize Macroscope's approvability policy. Learn more.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 317c0432ec

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a606aece47

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f2245272d7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Honor rollout concurrency for grouped evals

317c043

chatgpt-codex-connector Bot reviewed Jun 21, 2026

View reviewed changes

Comment thread verifiers/v1/cli/eval/runner.py

Weight grouped requests in elastic pool

a606aec

chatgpt-codex-connector Bot reviewed Jun 21, 2026

View reviewed changes

Comment thread verifiers/v1/serve/pool.py Outdated

macroscopeapp Bot reviewed Jun 21, 2026

View reviewed changes

Comment thread verifiers/v1/serve/pool.py Outdated

Handle malformed grouped pool requests

f224527

chatgpt-codex-connector Bot reviewed Jun 21, 2026

View reviewed changes

Comment thread verifiers/v1/serve/pool.py Outdated

xeophon added 3 commits June 21, 2026 16:45

Name pool load as rollout slots

ea07ec7

Remove pool regression test

a07ad7d

Validate group load before scaling pool

0580585

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Honor rollout concurrency for grouped V1 evals#1815

Honor rollout concurrency for grouped V1 evals#1815
xeophon wants to merge 6 commits into
feat/nano-as-v1from
codex/v1-group-rollout-cap

xeophon commented Jun 21, 2026 •

edited by macroscopeapp Bot

Loading

Uh oh!

macroscopeapp Bot commented Jun 21, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xeophon commented Jun 21, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Reasoning

Performance and resource impact

Honor rollout concurrency limits for grouped V1 evals

Uh oh!

macroscopeapp Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approvability

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xeophon commented Jun 21, 2026 •

edited by macroscopeapp Bot

Loading

macroscopeapp Bot commented Jun 21, 2026 •

edited

Loading