Honor rollout concurrency for grouped V1 evals#1815
Conversation
ApprovabilityVerdict: Needs human review This PR modifies concurrency control and resource scaling logic with two unresolved review comments identifying potential issues: broken elastic pool worker fan-out and possible resource exhaustion from unvalidated group sizes. These substantive concerns require human review. You can customize Macroscope's approvability policy. Learn more. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 317c0432ec
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a606aece47
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f2245272d7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Overview
Honor
max_concurrentas a rollout-level resource bound for group-scored V1 server evals. A group request ownsnum_rolloutsinseparable rollouts, so the runner now derives request concurrency asmax(1, max_concurrent // num_rollouts).This keeps cross-rollout scoring groups intact, guarantees progress for a group larger than the cap, and leaves independent server rollouts, local eval, resume behavior, and output handling unchanged.
Reasoning
The env-server semaphore previously counted requests. That is equivalent to rollout concurrency for independent scoring, but a group-scored request expands into multiple rollouts inside the worker. With a rollout cap of 128 and group size 8, 128 request permits therefore allowed 1,024 live rollouts.
Scaling group-request permits by group size aligns the server path with the rollout-level meaning of
max_concurrent. Floor division intentionally leaves spare capacity for non-divisible group sizes; one oversized group still receives a permit because group scoring cannot split it.Reducing request permits must not suppress elastic worker fan-out, so the broker now counts a
run_grouprequest as loadnfor least-busy dispatch and scale-up. Other requests consume one rollout slot.Performance and resource impact
A scheduler-focused reproduction used 256 groups × 8 rollouts, cap 128, and 64 KiB retained per active rollout. Values are medians from three fresh processes per variant.
The wall-time result is deliberately reported as negative time saved: enforcing the resource cap creates more request waves in an unconstrained synthetic workload. This is a resource-bound correction, not a throughput claim.
Note
Medium Risk
Changes scheduling and pool scaling for group-scored server evals only; incorrect slot accounting could under- or over-utilize workers, but auth and data paths are untouched.
Overview
max_concurrentis now enforced as a rollout budget on the V1 env-server path when the taskset uses group scoring, instead of counting one permit per HTTP-stylerun_grouprequest (which could fan out to many rollouts per worker).In
runner.py, the client-side semaphore for group-scored server evals is set tomax(1, max_concurrent // num_rollouts)so in-flight groups stay within the intended rollout cap while still allowing at least one indivisible group to run.In
pool.py, least-busy dispatch, per-workeractiveload, and elastic scale-up use rollout slots (nfromrun_grouppayloads; other methods count as 1), with a globalin_flightcounter aligned to that model.Reviewed by Cursor Bugbot for commit 0580585. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Honor rollout concurrency limits for grouped V1 evals
asyncio.Semaphoreforrun_eval_servernow usesmax_concurrent / num_rollouts(minimum 1) when group scoring is required, preventing oversubscription when eachrun_grouprequest consumes multiple rollout slots.rollout_slotsper pending request, decodingrun_grouppayloads asRunGroupRequestto extractn. Worker active counts and thein_flighttotal are incremented/decremented by slot count rather than request count._maybe_scale_upnow receives thein_flightslot total instead of the raw pending request count, so scale-up decisions reflect actual rollout load.run_grouprequests will scale up sooner than before.Macroscope summarized 0580585.