feat: persistent runtime pool + warm in-runtime scoring#1766
Closed
mikasenghaas wants to merge 2 commits into
Closed
feat: persistent runtime pool + warm in-runtime scoring#1766mikasenghaas wants to merge 2 commits into
mikasenghaas wants to merge 2 commits into
Conversation
Add `persistent` to the base runtime config: a persistent runtime is taken from an eval/train-level pool, reused across rollouts (per-rollout workspace reset between uses), and torn down only at the end of the run — so expensive provisioning and warm in-runtime workers are paid once, not per rollout. - BaseRuntimeConfig.persistent (all four runtime configs); RuntimePool wired into Environment.serving() (eval + env-server/train) and injected into rollouts; Rollout acquires/releases instead of make_runtime/stop. The env server's serving() spans the whole run, so on train the pool lives for the entire run. - Runtime.reset() clears the per-rollout workspace (subprocess/docker/prime/modal) so reuse stays isolated; warm workers + the provisioned resource survive. - run_uv_script(warm=True): the subprocess runtime routes to a long-lived worker that imports the script's deps once (gated on persistent — the worker dies with an ephemeral runtime). Scripts opt in via main(argv)->str, staying uv-run-able cold. gsm8k verify.py converted + its reward passes warm=True. - Split the runtime factory (RuntimeConfig/make_runtime) into runtimes/factory.py so the pool can build runtimes without importing the package (a cycle). - bench/persistent_runtime.py: gsm8k scoring ~12x faster at 1000 rollouts / 128 concurrency (3.3x@256 → 29x@4000), import paid once per pooled runtime not per call. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Same math-verify pattern as gsm8k — convert verify.py to main(argv)->str (early returns instead of sys.exit, so a reused warm worker survives them) + __main__ footer, and pass warm=True. So aime/math scoring pays import math_verify once per pooled runtime on a persistent runtime, like gsm8k. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| warm worker would die with the rollout that spawned it, so it only pays off when reused).""" | ||
| data = script.encode() if isinstance(script, str) else script | ||
| interpreter, path = await self._resolve_interpreter(data) | ||
| if warm and self.config.persistent: |
There was a problem hiding this comment.
🟡 Medium runtimes/subprocess.py:186
When warm=True and self.config.persistent is true, the env parameter is silently dropped — _run_warm has no env parameter and the warm worker protocol has no mechanism to set per-call environment variables. A caller passing env={"FOO": "bar"} with warm=True will get execution without those variables applied, with no error or warning. Consider falling back to the non-warm path when env is provided.
Suggested change
| if warm and self.config.persistent: | |
| if warm and self.config.persistent and not env: |
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/runtimes/subprocess.py around line 186:
When `warm=True` and `self.config.persistent` is true, the `env` parameter is silently dropped — `_run_warm` has no `env` parameter and the warm worker protocol has no mechanism to set per-call environment variables. A caller passing `env={"FOO": "bar"}` with `warm=True` will get execution without those variables applied, with no error or warning. Consider falling back to the non-warm path when `env` is provided.
Evidence trail:
verifiers/v1/runtimes/subprocess.py lines 170-188 (REVIEWED_COMMIT): `run_uv_script` signature accepts `env`, line 186-187 warm path drops it. Lines 190-204: `_run_warm` has no `env` parameter. Lines 206-229: `_spawn_warm` and the warm worker protocol have no mechanism for per-call env vars. Line 188: non-warm path uses `env` via `self.run(...)`.
Member
Author
|
issue was nfs, not code |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a persistent runtime mode + warm in-runtime scoring so the heavy per-rollout costs (sandbox/container provisioning, and re-
importing a verifier's deps every rollout) are paid once per run instead of once per rollout.persistenton the base runtime config (BaseRuntimeConfig.persistent, inherited by subprocess/docker/prime/modal; CLI--harness.runtime.persistent true). A persistent runtime is taken from an eval/train-levelRuntimePool, reused across rollouts, and torn down only at the end of the run.Environment.serving()alongside the existing shared-tools / interception pools, and injected into everyRolloutviaepisode()— so it covers both the eval CLI and the env server. The env server'sserving()spans the whole run, so on training the pool lives for the entire run.Rolloutacquires/releases a pooled runtime instead ofmake_runtime/stop(ephemeral path unchanged). Acquire happens inside the rollout'stry, so a provisioning failure is captured on the trace like a normalstart()failure.Runtime.reset()clears the per-rollout workspace between reuses (subprocess recreates/tmp/<name>; docker/prime/modal empty the workdir) so reuse stays isolated; the provisioned resource and warm workers survive. Persistent mode therefore suits tasksets whose per-rollout state is workspace-local (e.g. gsm8k / math).run_uv_script(..., warm=True)— the subprocess runtime routes to a long-lived worker that loads the script as a module once (its heavy top-level imports paid once) and answers manyargs → stdoutcalls. Gated onpersistent(a worker would otherwise die with its rollout). Scripts opt in by exposingmain(argv) -> strwhile stayinguv run-able cold via aif __name__ == "__main__": print(main(sys.argv[1:]))footer.verify.pyconverted to the dual-modemain(argv)shape (the math ones return instead ofsys.exit, so a reused worker survives the early-out paths) and thecorrectreward passeswarm=True.RuntimeConfig/make_runtime/runtime_is_local) intoruntimes/factory.pyso the pool can build runtimes without importing the package (avoids a cycle);runtimes/__init__re-exports.Why two layers
The provisioning win (skip re-creating a container/sandbox per rollout) needs only the pool, and helps remote runtimes most (the tb2 bench measured 2.5–7.4s sandbox+tunnel provisioning per rollout). The import-once win needs a warm worker, which only survives if the runtime persists — so the pool is the foundation and warm scoring rides on it. For subprocess, provisioning is ~free (
start/stopis 0.2 ms); the entire per-rollout cost is the fresh-processimport math_verify(~0.22 s), which the warm worker removes.Verification
uv run python bench/persistent_runtime.py <N> <concurrency>— runs gsm8k's realverify.pythrough the framework API (no model generation, to isolate runtime/scoring overhead).ephemeral=make_runtime + start + run_uv_script(cold) + stopper call (today's per-rollout scoring path);persistent=pool.acquire + run_uv_script(warm=True) + release.The win grows with N: ephemeral stays pinned at ~15.3 ms/call (the per-call import), while persistent's one-time worker import (one per pooled runtime) amortizes away — so it peaks when
N ≫ concurrency(a long training run).Correctness: a small gsm8k-v1 eval with
--harness.runtime.persistent truevs default produces identical rewards (reward=1.000, 0 errors), logs rollouts as(pooled), and tears the pool down at the end (runtime pool: tearing down N persistent runtime(s)). Warm worker output verified correct across edge cases (e.g.1,000, malformed →0.0).ruffclean;tests/v1/test_configs.pygreen; full test suite collects.Scope / follow-ups
expose/host_endpointplumbing — structurally like a tool server) are a natural follow-up.reset, but anything a taskset installs outside the workspace persists — so it's opt-in and best for workspace-local state. Tasks with per-task images get a separate pool per image (pool keyed by resolved config).Breaking
None.
persistentdefaults toFalse(ephemeral, today's behavior);warmdefaults toFalse.run_uv_scriptgains an optional trailingwarmkwarg.