feat: persistent runtime pool + warm in-runtime scoring by mikasenghaas · Pull Request #1766 · PrimeIntellect-ai/verifiers

mikasenghaas · 2026-06-19T05:41:28Z

Summary

Adds a persistent runtime mode + warm in-runtime scoring so the heavy per-rollout costs (sandbox/container provisioning, and re-importing a verifier's deps every rollout) are paid once per run instead of once per rollout.

persistent on the base runtime config (BaseRuntimeConfig.persistent, inherited by subprocess/docker/prime/modal; CLI --harness.runtime.persistent true). A persistent runtime is taken from an eval/train-level RuntimePool, reused across rollouts, and torn down only at the end of the run.
Wired into Environment.serving() alongside the existing shared-tools / interception pools, and injected into every Rollout via episode() — so it covers both the eval CLI and the env server. The env server's serving() spans the whole run, so on training the pool lives for the entire run.
Rollout acquires/releases a pooled runtime instead of make_runtime/stop (ephemeral path unchanged). Acquire happens inside the rollout's try, so a provisioning failure is captured on the trace like a normal start() failure.
Runtime.reset() clears the per-rollout workspace between reuses (subprocess recreates /tmp/<name>; docker/prime/modal empty the workdir) so reuse stays isolated; the provisioned resource and warm workers survive. Persistent mode therefore suits tasksets whose per-rollout state is workspace-local (e.g. gsm8k / math).
run_uv_script(..., warm=True) — the subprocess runtime routes to a long-lived worker that loads the script as a module once (its heavy top-level imports paid once) and answers many args → stdout calls. Gated on persistent (a worker would otherwise die with its rollout). Scripts opt in by exposing main(argv) -> str while staying uv run-able cold via a if __name__ == "__main__": print(main(sys.argv[1:])) footer.
gsm8k-v1 / aime24-v1 / math-env-v1 opt in: each verify.py converted to the dual-mode main(argv) shape (the math ones return instead of sys.exit, so a reused worker survives the early-out paths) and the correct reward passes warm=True.
Split the runtime factory (RuntimeConfig / make_runtime / runtime_is_local) into runtimes/factory.py so the pool can build runtimes without importing the package (avoids a cycle); runtimes/__init__ re-exports.

Why two layers

The provisioning win (skip re-creating a container/sandbox per rollout) needs only the pool, and helps remote runtimes most (the tb2 bench measured 2.5–7.4s sandbox+tunnel provisioning per rollout). The import-once win needs a warm worker, which only survives if the runtime persists — so the pool is the foundation and warm scoring rides on it. For subprocess, provisioning is ~free (start/stop is 0.2 ms); the entire per-rollout cost is the fresh-process import math_verify (~0.22 s), which the warm worker removes.

Verification

uv run python bench/persistent_runtime.py <N> <concurrency> — runs gsm8k's real verify.py through the framework API (no model generation, to isolate runtime/scoring overhead). ephemeral = make_runtime + start + run_uv_script(cold) + stop per call (today's per-rollout scoring path); persistent = pool.acquire + run_uv_script(warm=True) + release.

N	concurrency	ephemeral	persistent	speedup
256	128	3.95s (15.5 ms/call)	1.21s (4.7 ms/call)	3.3x
1000	128	15.31s (15.3 ms/call)	1.27s (1.27 ms/call)	12.0x
4000	128	63.28s (15.8 ms/call)	2.16s (0.54 ms/call)	29.4x

The win grows with N: ephemeral stays pinned at ~15.3 ms/call (the per-call import), while persistent's one-time worker import (one per pooled runtime) amortizes away — so it peaks when N ≫ concurrency (a long training run).

Correctness: a small gsm8k-v1 eval with --harness.runtime.persistent true vs default produces identical rewards (reward=1.000, 0 errors), logs rollouts as (pooled), and tears the pool down at the end (runtime pool: tearing down N persistent runtime(s)). Warm worker output verified correct across edge cases (e.g. 1,000, malformed → 0.0). ruff clean; tests/v1/test_configs.py green; full test suite collects.

Scope / follow-ups

Warm workers are currently implemented on the subprocess runtime. Persistence works on every runtime (and on remote runtimes also skips the per-rollout sandbox+tunnel provisioning). Warm workers on docker/prime/modal (a small in-runtime HTTP service reached via the existing expose/host_endpoint plumbing — structurally like a tool server) are a natural follow-up.
Persistent mode reuses a runtime across rollouts/tasks; the per-rollout workspace is reset, but anything a taskset installs outside the workspace persists — so it's opt-in and best for workspace-local state. Tasks with per-task images get a separate pool per image (pool keyed by resolved config).

Breaking

None. persistent defaults to False (ephemeral, today's behavior); warm defaults to False. run_uv_script gains an optional trailing warm kwarg.

Add `persistent` to the base runtime config: a persistent runtime is taken from an eval/train-level pool, reused across rollouts (per-rollout workspace reset between uses), and torn down only at the end of the run — so expensive provisioning and warm in-runtime workers are paid once, not per rollout. - BaseRuntimeConfig.persistent (all four runtime configs); RuntimePool wired into Environment.serving() (eval + env-server/train) and injected into rollouts; Rollout acquires/releases instead of make_runtime/stop. The env server's serving() spans the whole run, so on train the pool lives for the entire run. - Runtime.reset() clears the per-rollout workspace (subprocess/docker/prime/modal) so reuse stays isolated; warm workers + the provisioned resource survive. - run_uv_script(warm=True): the subprocess runtime routes to a long-lived worker that imports the script's deps once (gated on persistent — the worker dies with an ephemeral runtime). Scripts opt in via main(argv)->str, staying uv-run-able cold. gsm8k verify.py converted + its reward passes warm=True. - Split the runtime factory (RuntimeConfig/make_runtime) into runtimes/factory.py so the pool can build runtimes without importing the package (a cycle). - bench/persistent_runtime.py: gsm8k scoring ~12x faster at 1000 rollouts / 128 concurrency (3.3x@256 → 29x@4000), import paid once per pooled runtime not per call. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Same math-verify pattern as gsm8k — convert verify.py to main(argv)->str (early returns instead of sys.exit, so a reused warm worker survives them) + __main__ footer, and pass warm=True. So aime/math scoring pays import math_verify once per pooled runtime on a persistent runtime, like gsm8k. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

macroscopeapp · 2026-06-19T05:55:22Z

+        warm worker would die with the rollout that spawned it, so it only pays off when reused)."""
+        data = script.encode() if isinstance(script, str) else script
+        interpreter, path = await self._resolve_interpreter(data)
+        if warm and self.config.persistent:


🟡 Medium runtimes/subprocess.py:186

When warm=True and self.config.persistent is true, the env parameter is silently dropped — _run_warm has no env parameter and the warm worker protocol has no mechanism to set per-call environment variables. A caller passing env={"FOO": "bar"} with warm=True will get execution without those variables applied, with no error or warning. Consider falling back to the non-warm path when env is provided.

Suggested change

if warm and self.config.persistent:

if warm and self.config.persistent and not env:

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file @verifiers/v1/runtimes/subprocess.py around line 186: When `warm=True` and `self.config.persistent` is true, the `env` parameter is silently dropped — `_run_warm` has no `env` parameter and the warm worker protocol has no mechanism to set per-call environment variables. A caller passing `env={"FOO": "bar"}` with `warm=True` will get execution without those variables applied, with no error or warning. Consider falling back to the non-warm path when `env` is provided. Evidence trail: verifiers/v1/runtimes/subprocess.py lines 170-188 (REVIEWED_COMMIT): `run_uv_script` signature accepts `env`, line 186-187 warm path drops it. Lines 190-204: `_run_warm` has no `env` parameter. Lines 206-229: `_spawn_warm` and the warm worker protocol have no mechanism for per-call env vars. Line 188: non-warm path uses `env` via `self.run(...)`.

mikasenghaas · 2026-06-22T23:29:59Z

issue was nfs, not code

mikasenghaas and others added 2 commits June 19, 2026 05:41

macroscopeapp Bot reviewed Jun 19, 2026

View reviewed changes

mikasenghaas closed this Jun 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: persistent runtime pool + warm in-runtime scoring#1766

feat: persistent runtime pool + warm in-runtime scoring#1766
mikasenghaas wants to merge 2 commits into
feat/nano-as-v1from
feat/v1-persistent-runtime

mikasenghaas commented Jun 19, 2026 •

edited

Loading

Uh oh!

macroscopeapp Bot Jun 19, 2026

Uh oh!

mikasenghaas commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	if warm and self.config.persistent:
	if warm and self.config.persistent and not env:

Conversation

mikasenghaas commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why two layers

Verification

Scope / follow-ups

Breaking

Uh oh!

macroscopeapp Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

mikasenghaas commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mikasenghaas commented Jun 19, 2026 •

edited

Loading