Restore isolated R2E test staging with a validation fast path#1776
Restore isolated R2E test staging with a validation fast path#1776xeophon wants to merge 7 commits into
Conversation
ApprovabilityVerdict: Needs human review This PR introduces new security-sensitive functionality including custom cryptographic operations for test archive encryption, process management to kill agent processes, and host-side archive storage. The complexity and security implications of these new capabilities warrant human review. You can customize Macroscope's approvability policy. Learn more. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 770f6b3254
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Dismissing prior approval to re-evaluate db3b042
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit db3b042. Configure here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e70e1d3a9f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2bda5ed924
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5e234c7154
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5e33a74384
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a4631c1b8e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| decrypt = await runtime.run( | ||
| ["/usr/bin/python3", "-I", "-S", "-c", DECRYPT_TESTS], | ||
| SYSTEM_ENV | {"R2E_ARCHIVE_KEY": key.hex()}, |
There was a problem hiding this comment.
Do not hand the archive key to mutable sandbox Python
In root-capable R2E sandboxes, the agent can replace /usr/bin/python3 or its stdlib/shared libraries during generation; the process sweep only kills live PIDs and does not restore filesystem changes. This line then executes that agent-mutated interpreter with R2E_ARCHIVE_KEY, so a wrapper can copy the key, decrypt /tmp/r2e_tests.tar.gz.sealed, or patch /testbed before delegating, leaking the hidden tests despite the ciphertext upload. Please keep decryption/extraction in a clean trusted runtime or otherwise run the key through immutable verified helpers.
Useful? React with 👍 / 👎.

Overview
Restore the R2E Gym isolation contract introduced in 3e2a851 and keep it intact through scoring:
This changes only environments/r2e_gym_v1/r2e_gym_v1/taskset.py.
Execution topology
This does not create a second sandbox. Each rollout still provisions exactly one agent sandbox.
The two /tmp paths are on different filesystems:
The evaluator process retains the archive bytes and process-identity baseline between setup and scoring. It does not execute the agent or tests in another sandbox.
Why the archive format does not matter
Tar/gzip is serialization, not the isolation boundary. An uncompressed tarball—or any other readable representation left inside the sandbox—would be equally visible to a root-capable agent.
The V1 port left /opt/r2e_tests.tar.gz inside the agent sandbox. That hid individual paths from ordinary directory traversal, but the agent could still list, extract, or stream the known archive. The default path now transfers the archive to a collision-free, mode-0600 evaluator temporary file and removes both /r2e_tests and the sandbox archive before the harness starts.
The end-of-harness boundary matters too: detached processes can outlive the harness command. Before the archive returns, scoring preserves the original runtime/control processes and its own exec ancestry, then repeatedly freezes and kills every process created after setup until a complete scan finds none. It seals the archive with a fresh per-rollout key, sends only authenticated ciphertext through runtime.write(), repeats the stable cleanup to remove helpers or watchers created by that upload, and only then reveals the key to an isolated absolute system interpreter that creates the plaintext archive. The security properties therefore come from keeping the bytes outside the sandbox during generation and removing leftover agent execution before restore—not from compression.
The runtime stop path explicitly removes the evaluator file and taskset state even when a rollout fails before scoring. A runtime finalizer remains as the interpreter-exit backstop.
Lifecycle breakdown
Agent-safe path (default)
Setup:
The start time is part of the identity so later PID reuse cannot accidentally whitelist an agent process.
Harness execution:
Scoring:
SIGKILL cannot be trapped by a watcher. Repeating the scan closes the fork-after-expansion race; zombies are harmless and ignored, while an unkillable live process keeps scoring fail-closed until its timeout. Preserving both the setup baseline and the active exec ancestry keeps provider/runtime control processes alive while removing detached agent children. Cleanup and restore invoke /bin/sh directly and exclude /testbed/.venv/bin and /root/.local/bin from PATH, so agent-planted tar, sed, or shell shims are not selected after the archive returns. The upload boundary is separately protected: sandbox-side mkdir/cat helpers receive only SHAKE-encrypted, HMAC-authenticated bytes; their descendants are reaped before the key exists in the sandbox, and modified ciphertext fails closed. The test command separately retains the project PATH required by the repository environment. Archive, transfer, cleanup, and restore failures stay inside the existing setup or scoring error boundaries.
This cleanup relies on Linux /proc, which matches the container runtimes and Linux R2E images required by the taskset. It is a user-space boundary inside the existing sandbox; compromising PID 1, the provider control plane, or the kernel remains outside the taskset threat model.
No-agent validation fast path
When hide_tests_from_agent = false, no harness inspects the sandbox, so setup uses:
rm -rf /testbed/r2e_tests && mv /r2e_tests /testbed/r2e_testsValidation then applies the gold patch and runs the same scoring command without an archive round trip or process sweep.
Validation-path timing breakdown
The benchmark used 2,000 deterministic 4 KiB files across 20 directories (8,192,000 bytes total), one warmup per strategy, and nine alternating measured rounds on fresh copies. Fixture preparation and hashing were outside the timed region; shell startup and replacement of a pre-existing destination were included.
For no-agent validation, rename is 93.54× faster at the median, a 98.93% elapsed-time reduction:
These timings cover filesystem operations, not provider-specific transfer. The default agent-safe path prioritizes isolation and pays the sealing, archive transfer, and two process-sweep costs; the rename result applies only when hide_tests_from_agent is explicitly disabled.
Filesystem assumption
Only the validation fast path depends on a same-filesystem rename between /r2e_tests and /testbed/r2e_tests. Across mounts, mv may fall back to recursive copy/delete and lose its constant-time and atomic-rename properties.
Note
Cursor Bugbot is generating a summary for commit e70e1d3. Configure here.
Note
Overview
Restore the R2E Gym isolation contract that was introduced in
3e2a8510d5but lost during the V1 port:/r2e_tests, transfer the archive to the evaluator process, and remove both the test tree and remote archive before the harness starts/testbed/r2e_tests, removes the temporary archive, and runs the existing reward pathhide_tests_from_agent = falseto move the directory directly into placeThis changes only
environments/r2e_gym_v1/r2e_gym_v1/taskset.py.Execution topology
This does not create a second sandbox. Each rollout still provisions exactly one agent sandbox.
The two
/tmppaths are on different filesystems:The evaluator process retains only the archive bytes between setup and scoring. It does not execute the agent or tests in another sandbox.
Why the archive matters
Compression is not the isolation boundary; moving the archive out of the sandbox is.
The V1 port left
/opt/r2e_tests.tar.gzinside the same root-capable sandbox as the agent. Although that kept individual files out of ordinary directory traversal, an agent could still list, extract, or stream the known archive.With the default
hide_tests_from_agent = truepath, the sandbox archive exists only during taskset setup and scoring—outside the harness execution window. Setup transfers it to a collision-free mode-0600evaluator temporary file whose prefix includes the per-rollout runtime ID, then deletes both/r2e_testsand the sandbox archive. The harness therefore has neither representation available. Scoring reverses the transfer into the same sandbox and deletes the evaluator copy after restoration.A runtime finalizer also owns cleanup of the evaluator file if a rollout exits before scoring.
Lifecycle breakdown
Agent-safe path (default)
Setup:
Harness execution:
Scoring:
Archive, transfer, removal, and restore failures remain attributable to the existing setup or scoring boundaries.
No-agent validation fast path
When
hide_tests_from_agent = false, no harness will inspect the sandbox, so setup uses:rm -rf /testbed/r2e_tests && mv /r2e_tests /testbed/r2e_testsValidation then applies the gold patch and runs the same scoring command without an archive round trip.
Validation-path timing breakdown
The benchmark used 2,000 deterministic 4 KiB files across 20 directories (8,192,000 bytes total), one warmup per strategy, and nine alternating measured rounds on fresh copies. Fixture preparation and hashing were outside the timed region; shell startup and replacement of a pre-existing destination were included.
For no-agent validation, rename is 93.54× faster at the median, a 98.93% elapsed-time reduction:
These timings intentionally cover the filesystem operations, not provider-specific transfer. The default agent-safe path prioritizes isolation and pays the evaluator transfer cost; the rename result applies only when
hide_tests_from_agentis explicitly disabled.Filesystem assumption
Only the validation fast path depends on a same-filesystem rename between
/r2e_testsand/testbed/r2e_tests. Across mounts,mvmay fall back to recursive copy/delete and lose its constant-time and atomic-rename properties.Changes since #1776 opened
R2EGymTaskset[e70e1d3]HOST_TESTS_ARCHIVEconstant and updated module imports inr2e_gym_v1.taskset[e70e1d3]R2EGymTaskset.solvedreward method that kills non-baseline processes before test restoration [2bda5ed]R2EGymTasksetruntime setup [2bda5ed]SYSTEM_ENVconstant withinr2e_gym_v1.tasksetwithPATHrestricted to system directories and documentation indicating hidden-test control commands must not resolve tools from agent-writable project paths [5e234c7]runtime.runinvocations withinR2EGymTaskset.solvedmethod to use/bin/shshell and updated environment configurations [5e234c7]r2e_gym_v1.tasksetmodule to repeatedly scan and terminate non-ancestor, non-baseline processes until none remain [5e33a74]runtime.stopmethod inR2EGymTasksetruntime setup to clean up host archive and weakref entries before stopping [5e33a74]