Skip to content

Restore isolated R2E test staging with a validation fast path#1776

Open
xeophon wants to merge 7 commits into
feat/nano-as-v1from
codex/r2e-gym-rename-hidden-tests
Open

Restore isolated R2E test staging with a validation fast path#1776
xeophon wants to merge 7 commits into
feat/nano-as-v1from
codex/r2e-gym-rename-hidden-tests

Conversation

@xeophon

@xeophon xeophon commented Jun 20, 2026

Copy link
Copy Markdown
Member

Overview

Restore the R2E Gym isolation contract introduced in 3e2a851 and keep it intact through scoring:

  • agent rollouts archive /r2e_tests, transfer the archive to the evaluator process, remove both sandbox copies, record the pre-agent process baseline, and register archive cleanup on runtime stop
  • scoring kills post-setup agent processes, uploads only authenticated ciphertext through the runtime, kills any upload-spawned processes, and creates plaintext only under an absolute isolated system interpreter before running the existing reward path
  • no-agent validation can explicitly set hide_tests_from_agent = false to move the directory directly into place

This changes only environments/r2e_gym_v1/r2e_gym_v1/taskset.py.

Execution topology

This does not create a second sandbox. Each rollout still provisions exactly one agent sandbox.

machine/process running Verifiers
    /tmp/r2e_tests_<runtime-id>_<random>.tar.gz
    in-memory baseline: sandbox PID + kernel start time
                    ⇅ authenticated ciphertext transfer / exec
single agent sandbox
    /tmp/r2e_tests.tar.gz.sealed -> /tmp/r2e_tests.tar.gz

The two /tmp paths are on different filesystems:

  • with a local Docker runtime, the evaluator path is on the Docker host
  • with a Prime runtime, the evaluator path is on the machine or worker running the Verifiers evaluation process
  • the sandbox path belongs to the one agent sandbox provisioned for that rollout

The evaluator process retains the archive bytes and process-identity baseline between setup and scoring. It does not execute the agent or tests in another sandbox.

Why the archive format does not matter

Tar/gzip is serialization, not the isolation boundary. An uncompressed tarball—or any other readable representation left inside the sandbox—would be equally visible to a root-capable agent.

The V1 port left /opt/r2e_tests.tar.gz inside the agent sandbox. That hid individual paths from ordinary directory traversal, but the agent could still list, extract, or stream the known archive. The default path now transfers the archive to a collision-free, mode-0600 evaluator temporary file and removes both /r2e_tests and the sandbox archive before the harness starts.

The end-of-harness boundary matters too: detached processes can outlive the harness command. Before the archive returns, scoring preserves the original runtime/control processes and its own exec ancestry, then repeatedly freezes and kills every process created after setup until a complete scan finds none. It seals the archive with a fresh per-rollout key, sends only authenticated ciphertext through runtime.write(), repeats the stable cleanup to remove helpers or watchers created by that upload, and only then reveals the key to an isolated absolute system interpreter that creates the plaintext archive. The security properties therefore come from keeping the bytes outside the sandbox during generation and removing leftover agent execution before restore—not from compression.

The runtime stop path explicitly removes the evaluator file and taskset state even when a rollout fails before scoring. A runtime finalizer remains as the interpreter-exit backstop.

Lifecycle breakdown

Agent-safe path (default)

Setup:

/r2e_tests in the single agent sandbox
    -> /tmp/r2e_tests.tar.gz in that sandbox
    -> /tmp/r2e_tests_<runtime-id>_<random>.tar.gz on the evaluator filesystem
    -> delete /r2e_tests and the sandbox archive
    -> record each remaining sandbox process as PID + kernel start time

The start time is part of the identity so later PID reuse cannot accidentally whitelist an agent process.

Harness execution:

the sandbox contains neither /r2e_tests nor a test archive
processes created by the harness/agent are absent from the setup baseline

Scoring:

start /bin/sh with PATH=/usr/sbin:/usr/bin:/sbin:/bin
    -> preserve PID 1, setup-baseline identities, and the cleanup exec's ancestors
    -> SIGSTOP then SIGKILL every other live process
    -> rescan until no live non-baseline process remains
evaluator archive
    -> seal with a fresh key and upload /tmp/r2e_tests.tar.gz.sealed
    -> repeat process cleanup; upload helpers have seen ciphertext only
    -> verify and decrypt as /tmp/r2e_tests.tar.gz via /usr/bin/python3 -I -S
    -> extract as /testbed/r2e_tests under the system-only PATH
    -> delete sandbox and evaluator archives
    -> run /bin/bash run_tests.sh

SIGKILL cannot be trapped by a watcher. Repeating the scan closes the fork-after-expansion race; zombies are harmless and ignored, while an unkillable live process keeps scoring fail-closed until its timeout. Preserving both the setup baseline and the active exec ancestry keeps provider/runtime control processes alive while removing detached agent children. Cleanup and restore invoke /bin/sh directly and exclude /testbed/.venv/bin and /root/.local/bin from PATH, so agent-planted tar, sed, or shell shims are not selected after the archive returns. The upload boundary is separately protected: sandbox-side mkdir/cat helpers receive only SHAKE-encrypted, HMAC-authenticated bytes; their descendants are reaped before the key exists in the sandbox, and modified ciphertext fails closed. The test command separately retains the project PATH required by the repository environment. Archive, transfer, cleanup, and restore failures stay inside the existing setup or scoring error boundaries.

This cleanup relies on Linux /proc, which matches the container runtimes and Linux R2E images required by the taskset. It is a user-space boundary inside the existing sandbox; compromising PID 1, the provider control plane, or the kernel remains outside the taskset threat model.

No-agent validation fast path

When hide_tests_from_agent = false, no harness inspects the sandbox, so setup uses:

rm -rf /testbed/r2e_tests && mv /r2e_tests /testbed/r2e_tests

Validation then applies the gold patch and runs the same scoring command without an archive round trip or process sweep.

Validation-path timing breakdown

The benchmark used 2,000 deterministic 4 KiB files across 20 directories (8,192,000 bytes total), one warmup per strategy, and nine alternating measured rounds on fresh copies. Fixture preparation and hashing were outside the timed region; shell startup and replacement of a pre-existing destination were included.

Strategy Hide Restore Total median Total range
tar/gzip, remove, extract 0.839079–1.035526 s 0.615873–0.860195 s 1.686329 s 1.454952–1.895722 s
remove, rename 0.008513–0.010382 s 0.008714–0.009748 s 0.018027 s 0.017227–0.020131 s

For no-agent validation, rename is 93.54× faster at the median, a 98.93% elapsed-time reduction:

(1.686329 - 0.018027) / 1.686329 × 100 = 98.93%

These timings cover filesystem operations, not provider-specific transfer. The default agent-safe path prioritizes isolation and pays the sealing, archive transfer, and two process-sweep costs; the rename result applies only when hide_tests_from_agent is explicitly disabled.

Filesystem assumption

Only the validation fast path depends on a same-filesystem rename between /r2e_tests and /testbed/r2e_tests. Across mounts, mv may fall back to recursive copy/delete and lose its constant-time and atomic-rename properties.


Note

Cursor Bugbot is generating a summary for commit e70e1d3. Configure here.

Note

Overview

Restore the R2E Gym isolation contract that was introduced in 3e2a8510d5 but lost during the V1 port:

  • agent rollouts archive /r2e_tests, transfer the archive to the evaluator process, and remove both the test tree and remote archive before the harness starts
  • scoring uploads the archive back to the same sandbox, restores /testbed/r2e_tests, removes the temporary archive, and runs the existing reward path
  • no-agent validation can explicitly set hide_tests_from_agent = false to move the directory directly into place

This changes only environments/r2e_gym_v1/r2e_gym_v1/taskset.py.

Execution topology

This does not create a second sandbox. Each rollout still provisions exactly one agent sandbox.

machine/process running Verifiers
    /tmp/r2e_tests_<runtime-id>_<random>.tar.gz
                    ⇅ runtime file transfer
single agent sandbox
    /tmp/r2e_tests.tar.gz

The two /tmp paths are on different filesystems:

  • with a local Docker runtime, the evaluator path is on the Docker host
  • with a Prime runtime, the evaluator path is on the machine or worker running the Verifiers evaluation process
  • the sandbox path belongs to the one agent sandbox provisioned for that rollout

The evaluator process retains only the archive bytes between setup and scoring. It does not execute the agent or tests in another sandbox.

Why the archive matters

Compression is not the isolation boundary; moving the archive out of the sandbox is.

The V1 port left /opt/r2e_tests.tar.gz inside the same root-capable sandbox as the agent. Although that kept individual files out of ordinary directory traversal, an agent could still list, extract, or stream the known archive.

With the default hide_tests_from_agent = true path, the sandbox archive exists only during taskset setup and scoring—outside the harness execution window. Setup transfers it to a collision-free mode-0600 evaluator temporary file whose prefix includes the per-rollout runtime ID, then deletes both /r2e_tests and the sandbox archive. The harness therefore has neither representation available. Scoring reverses the transfer into the same sandbox and deletes the evaluator copy after restoration.

A runtime finalizer also owns cleanup of the evaluator file if a rollout exits before scoring.

Lifecycle breakdown

Agent-safe path (default)

Setup:

/r2e_tests in the single agent sandbox
    -> /tmp/r2e_tests.tar.gz in that sandbox
    -> /tmp/r2e_tests_<runtime-id>_<random>.tar.gz on the evaluator filesystem
    -> delete /r2e_tests and the sandbox archive

Harness execution:

the agent sandbox contains neither /r2e_tests nor a test archive

Scoring:

evaluator archive
    -> /tmp/r2e_tests.tar.gz in the same agent sandbox
    -> extract as /testbed/r2e_tests
    -> delete sandbox and evaluator archives
    -> run /bin/bash run_tests.sh

Archive, transfer, removal, and restore failures remain attributable to the existing setup or scoring boundaries.

No-agent validation fast path

When hide_tests_from_agent = false, no harness will inspect the sandbox, so setup uses:

rm -rf /testbed/r2e_tests && mv /r2e_tests /testbed/r2e_tests

Validation then applies the gold patch and runs the same scoring command without an archive round trip.

Validation-path timing breakdown

The benchmark used 2,000 deterministic 4 KiB files across 20 directories (8,192,000 bytes total), one warmup per strategy, and nine alternating measured rounds on fresh copies. Fixture preparation and hashing were outside the timed region; shell startup and replacement of a pre-existing destination were included.

Strategy Hide Restore Total median Total range
tar/gzip, remove, extract 0.839079–1.035526 s 0.615873–0.860195 s 1.686329 s 1.454952–1.895722 s
remove, rename 0.008513–0.010382 s 0.008714–0.009748 s 0.018027 s 0.017227–0.020131 s

For no-agent validation, rename is 93.54× faster at the median, a 98.93% elapsed-time reduction:

(1.686329 - 0.018027) / 1.686329 × 100 = 98.93%

These timings intentionally cover the filesystem operations, not provider-specific transfer. The default agent-safe path prioritizes isolation and pays the evaluator transfer cost; the rename result applies only when hide_tests_from_agent is explicitly disabled.

Filesystem assumption

Only the validation fast path depends on a same-filesystem rename between /r2e_tests and /testbed/r2e_tests. Across mounts, mv may fall back to recursive copy/delete and lose its constant-time and atomic-rename properties.


[!NOTE]
Cursor Bugbot is generating a summary for commit e70e1d3. Configure here.

Changes since #1776 opened

  • Replaced static host test archive path with per-runtime temporary file management in R2EGymTaskset [e70e1d3]
  • Removed HOST_TESTS_ARCHIVE constant and updated module imports in r2e_gym_v1.taskset [e70e1d3]
  • Added process cleanup to R2EGymTaskset.solved reward method that kills non-baseline processes before test restoration [2bda5ed]
  • Added shell script constants for process snapshotting and selective killing [2bda5ed]
  • Added baseline process snapshot capture to R2EGymTaskset runtime setup [2bda5ed]
  • Introduced SYSTEM_ENV constant within r2e_gym_v1.taskset with PATH restricted to system directories and documentation indicating hidden-test control commands must not resolve tools from agent-writable project paths [5e234c7]
  • Modified three runtime.run invocations within R2EGymTaskset.solved method to use /bin/sh shell and updated environment configurations [5e234c7]
  • Reworked the process cleanup shell script in r2e_gym_v1.taskset module to repeatedly scan and terminate non-ancestor, non-baseline processes until none remain [5e33a74]
  • Wrapped runtime.stop method in R2EGymTaskset runtime setup to clean up host archive and weakref entries before stopping [5e33a74]
  • Implemented encrypted test archive staging using XOR encryption with SHAKE-256 keystream and HMAC-SHA256 authentication [a4631c1]
  • Modified test archive restoration orchestration to decrypt sealed archives inside the sandbox runtime after terminating agent processes [a4631c1]
  • Updated shell cleanup and restore commands to remove sealed test archive files [a4631c1]

@macroscopeapp

macroscopeapp Bot commented Jun 20, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

This PR introduces new security-sensitive functionality including custom cryptographic operations for test archive encryption, process management to kill agent processes, and host-side archive storage. The complexity and security implications of these new capabilities warrant human review.

You can customize Macroscope's approvability policy. Learn more.

macroscopeapp[bot]
macroscopeapp Bot previously approved these changes Jun 20, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 770f6b3254

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread environments/r2e_gym_v1/r2e_gym_v1/taskset.py Outdated
@macroscopeapp macroscopeapp Bot dismissed their stale review June 20, 2026 12:43

Dismissing prior approval to re-evaluate db3b042

@xeophon xeophon changed the title Optimize R2E Gym hidden-test staging Restore isolated R2E test staging with a validation fast path Jun 20, 2026

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit db3b042. Configure here.

Comment thread environments/r2e_gym_v1/r2e_gym_v1/taskset.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e70e1d3a9f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread environments/r2e_gym_v1/r2e_gym_v1/taskset.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2bda5ed924

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread environments/r2e_gym_v1/r2e_gym_v1/taskset.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5e234c7154

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread environments/r2e_gym_v1/r2e_gym_v1/taskset.py Outdated
Comment thread environments/r2e_gym_v1/r2e_gym_v1/taskset.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5e33a74384

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread environments/r2e_gym_v1/r2e_gym_v1/taskset.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a4631c1b8e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +400 to +402
decrypt = await runtime.run(
["/usr/bin/python3", "-I", "-S", "-c", DECRYPT_TESTS],
SYSTEM_ENV | {"R2E_ARCHIVE_KEY": key.hex()},

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Do not hand the archive key to mutable sandbox Python

In root-capable R2E sandboxes, the agent can replace /usr/bin/python3 or its stdlib/shared libraries during generation; the process sweep only kills live PIDs and does not restore filesystem changes. This line then executes that agent-mutated interpreter with R2E_ARCHIVE_KEY, so a wrapper can copy the key, decrypt /tmp/r2e_tests.tar.gz.sealed, or patch /testbed before delegating, leaking the hidden tests despite the ciphertext upload. Please keep decryption/extraction in a clean trusted runtime or otherwise run the key through immutable verified helpers.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants