feat: multimodal browser-agent debug environment by mikasenghaas · Pull Request #1764 · PrimeIntellect-ai/verifiers

mikasenghaas · 2026-06-19T04:36:52Z

Summary

New v1 environment: a sandboxed local-app browser-agent environment (flight-search web-app tasks). Each task boots a local SPA + headless-Chromium CDP service inside a per-task Docker image; a browser agent drives it by screenshots → vision model → click/type actions, then submits a structured JSON result, which a structured-output LLM judge (default openai/gpt-4.1-mini on Prime inference) scores against a deterministic answer key (answer_key reward, weight 1.0). Requires a multimodal model.
The browser agent is private and is not vendored in this repo. The harness fetches it at run time from a private, auth-gated GitHub repo (pinned to a commit), caches it locally, and stages it into the sandbox. Configure via --harness.agent-repo / --harness.agent-ref with a token in --harness.agent-token-env; a --harness.agent-path local override is provided for development. A .gitignore keeps the agent out of version control.
Tasks are pulled dynamically from the private Prime hub, not bundled in the repo: load_tasks fetches the dataset via prime env pull and caches it under ~/.cache/verifiers/. Override with --taskset.dataset_path, or repoint --taskset.hub_env_id / --taskset.hub_version.
The judge config is a JudgeConfig subconfig (--taskset.judge.model + a verifiers BaseClientConfig), so the endpoint, team header, and API key auto-resolve to Prime inference. The judge uses strict structured output (json_schema) so its verdict is always valid JSON (with a tolerant parser as a backstop).
Co-packages the taskset and a thin (public) harness shim in one module (both resolved via __all__, so --taskset.id and --harness.id share the name). Registered as an editable workspace member (pyproject.toml + uv.lock).

Adapted to the current feat/nano-as-v1 API: boundary error types (HarnessError / TasksetError); module discovery via __all__ (no load_environment wiring). Config is kept minimal — timeouts, task shuffling, and selection are left to the framework; the sandbox image is set per-task.

Run

The task image is a Prime-registry image, so use the prime runtime; supply the agent token + pinned ref and a vision model:

export <AGENT_TOKEN_ENV>=<token>
uv run eval <env-id> --harness.id <env-id> \
  --harness.runtime.type prime --harness.agent-ref <agent-commit-sha> \
  -m <multimodal-model> -n 1 -r 1 -c 1

Status

Verified end-to-end on the prime runtime: dynamic hub task pull, plugin discovery, private-source agent staging, sandbox provisioning, the agent completing a task and submitting a result, and the structured-output judge scoring it. A sample run scored 0.83 (5/6 fields) — the agent solved the search but omitted baggage fees from the total.

🤖 Generated with Claude Code

Sandboxed local-app browser-agent environment ported to the current v1 API. Tasks are pulled dynamically from the Prime hub and cached locally. The browser agent is proprietary and fetched at run time from a private repo (not vendored); the harness stages it into the sandbox. Co-packages the taskset and harness. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The judge model can return JSON with the root object left unclosed (it stops after the final array); the old fallback grabbed a nested `}` and then crashed the rollout on an unguarded json.loads. Strip code fences, balance open brackets, and fall back to a default verdict instead of raising. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Walrus for the env lookup, .exists() instead of the nested try/except, and drop the defensive isinstance/str cast (the prime config is always a dict). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Judge config is now a JudgeConfig subconfig (model + verifiers BaseClientConfig), so the endpoint, team header, and API key auto-resolve to Prime inference; drop the bespoke prime_team_id / prime_default_headers and the flat judge_* fields. - Judge uses a structured-output (json_schema) model (default openai/gpt-4.1-mini) so the verdict is always valid JSON. - Drop config knobs that are framework-internal or belong on the task: per-task timeouts, sandbox_image (now set directly on the task), task shuffling, task_indices / task_profile, seed_store_path (+ the sqlite seed-store path), and the configurable cache dir (now a constant). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

macroscopeapp · 2026-06-19T05:09:35Z

+            staging = dest / (self.config.agent_package + ".tmp")
+            if staging.exists():
+                shutil.rmtree(staging)
+            shutil.copytree(matches[0], staging)
+            os.replace(staging, dest / self.config.agent_package)


🟢 Low harness/__init__.py:239

The _download_agent staging path is hard-coded (dest / "<package>.tmp"), so concurrent processes race on it: one may rmtree the directory while another is copytree-ing into it, causing a failed or corrupted copy. The final os.replace atomically commits the finished directory but does not protect the staging work from interleaving. Use a unique staging directory per process (e.g. tempfile.mkdtemp inside dest) to prevent collisions.

- staging = dest / (self.config.agent_package + ".tmp") - if staging.exists(): - shutil.rmtree(staging) - shutil.copytree(matches[0], staging) + staging = Path(tempfile.mkdtemp(dir=dest, prefix=self.config.agent_package + ".tmp.")) + shutil.copytree(matches[0], staging) os.replace(staging, dest / self.config.agent_package)

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file @environments/mini_browse_apps_platform_v1/mini_browse_apps_platform_v1/harness/__init__.py around lines 239-243: The `_download_agent` staging path is hard-coded (`dest / "<package>.tmp"`), so concurrent processes race on it: one may `rmtree` the directory while another is `copytree`-ing into it, causing a failed or corrupted copy. The final `os.replace` atomically commits the finished directory but does not protect the staging work from interleaving. Use a unique staging directory per process (e.g. `tempfile.mkdtemp` inside `dest`) to prevent collisions. Evidence trail: environments/mini_browse_apps_platform_v1/mini_browse_apps_platform_v1/harness/__init__.py lines 196-197 (TOCTOU guard), lines 238-243 (deterministic staging path and non-atomic staging operations) at REVIEWED_COMMIT.

macroscopeapp · 2026-06-19T05:09:35Z

+            with tarfile.open(archive) as tar:
+                tar.extractall(extract, filter="data")


🟢 Low harness/__init__.py:230

tar.extractall(extract, filter="data") raises TypeError: extractall() got an unexpected keyword argument 'filter' on Python 3.10.0–3.10.11 and 3.11.0–3.11.3, because the filter parameter was only backported in 3.10.12 / 3.11.4. Since _download_agent runs on the host (not inside the 3.12 Docker container), hosts on those earlier patch versions will crash with an unhelpful error. Consider guarding with hasattr(tarfile, 'data_filter') to fall back gracefully.

with tarfile.open(archive) as tar: - tar.extractall(extract, filter="data") + if hasattr(tarfile, 'data_filter'): + tar.extractall(extract, filter="data") + else: + tar.extractall(extract)

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file @environments/mini_browse_apps_platform_v1/mini_browse_apps_platform_v1/harness/__init__.py around lines 230-231: `tar.extractall(extract, filter="data")` raises `TypeError: extractall() got an unexpected keyword argument 'filter'` on Python 3.10.0–3.10.11 and 3.11.0–3.11.3, because the `filter` parameter was only backported in 3.10.12 / 3.11.4. Since `_download_agent` runs on the host (not inside the 3.12 Docker container), hosts on those earlier patch versions will crash with an unhelpful error. Consider guarding with `hasattr(tarfile, 'data_filter')` to fall back gracefully. Evidence trail: environments/mini_browse_apps_platform_v1/mini_browse_apps_platform_v1/harness/__init__.py lines 230-231 (REVIEWED_COMMIT): `tar.extractall(extract, filter="data")`. environments/mini_browse_apps_platform_v1/pyproject.toml line 4: `requires-python = ">=3.10"`. Root pyproject.toml line 14: `requires-python = ">=3.11,<3.14"`. Python 3.11 docs (https://docs.python.org/uk/3.11/library/tarfile.html): 'New in version 3.11.4' for extraction filters. Python 3.14 docs (https://docs.python.org/3/library/tarfile.html): 'Changed in version 3.12: Added the filter parameter.' and recommendation to use `hasattr(tarfile, 'data_filter')`. scikit-learn issue #31521 for identical bug.

- Group the harness agent-source fields under an `agent` subconfig (agent.repo / agent.ref / agent.path / ...), dropping the agent_ prefix. - Pass the model endpoint/key/model to the in-sandbox agent via a JSON file (--model-client) instead of OPENAI_* env vars; program.py builds the client and passes it to the agent explicitly, keeping the secret out of the process env. - Drop the redundant harness path config fields (use the shared contract paths directly) and inline taskset constants that only fed config defaults. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The framework enforces --max-turns at the interception layer for any harness, so the harness-specific max_steps knob was redundant. Remove it; program.py keeps its own default step backstop, and rollouts are capped with --max-turns. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Default the agent source to PrimeIntellect-ai/mini-browse pinned at 157b449 (the private browser-agent repo), so the env fetches it out of the box. - Rename the proxy env var to MINI_BROWSE_HTTP_PROXY and say "private" rather than "proprietary" in the harness/README. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

macroscopeapp Bot reviewed Jun 19, 2026

View reviewed changes

mikasenghaas force-pushed the feat/mini-browse-apps-platform-v1 branch from 1d2b73b to 0631af5 Compare June 19, 2026 04:52

mikasenghaas and others added 3 commits June 19, 2026 04:57

refactor(v1): simplify prime_team_id in the browse-apps judge

e020bbe

Walrus for the env lookup, .exists() instead of the nested try/except, and drop the defensive isinstance/str cast (the prime config is always a dict). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mikasenghaas changed the title ~~feat: add mini-browse-apps-platform-v1 environment (dynamic hub tasks)~~ feat: multimodal browser-agent debug environment Jun 19, 2026

macroscopeapp Bot reviewed Jun 19, 2026

View reviewed changes

mikasenghaas and others added 3 commits June 19, 2026 05:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: multimodal browser-agent debug environment#1764

feat: multimodal browser-agent debug environment#1764
mikasenghaas wants to merge 7 commits into
feat/nano-as-v1from
feat/mini-browse-apps-platform-v1

mikasenghaas commented Jun 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

macroscopeapp Bot Jun 19, 2026

Uh oh!

macroscopeapp Bot Jun 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		with tarfile.open(archive) as tar:
		tar.extractall(extract, filter="data")

Conversation

mikasenghaas commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Run

Status

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

macroscopeapp Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

macroscopeapp Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mikasenghaas commented Jun 19, 2026 •

edited

Loading