feat: multimodal browser-agent debug environment#1764
Conversation
Sandboxed local-app browser-agent environment ported to the current v1 API. Tasks are pulled dynamically from the Prime hub and cached locally. The browser agent is proprietary and fetched at run time from a private repo (not vendored); the harness stages it into the sandbox. Co-packages the taskset and harness. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1d2b73b to
0631af5
Compare
The judge model can return JSON with the root object left unclosed (it stops after the final array); the old fallback grabbed a nested `}` and then crashed the rollout on an unguarded json.loads. Strip code fences, balance open brackets, and fall back to a default verdict instead of raising. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Walrus for the env lookup, .exists() instead of the nested try/except, and drop the defensive isinstance/str cast (the prime config is always a dict). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Judge config is now a JudgeConfig subconfig (model + verifiers BaseClientConfig), so the endpoint, team header, and API key auto-resolve to Prime inference; drop the bespoke prime_team_id / prime_default_headers and the flat judge_* fields. - Judge uses a structured-output (json_schema) model (default openai/gpt-4.1-mini) so the verdict is always valid JSON. - Drop config knobs that are framework-internal or belong on the task: per-task timeouts, sandbox_image (now set directly on the task), task shuffling, task_indices / task_profile, seed_store_path (+ the sqlite seed-store path), and the configurable cache dir (now a constant). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| staging = dest / (self.config.agent_package + ".tmp") | ||
| if staging.exists(): | ||
| shutil.rmtree(staging) | ||
| shutil.copytree(matches[0], staging) | ||
| os.replace(staging, dest / self.config.agent_package) |
There was a problem hiding this comment.
🟢 Low harness/__init__.py:239
The _download_agent staging path is hard-coded (dest / "<package>.tmp"), so concurrent processes race on it: one may rmtree the directory while another is copytree-ing into it, causing a failed or corrupted copy. The final os.replace atomically commits the finished directory but does not protect the staging work from interleaving. Use a unique staging directory per process (e.g. tempfile.mkdtemp inside dest) to prevent collisions.
- staging = dest / (self.config.agent_package + ".tmp")
- if staging.exists():
- shutil.rmtree(staging)
- shutil.copytree(matches[0], staging)
+ staging = Path(tempfile.mkdtemp(dir=dest, prefix=self.config.agent_package + ".tmp."))
+ shutil.copytree(matches[0], staging)
os.replace(staging, dest / self.config.agent_package)🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @environments/mini_browse_apps_platform_v1/mini_browse_apps_platform_v1/harness/__init__.py around lines 239-243:
The `_download_agent` staging path is hard-coded (`dest / "<package>.tmp"`), so concurrent processes race on it: one may `rmtree` the directory while another is `copytree`-ing into it, causing a failed or corrupted copy. The final `os.replace` atomically commits the finished directory but does not protect the staging work from interleaving. Use a unique staging directory per process (e.g. `tempfile.mkdtemp` inside `dest`) to prevent collisions.
Evidence trail:
environments/mini_browse_apps_platform_v1/mini_browse_apps_platform_v1/harness/__init__.py lines 196-197 (TOCTOU guard), lines 238-243 (deterministic staging path and non-atomic staging operations) at REVIEWED_COMMIT.
| with tarfile.open(archive) as tar: | ||
| tar.extractall(extract, filter="data") |
There was a problem hiding this comment.
🟢 Low harness/__init__.py:230
tar.extractall(extract, filter="data") raises TypeError: extractall() got an unexpected keyword argument 'filter' on Python 3.10.0–3.10.11 and 3.11.0–3.11.3, because the filter parameter was only backported in 3.10.12 / 3.11.4. Since _download_agent runs on the host (not inside the 3.12 Docker container), hosts on those earlier patch versions will crash with an unhelpful error. Consider guarding with hasattr(tarfile, 'data_filter') to fall back gracefully.
with tarfile.open(archive) as tar:
- tar.extractall(extract, filter="data")
+ if hasattr(tarfile, 'data_filter'):
+ tar.extractall(extract, filter="data")
+ else:
+ tar.extractall(extract)🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @environments/mini_browse_apps_platform_v1/mini_browse_apps_platform_v1/harness/__init__.py around lines 230-231:
`tar.extractall(extract, filter="data")` raises `TypeError: extractall() got an unexpected keyword argument 'filter'` on Python 3.10.0–3.10.11 and 3.11.0–3.11.3, because the `filter` parameter was only backported in 3.10.12 / 3.11.4. Since `_download_agent` runs on the host (not inside the 3.12 Docker container), hosts on those earlier patch versions will crash with an unhelpful error. Consider guarding with `hasattr(tarfile, 'data_filter')` to fall back gracefully.
Evidence trail:
environments/mini_browse_apps_platform_v1/mini_browse_apps_platform_v1/harness/__init__.py lines 230-231 (REVIEWED_COMMIT): `tar.extractall(extract, filter="data")`. environments/mini_browse_apps_platform_v1/pyproject.toml line 4: `requires-python = ">=3.10"`. Root pyproject.toml line 14: `requires-python = ">=3.11,<3.14"`. Python 3.11 docs (https://docs.python.org/uk/3.11/library/tarfile.html): 'New in version 3.11.4' for extraction filters. Python 3.14 docs (https://docs.python.org/3/library/tarfile.html): 'Changed in version 3.12: Added the filter parameter.' and recommendation to use `hasattr(tarfile, 'data_filter')`. scikit-learn issue #31521 for identical bug.
- Group the harness agent-source fields under an `agent` subconfig (agent.repo / agent.ref / agent.path / ...), dropping the agent_ prefix. - Pass the model endpoint/key/model to the in-sandbox agent via a JSON file (--model-client) instead of OPENAI_* env vars; program.py builds the client and passes it to the agent explicitly, keeping the secret out of the process env. - Drop the redundant harness path config fields (use the shared contract paths directly) and inline taskset constants that only fed config defaults. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The framework enforces --max-turns at the interception layer for any harness, so the harness-specific max_steps knob was redundant. Remove it; program.py keeps its own default step backstop, and rollouts are capped with --max-turns. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Default the agent source to PrimeIntellect-ai/mini-browse pinned at 157b449 (the private browser-agent repo), so the env fetches it out of the box. - Rename the proxy env var to MINI_BROWSE_HTTP_PROXY and say "private" rather than "proprietary" in the harness/README. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
openai/gpt-4.1-minion Prime inference) scores against a deterministic answer key (answer_keyreward, weight 1.0). Requires a multimodal model.--harness.agent-repo/--harness.agent-refwith a token in--harness.agent-token-env; a--harness.agent-pathlocal override is provided for development. A.gitignorekeeps the agent out of version control.load_tasksfetches the dataset viaprime env pulland caches it under~/.cache/verifiers/. Override with--taskset.dataset_path, or repoint--taskset.hub_env_id/--taskset.hub_version.JudgeConfigsubconfig (--taskset.judge.model+ a verifiersBaseClientConfig), so the endpoint, team header, and API key auto-resolve to Prime inference. The judge uses strict structured output (json_schema) so its verdict is always valid JSON (with a tolerant parser as a backstop).__all__, so--taskset.idand--harness.idshare the name). Registered as an editable workspace member (pyproject.toml+uv.lock).Adapted to the current
feat/nano-as-v1API: boundary error types (HarnessError/TasksetError); module discovery via__all__(noload_environmentwiring). Config is kept minimal — timeouts, task shuffling, and selection are left to the framework; the sandbox image is set per-task.Run
The task image is a Prime-registry image, so use the
primeruntime; supply the agent token + pinned ref and a vision model:Status
Verified end-to-end on the
primeruntime: dynamic hub task pull, plugin discovery, private-source agent staging, sandbox provisioning, the agent completing a task and submitting a result, and the structured-output judge scoring it. A sample run scored 0.83 (5/6 fields) — the agent solved the search but omitted baggage fees from the total.🤖 Generated with Claude Code