Skip to content

feat: multimodal browser-agent debug environment#1764

Draft
mikasenghaas wants to merge 7 commits into
feat/nano-as-v1from
feat/mini-browse-apps-platform-v1
Draft

feat: multimodal browser-agent debug environment#1764
mikasenghaas wants to merge 7 commits into
feat/nano-as-v1from
feat/mini-browse-apps-platform-v1

Conversation

@mikasenghaas

@mikasenghaas mikasenghaas commented Jun 19, 2026

Copy link
Copy Markdown
Member

Summary

  • New v1 environment: a sandboxed local-app browser-agent environment (flight-search web-app tasks). Each task boots a local SPA + headless-Chromium CDP service inside a per-task Docker image; a browser agent drives it by screenshots → vision model → click/type actions, then submits a structured JSON result, which a structured-output LLM judge (default openai/gpt-4.1-mini on Prime inference) scores against a deterministic answer key (answer_key reward, weight 1.0). Requires a multimodal model.
  • The browser agent is private and is not vendored in this repo. The harness fetches it at run time from a private, auth-gated GitHub repo (pinned to a commit), caches it locally, and stages it into the sandbox. Configure via --harness.agent-repo / --harness.agent-ref with a token in --harness.agent-token-env; a --harness.agent-path local override is provided for development. A .gitignore keeps the agent out of version control.
  • Tasks are pulled dynamically from the private Prime hub, not bundled in the repo: load_tasks fetches the dataset via prime env pull and caches it under ~/.cache/verifiers/. Override with --taskset.dataset_path, or repoint --taskset.hub_env_id / --taskset.hub_version.
  • The judge config is a JudgeConfig subconfig (--taskset.judge.model + a verifiers BaseClientConfig), so the endpoint, team header, and API key auto-resolve to Prime inference. The judge uses strict structured output (json_schema) so its verdict is always valid JSON (with a tolerant parser as a backstop).
  • Co-packages the taskset and a thin (public) harness shim in one module (both resolved via __all__, so --taskset.id and --harness.id share the name). Registered as an editable workspace member (pyproject.toml + uv.lock).

Adapted to the current feat/nano-as-v1 API: boundary error types (HarnessError / TasksetError); module discovery via __all__ (no load_environment wiring). Config is kept minimal — timeouts, task shuffling, and selection are left to the framework; the sandbox image is set per-task.

Run

The task image is a Prime-registry image, so use the prime runtime; supply the agent token + pinned ref and a vision model:

export <AGENT_TOKEN_ENV>=<token>
uv run eval <env-id> --harness.id <env-id> \
  --harness.runtime.type prime --harness.agent-ref <agent-commit-sha> \
  -m <multimodal-model> -n 1 -r 1 -c 1

Status

Verified end-to-end on the prime runtime: dynamic hub task pull, plugin discovery, private-source agent staging, sandbox provisioning, the agent completing a task and submitting a result, and the structured-output judge scoring it. A sample run scored 0.83 (5/6 fields) — the agent solved the search but omitted baggage fees from the total.

🤖 Generated with Claude Code

Comment thread environments/mini_browse_apps_platform_v1/mini_browse_apps_platform_v1/judge.py Outdated
Comment thread environments/mini_browse_apps_platform_v1/mini_browse_apps_platform_v1/taskset.py Outdated
Sandboxed local-app browser-agent environment ported to the current v1 API.
Tasks are pulled dynamically from the Prime hub and cached locally. The browser
agent is proprietary and fetched at run time from a private repo (not vendored);
the harness stages it into the sandbox. Co-packages the taskset and harness.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mikasenghaas mikasenghaas force-pushed the feat/mini-browse-apps-platform-v1 branch from 1d2b73b to 0631af5 Compare June 19, 2026 04:52
mikasenghaas and others added 3 commits June 19, 2026 04:57
The judge model can return JSON with the root object left unclosed (it stops
after the final array); the old fallback grabbed a nested `}` and then crashed
the rollout on an unguarded json.loads. Strip code fences, balance open
brackets, and fall back to a default verdict instead of raising.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Walrus for the env lookup, .exists() instead of the nested try/except, and drop
the defensive isinstance/str cast (the prime config is always a dict).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Judge config is now a JudgeConfig subconfig (model + verifiers BaseClientConfig),
  so the endpoint, team header, and API key auto-resolve to Prime inference; drop
  the bespoke prime_team_id / prime_default_headers and the flat judge_* fields.
- Judge uses a structured-output (json_schema) model (default openai/gpt-4.1-mini)
  so the verdict is always valid JSON.
- Drop config knobs that are framework-internal or belong on the task: per-task
  timeouts, sandbox_image (now set directly on the task), task shuffling,
  task_indices / task_profile, seed_store_path (+ the sqlite seed-store path), and
  the configurable cache dir (now a constant).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mikasenghaas mikasenghaas changed the title feat: add mini-browse-apps-platform-v1 environment (dynamic hub tasks) feat: multimodal browser-agent debug environment Jun 19, 2026
Comment on lines +239 to +243
staging = dest / (self.config.agent_package + ".tmp")
if staging.exists():
shutil.rmtree(staging)
shutil.copytree(matches[0], staging)
os.replace(staging, dest / self.config.agent_package)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low harness/__init__.py:239

The _download_agent staging path is hard-coded (dest / "<package>.tmp"), so concurrent processes race on it: one may rmtree the directory while another is copytree-ing into it, causing a failed or corrupted copy. The final os.replace atomically commits the finished directory but does not protect the staging work from interleaving. Use a unique staging directory per process (e.g. tempfile.mkdtemp inside dest) to prevent collisions.

-            staging = dest / (self.config.agent_package + ".tmp")
-            if staging.exists():
-                shutil.rmtree(staging)
-            shutil.copytree(matches[0], staging)
+            staging = Path(tempfile.mkdtemp(dir=dest, prefix=self.config.agent_package + ".tmp."))
+            shutil.copytree(matches[0], staging)
             os.replace(staging, dest / self.config.agent_package)
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @environments/mini_browse_apps_platform_v1/mini_browse_apps_platform_v1/harness/__init__.py around lines 239-243:

The `_download_agent` staging path is hard-coded (`dest / "<package>.tmp"`), so concurrent processes race on it: one may `rmtree` the directory while another is `copytree`-ing into it, causing a failed or corrupted copy. The final `os.replace` atomically commits the finished directory but does not protect the staging work from interleaving. Use a unique staging directory per process (e.g. `tempfile.mkdtemp` inside `dest`) to prevent collisions.

Evidence trail:
environments/mini_browse_apps_platform_v1/mini_browse_apps_platform_v1/harness/__init__.py lines 196-197 (TOCTOU guard), lines 238-243 (deterministic staging path and non-atomic staging operations) at REVIEWED_COMMIT.

Comment on lines +230 to +231
with tarfile.open(archive) as tar:
tar.extractall(extract, filter="data")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low harness/__init__.py:230

tar.extractall(extract, filter="data") raises TypeError: extractall() got an unexpected keyword argument 'filter' on Python 3.10.0–3.10.11 and 3.11.0–3.11.3, because the filter parameter was only backported in 3.10.12 / 3.11.4. Since _download_agent runs on the host (not inside the 3.12 Docker container), hosts on those earlier patch versions will crash with an unhelpful error. Consider guarding with hasattr(tarfile, 'data_filter') to fall back gracefully.

 with tarfile.open(archive) as tar:
-                tar.extractall(extract, filter="data")
+                if hasattr(tarfile, 'data_filter'):
+                    tar.extractall(extract, filter="data")
+                else:
+                    tar.extractall(extract)
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @environments/mini_browse_apps_platform_v1/mini_browse_apps_platform_v1/harness/__init__.py around lines 230-231:

`tar.extractall(extract, filter="data")` raises `TypeError: extractall() got an unexpected keyword argument 'filter'` on Python 3.10.0–3.10.11 and 3.11.0–3.11.3, because the `filter` parameter was only backported in 3.10.12 / 3.11.4. Since `_download_agent` runs on the host (not inside the 3.12 Docker container), hosts on those earlier patch versions will crash with an unhelpful error. Consider guarding with `hasattr(tarfile, 'data_filter')` to fall back gracefully.

Evidence trail:
environments/mini_browse_apps_platform_v1/mini_browse_apps_platform_v1/harness/__init__.py lines 230-231 (REVIEWED_COMMIT): `tar.extractall(extract, filter="data")`. environments/mini_browse_apps_platform_v1/pyproject.toml line 4: `requires-python = ">=3.10"`. Root pyproject.toml line 14: `requires-python = ">=3.11,<3.14"`. Python 3.11 docs (https://docs.python.org/uk/3.11/library/tarfile.html): 'New in version 3.11.4' for extraction filters. Python 3.14 docs (https://docs.python.org/3/library/tarfile.html): 'Changed in version 3.12: Added the filter parameter.' and recommendation to use `hasattr(tarfile, 'data_filter')`. scikit-learn issue #31521 for identical bug.

mikasenghaas and others added 3 commits June 19, 2026 05:10
- Group the harness agent-source fields under an `agent` subconfig (agent.repo /
  agent.ref / agent.path / ...), dropping the agent_ prefix.
- Pass the model endpoint/key/model to the in-sandbox agent via a JSON file
  (--model-client) instead of OPENAI_* env vars; program.py builds the client and
  passes it to the agent explicitly, keeping the secret out of the process env.
- Drop the redundant harness path config fields (use the shared contract paths
  directly) and inline taskset constants that only fed config defaults.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The framework enforces --max-turns at the interception layer for any harness, so
the harness-specific max_steps knob was redundant. Remove it; program.py keeps its
own default step backstop, and rollouts are capped with --max-turns.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Default the agent source to PrimeIntellect-ai/mini-browse pinned at 157b449
  (the private browser-agent repo), so the env fetches it out of the box.
- Rename the proxy env var to MINI_BROWSE_HTTP_PROXY and say "private" rather
  than "proprietary" in the harness/README.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant