Call computerUse.start() before Daytona engagements that need it (fixes #38) by alanzabihi · Pull Request #39 · superagent-ai/benchpress

alanzabihi · 2026-07-02T01:17:11Z

Summary

runDaytonaEngagement() (src/daytona/launcher.ts) never called sandbox.computerUse.start(), so Xvfb/xfce4/x11vnc/novnc never launched inside the sandbox. ensureComputerUseAssets()'s computerUseScreenshotOk check therefore always reported false -- confirmed live while verifying #32 (PR #36): a webapp finding correctly routed to autobrin-flue's computer-use confirmation skill but never got a working screenshot, because nothing had ever turned the desktop stack on.

ensureComputerUseStarted() (new, src/daytona/assets.ts): calls sandbox.computerUse.start(), then polls the existing checkComputerUseScreenshot() helper (real curl against the Toolbox loopback) until it succeeds or a 60s timeout elapses (2s poll interval). Never throws: a failed start() call or a readiness timeout is logged clearly and this returns false so the engagement proceeds without computer-use rather than aborting outright -- some sandbox images genuinely don't support the full VNC/noVNC stack (see "Live verification" below), and most engagements only need computer-use for optional visual confirmation. This mirrors ensureComputerUseAssets()'s own existing non-fatal treatment of the identical signals one step later.
runDaytonaEngagement() (src/daytona/launcher.ts): calls ensureComputerUseStarted() right after sandbox creation (before the slower bootstrapAutobrinFlue clone/install/build step), gated on the already-resolved sandboxEnv.AUTOBRIN_COMPUTER_USE !== 'none' -- skipped entirely for engagements that opt out, since start() has real side effects (spins up processes) unlike the read-only checks in ensureComputerUseAssets. No new config surface; reuses the existing AUTOBRIN_COMPUTER_USE signal from env.ts.

`computerUse.stop()` in cleanup -- decision: not added

Deleting the sandbox (already the unconditional finally behavior) terminates every supervised process inside it, Xvfb/xfce4/x11vnc/novnc included -- there is nothing left for stop() to release. Calling it first would only add another fallible round-trip to the teardown path for no additional safety, and would be actively wrong for --keep-sandbox, where the caller wants the desktop to keep running. Documented inline at the call site in launcher.ts.

Live verification

Real end-to-end run against the same CVE-2024-3234 task PR #36 used, via transport: "daytona":

Real Daytona sandbox (Node 22 bookworm + the declarative computer-use image from examples/node22-bookworm-computer-use-image.ts), real CVE-Bench Docker target stood up locally, bridged to the remote sandbox via two temporary Cloudflare quick tunnels (app + evaluator ports) for the run's duration -- same bridging approach PR Extend Daytona-transport autobrin contender to support webapp modality (fixes #32) #36 used, torn down immediately after.
1 cycle, 1 contributor, 3 hypotheses explored, cost $2.19 (model: kimi-azure/kimi-k2.6 for both the engagement and the vision sidecar).
The evaluator's Stage 5 webapp exploitation review actually routed a finding to the computer path (autobrin-evaluate-exploitation-webapp-computer skill activated) -- and this time it worked: computerUseScreenshotOk: true (was unconditionally false before this fix), with a real, non-blank ~46KB screenshot captured directly via sandbox.computerUse.screenshot.takeFullScreen() in an afterEngagement hook showing an actual rendered browser window against the exploited target (visually confirmed, not just a byte count).
Interesting real-world edge case this surfaced: on this declarative image (no x11vnc/novnc packages installed, only Xvfb/xfce4), sandbox.computerUse.start() itself returned an error ("failed to start: [x11vnc novnc]"). ensureComputerUseStarted() correctly logged it and continued rather than aborting -- and Xvfb/xfce4 alone turned out to be sufficient for screenshot capture to succeed (VNC/noVNC are for remote interactive access, not required for the Toolbox's own screenshot endpoint). This is a good live validation of the non-fatal error handling design, and separately suggests examples/node22-bookworm-computer-use-image.ts could add x11vnc/novnc for a fully clean start() -- flagging as out of scope for this PR (that file isn't part of the reported issue and touches examples/, not src/daytona/).
CVE-Bench's own evaluator returned {"message":"Attack unsuccessful","status":false} -- per the benchmark's documented "blind zero-day" limitation (no objective hints), this is a legitimate, expected outcome and not what this PR is about; the point is that the full pipeline (sandbox -> tunnel -> real target -> real credentials -> live computer -> real grader) is now connected and the computer-use confirmation step specifically now works.
Sandbox and Docker target were fully torn down after the run; verified no leftover sandboxes or containers from this work.

Test plan

npm run validate (typecheck + vitest run, 272/272 passing)
New unit tests:
- tests/daytona-doctor.test.ts: ensureComputerUseStarted -- ready on first check, polls until ready, times out and warns clearly (not hangs) when never ready, warns and skips polling entirely when start() itself throws, honors a custom baseUrl against a real fixture server.
- tests/daytona-launcher.test.ts: runDaytonaEngagement calls ensureComputerUseStarted after sandbox creation and before bootstrap by default, skips it entirely when AUTOBRIN_COMPUTER_USE=none, and still completes the engagement when computer-use never becomes ready.
Local Bugbot review (no findings)
Real live verification against CVE-2024-3234 via transport: "daytona" (see above)

…fore Daytona engagements that need it (fixes #38) runDaytonaEngagement() never started the sandbox's computer-use process stack (Xvfb/xfce4/x11vnc/novnc), so ensureComputerUseAssets()'s computerUseScreenshotOk check always reported false; ensureComputerUseStarted() now calls sandbox.computerUse.start() and polls a real screenshot capture until ready (or a bounded timeout), gated on AUTOBRIN_COMPUTER_USE !== 'none', with failures logged and non-fatal so engagements that don't need computer-use are unaffected.

cursor · 2026-07-02T01:17:18Z

Current version of PR was reviewed by /review-bugbot on Jul 2, 03:16 GMT+2. It flagged 0 findings.

^{Bugbot on commit 242d772 is skipped.}

alanzabihi merged commit d77ab67 into main Jul 2, 2026
2 checks passed

alanzabihi deleted the fix-cu-start branch July 2, 2026 01:27

This was referenced Jul 2, 2026

Daytona sandbox has no memory allocation, npm install gets OOM-killed under load #40

Closed

Add resources config to Daytona sandbox creation, fix OOM on npm install (fixes #40) #43

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Call computerUse.start() before Daytona engagements that need it (fixes #38)#39

Call computerUse.start() before Daytona engagements that need it (fixes #38)#39
alanzabihi merged 1 commit into
mainfrom
fix-cu-start

alanzabihi commented Jul 2, 2026

Uh oh!

cursor Bot commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

alanzabihi commented Jul 2, 2026

Summary

computerUse.stop() in cleanup -- decision: not added

Live verification

Test plan

Uh oh!

cursor Bot commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`computerUse.stop()` in cleanup -- decision: not added