Skip to content

feat(daemon): opt-in S3 cache restore + snapshot for sandbox runner#1182

Open
nicacioliveira wants to merge 1 commit into
mainfrom
feat/sandbox-cache-restore
Open

feat(daemon): opt-in S3 cache restore + snapshot for sandbox runner#1182
nicacioliveira wants to merge 1 commit into
mainfrom
feat/sandbox-cache-restore

Conversation

@nicacioliveira

@nicacioliveira nicacioliveira commented May 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Sandbox cold start today re-downloads ~200MB of Deno modules every wake-up because /deno-dir is on the pod's writable layer (lost between claims). On a typical Storefront site this is the largest single chunk in the 12-15s wake-up budget.

This patch caches /deno-dir in S3 per site (one cache per repo, shared across envs of the same site since deps are identical):

s3://<bucket>/<site>/cache.tgz

Relationship to the existing daemon/cache.ts (downloadCache)

There is already an S3 cache flow for sandbox mode:

// daemon/main.ts:312
if (SANDBOX_MODE) {
  await downloadCache(siteName).catch(...)
}

It downloads cache.tar.zst from s3://deco-assets-storage/deco-sites/<site>/ (sa-east-1, static ADMIN_S3_* keys) into DENO_DIR_RUN. The cache is produced by an external build pipeline (decocms/platform-infra/site-builder), not by the runner — so it only updates when the build pipeline runs, not after a runtime session adds new modules.

This PR is complementary, not a replacement:

existing downloadCache this PR restoreCache/snapshotCache
Producer external build pipeline the runner itself, post-boot
Direction download only download + upload
Bucket deco-assets-storage (sa-east-1) configurable via deploy body (us-west-2 in prod)
Auth static ADMIN_S3_* keys Pod Identity (default credential chain)
Target DENO_DIR_RUN (worker) /deno-dir (daemon)
Update cadence per build per wake-up

A future PR can consolidate (move the build-pipeline-produced cache to the same us-west-2 bucket and Pod Identity, retire the sa-east-1 + static keys path). Out of scope here.

How activation works (this PR)

Config arrives via the /sandbox/deploy POST body (envs field) — not via container env vars. The agent-sandbox operator (v0.4.x) rejects per-claim env when a warm pool is bound to the claim:

// extensions/controllers/sandboxclaim_controller.go
if policy != WarmPoolPolicyNone && len(claim.Spec.Env) > 0 {
    err := fmt.Errorf("custom environment variables are not supported when using a warm pool")

So the admin forwards cache hints through the deploy body, and this PR consumes them.

Hint env vars (all optional, all forwarded by admin via the deploy body):

  • DECO_CACHE_S3_BUCKET (presence enables the feature)
  • DECO_CACHE_S3_REGION (default us-west-2)
  • DECO_CACHE_S3_KEY (admin sets to <site>/cache.tgz)

Lifecycle in the runner

onDeploy:
  setSiteName(site)
  await restoreCache(cfg)            ← BLOCKS deploy: fetch tarball, untar to /deno-dir
  // existing downloadCache(siteName) still runs unchanged after this
  createSiteApp({...})
  currentSite.gitReady
    .then(() => snapshotCache(cfg))   ← DETACHED: tar /deno-dir + upload to S3

Rollout-safe

  • Off by default. When the deploy body has no cache hints (DECO_CACHE_S3_BUCKET unset on the admin pod), isEnabled(cfg) is false and both restoreCache and snapshotCache return { ok: false, reason: "disabled" } immediately. Zero behaviour change vs. main.
  • Cache is advisory. Every error path logs and returns a no-op result — missing key on first wake-up of a site (NoSuchKey is normal, logged at info level), S3 outage, malformed tarball, tar process failure. The wake-up may be slower, never broken.
  • Auth via standard AWS SDK credential chain → Pod Identity in EKS supplies credentials transparently. No static keys baked in for this path.

Files

  • daemon/sandboxCache.ts (new): cacheConfigFromEnvs, restoreCache, snapshotCache. All wrapped in try/catch.
  • daemon/main.ts: two hooks in the SANDBOX_MODE onDeploy — restore before createSiteApp, snapshot detached after gitReady. The existing downloadCache(siteName) invocation a few lines below is left intact.

Companion PRs

  • decocms/terraform-eks-cluster#5 — bucket IAM + Pod Identity associations
  • deco-sites/admin#3098 — admin forwards DECO_CACHE_S3_* in deploy body + S3 cleanup on site delete
  • deco-sites/admin#3096 — wake-up splash (independent of this work)

Future work (out of scope)

  • Consolidate the legacy daemon/cache.ts downloadCache into the same path (us-west-2 bucket, Pod Identity, switchable bucket via env). The legacy flow is still useful as a "build-pipeline-produced" cache, but the auth + region should match this PR for consistency.
  • Decide whether to spin up a separate sandbox-runner image (Dockerfile + workflow) tailored just for sandbox mode — currently the runner image (ghcr.io/deco-cx/deco/ai) carries non-sandbox modes too.

Test plan

  • First wake-up of a site (no cache exists): runner logs [sandbox-cache] no cache yet for s3://... (cold start), deploy completes normally, then logs [sandbox-cache] snapshotted YMB → s3://....
  • Second wake-up after snapshot: runner logs [sandbox-cache] restored cache from s3://..., deno cache resolution noticeably faster.
  • No DECO_CACHE_S3_BUCKET in deploy body: disabled returned, no S3 calls. Existing behaviour preserved.
  • S3 outage during restore: restore failed warning, deploy still completes via cold-start path.
  • S3 outage during snapshot: snapshot failed warning, deploy result already returned to admin (detached).

🤖 Generated with Claude Code


Summary by cubic

Adds opt-in S3-backed restore and snapshot of the Deno cache to cut sandbox cold starts by reusing /deno-dir (saves ~5s on typical sites). Restores on deploy, snapshots after startup; off by default and never blocks or breaks deploys.

  • New Features
    • Restore: download s3://<bucket>/<site>/cache.tgz into /deno-dir during /sandbox/deploy; awaited before worker starts.
    • Snapshot: tar and upload /deno-dir after gitReady; detached from the deploy response.
    • Enable via deploy body envs: DECO_CACHE_S3_BUCKET (enables), DECO_CACHE_S3_REGION (default us-west-2), DECO_CACHE_S3_KEY (e.g., <site>/cache.tgz).
    • Auth via AWS default credential chain (Pod Identity). No static keys.
    • Works alongside existing downloadCache; that flow is unchanged.
    • Advisory behavior: errors are logged and ignored; cold start path remains.

Written for commit c6e44db. Summary will update on new commits.

Summary by CodeRabbit

  • New Features
    • Added Deno module caching integration for sandbox deployments, automatically restoring cached modules during initialization and capturing snapshots after deployment completes for improved build efficiency.

Sandbox cold start today re-downloads ~200MB of Deno modules every wake-up
because /deno-dir is on the pod's writable layer (lost between claims).
On a typical Storefront site this is the largest single chunk in the
12-15s wake-up budget (~5s).

This patch caches /deno-dir in S3 per site (one cache per repo, shared
across envs of the same site since deps are identical):

  s3://<bucket>/<site>/cache.tgz

Config arrives via the /sandbox/deploy POST body (envs field). The admin
forwards the cache hints there because the agent-sandbox operator
(v0.4.x) rejects per-claim env when a warm pool is bound to the claim:

    Error seen: custom environment variables are not supported when
    using a warm pool
      — kubernetes-sigs/agent-sandbox extensions/controllers

Hint env vars (all optional, all forwarded by admin via deploy body):
  - DECO_CACHE_S3_BUCKET   (presence enables the feature)
  - DECO_CACHE_S3_REGION   (default: us-west-2)
  - DECO_CACHE_S3_KEY      (admin sets to "<site>/cache.tgz")

Lifecycle:
- restoreCache(cfg) runs at the top of onDeploy, awaited so /deno-dir
  is populated before the worker subprocess starts requiring modules.
- snapshotCache(cfg) fires after gitReady, detached from the deploy
  response — the user-visible deploy never waits on the upload.

Reliability invariant: the cache is *advisory*. Every error path logs
and returns a no-op result. A missing key (first wake-up of a site),
S3 outage, malformed tarball, tar process failure — all fall back to
the cold-start path. The wake-up may be slower, never broken.

Auth via standard AWS SDK credential chain → Pod Identity in EKS
provides credentials transparently (no static keys baked in).

Companion changes (separate PRs):
- decocms/terraform-eks-cluster: bucket + IAM + Pod Identity
- deco-sites/admin: forward DECO_CACHE_S3_* in /sandbox/deploy body

Co-Authored-By: Nicacio Oliveira <nicacio@deco.cx>
@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

Tagging Options

Should a new tag be published when this PR is merged?

  • 👍 for Patch 1.197.1 update
  • 🎉 for Minor 1.198.0 update
  • 🚀 for Major 2.0.0 update

@coderabbitai

coderabbitai Bot commented May 5, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

A new S3-backed cache module is added to store and restore Deno module dependencies for sandbox deploys. The daemon integrates this cache by restoring modules at deploy start and snapshotting them afterward, with graceful error handling throughout.

Changes

Deno Module Cache Integration

Layer / File(s) Summary
Cache Infrastructure
daemon/sandboxCache.ts
Defines CacheConfig and CacheResult types. Implements cacheConfigFromEnvs() to extract S3 bucket, region, and key from environment variables. Adds region-scoped S3 client caching to avoid re-instantiation.
Cache Operations
daemon/sandboxCache.ts
restoreCache() downloads cache.tgz from S3 and streams it into tar -xzf - to unpack into /. snapshotCache() creates a gzipped tarball of /deno-dir, buffers it in memory, and uploads to S3 via PutObjectCommand. Both operations return structured results with timing and reason; failures are logged but not thrown.
Deploy Lifecycle Integration
daemon/main.ts
Imports cache helpers and computes CacheConfig from deploy envs. Calls restoreCache() synchronously before deploy proceeds. After site's git-ready phase, triggers snapshotCache() asynchronously without blocking the deploy response.

Sequence Diagram

sequenceDiagram
    participant Deploy as Deploy Lifecycle
    participant Cache as Cache Module
    participant S3 as S3 Storage
    participant Tar as Tar Process
    participant FS as Filesystem<br>(/deno-dir)

    Deploy->>Cache: restoreCache(config)
    Cache->>S3: GetObjectCommand(bucket, key)
    S3-->>Cache: cache.tgz stream
    Cache->>Tar: pipe stream → tar -xzf -
    Tar->>FS: extract to /
    Tar-->>Cache: exit status
    Cache-->>Deploy: CacheResult {ok, ms}

    Deploy->>Deploy: Deploy proceeds

    Deploy->>Deploy: git-ready resolves

    Deploy->>Cache: snapshotCache(config) [async]
    Cache->>FS: Deno.stat(/deno-dir)
    FS-->>Cache: directory exists
    Cache->>Tar: tar -czf - /deno-dir
    Tar-->>Cache: gzipped archive (buffered)
    Cache->>S3: PutObjectCommand(bucket, key, archive)
    S3-->>Cache: upload complete
    Cache-->>Deploy: CacheResult {ok, ms} [fire & forget]
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • deco-cx/deco#1061: Implements the same sandbox Deno-dir S3 cache restore functionality with a non-fatal cache restore step in the deploy lifecycle.

Poem

🐰 A warren of modules, cached in the cloud,
Tar-zipped and shipped with no fuss, no crowd,
S3 keeps the bundles, we fetch them with speed,
No errors will break us—we plant the seed.
Swift deploys, swift snapshots, the rabbit's delight!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(daemon): opt-in S3 cache restore + snapshot for sandbox runner' accurately summarizes the main change: adding optional S3-backed caching for Deno module restoration and snapshots in the sandbox runner.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/sandbox-cache-restore

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@daemon/main.ts`:
- Around line 623-629: The snapshot is being triggered too early from
currentSite.gitReady which can cause uploading a cold/incomplete /deno-dir; move
the call to snapshotCache(cacheCfg) off the gitReady promise and instead invoke
it only after runtime dependencies are fully initialized (e.g., after the first
successful worker/module resolution or a deps-ready signal). Concretely, remove
or stop using currentSite.gitReady.then(...snapshotCache...) and wire
snapshotCache(cacheCfg) to the lifecycle event that represents "deps fully
initialized" (add or use an existing currentSite.depsReady, worker.initComplete,
or the first successful module resolution hook) so the snapshot runs only once
dependencies and manifest/block generation have completed; keep snapshotCache
detached and swallow errors as before.

In `@daemon/sandboxCache.ts`:
- Around line 95-100: The spawned Deno.Command instances for "tar" (the variable
tar used in restore and snapshot) set stderr to "piped" but only read it after
await tar.status(), which can deadlock if stderr fills; fix by draining
tar.stderr concurrently (e.g., spawn an async task that reads/consumes
tar.stderr.readable or pipes it to /dev/null or processLogger) immediately after
.spawn() so stderr cannot block the child, and apply the same change to the
other tar invocations noted (the Deno.Command calls around the regions
referenced).
- Around line 80-87: The restoreCache function can hang because S3 download and
untar have no timeout; add a hard timeout (e.g. using AbortController with
setTimeout or Promise.race) around the S3 client(region).send(new
GetObjectCommand(...)) and the subsequent untar/stream processing so the whole
restore path aborts after a configured max (e.g. 30s); when timeout triggers
abort the S3 request, kill/cleanup any tar streams, and return { ok: false, ms:
elapsed, reason: "timeout" } from restoreCache (update the code paths around
restoreCache, GetObjectCommand handling and the untar/pipeline section
referenced at lines ~102-107 to respect the abort signal and perform cleanup).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1739163f-da8f-417b-8738-e21ebfe034f4

📥 Commits

Reviewing files that changed from the base of the PR and between 2666148 and c6e44db.

📒 Files selected for processing (2)
  • daemon/main.ts
  • daemon/sandboxCache.ts

Comment thread daemon/main.ts
Comment on lines +623 to +629
// Best-effort cache snapshot — fires after the worker is up so
// /deno-dir reflects every module the site actually needed.
// Detached: never blocks the deploy response. Failures are
// logged and swallowed inside snapshotCache().
currentSite.gitReady
.then(() => snapshotCache(cacheCfg))
.catch(() => {/* never throws */});

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Snapshot trigger is too early; it can upload a cold/incomplete /deno-dir.

This hook runs after gitReady, which resolves before manifest/block generation and before runtime dependency warmup. You can end up overwriting S3 with a mostly empty cache on cold deploys.

Please trigger snapshot from a “deps fully initialized” signal (or after first successful worker/module resolution), not from gitReady.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@daemon/main.ts` around lines 623 - 629, The snapshot is being triggered too
early from currentSite.gitReady which can cause uploading a cold/incomplete
/deno-dir; move the call to snapshotCache(cacheCfg) off the gitReady promise and
instead invoke it only after runtime dependencies are fully initialized (e.g.,
after the first successful worker/module resolution or a deps-ready signal).
Concretely, remove or stop using currentSite.gitReady.then(...snapshotCache...)
and wire snapshotCache(cacheCfg) to the lifecycle event that represents "deps
fully initialized" (add or use an existing currentSite.depsReady,
worker.initComplete, or the first successful module resolution hook) so the
snapshot runs only once dependencies and manifest/block generation have
completed; keep snapshotCache detached and swallow errors as before.

Comment thread daemon/sandboxCache.ts
Comment on lines +80 to +87
export async function restoreCache(cfg: CacheConfig): Promise<CacheResult> {
if (!isEnabled(cfg)) return { ok: false, ms: 0, reason: "disabled" };
const { bucket, key, region = "us-west-2" } = cfg;
const start = performance.now();
try {
const resp = await client(region).send(
new GetObjectCommand({ Bucket: bucket, Key: key }),
);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add a hard timeout to restore to avoid blocking deploy startup.

restoreCache is awaited in deploy startup, but the S3 fetch/untar path has no timeout. A stalled S3 stream or tar process can hang deploy indefinitely instead of falling back to cold start.

Proposed change
 export async function restoreCache(cfg: CacheConfig): Promise<CacheResult> {
   if (!isEnabled(cfg)) return { ok: false, ms: 0, reason: "disabled" };
   const { bucket, key, region = "us-west-2" } = cfg;
   const start = performance.now();
+  const timeoutMs = 15_000;
   try {
+    const abortSignal = AbortSignal.timeout(timeoutMs);
     const resp = await client(region).send(
       new GetObjectCommand({ Bucket: bucket, Key: key }),
+      { abortSignal },
     );
@@
-    await stream.pipeTo(tar.stdin);
-    const status = await tar.status;
+    await Promise.race([
+      stream.pipeTo(tar.stdin),
+      new Promise((_, reject) =>
+        setTimeout(() => {
+          tar.kill("SIGKILL");
+          reject(new Error(`restore timeout after ${timeoutMs}ms`));
+        }, timeoutMs)
+      ),
+    ]);
+    const status = await tar.status;

Also applies to: 102-107

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@daemon/sandboxCache.ts` around lines 80 - 87, The restoreCache function can
hang because S3 download and untar have no timeout; add a hard timeout (e.g.
using AbortController with setTimeout or Promise.race) around the S3
client(region).send(new GetObjectCommand(...)) and the subsequent untar/stream
processing so the whole restore path aborts after a configured max (e.g. 30s);
when timeout triggers abort the S3 request, kill/cleanup any tar streams, and
return { ok: false, ms: elapsed, reason: "timeout" } from restoreCache (update
the code paths around restoreCache, GetObjectCommand handling and the
untar/pipeline section referenced at lines ~102-107 to respect the abort signal
and perform cleanup).

Comment thread daemon/sandboxCache.ts
Comment on lines +95 to +100
const tar = new Deno.Command("tar", {
args: ["-xzf", "-", "-C", "/"],
stdin: "piped",
stdout: "null",
stderr: "piped",
}).spawn();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Drain tar.stderr concurrently to prevent child-process deadlock.

In both restore and snapshot, stderr is piped but only read after waiting for process completion. If stderr output exceeds pipe capacity, the process can block before exit.

Proposed change
     const tar = new Deno.Command("tar", {
@@
       stderr: "piped",
     }).spawn();
+    const stderrText = new Response(tar.stderr).text().catch(() => "");
@@
-    const status = await tar.status;
+    const status = await tar.status;
@@
-      const stderr = await new Response(tar.stderr).text().catch(() => "");
+      const stderr = await stderrText;
     const tar = new Deno.Command("tar", {
@@
       stderr: "piped",
     }).spawn();
+    const stderrText = new Response(tar.stderr).text().catch(() => "");
 
     const body = new Uint8Array(await new Response(tar.stdout).arrayBuffer());
     const status = await tar.status;
     if (!status.success) {
-      const stderr = await new Response(tar.stderr).text().catch(() => "");
+      const stderr = await stderrText;

Also applies to: 106-110, 158-162, 165-168

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@daemon/sandboxCache.ts` around lines 95 - 100, The spawned Deno.Command
instances for "tar" (the variable tar used in restore and snapshot) set stderr
to "piped" but only read it after await tar.status(), which can deadlock if
stderr fills; fix by draining tar.stderr concurrently (e.g., spawn an async task
that reads/consumes tar.stderr.readable or pipes it to /dev/null or
processLogger) immediately after .spawn() so stderr cannot block the child, and
apply the same change to the other tar invocations noted (the Deno.Command calls
around the regions referenced).

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 2 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="daemon/sandboxCache.ts">

<violation number="1" location="daemon/sandboxCache.ts:105">
P1: Potential child-process deadlock: `stderr` is set to `"piped"` but is only consumed after `await tar.status`. If tar writes more than the OS pipe buffer capacity (~64KB) to stderr, it will block waiting for a reader, and `tar.status` will never resolve. Start draining stderr concurrently before awaiting status:
```ts
const stderrText = new Response(tar.stderr).text().catch(() => "");
await stream.pipeTo(tar.stdin);
const status = await tar.status;
// then: const stderr = await stderrText;
```</violation>

<violation number="2" location="daemon/sandboxCache.ts:154">
P2: The empty-cache guard is incomplete: this only checks that `DENO_DIR` exists, so an empty directory can still be snapshotted and overwrite a valid cache in S3.</violation>

<violation number="3" location="daemon/sandboxCache.ts:164">
P1: `stderr` is piped but not drained until after `await tar.status`; this can block the tar process on a full stderr pipe and hang snapshotting.</violation>
</file>

<file name="daemon/main.ts">

<violation number="1" location="daemon/main.ts:587">
P1: `restoreCache` is awaited on the deploy critical path but has no timeout. A stalled S3 stream or hung `tar` process will block deploy startup indefinitely instead of falling back to the cold-start path. Add an `AbortSignal.timeout()` to the S3 request and a timeout/race around `stream.pipeTo(tar.stdin)` to ensure the deploy proceeds within a bounded time.</violation>

<violation number="2" location="daemon/main.ts:628">
P2: `snapshotCache` is scheduled on `gitReady`, which resolves right after git clone (before startup/cache resolution). This can upload a premature cache snapshot that misses dependencies needed by the running site.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread daemon/sandboxCache.ts
Comment on lines +164 to +167
const body = new Uint8Array(await new Response(tar.stdout).arrayBuffer());
const status = await tar.status;
if (!status.success) {
const stderr = await new Response(tar.stderr).text().catch(() => "");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: stderr is piped but not drained until after await tar.status; this can block the tar process on a full stderr pipe and hang snapshotting.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At daemon/sandboxCache.ts, line 164:

<comment>`stderr` is piped but not drained until after `await tar.status`; this can block the tar process on a full stderr pipe and hang snapshotting.</comment>

<file context>
@@ -0,0 +1,198 @@
+      stderr: "piped",
+    }).spawn();
+
+    const body = new Uint8Array(await new Response(tar.stdout).arrayBuffer());
+    const status = await tar.status;
+    if (!status.success) {
</file context>
Suggested change
const body = new Uint8Array(await new Response(tar.stdout).arrayBuffer());
const status = await tar.status;
if (!status.success) {
const stderr = await new Response(tar.stderr).text().catch(() => "");
const stderrPromise = new Response(tar.stderr).text().catch(() => "");
const body = new Uint8Array(await new Response(tar.stdout).arrayBuffer());
const status = await tar.status;
const stderr = await stderrPromise;
if (!status.success) {

Comment thread daemon/sandboxCache.ts
const stream =
(body as { transformToWebStream: () => ReadableStream<Uint8Array> })
.transformToWebStream();
await stream.pipeTo(tar.stdin);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Potential child-process deadlock: stderr is set to "piped" but is only consumed after await tar.status. If tar writes more than the OS pipe buffer capacity (~64KB) to stderr, it will block waiting for a reader, and tar.status will never resolve. Start draining stderr concurrently before awaiting status:

const stderrText = new Response(tar.stderr).text().catch(() => "");
await stream.pipeTo(tar.stdin);
const status = await tar.status;
// then: const stderr = await stderrText;
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At daemon/sandboxCache.ts, line 105:

<comment>Potential child-process deadlock: `stderr` is set to `"piped"` but is only consumed after `await tar.status`. If tar writes more than the OS pipe buffer capacity (~64KB) to stderr, it will block waiting for a reader, and `tar.status` will never resolve. Start draining stderr concurrently before awaiting status:
```ts
const stderrText = new Response(tar.stderr).text().catch(() => "");
await stream.pipeTo(tar.stdin);
const status = await tar.status;
// then: const stderr = await stderrText;
```</comment>

<file context>
@@ -0,0 +1,198 @@
+    const stream =
+      (body as { transformToWebStream: () => ReadableStream<Uint8Array> })
+        .transformToWebStream();
+    await stream.pipeTo(tar.stdin);
+    const status = await tar.status;
+
</file context>

Comment thread daemon/main.ts
// because we want it done before the worker subprocess starts
// requesting modules.
const cacheCfg = cacheConfigFromEnvs(envs);
await restoreCache(cacheCfg).catch(() => {/* never throws */});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: restoreCache is awaited on the deploy critical path but has no timeout. A stalled S3 stream or hung tar process will block deploy startup indefinitely instead of falling back to the cold-start path. Add an AbortSignal.timeout() to the S3 request and a timeout/race around stream.pipeTo(tar.stdin) to ensure the deploy proceeds within a bounded time.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At daemon/main.ts, line 587:

<comment>`restoreCache` is awaited on the deploy critical path but has no timeout. A stalled S3 stream or hung `tar` process will block deploy startup indefinitely instead of falling back to the cold-start path. Add an `AbortSignal.timeout()` to the S3 request and a timeout/race around `stream.pipeTo(tar.stdin)` to ensure the deploy proceeds within a bounded time.</comment>

<file context>
@@ -570,6 +575,17 @@ if (SANDBOX_MODE) {
+      // because we want it done before the worker subprocess starts
+      // requesting modules.
+      const cacheCfg = cacheConfigFromEnvs(envs);
+      await restoreCache(cacheCfg).catch(() => {/* never throws */});
+
       // Use run command from deploy request.
</file context>

Comment thread daemon/sandboxCache.ts
Comment on lines +154 to +156
if (!stat?.isDirectory) {
return { ok: false, ms: 0, reason: "no deno-dir" };
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The empty-cache guard is incomplete: this only checks that DENO_DIR exists, so an empty directory can still be snapshotted and overwrite a valid cache in S3.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At daemon/sandboxCache.ts, line 154:

<comment>The empty-cache guard is incomplete: this only checks that `DENO_DIR` exists, so an empty directory can still be snapshotted and overwrite a valid cache in S3.</comment>

<file context>
@@ -0,0 +1,198 @@
+  try {
+    // Skip if /deno-dir doesn't exist or is empty
+    const stat = await Deno.stat(DENO_DIR).catch(() => null);
+    if (!stat?.isDirectory) {
+      return { ok: false, ms: 0, reason: "no deno-dir" };
+    }
</file context>
Suggested change
if (!stat?.isDirectory) {
return { ok: false, ms: 0, reason: "no deno-dir" };
}
if (!stat?.isDirectory) {
return { ok: false, ms: 0, reason: "no deno-dir" };
}
let hasEntries = false;
for await (const _entry of Deno.readDir(DENO_DIR)) {
hasEntries = true;
break;
}
if (!hasEntries) {
return { ok: false, ms: 0, reason: "empty deno-dir" };
}

Comment thread daemon/main.ts
// Detached: never blocks the deploy response. Failures are
// logged and swallowed inside snapshotCache().
currentSite.gitReady
.then(() => snapshotCache(cacheCfg))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: snapshotCache is scheduled on gitReady, which resolves right after git clone (before startup/cache resolution). This can upload a premature cache snapshot that misses dependencies needed by the running site.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At daemon/main.ts, line 628:

<comment>`snapshotCache` is scheduled on `gitReady`, which resolves right after git clone (before startup/cache resolution). This can upload a premature cache snapshot that misses dependencies needed by the running site.</comment>

<file context>
@@ -604,6 +620,14 @@ if (SANDBOX_MODE) {
+      // Detached: never blocks the deploy response. Failures are
+      // logged and swallowed inside snapshotCache().
+      currentSite.gitReady
+        .then(() => snapshotCache(cacheCfg))
+        .catch(() => {/* never throws */});
+
</file context>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant