Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions daemon/main.ts
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,11 @@ import {
import { downloadCache } from "./cache.ts";
import { createAIHandlers } from "./ai/handlers.ts";
import { createSandboxHandlers, type DeployParams } from "./sandbox.ts";
import {
cacheConfigFromEnvs,
restoreCache,
snapshotCache,
} from "./sandboxCache.ts";
import { register, type TunnelConnection } from "./tunnel.ts";
import {
createWorker,
Expand Down Expand Up @@ -570,6 +575,17 @@ if (SANDBOX_MODE) {
// Also update the module-level variable so getSiteName() returns the value
setSiteName(site);

// Best-effort cache restore — populates /deno-dir from S3 so the
// upcoming deno cache resolution is mostly a no-op. Opt-in via
// DECO_CACHE_S3_BUCKET in the deploy body envs (admin forwards it
// there, NOT via SandboxClaim.spec.env: the operator rejects
// per-claim env when warm pool is in use).
// Any failure logs and continues with the cold-start path. Awaited
// because we want it done before the worker subprocess starts
// requesting modules.
const cacheCfg = cacheConfigFromEnvs(envs);
await restoreCache(cacheCfg).catch(() => {/* never throws */});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: restoreCache is awaited on the deploy critical path but has no timeout. A stalled S3 stream or hung tar process will block deploy startup indefinitely instead of falling back to the cold-start path. Add an AbortSignal.timeout() to the S3 request and a timeout/race around stream.pipeTo(tar.stdin) to ensure the deploy proceeds within a bounded time.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At daemon/main.ts, line 587:

<comment>`restoreCache` is awaited on the deploy critical path but has no timeout. A stalled S3 stream or hung `tar` process will block deploy startup indefinitely instead of falling back to the cold-start path. Add an `AbortSignal.timeout()` to the S3 request and a timeout/race around `stream.pipeTo(tar.stdin)` to ensure the deploy proceeds within a bounded time.</comment>

<file context>
@@ -570,6 +575,17 @@ if (SANDBOX_MODE) {
+      // because we want it done before the worker subprocess starts
+      // requesting modules.
+      const cacheCfg = cacheConfigFromEnvs(envs);
+      await restoreCache(cacheCfg).catch(() => {/* never throws */});
+
       // Use run command from deploy request.
</file context>


// Use run command from deploy request.
// If no runCommand provided, default to Deno runner only if dev.ts exists.
// This prevents a timeout loop when the sandbox is used for non-Deco repos
Expand Down Expand Up @@ -604,6 +620,14 @@ if (SANDBOX_MODE) {
branch,
});

// Best-effort cache snapshot — fires after the worker is up so
// /deno-dir reflects every module the site actually needed.
// Detached: never blocks the deploy response. Failures are
// logged and swallowed inside snapshotCache().
currentSite.gitReady
.then(() => snapshotCache(cacheCfg))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: snapshotCache is scheduled on gitReady, which resolves right after git clone (before startup/cache resolution). This can upload a premature cache snapshot that misses dependencies needed by the running site.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At daemon/main.ts, line 628:

<comment>`snapshotCache` is scheduled on `gitReady`, which resolves right after git clone (before startup/cache resolution). This can upload a premature cache snapshot that misses dependencies needed by the running site.</comment>

<file context>
@@ -604,6 +620,14 @@ if (SANDBOX_MODE) {
+      // Detached: never blocks the deploy response. Failures are
+      // logged and swallowed inside snapshotCache().
+      currentSite.gitReady
+        .then(() => snapshotCache(cacheCfg))
+        .catch(() => {/* never throws */});
+
</file context>

.catch(() => {/* never throws */});
Comment on lines +623 to +629

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Snapshot trigger is too early; it can upload a cold/incomplete /deno-dir.

This hook runs after gitReady, which resolves before manifest/block generation and before runtime dependency warmup. You can end up overwriting S3 with a mostly empty cache on cold deploys.

Please trigger snapshot from a “deps fully initialized” signal (or after first successful worker/module resolution), not from gitReady.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@daemon/main.ts` around lines 623 - 629, The snapshot is being triggered too
early from currentSite.gitReady which can cause uploading a cold/incomplete
/deno-dir; move the call to snapshotCache(cacheCfg) off the gitReady promise and
instead invoke it only after runtime dependencies are fully initialized (e.g.,
after the first successful worker/module resolution or a deps-ready signal).
Concretely, remove or stop using currentSite.gitReady.then(...snapshotCache...)
and wire snapshotCache(cacheCfg) to the lifecycle event that represents "deps
fully initialized" (add or use an existing currentSite.depsReady,
worker.initComplete, or the first successful module resolution hook) so the
snapshot runs only once dependencies and manifest/block generation have
completed; keep snapshotCache detached and swallow errors as before.


// Always create AI handlers — OAuth can be used when no API key is set
aiHandlers = createAIHandlers({
cwd: Deno.cwd(),
Expand Down
198 changes: 198 additions & 0 deletions daemon/sandboxCache.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
/**
* Sandbox runner cache — opt-in S3-backed restore/snapshot for /deno-dir.
*
* Skipping the Deno module download on cold start saves ~5s per wake-up
* on Deco/Storefront sites. The cache is *advisory*: every code path here
* swallows errors and returns a no-op result. A failed restore must let
* the runner fall back to the cold-start git clone + deno cache flow; a
* failed snapshot must not affect the user-visible deploy.
*
* Config arrives via the /sandbox/deploy POST body (envs field) — NOT via
* container env vars. The agent-sandbox operator rejects per-claim env when
* a warm pool is in use, so the admin forwards cache hints through the
* deploy body instead. The runner reads them from the request and passes
* them into restoreCache/snapshotCache as explicit config.
*
* Auth: standard AWS SDK credential chain — Pod Identity in EKS supplies
* credentials transparently via container metadata env vars.
*
* Layout in S3:
* s3://<bucket>/<site>/cache.tgz (tar.gz of /deno-dir)
*/

import {
GetObjectCommand,
PutObjectCommand,
S3Client,
} from "npm:@aws-sdk/client-s3@3.569.0";

const DENO_DIR = Deno.env.get("DENO_DIR") ?? "/deno-dir";

export interface CacheConfig {
bucket?: string;
region?: string;
/** S3 key (path within bucket), e.g. "<site>/cache.tgz" */
key?: string;
}

const isEnabled = (
cfg: CacheConfig,
): cfg is Required<Pick<CacheConfig, "bucket" | "key">> & CacheConfig =>
Boolean(cfg.bucket && cfg.key);

let cachedClient: S3Client | null = null;
let cachedRegion: string | null = null;
const client = (region: string): S3Client => {
if (!cachedClient || cachedRegion !== region) {
cachedClient = new S3Client({ region });
cachedRegion = region;
}
return cachedClient;
};

export interface CacheResult {
ok: boolean;
ms: number;
reason?: string;
}

const log = (msg: string) => console.log(`[sandbox-cache] ${msg}`);
const warn = (msg: string) => console.warn(`[sandbox-cache] ${msg}`);

/**
* Build a CacheConfig from the envs map received via /sandbox/deploy.
* Helper so callers don't have to know the env-var names.
*/
export function cacheConfigFromEnvs(
envs?: Record<string, string>,
): CacheConfig {
return {
bucket: envs?.DECO_CACHE_S3_BUCKET,
region: envs?.DECO_CACHE_S3_REGION ?? "us-west-2",
key: envs?.DECO_CACHE_S3_KEY,
};
}

/**
* Restore the Deno-cache tarball from S3 into /deno-dir.
* Streams S3 body directly into `tar -xzf -` without buffering in Deno.
*/
export async function restoreCache(cfg: CacheConfig): Promise<CacheResult> {
if (!isEnabled(cfg)) return { ok: false, ms: 0, reason: "disabled" };
const { bucket, key, region = "us-west-2" } = cfg;
const start = performance.now();
try {
const resp = await client(region).send(
new GetObjectCommand({ Bucket: bucket, Key: key }),
);
Comment on lines +80 to +87

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add a hard timeout to restore to avoid blocking deploy startup.

restoreCache is awaited in deploy startup, but the S3 fetch/untar path has no timeout. A stalled S3 stream or tar process can hang deploy indefinitely instead of falling back to cold start.

Proposed change
 export async function restoreCache(cfg: CacheConfig): Promise<CacheResult> {
   if (!isEnabled(cfg)) return { ok: false, ms: 0, reason: "disabled" };
   const { bucket, key, region = "us-west-2" } = cfg;
   const start = performance.now();
+  const timeoutMs = 15_000;
   try {
+    const abortSignal = AbortSignal.timeout(timeoutMs);
     const resp = await client(region).send(
       new GetObjectCommand({ Bucket: bucket, Key: key }),
+      { abortSignal },
     );
@@
-    await stream.pipeTo(tar.stdin);
-    const status = await tar.status;
+    await Promise.race([
+      stream.pipeTo(tar.stdin),
+      new Promise((_, reject) =>
+        setTimeout(() => {
+          tar.kill("SIGKILL");
+          reject(new Error(`restore timeout after ${timeoutMs}ms`));
+        }, timeoutMs)
+      ),
+    ]);
+    const status = await tar.status;

Also applies to: 102-107

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@daemon/sandboxCache.ts` around lines 80 - 87, The restoreCache function can
hang because S3 download and untar have no timeout; add a hard timeout (e.g.
using AbortController with setTimeout or Promise.race) around the S3
client(region).send(new GetObjectCommand(...)) and the subsequent untar/stream
processing so the whole restore path aborts after a configured max (e.g. 30s);
when timeout triggers abort the S3 request, kill/cleanup any tar streams, and
return { ok: false, ms: elapsed, reason: "timeout" } from restoreCache (update
the code paths around restoreCache, GetObjectCommand handling and the
untar/pipeline section referenced at lines ~102-107 to respect the abort signal
and perform cleanup).

const body = resp.Body;
if (!body) {
return { ok: false, ms: performance.now() - start, reason: "empty body" };
}

// Stream body → tar. tar reads the whole archive from stdin and unpacks
// into /deno-dir directly. No intermediate file or buffer.
const tar = new Deno.Command("tar", {
args: ["-xzf", "-", "-C", "/"],
stdin: "piped",
stdout: "null",
stderr: "piped",
}).spawn();
Comment on lines +95 to +100

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Drain tar.stderr concurrently to prevent child-process deadlock.

In both restore and snapshot, stderr is piped but only read after waiting for process completion. If stderr output exceeds pipe capacity, the process can block before exit.

Proposed change
     const tar = new Deno.Command("tar", {
@@
       stderr: "piped",
     }).spawn();
+    const stderrText = new Response(tar.stderr).text().catch(() => "");
@@
-    const status = await tar.status;
+    const status = await tar.status;
@@
-      const stderr = await new Response(tar.stderr).text().catch(() => "");
+      const stderr = await stderrText;
     const tar = new Deno.Command("tar", {
@@
       stderr: "piped",
     }).spawn();
+    const stderrText = new Response(tar.stderr).text().catch(() => "");
 
     const body = new Uint8Array(await new Response(tar.stdout).arrayBuffer());
     const status = await tar.status;
     if (!status.success) {
-      const stderr = await new Response(tar.stderr).text().catch(() => "");
+      const stderr = await stderrText;

Also applies to: 106-110, 158-162, 165-168

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@daemon/sandboxCache.ts` around lines 95 - 100, The spawned Deno.Command
instances for "tar" (the variable tar used in restore and snapshot) set stderr
to "piped" but only read it after await tar.status(), which can deadlock if
stderr fills; fix by draining tar.stderr concurrently (e.g., spawn an async task
that reads/consumes tar.stderr.readable or pipes it to /dev/null or
processLogger) immediately after .spawn() so stderr cannot block the child, and
apply the same change to the other tar invocations noted (the Deno.Command calls
around the regions referenced).


const stream =
(body as { transformToWebStream: () => ReadableStream<Uint8Array> })
.transformToWebStream();
await stream.pipeTo(tar.stdin);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Potential child-process deadlock: stderr is set to "piped" but is only consumed after await tar.status. If tar writes more than the OS pipe buffer capacity (~64KB) to stderr, it will block waiting for a reader, and tar.status will never resolve. Start draining stderr concurrently before awaiting status:

const stderrText = new Response(tar.stderr).text().catch(() => "");
await stream.pipeTo(tar.stdin);
const status = await tar.status;
// then: const stderr = await stderrText;
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At daemon/sandboxCache.ts, line 105:

<comment>Potential child-process deadlock: `stderr` is set to `"piped"` but is only consumed after `await tar.status`. If tar writes more than the OS pipe buffer capacity (~64KB) to stderr, it will block waiting for a reader, and `tar.status` will never resolve. Start draining stderr concurrently before awaiting status:
```ts
const stderrText = new Response(tar.stderr).text().catch(() => "");
await stream.pipeTo(tar.stdin);
const status = await tar.status;
// then: const stderr = await stderrText;
```</comment>

<file context>
@@ -0,0 +1,198 @@
+    const stream =
+      (body as { transformToWebStream: () => ReadableStream<Uint8Array> })
+        .transformToWebStream();
+    await stream.pipeTo(tar.stdin);
+    const status = await tar.status;
+
</file context>

const status = await tar.status;

if (!status.success) {
const stderr = await new Response(tar.stderr).text().catch(() => "");
return {
ok: false,
ms: performance.now() - start,
reason: `tar exit ${status.code}: ${stderr.slice(0, 200)}`,
};
}

const ms = performance.now() - start;
log(`restored cache from s3://${bucket}/${key} in ${ms.toFixed(0)}ms`);
return { ok: true, ms };
} catch (err) {
const ms = performance.now() - start;
// NoSuchKey on first wake-up of a site is normal — log at info level.
const isMissing = (err as { name?: string }).name === "NoSuchKey" ||
String(err).includes("NoSuchKey");
const fmt = `${(err as Error).message ?? err}`;
if (isMissing) {
log(
`no cache yet for s3://${bucket}/${key} (cold start, ${
ms.toFixed(0)
}ms)`,
);
} else {
warn(`restore failed (${ms.toFixed(0)}ms): ${fmt}`);
}
return { ok: false, ms, reason: isMissing ? "missing" : fmt };
}
}

/**
* Tar /deno-dir and upload to S3. Best-effort — failures are logged and
* swallowed so a failed snapshot can't break a successful deploy.
*
* Buffers the tarball in memory before upload (one-shot PUT). For a typical
* site with ~200MB compressed cache this is fine; if cache sizes grow we
* should switch to @aws-sdk/lib-storage's streaming Upload.
*/
export async function snapshotCache(cfg: CacheConfig): Promise<CacheResult> {
if (!isEnabled(cfg)) return { ok: false, ms: 0, reason: "disabled" };
const { bucket, key, region = "us-west-2" } = cfg;
const start = performance.now();
try {
// Skip if /deno-dir doesn't exist or is empty
const stat = await Deno.stat(DENO_DIR).catch(() => null);
if (!stat?.isDirectory) {
return { ok: false, ms: 0, reason: "no deno-dir" };
}
Comment on lines +154 to +156

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The empty-cache guard is incomplete: this only checks that DENO_DIR exists, so an empty directory can still be snapshotted and overwrite a valid cache in S3.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At daemon/sandboxCache.ts, line 154:

<comment>The empty-cache guard is incomplete: this only checks that `DENO_DIR` exists, so an empty directory can still be snapshotted and overwrite a valid cache in S3.</comment>

<file context>
@@ -0,0 +1,198 @@
+  try {
+    // Skip if /deno-dir doesn't exist or is empty
+    const stat = await Deno.stat(DENO_DIR).catch(() => null);
+    if (!stat?.isDirectory) {
+      return { ok: false, ms: 0, reason: "no deno-dir" };
+    }
</file context>
Suggested change
if (!stat?.isDirectory) {
return { ok: false, ms: 0, reason: "no deno-dir" };
}
if (!stat?.isDirectory) {
return { ok: false, ms: 0, reason: "no deno-dir" };
}
let hasEntries = false;
for await (const _entry of Deno.readDir(DENO_DIR)) {
hasEntries = true;
break;
}
if (!hasEntries) {
return { ok: false, ms: 0, reason: "empty deno-dir" };
}


const tar = new Deno.Command("tar", {
args: ["-czf", "-", "-C", "/", DENO_DIR.replace(/^\//, "")],
stdout: "piped",
stderr: "piped",
}).spawn();

const body = new Uint8Array(await new Response(tar.stdout).arrayBuffer());
const status = await tar.status;
if (!status.success) {
const stderr = await new Response(tar.stderr).text().catch(() => "");
Comment on lines +164 to +167

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: stderr is piped but not drained until after await tar.status; this can block the tar process on a full stderr pipe and hang snapshotting.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At daemon/sandboxCache.ts, line 164:

<comment>`stderr` is piped but not drained until after `await tar.status`; this can block the tar process on a full stderr pipe and hang snapshotting.</comment>

<file context>
@@ -0,0 +1,198 @@
+      stderr: "piped",
+    }).spawn();
+
+    const body = new Uint8Array(await new Response(tar.stdout).arrayBuffer());
+    const status = await tar.status;
+    if (!status.success) {
</file context>
Suggested change
const body = new Uint8Array(await new Response(tar.stdout).arrayBuffer());
const status = await tar.status;
if (!status.success) {
const stderr = await new Response(tar.stderr).text().catch(() => "");
const stderrPromise = new Response(tar.stderr).text().catch(() => "");
const body = new Uint8Array(await new Response(tar.stdout).arrayBuffer());
const status = await tar.status;
const stderr = await stderrPromise;
if (!status.success) {

return {
ok: false,
ms: performance.now() - start,
reason: `tar exit ${status.code}: ${stderr.slice(0, 200)}`,
};
}

await client(region).send(
new PutObjectCommand({
Bucket: bucket,
Key: key,
Body: body,
ContentType: "application/gzip",
}),
);

const ms = performance.now() - start;
log(
`snapshotted ${
(body.length / 1024 / 1024).toFixed(1)
}MB → s3://${bucket}/${key} in ${ms.toFixed(0)}ms`,
);
return { ok: true, ms };
} catch (err) {
const ms = performance.now() - start;
warn(
`snapshot failed (${ms.toFixed(0)}ms): ${(err as Error).message ?? err}`,
);
return { ok: false, ms, reason: `${(err as Error).message ?? err}` };
}
}
Loading