Skip to content

docs: add release-caused-outage runbook + orchestration skill#2462

Open
ramboz wants to merge 4 commits into
mainfrom
docs/release-outage-runbook
Open

docs: add release-caused-outage runbook + orchestration skill#2462
ramboz wants to merge 4 commits into
mainfrom
docs/release-outage-runbook

Conversation

@ramboz
Copy link
Copy Markdown
Contributor

@ramboz ramboz commented May 21, 2026

Summary

Adds an operations runbook and a companion Claude Code skill for responding to a
release-caused outage of the spacecat-api-service Lambda — the scenario where a
deploy makes every invocation fail at startup (e.g. TypeError: main2 is not a function).

Two ways to run the same playbook:

  • Manualdocs/runbooks/release-caused-outage.md:
    a step-by-step runbook (triage → correlate the release → verify locally →
    revert → confirm recovery → follow-up) with copy-pasteable commands.
  • LLM-led.claude/skills/release-outage-response/: a skill that drives an
    agent through the same phases, enforcing diagnose-before-revert discipline and a
    hard human-confirmation gate before any revert/push to prod main.

What's included

  • docs/runbooks/release-caused-outage.md — the runbook, with a deep-dive appendix
    on the main is not a function bundle-failure signature: a recurring
    helix-deploy gotcha where a file read from disk at module load isn't shipped in
    the bundle (not declared in hlx.static), so the main export never initializes.
  • .claude/skills/release-outage-response/SKILL.md — the orchestration skill. It
    delegates the actual commands to the runbook and adds the sequencing + safety
    gates.
  • README.md — new "Operations & Runbooks" section linking to the runbook.

Motivation

Distilled from the 2026-05-21 v1.508.0 incident: commit 882dbab (#2456) read a
JSON asset at module load that wasn't included in the Lambda bundle, so every
invocation failed with main2 is not a function. Reverted in 873eddb. The runbook
captures how that was diagnosed — and how to confirm a release is the cause before
reverting — so the next responder doesn't start from scratch.

Notes

  • Docs + skill only; no runtime code changed (docs:, so no version bump).
  • The recommended durable fix — a CI bundle-smoke test asserting
    typeof main === 'function' on the built artifact — is called out as follow-up,
    not included here.

Test plan

@github-actions
Copy link
Copy Markdown

This PR will trigger no release when merged.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@tkotthakota-adobe
Copy link
Copy Markdown
Contributor

Hi @ramboz Thanks for the PR. Conditionally approving it with below comments.

Overall

Good addition — capturing the 2026-05-21 v1.508.0 incident as both a runbook and a reusable skill is exactly the right move. The "diagnose before revert" framing is the most valuable part. A few issues need fixing before merge.


SKILL.md — Defects

  1. Malformed frontmatter title. The title renders as ---se-Caused Outage Response — looks like a paste/copy artifact where the leading Relea was dropped. Should be Release-Caused Outage Response.
  2. Truncated phase headers. Several phase names appear cut off: Ph. (instead of Phase N: ...), eit. (likely Identify or similar), recovech (probably Recovery Check). These need to be the full readable strings - the skill file is loaded verbatim by the runtime.
  3. Hard confirmation gate placement. The human-confirmation gate before revert/push is correct in principle, but verify it appears before the git commands, not after. If it's at the end of the revert phase rather than the start, an agent could execute the push before the user confirms.

runbook/release-caused-outage.md — Substantive Gaps

  1. No post-incident comms step. The runbook ends at "confirm recovery" with no mention of: notifying the incident channel, updating the status page, or drafting the post-mortem ticket. Add a Follow-Up section that at minimum links to the post-mortem template and names the communication channel.
  2. CI smoke test uses a hardcoded version. The snippet that validates the deployed version pins to a literal version string (e.g., v1.508.0). That will be wrong on the next incident. Use a parameter or derive it from the release tag dynamically.
  3. npm run test:bundle vs npm run build inconsistency. The Appendix B verification steps reference npm run test:bundle in one place and npm run build in another for the same local bundle verification. Pick one and use it consistently — test:bundle is more precise if it exists.
  4. Appendix A: three fix options need a recommended path. The appendix lists three ways to fix main is not a function (move the read inside the handler, add to hlx.static, use --test-bundle in CI) but doesn't state which is preferred. Given that hlx.static is the allow-list pattern the runtime already uses for non-JS assets, that should be called out as the
    canonical fix. Readers landing on this under incident pressure shouldn't have to evaluate tradeoffs.
  5. hlx.static explanation is incomplete. Appendix B describes that the file must be in the hlx.static allow-list but doesn't show what the config entry looks like. Add a concrete one-liner example (e.g., "hlx.static": ["locations.json"] in helix-deploy.yaml or wherever it lives) so engineers don't have to grep for an example while an outage is active.

README.md

Looks fine. One suggestion: link directly to the specific runbook file rather than just the docs/runbooks/ directory — deep links are more useful from a README.


Minor

  • Runbook intro says "This runbook covers releases to the API service" — if it's intended to generalize to other services (audit-worker, etc.), broaden the scope statement or create a sibling runbook per service. As written it reads API-service-specific.
  • Recommend adding a "Last validated" date or version anchor at the top of the runbook so readers know when it was last exercised against a real incident.

@ramboz ramboz temporarily deployed to dev-branches May 21, 2026 22:29 — with GitHub Actions Inactive
@ramboz
Copy link
Copy Markdown
Contributor Author

ramboz commented May 21, 2026

This is detailed feedback, and a chunk of it describes things that don't match what I actually wrote — "malformed title," "truncated headers" like Ph./eit./`recovech," a README that links to a directory, and Appendix A "not stating a preferred fix" (I did). Those smell like a corrupted/truncated rendering on the reviewer's side, not real defects. But I shouldn't trust my memory either — let me read the actual files on disk and verify each claim before responding.

The on-disk SKILL.md is clean: title is # Release-Caused Outage Response (line 16), all phase headers are full ("Phase 1 — Confirm and frame the outage" … "Phase 6 — Confirm recovery" … "Phase 7 — Follow-up"), and the confirmation gate (Phase 4) is correctly before the revert/push (Phase 5, "only after Phase 4 approval"). So the first three "defects" are phantom — a corrupted rendering on the reviewer's side.

Now let me verify the runbook claims and check whether test:bundle actually exists.

The on-disk runbook is also clean. Now let me verify two things the feedback hinges on: the README link target, and whether npm run test:bundle actually exists (and what it does vs build).

I read both files on disk and checked every claim against the branch. Useful review overall — and the "diagnose before revert" appreciation is the right read of the intent. But a large share of the specific defects describe content that isn't in the files. The ---se-Caused, Ph., eit., recovech strings are the tell: those are left-truncated lines (# Relea|se-Caused, Pha|se, Confirm re|cove|ry), which means your reviewer almost certainly looked at a corrupted/truncated rendering, not the actual file. Here's the split.

A. Phantom — don't match what's on disk (verified, with line refs)

  1. Malformed titleSKILL.md:16 is # Release-Caused Outage Response. Frontmatter (lines 1–14) has only name + description, no title field. ---se-Caused = the --- delimiter + a truncated heading.
  2. Truncated phase headers — all seven are full strings: SKILL.md:41,49,60,75,90,101,109 → "Phase 1 — Confirm and frame the outage" … "Phase 6 — Confirm recovery" … "Phase 7 — Follow-up". Nothing reads Ph./recovech on disk.
  3. Gate placement — actually correct. Phase 4 is the gate (SKILL.md:75), Phase 5 is the revert (:90, titled "only after Phase 4 approval"), and the skill delegates git to the runbook — the only git before the gate is read-only fetch/log (:51). Good instinct to check, though.
  4. README links to a directory — it deep-links to the file: README.md:280](docs/runbooks/release-caused-outage.md).
  5. Appendix A "doesn't state preferred fix"runbook:300 literally says "prefer the first — it removes the runtime FS read entirely." Also, the three options the reviewer listed ("move read inside handler / hlx.static / --test-bundle in CI") aren't the three in the doc (inline-JS / hlx.static / JSON import attribute) — they conflated Appendix A with the CI material, another garbled-read signal.
  6. test:bundle vs build "inconsistency" — the runbook only ever uses npm run build; test:bundle appears nowhere in it. (But there's a real improvement hiding here — see B2.)
  7. Intro quote — "This runbook covers releases to the API service" isn't in the file. The intro (runbook:3) is "How to respond when spacecat-api-service starts failing broadly…".

So before anything else: ask them to re-pull docs/release-outage-runbook (or view on GitHub) — I think they reviewed a bad copy.

B. Genuinely valid — worth doing

  1. No comms / post-mortem step. Real gap. Step 7 has the fix-ticket + regression guard but nothing on notifying the incident channel or filing the post-mortem. I'll add it — but I don't know your team's actual incident channel / status-page / post-mortem-template, so I'd add a clearly-labeled placeholder rather than invent specifics. Tell me those and I'll wire them in.
  2. CI smoke test versioning — and the repo makes this better than the reviewer realized: package.json already has test:bundle, which derives the version dynamically (npm pkg get version) and runs test/index.test.js against the bundle. So Step 7 shouldn't hand-roll an import() snippet (mine used a <version> placeholder, not a literal v1.508.0 as claimed) — it should point at the existing npm run test:bundle, recommend making it a blocking CI gate, and ensure that test asserts main is callable. Uses existing infra and removes the version question entirely.
  3. Concrete hlx.static example — cheap, useful. One correction to fold in: the reviewer's "helix-deploy.yaml" is wrong — it lives in package.json under the hlx key. I'll show the real snippet.
  4. "Last validated" anchor — good, cheap. I'll add Last validated: 2026-05-21 (v1.508.0 incident) at the top.

C. Judgment call — I lean the other way

Canonical fix (Appendix A). I recommended inlining the data as a JS module (eliminates the FS read → can't recur); the reviewer wants hlx.static as canonical (matches the existing allow-list pattern). My reasoning: this is the deliberate re-land fix, not the incident-time mitigation — the mitigation is the revert, already covered — so "no time to weigh tradeoffs under pressure" doesn't apply here. For a considered re-land, removing the failure mode beats patching around it; hlx.static keeps the fragile readFileSync and the next data file hits the same trap. I'd keep inline-JS as the recommendation but make it louder and add a one-line "why not just hlx.static." Your call as PR author, though — happy to flip it if you prefer pattern-consistency.

D. Minor pushback

Scope. The runbook is intentionally api-service-specific (hard-coded lambda name, hlx.static, npm scripts). That's correct, not a defect — and the sentence that triggered the concern doesn't exist. If other services need this, sibling per-service runbooks beat one vague doc.


With that in mind, I'll address B1-B4, and skip A, C & D.

@ramboz ramboz temporarily deployed to dev-branches May 21, 2026 22:58 — with GitHub Actions Inactive
@solaris007 solaris007 requested a review from MysticatBot May 22, 2026 03:44
@MysticatBot MysticatBot added the ai-reviewed Reviewed by AI label May 22, 2026
Copy link
Copy Markdown

@MysticatBot MysticatBot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I'll produce the consolidated review. Both agents recommend approval with no blocking issues.

Hey @ramboz,

Strengths

  • Diagnosis-before-action discipline is the right architectural choice (SKILL.md:28-32, runbook Step 3). The explicit "verify before reverting" workflow with concrete evidence criteria prevents the most common incident-response anti-pattern: speculative reverts that waste the one cheap mitigation and obscure the real cause.

  • Clean separation of orchestration from procedure (SKILL.md:20-23). The skill defines sequence and decision gates; the runbook holds the concrete commands. This layers correctly - the skill can evolve with new tooling without rewriting procedural steps, and the runbook remains usable by a human without the LLM orchestrator.

  • Worked-example anchoring grounds the theory (runbook:9-14, Appendix A). Tying every step to the actual 2026-05-21 incident with specific SHAs, timestamps, and error messages makes this immediately actionable. The main is not a function appendix captures failure-mode reasoning that would otherwise be lost to Slack threads.

  • Timeline caveat is operationally valuable (runbook:92-105). Explicitly calling out that the release-commit timestamp is not the deploy timestamp, with the aws lambda get-function verification, prevents a subtle correlation error during incident response.

  • Hard gate design is correct (SKILL.md:72-87). Requiring explicit re-confirmation per commit (not a blanket prior approval) is the right safety boundary for an LLM-driven workflow touching shared production state.

  • The "gotchas" section is honest operational documentation (runbook Step 5). Documenting node version, pre-commit hook deps, and branch protection behavior under stress is the kind of detail that only gets written by someone who actually hit these issues.

Issues

Minor (Nice to Have)

  1. TODO placeholders reduce operational readiness - docs/runbooks/release-caused-outage.md:194,235-238: Steps 5 and 7 contain _(TODO: name it)_, _(TODO: SLA)_, _(TODO: link)_ for the incident channel, post-mortem template, and status page. Acceptable for a first merge (acknowledged in discussion), but if these linger, the runbook becomes a "read and interpret" document rather than "follow and execute" during a real incident. Recommend filing a follow-up ticket to resolve within a sprint.

Recommendations

  • Consider adding a "Prerequisites" section to the runbook listing required access (AWS credentials with Lambda read, admin push rights on main, nvm with node 24 installed). During an incident, discovering you lack permissions at Step 5 is costly. A small addition that improves time-to-mitigate for responders unfamiliar with this flow.

  • Track the TODOs explicitly. A single follow-up ticket with a reasonable deadline prevents the placeholders from becoming permanent gaps.

Assessment

Ready to merge? Yes

The documents are well-structured, operationally grounded, and encode the correct incident-response discipline. The architecture decision to separate LLM orchestration from human-readable procedure is sound and will age well. The existing review discussion has been addressed (Last validated anchor, comms/post-mortem TODOs, test:bundle reference, hlx.static example all present in the current diff). No runtime code was changed, CI passes, and the TODO placeholders are appropriate acknowledged debt for a first version.


Skill: pr-review | Model: us.anthropic.claude-opus-4-6-v1[1m] | Duration: 0m 43s | Cost: $3.04 | Commit: 39497452c7a9e5b95294beb805fc68fd2bbfb7c2
If this code review was useful, please react with 👍. Otherwise, react with 👎.

rainer-friederich added a commit that referenced this pull request May 22, 2026
…CLAUDE.md guidance (#2466)

## Re-land of
[#2456](#2456) with
bundling fix + new pre-merge gate + codified guidance

Tracks [SITES-45260](https://jira.corp.adobe.com/browse/SITES-45260).
The semrush AIO proxy merged as commit `882dbab5` (v1.508.0) on
2026-05-21 and took the api-service Lambda down — every invocation
failed with `TypeError: main2 is not a function at lambdaAdapter`.
Reverted at `873eddb9`. This PR brings the feature back with three
layered defences against the same class of failure recurring.

## Root cause (recap)

`src/support/semrush/handlers/projects.js` was reading
`data/locations.json` at module load via
`readFileSync(import.meta.url)`. The JSON file was not in `package.json`
`hlx.static`, so `helix-deploy` never copied it into the Lambda zip.
Cold start hit `ENOENT … dist/data/locations.json`, the module's export
went undefined, and the deploy wrapper's reference (`main2`) crashed on
every invocation.

Tests passed because they import from source, where `import.meta.url`
resolves to the original source path and the JSON lives alongside the
handler. The failure only manifests in the bundled artifact.

## What changes

### 1. Bundling fix (`a4b198fa`)

`src/support/semrush/data/locations.json` ->
`src/support/semrush/data/locations.js` (`export const LOCATIONS = { ...
}`). `handlers/projects.js` imports it via normal ESM resolution. The
bundler picks it up automatically — no `hlx.static` registry to
maintain, no `import.meta.url` path arithmetic, no `fs`/`path`/`url`
imports in the handler.

This is fix option 1 from the SITES-45260 ticket (the recommended one).
Diff is a clean swap; data values are unchanged.

### 2. Pre-merge CI gate (`f1cde0db`)

The mysticat-ci reusable workflow runs lint + test + coverage only — it
**never** executes `npm run build` (`hedy -v --test-bundle`). That's the
gap that let the broken bundle through.

Add a repo-local `bundle-build` job to `.github/workflows/ci.yaml` that
runs `npm run build` on every push / PR. `hedy --test-bundle` bundles,
zips, **and invokes `lambda()` against a synthetic healthcheck event**,
exiting non-zero on any non-2xx response. Verified locally:

- Healthy bundle (this PR's tip): exit 0, healthcheck returns 200
- Broken bundle (reproduction with `readFileSync` restored): exit 1,
"Validation failed: 500" with `x-error: ENOENT … data/locations.json`

This is the load-bearing catch for SITES-45260's class of failure. It
addresses acceptance criteria 2 + 4 in one step — a separate `typeof
main === 'function'` smoke check would have been a strict subset of what
`hedy --test-bundle` already does, so it was dropped.

Until adobe/mysticat-ci's reusable build action grows the equivalent
step, this repo-local job is the gate. Worth lifting upstream in a
follow-up so every spacecat service inherits it.

### 3. Lambda bundle constraints in CLAUDE.md (`040ed590`)

The CI gate catches the failure, but the right place to prevent the next
attempt is in the guidance an AI agent (or new contributor) reads first.
New `## Lambda Bundle Constraints` section in `CLAUDE.md`, placed right
after the build/deploy commands so it's hard to miss when working on
anything that touches `src/`. Three short rules:

1. **Do NOT** use `readFileSync(import.meta.url, ...)` or any
sibling-file reads at module load — the bundled artifact does not
preserve source-relative paths.
2. **Prefer JS-module imports for static data** — inline JSON / locale
data / lookup tables as `export const FOO = { ... }` in a `.js` file.
See `src/support/semrush/data/locations.js`.
3. **If a non-JS asset is unavoidable**, declare its repo-relative path
in `package.json` `hlx.static` AND read it from the Lambda task root,
not from `import.meta.url`.

Plus a pointer to the `bundle-build` CI gate and the SITES-45260
post-mortem.

## Verification

- Local `npm run build` -> exit 0, healthcheck OK
- Local reproduction with the `readFileSync` re-introduced -> exit 1
(confirmed gate works)
- 178 semrush / contract / workspace-resolver tests passing
- `git diff e3618d9..HEAD --stat` is small — only the four files this
PR is meant to touch (locations.js, projects.js, ci.yaml, CLAUDE.md).
The reapply commit `e3618d92` is exactly the diff of the original
`882dbab5`.

## Cross-repo notes

- `mysticat-data-service` migration
`20260528000000_brand_to_semrush_projects.sql`: already merged +
deployed pre-revert. Schema is in place.
- `@adobe/spacecat-shared-data-access` `3.68.0`: shipped, pinned by this
PR.
- Operator runbook for incident mitigation:
[#2462](#2462)
(ramboz).
- Follow-up:
[#2467](#2467)
makes `SEMRUSH_PROJECTS_BASE_URL` strict / Vault-backed (no source
default).

## Follow-ups (out of scope)

- Lift the `bundle-build` job into `adobe/mysticat-ci`'s reusable
workflow so every spacecat/mysticat service inherits the gate.
- Phase 7 (dev smoke tests against
`https://spacecat.experiencecloud.live/api/ci/`) — runs after this lands
and deploys.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Alicia Adriani <aadriani+adobe@adobe.com>
@ramboz ramboz temporarily deployed to dev-branches May 22, 2026 15:34 — with GitHub Actions Inactive
@solaris007
Copy link
Copy Markdown
Member

@ramboz should this go to experience-success-skills instead to be easily applicable to all lambdas we have? it is automatically available to workspace users

@ramboz
Copy link
Copy Markdown
Contributor Author

ramboz commented May 22, 2026

@solaris007 Not sure. Filed it here since this was the scope of the CSO I worked on. If you prefer having that somewhere else, I'm fine with that, but we need to make sure it's easily discoverable by whoever is handling the CSO at that time. At the very least, I would then make sure we have a pointer to the workspace from this repo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-reviewed Reviewed by AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants