docs: add release-caused-outage runbook + orchestration skill#2462
docs: add release-caused-outage runbook + orchestration skill#2462ramboz wants to merge 4 commits into
Conversation
|
This PR will trigger no release when merged. |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
Hi @ramboz Thanks for the PR. Conditionally approving it with below comments. Overall Good addition — capturing the 2026-05-21 v1.508.0 incident as both a runbook and a reusable skill is exactly the right move. The "diagnose before revert" framing is the most valuable part. A few issues need fixing before merge. SKILL.md — Defects
runbook/release-caused-outage.md — Substantive Gaps
README.md Looks fine. One suggestion: link directly to the specific runbook file rather than just the docs/runbooks/ directory — deep links are more useful from a README. Minor
|
|
This is detailed feedback, and a chunk of it describes things that don't match what I actually wrote — "malformed title," "truncated headers" like The on-disk SKILL.md is clean: title is Now let me verify the runbook claims and check whether The on-disk runbook is also clean. Now let me verify two things the feedback hinges on: the README link target, and whether I read both files on disk and checked every claim against the branch. Useful review overall — and the "diagnose before revert" appreciation is the right read of the intent. But a large share of the specific defects describe content that isn't in the files. The A. Phantom — don't match what's on disk (verified, with line refs)
So before anything else: ask them to re-pull B. Genuinely valid — worth doing
C. Judgment call — I lean the other wayCanonical fix (Appendix A). I recommended inlining the data as a JS module (eliminates the FS read → can't recur); the reviewer wants D. Minor pushbackScope. The runbook is intentionally api-service-specific (hard-coded lambda name, With that in mind, I'll address B1-B4, and skip A, C & D. |
MysticatBot
left a comment
There was a problem hiding this comment.
Now I'll produce the consolidated review. Both agents recommend approval with no blocking issues.
Hey @ramboz,
Strengths
-
Diagnosis-before-action discipline is the right architectural choice (SKILL.md:28-32, runbook Step 3). The explicit "verify before reverting" workflow with concrete evidence criteria prevents the most common incident-response anti-pattern: speculative reverts that waste the one cheap mitigation and obscure the real cause.
-
Clean separation of orchestration from procedure (SKILL.md:20-23). The skill defines sequence and decision gates; the runbook holds the concrete commands. This layers correctly - the skill can evolve with new tooling without rewriting procedural steps, and the runbook remains usable by a human without the LLM orchestrator.
-
Worked-example anchoring grounds the theory (runbook:9-14, Appendix A). Tying every step to the actual 2026-05-21 incident with specific SHAs, timestamps, and error messages makes this immediately actionable. The
main is not a functionappendix captures failure-mode reasoning that would otherwise be lost to Slack threads. -
Timeline caveat is operationally valuable (runbook:92-105). Explicitly calling out that the release-commit timestamp is not the deploy timestamp, with the
aws lambda get-functionverification, prevents a subtle correlation error during incident response. -
Hard gate design is correct (SKILL.md:72-87). Requiring explicit re-confirmation per commit (not a blanket prior approval) is the right safety boundary for an LLM-driven workflow touching shared production state.
-
The "gotchas" section is honest operational documentation (runbook Step 5). Documenting node version, pre-commit hook deps, and branch protection behavior under stress is the kind of detail that only gets written by someone who actually hit these issues.
Issues
Minor (Nice to Have)
- TODO placeholders reduce operational readiness -
docs/runbooks/release-caused-outage.md:194,235-238: Steps 5 and 7 contain_(TODO: name it)_,_(TODO: SLA)_,_(TODO: link)_for the incident channel, post-mortem template, and status page. Acceptable for a first merge (acknowledged in discussion), but if these linger, the runbook becomes a "read and interpret" document rather than "follow and execute" during a real incident. Recommend filing a follow-up ticket to resolve within a sprint.
Recommendations
-
Consider adding a "Prerequisites" section to the runbook listing required access (AWS credentials with Lambda read, admin push rights on main, nvm with node 24 installed). During an incident, discovering you lack permissions at Step 5 is costly. A small addition that improves time-to-mitigate for responders unfamiliar with this flow.
-
Track the TODOs explicitly. A single follow-up ticket with a reasonable deadline prevents the placeholders from becoming permanent gaps.
Assessment
Ready to merge? Yes
The documents are well-structured, operationally grounded, and encode the correct incident-response discipline. The architecture decision to separate LLM orchestration from human-readable procedure is sound and will age well. The existing review discussion has been addressed (Last validated anchor, comms/post-mortem TODOs, test:bundle reference, hlx.static example all present in the current diff). No runtime code was changed, CI passes, and the TODO placeholders are appropriate acknowledged debt for a first version.
Skill: pr-review | Model: us.anthropic.claude-opus-4-6-v1[1m] | Duration: 0m 43s | Cost: $3.04 | Commit: 39497452c7a9e5b95294beb805fc68fd2bbfb7c2
If this code review was useful, please react with 👍. Otherwise, react with 👎.
…CLAUDE.md guidance (#2466) ## Re-land of [#2456](#2456) with bundling fix + new pre-merge gate + codified guidance Tracks [SITES-45260](https://jira.corp.adobe.com/browse/SITES-45260). The semrush AIO proxy merged as commit `882dbab5` (v1.508.0) on 2026-05-21 and took the api-service Lambda down — every invocation failed with `TypeError: main2 is not a function at lambdaAdapter`. Reverted at `873eddb9`. This PR brings the feature back with three layered defences against the same class of failure recurring. ## Root cause (recap) `src/support/semrush/handlers/projects.js` was reading `data/locations.json` at module load via `readFileSync(import.meta.url)`. The JSON file was not in `package.json` `hlx.static`, so `helix-deploy` never copied it into the Lambda zip. Cold start hit `ENOENT … dist/data/locations.json`, the module's export went undefined, and the deploy wrapper's reference (`main2`) crashed on every invocation. Tests passed because they import from source, where `import.meta.url` resolves to the original source path and the JSON lives alongside the handler. The failure only manifests in the bundled artifact. ## What changes ### 1. Bundling fix (`a4b198fa`) `src/support/semrush/data/locations.json` -> `src/support/semrush/data/locations.js` (`export const LOCATIONS = { ... }`). `handlers/projects.js` imports it via normal ESM resolution. The bundler picks it up automatically — no `hlx.static` registry to maintain, no `import.meta.url` path arithmetic, no `fs`/`path`/`url` imports in the handler. This is fix option 1 from the SITES-45260 ticket (the recommended one). Diff is a clean swap; data values are unchanged. ### 2. Pre-merge CI gate (`f1cde0db`) The mysticat-ci reusable workflow runs lint + test + coverage only — it **never** executes `npm run build` (`hedy -v --test-bundle`). That's the gap that let the broken bundle through. Add a repo-local `bundle-build` job to `.github/workflows/ci.yaml` that runs `npm run build` on every push / PR. `hedy --test-bundle` bundles, zips, **and invokes `lambda()` against a synthetic healthcheck event**, exiting non-zero on any non-2xx response. Verified locally: - Healthy bundle (this PR's tip): exit 0, healthcheck returns 200 - Broken bundle (reproduction with `readFileSync` restored): exit 1, "Validation failed: 500" with `x-error: ENOENT … data/locations.json` This is the load-bearing catch for SITES-45260's class of failure. It addresses acceptance criteria 2 + 4 in one step — a separate `typeof main === 'function'` smoke check would have been a strict subset of what `hedy --test-bundle` already does, so it was dropped. Until adobe/mysticat-ci's reusable build action grows the equivalent step, this repo-local job is the gate. Worth lifting upstream in a follow-up so every spacecat service inherits it. ### 3. Lambda bundle constraints in CLAUDE.md (`040ed590`) The CI gate catches the failure, but the right place to prevent the next attempt is in the guidance an AI agent (or new contributor) reads first. New `## Lambda Bundle Constraints` section in `CLAUDE.md`, placed right after the build/deploy commands so it's hard to miss when working on anything that touches `src/`. Three short rules: 1. **Do NOT** use `readFileSync(import.meta.url, ...)` or any sibling-file reads at module load — the bundled artifact does not preserve source-relative paths. 2. **Prefer JS-module imports for static data** — inline JSON / locale data / lookup tables as `export const FOO = { ... }` in a `.js` file. See `src/support/semrush/data/locations.js`. 3. **If a non-JS asset is unavoidable**, declare its repo-relative path in `package.json` `hlx.static` AND read it from the Lambda task root, not from `import.meta.url`. Plus a pointer to the `bundle-build` CI gate and the SITES-45260 post-mortem. ## Verification - Local `npm run build` -> exit 0, healthcheck OK - Local reproduction with the `readFileSync` re-introduced -> exit 1 (confirmed gate works) - 178 semrush / contract / workspace-resolver tests passing - `git diff e3618d9..HEAD --stat` is small — only the four files this PR is meant to touch (locations.js, projects.js, ci.yaml, CLAUDE.md). The reapply commit `e3618d92` is exactly the diff of the original `882dbab5`. ## Cross-repo notes - `mysticat-data-service` migration `20260528000000_brand_to_semrush_projects.sql`: already merged + deployed pre-revert. Schema is in place. - `@adobe/spacecat-shared-data-access` `3.68.0`: shipped, pinned by this PR. - Operator runbook for incident mitigation: [#2462](#2462) (ramboz). - Follow-up: [#2467](#2467) makes `SEMRUSH_PROJECTS_BASE_URL` strict / Vault-backed (no source default). ## Follow-ups (out of scope) - Lift the `bundle-build` job into `adobe/mysticat-ci`'s reusable workflow so every spacecat/mysticat service inherits the gate. - Phase 7 (dev smoke tests against `https://spacecat.experiencecloud.live/api/ci/`) — runs after this lands and deploys. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Alicia Adriani <aadriani+adobe@adobe.com>
|
@ramboz should this go to experience-success-skills instead to be easily applicable to all lambdas we have? it is automatically available to workspace users |
|
@solaris007 Not sure. Filed it here since this was the scope of the CSO I worked on. If you prefer having that somewhere else, I'm fine with that, but we need to make sure it's easily discoverable by whoever is handling the CSO at that time. At the very least, I would then make sure we have a pointer to the workspace from this repo |
Summary
Adds an operations runbook and a companion Claude Code skill for responding to a
release-caused outage of the
spacecat-api-serviceLambda — the scenario where adeploy makes every invocation fail at startup (e.g.
TypeError: main2 is not a function).Two ways to run the same playbook:
docs/runbooks/release-caused-outage.md:a step-by-step runbook (triage → correlate the release → verify locally →
revert → confirm recovery → follow-up) with copy-pasteable commands.
.claude/skills/release-outage-response/: a skill that drives anagent through the same phases, enforcing diagnose-before-revert discipline and a
hard human-confirmation gate before any revert/push to prod
main.What's included
docs/runbooks/release-caused-outage.md— the runbook, with a deep-dive appendixon the
main is not a functionbundle-failure signature: a recurringhelix-deploy gotcha where a file read from disk at module load isn't shipped in
the bundle (not declared in
hlx.static), so themainexport never initializes..claude/skills/release-outage-response/SKILL.md— the orchestration skill. Itdelegates the actual commands to the runbook and adds the sequencing + safety
gates.
README.md— new "Operations & Runbooks" section linking to the runbook.Motivation
Distilled from the 2026-05-21 v1.508.0 incident: commit 882dbab (#2456) read a
JSON asset at module load that wasn't included in the Lambda bundle, so every
invocation failed with
main2 is not a function. Reverted in 873eddb. The runbookcaptures how that was diagnosed — and how to confirm a release is the cause before
reverting — so the next responder doesn't start from scratch.
Notes
docs:, so no version bump).typeof main === 'function'on the built artifact — is called out as follow-up,not included here.
Test plan
main(hlx.static,npmscripts,node
engines, deploy-zip path)SHAs / PR feat(semrush): IMS-bearer-only AIO proxy + onboarding + DB-backed projects #2456