flightctl · adalton · May 14, 2026 · May 12, 2026 · May 12, 2026 · May 12, 2026
diff --git a/_shared/recipes/self-review-gate.md b/_shared/recipes/self-review-gate.md
@@ -4,13 +4,15 @@ A quality gate that reviews code changes before they are pushed or submitted
 as a PR. Used by workflows that make code changes (bugfix, implement, cve-fix,
 e2e) to catch issues before external review.
 
-This gate is one peer in a defence-in-depth chain: self-review catches
-mechanical issues that automated checks are good at finding (convention
-violations, obvious bugs, missing error handling, inconsistencies with
-surrounding code), while downstream reviewers (coderabbit, human reviewers)
-provide genuinely independent perspectives on design, architecture, and
-subtle correctness issues. A same-model subagent improves review quality
-over inline review but does not replace independent review.
+This gate is one peer in a defence-in-depth chain. It evaluates all
+criteria in the review protocol using a same-model subagent for
+isolation. A subagent reviewing only the diff, without the
+implementation rationale loaded, can catch issues visible from the code
+that the implementor is too close to see. It will not catch everything
+a human reviewer would — subtle correctness issues and cross-system
+design concerns benefit from genuinely independent review by downstream
+reviewers (CodeRabbit, human reviewers). The goal is to catch what can
+be caught early, not to replace independent review.
 
 ## Parameters
 
@@ -19,6 +21,7 @@ over inline review but does not replace independent review.
 | DIFF_COMMAND | No | Must be a `git diff` invocation (the gate appends `--name-status` to it). Note: `git diff` only shows tracked file changes — if the workflow creates new untracked files, the calling workflow should stage them first or use a DIFF_COMMAND that captures them. | `git diff HEAD` |
 | MAX_ROUNDS | No | Maximum fix-and-re-review iterations | `3` |
 | CONTEXT_FILES | No | Workflow artifacts providing review context (e.g., design docs, requirements, implementation notes) | None |
+| SUPPLEMENTARY_CRITERIA | No | Additional evaluation criteria beyond the review protocol. Passed to the reviewer alongside the standard criteria. Use for domain-specific checks (e.g., e2e anti-patterns) or review focus directives (e.g., cross-cutting concerns to prioritize, issues to skip). | None |
 
 ## Procedure
 
@@ -86,13 +89,18 @@ independence. Load it with:
 - The diff output
 - The project's AGENTS.md/CLAUDE.md (if they exist)
 - Any CONTEXT_FILES provided by the calling workflow
+- Any SUPPLEMENTARY_CRITERIA provided by the calling workflow
 
 Retain the subagent's ID for use in Step 4 — resuming the same reviewer
 gives it memory of its previous findings and concerns, producing more
 coherent follow-up reviews.
 
-**If subagents are not available**, perform the review inline. Adopt the
-reviewer perspective: evaluate the code as if you did not write it.
+**If subagents are not available**, perform the review inline. Apply
+any SUPPLEMENTARY_CRITERIA alongside the standard evaluation criteria.
+Adopt the reviewer perspective: evaluate the code as if you did not
+write it. Do not let your knowledge of the implementation rationale
+excuse issues that a fresh reviewer would flag — review the diff on
+its own merits.
 
 ### Step 3: Validate and Assess Findings
 
@@ -127,8 +135,9 @@ If Step 3 produced code changes:
    original findings and lets it verify they were addressed correctly.
 
    **If resumption is not available:** Spawn a new subagent loaded with the
-   review protocol, the updated diff, and a summary of the previous round's
-   findings and fixes so it has full context.
+   review protocol, the updated diff, the project's AGENTS.md/CLAUDE.md,
+   any CONTEXT_FILES, any SUPPLEMENTARY_CRITERIA, and a summary of the
+   previous round's findings and fixes so it has full context.
 
    **If subagents are not available:** Re-review inline, focusing on the
    current state of the diff (not just the delta from last round).

diff --git a/_shared/review-protocol.md b/_shared/review-protocol.md
@@ -6,10 +6,16 @@ workflow and the self-review gates embedded in other workflows.
 
 ## Core Principles
 
-- **Review what the developer wrote, not what you would have written.** Evaluate
-  correctness, clarity, safety, and adherence to project conventions. Do not
-  impose stylistic preferences that are not codified in the project's linting
-  or contribution guidelines.
+- **Challenge decisions, not just implementation.** Don't just verify that
+  the code does what it intends — question whether the intent is right.
+  Should this abstraction exist? Is this the right boundary between
+  components? Would a different approach avoid the complexity being managed
+  here? A review that only checks correctness is a linter with extra steps.
+- **Review what the developer wrote, not what you would have written.** The
+  above principle challenges decisions; this one constrains how. Evaluate
+  the approach on its merits — does it work, is it clear, is it safe? Do
+  not impose stylistic preferences or rewrite working code in your preferred
+  idiom. Challenge the design, not the taste.
 - **Findings must be actionable.** Every finding must include a concrete
   suggestion the developer can evaluate and apply. Vague observations
   ("consider improving this") are not findings.
@@ -36,13 +42,23 @@ Evaluate changes against these categories, prioritized by impact:
    appropriately? Are failure modes handled?
 3. **Security** — Are there injection risks, unsafe operations, exposed secrets,
    or other OWASP-category concerns?
-4. **Performance** — Are there unnecessary allocations, N+1 queries, unbounded
+4. **Design** — Does each new abstraction earn its complexity? Are
+   responsibilities clearly divided — no god functions, no single type
+   accumulating unrelated concerns? Do interfaces hide implementation details
+   and expose coherent contracts? Are there implicit assumptions between
+   components that should be explicit (shared constants, expected call order,
+   assumed preconditions)? Signals: a function taking many parameters
+   (missing intermediate type), a type importing from several unrelated
+   packages (mixed concerns), identical error-handling blocks repeated
+   across call sites (missing shared helper), a new interface that exposes
+   internal types to callers.
+5. **Performance** — Are there unnecessary allocations, N+1 queries, unbounded
    operations, or other efficiency concerns?
-5. **Naming and clarity** — Are names descriptive? Is the intent clear from
+6. **Naming and clarity** — Are names descriptive? Is the intent clear from
    reading the code?
-6. **Test coverage** — Are the changes tested? Are edge cases covered? Are
+7. **Test coverage** — Are the changes tested? Are edge cases covered? Are
    tests testing contracts, not implementation?
-7. **Project conventions** — Do the changes follow the conventions discovered
+8. **Project conventions** — Do the changes follow the conventions discovered
    from the project's AGENTS.md, CLAUDE.md, linting configs, and contribution
    guidelines?
 

diff --git a/code-review/skills/start.md b/code-review/skills/start.md
@@ -247,8 +247,7 @@ change interact correctly with surrounding code? The diff shows what
 changed; the surrounding code reveals whether the change fits.
 
 The reviewer should evaluate all categories defined in
-`../../_shared/review-protocol.md` (Correctness, Error handling, Security,
-Performance, Naming and clarity, Test coverage, Project conventions).
+`../../_shared/review-protocol.md`.
 
 If the user provided focus guidance, prioritize those areas but still
 report CRITICAL and HIGH findings in other categories.

diff --git a/design/README.md b/design/README.md
@@ -127,12 +127,13 @@ The decomposition follows the team's Jira hierarchy:
 
 - **Feature** (exists in Jira — input to this workflow)
   - **Epic** — user-value oriented, standalone, T-shirt sized
-    - **Story** — right-sized, includes functionality + testing, prefixed with `[DEV]`/`[UI]`/`[UX]`/`[QE]`/`[DOCS]`/`[CI]`
+    - **Story** — right-sized, prefixed with `[DEV]`/`[UI]`/`[UX]`/`[QE]`/`[DOCS]`/`[CI]`
 
 Key constraints:
 - Each epic delivers complete functionality independently
 - Each story leaves the system in a stable state (CI/CD to main)
-- Every story includes both functionality and testing (no deferred test stories)
+- Every `[DEV]` story includes both functionality and testing (no deferred test stories)
+- `[DOCS]` stories use a dedicated template with Documentation Scope and Documentation Inputs instead of Implementation Guidance and Testing Approach
 - Tests validate the software's contract, not its implementation — use test types appropriate to the change (unit, integration, e2e)
 - A coverage matrix ensures all PRD requirements are addressed
 

diff --git a/design/decomposition-review.md b/design/decomposition-review.md
@@ -0,0 +1,160 @@
+# Decomposition Review Protocol
+
+Evaluation criteria, finding format, and assessment model used by the
+decomposition review step in the `/decompose` phase. This protocol is
+runtime-agnostic: it defines what to evaluate and how to report findings,
+whether the reviewer is a subagent or the decomposer performing inline
+self-review.
+
+The reviewer receives the PRD and the decomposition artifacts (epics,
+stories, coverage matrix) but intentionally does **not** receive the
+design document. This is the independence property: the reviewer
+evaluates what the artifacts actually say, not what the decomposer
+intended. If a story's acceptance criteria are unclear without
+cross-referencing the design document, that is a real finding — the
+implementor needs self-contained acceptance criteria. Stories are
+designed to be read alongside the design document during implementation
+(Implementation Guidance sections may reference design sections by
+number), so design-section references in Implementation Guidance are
+expected and should not be flagged as a clarity problem.
+
+## Core Principles
+
+- **Challenge the decomposition structure, not just the content.** Don't
+  just verify that each story describes work — question whether the epic
+  boundaries are right, whether stories are right-sized, and whether the
+  dependency order would actually produce a working feature. A review
+  that only checks formatting is a linter with extra steps.
+- **Evaluate against the PRD, not against assumed design intent.** The
+  reviewer intentionally does not receive the design document. Evaluate
+  whether the decomposition fully addresses the PRD's requirements. If a
+  requirement appears missing, verify against the PRD before flagging —
+  the design may have legitimately restructured how a requirement is
+  addressed.
+- **Findings must be actionable.** Every finding must cite a specific
+  artifact file and section, and include a concrete suggestion the
+  decomposer can evaluate and apply. Vague observations ("consider
+  adding more detail") are not findings.
+- **Severity must be honest.** CRITICAL and HIGH indicate problems that
+  would cause incorrect implementation or significant rework. MEDIUM and
+  LOW are suggestions that improve clarity or structure but are not
+  blocking. Inflating severity erodes trust; deflating it hides real
+  problems.
+- **Assess on value, not severity.** A LOW finding that genuinely
+  improves story clarity is worth implementing. A MEDIUM finding that is
+  a structural preference without real impact is not. Severity indicates
+  potential damage; value determines whether the finding is worth acting
+  on.
+- **Do not flag design-document cross-references in Implementation
+  Guidance.** Stories are designed to be read alongside the design
+  document during implementation. References like "See §4.3 of the
+  design document" are intentional and expected. Flag instead: acceptance
+  criteria that require the design document to understand, or
+  Implementation Guidance that lacks enough direction for an implementor
+  to start work. Note: `[DOCS]` stories use `Documentation Scope`
+  instead of `Implementation Guidance` — evaluate those sections on
+  whether they describe what the reader needs to understand, not on
+  implementation direction.
+
+## Evaluation Criteria
+
+Evaluate the decomposition against these categories, prioritized by
+impact:
+
+1. **Requirement Coverage** — Does the coverage matrix (`06-coverage.md`)
+   map every FR and NFR from the PRD? Do story acceptance criteria
+   collectively cover the PRD's acceptance criteria, not just the FR/NFR
+   descriptions? Are there stories without PRD traceability — and if so,
+   are they justified (e.g., infrastructure prerequisites)? Has any
+   requirement been silently dropped, deferred to "v2", or reduced in
+   scope?
+
+2. **Epic Structure** — Is each epic organized around user value, not
+   technical layers? Does each epic deliver complete functionality
+   independently? Are T-shirt sizes assigned and reasonable? Is the epic
+   dependency order documented and logical? Could implementing the epics
+   in the documented order actually work, or are there hidden
+   dependencies?
+
+3. **Story Completeness** — Are acceptance criteria behavioral outcomes
+   (what the system does, testable from outside), not implementation
+   specifications? Does each story have a title prefix assigned (`[DEV]`,
+   `[QE]`, `[UI]`, `[CI]`, `[DOCS]`)? Is the story implementation order
+   within each epic documented? Are stories right-sized — not so small
+   they add process overhead, not so large they resist meaningful review?
+   For non-`[DOCS]` stories, does the Implementation Guidance give
+   enough direction to start work?
+   If a story has more than approximately 8 acceptance criteria, should
+   it be split?
+
+4. **Testing Commitment** — Does every `[DEV]` story include both
+   functionality AND testing? Do `[QE]` stories define concrete e2e test
+   scenarios that validate user-facing behavior, not duplicate
+   unit/integration tests from `[DEV]` stories? Does each epic with
+   e2e-testable surface include at least one `[QE]` story? Does any
+   story fall back to "manual validation" or "document procedure for QE"
+   where an automated test could be written?
+
+5. **Integration Stability** — Would implementing all stories in
+   dependency order produce a working feature end-to-end? Does any
+   integration work fall between stories — work that is needed but not
+   captured in any story? Are story dependencies documented and
+   complete? Does each story leave the system in a stable state (all
+   tests passing after merge)?
+
+6. **Documentation** — Do `[DOCS]` stories use the `[DOCS]` template
+   (Documentation Scope and Documentation Inputs instead of
+   Implementation Guidance and Testing Approach)? Does the Documentation
+   Scope describe what the reader needs to understand without prescribing
+   how to write or organize the documentation? Does the Documentation
+   Inputs section inventory all user-facing artifacts from the
+   implementation stories it references? Is the inventory detailed enough
+   for a documentation author to know what to look for in the
+   implementation?
+
+## Finding Format
+
+Each finding must include:
+
+- **File:** artifact file path (e.g.,
+  `05-stories/epic-1/story-02-add-validation.md`)
+- **Section:** section within the artifact (e.g., "Acceptance Criteria",
+  "Dependencies", "Testing Approach", "Documentation Scope")
+- **Severity:** CRITICAL | HIGH | MEDIUM | LOW
+- **Category:** one of the evaluation criteria above
+- **Issue:** what the problem is
+- **Suggestion:** concrete, actionable fix
+
+Findings that cannot cite a specific file and section in the actual
+artifacts must be discarded — they indicate hallucinated references.
+Coverage matrix findings should cite `06-coverage.md` and the specific
+row or gap.
+
+## Severity Definitions
+
+| Severity | Meaning | Action |
+|----------|---------|--------|
+| CRITICAL | Would cause incorrect implementation — missing requirements, contradictory acceptance criteria, broken dependency chain, scope reduction | Must fix before presenting |
+| HIGH | Would cause rework — story too large to review in a single PR, missing testing commitment, integration gap between stories, `[QE]` story duplicating `[DEV]` test scope | Should fix before presenting |
+| MEDIUM | Would reduce clarity — vague acceptance criteria, imprecise implementation guidance or documentation scope, sizing concerns, minor coverage matrix inconsistencies | Fix if it adds value |
+| LOW | Structural improvements — naming consistency, ordering suggestions, documentation clarity | Fix only if clearly valuable |
+
+## Validation Rules
+
+After obtaining a review (whether from a subagent or inline), validate
+every finding:
+
+1. **Verify file references.** Confirm the cited artifact file exists in
+   the decomposition output (under
+   `.artifacts/design/{issue-number}/`).
+2. **Verify section references.** Confirm the cited section exists in
+   the artifact file (e.g., an "Acceptance Criteria" heading actually
+   appears in the story file).
+3. **Discard hallucinated findings.** If a finding references an artifact
+   file or section that does not exist, discard it silently. Do not
+   present it.
+4. **Cross-check against the PRD.** For findings that claim a
+   requirement is missing or inadequately covered, verify the claim
+   against the actual PRD. The PRD is the source of truth — if the
+   reviewer misidentified a requirement or the requirement does not
+   exist, discard the finding.
diff --git a/design/guidelines.md b/design/guidelines.md
@@ -39,7 +39,7 @@
 - Follow the design document template (`templates/design.md`). Do not invent new sections or omit existing ones without user approval. Sections with no impact should say so explicitly (e.g., "No schema changes required") rather than being omitted.
 - Follow the section guidance (`templates/section-guidance.md`) for content standards.
 - Design-scoped goals are **implementation constraints**, not product outcomes. They complement — not duplicate — the PRD's goals.
-- Acceptance criteria must be **behavioral outcomes** (what the system does, testable from outside), not activities or implementation specifications. Implementation details belong in Implementation Guidance.
+- Acceptance criteria must be **behavioral outcomes** (what the system does, testable from outside), not activities or implementation specifications. For non-`[DOCS]` stories, implementation details belong in Implementation Guidance. For `[DOCS]` stories, implementation details belong in Documentation Scope.
 - Open questions must have owners and impact statements.
 
 ## Escalation