Skip to content
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 20 additions & 11 deletions _shared/recipes/self-review-gate.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,15 @@ A quality gate that reviews code changes before they are pushed or submitted
as a PR. Used by workflows that make code changes (bugfix, implement, cve-fix,
e2e) to catch issues before external review.

This gate is one peer in a defence-in-depth chain: self-review catches
mechanical issues that automated checks are good at finding (convention
violations, obvious bugs, missing error handling, inconsistencies with
surrounding code), while downstream reviewers (coderabbit, human reviewers)
provide genuinely independent perspectives on design, architecture, and
subtle correctness issues. A same-model subagent improves review quality
over inline review but does not replace independent review.
This gate is one peer in a defence-in-depth chain. It evaluates all
criteria in the review protocol using a same-model subagent for
isolation. A subagent reviewing only the diff, without the
implementation rationale loaded, can catch issues visible from the code
that the implementor is too close to see. It will not catch everything
a human reviewer would — subtle correctness issues and cross-system
design concerns benefit from genuinely independent review by downstream
reviewers (CodeRabbit, human reviewers). The goal is to catch what can
be caught early, not to replace independent review.

## Parameters

Expand All @@ -19,6 +21,7 @@ over inline review but does not replace independent review.
| DIFF_COMMAND | No | Must be a `git diff` invocation (the gate appends `--name-status` to it). Note: `git diff` only shows tracked file changes — if the workflow creates new untracked files, the calling workflow should stage them first or use a DIFF_COMMAND that captures them. | `git diff HEAD` |
| MAX_ROUNDS | No | Maximum fix-and-re-review iterations | `3` |
| CONTEXT_FILES | No | Workflow artifacts providing review context (e.g., design docs, requirements, implementation notes) | None |
| SUPPLEMENTARY_CRITERIA | No | Additional evaluation criteria beyond the review protocol. Passed to the reviewer alongside the standard criteria. Use for domain-specific checks (e.g., e2e anti-patterns) or review focus directives (e.g., cross-cutting concerns to prioritize, issues to skip). | None |

## Procedure

Expand Down Expand Up @@ -86,13 +89,18 @@ independence. Load it with:
- The diff output
- The project's AGENTS.md/CLAUDE.md (if they exist)
- Any CONTEXT_FILES provided by the calling workflow
- Any SUPPLEMENTARY_CRITERIA provided by the calling workflow

Retain the subagent's ID for use in Step 4 — resuming the same reviewer
gives it memory of its previous findings and concerns, producing more
coherent follow-up reviews.

**If subagents are not available**, perform the review inline. Adopt the
reviewer perspective: evaluate the code as if you did not write it.
**If subagents are not available**, perform the review inline. Apply
any SUPPLEMENTARY_CRITERIA alongside the standard evaluation criteria.
Adopt the reviewer perspective: evaluate the code as if you did not
write it. Do not let your knowledge of the implementation rationale
excuse issues that a fresh reviewer would flag — review the diff on
its own merits.

### Step 3: Validate and Assess Findings

Expand Down Expand Up @@ -127,8 +135,9 @@ If Step 3 produced code changes:
original findings and lets it verify they were addressed correctly.

**If resumption is not available:** Spawn a new subagent loaded with the
review protocol, the updated diff, and a summary of the previous round's
findings and fixes so it has full context.
review protocol, the updated diff, the project's AGENTS.md/CLAUDE.md,
any CONTEXT_FILES, any SUPPLEMENTARY_CRITERIA, and a summary of the
previous round's findings and fixes so it has full context.

**If subagents are not available:** Re-review inline, focusing on the
current state of the diff (not just the delta from last round).
Expand Down
32 changes: 24 additions & 8 deletions _shared/review-protocol.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,16 @@ workflow and the self-review gates embedded in other workflows.

## Core Principles

- **Review what the developer wrote, not what you would have written.** Evaluate
correctness, clarity, safety, and adherence to project conventions. Do not
impose stylistic preferences that are not codified in the project's linting
or contribution guidelines.
- **Challenge decisions, not just implementation.** Don't just verify that
the code does what it intends — question whether the intent is right.
Should this abstraction exist? Is this the right boundary between
components? Would a different approach avoid the complexity being managed
here? A review that only checks correctness is a linter with extra steps.
- **Review what the developer wrote, not what you would have written.** The
above principle challenges decisions; this one constrains how. Evaluate
the approach on its merits — does it work, is it clear, is it safe? Do
not impose stylistic preferences or rewrite working code in your preferred
idiom. Challenge the design, not the taste.
- **Findings must be actionable.** Every finding must include a concrete
suggestion the developer can evaluate and apply. Vague observations
("consider improving this") are not findings.
Expand All @@ -36,13 +42,23 @@ Evaluate changes against these categories, prioritized by impact:
appropriately? Are failure modes handled?
3. **Security** — Are there injection risks, unsafe operations, exposed secrets,
or other OWASP-category concerns?
4. **Performance** — Are there unnecessary allocations, N+1 queries, unbounded
4. **Design** — Does each new abstraction earn its complexity? Are
responsibilities clearly divided — no god functions, no single type
accumulating unrelated concerns? Do interfaces hide implementation details
and expose coherent contracts? Are there implicit assumptions between
components that should be explicit (shared constants, expected call order,
assumed preconditions)? Signals: a function taking many parameters
(missing intermediate type), a type importing from several unrelated
packages (mixed concerns), identical error-handling blocks repeated
across call sites (missing shared helper), a new interface that exposes
internal types to callers.
5. **Performance** — Are there unnecessary allocations, N+1 queries, unbounded
operations, or other efficiency concerns?
5. **Naming and clarity** — Are names descriptive? Is the intent clear from
6. **Naming and clarity** — Are names descriptive? Is the intent clear from
reading the code?
6. **Test coverage** — Are the changes tested? Are edge cases covered? Are
7. **Test coverage** — Are the changes tested? Are edge cases covered? Are
tests testing contracts, not implementation?
7. **Project conventions** — Do the changes follow the conventions discovered
8. **Project conventions** — Do the changes follow the conventions discovered
from the project's AGENTS.md, CLAUDE.md, linting configs, and contribution
guidelines?

Expand Down
3 changes: 1 addition & 2 deletions code-review/skills/start.md
Original file line number Diff line number Diff line change
Expand Up @@ -247,8 +247,7 @@ change interact correctly with surrounding code? The diff shows what
changed; the surrounding code reveals whether the change fits.

The reviewer should evaluate all categories defined in
`../../_shared/review-protocol.md` (Correctness, Error handling, Security,
Performance, Naming and clarity, Test coverage, Project conventions).
`../../_shared/review-protocol.md`.

If the user provided focus guidance, prioritize those areas but still
report CRITICAL and HIGH findings in other categories.
Expand Down
5 changes: 3 additions & 2 deletions design/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,12 +127,13 @@ The decomposition follows the team's Jira hierarchy:

- **Feature** (exists in Jira — input to this workflow)
- **Epic** — user-value oriented, standalone, T-shirt sized
- **Story** — right-sized, includes functionality + testing, prefixed with `[DEV]`/`[UI]`/`[UX]`/`[QE]`/`[DOCS]`/`[CI]`
- **Story** — right-sized, prefixed with `[DEV]`/`[UI]`/`[UX]`/`[QE]`/`[DOCS]`/`[CI]`

Key constraints:
- Each epic delivers complete functionality independently
- Each story leaves the system in a stable state (CI/CD to main)
- Every story includes both functionality and testing (no deferred test stories)
- Every `[DEV]` story includes both functionality and testing (no deferred test stories)
- `[DOCS]` stories use a dedicated template with Documentation Scope and Documentation Inputs instead of Implementation Guidance and Testing Approach
- Tests validate the software's contract, not its implementation — use test types appropriate to the change (unit, integration, e2e)
- A coverage matrix ensures all PRD requirements are addressed

Expand Down
160 changes: 160 additions & 0 deletions design/decomposition-review.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
# Decomposition Review Protocol
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file seems to be very desired, but I wonder about its tokens consumption/economics.
There is a lot of judgement calls and high level reviews here that try to capture the big picture.
I guess we need just to try and improve over time.
Just to be clear, I don't see an action item with this review comment. It's more for sharing thoughts.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on the token economics concern — this is expensive. The tradeoff we discussed: decomposition errors caught early (missing requirements, broken dependency chains, vague acceptance criteria) tend to compound downstream. A story with a missing AC becomes a PR with missing tests becomes a feature gap in production. The review cost is front-loaded but should reduce rework in the implementation and QE phases. We'll need to see how it plays out in practice and adjust.


Evaluation criteria, finding format, and assessment model used by the
decomposition review step in the `/decompose` phase. This protocol is
runtime-agnostic: it defines what to evaluate and how to report findings,
whether the reviewer is a subagent or the decomposer performing inline
self-review.

The reviewer receives the PRD and the decomposition artifacts (epics,
stories, coverage matrix) but intentionally does **not** receive the
design document. This is the independence property: the reviewer
evaluates what the artifacts actually say, not what the decomposer
intended. If a story's acceptance criteria are unclear without
cross-referencing the design document, that is a real finding — the
implementor needs self-contained acceptance criteria. Stories are
designed to be read alongside the design document during implementation
(Implementation Guidance sections may reference design sections by
number), so design-section references in Implementation Guidance are
expected and should not be flagged as a clarity problem.

## Core Principles

- **Challenge the decomposition structure, not just the content.** Don't
just verify that each story describes work — question whether the epic
boundaries are right, whether stories are right-sized, and whether the
dependency order would actually produce a working feature. A review
that only checks formatting is a linter with extra steps.
- **Evaluate against the PRD, not against assumed design intent.** The
reviewer intentionally does not receive the design document. Evaluate
whether the decomposition fully addresses the PRD's requirements. If a
requirement appears missing, verify against the PRD before flagging —
the design may have legitimately restructured how a requirement is
addressed.
- **Findings must be actionable.** Every finding must cite a specific
artifact file and section, and include a concrete suggestion the
decomposer can evaluate and apply. Vague observations ("consider
adding more detail") are not findings.
- **Severity must be honest.** CRITICAL and HIGH indicate problems that
would cause incorrect implementation or significant rework. MEDIUM and
LOW are suggestions that improve clarity or structure but are not
blocking. Inflating severity erodes trust; deflating it hides real
problems.
- **Assess on value, not severity.** A LOW finding that genuinely
improves story clarity is worth implementing. A MEDIUM finding that is
a structural preference without real impact is not. Severity indicates
potential damage; value determines whether the finding is worth acting
on.
- **Do not flag design-document cross-references in Implementation
Guidance.** Stories are designed to be read alongside the design
document during implementation. References like "See §4.3 of the
design document" are intentional and expected. Flag instead: acceptance
criteria that require the design document to understand, or
Implementation Guidance that lacks enough direction for an implementor
to start work. Note: `[DOCS]` stories use `Documentation Scope`
instead of `Implementation Guidance` — evaluate those sections on
whether they describe what the reader needs to understand, not on
implementation direction.

## Evaluation Criteria

Evaluate the decomposition against these categories, prioritized by
impact:

1. **Requirement Coverage** — Does the coverage matrix (`06-coverage.md`)
map every FR and NFR from the PRD? Do story acceptance criteria
collectively cover the PRD's acceptance criteria, not just the FR/NFR
descriptions? Are there stories without PRD traceability — and if so,
are they justified (e.g., infrastructure prerequisites)? Has any
requirement been silently dropped, deferred to "v2", or reduced in
scope?

2. **Epic Structure** — Is each epic organized around user value, not
technical layers? Does each epic deliver complete functionality
independently? Are T-shirt sizes assigned and reasonable? Is the epic
dependency order documented and logical? Could implementing the epics
in the documented order actually work, or are there hidden
dependencies?

3. **Story Completeness** — Are acceptance criteria behavioral outcomes
(what the system does, testable from outside), not implementation
specifications? Does each story have a title prefix assigned (`[DEV]`,
`[QE]`, `[UI]`, `[CI]`, `[DOCS]`)? Is the story implementation order
within each epic documented? Are stories right-sized — not so small
they add process overhead, not so large they resist meaningful review?
For non-`[DOCS]` stories, does the Implementation Guidance give
enough direction to start work?
If a story has more than approximately 8 acceptance criteria, should
it be split?

4. **Testing Commitment** — Does every `[DEV]` story include both
functionality AND testing? Do `[QE]` stories define concrete e2e test
scenarios that validate user-facing behavior, not duplicate
unit/integration tests from `[DEV]` stories? Does each epic with
e2e-testable surface include at least one `[QE]` story? Does any
story fall back to "manual validation" or "document procedure for QE"
where an automated test could be written?

5. **Integration Stability** — Would implementing all stories in
dependency order produce a working feature end-to-end? Does any
integration work fall between stories — work that is needed but not
captured in any story? Are story dependencies documented and
complete? Does each story leave the system in a stable state (all
tests passing after merge)?

6. **Documentation** — Do `[DOCS]` stories use the `[DOCS]` template
(Documentation Scope and Documentation Inputs instead of
Implementation Guidance and Testing Approach)? Does the Documentation
Scope describe what the reader needs to understand without prescribing
how to write or organize the documentation? Does the Documentation
Inputs section inventory all user-facing artifacts from the
implementation stories it references? Is the inventory detailed enough
for a documentation author to know what to look for in the
implementation?

## Finding Format

Each finding must include:

- **File:** artifact file path (e.g.,
`05-stories/epic-1/story-02-add-validation.md`)
- **Section:** section within the artifact (e.g., "Acceptance Criteria",
"Dependencies", "Testing Approach", "Documentation Scope")
- **Severity:** CRITICAL | HIGH | MEDIUM | LOW
- **Category:** one of the evaluation criteria above
- **Issue:** what the problem is
- **Suggestion:** concrete, actionable fix

Findings that cannot cite a specific file and section in the actual
artifacts must be discarded — they indicate hallucinated references.
Coverage matrix findings should cite `06-coverage.md` and the specific
row or gap.

## Severity Definitions

| Severity | Meaning | Action |
|----------|---------|--------|
| CRITICAL | Would cause incorrect implementation — missing requirements, contradictory acceptance criteria, broken dependency chain, scope reduction | Must fix before presenting |
| HIGH | Would cause rework — story too large to review in a single PR, missing testing commitment, integration gap between stories, `[QE]` story duplicating `[DEV]` test scope | Should fix before presenting |
| MEDIUM | Would reduce clarity — vague acceptance criteria, imprecise implementation guidance or documentation scope, sizing concerns, minor coverage matrix inconsistencies | Fix if it adds value |
| LOW | Structural improvements — naming consistency, ordering suggestions, documentation clarity | Fix only if clearly valuable |

## Validation Rules

After obtaining a review (whether from a subagent or inline), validate
every finding:

1. **Verify file references.** Confirm the cited artifact file exists in
the decomposition output (under
`.artifacts/design/{issue-number}/`).
2. **Verify section references.** Confirm the cited section exists in
the artifact file (e.g., an "Acceptance Criteria" heading actually
appears in the story file).
3. **Discard hallucinated findings.** If a finding references an artifact
file or section that does not exist, discard it silently. Do not
present it.
4. **Cross-check against the PRD.** For findings that claim a
requirement is missing or inadequately covered, verify the claim
against the actual PRD. The PRD is the source of truth — if the
reviewer misidentified a requirement or the requirement does not
exist, discard the finding.
2 changes: 1 addition & 1 deletion design/guidelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@
- Follow the design document template (`templates/design.md`). Do not invent new sections or omit existing ones without user approval. Sections with no impact should say so explicitly (e.g., "No schema changes required") rather than being omitted.
- Follow the section guidance (`templates/section-guidance.md`) for content standards.
- Design-scoped goals are **implementation constraints**, not product outcomes. They complement — not duplicate — the PRD's goals.
- Acceptance criteria must be **behavioral outcomes** (what the system does, testable from outside), not activities or implementation specifications. Implementation details belong in Implementation Guidance.
- Acceptance criteria must be **behavioral outcomes** (what the system does, testable from outside), not activities or implementation specifications. For non-`[DOCS]` stories, implementation details belong in Implementation Guidance. For `[DOCS]` stories, implementation details belong in Documentation Scope.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referring DOCS again...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See reply on decompose.md.

- Open questions must have owners and impact statements.

## Escalation
Expand Down
Loading
Loading