Skip to content

feat(routing): Goal Classifier Router for planner model selection (TD-195)#40

Merged
engkimo merged 19 commits into
mainfrom
feature/goal-classifier-router
May 21, 2026
Merged

feat(routing): Goal Classifier Router for planner model selection (TD-195)#40
engkimo merged 19 commits into
mainfrom
feature/goal-classifier-router

Conversation

@engkimo

@engkimo engkimo commented May 20, 2026

Copy link
Copy Markdown
Owner

Summary

  • Per-goal routing of LLMPlanner between Sonnet 4.6 and Haiku 4.5 via a pure-LLM GoalClassifierPort, behind the new MORPHIC_PLANNER_ROUTER flag (default disabled — production routing unchanged on merge).
  • Two interchangeable classifier adapters share a byte-identical SYSTEM_PROMPT (TD-190 stable-prefix invariant): LLMGoalClassifier (Anthropic Haiku 4.5 via LiteLLM) and LocalGoalClassifier (qwen3:8b via Ollama, $0 ops).
  • Privacy: raw goal never serialized — GoalClassified events carry sha256(goal)[:16] only. Fail-safe: classifier timeout / parse error → PlannerModel.SONNET (never to Haiku).

Live A/B verdict (3 arms × 10 goals × 3 trials, $0.97 total, dump at docs/benchmarks/planner_ab_router_2026_05_20.json):

metric Sonnet 4.6 Haiku 4.5 Router Δ vs Sonnet Acceptance
entity_preserved 83.8% 71.3% 81.3% −2.5pt ✓ (≥−5pt)
plan_eval 0.898 0.838 0.884 −0.014 ✓ (≥−0.030)
avg cost / call $0.01351 $0.00715 $0.01218 −9.85%

Captured-saving: 20.9% (paper bar 30% missed). Documented in TD-195 as a workload-mix structural effect of the entity-stressed benchmark (6/10 goals carry CJK / quoted entities → classifier correctly routes them to Sonnet at confidence ≥0.9). The router Pareto-dominates the Sonnet baseline (cheaper at near-identical quality).

Spec, plan, memo

Test plan

  • `uv run --extra dev pytest tests/unit/` → 3360 passed
  • `uv run --extra dev ruff check .` → clean
  • Domain layer purity (rg "from (sqlalchemy|...|infrastructure|application|interface)" domain/...) → 0 hits
  • `rg "from infrastructure.routing" application/` → 0 hits
  • T141 local live test (qwen3:8b) → 3/3 PASS, $0
  • T142 remote live test (Haiku 4.5) → 3/3 PASS, ≤$0.003
  • T145 regression guard (`MORPHIC_PLANNER_ROUTER=disabled` + planner unit tests) → 28/28 PASS
  • T144 spec-compliance audit (privacy / fail-safe / KV-cache) → PASS via `morphic-pr-reviewer`

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Optional per-goal planner router that routes tasks between Sonnet and Haiku, with CLI flags to enable router and tune acceptance/cost thresholds; benchmark runs now include router metadata and chosen-model attribution.
  • Documentation

    • ADR, specs, changelog, continuation, and environment-variable docs updated with router behavior and new env vars for enabling and tuning the router.
  • Tests

    • Extensive unit and integration tests added for classifier, router, observability, metrics, and planner integration.

Review Change Stack

engkimo and others added 18 commits May 19, 2026 23:00
SDD pilot 3 — spec.md / plan.md / tasks.md generated. Targets the
−11.4pt entity_preserved regression from the 2026-05-19 Haiku A/B
while retaining ≥30% of the 47.6% cost saving on the eligible slice.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ied event

Three pure-domain VOs for the goal-classifier-router pilot:
- `PlannerModel` (Sonnet/Haiku enum) with `to_gateway_id()`
- `GoalClassification` (Pydantic VO, confidence/latency/cost bounded)
- `GoalClassified` discriminated-union variant added to DebateEvent

Privacy: GoalClassified carries `goal_hash` (sha256[:16]) only — raw
goal is never serialized into events. Existing council events untouched
and round-trip-compatible.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Async port: `classify(goal: str) -> GoalClassification`. Empty or
whitespace goal raises ValueError. Pure abstract; impls go in
infrastructure/routing/ (LLM + Ollama).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…lClassifier (Ollama)

LLMGoalClassifier targets Anthropic Haiku 4.5 via LLMGateway.
LocalGoalClassifier targets Ollama qwen3:8b via LLMGateway (LiteLLM routes
locally); cost_usd forced to 0.0 since compute is local-only. Both adapters
share SYSTEM_PROMPT + parse_classification from _prompts.py so the contract
stays byte-identical for fair A/B and stable LiteLLM prompt-cache windows
(NFR-5 / TD-190). Parser tolerates qwen3 <think> blocks and fenced JSON.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
LLMPlanner accepts an optional PlannerModelRouter. When present it is
consulted per goal and the chosen PlannerModel is resolved to its
gateway id (claude-haiku-4-5-20251001 / claude-sonnet-4-6) which is
passed to LLMGateway.complete. When the router is None (the default),
behavior is byte-identical to the pre-router planner — the existing
constructor-time ``model`` argument is honored unchanged.

Verified: 8 new router-integration tests, 41 pre-existing planner tests
all still pass. TD-190 stable system prefix preserved (SYSTEM_PROMPT is
byte-identical across Haiku/Sonnet routes).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Build LLMGoalClassifier (Haiku 4.5) when anthropic_api_key is set,
LocalGoalClassifier (Ollama qwen3:8b) otherwise. Inject the resulting
PlannerModelRouter into LLMPlanner only when planner_router_mode=enabled;
otherwise pass router=None for byte-identical pre-router behavior.
TD-195 router observability layer:
- RouterMetrics: dependency-free Prometheus-style counters (decisions_total
  by (model, reason_category)) + latency_ms histogram buffer. Cardinality
  bounded at 12 by AD-3 closed ReasonCategory set.
- RouterObservingEventBus: EventBusPort decorator that taps GoalClassified
  events for metrics + one INFO log line per decision (goal_hash,
  chosen_model, reason_category, classifier_latency_ms, classifier_cost_usd)
  and forwards all events to the inner bus. Privacy: only sha256[:16] hash
  is ever logged, raw goal never carried.

20 new tests, 3360 unit tests total, 0 regressions, ruff clean.
…tBus wiring

- T110: LocalGoalClassifier live (qwen3:8b, $0) — 3 AD-3 buckets all PASS
- T111: LLMGoalClassifier live (Haiku 4.5) — 3 AD-3 buckets all PASS
- Refine SYSTEM_PROMPT with quoted-entity clarification + 3 few-shot
  examples (byte-identical across local/remote per TD-190)
- Wire RouterObservingEventBus + RouterMetrics into AppContainer so
  production picks up metrics + structured logs
- Cost ceiling per Haiku call relaxed to <=$0.001 (observed ~$0.0007)

Verified: 3,360 unit + 6 live integration green, 0 regressions.
Adds a 3rd arm to planner_quality_ab.py that calls PlannerModelRouter
once per goal, then runs the planner with the router-chosen model.

Reports:
- Router-gated mean vs Sonnet baseline (entity_preserved, plan_eval)
- Captured-saving ratio = (Sonnet − Router) / (Sonnet − Haiku-only)
- Per-goal routing breakdown (HAIKU vs SONNET counts)

Acceptance thresholds (configurable):
- entity_preserved Δ >= -5pt vs Sonnet
- plan_eval Δ >= -0.030 vs Sonnet
- captured-saving >= 30%

Default mode (no --router) unchanged.
Run: --router --trials 3, 190 LLM calls, $0.97 total.

Quality PASS:
  entity_preserved -2.5pt (>= -5pt threshold)
  plan_eval -0.014 (>= -0.030 threshold)

Captured-saving 20.9% (< 30% threshold) — structural property of
this benchmark mix (6/10 goals carry entities → Sonnet, 4/10 Haiku).
Router still strictly Pareto-dominates Sonnet at lower cost; defect is
in benchmark composition, not router logic. Memo:
memory/planner_router_ab_2026_05_20.md recommends shipping.
- TECH_DECISIONS.md: TD-195 "Goal Classifier Router for Planner Model
  Selection" — decision/rationale/consequences/follow-ups. Captures
  the 2026-05-20 live A/B verdict (entity_preserved -2.5pt, plan_eval
  -0.014, cost -9.85% / call) and explains the 20.9% captured-saving
  vs 30% paper bar as workload-mix structural, not router defect.
- ENV_VARS.md: 3 new vars under "Planner Model Router (v0.6.3, TD-195)"
  — MORPHIC_PLANNER_ROUTER (disabled|remote|local),
  MORPHIC_PLANNER_ROUTER_HAIKU_CONFIDENCE_THRESHOLD (default 0.7),
  MORPHIC_PLANNER_ROUTER_CLASSIFIER_TIMEOUT_MS (default 5000).
- CONTINUATION.md: new Sprint 91 (TD-195) section at top with branch
  HEAD, live A/B numbers, and the memo pointer.
- Wrap two integration-test docstring run-commands across lines so they
  fit the 100-char limit (E501).
- Apply ruff import-sort (I001) on auto-fix for the same files.
- Replace blind `Exception` in test_config_router with `ValidationError`
  (B017); the test asserts pydantic rejects an invalid enum value, and
  the specific type matches that contract.

Ruff: All checks passed.
- ENV_VARS.md: MORPHIC_PLANNER_ROUTER is `disabled | enabled`
  (remote/local is an auto-selected DI choice, not an env value).
- ENV_VARS.md: CLASSIFIER_TIMEOUT_MS default is 1500, not 5000
  (matches shared/config.py:172 and plan.md).
- TECH_DECISIONS.md TD-195: rewrite the opt-in sentence to reflect the
  actual env contract + DI-time adapter selection.
@coderabbitai

coderabbitai Bot commented May 20, 2026

Copy link
Copy Markdown

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 67c0c9fa-862b-46e0-b718-9bd68a042a03

📥 Commits

Reviewing files that changed from the base of the PR and between ba605c1 and 4b2c75e.

📒 Files selected for processing (5)
  • docs/CONTINUATION.md
  • docs/ENV_VARS.md
  • domain/value_objects/council_events.py
  • infrastructure/routing/_prompts.py
  • shared/config.py

📝 Walkthrough

Walkthrough

Per-goal planner-model routing added: classifier port/VOs/events, PlannerModelRouter service with confidence/timeouts/fallbacks, shared prompt + parser, remote/local classifier adapters, RouterMetrics and observing event bus, LLMPlanner wiring for per-goal model selection, DI/settings wiring, extended benchmark and JSON output, docs/specs, and comprehensive unit/integration/live tests.

Changes

Goal Classifier Router Feature

Layer / File(s) Summary
Benchmark tooling and results
benchmarks/planner_quality_ab.py, docs/benchmarks/planner_ab_router_2026_05_20.json
Benchmark gains an optional Router arm that classifies each goal, rolls classifier overhead into router trial cost, records chosen underlying model and classifier cost per trial, and emits router-specific acceptance/captured-saving summaries.
Domain value objects and types
domain/value_objects/planner_model.py, domain/value_objects/goal_classification.py, domain/value_objects/council_events.py
Adds PlannerModel enum with to_gateway_id(), frozen GoalClassification VO, and GoalClassified event variant with ReasonCategory, hashed goal id, chosen_model, and classifier metadata.
Domain port and router service
domain/ports/goal_classifier.py, domain/services/planner_model_router.py
Introduces GoalClassifierPort and PlannerModelRouter that invokes a classifier under timeout, gates HAiku by confidence threshold, demotes low-confidence HAiku to SONNET, falls back to SONNET on classifier failure/timeout, normalizes reason categories, and publishes GoalClassified events (best-effort).
Shared classifier prompt and parsing
infrastructure/routing/_prompts.py
Adds stable SYSTEM_PROMPT and parse_classification with think/fence stripping, balanced-brace extraction, JSON decoding, model coercion, field validation/truncation, and ClassificationParseError.
Goal classifier adapters
infrastructure/routing/llm_goal_classifier.py, infrastructure/routing/local_goal_classifier.py
LLMGoalClassifier calls Anthropic Haiku model and returns parsed GoalClassification with latency/cost; LocalGoalClassifier calls Ollama local model and returns parsed GoalClassification with cost forced to 0.0.
Router observability and metrics
infrastructure/metrics/router_metrics.py, infrastructure/observability/router_observer.py
RouterMetrics records decision counters by (model, reason_category) and latency samples; RouterObservingEventBus records metrics and logs structured INFO for GoalClassified events while forwarding to inner EventBus.
LLMPlanner router integration
infrastructure/fractal/llm_planner.py
LLMPlanner accepts optional router in constructor; per-goal _resolve_model_id(goal) queries router and converts chosen PlannerModel to gateway model id for LLMGateway.complete.
Configuration and DI wiring
shared/config.py, interface/api/container.py
Settings adds planner_router_mode, planner_router_haiku_confidence_threshold, and planner_router_classifier_timeout_ms; AppContainer._build_planner_router() wires local vs remote classifier, wraps event bus for router observation, and constructs PlannerModelRouter when enabled.
Test fakes & unit tests (domain, parsing, adapters, router, observability, integration, container)
tests/unit/..., tests/integration/...
Adds InMemoryGoalClassifier fake, FakeLLMGateway/FakeOllamaGateway, comprehensive unit tests for VOs, port, parsing, adapters, PlannerModelRouter behavior (including reason-category normalization, timeouts, privacy), RouterMetrics and RouterObservingEventBus tests, planner+router integration tests, container wiring tests, settings env-var parsing tests, and live integration tests that skip when backends absent.
Specification and docs
specs/goal-classifier-router/*, docs/CHANGELOG.md, docs/CONTINUATION.md, docs/ENV_VARS.md, docs/TECH_DECISIONS.md, infrastructure/routing/__init__.py
Adds feature spec, implementation plan, tasks, changelog entry, continuation handoff, ENV_VARS docs for router settings, TD-195 ADR, and a routing package docstring.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant LLMPlanner
  participant PlannerModelRouter
  participant GoalClassifier
  participant LLMGateway
  participant EventBus

  Client->>LLMPlanner: submit goal
  LLMPlanner->>PlannerModelRouter: select_for(goal)
  PlannerModelRouter->>GoalClassifier: classify(goal)
  GoalClassifier->>PlannerModelRouter: GoalClassification (model, confidence, latency, cost)
  PlannerModelRouter->>EventBus: publish GoalClassified(goal_hash, chosen_model, metadata)
  PlannerModelRouter->>LLMPlanner: chosen PlannerModel
  LLMPlanner->>LLMGateway: complete(system+user, model=chosen_model_id)
  LLMGateway-->>LLMPlanner: completion (candidates)
  LLMPlanner-->>Client: plans
Loading

Poem

🐰 A router hops through goals with grace,
Choosing Haiku fast or Sonnet for pace—
Each choice hashed, no raw goal shown,
Costs and confidence gently known.
Tiny classifier, big plan cheer!

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/goal-classifier-router

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

🧹 Nitpick comments (5)
tests/unit/infrastructure/observability/test_router_observer.py (1)

26-29: ⚡ Quick win

Remove type: ignore and type reason_category as ReasonCategory.

The helper currently bypasses the closed-set contract with str + ignore. Using ReasonCategory keeps test fixtures aligned with production typing.

♻️ Proposed fix
 from domain.value_objects.council_events import (
     DebateStarted,
     GoalClassified,
+    ReasonCategory,
 )
@@
 def _goal_classified(
@@
-    reason_category: str = "generic_tech_english",
+    reason_category: ReasonCategory = "generic_tech_english",
@@
-        reason_category=reason_category,  # type: ignore[arg-type]
+        reason_category=reason_category,

Also applies to: 40-50

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/infrastructure/observability/test_router_observer.py` around lines
26 - 29, The test helper currently types reason_category as a plain str with a
type: ignore; instead import and use the production enum/type ReasonCategory and
remove the ignore to keep fixtures aligned with runtime types. Update the
imports in test_router_observer.py to include ReasonCategory (alongside
DebateStarted and GoalClassified) and change the helper variable/parameter named
reason_category to be typed as ReasonCategory (remove the # type: ignore).
Ensure any test fixtures or factory calls that construct reason_category use a
valid ReasonCategory member.
infrastructure/metrics/router_metrics.py (1)

46-51: ⚡ Quick win

Prevent mutation of internal metric state via public properties.

Returning internal defaultdict/list exposes mutable state, so callers can modify counters/latency outside record(). Prefer defensive copies (or immutable views).

♻️ Proposed fix
 from collections import defaultdict
 from collections.abc import Mapping
+from types import MappingProxyType
@@
     `@property`
     def decisions_total(self) -> Mapping[tuple[str, str], int]:
-        return self._decisions
+        return MappingProxyType(dict(self._decisions))
@@
     `@property`
     def latency_samples(self) -> list[int]:
-        return self._latency
+        return list(self._latency)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@infrastructure/metrics/router_metrics.py` around lines 46 - 51, The public
accessors decisions_total and latency_samples expose internal mutable state
(_decisions and _latency); change them to return defensive copies (e.g., a
shallow copy of the mapping and a copy of the list or an immutable view) so
callers cannot mutate internal counters or latency samples; locate the methods
decisions_total and latency_samples and modify their return values to return
copies (or tuples/frozendict if available) instead of returning _decisions and
_latency directly, leaving record() and internal mutation logic unchanged.
tests/unit/infrastructure/routing/test_prompts_parser.py (1)

110-115: ⚡ Quick win

Add a regression test for reason values containing braces.

Please add a case like "reason": "token {date}" to ensure valid JSON with brace characters inside strings keeps parsing successfully.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/infrastructure/routing/test_prompts_parser.py` around lines 110 -
115, Add a regression test that ensures JSON-encoded reasons containing brace
characters parse correctly: create a new test (e.g.,
test_reason_with_braces_parses) or extend
test_long_reason_is_truncated_not_rejected to set raw =
f'{{"model":"haiku","confidence":0.9,"reason":"token {{date}}"}}' (or similar
with braces) and call parse_classification(raw, latency_ms=10, cost_usd=0.0),
then assert result.reason contains the expected string (and/or length
constraints). Reference parse_classification and the raw variable so the test
verifies parsing of braces inside the "reason" field.
benchmarks/planner_quality_ab.py (1)

438-441: ⚡ Quick win

Normalize classifier overhead precision before writing trial rows.

At Line 439, raw float carry-through creates artifacts like 0.0006969999999999999 in dumped benchmark records. Round once, then reuse the rounded value for classifier_cost_usd and cost_usd.

Proposed patch
-                    row.classifier_cost_usd = cls_cost
+                    row.classifier_cost_usd = round(cls_cost, 6)
                     # Roll the per-goal classifier overhead into the router cost.
-                    row.cost_usd = round(row.cost_usd + cls_cost, 6)
+                    row.cost_usd = round(row.cost_usd + row.classifier_cost_usd, 6)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/planner_quality_ab.py` around lines 438 - 441, The code currently
assigns the raw float cls_cost to row.classifier_cost_usd and then adds the
unrounded cls_cost to row.cost_usd, causing floating-point artifacts; round
cls_cost once (e.g., rounded_cls_cost = round(cls_cost, 6)) and then set
row.classifier_cost_usd = rounded_cls_cost and row.cost_usd = round(row.cost_usd
+ rounded_cls_cost, 6) so both stored values use the same normalized precision;
update the lines handling cls_cost, row.classifier_cost_usd and row.cost_usd
accordingly.
tests/unit/domain/test_council_events_goal_classified.py (1)

38-50: ⚡ Quick win

Test does not actually prove discriminator immutability

The method name says “fixed”, but this only checks the default serialized value. Add a negative-path assertion that invalid kind is rejected at validation time.

Suggested test tightening
+import pytest
+from pydantic import ValidationError
@@
     def test_kind_discriminator_is_fixed(self) -> None:
@@
         dumped = ev.model_dump()
         assert dumped["kind"] == "goal_classified"
+
+        with pytest.raises(ValidationError):
+            GoalClassified.model_validate(
+                {
+                    "kind": "debate_started",
+                    "goal_hash": _hash16(),
+                    "chosen_model": PlannerModel.SONNET,
+                    "confidence": 0.5,
+                    "reason_category": "low_confidence",
+                    "classifier_latency_ms": 180,
+                    "classifier_cost_usd": 0.0002,
+                }
+            )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/domain/test_council_events_goal_classified.py` around lines 38 -
50, The test test_kind_discriminator_is_fixed currently only checks the
serialized default; extend it to assert that the discriminator is immutable by
attempting to create a GoalClassified with an invalid kind and expecting a
validation failure: call the GoalClassified constructor (or model_validate) with
kind set to an incorrect string and assert that pydantic.errors.ValidationError
(or pydantic.ValidationError) is raised; keep the existing positive assertion
(dumped["kind"] == "goal_classified") and add this negative-path assertion to
prove the Literal discriminator rejects invalid values.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmarks/planner_quality_ab.py`:
- Around line 504-506: The JSON dump call using
Path(args.dump).write_text(json.dumps(dump_payload, indent=2,
ensure_ascii=False)) produces non-deterministic key ordering; update the call to
include sort_keys=True so json.dumps(dump_payload, indent=2, ensure_ascii=False,
sort_keys=True) to produce deterministic, stable output for dump_payload and
reduce diff noise.

In `@docs/CONTINUATION.md`:
- Around line 27-28: The doc entry for MORPHIC_PLANNER_ROUTER is incorrect:
update the description to use the config contract values `disabled|enabled`
(default `disabled`, opt-in `enabled`) instead of `remote`/`local`; ensure the
text for MORPHIC_PLANNER_ROUTER in CONTINUATION.md matches the spec and other
environment docs and adjust any nearby examples or usage guidance to reference
`enabled` rather than `remote`.

In `@docs/ENV_VARS.md`:
- Line 51: The doc line for MORPHIC_PLANNER_ROUTER incorrectly states adapter
selection depends only on ANTHROPIC_API_KEY; update the description to explain
that selection uses the MORPHIC_PLANNER_ROUTER setting plus the LOCAL_FIRST flag
and runtime budget signals: when LOCAL_FIRST is true the local adapter is
preferred, otherwise if ANTHROPIC_API_KEY is present the remote adapter is used
unless budget signals force a local fallback, and if no key is present the local
adapter is used; mention these three determinants (MORPHIC_PLANNER_ROUTER,
ANTHROPIC_API_KEY, LOCAL_FIRST) and include a brief note about budget-triggered
overrides.

In `@docs/TECH_DECISIONS.md`:
- Around line 8126-8127: Update the TD reference for the cache metric: locate
the text "cache_hit_rate (TD-189)" and replace TD-189 with the correct decision
ID that documents the cache_hit_rate work (e.g., "cache_hit_rate (TD-190)");
ensure the token "cache_hit_rate" and the old "(TD-189)" are the ones changed so
traceability points to the correct TD entry.

In `@domain/value_objects/council_events.py`:
- Line 74: The goal_hash Field currently only enforces length and can accept
non-hex values; tighten validation on the goal_hash attribute in the
CouncilEvent value object by requiring a 16-character lowercase hex string
(e.g., add a regex constraint or a Pydantic validator for goal_hash such as
validating against ^[0-9a-f]{16}$) so it matches the sha256(goal)[:16] contract;
update the Field/validator on the goal_hash symbol in
domain/value_objects/council_events.py accordingly and ensure error messages
reflect invalid hex input.
- Around line 66-80: GoalClassified is currently a mutable Pydantic model
(inherits BaseModel) but must be immutable; update the class definition for
GoalClassified to make it a frozen/immutable Pydantic v2 model by adding the
Pydantic config (e.g. model_config = {"frozen": True}) inside the class so
instances cannot be mutated after creation; keep all existing fields (goal_hash,
chosen_model, confidence, reason_category, classifier_latency_ms,
classifier_cost_usd, classified_at) and their validators intact while ensuring
immutability for the domain value object.

In `@infrastructure/routing/__init__.py`:
- Around line 1-6: This module's first line is the module docstring but is
missing the required future import; add the line "from __future__ import
annotations" as the very first statement in infrastructure.routing.__init__.py
(before the module docstring) so the file complies with the repo-wide rule; no
other code changes needed—the rest of the module (LLMGoalClassifier,
LocalGoalClassifier, and the GoalClassifierPort-compatible definitions) can
remain unchanged.

In `@infrastructure/routing/_prompts.py`:
- Around line 64-83: The current _extract_json_blob uses the brittle
_FIRST_OBJECT_RE = re.compile(r"\{.*?\}", re.DOTALL) which can stop at a brace
inside a quoted string and cause false ClassificationParseError; replace the
regex-based extraction in _extract_json_blob with a JSON-aware scan: find the
first '{' in cleaned (after applying _THINK_BLOCK_RE and _JSON_FENCE_RE
handling), then iterate char-by-char tracking nesting depth, an in_string flag
and escape characters to find the matching closing brace for the top-level
object, and return that substring; keep raising ClassificationParseError when no
opening brace is found or the scan ends without closing the object, and preserve
existing references to _THINK_BLOCK_RE, _JSON_FENCE_RE and
ClassificationParseError.

In `@shared/config.py`:
- Around line 167-174: The two config fields
planner_router_haiku_confidence_threshold and
planner_router_classifier_timeout_ms accept invalid values; update their Field
definitions to enforce bounds (e.g., set
planner_router_haiku_confidence_threshold to accept only 0.0–1.0 using le=1.0
and ge=0.0, and set planner_router_classifier_timeout_ms to require a positive
integer using gt=0 or ge=1) while preserving their validation_alias names so
invalid runtime values raise validation errors instead of silently
misconfiguring routing.

In `@specs/goal-classifier-router/plan.md`:
- Line 185: The markdown contains a fenced code block that opens with triple
backticks (```) but omits a language hint; update that opening fence to include
a language identifier (for example change ``` to ```text or ```markdown) so the
block is lint-compliant and satisfies markdownlint rules for fenced code block
language specification.

In `@specs/goal-classifier-router/spec.md`:
- Around line 130-133: The markdown table row containing the cell
"sqlalchemy|fastapi|litellm|..." is being parsed as extra columns; update that
cell to escape the pipe characters or wrap the entire regex in backticks and
escape the pipes (e.g., use `sqlalchemy\|fastapi\|litellm\|...`) so the table
renders as a single cell and MD056 is avoided; locate the table row with the
"Framework imports in new domain files" metric and apply this change.

In `@specs/goal-classifier-router/tasks.md`:
- Line 43: The markdown line using sha256(goal)[:16] is being parsed as
reversed-link syntax; update the text so the literal is wrapped in inline code
backticks (e.g., `sha256(goal)[:16]`) where `goal_hash` is described to prevent
MD011 markdownlint errors and ensure the raw goal isn't exposed.
- Line 129: The fenced code block opener currently uses plain backticks (```);
add a language tag (for example ```text) to the code fence so the markdown
linter (MD040) passes and formatting remains stable—update the triple-backtick
opener for the block starting at the current fence to include a language
identifier such as text or bash.

In `@tests/integration/test_goal_classifier_local_live.py`:
- Around line 122-123: The assertion serializes the event with
event.model_dump_json() without deterministic ordering; change the serialization
to use a deterministic JSON dump (e.g., call
event.model_dump_json(sort_keys=True) or otherwise ensure JSON is produced with
sort_keys=True) so the payload variable is stable for the privacy assertion;
update the same pattern in test_goal_classifier_remote_live.py as well.

In `@tests/integration/test_goal_classifier_remote_live.py`:
- Around line 113-114: The assertion uses non-deterministic model_dump_json();
change it to produce deterministic JSON by importing json and serializing the
model dump with json.dumps(..., sort_keys=True) (e.g., replace
event.model_dump_json() with json.dumps(event.model_dump(), sort_keys=True)) so
the payload string is stable for the privacy check that compares goal against
the GoalClassified payload.

In `@tests/unit/domain/test_council_events_goal_classified.py`:
- Line 60: The test uses non-deterministic JSON serialization via raw =
original.model_dump_json(); change these calls to pass deterministic sorting
(e.g., original.model_dump_json(sort_keys=True)) so keys are ordered
consistently; update the three occurrences referenced (the call that assigns raw
at the shown location and the other two occurrences at the same file) to include
sort_keys=True on model_dump_json or, if that method is unavailable in your
model class, replace with json.dumps(original.model_dump(), sort_keys=True).

In `@tests/unit/infrastructure/observability/test_router_observer.py`:
- Around line 9-10: Update the module docstring in
tests/unit/infrastructure/observability/test_router_observer.py to remove the
incorrect claim that inner-bus errors are "swallowed"; instead state that events
are forwarded to the inner bus and that errors from the inner bus are propagated
(this aligns with the test asserting propagation). Reference the module-level
docstring near the RouterObserver/router_observer tests and change the phrase
"Forwards the event to the inner bus (best-effort, errors swallowed)." to
language indicating that inner-bus errors are propagated.

In `@tests/unit/infrastructure/test_llm_planner_router_integration.py`:
- Around line 52-63: The JSON payload helper currently returns json.dumps([...])
without deterministic ordering; modify the call to json.dumps in the helper that
returns the test payload so it passes sort_keys=True (e.g., change json.dumps([
... ]) to json.dumps([ ... ], sort_keys=True)) to ensure deterministic
serialization per repo policy.

---

Nitpick comments:
In `@benchmarks/planner_quality_ab.py`:
- Around line 438-441: The code currently assigns the raw float cls_cost to
row.classifier_cost_usd and then adds the unrounded cls_cost to row.cost_usd,
causing floating-point artifacts; round cls_cost once (e.g., rounded_cls_cost =
round(cls_cost, 6)) and then set row.classifier_cost_usd = rounded_cls_cost and
row.cost_usd = round(row.cost_usd + rounded_cls_cost, 6) so both stored values
use the same normalized precision; update the lines handling cls_cost,
row.classifier_cost_usd and row.cost_usd accordingly.

In `@infrastructure/metrics/router_metrics.py`:
- Around line 46-51: The public accessors decisions_total and latency_samples
expose internal mutable state (_decisions and _latency); change them to return
defensive copies (e.g., a shallow copy of the mapping and a copy of the list or
an immutable view) so callers cannot mutate internal counters or latency
samples; locate the methods decisions_total and latency_samples and modify their
return values to return copies (or tuples/frozendict if available) instead of
returning _decisions and _latency directly, leaving record() and internal
mutation logic unchanged.

In `@tests/unit/domain/test_council_events_goal_classified.py`:
- Around line 38-50: The test test_kind_discriminator_is_fixed currently only
checks the serialized default; extend it to assert that the discriminator is
immutable by attempting to create a GoalClassified with an invalid kind and
expecting a validation failure: call the GoalClassified constructor (or
model_validate) with kind set to an incorrect string and assert that
pydantic.errors.ValidationError (or pydantic.ValidationError) is raised; keep
the existing positive assertion (dumped["kind"] == "goal_classified") and add
this negative-path assertion to prove the Literal discriminator rejects invalid
values.

In `@tests/unit/infrastructure/observability/test_router_observer.py`:
- Around line 26-29: The test helper currently types reason_category as a plain
str with a type: ignore; instead import and use the production enum/type
ReasonCategory and remove the ignore to keep fixtures aligned with runtime
types. Update the imports in test_router_observer.py to include ReasonCategory
(alongside DebateStarted and GoalClassified) and change the helper
variable/parameter named reason_category to be typed as ReasonCategory (remove
the # type: ignore). Ensure any test fixtures or factory calls that construct
reason_category use a valid ReasonCategory member.

In `@tests/unit/infrastructure/routing/test_prompts_parser.py`:
- Around line 110-115: Add a regression test that ensures JSON-encoded reasons
containing brace characters parse correctly: create a new test (e.g.,
test_reason_with_braces_parses) or extend
test_long_reason_is_truncated_not_rejected to set raw =
f'{{"model":"haiku","confidence":0.9,"reason":"token {{date}}"}}' (or similar
with braces) and call parse_classification(raw, latency_ms=10, cost_usd=0.0),
then assert result.reason contains the expected string (and/or length
constraints). Reference parse_classification and the raw variable so the test
verifies parsing of braces inside the "reason" field.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d41bf214-f0bd-41f9-a503-a665a7b3e483

📥 Commits

Reviewing files that changed from the base of the PR and between 706c89d and ba605c1.

📒 Files selected for processing (47)
  • benchmarks/planner_quality_ab.py
  • docs/CHANGELOG.md
  • docs/CONTINUATION.md
  • docs/ENV_VARS.md
  • docs/TECH_DECISIONS.md
  • docs/benchmarks/planner_ab_router_2026_05_20.json
  • domain/ports/goal_classifier.py
  • domain/services/planner_model_router.py
  • domain/value_objects/council_events.py
  • domain/value_objects/goal_classification.py
  • domain/value_objects/planner_model.py
  • infrastructure/fractal/llm_planner.py
  • infrastructure/metrics/__init__.py
  • infrastructure/metrics/router_metrics.py
  • infrastructure/observability/__init__.py
  • infrastructure/observability/router_observer.py
  • infrastructure/routing/__init__.py
  • infrastructure/routing/_prompts.py
  • infrastructure/routing/llm_goal_classifier.py
  • infrastructure/routing/local_goal_classifier.py
  • interface/api/container.py
  • shared/config.py
  • specs/goal-classifier-router/plan.md
  • specs/goal-classifier-router/spec.md
  • specs/goal-classifier-router/tasks.md
  • tests/integration/test_goal_classifier_local_live.py
  • tests/integration/test_goal_classifier_remote_live.py
  • tests/unit/application/_fakes/in_memory_goal_classifier.py
  • tests/unit/application/_fakes/test_in_memory_goal_classifier.py
  • tests/unit/domain/test_council_events_goal_classified.py
  • tests/unit/domain/test_goal_classification.py
  • tests/unit/domain/test_goal_classifier_port.py
  • tests/unit/domain/test_planner_model.py
  • tests/unit/domain/test_planner_model_router.py
  • tests/unit/infrastructure/metrics/__init__.py
  • tests/unit/infrastructure/metrics/test_router_metrics.py
  • tests/unit/infrastructure/observability/__init__.py
  • tests/unit/infrastructure/observability/test_router_observer.py
  • tests/unit/infrastructure/routing/__init__.py
  • tests/unit/infrastructure/routing/test_llm_goal_classifier.py
  • tests/unit/infrastructure/routing/test_local_goal_classifier.py
  • tests/unit/infrastructure/routing/test_prompts_parser.py
  • tests/unit/infrastructure/test_llm_planner_router_integration.py
  • tests/unit/interface/api/__init__.py
  • tests/unit/interface/api/test_container_router_wiring.py
  • tests/unit/interface/test_fractal_container_wiring.py
  • tests/unit/shared/test_config_router.py

Comment on lines 504 to 506
Path(args.dump).write_text(
json.dumps(
{
"judge": JUDGE,
"trials": args.trials,
"rows": [r.__dict__ for r in rows],
"summary": {
"sonnet": sonnet_sum.__dict__,
"haiku": haiku_sum.__dict__,
},
"total_cost_usd": round(total_cost, 6),
},
indent=2,
ensure_ascii=False,
)
json.dumps(dump_payload, indent=2, ensure_ascii=False)
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add deterministic key ordering to benchmark JSON dumps.

At Lines 504-506, serialization omits sort_keys=True, which makes dump ordering unstable across runs and creates avoidable diff noise.

Proposed patch
         Path(args.dump).write_text(
-            json.dumps(dump_payload, indent=2, ensure_ascii=False)
+            json.dumps(dump_payload, indent=2, ensure_ascii=False, sort_keys=True)
         )

As per coding guidelines, JSON/YAML serialization must use sort_keys=True for deterministic output.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/planner_quality_ab.py` around lines 504 - 506, The JSON dump call
using Path(args.dump).write_text(json.dumps(dump_payload, indent=2,
ensure_ascii=False)) produces non-deterministic key ordering; update the call to
include sort_keys=True so json.dumps(dump_payload, indent=2, ensure_ascii=False,
sort_keys=True) to produce deterministic, stable output for dump_payload and
reduce diff noise.

Comment thread docs/CONTINUATION.md Outdated
Comment thread docs/ENV_VARS.md Outdated
Comment thread docs/TECH_DECISIONS.md
Comment on lines +8126 to +8127
existing observability dashboard alongside `cache_hit_rate` (TD-189) so
that classifier latency, decisions_total, and fallback rate are

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix the TD reference for cache_hit_rate follow-up.

Line 8126 references cache_hit_rate (TD-189), but this doc does not include TD-189; cache-hit-rate work appears under TD-188/TD-190/TD-193. Please correct the TD number to avoid traceability confusion.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/TECH_DECISIONS.md` around lines 8126 - 8127, Update the TD reference for
the cache metric: locate the text "cache_hit_rate (TD-189)" and replace TD-189
with the correct decision ID that documents the cache_hit_rate work (e.g.,
"cache_hit_rate (TD-190)"); ensure the token "cache_hit_rate" and the old
"(TD-189)" are the ones changed so traceability points to the correct TD entry.

Comment thread domain/value_objects/council_events.py
Comment on lines +122 to +123
payload = event.model_dump_json()
assert goal not in payload, "Raw goal leaked into GoalClassified payload"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

# First, find the file and check if it exists
find . -name "test_goal_classifier_local_live.py" -type f

Repository: engkimo/open-morphic

Length of output: 119


🏁 Script executed:

# Check the specific lines mentioned in the review
git ls-files | grep -i "test_goal_classifier_local_live"

Repository: engkimo/open-morphic

Length of output: 117


🏁 Script executed:

# Get file size to understand if we can read it fully
wc -l tests/integration/test_goal_classifier_local_live.py 2>/dev/null || echo "File not found"

Repository: engkimo/open-morphic

Length of output: 121


🏁 Script executed:

cat -n tests/integration/test_goal_classifier_local_live.py

Repository: engkimo/open-morphic

Length of output: 5352


🏁 Script executed:

# Check if there's a remote live test file mentioned in review
find . -name "*goal_classifier*remote*" -o -name "*goal_classifier*live*" | head -20

Repository: engkimo/open-morphic

Length of output: 175


🏁 Script executed:

# Search for other uses of model_dump_json() in test files to see patterns
rg "model_dump_json\(\)" tests/ -A 2 -B 2 --max-count=10

Repository: engkimo/open-morphic

Length of output: 4076


🏁 Script executed:

# Check if json module is imported elsewhere in test files for serialization
rg "json\.dumps.*sort_keys" . --type py | head -20

Repository: engkimo/open-morphic

Length of output: 638


🏁 Script executed:

# Verify Pydantic v2 model_dump_json() signature to confirm it doesn't support sort_keys
rg "model_dump_json" . --type py -B 2 -A 2 | grep -E "(sort_keys|indent)" | head -10

Repository: engkimo/open-morphic

Length of output: 46


🏁 Script executed:

# Check test_council_events_goal_classified.py to see how it handles similar privacy checks
sed -n '50,80p' tests/unit/domain/test_council_events_goal_classified.py

Repository: engkimo/open-morphic

Length of output: 1366


🏁 Script executed:

# Check the remote live test file for the same pattern
sed -n '120,125p' tests/integration/test_goal_classifier_remote_live.py

Repository: engkimo/open-morphic

Length of output: 46


🏁 Script executed:

# Check if there are imports of json module in these test files
head -50 tests/integration/test_goal_classifier_local_live.py | grep -E "^import|^from"

Repository: engkimo/open-morphic

Length of output: 909


Use deterministic JSON serialization in the privacy assertion.

The payload serialization should explicitly use sort_keys=True to comply with the coding guideline requiring deterministic JSON output.

Proposed fix
+import json
...
-    payload = event.model_dump_json()
+    payload = json.dumps(
+        event.model_dump(mode="json"),
+        sort_keys=True,
+        ensure_ascii=False,
+    )

Per the coding guidelines, all JSON serialization across **/*.py (including tests) must use sort_keys=True for deterministic output. The same issue exists in test_goal_classifier_remote_live.py and should be fixed identically.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
payload = event.model_dump_json()
assert goal not in payload, "Raw goal leaked into GoalClassified payload"
import json
...
payload = json.dumps(
event.model_dump(mode="json"),
sort_keys=True,
ensure_ascii=False,
)
assert goal not in payload, "Raw goal leaked into GoalClassified payload"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/test_goal_classifier_local_live.py` around lines 122 - 123,
The assertion serializes the event with event.model_dump_json() without
deterministic ordering; change the serialization to use a deterministic JSON
dump (e.g., call event.model_dump_json(sort_keys=True) or otherwise ensure JSON
is produced with sort_keys=True) so the payload variable is stable for the
privacy assertion; update the same pattern in
test_goal_classifier_remote_live.py as well.

Comment on lines +113 to +114
payload = event.model_dump_json()
assert goal not in payload, "Raw goal leaked into GoalClassified payload"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

# First, locate the file and verify it exists
find . -name "test_goal_classifier_remote_live.py" -type f

Repository: engkimo/open-morphic

Length of output: 120


🏁 Script executed:

# Read the file to check the exact contents at lines 113-114 and surrounding context
cat -n ./tests/integration/test_goal_classifier_remote_live.py | head -120 | tail -20

Repository: engkimo/open-morphic

Length of output: 972


🏁 Script executed:

# Check the imports and the beginning of the file, including line 20
cat -n ./tests/integration/test_goal_classifier_remote_live.py | head -30

Repository: engkimo/open-morphic

Length of output: 1209


🏁 Script executed:

# Check for all JSON serialization patterns in the file
cat -n ./tests/integration/test_goal_classifier_remote_live.py | grep -E "model_dump|json|JSON"

Repository: engkimo/open-morphic

Length of output: 109


🏁 Script executed:

# Let's check the broader context of the test and also verify if there are other similar patterns in related test files
# First check if there's a similar test file (local version)
find . -name "*goal_classifier*" -type f | grep -E "\.py$"

Repository: engkimo/open-morphic

Length of output: 613


🏁 Script executed:

# Check the local test file for comparison - look for similar JSON serialization patterns
grep -n "model_dump_json\|json.dumps" ./tests/integration/test_goal_classifier_local_live.py | head -20

Repository: engkimo/open-morphic

Length of output: 106


🏁 Script executed:

# Check if there are examples in the codebase where sort_keys=True is used
rg "sort_keys\s*=\s*True" --type py -A 2 -B 2

Repository: engkimo/open-morphic

Length of output: 2857


🏁 Script executed:

# Check the Pydantic version and model_dump usage in the codebase
rg "model_dump" --type py -B 1 -A 1 | head -40

Repository: engkimo/open-morphic

Length of output: 3044


🏁 Script executed:

# Check if json is already imported in the file
grep -n "^import json" ./tests/integration/test_goal_classifier_remote_live.py

Repository: engkimo/open-morphic

Length of output: 46


🏁 Script executed:

# Check the full import section of the file
cat -n ./tests/integration/test_goal_classifier_remote_live.py | head -50

Repository: engkimo/open-morphic

Length of output: 2039


🏁 Script executed:

# Verify the exact test function and full context around the issue
cat -n ./tests/integration/test_goal_classifier_remote_live.py | sed -n '85,120p'

Repository: engkimo/open-morphic

Length of output: 1538


Add deterministic JSON serialization to the privacy invariant assertion.

The test payload serialization at line 113 must use sort_keys=True per the coding guideline requiring deterministic JSON output. Import json and replace model_dump_json() with an explicit sort.

Proposed fix
+import json
 import os
 
 import pytest
     # Privacy invariant: raw goal MUST NOT appear in the event payload.
-    payload = event.model_dump_json()
+    payload = json.dumps(
+        event.model_dump(mode="json"),
+        sort_keys=True,
+        ensure_ascii=False,
+    )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/test_goal_classifier_remote_live.py` around lines 113 -
114, The assertion uses non-deterministic model_dump_json(); change it to
produce deterministic JSON by importing json and serializing the model dump with
json.dumps(..., sort_keys=True) (e.g., replace event.model_dump_json() with
json.dumps(event.model_dump(), sort_keys=True)) so the payload string is stable
for the privacy check that compares goal against the GoalClassified payload.

classifier_latency_ms=1500,
classifier_cost_usd=0.0,
)
raw = original.model_dump_json()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use deterministic JSON serialization in tests

These serializations should explicitly enforce sorted keys to meet deterministic output requirements.

Deterministic serialization patch
-        raw = original.model_dump_json()
+        raw = json.dumps(original.model_dump(mode="json"), sort_keys=True)
@@
-        raw = abandoned.model_dump_json()
+        raw = json.dumps(abandoned.model_dump(mode="json"), sort_keys=True)
@@
-        payload = json.loads(ev.model_dump_json())
+        serialized = json.dumps(ev.model_dump(mode="json"), sort_keys=True)
+        payload = json.loads(serialized)

As per coding guidelines, "JSON/YAML serialization must use sort_keys=True for deterministic output".

Also applies to: 72-72, 86-86

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/domain/test_council_events_goal_classified.py` at line 60, The
test uses non-deterministic JSON serialization via raw =
original.model_dump_json(); change these calls to pass deterministic sorting
(e.g., original.model_dump_json(sort_keys=True)) so keys are ordered
consistently; update the three occurrences referenced (the call that assigns raw
at the shown location and the other two occurrences at the same file) to include
sort_keys=True on model_dump_json or, if that method is unavailable in your
model class, replace with json.dumps(original.model_dump(), sort_keys=True).

Comment on lines +9 to +10
3. Forwards the event to the inner bus (best-effort, errors swallowed).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix the error-handling claim in the module docstring.

The docstring says inner-bus errors are swallowed, but this suite validates propagation. Align wording to avoid a misleading contract.

📝 Proposed fix
-3. Forwards the event to the inner bus (best-effort, errors swallowed).
+3. Forwards the event to the inner bus (errors propagate).
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
3. Forwards the event to the inner bus (best-effort, errors swallowed).
3. Forwards the event to the inner bus (errors propagate).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/infrastructure/observability/test_router_observer.py` around lines
9 - 10, Update the module docstring in
tests/unit/infrastructure/observability/test_router_observer.py to remove the
incorrect claim that inner-bus errors are "swallowed"; instead state that events
are forwarded to the inner bus and that errors from the inner bus are propagated
(this aligns with the test asserting propagation). Reference the module-level
docstring near the RouterObserver/router_observer tests and change the phrase
"Forwards the event to the inner bus (best-effort, errors swallowed)." to
language indicating that inner-bus errors are propagated.

Comment on lines +52 to +63
return json.dumps(
[
{
"description": "Step 1: do something",
"is_terminal": True,
"score": 0.9,
"condition": None,
"input_artifacts": {},
"output_artifacts": {},
}
]
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use deterministic JSON serialization in test payload helper.

Add sort_keys=True to keep serialization deterministic per repo policy.

🔧 Proposed fix
 def _sample_payload() -> str:
     return json.dumps(
         [
             {
                 "description": "Step 1: do something",
                 "is_terminal": True,
                 "score": 0.9,
                 "condition": None,
                 "input_artifacts": {},
                 "output_artifacts": {},
             }
-        ]
+        ],
+        sort_keys=True,
     )

As per coding guidelines, "JSON/YAML serialization must use sort_keys=True for deterministic output".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return json.dumps(
[
{
"description": "Step 1: do something",
"is_terminal": True,
"score": 0.9,
"condition": None,
"input_artifacts": {},
"output_artifacts": {},
}
]
)
return json.dumps(
[
{
"description": "Step 1: do something",
"is_terminal": True,
"score": 0.9,
"condition": None,
"input_artifacts": {},
"output_artifacts": {},
}
],
sort_keys=True,
)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/infrastructure/test_llm_planner_router_integration.py` around
lines 52 - 63, The JSON payload helper currently returns json.dumps([...])
without deterministic ordering; modify the call to json.dumps in the helper that
returns the test payload so it passes sort_keys=True (e.g., change json.dumps([
... ]) to json.dumps([ ... ], sort_keys=True)) to ensure deterministic
serialization per repo policy.

- domain/value_objects/council_events: freeze GoalClassified VO + enforce
  16-hex pattern on goal_hash (immutability + privacy contract).
- infrastructure/routing/_prompts: replace non-greedy regex with balanced
  brace scanner so JSON strings containing '}' parse correctly.
- shared/config: bound planner_router_haiku_confidence_threshold to
  [0.0, 1.0] and planner_router_classifier_timeout_ms to > 0; reject
  silent misconfig.
- docs/CONTINUATION + docs/ENV_VARS: correct opt-in value to `enabled`
  (was stale `remote`/`local`) and document adapter-selection priority
  (LOCAL_FIRST → ANTHROPIC_API_KEY+budget → off).

3,360 unit tests pass; ruff clean.
@engkimo engkimo merged commit 1bcbbe3 into main May 21, 2026
5 of 6 checks passed
@engkimo engkimo deleted the feature/goal-classifier-router branch May 21, 2026 06:01
engkimo added a commit that referenced this pull request May 22, 2026
- specs/goal-classifier-router/tasks.md: mark all 43 boxes complete
  (T001-T154 shipped via PR #40).
- docs/CHANGELOG: enrich Unreleased entry with PR #40, merge date,
  A/B headline numbers, and pointer to the authoritative memo.
engkimo added a commit that referenced this pull request May 22, 2026
- Promote Unreleased CHANGELOG entry to v0.6.3 (2026-05-22), rolling up
  7 PRs since v0.6.2: TD-195 router (#40), plan-evaluator fix (#39),
  Haiku A/B (#38), Haiku pricing fix (#37), cost-sim harness (#36),
  TD-189 step 5 CLI (#35).
- Bump pyproject + main.py + version consistency test + uv.lock.
- Refresh CLAUDE.md version banner to 0.6.3 / 2026-05-22.
engkimo added a commit that referenced this pull request May 22, 2026
Cosmetic-only cleanup of 9 residual Minor items from PR #40 review:

- benchmarks/planner_quality_ab.py: add sort_keys=True to JSON dump
- docs/TECH_DECISIONS.md: fix cache_hit_rate TD ref (TD-189 → TD-193)
- infrastructure/routing/__init__.py: add `from __future__ import annotations`
- specs/goal-classifier-router/plan.md: tag fenced block with `text`
- specs/goal-classifier-router/spec.md: escape pipes in table cell (MD056)
- specs/goal-classifier-router/tasks.md: backtick sha256(...) literal (MD011)
  + language tag on fenced block (MD040)
- tests/unit/domain/test_council_events_goal_classified.py: switch to
  json.dumps(model_dump(mode="json"), sort_keys=True) in 3 spots
- tests/unit/infrastructure/observability/test_router_observer.py: align
  docstring with actual contract (errors propagate, not swallowed)
- tests/unit/infrastructure/test_llm_planner_router_integration.py: add
  sort_keys=True to _sample_payload helper

No behavioral changes. 22 touched unit tests pass; ruff clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant