feat(routing): Goal Classifier Router for planner model selection (TD-195)#40
Conversation
SDD pilot 3 — spec.md / plan.md / tasks.md generated. Targets the −11.4pt entity_preserved regression from the 2026-05-19 Haiku A/B while retaining ≥30% of the 47.6% cost saving on the eligible slice. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ied event Three pure-domain VOs for the goal-classifier-router pilot: - `PlannerModel` (Sonnet/Haiku enum) with `to_gateway_id()` - `GoalClassification` (Pydantic VO, confidence/latency/cost bounded) - `GoalClassified` discriminated-union variant added to DebateEvent Privacy: GoalClassified carries `goal_hash` (sha256[:16]) only — raw goal is never serialized into events. Existing council events untouched and round-trip-compatible. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Async port: `classify(goal: str) -> GoalClassification`. Empty or whitespace goal raises ValueError. Pure abstract; impls go in infrastructure/routing/ (LLM + Ollama). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…lClassifier (Ollama) LLMGoalClassifier targets Anthropic Haiku 4.5 via LLMGateway. LocalGoalClassifier targets Ollama qwen3:8b via LLMGateway (LiteLLM routes locally); cost_usd forced to 0.0 since compute is local-only. Both adapters share SYSTEM_PROMPT + parse_classification from _prompts.py so the contract stays byte-identical for fair A/B and stable LiteLLM prompt-cache windows (NFR-5 / TD-190). Parser tolerates qwen3 <think> blocks and fenced JSON. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
LLMPlanner accepts an optional PlannerModelRouter. When present it is consulted per goal and the chosen PlannerModel is resolved to its gateway id (claude-haiku-4-5-20251001 / claude-sonnet-4-6) which is passed to LLMGateway.complete. When the router is None (the default), behavior is byte-identical to the pre-router planner — the existing constructor-time ``model`` argument is honored unchanged. Verified: 8 new router-integration tests, 41 pre-existing planner tests all still pass. TD-190 stable system prefix preserved (SYSTEM_PROMPT is byte-identical across Haiku/Sonnet routes). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Build LLMGoalClassifier (Haiku 4.5) when anthropic_api_key is set, LocalGoalClassifier (Ollama qwen3:8b) otherwise. Inject the resulting PlannerModelRouter into LLMPlanner only when planner_router_mode=enabled; otherwise pass router=None for byte-identical pre-router behavior.
TD-195 router observability layer: - RouterMetrics: dependency-free Prometheus-style counters (decisions_total by (model, reason_category)) + latency_ms histogram buffer. Cardinality bounded at 12 by AD-3 closed ReasonCategory set. - RouterObservingEventBus: EventBusPort decorator that taps GoalClassified events for metrics + one INFO log line per decision (goal_hash, chosen_model, reason_category, classifier_latency_ms, classifier_cost_usd) and forwards all events to the inner bus. Privacy: only sha256[:16] hash is ever logged, raw goal never carried. 20 new tests, 3360 unit tests total, 0 regressions, ruff clean.
…tBus wiring - T110: LocalGoalClassifier live (qwen3:8b, $0) — 3 AD-3 buckets all PASS - T111: LLMGoalClassifier live (Haiku 4.5) — 3 AD-3 buckets all PASS - Refine SYSTEM_PROMPT with quoted-entity clarification + 3 few-shot examples (byte-identical across local/remote per TD-190) - Wire RouterObservingEventBus + RouterMetrics into AppContainer so production picks up metrics + structured logs - Cost ceiling per Haiku call relaxed to <=$0.001 (observed ~$0.0007) Verified: 3,360 unit + 6 live integration green, 0 regressions.
Adds a 3rd arm to planner_quality_ab.py that calls PlannerModelRouter once per goal, then runs the planner with the router-chosen model. Reports: - Router-gated mean vs Sonnet baseline (entity_preserved, plan_eval) - Captured-saving ratio = (Sonnet − Router) / (Sonnet − Haiku-only) - Per-goal routing breakdown (HAIKU vs SONNET counts) Acceptance thresholds (configurable): - entity_preserved Δ >= -5pt vs Sonnet - plan_eval Δ >= -0.030 vs Sonnet - captured-saving >= 30% Default mode (no --router) unchanged.
Run: --router --trials 3, 190 LLM calls, $0.97 total. Quality PASS: entity_preserved -2.5pt (>= -5pt threshold) plan_eval -0.014 (>= -0.030 threshold) Captured-saving 20.9% (< 30% threshold) — structural property of this benchmark mix (6/10 goals carry entities → Sonnet, 4/10 Haiku). Router still strictly Pareto-dominates Sonnet at lower cost; defect is in benchmark composition, not router logic. Memo: memory/planner_router_ab_2026_05_20.md recommends shipping.
- TECH_DECISIONS.md: TD-195 "Goal Classifier Router for Planner Model Selection" — decision/rationale/consequences/follow-ups. Captures the 2026-05-20 live A/B verdict (entity_preserved -2.5pt, plan_eval -0.014, cost -9.85% / call) and explains the 20.9% captured-saving vs 30% paper bar as workload-mix structural, not router defect. - ENV_VARS.md: 3 new vars under "Planner Model Router (v0.6.3, TD-195)" — MORPHIC_PLANNER_ROUTER (disabled|remote|local), MORPHIC_PLANNER_ROUTER_HAIKU_CONFIDENCE_THRESHOLD (default 0.7), MORPHIC_PLANNER_ROUTER_CLASSIFIER_TIMEOUT_MS (default 5000). - CONTINUATION.md: new Sprint 91 (TD-195) section at top with branch HEAD, live A/B numbers, and the memo pointer.
- Wrap two integration-test docstring run-commands across lines so they fit the 100-char limit (E501). - Apply ruff import-sort (I001) on auto-fix for the same files. - Replace blind `Exception` in test_config_router with `ValidationError` (B017); the test asserts pydantic rejects an invalid enum value, and the specific type matches that contract. Ruff: All checks passed.
- ENV_VARS.md: MORPHIC_PLANNER_ROUTER is `disabled | enabled` (remote/local is an auto-selected DI choice, not an env value). - ENV_VARS.md: CLASSIFIER_TIMEOUT_MS default is 1500, not 5000 (matches shared/config.py:172 and plan.md). - TECH_DECISIONS.md TD-195: rewrite the opt-in sentence to reflect the actual env contract + DI-time adapter selection.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughPer-goal planner-model routing added: classifier port/VOs/events, PlannerModelRouter service with confidence/timeouts/fallbacks, shared prompt + parser, remote/local classifier adapters, RouterMetrics and observing event bus, LLMPlanner wiring for per-goal model selection, DI/settings wiring, extended benchmark and JSON output, docs/specs, and comprehensive unit/integration/live tests. ChangesGoal Classifier Router Feature
Estimated code review effort 🎯 4 (Complex) | ⏱️ ~60 minutes Sequence Diagram(s)sequenceDiagram
participant Client
participant LLMPlanner
participant PlannerModelRouter
participant GoalClassifier
participant LLMGateway
participant EventBus
Client->>LLMPlanner: submit goal
LLMPlanner->>PlannerModelRouter: select_for(goal)
PlannerModelRouter->>GoalClassifier: classify(goal)
GoalClassifier->>PlannerModelRouter: GoalClassification (model, confidence, latency, cost)
PlannerModelRouter->>EventBus: publish GoalClassified(goal_hash, chosen_model, metadata)
PlannerModelRouter->>LLMPlanner: chosen PlannerModel
LLMPlanner->>LLMGateway: complete(system+user, model=chosen_model_id)
LLMGateway-->>LLMPlanner: completion (candidates)
LLMPlanner-->>Client: plans
Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 18
🧹 Nitpick comments (5)
tests/unit/infrastructure/observability/test_router_observer.py (1)
26-29: ⚡ Quick winRemove
type: ignoreand typereason_categoryasReasonCategory.The helper currently bypasses the closed-set contract with
str+ ignore. UsingReasonCategorykeeps test fixtures aligned with production typing.♻️ Proposed fix
from domain.value_objects.council_events import ( DebateStarted, GoalClassified, + ReasonCategory, ) @@ def _goal_classified( @@ - reason_category: str = "generic_tech_english", + reason_category: ReasonCategory = "generic_tech_english", @@ - reason_category=reason_category, # type: ignore[arg-type] + reason_category=reason_category,Also applies to: 40-50
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/infrastructure/observability/test_router_observer.py` around lines 26 - 29, The test helper currently types reason_category as a plain str with a type: ignore; instead import and use the production enum/type ReasonCategory and remove the ignore to keep fixtures aligned with runtime types. Update the imports in test_router_observer.py to include ReasonCategory (alongside DebateStarted and GoalClassified) and change the helper variable/parameter named reason_category to be typed as ReasonCategory (remove the # type: ignore). Ensure any test fixtures or factory calls that construct reason_category use a valid ReasonCategory member.infrastructure/metrics/router_metrics.py (1)
46-51: ⚡ Quick winPrevent mutation of internal metric state via public properties.
Returning internal
defaultdict/listexposes mutable state, so callers can modify counters/latency outsiderecord(). Prefer defensive copies (or immutable views).♻️ Proposed fix
from collections import defaultdict from collections.abc import Mapping +from types import MappingProxyType @@ `@property` def decisions_total(self) -> Mapping[tuple[str, str], int]: - return self._decisions + return MappingProxyType(dict(self._decisions)) @@ `@property` def latency_samples(self) -> list[int]: - return self._latency + return list(self._latency)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@infrastructure/metrics/router_metrics.py` around lines 46 - 51, The public accessors decisions_total and latency_samples expose internal mutable state (_decisions and _latency); change them to return defensive copies (e.g., a shallow copy of the mapping and a copy of the list or an immutable view) so callers cannot mutate internal counters or latency samples; locate the methods decisions_total and latency_samples and modify their return values to return copies (or tuples/frozendict if available) instead of returning _decisions and _latency directly, leaving record() and internal mutation logic unchanged.tests/unit/infrastructure/routing/test_prompts_parser.py (1)
110-115: ⚡ Quick winAdd a regression test for
reasonvalues containing braces.Please add a case like
"reason": "token {date}"to ensure valid JSON with brace characters inside strings keeps parsing successfully.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/infrastructure/routing/test_prompts_parser.py` around lines 110 - 115, Add a regression test that ensures JSON-encoded reasons containing brace characters parse correctly: create a new test (e.g., test_reason_with_braces_parses) or extend test_long_reason_is_truncated_not_rejected to set raw = f'{{"model":"haiku","confidence":0.9,"reason":"token {{date}}"}}' (or similar with braces) and call parse_classification(raw, latency_ms=10, cost_usd=0.0), then assert result.reason contains the expected string (and/or length constraints). Reference parse_classification and the raw variable so the test verifies parsing of braces inside the "reason" field.benchmarks/planner_quality_ab.py (1)
438-441: ⚡ Quick winNormalize classifier overhead precision before writing trial rows.
At Line 439, raw float carry-through creates artifacts like
0.0006969999999999999in dumped benchmark records. Round once, then reuse the rounded value forclassifier_cost_usdandcost_usd.Proposed patch
- row.classifier_cost_usd = cls_cost + row.classifier_cost_usd = round(cls_cost, 6) # Roll the per-goal classifier overhead into the router cost. - row.cost_usd = round(row.cost_usd + cls_cost, 6) + row.cost_usd = round(row.cost_usd + row.classifier_cost_usd, 6)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/planner_quality_ab.py` around lines 438 - 441, The code currently assigns the raw float cls_cost to row.classifier_cost_usd and then adds the unrounded cls_cost to row.cost_usd, causing floating-point artifacts; round cls_cost once (e.g., rounded_cls_cost = round(cls_cost, 6)) and then set row.classifier_cost_usd = rounded_cls_cost and row.cost_usd = round(row.cost_usd + rounded_cls_cost, 6) so both stored values use the same normalized precision; update the lines handling cls_cost, row.classifier_cost_usd and row.cost_usd accordingly.tests/unit/domain/test_council_events_goal_classified.py (1)
38-50: ⚡ Quick winTest does not actually prove discriminator immutability
The method name says “fixed”, but this only checks the default serialized value. Add a negative-path assertion that invalid
kindis rejected at validation time.Suggested test tightening
+import pytest +from pydantic import ValidationError @@ def test_kind_discriminator_is_fixed(self) -> None: @@ dumped = ev.model_dump() assert dumped["kind"] == "goal_classified" + + with pytest.raises(ValidationError): + GoalClassified.model_validate( + { + "kind": "debate_started", + "goal_hash": _hash16(), + "chosen_model": PlannerModel.SONNET, + "confidence": 0.5, + "reason_category": "low_confidence", + "classifier_latency_ms": 180, + "classifier_cost_usd": 0.0002, + } + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/domain/test_council_events_goal_classified.py` around lines 38 - 50, The test test_kind_discriminator_is_fixed currently only checks the serialized default; extend it to assert that the discriminator is immutable by attempting to create a GoalClassified with an invalid kind and expecting a validation failure: call the GoalClassified constructor (or model_validate) with kind set to an incorrect string and assert that pydantic.errors.ValidationError (or pydantic.ValidationError) is raised; keep the existing positive assertion (dumped["kind"] == "goal_classified") and add this negative-path assertion to prove the Literal discriminator rejects invalid values.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benchmarks/planner_quality_ab.py`:
- Around line 504-506: The JSON dump call using
Path(args.dump).write_text(json.dumps(dump_payload, indent=2,
ensure_ascii=False)) produces non-deterministic key ordering; update the call to
include sort_keys=True so json.dumps(dump_payload, indent=2, ensure_ascii=False,
sort_keys=True) to produce deterministic, stable output for dump_payload and
reduce diff noise.
In `@docs/CONTINUATION.md`:
- Around line 27-28: The doc entry for MORPHIC_PLANNER_ROUTER is incorrect:
update the description to use the config contract values `disabled|enabled`
(default `disabled`, opt-in `enabled`) instead of `remote`/`local`; ensure the
text for MORPHIC_PLANNER_ROUTER in CONTINUATION.md matches the spec and other
environment docs and adjust any nearby examples or usage guidance to reference
`enabled` rather than `remote`.
In `@docs/ENV_VARS.md`:
- Line 51: The doc line for MORPHIC_PLANNER_ROUTER incorrectly states adapter
selection depends only on ANTHROPIC_API_KEY; update the description to explain
that selection uses the MORPHIC_PLANNER_ROUTER setting plus the LOCAL_FIRST flag
and runtime budget signals: when LOCAL_FIRST is true the local adapter is
preferred, otherwise if ANTHROPIC_API_KEY is present the remote adapter is used
unless budget signals force a local fallback, and if no key is present the local
adapter is used; mention these three determinants (MORPHIC_PLANNER_ROUTER,
ANTHROPIC_API_KEY, LOCAL_FIRST) and include a brief note about budget-triggered
overrides.
In `@docs/TECH_DECISIONS.md`:
- Around line 8126-8127: Update the TD reference for the cache metric: locate
the text "cache_hit_rate (TD-189)" and replace TD-189 with the correct decision
ID that documents the cache_hit_rate work (e.g., "cache_hit_rate (TD-190)");
ensure the token "cache_hit_rate" and the old "(TD-189)" are the ones changed so
traceability points to the correct TD entry.
In `@domain/value_objects/council_events.py`:
- Line 74: The goal_hash Field currently only enforces length and can accept
non-hex values; tighten validation on the goal_hash attribute in the
CouncilEvent value object by requiring a 16-character lowercase hex string
(e.g., add a regex constraint or a Pydantic validator for goal_hash such as
validating against ^[0-9a-f]{16}$) so it matches the sha256(goal)[:16] contract;
update the Field/validator on the goal_hash symbol in
domain/value_objects/council_events.py accordingly and ensure error messages
reflect invalid hex input.
- Around line 66-80: GoalClassified is currently a mutable Pydantic model
(inherits BaseModel) but must be immutable; update the class definition for
GoalClassified to make it a frozen/immutable Pydantic v2 model by adding the
Pydantic config (e.g. model_config = {"frozen": True}) inside the class so
instances cannot be mutated after creation; keep all existing fields (goal_hash,
chosen_model, confidence, reason_category, classifier_latency_ms,
classifier_cost_usd, classified_at) and their validators intact while ensuring
immutability for the domain value object.
In `@infrastructure/routing/__init__.py`:
- Around line 1-6: This module's first line is the module docstring but is
missing the required future import; add the line "from __future__ import
annotations" as the very first statement in infrastructure.routing.__init__.py
(before the module docstring) so the file complies with the repo-wide rule; no
other code changes needed—the rest of the module (LLMGoalClassifier,
LocalGoalClassifier, and the GoalClassifierPort-compatible definitions) can
remain unchanged.
In `@infrastructure/routing/_prompts.py`:
- Around line 64-83: The current _extract_json_blob uses the brittle
_FIRST_OBJECT_RE = re.compile(r"\{.*?\}", re.DOTALL) which can stop at a brace
inside a quoted string and cause false ClassificationParseError; replace the
regex-based extraction in _extract_json_blob with a JSON-aware scan: find the
first '{' in cleaned (after applying _THINK_BLOCK_RE and _JSON_FENCE_RE
handling), then iterate char-by-char tracking nesting depth, an in_string flag
and escape characters to find the matching closing brace for the top-level
object, and return that substring; keep raising ClassificationParseError when no
opening brace is found or the scan ends without closing the object, and preserve
existing references to _THINK_BLOCK_RE, _JSON_FENCE_RE and
ClassificationParseError.
In `@shared/config.py`:
- Around line 167-174: The two config fields
planner_router_haiku_confidence_threshold and
planner_router_classifier_timeout_ms accept invalid values; update their Field
definitions to enforce bounds (e.g., set
planner_router_haiku_confidence_threshold to accept only 0.0–1.0 using le=1.0
and ge=0.0, and set planner_router_classifier_timeout_ms to require a positive
integer using gt=0 or ge=1) while preserving their validation_alias names so
invalid runtime values raise validation errors instead of silently
misconfiguring routing.
In `@specs/goal-classifier-router/plan.md`:
- Line 185: The markdown contains a fenced code block that opens with triple
backticks (```) but omits a language hint; update that opening fence to include
a language identifier (for example change ``` to ```text or ```markdown) so the
block is lint-compliant and satisfies markdownlint rules for fenced code block
language specification.
In `@specs/goal-classifier-router/spec.md`:
- Around line 130-133: The markdown table row containing the cell
"sqlalchemy|fastapi|litellm|..." is being parsed as extra columns; update that
cell to escape the pipe characters or wrap the entire regex in backticks and
escape the pipes (e.g., use `sqlalchemy\|fastapi\|litellm\|...`) so the table
renders as a single cell and MD056 is avoided; locate the table row with the
"Framework imports in new domain files" metric and apply this change.
In `@specs/goal-classifier-router/tasks.md`:
- Line 43: The markdown line using sha256(goal)[:16] is being parsed as
reversed-link syntax; update the text so the literal is wrapped in inline code
backticks (e.g., `sha256(goal)[:16]`) where `goal_hash` is described to prevent
MD011 markdownlint errors and ensure the raw goal isn't exposed.
- Line 129: The fenced code block opener currently uses plain backticks (```);
add a language tag (for example ```text) to the code fence so the markdown
linter (MD040) passes and formatting remains stable—update the triple-backtick
opener for the block starting at the current fence to include a language
identifier such as text or bash.
In `@tests/integration/test_goal_classifier_local_live.py`:
- Around line 122-123: The assertion serializes the event with
event.model_dump_json() without deterministic ordering; change the serialization
to use a deterministic JSON dump (e.g., call
event.model_dump_json(sort_keys=True) or otherwise ensure JSON is produced with
sort_keys=True) so the payload variable is stable for the privacy assertion;
update the same pattern in test_goal_classifier_remote_live.py as well.
In `@tests/integration/test_goal_classifier_remote_live.py`:
- Around line 113-114: The assertion uses non-deterministic model_dump_json();
change it to produce deterministic JSON by importing json and serializing the
model dump with json.dumps(..., sort_keys=True) (e.g., replace
event.model_dump_json() with json.dumps(event.model_dump(), sort_keys=True)) so
the payload string is stable for the privacy check that compares goal against
the GoalClassified payload.
In `@tests/unit/domain/test_council_events_goal_classified.py`:
- Line 60: The test uses non-deterministic JSON serialization via raw =
original.model_dump_json(); change these calls to pass deterministic sorting
(e.g., original.model_dump_json(sort_keys=True)) so keys are ordered
consistently; update the three occurrences referenced (the call that assigns raw
at the shown location and the other two occurrences at the same file) to include
sort_keys=True on model_dump_json or, if that method is unavailable in your
model class, replace with json.dumps(original.model_dump(), sort_keys=True).
In `@tests/unit/infrastructure/observability/test_router_observer.py`:
- Around line 9-10: Update the module docstring in
tests/unit/infrastructure/observability/test_router_observer.py to remove the
incorrect claim that inner-bus errors are "swallowed"; instead state that events
are forwarded to the inner bus and that errors from the inner bus are propagated
(this aligns with the test asserting propagation). Reference the module-level
docstring near the RouterObserver/router_observer tests and change the phrase
"Forwards the event to the inner bus (best-effort, errors swallowed)." to
language indicating that inner-bus errors are propagated.
In `@tests/unit/infrastructure/test_llm_planner_router_integration.py`:
- Around line 52-63: The JSON payload helper currently returns json.dumps([...])
without deterministic ordering; modify the call to json.dumps in the helper that
returns the test payload so it passes sort_keys=True (e.g., change json.dumps([
... ]) to json.dumps([ ... ], sort_keys=True)) to ensure deterministic
serialization per repo policy.
---
Nitpick comments:
In `@benchmarks/planner_quality_ab.py`:
- Around line 438-441: The code currently assigns the raw float cls_cost to
row.classifier_cost_usd and then adds the unrounded cls_cost to row.cost_usd,
causing floating-point artifacts; round cls_cost once (e.g., rounded_cls_cost =
round(cls_cost, 6)) and then set row.classifier_cost_usd = rounded_cls_cost and
row.cost_usd = round(row.cost_usd + rounded_cls_cost, 6) so both stored values
use the same normalized precision; update the lines handling cls_cost,
row.classifier_cost_usd and row.cost_usd accordingly.
In `@infrastructure/metrics/router_metrics.py`:
- Around line 46-51: The public accessors decisions_total and latency_samples
expose internal mutable state (_decisions and _latency); change them to return
defensive copies (e.g., a shallow copy of the mapping and a copy of the list or
an immutable view) so callers cannot mutate internal counters or latency
samples; locate the methods decisions_total and latency_samples and modify their
return values to return copies (or tuples/frozendict if available) instead of
returning _decisions and _latency directly, leaving record() and internal
mutation logic unchanged.
In `@tests/unit/domain/test_council_events_goal_classified.py`:
- Around line 38-50: The test test_kind_discriminator_is_fixed currently only
checks the serialized default; extend it to assert that the discriminator is
immutable by attempting to create a GoalClassified with an invalid kind and
expecting a validation failure: call the GoalClassified constructor (or
model_validate) with kind set to an incorrect string and assert that
pydantic.errors.ValidationError (or pydantic.ValidationError) is raised; keep
the existing positive assertion (dumped["kind"] == "goal_classified") and add
this negative-path assertion to prove the Literal discriminator rejects invalid
values.
In `@tests/unit/infrastructure/observability/test_router_observer.py`:
- Around line 26-29: The test helper currently types reason_category as a plain
str with a type: ignore; instead import and use the production enum/type
ReasonCategory and remove the ignore to keep fixtures aligned with runtime
types. Update the imports in test_router_observer.py to include ReasonCategory
(alongside DebateStarted and GoalClassified) and change the helper
variable/parameter named reason_category to be typed as ReasonCategory (remove
the # type: ignore). Ensure any test fixtures or factory calls that construct
reason_category use a valid ReasonCategory member.
In `@tests/unit/infrastructure/routing/test_prompts_parser.py`:
- Around line 110-115: Add a regression test that ensures JSON-encoded reasons
containing brace characters parse correctly: create a new test (e.g.,
test_reason_with_braces_parses) or extend
test_long_reason_is_truncated_not_rejected to set raw =
f'{{"model":"haiku","confidence":0.9,"reason":"token {{date}}"}}' (or similar
with braces) and call parse_classification(raw, latency_ms=10, cost_usd=0.0),
then assert result.reason contains the expected string (and/or length
constraints). Reference parse_classification and the raw variable so the test
verifies parsing of braces inside the "reason" field.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d41bf214-f0bd-41f9-a503-a665a7b3e483
📒 Files selected for processing (47)
benchmarks/planner_quality_ab.pydocs/CHANGELOG.mddocs/CONTINUATION.mddocs/ENV_VARS.mddocs/TECH_DECISIONS.mddocs/benchmarks/planner_ab_router_2026_05_20.jsondomain/ports/goal_classifier.pydomain/services/planner_model_router.pydomain/value_objects/council_events.pydomain/value_objects/goal_classification.pydomain/value_objects/planner_model.pyinfrastructure/fractal/llm_planner.pyinfrastructure/metrics/__init__.pyinfrastructure/metrics/router_metrics.pyinfrastructure/observability/__init__.pyinfrastructure/observability/router_observer.pyinfrastructure/routing/__init__.pyinfrastructure/routing/_prompts.pyinfrastructure/routing/llm_goal_classifier.pyinfrastructure/routing/local_goal_classifier.pyinterface/api/container.pyshared/config.pyspecs/goal-classifier-router/plan.mdspecs/goal-classifier-router/spec.mdspecs/goal-classifier-router/tasks.mdtests/integration/test_goal_classifier_local_live.pytests/integration/test_goal_classifier_remote_live.pytests/unit/application/_fakes/in_memory_goal_classifier.pytests/unit/application/_fakes/test_in_memory_goal_classifier.pytests/unit/domain/test_council_events_goal_classified.pytests/unit/domain/test_goal_classification.pytests/unit/domain/test_goal_classifier_port.pytests/unit/domain/test_planner_model.pytests/unit/domain/test_planner_model_router.pytests/unit/infrastructure/metrics/__init__.pytests/unit/infrastructure/metrics/test_router_metrics.pytests/unit/infrastructure/observability/__init__.pytests/unit/infrastructure/observability/test_router_observer.pytests/unit/infrastructure/routing/__init__.pytests/unit/infrastructure/routing/test_llm_goal_classifier.pytests/unit/infrastructure/routing/test_local_goal_classifier.pytests/unit/infrastructure/routing/test_prompts_parser.pytests/unit/infrastructure/test_llm_planner_router_integration.pytests/unit/interface/api/__init__.pytests/unit/interface/api/test_container_router_wiring.pytests/unit/interface/test_fractal_container_wiring.pytests/unit/shared/test_config_router.py
| Path(args.dump).write_text( | ||
| json.dumps( | ||
| { | ||
| "judge": JUDGE, | ||
| "trials": args.trials, | ||
| "rows": [r.__dict__ for r in rows], | ||
| "summary": { | ||
| "sonnet": sonnet_sum.__dict__, | ||
| "haiku": haiku_sum.__dict__, | ||
| }, | ||
| "total_cost_usd": round(total_cost, 6), | ||
| }, | ||
| indent=2, | ||
| ensure_ascii=False, | ||
| ) | ||
| json.dumps(dump_payload, indent=2, ensure_ascii=False) | ||
| ) |
There was a problem hiding this comment.
Add deterministic key ordering to benchmark JSON dumps.
At Lines 504-506, serialization omits sort_keys=True, which makes dump ordering unstable across runs and creates avoidable diff noise.
Proposed patch
Path(args.dump).write_text(
- json.dumps(dump_payload, indent=2, ensure_ascii=False)
+ json.dumps(dump_payload, indent=2, ensure_ascii=False, sort_keys=True)
)As per coding guidelines, JSON/YAML serialization must use sort_keys=True for deterministic output.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benchmarks/planner_quality_ab.py` around lines 504 - 506, The JSON dump call
using Path(args.dump).write_text(json.dumps(dump_payload, indent=2,
ensure_ascii=False)) produces non-deterministic key ordering; update the call to
include sort_keys=True so json.dumps(dump_payload, indent=2, ensure_ascii=False,
sort_keys=True) to produce deterministic, stable output for dump_payload and
reduce diff noise.
| existing observability dashboard alongside `cache_hit_rate` (TD-189) so | ||
| that classifier latency, decisions_total, and fallback rate are |
There was a problem hiding this comment.
Fix the TD reference for cache_hit_rate follow-up.
Line 8126 references cache_hit_rate (TD-189), but this doc does not include TD-189; cache-hit-rate work appears under TD-188/TD-190/TD-193. Please correct the TD number to avoid traceability confusion.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/TECH_DECISIONS.md` around lines 8126 - 8127, Update the TD reference for
the cache metric: locate the text "cache_hit_rate (TD-189)" and replace TD-189
with the correct decision ID that documents the cache_hit_rate work (e.g.,
"cache_hit_rate (TD-190)"); ensure the token "cache_hit_rate" and the old
"(TD-189)" are the ones changed so traceability points to the correct TD entry.
| payload = event.model_dump_json() | ||
| assert goal not in payload, "Raw goal leaked into GoalClassified payload" |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
# First, find the file and check if it exists
find . -name "test_goal_classifier_local_live.py" -type fRepository: engkimo/open-morphic
Length of output: 119
🏁 Script executed:
# Check the specific lines mentioned in the review
git ls-files | grep -i "test_goal_classifier_local_live"Repository: engkimo/open-morphic
Length of output: 117
🏁 Script executed:
# Get file size to understand if we can read it fully
wc -l tests/integration/test_goal_classifier_local_live.py 2>/dev/null || echo "File not found"Repository: engkimo/open-morphic
Length of output: 121
🏁 Script executed:
cat -n tests/integration/test_goal_classifier_local_live.pyRepository: engkimo/open-morphic
Length of output: 5352
🏁 Script executed:
# Check if there's a remote live test file mentioned in review
find . -name "*goal_classifier*remote*" -o -name "*goal_classifier*live*" | head -20Repository: engkimo/open-morphic
Length of output: 175
🏁 Script executed:
# Search for other uses of model_dump_json() in test files to see patterns
rg "model_dump_json\(\)" tests/ -A 2 -B 2 --max-count=10Repository: engkimo/open-morphic
Length of output: 4076
🏁 Script executed:
# Check if json module is imported elsewhere in test files for serialization
rg "json\.dumps.*sort_keys" . --type py | head -20Repository: engkimo/open-morphic
Length of output: 638
🏁 Script executed:
# Verify Pydantic v2 model_dump_json() signature to confirm it doesn't support sort_keys
rg "model_dump_json" . --type py -B 2 -A 2 | grep -E "(sort_keys|indent)" | head -10Repository: engkimo/open-morphic
Length of output: 46
🏁 Script executed:
# Check test_council_events_goal_classified.py to see how it handles similar privacy checks
sed -n '50,80p' tests/unit/domain/test_council_events_goal_classified.pyRepository: engkimo/open-morphic
Length of output: 1366
🏁 Script executed:
# Check the remote live test file for the same pattern
sed -n '120,125p' tests/integration/test_goal_classifier_remote_live.pyRepository: engkimo/open-morphic
Length of output: 46
🏁 Script executed:
# Check if there are imports of json module in these test files
head -50 tests/integration/test_goal_classifier_local_live.py | grep -E "^import|^from"Repository: engkimo/open-morphic
Length of output: 909
Use deterministic JSON serialization in the privacy assertion.
The payload serialization should explicitly use sort_keys=True to comply with the coding guideline requiring deterministic JSON output.
Proposed fix
+import json
...
- payload = event.model_dump_json()
+ payload = json.dumps(
+ event.model_dump(mode="json"),
+ sort_keys=True,
+ ensure_ascii=False,
+ )Per the coding guidelines, all JSON serialization across **/*.py (including tests) must use sort_keys=True for deterministic output. The same issue exists in test_goal_classifier_remote_live.py and should be fixed identically.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| payload = event.model_dump_json() | |
| assert goal not in payload, "Raw goal leaked into GoalClassified payload" | |
| import json | |
| ... | |
| payload = json.dumps( | |
| event.model_dump(mode="json"), | |
| sort_keys=True, | |
| ensure_ascii=False, | |
| ) | |
| assert goal not in payload, "Raw goal leaked into GoalClassified payload" |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/integration/test_goal_classifier_local_live.py` around lines 122 - 123,
The assertion serializes the event with event.model_dump_json() without
deterministic ordering; change the serialization to use a deterministic JSON
dump (e.g., call event.model_dump_json(sort_keys=True) or otherwise ensure JSON
is produced with sort_keys=True) so the payload variable is stable for the
privacy assertion; update the same pattern in
test_goal_classifier_remote_live.py as well.
| payload = event.model_dump_json() | ||
| assert goal not in payload, "Raw goal leaked into GoalClassified payload" |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
# First, locate the file and verify it exists
find . -name "test_goal_classifier_remote_live.py" -type fRepository: engkimo/open-morphic
Length of output: 120
🏁 Script executed:
# Read the file to check the exact contents at lines 113-114 and surrounding context
cat -n ./tests/integration/test_goal_classifier_remote_live.py | head -120 | tail -20Repository: engkimo/open-morphic
Length of output: 972
🏁 Script executed:
# Check the imports and the beginning of the file, including line 20
cat -n ./tests/integration/test_goal_classifier_remote_live.py | head -30Repository: engkimo/open-morphic
Length of output: 1209
🏁 Script executed:
# Check for all JSON serialization patterns in the file
cat -n ./tests/integration/test_goal_classifier_remote_live.py | grep -E "model_dump|json|JSON"Repository: engkimo/open-morphic
Length of output: 109
🏁 Script executed:
# Let's check the broader context of the test and also verify if there are other similar patterns in related test files
# First check if there's a similar test file (local version)
find . -name "*goal_classifier*" -type f | grep -E "\.py$"Repository: engkimo/open-morphic
Length of output: 613
🏁 Script executed:
# Check the local test file for comparison - look for similar JSON serialization patterns
grep -n "model_dump_json\|json.dumps" ./tests/integration/test_goal_classifier_local_live.py | head -20Repository: engkimo/open-morphic
Length of output: 106
🏁 Script executed:
# Check if there are examples in the codebase where sort_keys=True is used
rg "sort_keys\s*=\s*True" --type py -A 2 -B 2Repository: engkimo/open-morphic
Length of output: 2857
🏁 Script executed:
# Check the Pydantic version and model_dump usage in the codebase
rg "model_dump" --type py -B 1 -A 1 | head -40Repository: engkimo/open-morphic
Length of output: 3044
🏁 Script executed:
# Check if json is already imported in the file
grep -n "^import json" ./tests/integration/test_goal_classifier_remote_live.pyRepository: engkimo/open-morphic
Length of output: 46
🏁 Script executed:
# Check the full import section of the file
cat -n ./tests/integration/test_goal_classifier_remote_live.py | head -50Repository: engkimo/open-morphic
Length of output: 2039
🏁 Script executed:
# Verify the exact test function and full context around the issue
cat -n ./tests/integration/test_goal_classifier_remote_live.py | sed -n '85,120p'Repository: engkimo/open-morphic
Length of output: 1538
Add deterministic JSON serialization to the privacy invariant assertion.
The test payload serialization at line 113 must use sort_keys=True per the coding guideline requiring deterministic JSON output. Import json and replace model_dump_json() with an explicit sort.
Proposed fix
+import json
import os
import pytest # Privacy invariant: raw goal MUST NOT appear in the event payload.
- payload = event.model_dump_json()
+ payload = json.dumps(
+ event.model_dump(mode="json"),
+ sort_keys=True,
+ ensure_ascii=False,
+ )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/integration/test_goal_classifier_remote_live.py` around lines 113 -
114, The assertion uses non-deterministic model_dump_json(); change it to
produce deterministic JSON by importing json and serializing the model dump with
json.dumps(..., sort_keys=True) (e.g., replace event.model_dump_json() with
json.dumps(event.model_dump(), sort_keys=True)) so the payload string is stable
for the privacy check that compares goal against the GoalClassified payload.
| classifier_latency_ms=1500, | ||
| classifier_cost_usd=0.0, | ||
| ) | ||
| raw = original.model_dump_json() |
There was a problem hiding this comment.
Use deterministic JSON serialization in tests
These serializations should explicitly enforce sorted keys to meet deterministic output requirements.
Deterministic serialization patch
- raw = original.model_dump_json()
+ raw = json.dumps(original.model_dump(mode="json"), sort_keys=True)
@@
- raw = abandoned.model_dump_json()
+ raw = json.dumps(abandoned.model_dump(mode="json"), sort_keys=True)
@@
- payload = json.loads(ev.model_dump_json())
+ serialized = json.dumps(ev.model_dump(mode="json"), sort_keys=True)
+ payload = json.loads(serialized)As per coding guidelines, "JSON/YAML serialization must use sort_keys=True for deterministic output".
Also applies to: 72-72, 86-86
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/domain/test_council_events_goal_classified.py` at line 60, The
test uses non-deterministic JSON serialization via raw =
original.model_dump_json(); change these calls to pass deterministic sorting
(e.g., original.model_dump_json(sort_keys=True)) so keys are ordered
consistently; update the three occurrences referenced (the call that assigns raw
at the shown location and the other two occurrences at the same file) to include
sort_keys=True on model_dump_json or, if that method is unavailable in your
model class, replace with json.dumps(original.model_dump(), sort_keys=True).
| 3. Forwards the event to the inner bus (best-effort, errors swallowed). | ||
|
|
There was a problem hiding this comment.
Fix the error-handling claim in the module docstring.
The docstring says inner-bus errors are swallowed, but this suite validates propagation. Align wording to avoid a misleading contract.
📝 Proposed fix
-3. Forwards the event to the inner bus (best-effort, errors swallowed).
+3. Forwards the event to the inner bus (errors propagate).📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| 3. Forwards the event to the inner bus (best-effort, errors swallowed). | |
| 3. Forwards the event to the inner bus (errors propagate). |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/infrastructure/observability/test_router_observer.py` around lines
9 - 10, Update the module docstring in
tests/unit/infrastructure/observability/test_router_observer.py to remove the
incorrect claim that inner-bus errors are "swallowed"; instead state that events
are forwarded to the inner bus and that errors from the inner bus are propagated
(this aligns with the test asserting propagation). Reference the module-level
docstring near the RouterObserver/router_observer tests and change the phrase
"Forwards the event to the inner bus (best-effort, errors swallowed)." to
language indicating that inner-bus errors are propagated.
| return json.dumps( | ||
| [ | ||
| { | ||
| "description": "Step 1: do something", | ||
| "is_terminal": True, | ||
| "score": 0.9, | ||
| "condition": None, | ||
| "input_artifacts": {}, | ||
| "output_artifacts": {}, | ||
| } | ||
| ] | ||
| ) |
There was a problem hiding this comment.
Use deterministic JSON serialization in test payload helper.
Add sort_keys=True to keep serialization deterministic per repo policy.
🔧 Proposed fix
def _sample_payload() -> str:
return json.dumps(
[
{
"description": "Step 1: do something",
"is_terminal": True,
"score": 0.9,
"condition": None,
"input_artifacts": {},
"output_artifacts": {},
}
- ]
+ ],
+ sort_keys=True,
)As per coding guidelines, "JSON/YAML serialization must use sort_keys=True for deterministic output".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| return json.dumps( | |
| [ | |
| { | |
| "description": "Step 1: do something", | |
| "is_terminal": True, | |
| "score": 0.9, | |
| "condition": None, | |
| "input_artifacts": {}, | |
| "output_artifacts": {}, | |
| } | |
| ] | |
| ) | |
| return json.dumps( | |
| [ | |
| { | |
| "description": "Step 1: do something", | |
| "is_terminal": True, | |
| "score": 0.9, | |
| "condition": None, | |
| "input_artifacts": {}, | |
| "output_artifacts": {}, | |
| } | |
| ], | |
| sort_keys=True, | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/infrastructure/test_llm_planner_router_integration.py` around
lines 52 - 63, The JSON payload helper currently returns json.dumps([...])
without deterministic ordering; modify the call to json.dumps in the helper that
returns the test payload so it passes sort_keys=True (e.g., change json.dumps([
... ]) to json.dumps([ ... ], sort_keys=True)) to ensure deterministic
serialization per repo policy.
- domain/value_objects/council_events: freeze GoalClassified VO + enforce 16-hex pattern on goal_hash (immutability + privacy contract). - infrastructure/routing/_prompts: replace non-greedy regex with balanced brace scanner so JSON strings containing '}' parse correctly. - shared/config: bound planner_router_haiku_confidence_threshold to [0.0, 1.0] and planner_router_classifier_timeout_ms to > 0; reject silent misconfig. - docs/CONTINUATION + docs/ENV_VARS: correct opt-in value to `enabled` (was stale `remote`/`local`) and document adapter-selection priority (LOCAL_FIRST → ANTHROPIC_API_KEY+budget → off). 3,360 unit tests pass; ruff clean.
- Promote Unreleased CHANGELOG entry to v0.6.3 (2026-05-22), rolling up 7 PRs since v0.6.2: TD-195 router (#40), plan-evaluator fix (#39), Haiku A/B (#38), Haiku pricing fix (#37), cost-sim harness (#36), TD-189 step 5 CLI (#35). - Bump pyproject + main.py + version consistency test + uv.lock. - Refresh CLAUDE.md version banner to 0.6.3 / 2026-05-22.
Cosmetic-only cleanup of 9 residual Minor items from PR #40 review: - benchmarks/planner_quality_ab.py: add sort_keys=True to JSON dump - docs/TECH_DECISIONS.md: fix cache_hit_rate TD ref (TD-189 → TD-193) - infrastructure/routing/__init__.py: add `from __future__ import annotations` - specs/goal-classifier-router/plan.md: tag fenced block with `text` - specs/goal-classifier-router/spec.md: escape pipes in table cell (MD056) - specs/goal-classifier-router/tasks.md: backtick sha256(...) literal (MD011) + language tag on fenced block (MD040) - tests/unit/domain/test_council_events_goal_classified.py: switch to json.dumps(model_dump(mode="json"), sort_keys=True) in 3 spots - tests/unit/infrastructure/observability/test_router_observer.py: align docstring with actual contract (errors propagate, not swallowed) - tests/unit/infrastructure/test_llm_planner_router_integration.py: add sort_keys=True to _sample_payload helper No behavioral changes. 22 touched unit tests pass; ruff clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
LLMPlannerbetween Sonnet 4.6 and Haiku 4.5 via a pure-LLMGoalClassifierPort, behind the newMORPHIC_PLANNER_ROUTERflag (defaultdisabled— production routing unchanged on merge).SYSTEM_PROMPT(TD-190 stable-prefix invariant):LLMGoalClassifier(Anthropic Haiku 4.5 via LiteLLM) andLocalGoalClassifier(qwen3:8b via Ollama, $0 ops).GoalClassifiedevents carrysha256(goal)[:16]only. Fail-safe: classifier timeout / parse error →PlannerModel.SONNET(never to Haiku).Live A/B verdict (3 arms × 10 goals × 3 trials, $0.97 total, dump at
docs/benchmarks/planner_ab_router_2026_05_20.json):Captured-saving: 20.9% (paper bar 30% missed). Documented in TD-195 as a workload-mix structural effect of the entity-stressed benchmark (6/10 goals carry CJK / quoted entities → classifier correctly routes them to Sonnet at confidence ≥0.9). The router Pareto-dominates the Sonnet baseline (cheaper at near-identical quality).
Spec, plan, memo
specs/goal-classifier-router/spec.mdspecs/goal-classifier-router/plan.mdspecs/goal-classifier-router/tasks.mdmemory/planner_router_ab_2026_05_20.mddocs/TECH_DECISIONS.mdTD-195Test plan
rg "from (sqlalchemy|...|infrastructure|application|interface)" domain/...) → 0 hits🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Tests