bench(planner): Haiku 4.5 vs Sonnet 4.6 plan-quality A/B by engkimo · Pull Request #38 · engkimo/open-morphic

engkimo · 2026-05-19T01:49:42Z

Summary

Adds benchmarks/planner_quality_ab.py — live 4-axis A/B comparing Haiku 4.5 and Sonnet 4.6 as the LLMPlanner model.
Axes: parse_success, schema_valid, entity_preserved (programmatic distinctive-token presence), plan_eval (Sonnet-judged via LLMPlanEvaluator).
Pass criterion: Haiku within −5pt of Sonnet on every axis. Hard cost cap (default $1.00) aborts mid-run if exceeded.

Live result (60 planner + 60 evaluator calls, $0.629)

metric	Sonnet 4.6	Haiku 4.5	Δ	OK?
parse_success	100.0%	100.0%	+0.0pt	✓
schema_valid	100.0%	100.0%	+0.0pt	✓
entity_preserved	85.5%	74.1%	−11.4pt	✗
plan_eval (0–1)	0.904	0.834	−0.070	✗

Cost: Sonnet $0.01375/call vs Haiku $0.00721/call = 47.6% saving (planner-only saving matches the prior simulation's ~66.7%; the table above includes Sonnet judge cost).

Verdict: FAIL — Haiku regresses beyond −5pt on two axes. Recommendation: hold planner default at Sonnet 4.6 until a per-goal router (cheap classifier → Haiku for generic-tech / English-only, Sonnet for entity-heavy / multilingual) is added. Full analysis pinned to memory haiku_planner_ab_2026_05_19.md.

Follow-up bug noticed but not addressed here: LLMPlanEvaluator._parse_evaluation raised 'list' object has no attribute 'get' 3 times when the model returned a list-of-lists. Falls back gracefully; tracked separately.

Test plan

uv run --extra dev ruff check benchmarks/planner_quality_ab.py
Live run end-to-end ($0.629, 120 calls, exits cleanly with non-zero status on FAIL)
CI green on this PR

🤖 Generated with Claude Code

Live 4-axis benchmark (parse, schema, entity-preservation, plan-eval) over 10 goals × 2 models × 3 trials with Sonnet judge. Pass criterion: Haiku within -5pt of Sonnet on every axis. Includes hard cost-cap safety. Verdict: FAIL. Haiku saves 47.6%/call but regresses -11.4pt on entity preservation and -7pt on plan-eval — exceeds the -5pt threshold on two axes. Recommendation logged: hold planner default at Sonnet 4.6 until a per-goal router is added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

coderabbitai · 2026-05-19T01:49:48Z

Warning

Rate limit exceeded

@engkimo has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 1 minute and 55 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ef2d2668-f9bd-474d-bf6c-145aee7d9825

📥 Commits

Reviewing files that changed from the base of the PR and between 8ccb3f3 and 8556ddc.

📒 Files selected for processing (1)

benchmarks/planner_quality_ab.py

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/planner-quality-ab

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

- Promote Unreleased CHANGELOG entry to v0.6.3 (2026-05-22), rolling up 7 PRs since v0.6.2: TD-195 router (#40), plan-evaluator fix (#39), Haiku A/B (#38), Haiku pricing fix (#37), cost-sim harness (#36), TD-189 step 5 CLI (#35). - Bump pyproject + main.py + version consistency test + uv.lock. - Refresh CLAUDE.md version banner to 0.6.3 / 2026-05-22.

engkimo merged commit 989eb6c into main May 19, 2026
6 checks passed

engkimo deleted the feat/planner-quality-ab branch May 19, 2026 01:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench(planner): Haiku 4.5 vs Sonnet 4.6 plan-quality A/B#38

bench(planner): Haiku 4.5 vs Sonnet 4.6 plan-quality A/B#38
engkimo merged 1 commit into
mainfrom
feat/planner-quality-ab

engkimo commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026

Rate limit exceeded

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

engkimo commented May 19, 2026

Summary

Live result (60 planner + 60 evaluator calls, $0.629)

Test plan

Uh oh!

coderabbitai Bot commented May 19, 2026

Rate limit exceeded

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant