Skip to content

bench(planner): Haiku 4.5 vs Sonnet 4.6 plan-quality A/B#38

Merged
engkimo merged 1 commit into
mainfrom
feat/planner-quality-ab
May 19, 2026
Merged

bench(planner): Haiku 4.5 vs Sonnet 4.6 plan-quality A/B#38
engkimo merged 1 commit into
mainfrom
feat/planner-quality-ab

Conversation

@engkimo

@engkimo engkimo commented May 19, 2026

Copy link
Copy Markdown
Owner

Summary

  • Adds benchmarks/planner_quality_ab.py — live 4-axis A/B comparing Haiku 4.5 and Sonnet 4.6 as the LLMPlanner model.
  • Axes: parse_success, schema_valid, entity_preserved (programmatic distinctive-token presence), plan_eval (Sonnet-judged via LLMPlanEvaluator).
  • Pass criterion: Haiku within −5pt of Sonnet on every axis. Hard cost cap (default $1.00) aborts mid-run if exceeded.

Live result (60 planner + 60 evaluator calls, $0.629)

metric Sonnet 4.6 Haiku 4.5 Δ OK?
parse_success 100.0% 100.0% +0.0pt
schema_valid 100.0% 100.0% +0.0pt
entity_preserved 85.5% 74.1% −11.4pt
plan_eval (0–1) 0.904 0.834 −0.070

Cost: Sonnet $0.01375/call vs Haiku $0.00721/call = 47.6% saving (planner-only saving matches the prior simulation's ~66.7%; the table above includes Sonnet judge cost).

Verdict: FAIL — Haiku regresses beyond −5pt on two axes. Recommendation: hold planner default at Sonnet 4.6 until a per-goal router (cheap classifier → Haiku for generic-tech / English-only, Sonnet for entity-heavy / multilingual) is added. Full analysis pinned to memory haiku_planner_ab_2026_05_19.md.

Follow-up bug noticed but not addressed here: LLMPlanEvaluator._parse_evaluation raised 'list' object has no attribute 'get' 3 times when the model returned a list-of-lists. Falls back gracefully; tracked separately.

Test plan

  • uv run --extra dev ruff check benchmarks/planner_quality_ab.py
  • Live run end-to-end ($0.629, 120 calls, exits cleanly with non-zero status on FAIL)
  • CI green on this PR

🤖 Generated with Claude Code

Live 4-axis benchmark (parse, schema, entity-preservation, plan-eval)
over 10 goals × 2 models × 3 trials with Sonnet judge. Pass criterion:
Haiku within -5pt of Sonnet on every axis. Includes hard cost-cap safety.

Verdict: FAIL. Haiku saves 47.6%/call but regresses -11.4pt on entity
preservation and -7pt on plan-eval — exceeds the -5pt threshold on two
axes. Recommendation logged: hold planner default at Sonnet 4.6 until a
per-goal router is added.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented May 19, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@engkimo has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 1 minute and 55 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ef2d2668-f9bd-474d-bf6c-145aee7d9825

📥 Commits

Reviewing files that changed from the base of the PR and between 8ccb3f3 and 8556ddc.

📒 Files selected for processing (1)
  • benchmarks/planner_quality_ab.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/planner-quality-ab

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@engkimo engkimo merged commit 989eb6c into main May 19, 2026
6 checks passed
@engkimo engkimo deleted the feat/planner-quality-ab branch May 19, 2026 01:52
engkimo added a commit that referenced this pull request May 22, 2026
- Promote Unreleased CHANGELOG entry to v0.6.3 (2026-05-22), rolling up
  7 PRs since v0.6.2: TD-195 router (#40), plan-evaluator fix (#39),
  Haiku A/B (#38), Haiku pricing fix (#37), cost-sim harness (#36),
  TD-189 step 5 CLI (#35).
- Bump pyproject + main.py + version consistency test + uv.lock.
- Refresh CLAUDE.md version banner to 0.6.3 / 2026-05-22.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant