bench(planner): Haiku 4.5 vs Sonnet 4.6 plan-quality A/B#38
Conversation
Live 4-axis benchmark (parse, schema, entity-preservation, plan-eval) over 10 goals × 2 models × 3 trials with Sonnet judge. Pass criterion: Haiku within -5pt of Sonnet on every axis. Includes hard cost-cap safety. Verdict: FAIL. Haiku saves 47.6%/call but regresses -11.4pt on entity preservation and -7pt on plan-eval — exceeds the -5pt threshold on two axes. Recommendation logged: hold planner default at Sonnet 4.6 until a per-goal router is added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
- Promote Unreleased CHANGELOG entry to v0.6.3 (2026-05-22), rolling up 7 PRs since v0.6.2: TD-195 router (#40), plan-evaluator fix (#39), Haiku A/B (#38), Haiku pricing fix (#37), cost-sim harness (#36), TD-189 step 5 CLI (#35). - Bump pyproject + main.py + version consistency test + uv.lock. - Refresh CLAUDE.md version banner to 0.6.3 / 2026-05-22.
Summary
benchmarks/planner_quality_ab.py— live 4-axis A/B comparing Haiku 4.5 and Sonnet 4.6 as the LLMPlanner model.parse_success,schema_valid,entity_preserved(programmatic distinctive-token presence),plan_eval(Sonnet-judged viaLLMPlanEvaluator).Live result (60 planner + 60 evaluator calls, $0.629)
Cost: Sonnet $0.01375/call vs Haiku $0.00721/call = 47.6% saving (planner-only saving matches the prior simulation's ~66.7%; the table above includes Sonnet judge cost).
Verdict: FAIL — Haiku regresses beyond −5pt on two axes. Recommendation: hold planner default at Sonnet 4.6 until a per-goal router (cheap classifier → Haiku for generic-tech / English-only, Sonnet for entity-heavy / multilingual) is added. Full analysis pinned to memory
haiku_planner_ab_2026_05_19.md.Follow-up bug noticed but not addressed here:
LLMPlanEvaluator._parse_evaluationraised'list' object has no attribute 'get'3 times when the model returned a list-of-lists. Falls back gracefully; tracked separately.Test plan
uv run --extra dev ruff check benchmarks/planner_quality_ab.py🤖 Generated with Claude Code