Commit 0eb5a9e
feat(tests): routing-recall benchmark for tool-selection quality (v0.11.4)
Adds tests/routing_bench.rs — 20 natural-language queries mapped to oracle
tool expectations, executed via Claude API against live ToolRegistry schemas.
Turns "does Claude Code invoke the right tool?" from vibe-check into P@1.
- oracle_well_formed (default) — asserts every oracle entry references a
real tool AND every registered tool has at least one oracle query; catches
drift when tools are renamed or added.
- routing_recall_benchmark (#[ignore]) — requires ANTHROPIC_API_KEY; runs
20 queries through claude-sonnet-4-6, prints per-miss detail, asserts
P@1 >= 0.70. Cost ~$0.10/run.
Run locally:
ANTHROPIC_API_KEY=sk-... cargo test --test routing_bench -- --ignored --nocapture
New dev-dep: reqwest 0.12 with blocking + rustls-tls (no OpenSSL pull-in).
CI wiring intentionally deferred — add a gated step when release strategy is
settled.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 3d0cef6 commit 0eb5a9e
4 files changed
+652
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
58 | 58 | | |
59 | 59 | | |
60 | 60 | | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
61 | 80 | | |
62 | 81 | | |
63 | 82 | | |
| |||
0 commit comments