Skip to content

Commit 0eb5a9e

Browse files
sdsrssclaude
andcommitted
feat(tests): routing-recall benchmark for tool-selection quality (v0.11.4)
Adds tests/routing_bench.rs — 20 natural-language queries mapped to oracle tool expectations, executed via Claude API against live ToolRegistry schemas. Turns "does Claude Code invoke the right tool?" from vibe-check into P@1. - oracle_well_formed (default) — asserts every oracle entry references a real tool AND every registered tool has at least one oracle query; catches drift when tools are renamed or added. - routing_recall_benchmark (#[ignore]) — requires ANTHROPIC_API_KEY; runs 20 queries through claude-sonnet-4-6, prints per-miss detail, asserts P@1 >= 0.70. Cost ~$0.10/run. Run locally: ANTHROPIC_API_KEY=sk-... cargo test --test routing_bench -- --ignored --nocapture New dev-dep: reqwest 0.12 with blocking + rustls-tls (no OpenSSL pull-in). CI wiring intentionally deferred — add a gated step when release strategy is settled. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 3d0cef6 commit 0eb5a9e

File tree

4 files changed

+652
-1
lines changed

4 files changed

+652
-1
lines changed

CHANGELOG.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,25 @@ New module `src/search/acronyms.rs`. `strip_outer_generic` in
5858
`src/mcp/server/tools.rs`, plus one flat_map augmentation in
5959
`storage::queries::fts5_search_impl`.
6060

61+
### Routing-recall benchmark (new)
62+
63+
`tests/routing_bench.rs` — turns "does Claude Code naturally call our tools
64+
for the right intents?" from vibe-check into a P@1 number. 20 oracle queries
65+
(3 per tool for 6 tools + 2 for `find_references`), each sent to the Claude
66+
API with the live 7-tool schemas from `ToolRegistry`; asserts the picked
67+
tool matches the oracle expectation.
68+
69+
- `oracle_well_formed` runs in default `cargo test` and verifies every
70+
oracle entry references a real tool *and* every registered tool has at
71+
least one oracle query — catches drift when tools are renamed/added.
72+
- `routing_recall_benchmark` is `#[ignore]` (requires `ANTHROPIC_API_KEY`).
73+
Run locally: `ANTHROPIC_API_KEY=sk-... cargo test --test routing_bench -- --ignored --nocapture`.
74+
Cost ≈ $0.10/run with `claude-sonnet-4-6` (20 queries × ~1.2K in + ~150 out).
75+
Threshold starts at P@1 ≥ 0.70; tighten as descriptions improve.
76+
- New dev-dep `reqwest` (blocking + rustls-tls, no TLS-OpenSSL pulled in).
77+
- CI wiring deliberately not added yet — run manually or add a gated step
78+
(`env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}`) when ready.
79+
6180
## v0.11.3 — Doc: "hidden but callable" clarified (Claude Code vs. raw MCP)
6281

6382
User-facing: no behavior change; corrects a misleading claim in the adopted

0 commit comments

Comments
 (0)