v0.7.4: malformed-args → tool-error channel + 32GB eval tier by antoinezambelli · Pull Request #102 · antoinezambelli/forge

antoinezambelli · 2026-06-03T05:42:29Z

v0.7.4 — malformed-args → tool-error channel + 32GB eval tier

Two self-scoped changes, plus the dashboard infra to surface the larger models.

Malformed tool-call arguments ride the tool-error channel

A structurally valid tool call whose arguments are unparseable or not an object is now corrected via a tool-error result (role="tool", anchored to its tool_call_id, draining max_tool_errors) — uniformly across all OpenAI-shape clients and all three integration modes (WorkflowRunner, proxy, Guardrails facade). A single decode_tool_args helper (clients/base.py) is the one place args are decoded; shape validation moved to ResponseValidator. Supersedes 0.7.3's "malformed args drive a retry nudge."

Framed as a native-mode conditioning bet, not an ontology claim — a small model plausibly self-corrects better on the channel it was pretrained on. In prompt mode the tool role downgrades to a user message, so behavior there is unchanged. See docs/decisions/016.

ToolCall/TextResponse are now plain dataclasses (args: Any); attribute access + keyword construction unchanged, pydantic .model_* API gone. (Not labeled BREAKING — blast radius is callers who serialized these via pydantic; reserving the badge for forced-migration changes like 0.7.3's --mode rename.)
CheckResult.action gains "tool_error"; proxy exposes --max-tool-errors (default 2); StepTracker.check_prerequisites guards non-dict args.

32GB eval tier + eval-generation dashboard infra

6 models moved Unpublished → Current: Mistral-Small-3.2 24B, Qwen3.5 27B / 35B-A3B, Qwen3.6 27B / 35B-A3B, Nemotron-3 Nano 30B-A3B (Q4, rig-02).
Eval generations: each result row carries a gen comparability epoch (decoupled from version + filename); report.py dedups latest-gen-per-config, badges lagging rows, hides Retired behind a toggle. The tool-arg change was a no-op on these 32GB models (error type didn't surface) → same gen as the 8–14B lineup.

Validation

Unit suite: 1116 passed.
No eval re-run needed — correction paths are an 8–14% minority and regression tests were clean. The 8–14B reasoning-replay re-sweep (gen 3) will exercise the tool-channel path fully.

🤖 Generated with Claude Code

Models occasionally emit a structurally valid tool call with malformed args content (e.g. arguments="" instead of arguments="{}"). Pydantic rejected at ToolCall construction, crashing the stage with ValidationError. Observed at 86% of error rows on Qwen3-Next prompt mode (77/89), same family on Qwen3.6 (rig-02). This is conceptually "tool called with bad args" — the call exists, the inputs are wrong — same as FileNotFoundError at runtime. Should ride the tool-error channel with max_tool_errors=2 budget, not crash. - ToolCall / TextResponse: BaseModel → @DataClass. args is no longer validated at construction; ResponseValidator enforces dict-shape. Audit: no .model_* API on ToolCall anywhere in forge. - ResponseValidator: new args-shape branch after unknown-tool check. Unknown-tool runs first (cheap; no point validating args on a hallucinated tool name). - nudges.tool_arg_validation_nudge: schema-derived message naming the tool, the received args type, and the required JSON-object shape. - inference: parse-error nudges drain max_tool_errors (record_result) not max_retries (record_retry). Message prefix [ToolArgValidationError] vs [UnknownTool]. - Exhaustion message simplified: includes which budget and nudge kind. Smoke (Ministral-3-14B-Reasoning, 26 scenarios × 25 runs prompt-mode): score 78.77% (vs v0.7.0 baseline 79.5% at n=50). Delta within ±1.6% noise band — no regression. Patch never tripped on this model; this is a regression check, full bake on a model that hits the path to follow. 884 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…UD, Nemotron-3-Nano GGUF + server-flag + sampling-default entries for the three rig-02 32GB-tier models added for the v0.7.1 run. Launch config of record for eval_results_rig-02_v0.7.1.jsonl (31,200 rows).

Unify all OpenAI-shape clients on one decode_tool_args helper (clients/base.py): JSON-string args are parsed; malformed or non-dict payloads ride through on the ToolCall as raw (non-dict) args instead of collapsing to a TextResponse (openai_compat, vllm, llamafile) or raising (anthropic streaming). ResponseValidator's args-shape check then routes them to the tool-error channel + max_tool_errors budget — the same lane as a runtime tool error — rather than a retry nudge. Completes the client normalization #86 started (one decoder, all clients) and keeps fail-loud (never coerced to {}). Also closes an unguarded json.loads crash in the anthropic streaming finalize. Behavior change: structural malformed-args now drains max_tool_errors (2), not max_retries (3), in proxy mode. Wire-invisible; no public signature changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…le malformed test From the helper-vs-inline structural review (verdict: keep the shared decode_tool_args helper). Two correctness/honesty caveats actioned: - ToolCall.args annotated dict[str, Any] while the runtime contract now intentionally allows non-dicts (the docstring already says so). Widen to Any so the type stops lying. - Stale comments in openai_compat/vllm streaming finalize still claimed malformed args yield a retry-driving TextResponse; corrected to the raw-args → tool-error-channel routing (they invited the exact drift the helper prevents). - Add an explicit llamafile malformed-args test (non-stream native path). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… expose proxy max_tool_errors From the H1/H2 design review. Two consistency gaps closed: 1. The Guardrails middleware facade recorded EVERY validator failure as a retry (max_retries) and returned action='retry' — diverging from run_inference/proxy, where malformed args drain the tool-error budget and tool-call faults ride the tool channel. The facade now: - routes malformed args (tool_arg_validation) to max_tool_errors, - returns a new action='tool_error' for tool-call faults (unknown tool name OR malformed args), with nudge.role='tool' so callers emit the correction on the tool-result channel. Channel vs budget are now two explicit kind-sets in nudge.py (TOOL_CHANNEL_KINDS ⊃ TOOL_ERROR_KINDS); unknown-tool rides the tool channel but still drains the retry budget, matching run_inference. _TOOL_ERROR_KINDS moved from inference.py to the shared nudge module. 2. Proxy exposed max_retries but not max_tool_errors, hiding the budget exactly where malformed-arg recovery now matters. Added --max-tool-errors (default 2) threaded ProxyServer → HTTPServer → handler → ErrorTracker. Nobody depends on the middleware facade yet, so the CheckResult.action addition is free. Channel parity in run_inference unchanged (it emits role=tool for list-branch corrections regardless of nudge.role). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Closes the last crash vector from the design review (hole 4): StepTracker.check_prerequisites did args.get(match_arg), which raises on a non-dict args. ResponseValidator fences this before dispatch in the runner/proxy, but a granular caller that bypasses check() could reach it directly. Treat non-dict args as unsatisfied (block, don't crash). ADR-016 records the malformed-args → tool-error-channel decision in its honest framing: a native-mode conditioning bet, not an ontology claim; prompt mode degrades to the prior retry shape; the tool-error budget coupling is deliberate but revisitable. CHANGELOG held for release time. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Inject a per-row `gen` field so one dashboard can fold eval waves run against different code states. gen is a comparability epoch, not a release version: v0.6.0 -> gen 1 (carries the Anthropic ablation + Retired-tier models, neither re-run since), and v0.7.0 plus the new 32GB tier -> gen 2. Rename the 32GB wave to eval_results_v0.7.4.jsonl (its landing release) and keep it as a separate file beside v0.7.0 — same gen, distinct wave, so each wave keeps its own landing commit for reproducibility. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

report.py now accepts multiple result files and keeps the newest gen per config (dedup_latest_gen), so the board folds all generations into one view. Lagging rows (gen < newest) get a superscript badge backed by a commit/date legend; Retired-tier models are carried forward but hidden by default (--include-retired, or a sidebar checkbox in the HTML). Adds MODEL_FAMILIES entries for the 6 32GB models so they render clean family names and cross-backend keys instead of raw GGUF stems. React dashboard: Show-retired checkbox (dimmed rows + a "retired" pill), superscript gen badges with provenance tooltips from the data blob. Regenerated docs/results/ from the three gen-tagged datasets. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Bump 0.7.3 -> 0.7.4 and add the 0.7.4 CHANGELOG entry (malformed args -> tool-error channel; 32GB eval tier + dashboard eval-generations). Move the 6 32GB models (Mistral-Small-3.2, Qwen3.5/3.6 27-35B, Nemotron-3 Nano) from Unpublished to Current now that they're in the published eval, and reword the tier definitions for the dashboard's eval generations. Also scrub a stale bring-up note from the Qwen3.5-122B footnote (it leaked operator smoke-probe process into a public doc) and exclude the built dashboard dist/ from the hatchling sdist sweep. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The ToolCall/TextResponse pydantic->dataclass move only breaks callers who serialized these via the pydantic .model_* API or relied on construction-time validation; attribute reads and keyword construction are unchanged. Reserving BREAKING for forced-migration changes (cf. 0.7.3 --mode rename) keeps the badge meaningful.

antoinezambelli and others added 10 commits June 2, 2026 19:08

batch_eval/sampling: 32GB-eval lineup — Qwen3.6-27B, Qwen3.6-35B-A3B-…

b3c1816

…UD, Nemotron-3-Nano GGUF + server-flag + sampling-default entries for the three rig-02 32GB-tier models added for the v0.7.1 run. Launch config of record for eval_results_rig-02_v0.7.1.jsonl (31,200 rows).

antoinezambelli merged commit bd99f4d into main Jun 3, 2026
2 checks passed

antoinezambelli deleted the az/tool-arg-fwd branch June 3, 2026 05:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.7.4: malformed-args → tool-error channel + 32GB eval tier#102

v0.7.4: malformed-args → tool-error channel + 32GB eval tier#102
antoinezambelli merged 10 commits into
mainfrom
az/tool-arg-fwd

antoinezambelli commented Jun 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

antoinezambelli commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!