Skip to content

v0.7.4: malformed-args → tool-error channel + 32GB eval tier#102

Merged
antoinezambelli merged 10 commits into
mainfrom
az/tool-arg-fwd
Jun 3, 2026
Merged

v0.7.4: malformed-args → tool-error channel + 32GB eval tier#102
antoinezambelli merged 10 commits into
mainfrom
az/tool-arg-fwd

Conversation

@antoinezambelli

@antoinezambelli antoinezambelli commented Jun 3, 2026

Copy link
Copy Markdown
Owner

v0.7.4 — malformed-args → tool-error channel + 32GB eval tier

Two self-scoped changes, plus the dashboard infra to surface the larger models.

Malformed tool-call arguments ride the tool-error channel

A structurally valid tool call whose arguments are unparseable or not an object is now corrected via a tool-error result (role="tool", anchored to its tool_call_id, draining max_tool_errors) — uniformly across all OpenAI-shape clients and all three integration modes (WorkflowRunner, proxy, Guardrails facade). A single decode_tool_args helper (clients/base.py) is the one place args are decoded; shape validation moved to ResponseValidator. Supersedes 0.7.3's "malformed args drive a retry nudge."

Framed as a native-mode conditioning bet, not an ontology claim — a small model plausibly self-corrects better on the channel it was pretrained on. In prompt mode the tool role downgrades to a user message, so behavior there is unchanged. See docs/decisions/016.

  • ToolCall/TextResponse are now plain dataclasses (args: Any); attribute access + keyword construction unchanged, pydantic .model_* API gone. (Not labeled BREAKING — blast radius is callers who serialized these via pydantic; reserving the badge for forced-migration changes like 0.7.3's --mode rename.)
  • CheckResult.action gains "tool_error"; proxy exposes --max-tool-errors (default 2); StepTracker.check_prerequisites guards non-dict args.

32GB eval tier + eval-generation dashboard infra

  • 6 models moved Unpublished → Current: Mistral-Small-3.2 24B, Qwen3.5 27B / 35B-A3B, Qwen3.6 27B / 35B-A3B, Nemotron-3 Nano 30B-A3B (Q4, rig-02).
  • Eval generations: each result row carries a gen comparability epoch (decoupled from version + filename); report.py dedups latest-gen-per-config, badges lagging rows, hides Retired behind a toggle. The tool-arg change was a no-op on these 32GB models (error type didn't surface) → same gen as the 8–14B lineup.

Validation

  • Unit suite: 1116 passed.
  • No eval re-run needed — correction paths are an 8–14% minority and regression tests were clean. The 8–14B reasoning-replay re-sweep (gen 3) will exercise the tool-channel path fully.

🤖 Generated with Claude Code

antoinezambelli and others added 10 commits June 2, 2026 19:08
Models occasionally emit a structurally valid tool call with malformed
args content (e.g. arguments="" instead of arguments="{}"). Pydantic
rejected at ToolCall construction, crashing the stage with
ValidationError. Observed at 86% of error rows on Qwen3-Next prompt
mode (77/89), same family on Qwen3.6 (rig-02).

This is conceptually "tool called with bad args" — the call exists,
the inputs are wrong — same as FileNotFoundError at runtime. Should
ride the tool-error channel with max_tool_errors=2 budget, not crash.

- ToolCall / TextResponse: BaseModel → @DataClass. args is no longer
  validated at construction; ResponseValidator enforces dict-shape.
  Audit: no .model_* API on ToolCall anywhere in forge.
- ResponseValidator: new args-shape branch after unknown-tool check.
  Unknown-tool runs first (cheap; no point validating args on a
  hallucinated tool name).
- nudges.tool_arg_validation_nudge: schema-derived message naming the
  tool, the received args type, and the required JSON-object shape.
- inference: parse-error nudges drain max_tool_errors (record_result)
  not max_retries (record_retry). Message prefix
  [ToolArgValidationError] vs [UnknownTool].
- Exhaustion message simplified: includes which budget and nudge kind.

Smoke (Ministral-3-14B-Reasoning, 26 scenarios × 25 runs prompt-mode):
score 78.77% (vs v0.7.0 baseline 79.5% at n=50). Delta within
±1.6% noise band — no regression. Patch never tripped on this model;
this is a regression check, full bake on a model that hits the path
to follow.

884 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…UD, Nemotron-3-Nano

GGUF + server-flag + sampling-default entries for the three rig-02
32GB-tier models added for the v0.7.1 run. Launch config of record
for eval_results_rig-02_v0.7.1.jsonl (31,200 rows).
Unify all OpenAI-shape clients on one decode_tool_args helper
(clients/base.py): JSON-string args are parsed; malformed or non-dict
payloads ride through on the ToolCall as raw (non-dict) args instead of
collapsing to a TextResponse (openai_compat, vllm, llamafile) or raising
(anthropic streaming). ResponseValidator's args-shape check then routes
them to the tool-error channel + max_tool_errors budget — the same lane
as a runtime tool error — rather than a retry nudge.

Completes the client normalization #86 started (one decoder, all
clients) and keeps fail-loud (never coerced to {}). Also closes an
unguarded json.loads crash in the anthropic streaming finalize.

Behavior change: structural malformed-args now drains max_tool_errors
(2), not max_retries (3), in proxy mode. Wire-invisible; no public
signature changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…le malformed test

From the helper-vs-inline structural review (verdict: keep the shared
decode_tool_args helper). Two correctness/honesty caveats actioned:

- ToolCall.args annotated dict[str, Any] while the runtime contract now
  intentionally allows non-dicts (the docstring already says so). Widen
  to Any so the type stops lying.
- Stale comments in openai_compat/vllm streaming finalize still claimed
  malformed args yield a retry-driving TextResponse; corrected to the
  raw-args → tool-error-channel routing (they invited the exact drift
  the helper prevents).
- Add an explicit llamafile malformed-args test (non-stream native path).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… expose proxy max_tool_errors

From the H1/H2 design review. Two consistency gaps closed:

1. The Guardrails middleware facade recorded EVERY validator failure as a
   retry (max_retries) and returned action='retry' — diverging from
   run_inference/proxy, where malformed args drain the tool-error budget
   and tool-call faults ride the tool channel. The facade now:
     - routes malformed args (tool_arg_validation) to max_tool_errors,
     - returns a new action='tool_error' for tool-call faults (unknown
       tool name OR malformed args), with nudge.role='tool' so callers
       emit the correction on the tool-result channel.
   Channel vs budget are now two explicit kind-sets in nudge.py
   (TOOL_CHANNEL_KINDS ⊃ TOOL_ERROR_KINDS); unknown-tool rides the tool
   channel but still drains the retry budget, matching run_inference.
   _TOOL_ERROR_KINDS moved from inference.py to the shared nudge module.

2. Proxy exposed max_retries but not max_tool_errors, hiding the budget
   exactly where malformed-arg recovery now matters. Added --max-tool-errors
   (default 2) threaded ProxyServer → HTTPServer → handler → ErrorTracker.

Nobody depends on the middleware facade yet, so the CheckResult.action
addition is free. Channel parity in run_inference unchanged (it emits
role=tool for list-branch corrections regardless of nudge.role).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Closes the last crash vector from the design review (hole 4):
StepTracker.check_prerequisites did args.get(match_arg), which raises on
a non-dict args. ResponseValidator fences this before dispatch in the
runner/proxy, but a granular caller that bypasses check() could reach it
directly. Treat non-dict args as unsatisfied (block, don't crash).

ADR-016 records the malformed-args → tool-error-channel decision in its
honest framing: a native-mode conditioning bet, not an ontology claim;
prompt mode degrades to the prior retry shape; the tool-error budget
coupling is deliberate but revisitable. CHANGELOG held for release time.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Inject a per-row `gen` field so one dashboard can fold eval waves run
against different code states. gen is a comparability epoch, not a
release version: v0.6.0 -> gen 1 (carries the Anthropic ablation +
Retired-tier models, neither re-run since), and v0.7.0 plus the new
32GB tier -> gen 2.

Rename the 32GB wave to eval_results_v0.7.4.jsonl (its landing release)
and keep it as a separate file beside v0.7.0 — same gen, distinct wave,
so each wave keeps its own landing commit for reproducibility.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
report.py now accepts multiple result files and keeps the newest gen per
config (dedup_latest_gen), so the board folds all generations into one
view. Lagging rows (gen < newest) get a superscript badge backed by a
commit/date legend; Retired-tier models are carried forward but hidden
by default (--include-retired, or a sidebar checkbox in the HTML). Adds
MODEL_FAMILIES entries for the 6 32GB models so they render clean family
names and cross-backend keys instead of raw GGUF stems.

React dashboard: Show-retired checkbox (dimmed rows + a "retired" pill),
superscript gen badges with provenance tooltips from the data blob.

Regenerated docs/results/ from the three gen-tagged datasets.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Bump 0.7.3 -> 0.7.4 and add the 0.7.4 CHANGELOG entry (malformed args ->
tool-error channel; 32GB eval tier + dashboard eval-generations). Move
the 6 32GB models (Mistral-Small-3.2, Qwen3.5/3.6 27-35B, Nemotron-3
Nano) from Unpublished to Current now that they're in the published eval,
and reword the tier definitions for the dashboard's eval generations.

Also scrub a stale bring-up note from the Qwen3.5-122B footnote (it leaked
operator smoke-probe process into a public doc) and exclude the built
dashboard dist/ from the hatchling sdist sweep.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The ToolCall/TextResponse pydantic->dataclass move only breaks callers
who serialized these via the pydantic .model_* API or relied on
construction-time validation; attribute reads and keyword construction
are unchanged. Reserving BREAKING for forced-migration changes (cf.
0.7.3 --mode rename) keeps the badge meaningful.
@antoinezambelli antoinezambelli merged commit bd99f4d into main Jun 3, 2026
2 checks passed
@antoinezambelli antoinezambelli deleted the az/tool-arg-fwd branch June 3, 2026 05:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant