antoinezambelli · antoinezambelli · Jun 12, 2026 · Jun 2, 2026 · Jun 2, 2026 · Jun 2, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,19 @@
 
 All notable changes to forge are documented here.
 
+## [0.7.5] — 2026-06-11
+
+Reasoning replay is now a measured, bounded policy. Reasoning-capable backends return hidden reasoning alongside tool calls, and forge previously re-serialized all of it into backend-facing history on every later turn. The new `reasoning_replay` knob bounds that — and after a full re-sweep of the published eval grid showed that dropping replayed reasoning is quality-free and token-cheaper, the default is `none`. The release also re-baselines the Claude eval tier with extended thinking enabled and adds Anthropic prompt caching with cache-aware cost accounting.
+
+### Added
+- **`reasoning_replay {full, keep-last, none}`** on `WorkflowRunner(reasoning_replay=…)` and the proxy (`--reasoning-replay`). `full` replays every captured reasoning block (the historical behavior), `keep-last` only the most recent, `none` keeps reasoning out of backend-facing history entirely. Serialization-only: reasoning is still captured and still surfaces in `on_message` and internal history. In OpenAI-compatible proxy responses, `keep-last` exposes current reasoning as `reasoning_content` rather than assistant `content`, so clients that preserve reasoning fields can replay just the latest block. See [ADR-017](docs/decisions/017-reasoning-replay-policy.md).
+- **Reasoning-replay eval grid** (`eval_results_v0.7.5.jsonl`, a new eval generation): the full 8–14B lineup re-swept across all three policies × both ablations × native/prompt — ~170k runs. The policy is part of the eval resume key and a first-class report/dashboard dimension: row labels carry `:keep-last` / `:full` tags (untagged = `none`), the dashboard gains a Reasoning Replay filter, the report a `--reasoning-replay` filter, and a dedicated [reasoning-replay view](docs/results/raw/reasoning-replay.md) compares policies per config. A wire-level counter (`reasoning_wire`) validates each policy's on-wire behavior (`none` → exactly 0 replayed reasoning across every run).
+- **Anthropic extended thinking — `AnthropicClient(thinking=…)`** — request-side extended-thinking config (e.g. `{"type": "adaptive"}`). When set, a forced `tool_choice` is suppressed (the API requires `auto` with thinking on) and `max_tokens` is raised to fit the thinking budget. The Claude eval baseline now runs Sonnet and Opus with adaptive thinking — all prior Claude rows had thinking off, the wrong baseline for a reasoning-flavored suite; Haiku does not support adaptive thinking and stays non-thinking.
+- **Anthropic prompt caching — `AnthropicClient(prompt_caching=True)`** — marks a static ephemeral cache breakpoint over the tool definitions + system prompt (byte-identical every turn, so it read-hits from turn 2 onward instead of re-billing the re-sent schema). `TokenUsage` gains generic `cache_creation_input_tokens` / `cache_read_input_tokens` counters, and eval cost accounting prices cache writes (1.25×) and reads (0.1×) at their actual rates.
+
+### Changed
+- **Captured reasoning is no longer replayed to the backend by default.** Pre-0.7.5 behavior replayed every captured reasoning block (equivalent to `reasoning_replay="full"`); the default is now `"none"`. On the published eval suite, `none` is statistically indistinguishable from replay-all in aggregate while saving the replayed tokens every turn; no per-config regression survives multiple-comparison correction (closest: a mild raw drop on Ministral-3 14B Reasoning Q4, where `none` and `keep-last` are indistinguishable from each other). The knob is inert for models that emit no reasoning. Migration: `--reasoning-replay full` (proxy) or `WorkflowRunner(reasoning_replay="full")` restores the historical behavior. Anthropic-protocol proxy responses emit reasoning text only under `full` — forge does not synthesize signed Anthropic thinking blocks.
+
 ## [0.7.4] — 2026-06-03
 
 Malformed tool-call arguments now self-correct on the tool-error channel, and the eval suite gains its first model-size upgrade — a 32GB tier (Qwen3.5 / 3.6 27–35B, Nemotron-3 Nano, Mistral-Small-3.2) surfaced in the dashboard alongside the existing 8–14B lineup.

diff --git a/README.md b/README.md
@@ -128,7 +128,7 @@ For multi-step workflows, multi-turn conversations, and backend auto-management,
 
 Drop-in proxy that sits between any client and a local model server, speaking both the OpenAI chat-completions API and the Anthropic Messages API (`/v1/messages`). Point your client at the proxy (e.g. `http://localhost:8081/v1`) and forge applies its guardrails transparently — the client thinks it's talking to a smarter model.
 
-This is the path for **using forge with an existing harness** (opencode, Continue, aider, Cline, anything that speaks the OpenAI chat-completions schema — or Claude Code, which speaks the Anthropic Messages API). No Python rewrite.
+This is the path for **using forge with an existing harness** (opencode, Continue, aider, Cline, anything that speaks the OpenAI chat-completions schema — or Claude Code, which speaks the Anthropic Messages API). No Python rewrite. Reasoning replay defaults to `none`: Forge still captures reasoning for observability, but keeps it out of backend-facing history on later turns — the most token-efficient policy, and statistically indistinguishable from replay-all on the eval suite (see [reasoning-replay results](docs/results/raw/reasoning-replay.md)). Use `--reasoning-replay keep-last` to replay only the latest reasoning block, or `--reasoning-replay full` for the historical replay-all behavior.
 
 ```bash
 # External mode — you manage the backend, forge proxies it

diff --git a/docs/BACKEND_SETUP.md b/docs/BACKEND_SETUP.md
@@ -75,7 +75,7 @@ llamafile --server --nobrowser -m path/to/model.gguf --port 8080 -ngl 999
 
 `LlamafileClient` is **native-first**: `mode="native"` (the default) forwards tools via the backend's `tools` parameter and requires native function calling (llama.cpp with `--jinja`). For a backend without native FC, declare `mode="prompt"` to inject tool descriptions into the prompt and parse the JSON call back out. The capability is declared at construction and frozen — there is no runtime auto-detection. Native-first is the default because local-model FC support has matured into the more reliable path; prompt-injection stays fully supported as an explicit opt-in, but note that on more complex, multi-step interactions models tend to struggle to drive the prompt-injected protocol reliably, so reach for it only when the backend leaves no alternative.
 
-> **Proxy note:** the OpenAI-compatible proxy is **native-first**. By default (`--backend-capability native`) it forwards the client's tools verbatim to an FC-capable backend (llama.cpp with `--jinja`, vLLM, Ollama, Anthropic) — the recommended setup. For a non-FC llama.cpp/llamafile backend, opt into prompt-injection with `--backend-capability prompt` (strips tools into the prompt, parses the JSON call back; reuses the same prompt path as the WorkflowRunner). The choice is frozen at startup — there is no runtime auto-detect in the proxy. See ADR-012.
+> **Proxy note:** the OpenAI-compatible proxy is **native-first**. By default (`--backend-capability native`) it forwards the client's tools verbatim to an FC-capable backend (llama.cpp with `--jinja`, vLLM, Ollama, Anthropic) — the recommended setup. For a non-FC llama.cpp/llamafile backend, opt into prompt-injection with `--backend-capability prompt` (strips tools into the prompt, parses the JSON call back; reuses the same prompt path as the WorkflowRunner). The choice is frozen at startup — there is no runtime auto-detect in the proxy. Reasoning replay is controlled separately with `--reasoning-replay {full,keep-last,none}`; the default `none` keeps captured reasoning out of backend-facing history (`keep-last` replays only the latest captured reasoning block, `full` replays everything). See ADR-012.
 
 Smoke-test:
 

diff --git a/docs/MODEL_REGISTRY.md b/docs/MODEL_REGISTRY.md
@@ -4,7 +4,7 @@ Every model forge knows about, classified by eval-suite status.
 
 ## Status meanings
 
-- **Current** — in the published eval. The dashboard folds multiple eval *generations* into one view (the v0.7.0 8–14B lineup, plus the v0.7.4 32GB tier); runs not yet re-swept against the latest code — e.g. the Anthropic ablation — are carried forward and superscript-tagged. Numbers in [`docs/results/`](results/) and the [dashboard](results/dashboard.html).
+- **Current** — in the published eval. The dashboard folds multiple eval *generations* into one view (the v0.7.5 reasoning-replay grid for the 8–14B lineup and Claude tier, plus the v0.7.4 32GB tier); runs not yet re-swept against the latest code — e.g. the 32GB tier and the Claude deep-ablation rows — are carried forward and superscript-tagged. Numbers in [`docs/results/`](results/) and the [dashboard](results/dashboard.html).
 - **Retired** — appeared in a prior eval suite, cut from the current one. Either too weak (bare scores below the threshold for informative comparison) or superseded by a newer family member. Sampling defaults retained for backward compatibility.
 - **Unpublished** — sampling defaults are present, but no eval numbers have been published. Forge will work with these models; performance is undocumented.
 
@@ -20,7 +20,7 @@ Sampling values are sourced from the model's HuggingFace card unless noted. Valu
 | Ministral-3 14B Instruct 2512 | Q4_K_M | 0.05¹ | — | — | — | — | — | [HF](https://huggingface.co/mistralai/Ministral-3-14B-Instruct-2512) |
 | Ministral-3 8B Reasoning 2512 | Q4_K_M, Q8_0 | 0.7 | —² | — | — | — | — | [HF](https://huggingface.co/mistralai/Ministral-3-8B-Reasoning-2512) |
 | Ministral-3 14B Reasoning 2512 | Q4_K_M | 1.0 | —² | — | — | — | — | [HF](https://huggingface.co/mistralai/Ministral-3-14B-Reasoning-2512) |
-| Qwen3 8B | Q4_K_M, Q8_0 | 0.6 | 0.95 | 20 | 0.0 | — | — | [HF](https://huggingface.co/Qwen/Qwen3-8B) |
+| Qwen3 8B | Q4_K_M, Q8_0⁸ | 0.6 | 0.95 | 20 | 0.0 | — | — | [HF](https://huggingface.co/Qwen/Qwen3-8B) |
 | Qwen3 14B | Q4_K_M | 0.6 | 0.95 | 20 | 0.0 | — | — | [HF](https://huggingface.co/Qwen/Qwen3-14B) |
 | Granite 4.1 8B | Q4_K_M, Q8_0 | 0.0³ | 1.0 | 0 | — | — | — | (IBM convention, unconfirmed) |
 | Gemma-4 E4B-it | Q4_K_M, Q8_0 | 1.0 | 0.95 | 64 | — | — | — | [HF](https://huggingface.co/google/gemma-4-e4b-it) |
@@ -33,15 +33,16 @@ Sampling values are sourced from the model's HuggingFace card unless noted. Valu
 | Nemotron-3 Nano 30B-A3B | Q4_K_M | 0.6 | 0.95 | — | — | — | —⁷ | [HF](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) |
 | Claude Haiku 4.5⁵ | — | — | — | — | — | — | — | (SDK-managed) |
 | Claude Sonnet 4.6⁵ | — | — | — | — | — | — | — | (SDK-managed) |
-| Claude Opus 4.6⁵ | — | — | — | — | — | — | — | (SDK-managed) |
+| Claude Opus 4.8⁵ | — | — | — | — | — | — | — | (SDK-managed) |
 
 ¹ Ministral-3 Instruct cards say "temperature below 0.1 for production"; 0.05 picked within that range.
 ² Ministral-3 Reasoning cards show `top_p=0.95` in code examples but do NOT include it in the formal "Recommended Settings" section. Add explicitly if you want to follow the examples.
 ³ Granite 4.1 sampling mirrors the Granite 4.0 IBM convention (greedy decoding); marked unconfirmed pending IBM publication for the 4.1 family specifically.
 ⁴ Phi-4: no formal sampling recommendation from any official source (Microsoft HF card, model docs). Falls through to backend defaults.
-⁵ **Claude numbers are carried forward from the v0.6.0 dataset** — gen 1 on the dashboard, superscript-tagged. The Anthropic ablation has not been re-run since, owing to cost (~$272 for the full 11,700-row matrix). Backend support is unchanged; numbers are stable to within tool-error-channel sensitivity (small).
+⁵ **Claude baseline re-measured in the v0.7.5 dataset** with extended thinking enabled (adaptive) for Sonnet 4.6 and Opus 4.8; Haiku 4.5 does not support adaptive thinking and runs non-thinking. Earlier Claude rows ran thinking-off: Opus 4.6 and the Anthropic deep-ablation rows are carried forward from the v0.6.0 dataset (gen 1 on the dashboard, superscript-tagged) — the ablation has not been re-run owing to cost (~$272 for the full 11,700-row matrix).
 ⁶ Qwen3.6 27B (dense) deliberately diverges from its A3B siblings: its card drops the `presence_penalty=1.5` the MoE variants recommend, so forge sends `0.0` (no penalty).
 ⁷ Nemotron-3 Nano: the card splits sampling into a Reasoning preset (T=1.0, top_p=1.0) and a Tool-calling preset (T=0.6, top_p=0.95); the tool-calling preset is used here, with thinking enabled via `chat_template_kwargs`.
+⁸ **Qwen3 8B Q8_0 will be cut (→ Retired) in a future eval generation** on compute-cost vs signal-value grounds, not quality: it was the single most expensive model in the v0.7.5 grid (~108 GPU-hours, ~23% of the full sweep) while adding little information over its Q4_K_M sibling (the Q4/Q8 delta is a couple of points on a mid-board model, and the quant-comparison axis is preserved by the cheaper Ministral and Gemma Q4/Q8 pairs). Its numbers stay Current while they are part of the published dataset.
 
 ---
 

diff --git a/docs/USER_GUIDE.md b/docs/USER_GUIDE.md
@@ -85,6 +85,8 @@ claude
 
 **Function-calling capability.** `--backend-capability native` (default) uses the backend's chat-template tool-calling and is the smoother default for Claude Code's heavy multi-turn tool use. `--backend-capability prompt` injects the tool surface into the prompt for llama.cpp/llamafile backends without a tool-calling template; whether a model stays coherent across multi-turn tool results in prompt mode varies by model — and tends to degrade on more complex, multi-step interactions — so prefer native whenever the backend supports it. The capability is declared at startup and frozen.
 
+**Reasoning replay.** Reasoning-capable backends may return hidden reasoning alongside tool calls. Forge captures that reasoning for observability, then controls how much is replayed to the backend on later turns with `--reasoning-replay {full,keep-last,none}`. The default is `none`: captured reasoning stays out of backend-facing history entirely. This is the most token-efficient policy, and on forge's eval suite it is statistically indistinguishable from replay-all (no aggregate score cost; see [reasoning-replay results](results/raw/reasoning-replay.md)). `keep-last` replays only the latest captured reasoning block. `full` preserves the historical behavior and replays every captured reasoning block. In OpenAI-compatible proxy responses, `keep-last` exposes current reasoning as `reasoning_content` instead of normal assistant `content` so clients that preserve reasoning fields can replay only the latest block without turning it into plain text; under the default `none`, proxy responses omit captured reasoning. Anthropic proxy responses only emit reasoning text under `full`; Forge does not synthesize signed Anthropic thinking blocks, so default Anthropic proxy responses do not expose replayable reasoning. See [ADR-017](decisions/017-reasoning-replay-policy.md) for the policy design and the eval evidence behind the default.
+
 **Downstream protocol.**
 
 - **Local model (default, `--backend-protocol openai`)** — forge translates Claude Code's Anthropic requests to OpenAI for llama.cpp / Ollama and converts the reply back to Anthropic SSE. Anthropic-only fields with no OpenAI analog (`cache_control`, `thinking`, `document` blocks) are dropped at that boundary; see [ADR-015](decisions/015-cache-control-preservation-path1.md).
@@ -283,6 +285,8 @@ await server.stop()
 
 `WorkflowRunner` accepts an optional `on_message` callback that fires each time a `Message` is appended to the conversation during `run()`. This is the primary observability hook — use it for logging, eval metric collection, or building conversation history for multi-turn flows.
 
+`WorkflowRunner(reasoning_replay=...)` uses the same policy as the proxy: `none` by default (captured reasoning is not replayed to the backend), `keep-last` to replay only the latest reasoning block, and `full` for the historical replay-all behavior. The policy affects backend-facing serialization only; `MessageType.REASONING` entries still appear in `on_message` and internal history unless context compaction removes them.
+
 - **Single-turn (default):** `on_message` fires for every message the runner creates — system prompt, user input, assistant responses, tool results, nudges.
 - **Multi-turn (`initial_messages`):** `run()` accepts an optional `initial_messages` parameter that seeds the conversation with prior history. `on_message` fires **only for new messages created during this turn**, not for the replayed history.