Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,19 @@

All notable changes to forge are documented here.

## [0.7.5] — 2026-06-11

Reasoning replay is now a measured, bounded policy. Reasoning-capable backends return hidden reasoning alongside tool calls, and forge previously re-serialized all of it into backend-facing history on every later turn. The new `reasoning_replay` knob bounds that — and after a full re-sweep of the published eval grid showed that dropping replayed reasoning is quality-free and token-cheaper, the default is `none`. The release also re-baselines the Claude eval tier with extended thinking enabled and adds Anthropic prompt caching with cache-aware cost accounting.

### Added
- **`reasoning_replay {full, keep-last, none}`** on `WorkflowRunner(reasoning_replay=…)` and the proxy (`--reasoning-replay`). `full` replays every captured reasoning block (the historical behavior), `keep-last` only the most recent, `none` keeps reasoning out of backend-facing history entirely. Serialization-only: reasoning is still captured and still surfaces in `on_message` and internal history. In OpenAI-compatible proxy responses, `keep-last` exposes current reasoning as `reasoning_content` rather than assistant `content`, so clients that preserve reasoning fields can replay just the latest block. See [ADR-017](docs/decisions/017-reasoning-replay-policy.md).
- **Reasoning-replay eval grid** (`eval_results_v0.7.5.jsonl`, a new eval generation): the full 8–14B lineup re-swept across all three policies × both ablations × native/prompt — ~170k runs. The policy is part of the eval resume key and a first-class report/dashboard dimension: row labels carry `:keep-last` / `:full` tags (untagged = `none`), the dashboard gains a Reasoning Replay filter, the report a `--reasoning-replay` filter, and a dedicated [reasoning-replay view](docs/results/raw/reasoning-replay.md) compares policies per config. A wire-level counter (`reasoning_wire`) validates each policy's on-wire behavior (`none` → exactly 0 replayed reasoning across every run).
- **Anthropic extended thinking — `AnthropicClient(thinking=…)`** — request-side extended-thinking config (e.g. `{"type": "adaptive"}`). When set, a forced `tool_choice` is suppressed (the API requires `auto` with thinking on) and `max_tokens` is raised to fit the thinking budget. The Claude eval baseline now runs Sonnet and Opus with adaptive thinking — all prior Claude rows had thinking off, the wrong baseline for a reasoning-flavored suite; Haiku does not support adaptive thinking and stays non-thinking.
- **Anthropic prompt caching — `AnthropicClient(prompt_caching=True)`** — marks a static ephemeral cache breakpoint over the tool definitions + system prompt (byte-identical every turn, so it read-hits from turn 2 onward instead of re-billing the re-sent schema). `TokenUsage` gains generic `cache_creation_input_tokens` / `cache_read_input_tokens` counters, and eval cost accounting prices cache writes (1.25×) and reads (0.1×) at their actual rates.

### Changed
- **Captured reasoning is no longer replayed to the backend by default.** Pre-0.7.5 behavior replayed every captured reasoning block (equivalent to `reasoning_replay="full"`); the default is now `"none"`. On the published eval suite, `none` is statistically indistinguishable from replay-all in aggregate while saving the replayed tokens every turn; no per-config regression survives multiple-comparison correction (closest: a mild raw drop on Ministral-3 14B Reasoning Q4, where `none` and `keep-last` are indistinguishable from each other). The knob is inert for models that emit no reasoning. Migration: `--reasoning-replay full` (proxy) or `WorkflowRunner(reasoning_replay="full")` restores the historical behavior. Anthropic-protocol proxy responses emit reasoning text only under `full` — forge does not synthesize signed Anthropic thinking blocks.

## [0.7.4] — 2026-06-03

Malformed tool-call arguments now self-correct on the tool-error channel, and the eval suite gains its first model-size upgrade — a 32GB tier (Qwen3.5 / 3.6 27–35B, Nemotron-3 Nano, Mistral-Small-3.2) surfaced in the dashboard alongside the existing 8–14B lineup.
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ For multi-step workflows, multi-turn conversations, and backend auto-management,

Drop-in proxy that sits between any client and a local model server, speaking both the OpenAI chat-completions API and the Anthropic Messages API (`/v1/messages`). Point your client at the proxy (e.g. `http://localhost:8081/v1`) and forge applies its guardrails transparently — the client thinks it's talking to a smarter model.

This is the path for **using forge with an existing harness** (opencode, Continue, aider, Cline, anything that speaks the OpenAI chat-completions schema — or Claude Code, which speaks the Anthropic Messages API). No Python rewrite.
This is the path for **using forge with an existing harness** (opencode, Continue, aider, Cline, anything that speaks the OpenAI chat-completions schema — or Claude Code, which speaks the Anthropic Messages API). No Python rewrite. Reasoning replay defaults to `none`: Forge still captures reasoning for observability, but keeps it out of backend-facing history on later turns — the most token-efficient policy, and statistically indistinguishable from replay-all on the eval suite (see [reasoning-replay results](docs/results/raw/reasoning-replay.md)). Use `--reasoning-replay keep-last` to replay only the latest reasoning block, or `--reasoning-replay full` for the historical replay-all behavior.

```bash
# External mode — you manage the backend, forge proxies it
Expand Down
2 changes: 1 addition & 1 deletion docs/BACKEND_SETUP.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ llamafile --server --nobrowser -m path/to/model.gguf --port 8080 -ngl 999

`LlamafileClient` is **native-first**: `mode="native"` (the default) forwards tools via the backend's `tools` parameter and requires native function calling (llama.cpp with `--jinja`). For a backend without native FC, declare `mode="prompt"` to inject tool descriptions into the prompt and parse the JSON call back out. The capability is declared at construction and frozen — there is no runtime auto-detection. Native-first is the default because local-model FC support has matured into the more reliable path; prompt-injection stays fully supported as an explicit opt-in, but note that on more complex, multi-step interactions models tend to struggle to drive the prompt-injected protocol reliably, so reach for it only when the backend leaves no alternative.

> **Proxy note:** the OpenAI-compatible proxy is **native-first**. By default (`--backend-capability native`) it forwards the client's tools verbatim to an FC-capable backend (llama.cpp with `--jinja`, vLLM, Ollama, Anthropic) — the recommended setup. For a non-FC llama.cpp/llamafile backend, opt into prompt-injection with `--backend-capability prompt` (strips tools into the prompt, parses the JSON call back; reuses the same prompt path as the WorkflowRunner). The choice is frozen at startup — there is no runtime auto-detect in the proxy. See ADR-012.
> **Proxy note:** the OpenAI-compatible proxy is **native-first**. By default (`--backend-capability native`) it forwards the client's tools verbatim to an FC-capable backend (llama.cpp with `--jinja`, vLLM, Ollama, Anthropic) — the recommended setup. For a non-FC llama.cpp/llamafile backend, opt into prompt-injection with `--backend-capability prompt` (strips tools into the prompt, parses the JSON call back; reuses the same prompt path as the WorkflowRunner). The choice is frozen at startup — there is no runtime auto-detect in the proxy. Reasoning replay is controlled separately with `--reasoning-replay {full,keep-last,none}`; the default `none` keeps captured reasoning out of backend-facing history (`keep-last` replays only the latest captured reasoning block, `full` replays everything). See ADR-012.

Smoke-test:

Expand Down
9 changes: 5 additions & 4 deletions docs/MODEL_REGISTRY.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Every model forge knows about, classified by eval-suite status.

## Status meanings

- **Current** — in the published eval. The dashboard folds multiple eval *generations* into one view (the v0.7.0 8–14B lineup, plus the v0.7.4 32GB tier); runs not yet re-swept against the latest code — e.g. the Anthropic ablation — are carried forward and superscript-tagged. Numbers in [`docs/results/`](results/) and the [dashboard](results/dashboard.html).
- **Current** — in the published eval. The dashboard folds multiple eval *generations* into one view (the v0.7.5 reasoning-replay grid for the 8–14B lineup and Claude tier, plus the v0.7.4 32GB tier); runs not yet re-swept against the latest code — e.g. the 32GB tier and the Claude deep-ablation rows — are carried forward and superscript-tagged. Numbers in [`docs/results/`](results/) and the [dashboard](results/dashboard.html).
- **Retired** — appeared in a prior eval suite, cut from the current one. Either too weak (bare scores below the threshold for informative comparison) or superseded by a newer family member. Sampling defaults retained for backward compatibility.
- **Unpublished** — sampling defaults are present, but no eval numbers have been published. Forge will work with these models; performance is undocumented.

Expand All @@ -20,7 +20,7 @@ Sampling values are sourced from the model's HuggingFace card unless noted. Valu
| Ministral-3 14B Instruct 2512 | Q4_K_M | 0.05¹ | — | — | — | — | — | [HF](https://huggingface.co/mistralai/Ministral-3-14B-Instruct-2512) |
| Ministral-3 8B Reasoning 2512 | Q4_K_M, Q8_0 | 0.7 | —² | — | — | — | — | [HF](https://huggingface.co/mistralai/Ministral-3-8B-Reasoning-2512) |
| Ministral-3 14B Reasoning 2512 | Q4_K_M | 1.0 | —² | — | — | — | — | [HF](https://huggingface.co/mistralai/Ministral-3-14B-Reasoning-2512) |
| Qwen3 8B | Q4_K_M, Q8_0 | 0.6 | 0.95 | 20 | 0.0 | — | — | [HF](https://huggingface.co/Qwen/Qwen3-8B) |
| Qwen3 8B | Q4_K_M, Q8_0 | 0.6 | 0.95 | 20 | 0.0 | — | — | [HF](https://huggingface.co/Qwen/Qwen3-8B) |
| Qwen3 14B | Q4_K_M | 0.6 | 0.95 | 20 | 0.0 | — | — | [HF](https://huggingface.co/Qwen/Qwen3-14B) |
| Granite 4.1 8B | Q4_K_M, Q8_0 | 0.0³ | 1.0 | 0 | — | — | — | (IBM convention, unconfirmed) |
| Gemma-4 E4B-it | Q4_K_M, Q8_0 | 1.0 | 0.95 | 64 | — | — | — | [HF](https://huggingface.co/google/gemma-4-e4b-it) |
Expand All @@ -33,15 +33,16 @@ Sampling values are sourced from the model's HuggingFace card unless noted. Valu
| Nemotron-3 Nano 30B-A3B | Q4_K_M | 0.6 | 0.95 | — | — | — | —⁷ | [HF](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) |
| Claude Haiku 4.5⁵ | — | — | — | — | — | — | — | (SDK-managed) |
| Claude Sonnet 4.6⁵ | — | — | — | — | — | — | — | (SDK-managed) |
| Claude Opus 4.6⁵ | — | — | — | — | — | — | — | (SDK-managed) |
| Claude Opus 4.8⁵ | — | — | — | — | — | — | — | (SDK-managed) |

¹ Ministral-3 Instruct cards say "temperature below 0.1 for production"; 0.05 picked within that range.
² Ministral-3 Reasoning cards show `top_p=0.95` in code examples but do NOT include it in the formal "Recommended Settings" section. Add explicitly if you want to follow the examples.
³ Granite 4.1 sampling mirrors the Granite 4.0 IBM convention (greedy decoding); marked unconfirmed pending IBM publication for the 4.1 family specifically.
⁴ Phi-4: no formal sampling recommendation from any official source (Microsoft HF card, model docs). Falls through to backend defaults.
⁵ **Claude numbers are carried forward from the v0.6.0 dataset** — gen 1 on the dashboard, superscript-tagged. The Anthropic ablation has not been re-run since, owing to cost (~$272 for the full 11,700-row matrix). Backend support is unchanged; numbers are stable to within tool-error-channel sensitivity (small).
⁵ **Claude baseline re-measured in the v0.7.5 dataset** with extended thinking enabled (adaptive) for Sonnet 4.6 and Opus 4.8; Haiku 4.5 does not support adaptive thinking and runs non-thinking. Earlier Claude rows ran thinking-off: Opus 4.6 and the Anthropic deep-ablation rows are carried forward from the v0.6.0 dataset (gen 1 on the dashboard, superscript-tagged) — the ablation has not been re-run owing to cost (~$272 for the full 11,700-row matrix).
⁶ Qwen3.6 27B (dense) deliberately diverges from its A3B siblings: its card drops the `presence_penalty=1.5` the MoE variants recommend, so forge sends `0.0` (no penalty).
⁷ Nemotron-3 Nano: the card splits sampling into a Reasoning preset (T=1.0, top_p=1.0) and a Tool-calling preset (T=0.6, top_p=0.95); the tool-calling preset is used here, with thinking enabled via `chat_template_kwargs`.
⁸ **Qwen3 8B Q8_0 will be cut (→ Retired) in a future eval generation** on compute-cost vs signal-value grounds, not quality: it was the single most expensive model in the v0.7.5 grid (~108 GPU-hours, ~23% of the full sweep) while adding little information over its Q4_K_M sibling (the Q4/Q8 delta is a couple of points on a mid-board model, and the quant-comparison axis is preserved by the cheaper Ministral and Gemma Q4/Q8 pairs). Its numbers stay Current while they are part of the published dataset.

---

Expand Down
4 changes: 4 additions & 0 deletions docs/USER_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,8 @@ claude

**Function-calling capability.** `--backend-capability native` (default) uses the backend's chat-template tool-calling and is the smoother default for Claude Code's heavy multi-turn tool use. `--backend-capability prompt` injects the tool surface into the prompt for llama.cpp/llamafile backends without a tool-calling template; whether a model stays coherent across multi-turn tool results in prompt mode varies by model — and tends to degrade on more complex, multi-step interactions — so prefer native whenever the backend supports it. The capability is declared at startup and frozen.

**Reasoning replay.** Reasoning-capable backends may return hidden reasoning alongside tool calls. Forge captures that reasoning for observability, then controls how much is replayed to the backend on later turns with `--reasoning-replay {full,keep-last,none}`. The default is `none`: captured reasoning stays out of backend-facing history entirely. This is the most token-efficient policy, and on forge's eval suite it is statistically indistinguishable from replay-all (no aggregate score cost; see [reasoning-replay results](results/raw/reasoning-replay.md)). `keep-last` replays only the latest captured reasoning block. `full` preserves the historical behavior and replays every captured reasoning block. In OpenAI-compatible proxy responses, `keep-last` exposes current reasoning as `reasoning_content` instead of normal assistant `content` so clients that preserve reasoning fields can replay only the latest block without turning it into plain text; under the default `none`, proxy responses omit captured reasoning. Anthropic proxy responses only emit reasoning text under `full`; Forge does not synthesize signed Anthropic thinking blocks, so default Anthropic proxy responses do not expose replayable reasoning. See [ADR-017](decisions/017-reasoning-replay-policy.md) for the policy design and the eval evidence behind the default.

**Downstream protocol.**

- **Local model (default, `--backend-protocol openai`)** — forge translates Claude Code's Anthropic requests to OpenAI for llama.cpp / Ollama and converts the reply back to Anthropic SSE. Anthropic-only fields with no OpenAI analog (`cache_control`, `thinking`, `document` blocks) are dropped at that boundary; see [ADR-015](decisions/015-cache-control-preservation-path1.md).
Expand Down Expand Up @@ -283,6 +285,8 @@ await server.stop()

`WorkflowRunner` accepts an optional `on_message` callback that fires each time a `Message` is appended to the conversation during `run()`. This is the primary observability hook — use it for logging, eval metric collection, or building conversation history for multi-turn flows.

`WorkflowRunner(reasoning_replay=...)` uses the same policy as the proxy: `none` by default (captured reasoning is not replayed to the backend), `keep-last` to replay only the latest reasoning block, and `full` for the historical replay-all behavior. The policy affects backend-facing serialization only; `MessageType.REASONING` entries still appear in `on_message` and internal history unless context compaction removes them.

- **Single-turn (default):** `on_message` fires for every message the runner creates — system prompt, user input, assistant responses, tool results, nudges.
- **Multi-turn (`initial_messages`):** `run()` accepts an optional `initial_messages` parameter that seeds the conversation with prior history. `on_message` fires **only for new messages created during this turn**, not for the replayed history.

Expand Down
Loading
Loading