diff --git a/CHANGELOG.md b/CHANGELOG.md index e4e0a3f..f6961eb 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,6 +2,19 @@ All notable changes to forge are documented here. +## [0.7.5] — 2026-06-11 + +Reasoning replay is now a measured, bounded policy. Reasoning-capable backends return hidden reasoning alongside tool calls, and forge previously re-serialized all of it into backend-facing history on every later turn. The new `reasoning_replay` knob bounds that — and after a full re-sweep of the published eval grid showed that dropping replayed reasoning is quality-free and token-cheaper, the default is `none`. The release also re-baselines the Claude eval tier with extended thinking enabled and adds Anthropic prompt caching with cache-aware cost accounting. + +### Added +- **`reasoning_replay {full, keep-last, none}`** on `WorkflowRunner(reasoning_replay=…)` and the proxy (`--reasoning-replay`). `full` replays every captured reasoning block (the historical behavior), `keep-last` only the most recent, `none` keeps reasoning out of backend-facing history entirely. Serialization-only: reasoning is still captured and still surfaces in `on_message` and internal history. In OpenAI-compatible proxy responses, `keep-last` exposes current reasoning as `reasoning_content` rather than assistant `content`, so clients that preserve reasoning fields can replay just the latest block. See [ADR-017](docs/decisions/017-reasoning-replay-policy.md). +- **Reasoning-replay eval grid** (`eval_results_v0.7.5.jsonl`, a new eval generation): the full 8–14B lineup re-swept across all three policies × both ablations × native/prompt — ~170k runs. The policy is part of the eval resume key and a first-class report/dashboard dimension: row labels carry `:keep-last` / `:full` tags (untagged = `none`), the dashboard gains a Reasoning Replay filter, the report a `--reasoning-replay` filter, and a dedicated [reasoning-replay view](docs/results/raw/reasoning-replay.md) compares policies per config. A wire-level counter (`reasoning_wire`) validates each policy's on-wire behavior (`none` → exactly 0 replayed reasoning across every run). +- **Anthropic extended thinking — `AnthropicClient(thinking=…)`** — request-side extended-thinking config (e.g. `{"type": "adaptive"}`). When set, a forced `tool_choice` is suppressed (the API requires `auto` with thinking on) and `max_tokens` is raised to fit the thinking budget. The Claude eval baseline now runs Sonnet and Opus with adaptive thinking — all prior Claude rows had thinking off, the wrong baseline for a reasoning-flavored suite; Haiku does not support adaptive thinking and stays non-thinking. +- **Anthropic prompt caching — `AnthropicClient(prompt_caching=True)`** — marks a static ephemeral cache breakpoint over the tool definitions + system prompt (byte-identical every turn, so it read-hits from turn 2 onward instead of re-billing the re-sent schema). `TokenUsage` gains generic `cache_creation_input_tokens` / `cache_read_input_tokens` counters, and eval cost accounting prices cache writes (1.25×) and reads (0.1×) at their actual rates. + +### Changed +- **Captured reasoning is no longer replayed to the backend by default.** Pre-0.7.5 behavior replayed every captured reasoning block (equivalent to `reasoning_replay="full"`); the default is now `"none"`. On the published eval suite, `none` is statistically indistinguishable from replay-all in aggregate while saving the replayed tokens every turn; no per-config regression survives multiple-comparison correction (closest: a mild raw drop on Ministral-3 14B Reasoning Q4, where `none` and `keep-last` are indistinguishable from each other). The knob is inert for models that emit no reasoning. Migration: `--reasoning-replay full` (proxy) or `WorkflowRunner(reasoning_replay="full")` restores the historical behavior. Anthropic-protocol proxy responses emit reasoning text only under `full` — forge does not synthesize signed Anthropic thinking blocks. + ## [0.7.4] — 2026-06-03 Malformed tool-call arguments now self-correct on the tool-error channel, and the eval suite gains its first model-size upgrade — a 32GB tier (Qwen3.5 / 3.6 27–35B, Nemotron-3 Nano, Mistral-Small-3.2) surfaced in the dashboard alongside the existing 8–14B lineup. diff --git a/README.md b/README.md index e1ed4a6..0696369 100644 --- a/README.md +++ b/README.md @@ -128,7 +128,7 @@ For multi-step workflows, multi-turn conversations, and backend auto-management, Drop-in proxy that sits between any client and a local model server, speaking both the OpenAI chat-completions API and the Anthropic Messages API (`/v1/messages`). Point your client at the proxy (e.g. `http://localhost:8081/v1`) and forge applies its guardrails transparently — the client thinks it's talking to a smarter model. -This is the path for **using forge with an existing harness** (opencode, Continue, aider, Cline, anything that speaks the OpenAI chat-completions schema — or Claude Code, which speaks the Anthropic Messages API). No Python rewrite. +This is the path for **using forge with an existing harness** (opencode, Continue, aider, Cline, anything that speaks the OpenAI chat-completions schema — or Claude Code, which speaks the Anthropic Messages API). No Python rewrite. Reasoning replay defaults to `none`: Forge still captures reasoning for observability, but keeps it out of backend-facing history on later turns — the most token-efficient policy, and statistically indistinguishable from replay-all on the eval suite (see [reasoning-replay results](docs/results/raw/reasoning-replay.md)). Use `--reasoning-replay keep-last` to replay only the latest reasoning block, or `--reasoning-replay full` for the historical replay-all behavior. ```bash # External mode — you manage the backend, forge proxies it diff --git a/docs/BACKEND_SETUP.md b/docs/BACKEND_SETUP.md index 75e667d..2c75fe3 100644 --- a/docs/BACKEND_SETUP.md +++ b/docs/BACKEND_SETUP.md @@ -75,7 +75,7 @@ llamafile --server --nobrowser -m path/to/model.gguf --port 8080 -ngl 999 `LlamafileClient` is **native-first**: `mode="native"` (the default) forwards tools via the backend's `tools` parameter and requires native function calling (llama.cpp with `--jinja`). For a backend without native FC, declare `mode="prompt"` to inject tool descriptions into the prompt and parse the JSON call back out. The capability is declared at construction and frozen — there is no runtime auto-detection. Native-first is the default because local-model FC support has matured into the more reliable path; prompt-injection stays fully supported as an explicit opt-in, but note that on more complex, multi-step interactions models tend to struggle to drive the prompt-injected protocol reliably, so reach for it only when the backend leaves no alternative. -> **Proxy note:** the OpenAI-compatible proxy is **native-first**. By default (`--backend-capability native`) it forwards the client's tools verbatim to an FC-capable backend (llama.cpp with `--jinja`, vLLM, Ollama, Anthropic) — the recommended setup. For a non-FC llama.cpp/llamafile backend, opt into prompt-injection with `--backend-capability prompt` (strips tools into the prompt, parses the JSON call back; reuses the same prompt path as the WorkflowRunner). The choice is frozen at startup — there is no runtime auto-detect in the proxy. See ADR-012. +> **Proxy note:** the OpenAI-compatible proxy is **native-first**. By default (`--backend-capability native`) it forwards the client's tools verbatim to an FC-capable backend (llama.cpp with `--jinja`, vLLM, Ollama, Anthropic) — the recommended setup. For a non-FC llama.cpp/llamafile backend, opt into prompt-injection with `--backend-capability prompt` (strips tools into the prompt, parses the JSON call back; reuses the same prompt path as the WorkflowRunner). The choice is frozen at startup — there is no runtime auto-detect in the proxy. Reasoning replay is controlled separately with `--reasoning-replay {full,keep-last,none}`; the default `none` keeps captured reasoning out of backend-facing history (`keep-last` replays only the latest captured reasoning block, `full` replays everything). See ADR-012. Smoke-test: diff --git a/docs/MODEL_REGISTRY.md b/docs/MODEL_REGISTRY.md index 370a14b..9d84d4c 100644 --- a/docs/MODEL_REGISTRY.md +++ b/docs/MODEL_REGISTRY.md @@ -4,7 +4,7 @@ Every model forge knows about, classified by eval-suite status. ## Status meanings -- **Current** — in the published eval. The dashboard folds multiple eval *generations* into one view (the v0.7.0 8–14B lineup, plus the v0.7.4 32GB tier); runs not yet re-swept against the latest code — e.g. the Anthropic ablation — are carried forward and superscript-tagged. Numbers in [`docs/results/`](results/) and the [dashboard](results/dashboard.html). +- **Current** — in the published eval. The dashboard folds multiple eval *generations* into one view (the v0.7.5 reasoning-replay grid for the 8–14B lineup and Claude tier, plus the v0.7.4 32GB tier); runs not yet re-swept against the latest code — e.g. the 32GB tier and the Claude deep-ablation rows — are carried forward and superscript-tagged. Numbers in [`docs/results/`](results/) and the [dashboard](results/dashboard.html). - **Retired** — appeared in a prior eval suite, cut from the current one. Either too weak (bare scores below the threshold for informative comparison) or superseded by a newer family member. Sampling defaults retained for backward compatibility. - **Unpublished** — sampling defaults are present, but no eval numbers have been published. Forge will work with these models; performance is undocumented. @@ -20,7 +20,7 @@ Sampling values are sourced from the model's HuggingFace card unless noted. Valu | Ministral-3 14B Instruct 2512 | Q4_K_M | 0.05¹ | — | — | — | — | — | [HF](https://huggingface.co/mistralai/Ministral-3-14B-Instruct-2512) | | Ministral-3 8B Reasoning 2512 | Q4_K_M, Q8_0 | 0.7 | —² | — | — | — | — | [HF](https://huggingface.co/mistralai/Ministral-3-8B-Reasoning-2512) | | Ministral-3 14B Reasoning 2512 | Q4_K_M | 1.0 | —² | — | — | — | — | [HF](https://huggingface.co/mistralai/Ministral-3-14B-Reasoning-2512) | -| Qwen3 8B | Q4_K_M, Q8_0 | 0.6 | 0.95 | 20 | 0.0 | — | — | [HF](https://huggingface.co/Qwen/Qwen3-8B) | +| Qwen3 8B | Q4_K_M, Q8_0⁸ | 0.6 | 0.95 | 20 | 0.0 | — | — | [HF](https://huggingface.co/Qwen/Qwen3-8B) | | Qwen3 14B | Q4_K_M | 0.6 | 0.95 | 20 | 0.0 | — | — | [HF](https://huggingface.co/Qwen/Qwen3-14B) | | Granite 4.1 8B | Q4_K_M, Q8_0 | 0.0³ | 1.0 | 0 | — | — | — | (IBM convention, unconfirmed) | | Gemma-4 E4B-it | Q4_K_M, Q8_0 | 1.0 | 0.95 | 64 | — | — | — | [HF](https://huggingface.co/google/gemma-4-e4b-it) | @@ -33,15 +33,16 @@ Sampling values are sourced from the model's HuggingFace card unless noted. Valu | Nemotron-3 Nano 30B-A3B | Q4_K_M | 0.6 | 0.95 | — | — | — | —⁷ | [HF](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) | | Claude Haiku 4.5⁵ | — | — | — | — | — | — | — | (SDK-managed) | | Claude Sonnet 4.6⁵ | — | — | — | — | — | — | — | (SDK-managed) | -| Claude Opus 4.6⁵ | — | — | — | — | — | — | — | (SDK-managed) | +| Claude Opus 4.8⁵ | — | — | — | — | — | — | — | (SDK-managed) | ¹ Ministral-3 Instruct cards say "temperature below 0.1 for production"; 0.05 picked within that range. ² Ministral-3 Reasoning cards show `top_p=0.95` in code examples but do NOT include it in the formal "Recommended Settings" section. Add explicitly if you want to follow the examples. ³ Granite 4.1 sampling mirrors the Granite 4.0 IBM convention (greedy decoding); marked unconfirmed pending IBM publication for the 4.1 family specifically. ⁴ Phi-4: no formal sampling recommendation from any official source (Microsoft HF card, model docs). Falls through to backend defaults. -⁵ **Claude numbers are carried forward from the v0.6.0 dataset** — gen 1 on the dashboard, superscript-tagged. The Anthropic ablation has not been re-run since, owing to cost (~$272 for the full 11,700-row matrix). Backend support is unchanged; numbers are stable to within tool-error-channel sensitivity (small). +⁵ **Claude baseline re-measured in the v0.7.5 dataset** with extended thinking enabled (adaptive) for Sonnet 4.6 and Opus 4.8; Haiku 4.5 does not support adaptive thinking and runs non-thinking. Earlier Claude rows ran thinking-off: Opus 4.6 and the Anthropic deep-ablation rows are carried forward from the v0.6.0 dataset (gen 1 on the dashboard, superscript-tagged) — the ablation has not been re-run owing to cost (~$272 for the full 11,700-row matrix). ⁶ Qwen3.6 27B (dense) deliberately diverges from its A3B siblings: its card drops the `presence_penalty=1.5` the MoE variants recommend, so forge sends `0.0` (no penalty). ⁷ Nemotron-3 Nano: the card splits sampling into a Reasoning preset (T=1.0, top_p=1.0) and a Tool-calling preset (T=0.6, top_p=0.95); the tool-calling preset is used here, with thinking enabled via `chat_template_kwargs`. +⁸ **Qwen3 8B Q8_0 will be cut (→ Retired) in a future eval generation** on compute-cost vs signal-value grounds, not quality: it was the single most expensive model in the v0.7.5 grid (~108 GPU-hours, ~23% of the full sweep) while adding little information over its Q4_K_M sibling (the Q4/Q8 delta is a couple of points on a mid-board model, and the quant-comparison axis is preserved by the cheaper Ministral and Gemma Q4/Q8 pairs). Its numbers stay Current while they are part of the published dataset. --- diff --git a/docs/USER_GUIDE.md b/docs/USER_GUIDE.md index ac04a1f..de10ec1 100644 --- a/docs/USER_GUIDE.md +++ b/docs/USER_GUIDE.md @@ -85,6 +85,8 @@ claude **Function-calling capability.** `--backend-capability native` (default) uses the backend's chat-template tool-calling and is the smoother default for Claude Code's heavy multi-turn tool use. `--backend-capability prompt` injects the tool surface into the prompt for llama.cpp/llamafile backends without a tool-calling template; whether a model stays coherent across multi-turn tool results in prompt mode varies by model — and tends to degrade on more complex, multi-step interactions — so prefer native whenever the backend supports it. The capability is declared at startup and frozen. +**Reasoning replay.** Reasoning-capable backends may return hidden reasoning alongside tool calls. Forge captures that reasoning for observability, then controls how much is replayed to the backend on later turns with `--reasoning-replay {full,keep-last,none}`. The default is `none`: captured reasoning stays out of backend-facing history entirely. This is the most token-efficient policy, and on forge's eval suite it is statistically indistinguishable from replay-all (no aggregate score cost; see [reasoning-replay results](results/raw/reasoning-replay.md)). `keep-last` replays only the latest captured reasoning block. `full` preserves the historical behavior and replays every captured reasoning block. In OpenAI-compatible proxy responses, `keep-last` exposes current reasoning as `reasoning_content` instead of normal assistant `content` so clients that preserve reasoning fields can replay only the latest block without turning it into plain text; under the default `none`, proxy responses omit captured reasoning. Anthropic proxy responses only emit reasoning text under `full`; Forge does not synthesize signed Anthropic thinking blocks, so default Anthropic proxy responses do not expose replayable reasoning. See [ADR-017](decisions/017-reasoning-replay-policy.md) for the policy design and the eval evidence behind the default. + **Downstream protocol.** - **Local model (default, `--backend-protocol openai`)** — forge translates Claude Code's Anthropic requests to OpenAI for llama.cpp / Ollama and converts the reply back to Anthropic SSE. Anthropic-only fields with no OpenAI analog (`cache_control`, `thinking`, `document` blocks) are dropped at that boundary; see [ADR-015](decisions/015-cache-control-preservation-path1.md). @@ -283,6 +285,8 @@ await server.stop() `WorkflowRunner` accepts an optional `on_message` callback that fires each time a `Message` is appended to the conversation during `run()`. This is the primary observability hook — use it for logging, eval metric collection, or building conversation history for multi-turn flows. +`WorkflowRunner(reasoning_replay=...)` uses the same policy as the proxy: `none` by default (captured reasoning is not replayed to the backend), `keep-last` to replay only the latest reasoning block, and `full` for the historical replay-all behavior. The policy affects backend-facing serialization only; `MessageType.REASONING` entries still appear in `on_message` and internal history unless context compaction removes them. + - **Single-turn (default):** `on_message` fires for every message the runner creates — system prompt, user input, assistant responses, tool results, nudges. - **Multi-turn (`initial_messages`):** `run()` accepts an optional `initial_messages` parameter that seeds the conversation with prior history. `on_message` fires **only for new messages created during this turn**, not for the replayed history. diff --git a/docs/decisions/017-reasoning-replay-policy.md b/docs/decisions/017-reasoning-replay-policy.md new file mode 100644 index 0000000..e2bba03 --- /dev/null +++ b/docs/decisions/017-reasoning-replay-policy.md @@ -0,0 +1,51 @@ +# ADR-017: Reasoning replay is a bounded policy, default `none` + +**Status:** accepted (unreleased) + +## Context + +Reasoning-capable backends (Ministral Reasoning, Qwen3 thinking, gemma 4, …) return hidden reasoning alongside tool calls. forge captures that reasoning for observability (`MessageType.REASONING`), and historically re-serialized **all** of it into backend-facing history on every later turn — unbounded accumulation, with no way to turn it off. + +Two problems motivated bounding this: + +- **Convergence.** A proxy non-convergence investigation traced runaway context growth to captured reasoning being replayed back to the backend each turn. Frontier labs practice *scoped* reasoning retention, not replay-everything. +- **Cost.** Replayed reasoning grows the prompt every turn. On long multi-step workflows it competes with real history for the context budget and inflates per-turn token cost. + +A serializer reality check sharpened the question: even the legacy behavior was not a faithful 1:1 re-send — `fold_and_serialize` collapses consecutive reasoning blocks (only the one preceding a tool call survives), so only ~29% of generated reasoning reached the wire on real transcripts. "Replay everything" was already an approximation, not a ground truth worth preserving by default. + +## Decision + +One knob, `reasoning_replay ∈ {"full", "keep-last", "none"}`, shared by `WorkflowRunner` and the proxy (`--reasoning-replay`), **default `"none"`**. + +- **`none` (default)** — captured reasoning never enters backend-facing history. +- **`keep-last`** — only the most recent captured reasoning block is replayed. +- **`full`** — legacy behavior; every captured reasoning block is replayed. Pre-knob forge ≡ `full`. + +The policy affects **backend-facing serialization only**. Reasoning is still captured, still surfaces in `on_message` and internal history, and still lands in eval transcripts — observability is unchanged. + +Proxy response shaping follows the policy: under `keep-last` current reasoning is exposed as `reasoning_content` (so clients that preserve reasoning fields can replay just the latest block); under `full` it rides assistant `content`; under `none` it is omitted. Anthropic-protocol responses emit reasoning text only under `full`; forge does not synthesize signed Anthropic thinking blocks. + +## Evidence + +The default was chosen from a dedicated re-sweep (the v0.7.5 grid): 14 models × {none, keep-last, full} × {bare, reforged} × {native, prompt}, 50 runs × 26 scenarios per cell, 170k runs total. Scoring treats the **scenario** as the sampling unit (runs cluster hard within scenarios), paired against the v0.7.0 legacy/`full` baseline. + +- **`full` reproduces the pre-knob baseline** on all reasoning models (n.s. everywhere) — the knob is a clean superset of legacy behavior; the message-processing refactor did not regress the legacy path. +- **`none` is statistically indistinguishable from legacy overall** (+0.49pp, p=0.17), and in the reforged-only read (−0.35pp, p=0.45). Bounding replay is a free token saving on this suite. +- **`none` edges out `keep-last` overall** (+0.86pp, p=0.007); the two are indistinguishable reforged-only. +- **No robust per-config downside survives multiple-comparison correction.** The closest is the Ministral-14B-Reasoning-Q4 family (reforged-only raw drop ~1.5pp, p≈0.04–0.06, with `none` ≈ `keep-last`) — a family/quantization caveat, not a blocker. +- **Wire-level validation:** `none` → exactly 0 reasoning on the wire across every row; `keep-last` ∈ {0, 1}; per-transcript ordering full ≥ keep-last ≥ none holds by construction. + +Full per-config tables: [results/raw/reasoning-replay.md](../results/raw/reasoning-replay.md). + +## Consequences + +- **Behavioral change for reasoning-capable backends.** Upgraders who want the old behavior pin `--reasoning-replay full` (proxy) or `WorkflowRunner(reasoning_replay="full")`. For non-reasoning/instruct models the knob is inert and nothing changes. +- **Token savings by default.** Backend-facing history stops accumulating reasoning; `full` remains the cost wildcard (context grows with run length). +- **Eval surface.** `reasoning_replay` is part of the eval resume key and a first-class report/dashboard dimension; rows predating the knob count as `full` (that is what they ran). +- **Claude rows are unaffected.** The Anthropic client drops returned thinking blocks rather than capturing them into history, so the knob is request-inert there; carrying thinking across turns natively is deferred pending evidence it moves scores. + +## Alternatives considered + +- **Default `keep-last`** (the knob's initial default while evidence was pending). A reasonable middle ground — but it measured slightly *below* `none` overall, still pays a replay cost, and busts rolling prompt-cache prefixes (earlier messages re-serialize differently each turn). Rejected once the grid showed `none` is quality-free. +- **Default `full` (legacy).** Preserves bug-for-bug continuity, but it is the most expensive policy, delivers no measured score benefit, and is the very accumulation pathology that motivated the knob. +- **Drop replay entirely (no knob).** Simplest, but unfalsifiable — `full`/`keep-last` exist precisely so the policy stays a measured variable and per-model exceptions (e.g. the Ministral-Q4 caveat) remain one flag away. diff --git a/docs/results/dashboard.html b/docs/results/dashboard.html index 530f8d6..1519700 100644 --- a/docs/results/dashboard.html +++ b/docs/results/dashboard.html @@ -4,19 +4,20 @@
{const ul=E?.[String(x.gen)];return ul?`gen ${x.gen}: ${ul.note} (commit ${ul.commit}, ${ul.date})`:`gen ${x.gen}`})(),children:zr(x.gen)}),x.retired&&_.jsx("span",{className:"ml-1.5 align-middle text-[0.55rem] uppercase tracking-wider text-zinc-500 border border-zinc-700 rounded px-1",children:"retired"})]}),_.jsx("td",{className:`p-1.5 text-right tabular-nums ${Eu(x.score)}`,children:Xn(x.score,1)}),_.jsx("td",{className:`p-1.5 text-right tabular-nums ${Eu(x.accuracy)}`,children:Xn(x.accuracy,1)}),_.jsx("td",{className:`p-1.5 text-right tabular-nums ${Eu(x.completeness)}`,children:Xn(x.completeness,1)}),_.jsx("td",{className:`p-1.5 text-right tabular-nums ${Eu(x.efficiency)}`,children:Xn(x.efficiency)}),_.jsx("td",{className:"p-1.5 text-right tabular-nums text-zinc-400",children:x.wasted.toFixed(1)}),_.jsxs("td",{className:"p-1.5 text-right tabular-nums text-zinc-400",children:[x.speed.toFixed(1),"s"]}),_.jsx("td",{className:"p-1.5 text-right tabular-nums text-zinc-500",children:x.n}),M.map(ul=>{const zl=x.scenarios[ul],$=x.scenarioRuns?.[ul]??0;let nl,al;return zl!=null?(nl=String(zl),al=Eu(zl)):$===0?(nl="I",al="text-zinc-700"):(nl="—",al="text-zinc-600"),_.jsx("td",{className:`p-1.5 text-right tabular-nums ${al}`,children:nl},ul)})]})]},x.label)})})]})})}const Nr=[{key:"score",label:"Score",fmt:h=>h==null?"—":`${h.toFixed(1)}%`,higherBetter:!0},{key:"accuracy",label:"Accuracy",fmt:h=>h==null?"—":`${h.toFixed(1)}%`,higherBetter:!0},{key:"completeness",label:"Completeness",fmt:h=>h==null?"—":`${h.toFixed(1)}%`,higherBetter:!0},{key:"efficiency",label:"Efficiency",fmt:h=>h==null?"—":`${h.toFixed(1)}%`,higherBetter:!0},{key:"wasted",label:"Avg Wasted",fmt:h=>h==null?"—":h.toFixed(1),higherBetter:!1},{key:"speed",label:"Speed",fmt:h=>h==null?"—":`${h.toFixed(1)}s`,higherBetter:!1}];function Ho({va:h,vb:M,higherBetter:C}){if(h==null||M==null)return _.jsx("td",{className:"p-1.5 text-right text-zinc-600",children:"—"});const o=M-h,Y=(o>0?"+":"")+(Number.isInteger(o)?o:o.toFixed(1));let H="text-zinc-500";return o!==0&&(H=o>0===C?"text-emerald-400":"text-red-400"),_.jsx("td",{className:`p-1.5 text-right tabular-nums font-medium ${H}`,children:Y})}function Ur({a:h,b:M,scenarios:C,scenarioAbbrev:o,onSwap:q,onClear:Y}){const H=(R,p)=>p in R.scenarios?R.scenarios[p]:R[p]??null;return _.jsxs("div",{className:"mt-6 border border-zinc-800 rounded-lg p-4 max-w-2xl",children:[_.jsxs("div",{className:"flex items-center justify-between mb-3",children:[_.jsx("h3",{className:"text-sm font-semibold",children:"Compare"}),_.jsxs("div",{className:"flex gap-2",children:[_.jsx("button",{onClick:q,className:"text-xs px-2.5 py-1 rounded border border-zinc-700 hover:border-zinc-500 transition-colors",children:"Swap A↔B"}),_.jsx("button",{onClick:Y,className:"text-xs px-2.5 py-1 rounded border border-zinc-700 hover:border-red-500/50 hover:text-red-400 transition-colors",children:"Clear"})]})]}),_.jsxs("table",{className:"text-xs w-full border-collapse",children:[_.jsx("thead",{children:_.jsxs("tr",{className:"border-b border-zinc-800",children:[_.jsx("th",{className:"p-1.5 text-left text-zinc-500",children:"Metric"}),_.jsx("th",{className:"p-1.5 text-right text-zinc-400 max-w-48 truncate",title:h.label,children:h.label}),_.jsx("th",{className:"p-1.5 text-right text-zinc-500 w-16",children:"Delta"}),_.jsx("th",{className:"p-1.5 text-right text-zinc-400 max-w-48 truncate",title:M.label,children:M.label})]})}),_.jsxs("tbody",{children:[Nr.map(R=>{const p=H(h,R.key),E=H(M,R.key);return _.jsxs("tr",{className:"border-b border-zinc-900/50",children:[_.jsx("td",{className:"p-1.5 text-zinc-400",children:R.label}),_.jsx("td",{className:"p-1.5 text-right tabular-nums",children:R.fmt(p)}),_.jsx(Ho,{va:p,vb:E,higherBetter:R.higherBetter}),_.jsx("td",{className:"p-1.5 text-right tabular-nums",children:R.fmt(E)})]},R.key)}),_.jsx("tr",{children:_.jsx("td",{colSpan:4,className:"py-1",children:_.jsx("div",{className:"border-t border-zinc-800"})})}),C.map(R=>{const p=h.scenarios[R],E=M.scenarios[R],G=(U,x)=>U!=null?`${U}%`:(x.scenarioRuns?.[R]??0)===0?"I":"—";return _.jsxs("tr",{className:"border-b border-zinc-900/50",children:[_.jsx("td",{className:"p-1.5 text-zinc-500",children:o[R]||R}),_.jsx("td",{className:"p-1.5 text-right tabular-nums",children:G(p,h)}),_.jsx(Ho,{va:p,vb:E,higherBetter:!0}),_.jsx("td",{className:"p-1.5 text-right tabular-nums",children:G(E,M)})]},R)})]})]})]})}function Cr(h){const M={};for(const C of gf)M[C]=new Set(h.map(o=>o[C]));return M}function Rr(){const[h,M]=ml.useState(null),[C,o]=ml.useState(null),[q,Y]=ml.useState({col:"score",asc:!1}),[H,R]=ml.useState([]),[p,E]=ml.useState("reforged"),[G,U]=ml.useState("all"),[x,P]=ml.useState("all"),[k,vl]=ml.useState("all"),[ul,zl]=ml.useState(!1);ml.useEffect(()=>{Sr().then(g=>{M(g),o(Cr(g.rows))})},[]);const $=ml.useMemo(()=>h?ul?h.rows:h.rows.filter(g=>!g.retired):[],[h,ul]),nl=ml.useMemo(()=>h?h.rows.some(g=>g.retired):!1,[h]),al=ml.useMemo(()=>!h||!C?[]:Co($,p).filter(O=>gf.every(N=>!C[N]||C[N].has(O[N]))),[h,C,p,$]),{rows:Hl,scenarios:Tl}=ml.useMemo(()=>Er(al,h?.scenarios??[],x,k,h?.scenarioSuite??{}),[al,h,x,k]),W=ml.useMemo(()=>{if(!h)return{};const g=h.scenarioAbbrev,O=new Set(Tl),N={};for(const[X,K]of Object.entries(g))O.has(X)&&(N[X]=K);return N},[h,Tl]),Bl=ml.useMemo(()=>hf.find(g=>g.id===G)??hf[0],[G]),{sorted:Kl,groups:$l}=ml.useMemo(()=>pr(Hl,Bl,q,Tl,p),[Hl,Bl,q,Tl,p]),kl=ml.useMemo(()=>al.reduce((g,O)=>g+O.n*Tl.length,0),[al,Tl]),fl=ml.useCallback((g,O,N)=>{o(X=>{if(!X)return X;const K={...X,[g]:new Set(X[g])};return N?K[g].add(O):K[g].delete(O),K}),R([])},[]),Rt=ml.useCallback(g=>{E(g),R([])},[]),_t=ml.useCallback(g=>{U(g),R([])},[]),ut=ml.useCallback(g=>{P(g),R([])},[]),z=ml.useCallback(g=>{vl(g),R([])},[]),D=ml.useCallback(g=>{zl(g),R([])},[]),V=ml.useCallback(g=>{Y(O=>O.col===g?{col:g,asc:!O.asc}:{col:g,asc:g==="label"})},[]),dl=ml.useCallback((g,O)=>{R(N=>O?N.length>=2?[N[1],g]:[...N,g]:N.filter(X=>X!==g))},[]),yl=ml.useCallback(()=>{R(g=>[...g].reverse())},[]),d=ml.useCallback(()=>{R([])},[]);return!h||!C?_.jsx("div",{className:"flex items-center justify-center min-h-screen text-zinc-500",children:"Loading..."}):_.jsxs("div",{className:"flex min-h-screen",children:[_.jsx(xr,{rows:$,filters:C,onFilterChange:fl,activeScreen:p,onScreenChange:Rt,activeView:G,onViewChange:_t,scenarioScope:x,onScopeChange:ut,suiteScope:k,onSuiteChange:z,showRetired:ul,onShowRetiredChange:D,hasRetired:nl,filteredCount:al.length,totalCount:Co($,p).length,totalRuns:kl,timestamp:h.timestamp}),_.jsxs("main",{className:"flex-1 min-w-0 p-4 flex flex-col",children:[_.jsx(Dr,{rows:Kl,scenarios:Tl,scenarioAbbrev:W,sort:q,onSort:V,checked:H,onCompareToggle:dl,groups:$l,maxGen:h.maxGen??0,genInfo:h.genInfo}),H.length===2&&_.jsx(Ur,{a:Kl[H[0]],b:Kl[H[1]],scenarios:Tl,scenarioAbbrev:W,onSwap:yl,onClear:d}),_.jsxs("p",{className:"text-[0.6rem] text-zinc-600 mt-6",children:["Generated ",h.timestamp]})]})]})}rr.createRoot(document.getElementById("root")).render(_.jsx(ml.StrictMode,{children:_.jsx(Rr,{})})); - +
+