From 7a6bc856e5b64ed15671b07dfd7b41f28c082003 Mon Sep 17 00:00:00 2001 From: Antoine Zambelli Date: Tue, 2 Jun 2026 10:41:42 -0500 Subject: [PATCH 01/14] feat(reasoning): add reasoning_replay knob (full/keep-last/none) Bound reasoning accumulation in the forge->backend direction. Adds core/reasoning.py policy module and threads reasoning_replay through inference, runner, and proxy convert/handler paths. keep-last emits reasoning via reasoning_content for round-trip re-capture and trims older reasoning; none strips it; full preserves prior behavior. The Anthropic path drops reasoning under keep-last (no signable channel). Includes docs (README, BACKEND_SETUP, USER_GUIDE) and unit tests. Co-Authored-By: Claude Opus 4.8 (1M context) --- README.md | 2 +- docs/BACKEND_SETUP.md | 2 +- docs/USER_GUIDE.md | 4 + src/forge/__init__.py | 12 +- src/forge/core/inference.py | 65 ++++++++-- src/forge/core/reasoning.py | 53 ++++++++ src/forge/core/runner.py | 6 + src/forge/proxy/__main__.py | 9 ++ src/forge/proxy/convert.py | 44 +++++-- src/forge/proxy/convert_anthropic.py | 9 +- src/forge/proxy/handler.py | 43 +++++-- src/forge/proxy/proxy.py | 6 + src/forge/proxy/server.py | 4 + tests/unit/test_inference_passthrough.py | 63 +++++++++- tests/unit/test_proxy_convert.py | 62 +++++++++- tests/unit/test_proxy_convert_anthropic.py | 18 ++- tests/unit/test_proxy_handler.py | 37 +++++- tests/unit/test_reasoning_replay.py | 136 +++++++++++++++++++++ 18 files changed, 537 insertions(+), 38 deletions(-) create mode 100644 src/forge/core/reasoning.py create mode 100644 tests/unit/test_reasoning_replay.py diff --git a/README.md b/README.md index e1ed4a6..e1b827a 100644 --- a/README.md +++ b/README.md @@ -128,7 +128,7 @@ For multi-step workflows, multi-turn conversations, and backend auto-management, Drop-in proxy that sits between any client and a local model server, speaking both the OpenAI chat-completions API and the Anthropic Messages API (`/v1/messages`). Point your client at the proxy (e.g. `http://localhost:8081/v1`) and forge applies its guardrails transparently — the client thinks it's talking to a smarter model. -This is the path for **using forge with an existing harness** (opencode, Continue, aider, Cline, anything that speaks the OpenAI chat-completions schema — or Claude Code, which speaks the Anthropic Messages API). No Python rewrite. +This is the path for **using forge with an existing harness** (opencode, Continue, aider, Cline, anything that speaks the OpenAI chat-completions schema — or Claude Code, which speaks the Anthropic Messages API). No Python rewrite. Reasoning replay defaults to `keep-last`, so Forge captures reasoning for observability and replays only the latest available reasoning block to the backend on later turns; use `--reasoning-replay full` for the historical replay-all behavior or `--reasoning-replay none` to keep captured reasoning out of backend-facing history. ```bash # External mode — you manage the backend, forge proxies it diff --git a/docs/BACKEND_SETUP.md b/docs/BACKEND_SETUP.md index 75e667d..26702d3 100644 --- a/docs/BACKEND_SETUP.md +++ b/docs/BACKEND_SETUP.md @@ -75,7 +75,7 @@ llamafile --server --nobrowser -m path/to/model.gguf --port 8080 -ngl 999 `LlamafileClient` is **native-first**: `mode="native"` (the default) forwards tools via the backend's `tools` parameter and requires native function calling (llama.cpp with `--jinja`). For a backend without native FC, declare `mode="prompt"` to inject tool descriptions into the prompt and parse the JSON call back out. The capability is declared at construction and frozen — there is no runtime auto-detection. Native-first is the default because local-model FC support has matured into the more reliable path; prompt-injection stays fully supported as an explicit opt-in, but note that on more complex, multi-step interactions models tend to struggle to drive the prompt-injected protocol reliably, so reach for it only when the backend leaves no alternative. -> **Proxy note:** the OpenAI-compatible proxy is **native-first**. By default (`--backend-capability native`) it forwards the client's tools verbatim to an FC-capable backend (llama.cpp with `--jinja`, vLLM, Ollama, Anthropic) — the recommended setup. For a non-FC llama.cpp/llamafile backend, opt into prompt-injection with `--backend-capability prompt` (strips tools into the prompt, parses the JSON call back; reuses the same prompt path as the WorkflowRunner). The choice is frozen at startup — there is no runtime auto-detect in the proxy. See ADR-012. +> **Proxy note:** the OpenAI-compatible proxy is **native-first**. By default (`--backend-capability native`) it forwards the client's tools verbatim to an FC-capable backend (llama.cpp with `--jinja`, vLLM, Ollama, Anthropic) — the recommended setup. For a non-FC llama.cpp/llamafile backend, opt into prompt-injection with `--backend-capability prompt` (strips tools into the prompt, parses the JSON call back; reuses the same prompt path as the WorkflowRunner). The choice is frozen at startup — there is no runtime auto-detect in the proxy. Reasoning replay is controlled separately with `--reasoning-replay {full,keep-last,none}`; the default `keep-last` replays only the latest captured reasoning block to the backend when that reasoning is available in the conversation history. See ADR-012. Smoke-test: diff --git a/docs/USER_GUIDE.md b/docs/USER_GUIDE.md index ac04a1f..e8ec90f 100644 --- a/docs/USER_GUIDE.md +++ b/docs/USER_GUIDE.md @@ -85,6 +85,8 @@ claude **Function-calling capability.** `--backend-capability native` (default) uses the backend's chat-template tool-calling and is the smoother default for Claude Code's heavy multi-turn tool use. `--backend-capability prompt` injects the tool surface into the prompt for llama.cpp/llamafile backends without a tool-calling template; whether a model stays coherent across multi-turn tool results in prompt mode varies by model — and tends to degrade on more complex, multi-step interactions — so prefer native whenever the backend supports it. The capability is declared at startup and frozen. +**Reasoning replay.** Reasoning-capable backends may return hidden reasoning alongside tool calls. Forge captures that reasoning for observability, then controls how much is replayed to the backend on later turns with `--reasoning-replay {full,keep-last,none}`. The default is `keep-last`: only the latest captured reasoning block is replayed. `full` preserves the historical behavior and replays every captured reasoning block. `none` keeps reasoning out of backend-facing history. In OpenAI-compatible proxy responses, `keep-last` exposes current reasoning as `reasoning_content` instead of normal assistant `content` so clients that preserve reasoning fields can replay only the latest block without turning it into plain text. Anthropic proxy responses only emit reasoning text under `full`; Forge does not synthesize signed Anthropic thinking blocks, so default Anthropic proxy responses do not expose replayable reasoning. + **Downstream protocol.** - **Local model (default, `--backend-protocol openai`)** — forge translates Claude Code's Anthropic requests to OpenAI for llama.cpp / Ollama and converts the reply back to Anthropic SSE. Anthropic-only fields with no OpenAI analog (`cache_control`, `thinking`, `document` blocks) are dropped at that boundary; see [ADR-015](decisions/015-cache-control-preservation-path1.md). @@ -283,6 +285,8 @@ await server.stop() `WorkflowRunner` accepts an optional `on_message` callback that fires each time a `Message` is appended to the conversation during `run()`. This is the primary observability hook — use it for logging, eval metric collection, or building conversation history for multi-turn flows. +`WorkflowRunner(reasoning_replay=...)` uses the same policy as the proxy: `keep-last` by default, `full` for the historical replay-all behavior, and `none` to avoid replaying captured reasoning to the backend. The policy affects backend-facing serialization only; `MessageType.REASONING` entries still appear in `on_message` and internal history unless context compaction removes them. + - **Single-turn (default):** `on_message` fires for every message the runner creates — system prompt, user input, assistant responses, tool results, nudges. - **Multi-turn (`initial_messages`):** `run()` accepts an optional `initial_messages` parameter that seeds the conversation with prior history. `on_message` fires **only for new messages created during this turn**, not for the replayed history. diff --git a/src/forge/__init__.py b/src/forge/__init__.py index b6b8a27..7680419 100644 --- a/src/forge/__init__.py +++ b/src/forge/__init__.py @@ -17,7 +17,13 @@ Workflow, ) from forge.core.steps import StepTracker -from forge.core.inference import InferenceResult, fold_and_serialize, run_inference +from forge.core.inference import ( + InferenceResult, + fold_and_serialize, + prepare_backend_messages, + run_inference, +) +from forge.core.reasoning import DEFAULT_REASONING_REPLAY, REASONING_REPLAY_CHOICES, ReasoningReplay from forge.core.runner import WorkflowRunner from forge.core.slot_worker import SlotWorker from forge.clients.base import ChunkType, LLMClient, StreamChunk, TokenUsage @@ -87,7 +93,11 @@ # Inference (front half — shared by runner and proxy) "InferenceResult", "fold_and_serialize", + "prepare_backend_messages", "run_inference", + "DEFAULT_REASONING_REPLAY", + "REASONING_REPLAY_CHOICES", + "ReasoningReplay", # Runner "WorkflowRunner", # Slot worker diff --git a/src/forge/core/inference.py b/src/forge/core/inference.py index 599b5e7..aa1d365 100644 --- a/src/forge/core/inference.py +++ b/src/forge/core/inference.py @@ -23,6 +23,12 @@ ) from forge.context.manager import ContextManager from forge.core.messages import Message, MessageMeta, MessageRole, MessageType, ToolCallInfo +from forge.core.reasoning import ( + DEFAULT_REASONING_REPLAY, + ReasoningReplay, + filter_openai_reasoning_messages, + validate_reasoning_replay, +) from forge.core.workflow import LLMResponse, TextResponse, ToolCall, ToolSpec from forge.errors import StreamError, ToolCallError from forge.guardrails import ErrorTracker, ResponseValidator @@ -77,19 +83,32 @@ class InferenceResult: def fold_and_serialize( messages: list[Message], api_format: str, + reasoning_replay: ReasoningReplay = DEFAULT_REASONING_REPLAY, ) -> list[dict[str, Any]]: """Reasoning-fold and serialize forge Messages to API dicts. - Folds REASONING messages into the following TOOL_CALL message's content - field so the wire format has one assistant message with both content and - tool_calls (valid OpenAI format). Internal Message list stays separate - for compaction. + ``full`` folds every REASONING message into the following TOOL_CALL + message's content field, preserving the historical wire behavior. + ``keep-last`` folds only the most recent REASONING message in the + serialized history. ``none`` skips all REASONING messages on the wire. + Internal Message history stays separate for compaction and observability. """ + reasoning_replay = validate_reasoning_replay(reasoning_replay) api_messages: list[dict[str, Any]] = [] pending_reasoning: str | None = None + last_reasoning_index: int | None = None + + if reasoning_replay == "keep-last": + for i, m in enumerate(messages): + if m.metadata.type == MessageType.REASONING and m.role == MessageRole.ASSISTANT: + last_reasoning_index = i - for m in messages: + for i, m in enumerate(messages): if m.metadata.type == MessageType.REASONING and m.role == MessageRole.ASSISTANT: + if reasoning_replay == "none": + continue + if reasoning_replay == "keep-last" and i != last_reasoning_index: + continue pending_reasoning = m.content continue d = m.to_api_dict(format=api_format) @@ -107,6 +126,29 @@ def fold_and_serialize( return api_messages +def prepare_backend_messages( + messages: list[Message], + api_format: str, + reasoning_replay: ReasoningReplay = DEFAULT_REASONING_REPLAY, + raw_openai_messages: RawOpenAIMessages | None = None, + use_raw_messages: bool = False, +) -> list[dict[str, Any]]: + """Prepare backend-facing messages from raw OpenAI or forge history. + + This is the single backend replay-policy choke point. Raw OpenAI messages + preserve client-authored shape while filtering only reasoning fields; forge + history is folded with the same reasoning replay policy. + """ + reasoning_replay = validate_reasoning_replay(reasoning_replay) + if use_raw_messages and raw_openai_messages is not None: + return filter_openai_reasoning_messages( + raw_openai_messages, reasoning_replay=reasoning_replay, + ) + return fold_and_serialize( + messages, api_format, reasoning_replay=reasoning_replay, + ) + + def _build_tool_call_infos( tool_calls: list[ToolCall], tool_call_counter: int, @@ -138,6 +180,7 @@ async def run_inference( inbound_anthropic_body: dict[str, Any] | None = None, raw_openai_messages: RawOpenAIMessages | None = None, raw_openai_tools: RawOpenAITools | None = None, + reasoning_replay: ReasoningReplay = DEFAULT_REASONING_REPLAY, ) -> InferenceResult | None: """Send messages to the LLM with compaction, folding, validation, and retry. @@ -177,6 +220,7 @@ async def run_inference( ToolCallError: If retry budget (max_retries) is exhausted. StreamError: If streaming ends without a FINAL chunk. """ + reasoning_replay = validate_reasoning_replay(reasoning_replay) api_format = getattr(client, "api_format", "ollama") new_messages: list[Message] = [] max_retries = error_tracker.max_retries @@ -219,10 +263,13 @@ async def run_inference( and compacted is messages and not context_warning ) - if use_raw_messages: - api_messages = raw_openai_messages - else: - api_messages = fold_and_serialize(messages, api_format) + api_messages = prepare_backend_messages( + messages, + api_format, + reasoning_replay=reasoning_replay, + raw_openai_messages=raw_openai_messages, + use_raw_messages=use_raw_messages, + ) # Inject context warning as transient user message (not persisted # in conversation history). Uses "user" role because mid-conversation diff --git a/src/forge/core/reasoning.py b/src/forge/core/reasoning.py new file mode 100644 index 0000000..df0acec --- /dev/null +++ b/src/forge/core/reasoning.py @@ -0,0 +1,53 @@ +"""Reasoning replay policy shared by runner and proxy.""" + +from __future__ import annotations + +from copy import deepcopy +from typing import Any, Literal + + +ReasoningReplay = Literal["full", "keep-last", "none"] +REASONING_REPLAY_CHOICES: tuple[ReasoningReplay, ...] = ("full", "keep-last", "none") +DEFAULT_REASONING_REPLAY: ReasoningReplay = "keep-last" + + +def validate_reasoning_replay(value: str) -> ReasoningReplay: + """Validate and normalize a reasoning replay policy.""" + if value not in REASONING_REPLAY_CHOICES: + choices = ", ".join(REASONING_REPLAY_CHOICES) + raise ValueError(f"reasoning_replay must be one of: {choices}") + return value # type: ignore[return-value] + + +REASONING_MESSAGE_FIELDS = ("reasoning_content", "reasoning", "reasoning_text") + + +def filter_openai_reasoning_messages( + messages: list[dict[str, Any]], + reasoning_replay: ReasoningReplay = DEFAULT_REASONING_REPLAY, +) -> list[dict[str, Any]]: + """Copy raw OpenAI messages and apply the reasoning replay policy. + + Non-reasoning fields are preserved verbatim so proxy passthrough keeps + client-authored extensions, multimodal blocks, names, and other metadata. + """ + reasoning_replay = validate_reasoning_replay(reasoning_replay) + filtered = [deepcopy(msg) for msg in messages] + if reasoning_replay == "full": + return filtered + + last_reasoning_index: int | None = None + if reasoning_replay == "keep-last": + for i, msg in enumerate(filtered): + if msg.get("role") == "assistant" and any( + msg.get(field) for field in REASONING_MESSAGE_FIELDS + ): + last_reasoning_index = i + + for i, msg in enumerate(filtered): + if msg.get("role") != "assistant": + continue + if reasoning_replay == "none" or i != last_reasoning_index: + for field in REASONING_MESSAGE_FIELDS: + msg.pop(field, None) + return filtered diff --git a/src/forge/core/runner.py b/src/forge/core/runner.py index 79c2c0e..a880595 100644 --- a/src/forge/core/runner.py +++ b/src/forge/core/runner.py @@ -11,6 +11,7 @@ from forge.context.manager import ContextManager from forge.core.inference import _NUDGE_KIND_TO_TYPE, _build_tool_call_infos, run_inference from forge.core.messages import Message, MessageMeta, MessageRole, MessageType, ToolCallInfo +from forge.core.reasoning import DEFAULT_REASONING_REPLAY, ReasoningReplay, validate_reasoning_replay from forge.core.workflow import ToolCall, TextResponse, Workflow, ToolSpec from forge.errors import MaxIterationsError, PrerequisiteError, StepEnforcementError, ToolCallError, ToolExecutionError, ToolResolutionError, WorkflowCancelledError from forge.guardrails import ErrorTracker, ResponseValidator, StepEnforcer @@ -42,6 +43,7 @@ def __init__( on_message: Callable[[Message], None] | None = None, rescue_enabled: bool = True, retry_nudge: Callable[[str], str] | str | None = None, + reasoning_replay: ReasoningReplay = DEFAULT_REASONING_REPLAY, ): """ Args: @@ -65,6 +67,8 @@ def __init__( retry_nudge: Custom nudge for bare text responses. Pass a string for a static message, or a callable ``(raw_response) -> str`` for dynamic nudges. If None, uses the default. + reasoning_replay: How much captured reasoning to replay to the + backend on later turns: ``full``, ``keep-last``, or ``none``. """ self.client = client self.context_manager = context_manager @@ -75,6 +79,7 @@ def __init__( self.on_chunk = on_chunk self.on_message = on_message self.rescue_enabled = rescue_enabled + self.reasoning_replay = validate_reasoning_replay(reasoning_replay) if isinstance(retry_nudge, str): self._retry_nudge_fn: Callable[[str], str] | None = lambda _raw, _msg=retry_nudge: _msg else: @@ -180,6 +185,7 @@ def _emit(msg: Message) -> None: max_attempts=self.max_iterations - iteration, stream=self.stream, on_chunk=self.on_chunk, + reasoning_replay=self.reasoning_replay, ) # max_attempts exhausted — iteration budget spent if result is None: diff --git a/src/forge/proxy/__main__.py b/src/forge/proxy/__main__.py index 3b55ac9..d29f61a 100644 --- a/src/forge/proxy/__main__.py +++ b/src/forge/proxy/__main__.py @@ -8,6 +8,7 @@ import sys import time +from forge.core.reasoning import DEFAULT_REASONING_REPLAY, REASONING_REPLAY_CHOICES from forge.proxy.proxy import ProxyServer from forge.server import BudgetMode @@ -85,6 +86,13 @@ def main() -> None: help="Inject forge's synthetic respond() tool when the client sends " "tools (keeps small models in tool-calling mode). Default off.", ) + parser.add_argument( + "--reasoning-replay", + choices=REASONING_REPLAY_CHOICES, + default=DEFAULT_REASONING_REPLAY, + help="How much captured reasoning to replay to the backend " + "(default: keep-last).", + ) parser.add_argument("--verbose", "-v", action="store_true", help="Verbose logging") args = parser.parse_args() @@ -124,6 +132,7 @@ def main() -> None: inject_respond_tool=args.inject_respond_tool, backend_protocol=args.backend_protocol, backend_timeout=args.backend_timeout, + reasoning_replay=args.reasoning_replay, ) def _shutdown(sig: int, _frame: object) -> None: diff --git a/src/forge/proxy/convert.py b/src/forge/proxy/convert.py index 33fbd5f..2af7819 100644 --- a/src/forge/proxy/convert.py +++ b/src/forge/proxy/convert.py @@ -7,6 +7,7 @@ from typing import Any from forge.core.messages import Message, MessageMeta, MessageRole, MessageType, ToolCallInfo +from forge.core.reasoning import DEFAULT_REASONING_REPLAY, ReasoningReplay, validate_reasoning_replay from forge.core.workflow import ToolCall, TextResponse @@ -42,6 +43,17 @@ def openai_to_messages(openai_messages: list[dict[str, Any]]) -> list[Message]: )) elif role_str == "assistant": + reasoning = ( + msg.get("reasoning_content") + or msg.get("reasoning") + or msg.get("reasoning_text") + ) + if reasoning: + messages.append(Message( + MessageRole.ASSISTANT, + str(reasoning), + MessageMeta(MessageType.REASONING), + )) if "tool_calls" in msg and msg["tool_calls"]: tc_infos = [] for tc in msg["tool_calls"]: @@ -61,7 +73,7 @@ def openai_to_messages(openai_messages: list[dict[str, Any]]) -> list[Message]: MessageMeta(MessageType.TOOL_CALL), tool_calls=tc_infos, )) - else: + elif content: messages.append(Message( MessageRole.ASSISTANT, content, @@ -96,8 +108,10 @@ def tool_calls_to_openai( tool_calls: list[ToolCall], model: str = "forge", usage: Any | None = None, + reasoning_replay: ReasoningReplay = DEFAULT_REASONING_REPLAY, ) -> dict[str, Any]: """Convert forge ToolCalls to an OpenAI chat completions response object.""" + reasoning_replay = validate_reasoning_replay(reasoning_replay) tc_list = [] for i, tc in enumerate(tool_calls): tc_list.append({ @@ -109,17 +123,22 @@ def tool_calls_to_openai( }, }) + reasoning = tool_calls[0].reasoning if tool_calls else None + message: dict[str, Any] = { + "role": "assistant", + "content": reasoning if reasoning_replay == "full" else None, + "tool_calls": tc_list, + } + if reasoning and reasoning_replay == "keep-last": + message["reasoning_content"] = reasoning + response = { "id": f"chatcmpl-{uuid.uuid4().hex[:12]}", "object": "chat.completion", "model": model, "choices": [{ "index": 0, - "message": { - "role": "assistant", - "content": tool_calls[0].reasoning or None, - "tool_calls": tc_list, - }, + "message": message, "finish_reason": "tool_calls", }], "usage": {"prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0}, @@ -172,24 +191,31 @@ def tool_calls_to_sse_events( tool_calls: list[ToolCall], model: str = "forge", usage: Any | None = None, + reasoning_replay: ReasoningReplay = DEFAULT_REASONING_REPLAY, ) -> list[dict[str, Any]]: """Convert forge ToolCalls to a sequence of SSE chunk objects. Returns the complete list of chunk dicts ready to be formatted as SSE data lines. The caller handles the actual SSE wire format. """ + reasoning_replay = validate_reasoning_replay(reasoning_replay) cmpl_id = f"chatcmpl-{uuid.uuid4().hex[:12]}" events: list[dict[str, Any]] = [] - # If there's reasoning, send it as a content delta first - if tool_calls[0].reasoning: + reasoning = tool_calls[0].reasoning if tool_calls else None + if reasoning and reasoning_replay != "none": + delta: dict[str, Any] = {"role": "assistant"} + if reasoning_replay == "full": + delta["content"] = reasoning + else: + delta["reasoning_content"] = reasoning events.append({ "id": cmpl_id, "object": "chat.completion.chunk", "model": model, "choices": [{ "index": 0, - "delta": {"role": "assistant", "content": tool_calls[0].reasoning}, + "delta": delta, "finish_reason": None, }], }) diff --git a/src/forge/proxy/convert_anthropic.py b/src/forge/proxy/convert_anthropic.py index d7e5d55..df40f9b 100644 --- a/src/forge/proxy/convert_anthropic.py +++ b/src/forge/proxy/convert_anthropic.py @@ -11,6 +11,7 @@ from typing import Any from forge.core.messages import Message, MessageMeta, MessageRole, MessageType, ToolCallInfo +from forge.core.reasoning import DEFAULT_REASONING_REPLAY, ReasoningReplay, validate_reasoning_replay from forge.core.workflow import ToolCall, ToolSpec @@ -233,11 +234,13 @@ def tool_calls_to_anthropic( tool_calls: list[ToolCall], model: str = "forge", usage: Any | None = None, + reasoning_replay: ReasoningReplay = DEFAULT_REASONING_REPLAY, ) -> dict[str, Any]: """Convert forge ToolCalls to an Anthropic Messages API response object.""" + reasoning_replay = validate_reasoning_replay(reasoning_replay) blocks: list[dict[str, Any]] = [] - if tool_calls and tool_calls[0].reasoning: + if tool_calls and tool_calls[0].reasoning and reasoning_replay == "full": blocks.append({"type": "text", "text": tool_calls[0].reasoning}) for tc in tool_calls: @@ -284,6 +287,7 @@ def tool_calls_to_anthropic_sse( tool_calls: list[ToolCall], model: str = "forge", usage: Any | None = None, + reasoning_replay: ReasoningReplay = DEFAULT_REASONING_REPLAY, ) -> list[dict[str, Any]]: """Build the Anthropic SSE event sequence for a tool-use response. @@ -291,6 +295,7 @@ def tool_calls_to_anthropic_sse( formatter reads that to emit ``event: `` lines. Spec: https://platform.claude.com/docs/en/build-with-claude/streaming """ + reasoning_replay = validate_reasoning_replay(reasoning_replay) au = _anthropic_usage(usage) msg_id = f"msg_{uuid.uuid4().hex[:24]}" events: list[dict[str, Any]] = [] @@ -313,7 +318,7 @@ def tool_calls_to_anthropic_sse( # Reasoning text first, if present. reasoning = tool_calls[0].reasoning if tool_calls else None - if reasoning: + if reasoning and reasoning_replay == "full": events.append({ "type": "content_block_start", "index": block_idx, diff --git a/src/forge/proxy/handler.py b/src/forge/proxy/handler.py index c19dcd2..3e3712c 100644 --- a/src/forge/proxy/handler.py +++ b/src/forge/proxy/handler.py @@ -8,7 +8,12 @@ from forge.clients.base import LLMClient, format_tool from forge.context.manager import ContextManager -from forge.core.inference import _get_usage, fold_and_serialize, run_inference +from forge.core.inference import _get_usage, prepare_backend_messages, run_inference +from forge.core.reasoning import ( + DEFAULT_REASONING_REPLAY, + ReasoningReplay, + validate_reasoning_replay, +) from forge.core.workflow import ToolCall, ToolSpec, TextResponse from forge.errors import ToolCallError from forge.guardrails import ErrorTracker, ResponseValidator @@ -125,6 +130,7 @@ async def handle_chat_completions( native_passthrough: bool = True, inject_respond_tool: bool = False, protocol: Literal["openai", "anthropic"] = "openai", + reasoning_replay: ReasoningReplay = DEFAULT_REASONING_REPLAY, ) -> dict[str, Any] | list[dict[str, Any]]: """Handle an inbound completions request. @@ -153,11 +159,14 @@ async def handle_chat_completions( untouched unless explicitly opted in. protocol: Inbound wire format. ``openai`` for ``/v1/chat/completions``; ``anthropic`` for ``/v1/messages``. + reasoning_replay: How much captured reasoning to replay to the + backend and expose to clients. Returns: If stream=false: a single response dict (protocol-shaped). If stream=true: a list of SSE event dicts (protocol-shaped). """ + reasoning_replay = validate_reasoning_replay(reasoning_replay) is_stream = body.get("stream", False) model_name = body.get("model", "forge") @@ -229,7 +238,13 @@ async def handle_chat_completions( if not tool_specs: logger.info("No tools in request, passing through to backend") api_format = getattr(client, "api_format", "ollama") - api_messages = raw_messages_for_backend or fold_and_serialize(messages, api_format) + api_messages = prepare_backend_messages( + messages, + api_format, + reasoning_replay=reasoning_replay, + raw_openai_messages=raw_messages_for_backend, + use_raw_messages=raw_messages_for_backend is not None, + ) response = await client.send( api_messages, tools=None, sampling=sampling, passthrough=passthrough, inbound_anthropic_body=inbound_anthropic_body, @@ -256,6 +271,7 @@ async def handle_chat_completions( inbound_anthropic_body=inbound_anthropic_body, raw_openai_messages=raw_messages_for_backend, raw_openai_tools=raw_tools_for_backend, + reasoning_replay=reasoning_replay, ) except ToolCallError as exc: # Retries exhausted — the model kept returning text instead of tool @@ -289,7 +305,10 @@ async def handle_chat_completions( if other_calls: # Real tool calls (possibly mixed with respond) — return the # real tool calls only, drop respond. - return _emit_tool_calls(other_calls, model_name, protocol, is_stream, usage=usage) + return _emit_tool_calls( + other_calls, model_name, protocol, is_stream, usage=usage, + reasoning_replay=reasoning_replay, + ) # Shouldn't happen, but handle empty tool_calls gracefully return _emit_text("", model_name, protocol, is_stream, usage=usage) @@ -318,12 +337,22 @@ def _emit_tool_calls( protocol: str, is_stream: bool, usage: Any | None = None, + reasoning_replay: ReasoningReplay = DEFAULT_REASONING_REPLAY, ) -> dict[str, Any] | list[dict[str, Any]]: """Protocol-aware tool-call response emitter.""" if protocol == "anthropic": if is_stream: - return tool_calls_to_anthropic_sse(tool_calls, model=model, usage=usage) - return tool_calls_to_anthropic(tool_calls, model=model, usage=usage) + return tool_calls_to_anthropic_sse( + tool_calls, model=model, usage=usage, + reasoning_replay=reasoning_replay, + ) + return tool_calls_to_anthropic( + tool_calls, model=model, usage=usage, reasoning_replay=reasoning_replay, + ) if is_stream: - return tool_calls_to_sse_events(tool_calls, model=model, usage=usage) - return tool_calls_to_openai(tool_calls, model=model, usage=usage) + return tool_calls_to_sse_events( + tool_calls, model=model, usage=usage, reasoning_replay=reasoning_replay, + ) + return tool_calls_to_openai( + tool_calls, model=model, usage=usage, reasoning_replay=reasoning_replay, + ) diff --git a/src/forge/proxy/proxy.py b/src/forge/proxy/proxy.py index d77c0b4..a937324 100644 --- a/src/forge/proxy/proxy.py +++ b/src/forge/proxy/proxy.py @@ -20,6 +20,7 @@ from forge.clients.vllm import VLLMClient from forge.context.manager import ContextManager from forge.context.strategies import TieredCompact +from forge.core.reasoning import DEFAULT_REASONING_REPLAY, ReasoningReplay, validate_reasoning_replay from forge.proxy.server import HTTPServer from forge.server import BudgetMode, ServerManager, setup_backend @@ -72,6 +73,7 @@ def __init__( inject_respond_tool: bool = False, backend_protocol: Literal["openai", "anthropic"] = "openai", backend_timeout: float = 300.0, + reasoning_replay: ReasoningReplay = DEFAULT_REASONING_REPLAY, ) -> None: """ Args: @@ -118,6 +120,8 @@ def __init__( Only meaningful in external mode; ignored in managed mode. backend_timeout: Timeout in seconds for requests from the proxy to the downstream backend. + reasoning_replay: How much captured reasoning to replay to the + backend on later turns: ``full``, ``keep-last``, or ``none``. """ if backend_url is None and backend is None: raise ValueError("Provide either backend_url (external) or backend (managed)") @@ -177,6 +181,7 @@ def __init__( self._inject_respond_tool = inject_respond_tool self._backend_protocol = backend_protocol self._backend_timeout = backend_timeout + self._reasoning_replay = validate_reasoning_replay(reasoning_replay) # Auto-detect serialization: managed (no external url) = single local # GPU = serialize. External callers manage their own concurrency. @@ -262,6 +267,7 @@ async def _async_start(self, ready: threading.Event) -> None: rescue_enabled=self._rescue_enabled, native_passthrough=self._backend_capability == "native", inject_respond_tool=self._inject_respond_tool, + reasoning_replay=self._reasoning_replay, ) await self._http_server.start() self._started = True diff --git a/src/forge/proxy/server.py b/src/forge/proxy/server.py index 1a0328c..1f8e16a 100644 --- a/src/forge/proxy/server.py +++ b/src/forge/proxy/server.py @@ -15,6 +15,7 @@ from forge.clients.base import LLMClient from forge.context.manager import ContextManager +from forge.core.reasoning import DEFAULT_REASONING_REPLAY, ReasoningReplay, validate_reasoning_replay from forge.proxy.handler import handle_chat_completions logger = logging.getLogger("forge.proxy") @@ -52,6 +53,7 @@ def __init__( rescue_enabled: bool = True, native_passthrough: bool = True, inject_respond_tool: bool = False, + reasoning_replay: ReasoningReplay = DEFAULT_REASONING_REPLAY, ) -> None: self._client = client self._context_manager = context_manager @@ -62,6 +64,7 @@ def __init__( self._rescue_enabled = rescue_enabled self._native_passthrough = native_passthrough self._inject_respond_tool = inject_respond_tool + self._reasoning_replay = validate_reasoning_replay(reasoning_replay) self._server: asyncio.Server | None = None self._serialize = serialize_requests self._queue: asyncio.Queue[_QueueItem] = asyncio.Queue() @@ -316,6 +319,7 @@ async def _run_handler( native_passthrough=self._native_passthrough, inject_respond_tool=self._inject_respond_tool, protocol=protocol, + reasoning_replay=self._reasoning_replay, ) except Exception as exc: logger.exception("Handler error") diff --git a/tests/unit/test_inference_passthrough.py b/tests/unit/test_inference_passthrough.py index 78aa851..3829dd5 100644 --- a/tests/unit/test_inference_passthrough.py +++ b/tests/unit/test_inference_passthrough.py @@ -47,7 +47,22 @@ async def test_raw_used_on_first_attempt_folded_on_retry(): MessageRole.USER, "folded-form", MessageMeta(MessageType.USER_INPUT), )] - raw_messages = [{"role": "user", "content": "VERBATIM", "name": "u1"}] + raw_messages = [ + { + "role": "assistant", + "content": None, + "reasoning_content": "old", + "tool_calls": [], + "name": "a1", + }, + { + "role": "assistant", + "content": None, + "reasoning_content": "latest", + "tool_calls": [], + "name": "a2", + }, + ] raw_tools = [{"type": "function", "function": {"name": "search", "parameters": {}}}] result = await run_inference( @@ -59,6 +74,7 @@ async def test_raw_used_on_first_attempt_folded_on_retry(): tool_specs=[_search_spec()], raw_openai_messages=raw_messages, raw_openai_tools=raw_tools, + reasoning_replay="full", ) assert result is not None @@ -98,3 +114,48 @@ async def test_no_raw_falls_back_to_fold(): call = client.send.call_args assert call.args[0][0]["content"] == "hello" assert "raw_openai_tools" not in call.kwargs + + +@pytest.mark.asyncio +async def test_non_full_reasoning_replay_filters_raw_reasoning_but_keeps_raw_shape(): + client = _client([ToolCall(tool="search", args={})]) + messages = [Message( + MessageRole.USER, "folded-form", + MessageMeta(MessageType.USER_INPUT), + )] + raw_messages = [ + { + "role": "assistant", + "content": None, + "reasoning_content": "old", + "tool_calls": [], + "name": "a1", + }, + { + "role": "assistant", + "content": None, + "reasoning_content": "latest", + "tool_calls": [], + "name": "a2", + }, + ] + raw_tools = [{"type": "function", "function": {"name": "search", "parameters": {}}}] + + await run_inference( + messages=messages, + client=client, + context_manager=_ctx(), + validator=ResponseValidator(["search"], rescue_enabled=True), + error_tracker=ErrorTracker(max_retries=1), + tool_specs=[_search_spec()], + raw_openai_messages=raw_messages, + raw_openai_tools=raw_tools, + reasoning_replay="keep-last", + ) + + call = client.send.call_args + assert call.args[0][0]["name"] == "a1" + assert "reasoning_content" not in call.args[0][0] + assert call.args[0][1]["name"] == "a2" + assert call.args[0][1]["reasoning_content"] == "latest" + assert call.kwargs["raw_openai_tools"] == raw_tools diff --git a/tests/unit/test_proxy_convert.py b/tests/unit/test_proxy_convert.py index 5ce0feb..11b5a2e 100644 --- a/tests/unit/test_proxy_convert.py +++ b/tests/unit/test_proxy_convert.py @@ -149,12 +149,28 @@ def test_multiple_tool_calls(self): ]) assert len(result["choices"][0]["message"]["tool_calls"]) == 2 - def test_reasoning_in_content(self): + def test_reasoning_default_exposed_as_reasoning_content(self): result = tool_calls_to_openai([ ToolCall(tool="search", args={}, reasoning="Let me think..."), ]) + msg = result["choices"][0]["message"] + assert msg["content"] is None + assert msg["reasoning_content"] == "Let me think..." + + def test_full_reasoning_replay_exposes_reasoning_in_content(self): + result = tool_calls_to_openai([ + ToolCall(tool="search", args={}, reasoning="Let me think..."), + ], reasoning_replay="full") assert result["choices"][0]["message"]["content"] == "Let me think..." + def test_none_reasoning_replay_omits_reasoning(self): + result = tool_calls_to_openai([ + ToolCall(tool="search", args={}, reasoning="Let me think..."), + ], reasoning_replay="none") + msg = result["choices"][0]["message"] + assert msg["content"] is None + assert "reasoning_content" not in msg + def test_no_reasoning_content_is_none(self): result = tool_calls_to_openai([ToolCall(tool="search", args={})]) assert result["choices"][0]["message"]["content"] is None @@ -200,14 +216,27 @@ def test_single_tool_call_structure(self): assert events[-1]["choices"][0]["finish_reason"] == "tool_calls" assert events[-1]["choices"][0]["delta"] == {} - def test_reasoning_prepended(self): + def test_reasoning_prepended_as_reasoning_content_by_default(self): events = tool_calls_to_sse_events([ ToolCall(tool="search", args={}, reasoning="Thinking..."), ]) # reasoning delta + tool call delta + final assert len(events) == 3 + assert events[0]["choices"][0]["delta"]["reasoning_content"] == "Thinking..." + + def test_full_reasoning_replay_streams_content_delta(self): + events = tool_calls_to_sse_events([ + ToolCall(tool="search", args={}, reasoning="Thinking..."), + ], reasoning_replay="full") assert events[0]["choices"][0]["delta"]["content"] == "Thinking..." + def test_none_reasoning_replay_omits_stream_reasoning_delta(self): + events = tool_calls_to_sse_events([ + ToolCall(tool="search", args={}, reasoning="Thinking..."), + ], reasoning_replay="none") + assert len(events) == 2 + assert "tool_calls" in events[0]["choices"][0]["delta"] + def test_multiple_tool_calls(self): events = tool_calls_to_sse_events([ ToolCall(tool="a", args={}), @@ -254,3 +283,32 @@ def test_consistent_completion_id(self): events = text_to_sse_events("test", chunk_size=1) ids = {e["id"] for e in events} assert len(ids) == 1 + + +class TestOpenaiReasoningFields: + def test_reasoning_content_becomes_reasoning_message(self): + msgs = openai_to_messages([{ + "role": "assistant", + "content": None, + "reasoning_content": "Think.", + "tool_calls": [{ + "id": "call_1", + "function": {"name": "search", "arguments": "{}"}, + }], + }]) + + assert [m.metadata.type for m in msgs] == [ + MessageType.REASONING, MessageType.TOOL_CALL, + ] + assert msgs[0].content == "Think." + assert msgs[1].content == "" + + def test_reasoning_only_message_does_not_add_blank_text_response(self): + msgs = openai_to_messages([{ + "role": "assistant", + "content": None, + "reasoning_content": "Think.", + }]) + + assert len(msgs) == 1 + assert msgs[0].metadata.type == MessageType.REASONING diff --git a/tests/unit/test_proxy_convert_anthropic.py b/tests/unit/test_proxy_convert_anthropic.py index d7b9fa6..23fa62a 100644 --- a/tests/unit/test_proxy_convert_anthropic.py +++ b/tests/unit/test_proxy_convert_anthropic.py @@ -250,11 +250,18 @@ def test_shape(self): assert tu_blocks[0]["input"] == {"city": "Paris"} assert tu_blocks[0]["id"].startswith("toolu_") - def test_reasoning_emitted_as_text_block(self): + def test_default_omits_reasoning_text_block(self): result = tool_calls_to_anthropic([ ToolCall(tool="search", args={"q": "x"}, reasoning="Let me search."), ]) text_blocks = [b for b in result["content"] if b["type"] == "text"] + assert text_blocks == [] + + def test_full_reasoning_replay_emits_reasoning_as_text_block(self): + result = tool_calls_to_anthropic([ + ToolCall(tool="search", args={"q": "x"}, reasoning="Let me search."), + ], reasoning_replay="full") + text_blocks = [b for b in result["content"] if b["type"] == "text"] assert text_blocks and text_blocks[0]["text"] == "Let me search." def test_multiple_tool_calls(self): @@ -298,10 +305,17 @@ def test_event_sequence(self): delta = next(e for e in events if e["type"] == "message_delta") assert delta["delta"]["stop_reason"] == "tool_use" - def test_reasoning_block_precedes_tool_use(self): + def test_default_omits_reasoning_stream_block(self): events = tool_calls_to_anthropic_sse([ ToolCall(tool="search", args={"q": "x"}, reasoning="Hmm."), ]) + starts = [e for e in events if e["type"] == "content_block_start"] + assert starts[0]["content_block"]["type"] == "tool_use" + + def test_full_reasoning_replay_streams_text_block_before_tool_use(self): + events = tool_calls_to_anthropic_sse([ + ToolCall(tool="search", args={"q": "x"}, reasoning="Hmm."), + ], reasoning_replay="full") # First content_block_start should be type=text (reasoning), then tool_use starts = [e for e in events if e["type"] == "content_block_start"] assert starts[0]["content_block"]["type"] == "text" diff --git a/tests/unit/test_proxy_handler.py b/tests/unit/test_proxy_handler.py index 11de5ff..c4689c4 100644 --- a/tests/unit/test_proxy_handler.py +++ b/tests/unit/test_proxy_handler.py @@ -493,8 +493,39 @@ async def test_system_top_level_flows_into_messages(self): class TestNativePassthrough: - """The proxy forwards the client's OpenAI tools/messages verbatim on the - clean first attempt, bypassing the lossy ToolSpec round-trip.""" + """Native proxy passthrough keeps raw tools by default; raw messages + are forwarded only when full reasoning replay preserves old behavior.""" + + @pytest.mark.asyncio + async def test_default_reasoning_replay_filters_raw_reasoning_only(self): + client = _mock_client([ToolCall(tool="search", args={"q": "x"})]) + messages = [ + { + "role": "assistant", + "content": None, + "reasoning_content": "old", + "tool_calls": [], + "name": "a1", + "vendor": {"kept": True}, + }, + { + "role": "assistant", + "content": None, + "reasoning_content": "latest", + "tool_calls": [], + "name": "a2", + }, + ] + await handle_chat_completions( + _body(messages=messages, tools=[_tool_def("search")]), + client, _context_manager(), + ) + sent_messages = client.send.call_args.args[0] + assert sent_messages[0]["name"] == "a1" + assert sent_messages[0]["vendor"] == {"kept": True} + assert "reasoning_content" not in sent_messages[0] + assert sent_messages[1]["name"] == "a2" + assert sent_messages[1]["reasoning_content"] == "latest" @pytest.mark.asyncio async def test_raw_tools_forwarded_verbatim(self): @@ -525,7 +556,7 @@ async def test_raw_messages_forwarded_verbatim(self): messages = [{"role": "user", "content": "hi", "name": "u1"}] await handle_chat_completions( _body(messages=messages, tools=[_tool_def("search")]), - client, _context_manager(), + client, _context_manager(), reasoning_replay="full", ) sent_messages = client.send.call_args.args[0] assert sent_messages == messages diff --git a/tests/unit/test_reasoning_replay.py b/tests/unit/test_reasoning_replay.py new file mode 100644 index 0000000..96868cc --- /dev/null +++ b/tests/unit/test_reasoning_replay.py @@ -0,0 +1,136 @@ +"""Tests for reasoning replay policy serialization.""" + +import pytest + +from forge.core.inference import fold_and_serialize, prepare_backend_messages +from forge.core.messages import Message, MessageMeta, MessageRole, MessageType, ToolCallInfo +from forge.core.reasoning import filter_openai_reasoning_messages, validate_reasoning_replay + + +def _reasoning(text: str) -> Message: + return Message(MessageRole.ASSISTANT, text, MessageMeta(MessageType.REASONING)) + + +def _tool_call(name: str) -> Message: + return Message( + MessageRole.ASSISTANT, "", MessageMeta(MessageType.TOOL_CALL), + tool_calls=[ToolCallInfo(name=name, args={}, call_id=f"call_{name}")], + ) + + +def test_full_replays_every_reasoning_block(): + messages = [ + _reasoning("first"), _tool_call("a"), + _reasoning("second"), _tool_call("b"), + ] + + result = fold_and_serialize(messages, "openai", reasoning_replay="full") + + assert [m["content"] for m in result] == ["first", "second"] + + +def test_keep_last_replays_only_latest_reasoning_block(): + messages = [ + _reasoning("first"), _tool_call("a"), + _reasoning("second"), _tool_call("b"), + ] + + result = fold_and_serialize(messages, "openai", reasoning_replay="keep-last") + + assert [m["content"] for m in result] == ["", "second"] + + +def test_none_replays_no_reasoning_blocks(): + messages = [ + _reasoning("first"), _tool_call("a"), + _reasoning("second"), _tool_call("b"), + ] + + result = fold_and_serialize(messages, "openai", reasoning_replay="none") + + assert [m["content"] for m in result] == ["", ""] + + +def test_keep_last_orphan_reasoning_is_preserved_as_orphan(): + messages = [_reasoning("first"), _tool_call("a"), _reasoning("orphan")] + + result = fold_and_serialize(messages, "openai", reasoning_replay="keep-last") + + assert result[-1] == {"role": "assistant", "content": "orphan"} + + +def test_validate_reasoning_replay_rejects_unknown_policy(): + with pytest.raises(ValueError, match="reasoning_replay must be one of"): + validate_reasoning_replay("latest") + + +def test_filter_openai_reasoning_messages_only_filters_assistant_messages(): + messages = [ + { + "role": "user", + "content": "keep this", + "reasoning_content": "user metadata", + }, + { + "role": "assistant", + "content": None, + "reasoning_content": "old assistant reasoning", + "name": "a1", + }, + { + "role": "assistant", + "content": None, + "reasoning_content": "latest assistant reasoning", + "name": "a2", + }, + ] + + result = filter_openai_reasoning_messages(messages, reasoning_replay="keep-last") + + assert result[0]["reasoning_content"] == "user metadata" + assert result[1]["name"] == "a1" + assert "reasoning_content" not in result[1] + assert result[2]["name"] == "a2" + assert result[2]["reasoning_content"] == "latest assistant reasoning" + + +def test_filter_openai_reasoning_messages_none_preserves_user_reasoning_fields(): + messages = [ + {"role": "user", "content": "keep", "reasoning": "user value"}, + {"role": "assistant", "content": None, "reasoning": "drop"}, + ] + + result = filter_openai_reasoning_messages(messages, reasoning_replay="none") + + assert result[0]["reasoning"] == "user value" + assert "reasoning" not in result[1] + + +def test_prepare_backend_messages_filters_raw_openai_reasoning(): + raw_messages = [ + {"role": "assistant", "content": None, "reasoning_content": "old", "name": "a1"}, + {"role": "assistant", "content": None, "reasoning_content": "latest", "name": "a2"}, + ] + + result = prepare_backend_messages( + [], + "openai", + raw_openai_messages=raw_messages, + use_raw_messages=True, + reasoning_replay="keep-last", + ) + + assert result[0]["name"] == "a1" + assert "reasoning_content" not in result[0] + assert result[1]["name"] == "a2" + assert result[1]["reasoning_content"] == "latest" + + +def test_prepare_backend_messages_folds_forge_history_without_raw_messages(): + messages = [_reasoning("first"), _tool_call("a"), _reasoning("second"), _tool_call("b")] + + result = prepare_backend_messages( + messages, "openai", reasoning_replay="keep-last", + ) + + assert [m["content"] for m in result] == ["", "second"] From 9fd2508318bec222f0e9213c8b70b9062491b99e Mon Sep 17 00:00:00 2001 From: Antoine Zambelli Date: Tue, 2 Jun 2026 11:55:49 -0500 Subject: [PATCH 02/14] eval: thread reasoning_replay through batch_eval + eval_runner Make reasoning_replay a first-class, resumable, recorded eval axis so the re-sweep can run none-on-all (regression) and keep-last/full on reasoning models without collisions. - EvalConfig gains reasoning_replay; run_scenario passes it to WorkflowRunner (the backend pipeline already consumes it). run_eval propagates it. - batch_eval: run-wide --reasoning-replay choice (mirrors --ablation), threaded into run_batch, every EvalConfig, and the JSONL row. - Centralize the 6 inline resume keys into _run_key(); reasoning_replay is now part of the key so distinct policies for the same model+scenario are independent runs. _count_completed_runs defaults pre-knob rows (no field) to keep-last, so old dumps resume cleanly under the default. - Both CLIs expose --reasoning-replay {full,keep-last,none}; banners print it. - Add test_batch_eval_resume.py covering key distinctness, row recording, and per-policy resume counting incl. the legacy-default fold. Co-Authored-By: Claude Opus 4.8 (1M context) --- tests/eval/batch_eval.py | 86 ++++++++++++++++++------- tests/eval/eval_runner.py | 14 +++++ tests/unit/test_batch_eval_resume.py | 93 ++++++++++++++++++++++++++++ 3 files changed, 171 insertions(+), 22 deletions(-) create mode 100644 tests/unit/test_batch_eval_resume.py diff --git a/tests/eval/batch_eval.py b/tests/eval/batch_eval.py index 4417f17..5e2084d 100644 --- a/tests/eval/batch_eval.py +++ b/tests/eval/batch_eval.py @@ -19,6 +19,7 @@ from pathlib import Path from typing import Any +from forge.core.reasoning import DEFAULT_REASONING_REPLAY, REASONING_REPLAY_CHOICES, ReasoningReplay from forge.server import BudgetMode, ServerManager from tests.eval.ablation import ABLATION_PRESETS, AblationConfig @@ -215,15 +216,39 @@ def _config_key(model: str, backend: str, mode: str) -> str: return f"{model}|{backend}|{mode}" +def _run_key( + model: str, + backend: str, + mode: str, + ablation_name: str, + tool_choice: str, + reasoning_replay: str, + scenario: str, +) -> str: + """Canonical per-run resume key. + + Single source of truth for the resume/dedup dimensions so the counting + pass and every run-loop lookup stay in lockstep. reasoning_replay is part + of the key: distinct policies (none/keep-last/full) on the same + model+scenario are independent runs and must not collide. + """ + return ( + f"{model}|{backend}|{mode}" + f"|{ablation_name}|{tool_choice}|{reasoning_replay}|{scenario}" + ) + + def _count_completed_runs( jsonl_path: Path, ablation_name: str = "reforged", ) -> dict[str, int]: - """Scan JSONL and count completed runs per (model, backend, mode, ablation, tool_choice, scenario). + """Scan JSONL and count completed runs per resume key (see ``_run_key``). - Returns dict mapping "model|backend|mode|ablation|tool_choice|scenario" → count. - Records without an ablation field are treated as "reforged". - Records without a tool_choice field are treated as "auto". + Returns dict mapping the canonical run key → count. Records without an + ablation field are treated as "reforged", without tool_choice as "auto", + and without reasoning_replay as the default policy (keep-last) — so + pre-knob dumps resume cleanly under the default and are re-run under a + different policy. """ counts: dict[str, int] = {} if not jsonl_path.exists(): @@ -241,9 +266,10 @@ def _count_completed_runs( if row_ablation != ablation_name: continue row_tc = row.get("tool_choice", "auto") - key = ( - f"{row['model']}|{row['backend']}|{row['mode']}" - f"|{row_ablation}|{row_tc}|{row['scenario']}" + row_rr = row.get("reasoning_replay", DEFAULT_REASONING_REPLAY) + key = _run_key( + row["model"], row["backend"], row["mode"], + row_ablation, row_tc, row_rr, row["scenario"], ) counts[key] = counts.get(key, 0) + 1 return counts @@ -256,6 +282,7 @@ def _run_result_to_row( run_idx: int, budget_tokens: int | None = None, ablation_name: str = "reforged", + reasoning_replay: str = DEFAULT_REASONING_REPLAY, ) -> dict[str, Any]: """Convert a RunResult into a flat dict for JSONL output.""" row: dict[str, Any] = { @@ -264,6 +291,7 @@ def _run_result_to_row( "mode": config.mode, "ablation": ablation_name, "tool_choice": config.tool_choice or "auto", + "reasoning_replay": reasoning_replay, "scenario": result.scenario_name, "run": run_idx, "completeness": result.completeness, @@ -563,6 +591,7 @@ async def run_batch( tags: list[str] | None = None, scenario_names: list[str] | None = None, ablation: AblationConfig | None = None, + reasoning_replay: ReasoningReplay = DEFAULT_REASONING_REPLAY, ) -> None: """Run all configs × scenarios, appending each result to JSONL. @@ -602,9 +631,9 @@ async def run_batch( ) if scenario.name in _COMPACTION_SCENARIOS and skip_compaction: continue - key = ( - f"{config.model}|{config.backend}|{config.mode}" - f"|{ablation_name}|{tc_label_pre}|{scenario.name}" + key = _run_key( + config.model, config.backend, config.mode, + ablation_name, tc_label_pre, reasoning_replay, scenario.name, ) existing = completed_counts.get(key, 0) total_expected += max(0, runs_per_scenario - existing) @@ -642,9 +671,9 @@ async def run_batch( if scenario.name in _COMPACTION_SCENARIOS and skip_compaction: print(f" {scenario.name}: SKIP (compaction N/A)") continue - key = ( - f"{config.model}|{config.backend}|{config.mode}" - f"|{ablation_name}|{tc_label}|{scenario.name}" + key = _run_key( + config.model, config.backend, config.mode, + ablation_name, tc_label, reasoning_replay, scenario.name, ) existing = completed_counts.get(key, 0) remaining = max(0, runs_per_scenario - existing) @@ -674,9 +703,9 @@ async def run_batch( total_skipped += 1 continue - key = ( - f"{config.model}|{config.backend}|{config.mode}" - f"|{ablation_name}|{tc_label}|{scenario.name}" + key = _run_key( + config.model, config.backend, config.mode, + ablation_name, tc_label, reasoning_replay, scenario.name, ) existing = completed_counts.get(key, 0) remaining = max(0, runs_per_scenario - existing) @@ -693,6 +722,7 @@ async def run_batch( keep_message_history=True, verbose=verbose, budget_override=scenario_budget, + reasoning_replay=reasoning_replay, ) eta = _format_eta(total_ran, total_expected, batch_start) @@ -732,9 +762,9 @@ async def run_batch( ) if scenario.name in _COMPACTION_SCENARIOS and skip_compaction: continue - key_check = ( - f"{config.model}|{config.backend}|{config.mode}" - f"|{ablation_name}|{tc_label}|{scenario.name}" + key_check = _run_key( + config.model, config.backend, config.mode, + ablation_name, tc_label, reasoning_replay, scenario.name, ) if completed_counts.get(key_check, 0) < runs_per_scenario: has_work = True @@ -816,9 +846,9 @@ async def run_batch( total_skipped += 1 continue - key = ( - f"{config.model}|{config.backend}|{config.mode}" - f"|{ablation_name}|{tc_label}|{scenario.name}" + key = _run_key( + config.model, config.backend, config.mode, + ablation_name, tc_label, reasoning_replay, scenario.name, ) existing = completed_counts.get(key, 0) remaining = max(0, runs_per_scenario - existing) @@ -843,6 +873,7 @@ async def run_batch( verbose=verbose, budget_override=scenario_budget, strategy_overrides={"compaction": TieredCompact(keep_recent=2)}, + reasoning_replay=reasoning_replay, ) eta = _format_eta(total_ran, total_expected, batch_start) @@ -900,6 +931,7 @@ async def run_batch( result, config, scenario, run_idx + 1, budget_tokens=scenario_budget, ablation_name=ablation_name, + reasoning_replay=reasoning_replay, ) with output_path.open("a") as f: f.write(json.dumps(row) + "\n") @@ -970,6 +1002,14 @@ async def main() -> None: default="reforged", help="Ablation preset: selectively disable guardrails (default: reforged = all enabled)", ) + parser.add_argument( + "--reasoning-replay", + choices=list(REASONING_REPLAY_CHOICES), + default=DEFAULT_REASONING_REPLAY, + help="How much captured reasoning to replay to the backend each turn: " + "full (legacy), keep-last (default), none. Part of the resume key, so " + "distinct policies for the same model/scenario are independent runs.", + ) parser.add_argument( "--model", type=str, @@ -1009,6 +1049,7 @@ async def main() -> None: print(f" Config set: {args.config} ({len(configs)} configs)") print(f" Budget mode: {budget_mode.value}") print(f" Ablation: {ablation.name}") + print(f" Reasoning replay: {args.reasoning_replay}") if args.scenario: print(f" Scenarios: {', '.join(args.scenario)}") elif args.tags: @@ -1033,6 +1074,7 @@ async def main() -> None: tags=args.tags, scenario_names=args.scenario, ablation=ablation, + reasoning_replay=args.reasoning_replay, ) diff --git a/tests/eval/eval_runner.py b/tests/eval/eval_runner.py index b503594..77707b9 100644 --- a/tests/eval/eval_runner.py +++ b/tests/eval/eval_runner.py @@ -13,6 +13,7 @@ from forge.context.manager import CompactEvent, ContextManager from forge.context.strategies import CompactStrategy, NoCompact, SlidingWindowCompact, TieredCompact from forge.core.messages import Message, MessageType +from forge.core.reasoning import DEFAULT_REASONING_REPLAY, REASONING_REPLAY_CHOICES, ReasoningReplay from forge.core.runner import WorkflowRunner from forge.core.workflow import ToolCall, ToolDef, ToolSpec, Workflow from forge.errors import ForgeError, StreamError @@ -63,6 +64,7 @@ class EvalConfig: verbose: bool = False budget_override: int | None = None stream_retries: int = 2 + reasoning_replay: ReasoningReplay = DEFAULT_REASONING_REPLAY class CountingClientWrapper: @@ -279,6 +281,7 @@ def on_message(msg: Message) -> None: stream=config.stream, on_message=on_message, rescue_enabled=rescue_enabled, + reasoning_replay=config.reasoning_replay, ) start = time.monotonic() @@ -435,6 +438,7 @@ async def run_eval( verbose=config.verbose, budget_override=scenario_budget, stream_retries=config.stream_retries, + reasoning_replay=config.reasoning_replay, ) scenario_results: list[RunResult] = [] @@ -548,6 +552,13 @@ async def main() -> None: default="reforged", help="Ablation preset: selectively disable guardrails (default: reforged = all enabled)", ) + parser.add_argument( + "--reasoning-replay", + choices=list(REASONING_REPLAY_CHOICES), + default=DEFAULT_REASONING_REPLAY, + help="How much captured reasoning to replay to the backend each turn: " + "full (legacy: replay all), keep-last (default: only most recent), none (drop all).", + ) parser.add_argument( "--tool-choice", choices=["auto", "any"], @@ -654,6 +665,7 @@ async def main() -> None: budget_override=resolved_budget, compact_strategy=cli_strategy, strategy_overrides={}, + reasoning_replay=args.reasoning_replay, ) else: config = EvalConfig( @@ -665,6 +677,7 @@ async def main() -> None: strategy_overrides={ "compaction": TieredCompact(keep_recent=2), }, + reasoning_replay=args.reasoning_replay, ) ablation = ABLATION_PRESETS[args.ablation] @@ -677,6 +690,7 @@ async def main() -> None: print(f"Resolved budget: {resolved_budget} tokens") print(f"Compact strategy: {strategy_label}") print(f"Ablation: {ablation.name}") + print(f"Reasoning replay: {args.reasoning_replay}") print(f"Tags filter: {args.tags or 'all'}") print(f"Scenario filter: {args.scenario or 'all'}") print() diff --git a/tests/unit/test_batch_eval_resume.py b/tests/unit/test_batch_eval_resume.py new file mode 100644 index 0000000..8ce3f73 --- /dev/null +++ b/tests/unit/test_batch_eval_resume.py @@ -0,0 +1,93 @@ +"""Resume-key behavior for the reasoning_replay eval axis (batch_eval). + +reasoning_replay is part of the canonical run key: distinct policies +(none / keep-last / full) on the same model+scenario are independent runs +and must not collide in resume counting, or a multi-policy sweep would +under-count and skip work it never actually ran. +""" + +from __future__ import annotations + +import json + +from forge.core.reasoning import DEFAULT_REASONING_REPLAY + +from tests.eval.batch_eval import ( + BatchConfig, + _count_completed_runs, + _run_key, + _run_result_to_row, +) +from tests.eval.eval_runner import RunResult +from tests.eval.scenarios import basic_2step + + +def _row(model: str, scenario: str, reasoning_replay: str) -> dict: + """Build a JSONL row via the production path for a given policy.""" + cfg = BatchConfig(model=model, backend="llamaserver", mode="native", think=None) + res = RunResult( + scenario_name=scenario, + completeness=True, + iterations_used=3, + accuracy=True, + messages=None, + ) + return _run_result_to_row( + res, cfg, basic_2step, run_idx=1, + ablation_name="reforged", reasoning_replay=reasoning_replay, + ) + + +def test_run_key_distinguishes_reasoning_replay() -> None: + base = dict( + model="m", backend="llamaserver", mode="native", + ablation_name="reforged", tool_choice="auto", scenario="s", + ) + k_none = _run_key(reasoning_replay="none", **base) + k_keep = _run_key(reasoning_replay="keep-last", **base) + k_full = _run_key(reasoning_replay="full", **base) + + # All three policies yield distinct keys... + assert len({k_none, k_keep, k_full}) == 3 + # ...and the key is stable for the same inputs. + assert _run_key(reasoning_replay="none", **base) == k_none + assert "none" in k_none + + +def test_run_result_to_row_records_reasoning_replay() -> None: + row = _row("M", "sc", "none") + assert row["reasoning_replay"] == "none" + + # Default when the caller doesn't pass one (legacy callers / inert axis). + cfg = BatchConfig(model="M", backend="llamaserver", mode="native", think=None) + res = RunResult(scenario_name="sc", completeness=True, iterations_used=2, messages=None) + default_row = _run_result_to_row(res, cfg, basic_2step, run_idx=1) + assert default_row["reasoning_replay"] == DEFAULT_REASONING_REPLAY + + +def test_count_completed_runs_separates_policies(tmp_path) -> None: + rows = [ + _row("M", "sc", "none"), + _row("M", "sc", "none"), + _row("M", "sc", "full"), + _row("M", "sc", "keep-last"), + ] + # A pre-knob row (no reasoning_replay field) must fold into the default + # policy, so a default-policy resume skips it and a different policy re-runs. + legacy = _row("M", "sc", "keep-last") + del legacy["reasoning_replay"] + rows.append(legacy) + + path = tmp_path / "results.jsonl" + path.write_text("\n".join(json.dumps(r) for r in rows) + "\n") + + counts = _count_completed_runs(path, ablation_name="reforged") + + def key(rr: str) -> str: + return _run_key("M", "llamaserver", "native", "reforged", "auto", rr, "sc") + + assert counts[key("none")] == 2 + assert counts[key("full")] == 1 + # explicit keep-last + the legacy row defaulting to keep-last + assert counts[key("keep-last")] == 2 + assert counts[key("none")] + counts[key("full")] + counts[key("keep-last")] == 5 From 5b77d0964679bf65d236d45e0535e3060dd59876 Mon Sep 17 00:00:00 2001 From: Antoine Zambelli Date: Tue, 2 Jun 2026 18:49:22 -0500 Subject: [PATCH 03/14] eval: add on-wire reasoning counter to validate replay knob Add count_wire_reasoning() (eval-only, no src change): serialize the recorded transcript through the real fold_and_serialize choke point and count which reasoning blocks survive onto the backend wire. Emit reasoning_wire (survived) and reasoning_wire_total (non-empty blocks) per batch_eval row, so the sweep records an actual replay rate. Validated on a reasoning model (N=10, all 3 policies, 26 scenarios): none -> 0 on the wire across all 260 rows; keep-last in {0,1}; full in [0, total]. Surfaces that legacy/full replay is itself lossy (~29% of generated reasoning reaches the wire) due to consecutive-block collapse and empty-reasoning omission in fold_and_serialize. Co-Authored-By: Claude Opus 4.8 (1M context) --- tests/eval/batch_eval.py | 11 ++++++- tests/eval/metrics.py | 45 ++++++++++++++++++++++++++++- tests/unit/test_reasoning_replay.py | 32 ++++++++++++++++++++ 3 files changed, 86 insertions(+), 2 deletions(-) diff --git a/tests/eval/batch_eval.py b/tests/eval/batch_eval.py index 5e2084d..d588b25 100644 --- a/tests/eval/batch_eval.py +++ b/tests/eval/batch_eval.py @@ -24,7 +24,7 @@ from tests.eval.ablation import ABLATION_PRESETS, AblationConfig from tests.eval.eval_runner import EvalConfig, RunResult, run_scenario -from tests.eval.metrics import analyze_history, compute_metrics +from tests.eval.metrics import analyze_history, compute_metrics, count_wire_reasoning from tests.eval.scenarios import ALL_SCENARIOS, EvalScenario # ── GGUF paths ────────────────────────────────────────────────── @@ -313,11 +313,20 @@ def _run_result_to_row( row["step_nudges"] = stats.step_nudges row["tool_errors"] = stats.tool_errors row["reasoning_msgs"] = stats.reasoning_messages + # On-wire reasoning that survives the replay policy (independent + # validation of the knob): none->0, keep-last->{0,1}, full->[0,total]. + # reasoning_wire_total is the denominator (non-empty reasoning blocks), + # so reasoning_wire / reasoning_wire_total is the actual replay rate. + wire_survived, wire_total = count_wire_reasoning(result.messages, reasoning_replay) + row["reasoning_wire"] = wire_survived + row["reasoning_wire_total"] = wire_total else: row["retry_nudges"] = None row["step_nudges"] = None row["tool_errors"] = None row["reasoning_msgs"] = None + row["reasoning_wire"] = None + row["reasoning_wire_total"] = None # Correctness row["accuracy"] = result.accuracy diff --git a/tests/eval/metrics.py b/tests/eval/metrics.py index 65d8616..bd97538 100644 --- a/tests/eval/metrics.py +++ b/tests/eval/metrics.py @@ -2,10 +2,13 @@ from __future__ import annotations +import json from dataclasses import dataclass, field from typing import TYPE_CHECKING -from forge.core.messages import Message, MessageType +from forge.core.inference import fold_and_serialize +from forge.core.messages import Message, MessageRole, MessageType +from forge.core.reasoning import ReasoningReplay if TYPE_CHECKING: from tests.eval.eval_runner import RunResult @@ -51,6 +54,46 @@ def analyze_history(messages: list[Message]) -> HistoryStats: return stats +def count_wire_reasoning( + messages: list[Message], + reasoning_replay: ReasoningReplay, + api_format: str = "openai", +) -> tuple[int, int]: + """Count reasoning messages whose text actually survives onto the backend wire. + + This is an *independent* validation of the reasoning-replay knob, not a + reimplementation of it: we serialize the recorded transcript through the + real production serializer (``fold_and_serialize`` — the single replay-policy + choke point) and then check which of the run's REASONING contents are present + in the resulting payload. Returns ``(survived, total)``. + + Expected by policy on a transcript with N>0 reasoning messages: + * ``none`` -> survived == 0 (knob strips all reasoning from the wire) + * ``keep-last`` -> survived == 1 (only the final reasoning is folded) + * ``full`` -> survived == N (every reasoning is folded, legacy behavior) + + Semantics note: this re-derives the *final-snapshot* wire payload, so for + ``keep-last`` it reflects the last reasoning in the completed history, not the + cumulative count sent turn-by-turn. ``none``==0 is exact for every prefix + because the drop is unconditional. Reasoning survival is independent of + ``api_format`` (the drop happens before ``to_api_dict``), so the default is fine. + """ + reasoning_texts = [ + m.content + for m in messages + if m.metadata.type == MessageType.REASONING + and m.role == MessageRole.ASSISTANT + and m.content + ] + total = len(reasoning_texts) + if total == 0: + return 0, 0 + wire = fold_and_serialize(messages, api_format, reasoning_replay=reasoning_replay) + blob = json.dumps(wire, ensure_ascii=False) + survived = sum(1 for text in reasoning_texts if text in blob) + return survived, total + + # ── Aggregated metrics ─────────────────────────────────────────── diff --git a/tests/unit/test_reasoning_replay.py b/tests/unit/test_reasoning_replay.py index 96868cc..c9ec1e5 100644 --- a/tests/unit/test_reasoning_replay.py +++ b/tests/unit/test_reasoning_replay.py @@ -6,6 +6,8 @@ from forge.core.messages import Message, MessageMeta, MessageRole, MessageType, ToolCallInfo from forge.core.reasoning import filter_openai_reasoning_messages, validate_reasoning_replay +from tests.eval.metrics import count_wire_reasoning + def _reasoning(text: str) -> Message: return Message(MessageRole.ASSISTANT, text, MessageMeta(MessageType.REASONING)) @@ -134,3 +136,33 @@ def test_prepare_backend_messages_folds_forge_history_without_raw_messages(): ) assert [m["content"] for m in result] == ["", "second"] + + +# ── Eval-side on-wire reasoning counter (validates the knob end to end) ── + +def _wire_transcript() -> list[Message]: + return [ + _reasoning("first"), _tool_call("a"), + _reasoning("second"), _tool_call("b"), + ] + + +def test_count_wire_reasoning_full_keeps_all(): + survived, total = count_wire_reasoning(_wire_transcript(), "full") + assert (survived, total) == (2, 2) + + +def test_count_wire_reasoning_keep_last_keeps_one(): + survived, total = count_wire_reasoning(_wire_transcript(), "keep-last") + assert (survived, total) == (1, 2) + + +def test_count_wire_reasoning_none_strips_all(): + # The core claim: none puts zero reasoning on the wire. + survived, total = count_wire_reasoning(_wire_transcript(), "none") + assert (survived, total) == (0, 2) + + +def test_count_wire_reasoning_no_reasoning_is_zero_zero(): + survived, total = count_wire_reasoning([_tool_call("a"), _tool_call("b")], "full") + assert (survived, total) == (0, 0) From 6f6b3460bbfd3244780080d92a044c3573773f82 Mon Sep 17 00:00:00 2001 From: Antoine Zambelli Date: Thu, 4 Jun 2026 21:58:01 -0500 Subject: [PATCH 04/14] feat(eval): add Anthropic prompt caching + cache-aware cost Cache the re-sent tool defs + system prompt on the Anthropic eval path so the repeated input prefix bills at 0.1x (read) instead of full price every turn. Billing-only: identical model behavior, accuracy, and iteration counts (safe for cross-run comparability). - AnthropicClient gains opt-in `prompt_caching` (default off, so the proxy verbatim path and existing request shape are untouched). When on, a static ephemeral breakpoint marks the tools + system prefix in the rebuild path. - Static-only on purpose: a rolling conversation breakpoint is NOT placed. The default reasoning_replay="keep-last" re-serializes earlier tool-call messages each turn, which busts a rolling prefix cache (1.25x writes, no reads). The conversation prefix is only stable under none/full, and reasoning_replay is a measured variable we won't pin, so caching is confined to the always-stable tools+system region. - TokenUsage carries cache_creation/cache_read counts (additive, defaults 0); captured in send() and send_stream(); accumulated through CountingClientWrapper and RunResult into the JSONL row. - _compute_cost is cache-aware (write 1.25x, read 0.1x of input rate); applied at the row and both eval_runner cost summaries. - Enabled by default for batch_eval sweeps; eval_runner gains --no-anthropic-cache for a cache-free cost-floor comparison. - Bump claude-opus-4-6 -> claude-opus-4-8 (configs + pricing, $5/$25 verified). Validated: 1148 unit tests pass (incl. new cache tests) + a live one-run smoke on compaction_chain_baseline (20,523 cache reads, behavior unchanged). Co-Authored-By: Claude Opus 4.8 --- src/forge/clients/anthropic.py | 52 ++++++++++++++++++ src/forge/clients/base.py | 9 ++++ tests/eval/batch_eval.py | 54 ++++++++++++++++--- tests/eval/eval_runner.py | 47 ++++++++++++++-- tests/unit/test_anthropic_client.py | 81 +++++++++++++++++++++++++++- tests/unit/test_batch_eval_resume.py | 50 +++++++++++++++++ tests/unit/test_proxy_path1.py | 2 + 7 files changed, 282 insertions(+), 13 deletions(-) diff --git a/src/forge/clients/anthropic.py b/src/forge/clients/anthropic.py index 6437eb7..ddb1349 100644 --- a/src/forge/clients/anthropic.py +++ b/src/forge/clients/anthropic.py @@ -40,10 +40,17 @@ def __init__( tool_choice: str | None = None, recommended_sampling: bool = False, base_url: str | None = None, + prompt_caching: bool = False, ) -> None: self.model = model self.max_tokens = max_tokens self._tool_choice = tool_choice # "auto", "any", or None (default=auto) + # Opt-in Anthropic prompt caching (billing-only). When on, the rebuild + # path marks a static cache breakpoint over the tool defs + system + # prompt (re-sent verbatim every turn). Off by default so the proxy + # verbatim path and existing request shape are untouched. See + # _apply_static_cache for why caching is static-only here. + self._prompt_caching = prompt_caching # Accepted for API symmetry across clients but currently a no-op: # AnthropicClient does not expose sampling kwargs through forge today. # The Anthropic SDK manages sampling internally. @@ -279,8 +286,41 @@ def _build_kwargs( kwargs["tools"] = self._convert_tools(tools) if self._tool_choice and "tool_choice" not in kwargs: kwargs["tool_choice"] = {"type": self._tool_choice} + if self._prompt_caching: + self._apply_static_cache(kwargs) return kwargs + @staticmethod + def _apply_static_cache(kwargs: dict[str, Any]) -> None: + """Mark a static ephemeral cache breakpoint over tool defs + system. + + The tool block and system prompt are byte-identical on every turn of a + run, so this prefix reliably read-hits (at 0.1×) from turn 2 onward + instead of re-billing the re-sent schema + prompt at full price. + + Static-only on purpose: a *rolling* per-turn breakpoint over the growing + conversation is NOT placed here. The eval's default + ``reasoning_replay="keep-last"`` re-serializes earlier tool-call messages + differently each turn (it keeps only the latest reasoning), which busts a + rolling prefix cache — you'd pay 1.25× writes with no reads. The + conversation prefix is only stable under ``none``/``full``, and + ``reasoning_replay`` is a measured variable we won't pin, so caching is + confined to the always-stable tools+system region. + + The cached prefix is ordered tools → system → messages, so a single + breakpoint on the system block subsumes the tools; we additionally mark + the last tool so the tool prefix still caches when ``system`` is absent. + """ + ephemeral = {"type": "ephemeral"} + tools = kwargs.get("tools") + if tools: + tools[-1]["cache_control"] = ephemeral + system = kwargs.get("system") + if isinstance(system, str) and system: + kwargs["system"] = [ + {"type": "text", "text": system, "cache_control": ephemeral} + ] + async def send( self, messages: list[dict[str, str]], @@ -319,6 +359,12 @@ async def send( prompt_tokens=response.usage.input_tokens, completion_tokens=response.usage.output_tokens, total_tokens=response.usage.input_tokens + response.usage.output_tokens, + cache_creation_input_tokens=getattr( + response.usage, "cache_creation_input_tokens", 0 + ) or 0, + cache_read_input_tokens=getattr( + response.usage, "cache_read_input_tokens", 0 + ) or 0, ) } return self._parse_response(response) @@ -402,6 +448,12 @@ async def send_stream( completion_tokens=final_message.usage.output_tokens, total_tokens=final_message.usage.input_tokens + final_message.usage.output_tokens, + cache_creation_input_tokens=getattr( + final_message.usage, "cache_creation_input_tokens", 0 + ) or 0, + cache_read_input_tokens=getattr( + final_message.usage, "cache_read_input_tokens", 0 + ) or 0, ) } except anthropic.APIError as exc: diff --git a/src/forge/clients/base.py b/src/forge/clients/base.py index 2a084a3..7aaa3d2 100644 --- a/src/forge/clients/base.py +++ b/src/forge/clients/base.py @@ -25,11 +25,20 @@ class TokenUsage: llama-server). Backends that don't report usage leave the client's ``last_usage`` empty and the context manager falls back to heuristic estimation. + + ``cache_creation_input_tokens`` / ``cache_read_input_tokens`` are + Anthropic prompt-cache counters (0 for backends without caching, or when + caching is off). ``prompt_tokens`` stays the *uncached* input sliver and + ``total_tokens`` stays ``prompt + completion`` — the cache counters are + carried separately so cost can price them (write 1.25×, read 0.1× of the + input rate) without shifting any existing consumer's semantics. """ prompt_tokens: int completion_tokens: int total_tokens: int + cache_creation_input_tokens: int = 0 + cache_read_input_tokens: int = 0 # Both Ollama and llama-server use the OpenAI tool schema format today. diff --git a/tests/eval/batch_eval.py b/tests/eval/batch_eval.py index d588b25..3052bbc 100644 --- a/tests/eval/batch_eval.py +++ b/tests/eval/batch_eval.py @@ -156,13 +156,13 @@ class BatchConfig: ANTHROPIC_CONFIGS: list[BatchConfig] = [ BatchConfig(model="claude-haiku-4-5-20251001", backend="anthropic", mode="native", think=None), BatchConfig(model="claude-sonnet-4-6", backend="anthropic", mode="native", think=None), - BatchConfig(model="claude-opus-4-6", backend="anthropic", mode="native", think=None), + BatchConfig(model="claude-opus-4-8", backend="anthropic", mode="native", think=None), ] ANTHROPIC_ANY_CONFIGS: list[BatchConfig] = [ BatchConfig(model="claude-haiku-4-5-20251001", backend="anthropic", mode="native", think=None, tool_choice="any"), BatchConfig(model="claude-sonnet-4-6", backend="anthropic", mode="native", think=None, tool_choice="any"), - BatchConfig(model="claude-opus-4-6", backend="anthropic", mode="native", think=None, tool_choice="any"), + BatchConfig(model="claude-opus-4-8", backend="anthropic", mode="native", think=None, tool_choice="any"), ] ALL_CONFIGS: list[BatchConfig] = ( @@ -196,16 +196,40 @@ class BatchConfig: "claude-haiku-4-5-20251001": (1.0, 5.0), "claude-sonnet-4-6": (3.0, 15.0), "claude-opus-4-6": (5.0, 25.0), + # Opus 4.8 standard mode: $5 input / $25 output per Mtok (anthropic.com, + # confirmed 2026-06). Same as 4-6. (Fast mode is 2× — $10/$50 — not used here.) + "claude-opus-4-8": (5.0, 25.0), } +# Prompt-cache token multipliers on the input rate, uniform across current +# Anthropic models: writes bill 1.25×, reads bill 0.1×. +_CACHE_WRITE_MULTIPLIER = 1.25 +_CACHE_READ_MULTIPLIER = 0.1 -def _compute_cost(model: str, input_tokens: int, output_tokens: int) -> float: - """Compute USD cost from token counts. Returns 0.0 for unknown models.""" + +def _compute_cost( + model: str, + input_tokens: int, + output_tokens: int, + cache_creation_tokens: int = 0, + cache_read_tokens: int = 0, +) -> float: + """Compute USD cost from token counts. Returns 0.0 for unknown models. + + ``input_tokens`` is the *uncached* input sliver; cached writes/reads are + priced separately off the input rate so prompt caching is reflected + accurately (the API reports these as distinct usage fields). + """ rates = _ANTHROPIC_PRICING.get(model) if not rates: return 0.0 input_rate, output_rate = rates - return (input_tokens * input_rate + output_tokens * output_rate) / 1_000_000 + return ( + input_tokens * input_rate + + cache_creation_tokens * input_rate * _CACHE_WRITE_MULTIPLIER + + cache_read_tokens * input_rate * _CACHE_READ_MULTIPLIER + + output_tokens * output_rate + ) / 1_000_000 # ── JSONL helpers ─────────────────────────────────────────────── @@ -342,11 +366,19 @@ def _run_result_to_row( row["wasted_calls"] = None # Token usage and cost (Anthropic only — local backends report 0) - if result.input_tokens or result.output_tokens: + if ( + result.input_tokens or result.output_tokens + or result.cache_creation_tokens or result.cache_read_tokens + ): row["input_tokens"] = result.input_tokens row["output_tokens"] = result.output_tokens + row["cache_creation_input_tokens"] = result.cache_creation_tokens + row["cache_read_input_tokens"] = result.cache_read_tokens row["cost_usd"] = round( - _compute_cost(config.model, result.input_tokens, result.output_tokens), + _compute_cost( + config.model, result.input_tokens, result.output_tokens, + result.cache_creation_tokens, result.cache_read_tokens, + ), 6, ) @@ -562,7 +594,13 @@ def _build_client(config: BatchConfig, models_dir: Path) -> Any: elif config.backend == "anthropic": from forge.clients.anthropic import AnthropicClient - return AnthropicClient(model=config.model, tool_choice=config.tool_choice) + # Prompt caching on for sweeps: billing-only (identical model behavior + # and accuracy/iterations metrics), caches the re-sent tool defs + + # system prompt. Static-only — see AnthropicClient._apply_static_cache. + return AnthropicClient( + model=config.model, tool_choice=config.tool_choice, + prompt_caching=True, + ) else: raise ValueError(f"Unknown backend: {config.backend}") diff --git a/tests/eval/eval_runner.py b/tests/eval/eval_runner.py index 77707b9..505628a 100644 --- a/tests/eval/eval_runner.py +++ b/tests/eval/eval_runner.py @@ -49,6 +49,8 @@ class RunResult: stream_retries: int = 0 input_tokens: int = 0 output_tokens: int = 0 + cache_creation_tokens: int = 0 + cache_read_tokens: int = 0 cost_usd: float = 0.0 @@ -75,6 +77,8 @@ def __init__(self, client: LLMClient) -> None: self.call_count = 0 self.total_input_tokens = 0 self.total_output_tokens = 0 + self.total_cache_creation_tokens = 0 + self.total_cache_read_tokens = 0 def __getattr__(self, name: str) -> Any: return getattr(self._client, name) @@ -88,6 +92,13 @@ def _collect_usage(self) -> None: for tu in usage.values(): self.total_input_tokens += tu.prompt_tokens self.total_output_tokens += tu.completion_tokens + # Anthropic prompt-cache counters; 0 for other backends. + self.total_cache_creation_tokens += getattr( + tu, "cache_creation_input_tokens", 0 + ) + self.total_cache_read_tokens += getattr( + tu, "cache_read_input_tokens", 0 + ) async def send( self, @@ -332,6 +343,8 @@ def on_message(msg: Message) -> None: stream_retries=attempt, input_tokens=counting_client.total_input_tokens, output_tokens=counting_client.total_output_tokens, + cache_creation_tokens=counting_client.total_cache_creation_tokens, + cache_read_tokens=counting_client.total_cache_read_tokens, ) except StreamError as exc: last_stream_error = exc @@ -350,6 +363,8 @@ def on_message(msg: Message) -> None: stream_retries=attempt, input_tokens=counting_client.total_input_tokens, output_tokens=counting_client.total_output_tokens, + cache_creation_tokens=counting_client.total_cache_creation_tokens, + cache_read_tokens=counting_client.total_cache_read_tokens, ) except Exception as exc: elapsed = time.monotonic() - start @@ -365,6 +380,8 @@ def on_message(msg: Message) -> None: stream_retries=attempt, input_tokens=counting_client.total_input_tokens, output_tokens=counting_client.total_output_tokens, + cache_creation_tokens=counting_client.total_cache_creation_tokens, + cache_read_tokens=counting_client.total_cache_read_tokens, ) # All stream retries exhausted @@ -457,13 +474,15 @@ async def run_eval( else: status = "OK" cost_str = "" - if result.input_tokens: + if result.input_tokens or result.cache_read_tokens or result.cache_creation_tokens: from tests.eval.batch_eval import _compute_cost cost = _compute_cost( client.model if hasattr(client, "model") else "", result.input_tokens, result.output_tokens, + result.cache_creation_tokens, + result.cache_read_tokens, ) if cost > 0: cost_str = f", ${cost:.4f}" @@ -570,6 +589,13 @@ async def main() -> None: action="store_true", help="Disable llama-server prompt caching (default: enabled)", ) + parser.add_argument( + "--no-anthropic-cache", + action="store_true", + help="Disable Anthropic prompt caching for --backend anthropic " + "(default: enabled). Caching is billing-only; use this for a " + "cache-free cost-floor comparison.", + ) parser.add_argument( "--compact-strategy", choices=["tiered", "sliding", "none"], @@ -616,7 +642,10 @@ async def main() -> None: elif args.backend == "anthropic": from forge.clients.anthropic import AnthropicClient - client = AnthropicClient(model=args.model, tool_choice=args.tool_choice) + client = AnthropicClient( + model=args.model, tool_choice=args.tool_choice, + prompt_caching=not args.no_anthropic_cache, + ) else: from forge.clients.llamafile import LlamafileClient @@ -710,14 +739,24 @@ async def main() -> None: all_runs = [r for runs in results.values() for r in runs] total_input = sum(r.input_tokens for r in all_runs) total_output = sum(r.output_tokens for r in all_runs) - if total_input: + total_cache_creation = sum(r.cache_creation_tokens for r in all_runs) + total_cache_read = sum(r.cache_read_tokens for r in all_runs) + if total_input or total_cache_read or total_cache_creation: from tests.eval.batch_eval import _compute_cost - total_cost = _compute_cost(args.model, total_input, total_output) + total_cost = _compute_cost( + args.model, total_input, total_output, + total_cache_creation, total_cache_read, + ) print( f"Token usage: {total_input:,} input + {total_output:,} output" f" = {total_input + total_output:,} total" ) + if total_cache_creation or total_cache_read: + print( + f"Prompt cache: {total_cache_creation:,} written + " + f"{total_cache_read:,} read" + ) if total_cost > 0: n_runs = len(all_runs) print(f"Total cost: ${total_cost:.4f} ({n_runs} runs, ${total_cost / n_runs:.4f}/run)") diff --git a/tests/unit/test_anthropic_client.py b/tests/unit/test_anthropic_client.py index d00b514..edb4548 100644 --- a/tests/unit/test_anthropic_client.py +++ b/tests/unit/test_anthropic_client.py @@ -6,7 +6,7 @@ import json from typing import Literal -from unittest.mock import MagicMock +from unittest.mock import AsyncMock, MagicMock import pytest from pydantic import BaseModel, Field @@ -428,6 +428,10 @@ async def test_send_records_slot_keyed_usage(self) -> None: response.content = [text_block] response.usage.input_tokens = 12 response.usage.output_tokens = 7 + # Real Anthropic Usage reports these as ints (0 without caching); set + # them so the MagicMock doesn't auto-create truthy attrs. + response.usage.cache_creation_input_tokens = 0 + response.usage.cache_read_input_tokens = 0 async def fake_create(**kwargs): return response @@ -441,3 +445,78 @@ async def fake_create(**kwargs): assert client.last_usage == {0: expected} # Cross-client contract: _get_usage resolves slot 0 to the TokenUsage. assert _get_usage(client) == expected + + +# ── Prompt caching (static tools+system breakpoint) ────────────── + + +class TestPromptCaching: + """Opt-in prompt caching marks a static breakpoint over tool defs + system + in the rebuild path only; off by default; never touches the verbatim path.""" + + _MESSAGES = [ + {"role": "system", "content": "stable system prompt"}, + {"role": "user", "content": "hi"}, + ] + + def test_static_cache_marks_tools_and_system(self) -> None: + client = AnthropicClient( + model="claude-test", api_key="dummy", prompt_caching=True + ) + tools = [_make_spec("a"), _make_spec("b")] + kwargs = client._build_kwargs(self._MESSAGES, tools) + + # Last tool carries the ephemeral breakpoint (caches the tool prefix). + assert kwargs["tools"][-1]["cache_control"] == {"type": "ephemeral"} + # System is converted to a cached text block (caches tools+system). + assert isinstance(kwargs["system"], list) + assert kwargs["system"][0]["text"] == "stable system prompt" + assert kwargs["system"][0]["cache_control"] == {"type": "ephemeral"} + + def test_no_cache_control_by_default(self) -> None: + client = AnthropicClient(model="claude-test", api_key="dummy") + tools = [_make_spec("a"), _make_spec("b")] + kwargs = client._build_kwargs(self._MESSAGES, tools) + + assert "cache_control" not in kwargs["tools"][-1] + # System stays a plain string when caching is off. + assert kwargs["system"] == "stable system prompt" + + def test_cache_does_not_touch_verbatim_inbound(self) -> None: + """prompt_caching must not mutate the path-1 verbatim body — that path + carries the proxy's own cache_control and bypasses the rebuild.""" + client = AnthropicClient( + model="claude-test", api_key="dummy", prompt_caching=True + ) + inbound = { + "max_tokens": 10, + "system": "verbatim system", + "messages": [{"role": "user", "content": "hi"}], + } + kwargs = client._build_kwargs([], None, None, inbound) + + # System stays the verbatim string (NOT converted to a cached block). + assert kwargs["system"] == "verbatim system" + + @pytest.mark.asyncio + async def test_send_records_cache_usage(self) -> None: + client = AnthropicClient( + model="claude-test", api_key="dummy", prompt_caching=True + ) + text_block = MagicMock() + text_block.type = "text" + text_block.text = "ok" + response = MagicMock() + response.content = [text_block] + response.usage.input_tokens = 5 + response.usage.output_tokens = 3 + response.usage.cache_creation_input_tokens = 100 + response.usage.cache_read_input_tokens = 200 + client._client.messages.create = AsyncMock(return_value=response) + + await client.send([{"role": "user", "content": "hi"}]) + + tu = client.last_usage[0] + assert tu.prompt_tokens == 5 + assert tu.cache_creation_input_tokens == 100 + assert tu.cache_read_input_tokens == 200 diff --git a/tests/unit/test_batch_eval_resume.py b/tests/unit/test_batch_eval_resume.py index 8ce3f73..8becd20 100644 --- a/tests/unit/test_batch_eval_resume.py +++ b/tests/unit/test_batch_eval_resume.py @@ -14,6 +14,7 @@ from tests.eval.batch_eval import ( BatchConfig, + _compute_cost, _count_completed_runs, _run_key, _run_result_to_row, @@ -91,3 +92,52 @@ def key(rr: str) -> str: # explicit keep-last + the legacy row defaulting to keep-last assert counts[key("keep-last")] == 2 assert counts[key("none")] + counts[key("full")] + counts[key("keep-last")] == 5 + + +def test_compute_cost_prices_cache_tokens() -> None: + """Cache writes bill 1.25× and reads 0.1× of the input rate; uncached input + and output keep their base rates. (sonnet: $3 input / $15 output per Mtok.)""" + cost = _compute_cost( + "claude-sonnet-4-6", + input_tokens=1_000, + output_tokens=500, + cache_creation_tokens=2_000, + cache_read_tokens=4_000, + ) + expected = ( + 1_000 * 3.0 + + 2_000 * 3.0 * 1.25 + + 4_000 * 3.0 * 0.1 + + 500 * 15.0 + ) / 1_000_000 + assert cost == expected + + # Back-compat: omitting cache args matches the old input+output formula. + assert _compute_cost("claude-sonnet-4-6", 1_000, 500) == ( + 1_000 * 3.0 + 500 * 15.0 + ) / 1_000_000 + + # Opus 4.8 is priced (placeholder rate), not an unknown-model 0.0. + assert _compute_cost("claude-opus-4-8", 1_000, 0) > 0 + + +def test_run_result_to_row_emits_cache_tokens() -> None: + cfg = BatchConfig(model="claude-sonnet-4-6", backend="anthropic", mode="native", think=None) + res = RunResult( + scenario_name="sc", + completeness=True, + iterations_used=3, + accuracy=True, + messages=None, + input_tokens=1_000, + output_tokens=500, + cache_creation_tokens=2_000, + cache_read_tokens=4_000, + ) + row = _run_result_to_row(res, cfg, basic_2step, run_idx=1) + + assert row["cache_creation_input_tokens"] == 2_000 + assert row["cache_read_input_tokens"] == 4_000 + assert row["cost_usd"] == round( + _compute_cost("claude-sonnet-4-6", 1_000, 500, 2_000, 4_000), 6 + ) diff --git a/tests/unit/test_proxy_path1.py b/tests/unit/test_proxy_path1.py index 4cc8862..49ecc59 100644 --- a/tests/unit/test_proxy_path1.py +++ b/tests/unit/test_proxy_path1.py @@ -183,6 +183,8 @@ def _stub_anthropic_response(): msg.content = [MagicMock(type="text", text="ok")] msg.usage.input_tokens = 1 msg.usage.output_tokens = 1 + msg.usage.cache_creation_input_tokens = 0 + msg.usage.cache_read_input_tokens = 0 return msg From b1217c7174048a4bf7ab9fc817c67f293bf9ddb4 Mon Sep 17 00:00:00 2001 From: Antoine Zambelli Date: Sat, 6 Jun 2026 05:44:17 -0500 Subject: [PATCH 05/14] Add v0.7.5 reasoning-replay eval results (rig-02 dual-GPU sweep, 67.6k runs) Co-Authored-By: Claude Opus 4.8 (1M context) --- eval_results_v0.7.5.jsonl | 3 +++ 1 file changed, 3 insertions(+) create mode 100644 eval_results_v0.7.5.jsonl diff --git a/eval_results_v0.7.5.jsonl b/eval_results_v0.7.5.jsonl new file mode 100644 index 0000000..d185976 --- /dev/null +++ b/eval_results_v0.7.5.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:598db5bfeddbdf7c5e74bd59bdff821b462ce3e396f525cc860dd6c3fba51c0c +size 41620228 From e450b43be05af35a4f5097c4e549b8df8d34eab0 Mon Sep 17 00:00:00 2001 From: Antoine Zambelli Date: Sat, 6 Jun 2026 06:02:39 -0500 Subject: [PATCH 06/14] Complete v0.7.5 reasoning-replay eval grid (78 cells, 101.4k runs) Adds the remaining single-GPU sweep partition (33.8k runs, 26 config cells) to the existing dual-GPU results, completing the full reasoning_replay grid across all 14 models x {none,keep-last,full} x {bare,reforged} x {native,prompt}. All 78 cells verified complete (26 scenarios x 50 runs each), zero duplicate run-keys. Rows stamped gen=3 (v0.6.0=1, v0.7.0=2) so cross-generation report dedup keeps this suite over older generations of the same config. Co-Authored-By: Claude Opus 4.8 (1M context) --- eval_results_v0.7.5.jsonl | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/eval_results_v0.7.5.jsonl b/eval_results_v0.7.5.jsonl index d185976..fd28663 100644 --- a/eval_results_v0.7.5.jsonl +++ b/eval_results_v0.7.5.jsonl @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:598db5bfeddbdf7c5e74bd59bdff821b462ce3e396f525cc860dd6c3fba51c0c -size 41620228 +oid sha256:a677cffdb3f1f018fe2b61bf3fe37ccc99e7193dddb78f603e0ff52d9df7e6da +size 63568949 From 694110bc3b847789bd2a24f6bc6ca582ca9586a0 Mon Sep 17 00:00:00 2001 From: Antoine Zambelli Date: Sun, 7 Jun 2026 19:36:56 -0500 Subject: [PATCH 07/14] Add Anthropic v0.7.5 eval rows --- eval_results_v0.7.5.jsonl | 4 +-- src/forge/clients/anthropic.py | 13 +++++++- tests/eval/batch_eval.py | 21 ++++++++++--- tests/unit/test_anthropic_client.py | 39 +++++++++++++++++++++++ tests/unit/test_batch_eval_resume.py | 46 ++++++++++++++++++++++++++++ 5 files changed, 116 insertions(+), 7 deletions(-) diff --git a/eval_results_v0.7.5.jsonl b/eval_results_v0.7.5.jsonl index fd28663..9c6a070 100644 --- a/eval_results_v0.7.5.jsonl +++ b/eval_results_v0.7.5.jsonl @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:a677cffdb3f1f018fe2b61bf3fe37ccc99e7193dddb78f603e0ff52d9df7e6da -size 63568949 +oid sha256:be698e8707360458c0e2310516149eed2e295c186bb10b847424a5f554418ab5 +size 70832005 diff --git a/src/forge/clients/anthropic.py b/src/forge/clients/anthropic.py index ddb1349..7cea440 100644 --- a/src/forge/clients/anthropic.py +++ b/src/forge/clients/anthropic.py @@ -41,6 +41,7 @@ def __init__( recommended_sampling: bool = False, base_url: str | None = None, prompt_caching: bool = False, + thinking: dict[str, Any] | None = None, ) -> None: self.model = model self.max_tokens = max_tokens @@ -51,6 +52,12 @@ def __init__( # verbatim path and existing request shape are untouched. See # _apply_static_cache for why caching is static-only here. self._prompt_caching = prompt_caching + # Extended-thinking request config, e.g. {"type": "adaptive"}. When set, + # merged into every messages.create call (and a forced tool_choice is + # suppressed — Anthropic requires tool_choice="auto" with thinking on). + # None = thinking off; the proxy passthrough path can still carry its + # own ``thinking`` via ``passthrough``. + self._thinking = thinking # Accepted for API symmetry across clients but currently a no-op: # AnthropicClient does not expose sampling kwargs through forge today. # The Anthropic SDK manages sampling internally. @@ -284,8 +291,12 @@ def _build_kwargs( kwargs["system"] = system if tools: kwargs["tools"] = self._convert_tools(tools) - if self._tool_choice and "tool_choice" not in kwargs: + # Extended thinking is incompatible with a forced tool_choice; + # Anthropic requires "auto" (the default) when thinking is on. + if self._tool_choice and not self._thinking and "tool_choice" not in kwargs: kwargs["tool_choice"] = {"type": self._tool_choice} + if self._thinking and "thinking" not in kwargs: + kwargs["thinking"] = self._thinking if self._prompt_caching: self._apply_static_cache(kwargs) return kwargs diff --git a/tests/eval/batch_eval.py b/tests/eval/batch_eval.py index 3052bbc..25ee699 100644 --- a/tests/eval/batch_eval.py +++ b/tests/eval/batch_eval.py @@ -154,9 +154,13 @@ class BatchConfig: ] ANTHROPIC_CONFIGS: list[BatchConfig] = [ - BatchConfig(model="claude-haiku-4-5-20251001", backend="anthropic", mode="native", think=None), - BatchConfig(model="claude-sonnet-4-6", backend="anthropic", mode="native", think=None), - BatchConfig(model="claude-opus-4-8", backend="anthropic", mode="native", think=None), + # think=True -> adaptive extended thinking ("Claude with reasoning" baseline + # rows). Haiku has no adaptive support (API rejects it) so it stays a + # non-thinking baseline. Wired in _build_client. NOT part of the + # reasoning_replay sweep — thinking here is request-only, no replay folding. + BatchConfig(model="claude-haiku-4-5-20251001", backend="anthropic", mode="native", think=False), + BatchConfig(model="claude-sonnet-4-6", backend="anthropic", mode="native", think=True), + BatchConfig(model="claude-opus-4-8", backend="anthropic", mode="native", think=True), ] ANTHROPIC_ANY_CONFIGS: list[BatchConfig] = [ @@ -597,9 +601,17 @@ def _build_client(config: BatchConfig, models_dir: Path) -> Any: # Prompt caching on for sweeps: billing-only (identical model behavior # and accuracy/iterations metrics), caches the re-sent tool defs + # system prompt. Static-only — see AnthropicClient._apply_static_cache. + # + # Adaptive extended thinking when think=True ("Claude with reasoning" + # baselines). Gated off for tool_choice="any" (forced tool choice is + # incompatible with thinking) and for models without adaptive support + # (Haiku, configured think=False). Request-only: no reasoning_replay + # folding — these are baseline rows, not part of the replay sweep. + thinking = {"type": "adaptive"} if (config.think and config.tool_choice != "any") else None return AnthropicClient( model=config.model, tool_choice=config.tool_choice, - prompt_caching=True, + prompt_caching=True, thinking=thinking, + max_tokens=16384 if thinking else 4096, ) else: @@ -794,6 +806,7 @@ async def run_batch( result, config, scenario, run_idx + 1, budget_tokens=scenario_budget, ablation_name=ablation_name, + reasoning_replay=reasoning_replay, ) with output_path.open("a") as f: f.write(json.dumps(row) + "\n") diff --git a/tests/unit/test_anthropic_client.py b/tests/unit/test_anthropic_client.py index edb4548..bfc580b 100644 --- a/tests/unit/test_anthropic_client.py +++ b/tests/unit/test_anthropic_client.py @@ -520,3 +520,42 @@ async def test_send_records_cache_usage(self) -> None: assert tu.prompt_tokens == 5 assert tu.cache_creation_input_tokens == 100 assert tu.cache_read_input_tokens == 200 + + +class TestThinking: + """Adaptive extended-thinking request wiring (baseline rows). Request-only: + thinking is merged into the rebuild path and forces tool_choice=auto.""" + + _MESSAGES = [ + {"role": "system", "content": "sys"}, + {"role": "user", "content": "hi"}, + ] + + def test_thinking_merged_into_kwargs(self) -> None: + client = AnthropicClient( + model="claude-test", api_key="dummy", thinking={"type": "adaptive"} + ) + kwargs = client._build_kwargs(self._MESSAGES, [_make_spec("a")]) + assert kwargs["thinking"] == {"type": "adaptive"} + + def test_no_thinking_by_default(self) -> None: + client = AnthropicClient(model="claude-test", api_key="dummy") + kwargs = client._build_kwargs(self._MESSAGES, [_make_spec("a")]) + assert "thinking" not in kwargs + + def test_thinking_suppresses_forced_tool_choice(self) -> None: + # Anthropic forbids a forced tool_choice with thinking on -> must drop it. + client = AnthropicClient( + model="claude-test", api_key="dummy", + tool_choice="any", thinking={"type": "adaptive"}, + ) + kwargs = client._build_kwargs(self._MESSAGES, [_make_spec("a")]) + assert "tool_choice" not in kwargs + assert kwargs["thinking"] == {"type": "adaptive"} + + def test_forced_tool_choice_kept_when_no_thinking(self) -> None: + client = AnthropicClient( + model="claude-test", api_key="dummy", tool_choice="any" + ) + kwargs = client._build_kwargs(self._MESSAGES, [_make_spec("a")]) + assert kwargs["tool_choice"] == {"type": "any"} diff --git a/tests/unit/test_batch_eval_resume.py b/tests/unit/test_batch_eval_resume.py index 8becd20..937c666 100644 --- a/tests/unit/test_batch_eval_resume.py +++ b/tests/unit/test_batch_eval_resume.py @@ -10,14 +10,18 @@ import json +import pytest + from forge.core.reasoning import DEFAULT_REASONING_REPLAY +import tests.eval.batch_eval as batch_eval from tests.eval.batch_eval import ( BatchConfig, _compute_cost, _count_completed_runs, _run_key, _run_result_to_row, + run_batch, ) from tests.eval.eval_runner import RunResult from tests.eval.scenarios import basic_2step @@ -66,6 +70,48 @@ def test_run_result_to_row_records_reasoning_replay() -> None: assert default_row["reasoning_replay"] == DEFAULT_REASONING_REPLAY +@pytest.mark.asyncio +async def test_anthropic_batch_rows_record_selected_reasoning_replay( + tmp_path, monkeypatch, +) -> None: + """Anthropic rows must use the runtime policy, not the module default.""" + cfg = BatchConfig( + model="claude-sonnet-4-6", + backend="anthropic", + mode="native", + think=True, + ) + output = tmp_path / "results.jsonl" + + monkeypatch.setattr(batch_eval, "ALL_SCENARIOS", [basic_2step]) + monkeypatch.setattr(batch_eval, "_build_client", lambda config, models_dir: object()) + + async def fake_run_with_timeout(client, scenario, eval_config, ablation): + assert eval_config.reasoning_replay == "none" + return RunResult( + scenario_name=scenario.name, + completeness=True, + iterations_used=3, + accuracy=True, + messages=None, + ) + + monkeypatch.setattr(batch_eval, "_run_with_timeout", fake_run_with_timeout) + + await run_batch( + configs=[cfg], + runs_per_scenario=1, + output_path=output, + tags=["plumbing"], + reasoning_replay="none", + ) + + row = json.loads(output.read_text().strip()) + assert row["model"] == "claude-sonnet-4-6" + assert row["backend"] == "anthropic" + assert row["reasoning_replay"] == "none" + + def test_count_completed_runs_separates_policies(tmp_path) -> None: rows = [ _row("M", "sc", "none"), From 2268c99122c51a52021ce77ea458e8c5ed75e6d5 Mon Sep 17 00:00:00 2001 From: Antoine Zambelli Date: Thu, 11 Jun 2026 17:06:07 -0500 Subject: [PATCH 08/14] Add GPU-A catch-up replay eval shard --- eval_results_catchup_gpuA_v0.7.5.jsonl | 3 +++ 1 file changed, 3 insertions(+) create mode 100644 eval_results_catchup_gpuA_v0.7.5.jsonl diff --git a/eval_results_catchup_gpuA_v0.7.5.jsonl b/eval_results_catchup_gpuA_v0.7.5.jsonl new file mode 100644 index 0000000..3a92c15 --- /dev/null +++ b/eval_results_catchup_gpuA_v0.7.5.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:73a07bd8916acae8a7ec2fcb7c79a8e6d8f05e2bcfa9db6aa894c5c0ab020420 +size 12170481 From ab6f5e5550b3923e3577da73f8ba89af8c2fb4d2 Mon Sep 17 00:00:00 2001 From: Antoine Zambelli Date: Thu, 11 Jun 2026 18:41:24 -0500 Subject: [PATCH 09/14] eval: merge reasoning replay catch-up results Merge the catch-up reasoning-replay eval rows into the canonical v0.7.5 dataset and remove the temporary GPU-A shard. Add FORGE_EVAL_PORT so concurrent local eval workers can use separate llama-server ports. --- eval_results_catchup_gpuA_v0.7.5.jsonl | 3 --- eval_results_v0.7.5.jsonl | 4 ++-- tests/eval/batch_eval.py | 13 ++++++++++--- 3 files changed, 12 insertions(+), 8 deletions(-) delete mode 100644 eval_results_catchup_gpuA_v0.7.5.jsonl diff --git a/eval_results_catchup_gpuA_v0.7.5.jsonl b/eval_results_catchup_gpuA_v0.7.5.jsonl deleted file mode 100644 index 3a92c15..0000000 --- a/eval_results_catchup_gpuA_v0.7.5.jsonl +++ /dev/null @@ -1,3 +0,0 @@ -version https://git-lfs.github.com/spec/v1 -oid sha256:73a07bd8916acae8a7ec2fcb7c79a8e6d8f05e2bcfa9db6aa894c5c0ab020420 -size 12170481 diff --git a/eval_results_v0.7.5.jsonl b/eval_results_v0.7.5.jsonl index 9c6a070..9182af5 100644 --- a/eval_results_v0.7.5.jsonl +++ b/eval_results_v0.7.5.jsonl @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:be698e8707360458c0e2310516149eed2e295c186bb10b847424a5f554418ab5 -size 70832005 +oid sha256:8d0feb4f9f4456ba7b45cdb9a9fe407f593abecf7293c40fc40d30a12ebccfba +size 106268073 diff --git a/tests/eval/batch_eval.py b/tests/eval/batch_eval.py index 25ee699..98e295f 100644 --- a/tests/eval/batch_eval.py +++ b/tests/eval/batch_eval.py @@ -12,6 +12,7 @@ import asyncio import json +import os import subprocess import sys import time @@ -31,6 +32,11 @@ MODELS_DIR_DEFAULT = Path("models") + +def _eval_port() -> int: + """llama-server port for eval workers; overridden by rig wrappers.""" + return int(os.environ.get("FORGE_EVAL_PORT", "8080")) + # GGUF and llamafile model files for local-server backends. # Each entry is just the filename — paired into a BatchConfig below # alongside the canonical identity (the file stem, no extension). @@ -580,6 +586,7 @@ def _build_client(config: BatchConfig, models_dir: Path) -> Any: return LlamafileClient( gguf_path=str(models_dir / config.gguf_filename), mode=config.mode, think=think_val, + base_url=f"http://localhost:{_eval_port()}/v1", recommended_sampling=recommended_sampling, ) @@ -591,7 +598,7 @@ def _build_client(config: BatchConfig, models_dir: Path) -> Any: gguf_path=str(models_dir / config.gguf_filename), mode=config.mode, think=think_val, - base_url="http://localhost:8080/v1", + base_url=f"http://localhost:{_eval_port()}/v1", recommended_sampling=recommended_sampling, ) @@ -703,7 +710,7 @@ async def run_batch( total_ran = 0 total_failed_connect = 0 batch_start = time.monotonic() - server = ServerManager(backend="ollama", port=8080, models_dir=models_dir) + server = ServerManager(backend="ollama", port=_eval_port(), models_dir=models_dir) prev_backend: str | None = None prev_server: ServerManager | None = None @@ -846,7 +853,7 @@ async def run_batch( if prev_server is not None and prev_backend != "ollama": await prev_server.stop() server = ServerManager( - backend=config.backend, port=8080, models_dir=models_dir + backend=config.backend, port=_eval_port(), models_dir=models_dir ) # Resolve GGUF/llamafile path for non-Ollama backends From 870f2dbfdeda52505f5cb072735ac770a0c0d66d Mon Sep 17 00:00:00 2001 From: Antoine Zambelli Date: Thu, 11 Jun 2026 20:39:48 -0500 Subject: [PATCH 10/14] feat(reasoning): default reasoning_replay to none The v0.7.5 eval grid showed dropping replayed reasoning is statistically indistinguishable from replay-all on score while saving the replayed tokens every turn, so the bounded policy becomes the default. Help strings, the resume-fold docstring, and the anthropic prompt-caching rationale updated to match; default-behavior tests now assert omission, with fold/exposure mechanics re-pinned under explicit keep-last. Co-Authored-By: Claude Fable 5 --- src/forge/clients/anthropic.py | 14 +++++++------- src/forge/core/reasoning.py | 2 +- src/forge/proxy/__main__.py | 2 +- tests/eval/batch_eval.py | 4 ++-- tests/eval/eval_runner.py | 2 +- tests/unit/test_batch_eval_resume.py | 6 +++--- tests/unit/test_proxy_convert.py | 21 +++++++++++++++++++-- tests/unit/test_proxy_handler.py | 17 +++++++++++++++++ tests/unit/test_runner.py | 26 +++++++++++++++++++++++--- 9 files changed, 74 insertions(+), 20 deletions(-) diff --git a/src/forge/clients/anthropic.py b/src/forge/clients/anthropic.py index 7cea440..6e57286 100644 --- a/src/forge/clients/anthropic.py +++ b/src/forge/clients/anthropic.py @@ -310,13 +310,13 @@ def _apply_static_cache(kwargs: dict[str, Any]) -> None: instead of re-billing the re-sent schema + prompt at full price. Static-only on purpose: a *rolling* per-turn breakpoint over the growing - conversation is NOT placed here. The eval's default - ``reasoning_replay="keep-last"`` re-serializes earlier tool-call messages - differently each turn (it keeps only the latest reasoning), which busts a - rolling prefix cache — you'd pay 1.25× writes with no reads. The - conversation prefix is only stable under ``none``/``full``, and - ``reasoning_replay`` is a measured variable we won't pin, so caching is - confined to the always-stable tools+system region. + conversation is NOT placed here. Under ``reasoning_replay="keep-last"`` + earlier tool-call messages re-serialize differently each turn (only the + latest reasoning is kept), which busts a rolling prefix cache — you'd + pay 1.25× writes with no reads. The conversation prefix is only stable + under ``none``/``full``, and ``reasoning_replay`` is a measured eval + variable we won't pin, so caching is confined to the always-stable + tools+system region. The cached prefix is ordered tools → system → messages, so a single breakpoint on the system block subsumes the tools; we additionally mark diff --git a/src/forge/core/reasoning.py b/src/forge/core/reasoning.py index df0acec..0750598 100644 --- a/src/forge/core/reasoning.py +++ b/src/forge/core/reasoning.py @@ -8,7 +8,7 @@ ReasoningReplay = Literal["full", "keep-last", "none"] REASONING_REPLAY_CHOICES: tuple[ReasoningReplay, ...] = ("full", "keep-last", "none") -DEFAULT_REASONING_REPLAY: ReasoningReplay = "keep-last" +DEFAULT_REASONING_REPLAY: ReasoningReplay = "none" def validate_reasoning_replay(value: str) -> ReasoningReplay: diff --git a/src/forge/proxy/__main__.py b/src/forge/proxy/__main__.py index d29f61a..455defa 100644 --- a/src/forge/proxy/__main__.py +++ b/src/forge/proxy/__main__.py @@ -91,7 +91,7 @@ def main() -> None: choices=REASONING_REPLAY_CHOICES, default=DEFAULT_REASONING_REPLAY, help="How much captured reasoning to replay to the backend " - "(default: keep-last).", + "(default: none).", ) parser.add_argument("--verbose", "-v", action="store_true", help="Verbose logging") diff --git a/tests/eval/batch_eval.py b/tests/eval/batch_eval.py index 98e295f..6a5d7e7 100644 --- a/tests/eval/batch_eval.py +++ b/tests/eval/batch_eval.py @@ -280,7 +280,7 @@ def _count_completed_runs( Returns dict mapping the canonical run key → count. Records without an ablation field are treated as "reforged", without tool_choice as "auto", - and without reasoning_replay as the default policy (keep-last) — so + and without reasoning_replay as the default policy (none) — so pre-knob dumps resume cleanly under the default and are re-run under a different policy. """ @@ -1074,7 +1074,7 @@ async def main() -> None: choices=list(REASONING_REPLAY_CHOICES), default=DEFAULT_REASONING_REPLAY, help="How much captured reasoning to replay to the backend each turn: " - "full (legacy), keep-last (default), none. Part of the resume key, so " + "full (legacy), keep-last, none (default). Part of the resume key, so " "distinct policies for the same model/scenario are independent runs.", ) parser.add_argument( diff --git a/tests/eval/eval_runner.py b/tests/eval/eval_runner.py index 505628a..94cb3ac 100644 --- a/tests/eval/eval_runner.py +++ b/tests/eval/eval_runner.py @@ -576,7 +576,7 @@ async def main() -> None: choices=list(REASONING_REPLAY_CHOICES), default=DEFAULT_REASONING_REPLAY, help="How much captured reasoning to replay to the backend each turn: " - "full (legacy: replay all), keep-last (default: only most recent), none (drop all).", + "full (legacy: replay all), keep-last (only most recent), none (default: drop all).", ) parser.add_argument( "--tool-choice", diff --git a/tests/unit/test_batch_eval_resume.py b/tests/unit/test_batch_eval_resume.py index 937c666..a32cab9 100644 --- a/tests/unit/test_batch_eval_resume.py +++ b/tests/unit/test_batch_eval_resume.py @@ -133,10 +133,10 @@ def test_count_completed_runs_separates_policies(tmp_path) -> None: def key(rr: str) -> str: return _run_key("M", "llamaserver", "native", "reforged", "auto", rr, "sc") - assert counts[key("none")] == 2 + # explicit none ×2 + the legacy row defaulting to none + assert counts[key("none")] == 3 assert counts[key("full")] == 1 - # explicit keep-last + the legacy row defaulting to keep-last - assert counts[key("keep-last")] == 2 + assert counts[key("keep-last")] == 1 assert counts[key("none")] + counts[key("full")] + counts[key("keep-last")] == 5 diff --git a/tests/unit/test_proxy_convert.py b/tests/unit/test_proxy_convert.py index 11b5a2e..07be9bb 100644 --- a/tests/unit/test_proxy_convert.py +++ b/tests/unit/test_proxy_convert.py @@ -149,12 +149,21 @@ def test_multiple_tool_calls(self): ]) assert len(result["choices"][0]["message"]["tool_calls"]) == 2 - def test_reasoning_default_exposed_as_reasoning_content(self): + def test_reasoning_omitted_by_default(self): + # Default policy is "none": reasoning is not exposed on the response. result = tool_calls_to_openai([ ToolCall(tool="search", args={}, reasoning="Let me think..."), ]) msg = result["choices"][0]["message"] assert msg["content"] is None + assert "reasoning_content" not in msg + + def test_keep_last_reasoning_replay_exposed_as_reasoning_content(self): + result = tool_calls_to_openai([ + ToolCall(tool="search", args={}, reasoning="Let me think..."), + ], reasoning_replay="keep-last") + msg = result["choices"][0]["message"] + assert msg["content"] is None assert msg["reasoning_content"] == "Let me think..." def test_full_reasoning_replay_exposes_reasoning_in_content(self): @@ -216,10 +225,18 @@ def test_single_tool_call_structure(self): assert events[-1]["choices"][0]["finish_reason"] == "tool_calls" assert events[-1]["choices"][0]["delta"] == {} - def test_reasoning_prepended_as_reasoning_content_by_default(self): + def test_reasoning_omitted_from_stream_by_default(self): + # Default policy is "none": no reasoning delta is streamed. events = tool_calls_to_sse_events([ ToolCall(tool="search", args={}, reasoning="Thinking..."), ]) + assert len(events) == 2 + assert "tool_calls" in events[0]["choices"][0]["delta"] + + def test_keep_last_reasoning_replay_streams_reasoning_content_delta(self): + events = tool_calls_to_sse_events([ + ToolCall(tool="search", args={}, reasoning="Thinking..."), + ], reasoning_replay="keep-last") # reasoning delta + tool call delta + final assert len(events) == 3 assert events[0]["choices"][0]["delta"]["reasoning_content"] == "Thinking..." diff --git a/tests/unit/test_proxy_handler.py b/tests/unit/test_proxy_handler.py index c4689c4..8b42f18 100644 --- a/tests/unit/test_proxy_handler.py +++ b/tests/unit/test_proxy_handler.py @@ -520,11 +520,28 @@ async def test_default_reasoning_replay_filters_raw_reasoning_only(self): _body(messages=messages, tools=[_tool_def("search")]), client, _context_manager(), ) + # Default policy is "none": every reasoning field is stripped, but the + # rest of each raw message survives verbatim. sent_messages = client.send.call_args.args[0] assert sent_messages[0]["name"] == "a1" assert sent_messages[0]["vendor"] == {"kept": True} assert "reasoning_content" not in sent_messages[0] assert sent_messages[1]["name"] == "a2" + assert "reasoning_content" not in sent_messages[1] + + @pytest.mark.asyncio + async def test_keep_last_reasoning_replay_keeps_latest_only(self): + client = _mock_client([ToolCall(tool="search", args={"q": "x"})]) + messages = [ + {"role": "assistant", "content": None, "reasoning_content": "old", "tool_calls": [], "name": "a1"}, + {"role": "assistant", "content": None, "reasoning_content": "latest", "tool_calls": [], "name": "a2"}, + ] + await handle_chat_completions( + _body(messages=messages, tools=[_tool_def("search")]), + client, _context_manager(), reasoning_replay="keep-last", + ) + sent_messages = client.send.call_args.args[0] + assert "reasoning_content" not in sent_messages[0] assert sent_messages[1]["reasoning_content"] == "latest" @pytest.mark.asyncio diff --git a/tests/unit/test_runner.py b/tests/unit/test_runner.py index ac69900..9a33c07 100644 --- a/tests/unit/test_runner.py +++ b/tests/unit/test_runner.py @@ -122,6 +122,7 @@ def _make_runner( stream: bool = False, on_chunk=None, budget_tokens: int = 100_000, + reasoning_replay: str = "none", ) -> WorkflowRunner: """Create a WorkflowRunner with NoCompact strategy and generous budget.""" ctx = ContextManager(strategy=NoCompact(), budget_tokens=budget_tokens) @@ -133,6 +134,7 @@ def _make_runner( max_tool_errors=max_tool_errors, stream=stream, on_chunk=on_chunk, + reasoning_replay=reasoning_replay, ) @@ -1335,8 +1337,8 @@ def spy_compact(messages, step_index=0, step_hint=""): assert types[reasoning_idx + 1] == MessageType.TOOL_CALL @pytest.mark.asyncio - async def test_reasoning_folded_into_tool_call_on_wire(self): - """Reasoning is folded into the tool_call message's content on the wire.""" + async def test_reasoning_not_on_wire_by_default(self): + """Default reasoning_replay="none": reasoning never reaches the wire.""" client = MockClient([ ToolCall(tool="fetch", args={}, reasoning="Thinking about this..."), ToolCall(tool="submit", args={}), @@ -1344,6 +1346,24 @@ async def test_reasoning_folded_into_tool_call_on_wire(self): runner = _make_runner(client) await runner.run(_make_workflow(), "go", prompt_vars={"role": "agent"}) + # Second send call: the tool_call message carries no reasoning content + second_call_msgs = client.send_calls[1][0] + # system, user, tool_call(assistant), tool_result(tool) + assert len(second_call_msgs) == 4 + assert second_call_msgs[2]["role"] == "assistant" + assert second_call_msgs[2]["content"] == "" + assert "tool_calls" in second_call_msgs[2] + + @pytest.mark.asyncio + async def test_reasoning_folded_into_tool_call_on_wire(self): + """With a replaying policy, reasoning is folded into the tool_call message's content on the wire.""" + client = MockClient([ + ToolCall(tool="fetch", args={}, reasoning="Thinking about this..."), + ToolCall(tool="submit", args={}), + ]) + runner = _make_runner(client, reasoning_replay="keep-last") + await runner.run(_make_workflow(), "go", prompt_vars={"role": "agent"}) + # Second send call: reasoning folded into tool_call content second_call_msgs = client.send_calls[1][0] # system, user, tool_call(assistant with content), tool_result(tool) @@ -1360,7 +1380,7 @@ async def test_text_response_not_folded_into_tool_call(self): ToolCall(tool="fetch", args={}, reasoning="Now I know what to do"), ToolCall(tool="submit", args={}), ]) - runner = _make_runner(client) + runner = _make_runner(client, reasoning_replay="keep-last") await runner.run(_make_workflow(), "go", prompt_vars={"role": "agent"}) # Third send call: after text_response+nudge recovery, then reasoning+fetch From d6b4a57155ed8a31d9018356e37dbb049a9aa984 Mon Sep 17 00:00:00 2001 From: Antoine Zambelli Date: Thu, 11 Jun 2026 20:39:58 -0500 Subject: [PATCH 11/14] data(eval): re-stamp Haiku v0.7.5 rows reasoning_replay keep-last -> none The Haiku baseline ran before the default-policy decision and recorded keep-last; Sonnet/Opus recorded none. The knob is request-inert for Claude rows (no captured reasoning is replayed), so the field is a label, not a behavioral difference - re-stamped for a consistent board. Targeted byte-level edit of the 3,900 Haiku rows; all other lines byte-identical. Post-edit validation: 170,300 rows, 0 bad JSON, 0 duplicate run keys. Co-Authored-By: Claude Fable 5 --- eval_results_v0.7.5.jsonl | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/eval_results_v0.7.5.jsonl b/eval_results_v0.7.5.jsonl index 9182af5..06e82f1 100644 --- a/eval_results_v0.7.5.jsonl +++ b/eval_results_v0.7.5.jsonl @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:8d0feb4f9f4456ba7b45cdb9a9fe407f593abecf7293c40fc40d30a12ebccfba -size 106268073 +oid sha256:7cf0797e5f77e04871635d4e61fb473922c4ac1de1b028954ec39476dca4c787 +size 106248573 From 81dcbfb8f49452fa828f497ff76c21943552e169 Mon Sep 17 00:00:00 2001 From: Antoine Zambelli Date: Thu, 11 Jun 2026 20:40:07 -0500 Subject: [PATCH 12/14] feat(eval): reasoning_replay as a first-class report/dashboard dimension ConfigKey (display identity) gains the policy so none/keep-last/full render as separate rows, tagged :keep-last/:full (untagged = the none default; pre-knob rows count as full - that is what they ran). The dedup identity (_config_tuple) deliberately excludes it so latest-gen-wins still supersedes pre-knob rows whole-config instead of keeping them as stale :full duplicates. Adds the reasoning-replay.md policy-comparison view, a --reasoning-replay report filter, a Reasoning Replay dashboard filter dimension with canonical ordering, and the gen-3 legend entry (tag ref v0.7.5; the squash SHA does not exist pre-merge). Reports and dashboard regenerated from all four dataset files. Co-Authored-By: Claude Fable 5 --- docs/results/dashboard.html | 13 +- docs/results/index.md | 3 +- docs/results/raw/native-vs-prompt.md | 232 ++++---- docs/results/raw/reasoning-replay.md | 420 ++++++++++++++ docs/results/raw/reforged-vs-bare.md | 708 +++++++++++++----------- docs/results/raw/reforged/all.md | 151 +++-- docs/results/raw/reforged/by-backend.md | 172 +++--- docs/results/raw/reforged/by-family.md | 263 +++++---- tests/eval/dashboard/src/Sidebar.tsx | 8 +- tests/eval/dashboard/src/types.ts | 10 +- tests/eval/dashboard/src/utils.ts | 10 +- tests/eval/report.py | 120 +++- 12 files changed, 1434 insertions(+), 676 deletions(-) create mode 100644 docs/results/raw/reasoning-replay.md diff --git a/docs/results/dashboard.html b/docs/results/dashboard.html index 530f8d6..1519700 100644 --- a/docs/results/dashboard.html +++ b/docs/results/dashboard.html @@ -4,19 +4,20 @@ Forge Eval Dashboard - +`+a.stack}}var Jn=Object.prototype.hasOwnProperty,wn=h.unstable_scheduleCallback,Wn=h.unstable_cancelCallback,qo=h.unstable_shouldYield,Yo=h.unstable_requestPaint,nt=h.unstable_now,Go=h.unstable_getCurrentPriorityLevel,zf=h.unstable_ImmediatePriority,Ef=h.unstable_UserBlockingPriority,Au=h.unstable_NormalPriority,Qo=h.unstable_LowPriority,Tf=h.unstable_IdlePriority,Xo=h.log,Zo=h.unstable_setDisableYieldValue,Ma=null,ct=null;function It(l){if(typeof Xo=="function"&&Zo(l),ct&&typeof ct.setStrictMode=="function")try{ct.setStrictMode(Ma,l)}catch{}}var it=Math.clz32?Math.clz32:Ko,Vo=Math.log,Lo=Math.LN2;function Ko(l){return l>>>=0,l===0?32:31-(Vo(l)/Lo|0)|0}var pu=256,_u=262144,Ou=4194304;function pe(l){var t=l&42;if(t!==0)return t;switch(l&-l){case 1:return 1;case 2:return 2;case 4:return 4;case 8:return 8;case 16:return 16;case 32:return 32;case 64:return 64;case 128:return 128;case 256:case 512:case 1024:case 2048:case 4096:case 8192:case 16384:case 32768:case 65536:case 131072:return l&261888;case 262144:case 524288:case 1048576:case 2097152:return l&3932160;case 4194304:case 8388608:case 16777216:case 33554432:return l&62914560;case 67108864:return 67108864;case 134217728:return 134217728;case 268435456:return 268435456;case 536870912:return 536870912;case 1073741824:return 0;default:return l}}function Mu(l,t,e){var a=l.pendingLanes;if(a===0)return 0;var u=0,n=l.suspendedLanes,c=l.pingedLanes;l=l.warmLanes;var i=a&134217727;return i!==0?(a=i&~n,a!==0?u=pe(a):(c&=i,c!==0?u=pe(c):e||(e=i&~l,e!==0&&(u=pe(e))))):(i=a&~n,i!==0?u=pe(i):c!==0?u=pe(c):e||(e=a&~l,e!==0&&(u=pe(e)))),u===0?0:t!==0&&t!==u&&(t&n)===0&&(n=u&-u,e=t&-t,n>=e||n===32&&(e&4194048)!==0)?t:u}function xa(l,t){return(l.pendingLanes&~(l.suspendedLanes&~l.pingedLanes)&t)===0}function Jo(l,t){switch(l){case 1:case 2:case 4:case 8:case 64:return t+250;case 16:case 32:case 128:case 256:case 512:case 1024:case 2048:case 4096:case 8192:case 16384:case 32768:case 65536:case 131072:case 262144:case 524288:case 1048576:case 2097152:return t+5e3;case 4194304:case 8388608:case 16777216:case 33554432:return-1;case 67108864:case 134217728:case 268435456:case 536870912:case 1073741824:return-1;default:return-1}}function Af(){var l=Ou;return Ou<<=1,(Ou&62914560)===0&&(Ou=4194304),l}function $n(l){for(var t=[],e=0;31>e;e++)t.push(l);return t}function Da(l,t){l.pendingLanes|=t,t!==268435456&&(l.suspendedLanes=0,l.pingedLanes=0,l.warmLanes=0)}function wo(l,t,e,a,u,n){var c=l.pendingLanes;l.pendingLanes=e,l.suspendedLanes=0,l.pingedLanes=0,l.warmLanes=0,l.expiredLanes&=e,l.entangledLanes&=e,l.errorRecoveryDisabledLanes&=e,l.shellSuspendCounter=0;var i=l.entanglements,f=l.expirationTimes,r=l.hiddenUpdates;for(e=c&~e;0"u")return null;try{return l.activeElement||l.body}catch{return l.body}}var Po=/[\n"\\]/g;function vt(l){return l.replace(Po,function(t){return"\\"+t.charCodeAt(0).toString(16)+" "})}function tc(l,t,e,a,u,n,c,i){l.name="",c!=null&&typeof c!="function"&&typeof c!="symbol"&&typeof c!="boolean"?l.type=c:l.removeAttribute("type"),t!=null?c==="number"?(t===0&&l.value===""||l.value!=t)&&(l.value=""+ht(t)):l.value!==""+ht(t)&&(l.value=""+ht(t)):c!=="submit"&&c!=="reset"||l.removeAttribute("value"),t!=null?ec(l,c,ht(t)):e!=null?ec(l,c,ht(e)):a!=null&&l.removeAttribute("value"),u==null&&n!=null&&(l.defaultChecked=!!n),u!=null&&(l.checked=u&&typeof u!="function"&&typeof u!="symbol"),i!=null&&typeof i!="function"&&typeof i!="symbol"&&typeof i!="boolean"?l.name=""+ht(i):l.removeAttribute("name")}function Bf(l,t,e,a,u,n,c,i){if(n!=null&&typeof n!="function"&&typeof n!="symbol"&&typeof n!="boolean"&&(l.type=n),t!=null||e!=null){if(!(n!=="submit"&&n!=="reset"||t!=null)){lc(l);return}e=e!=null?""+ht(e):"",t=t!=null?""+ht(t):e,i||t===l.value||(l.value=t),l.defaultValue=t}a=a??u,a=typeof a!="function"&&typeof a!="symbol"&&!!a,l.checked=i?l.checked:!!a,l.defaultChecked=!!a,c!=null&&typeof c!="function"&&typeof c!="symbol"&&typeof c!="boolean"&&(l.name=c),lc(l)}function ec(l,t,e){t==="number"&&Nu(l.ownerDocument)===l||l.defaultValue===""+e||(l.defaultValue=""+e)}function we(l,t,e,a){if(l=l.options,t){t={};for(var u=0;u"u"||typeof window.document>"u"||typeof window.document.createElement>"u"),ic=!1;if(Bt)try{var Ra={};Object.defineProperty(Ra,"passive",{get:function(){ic=!0}}),window.addEventListener("test",Ra,Ra),window.removeEventListener("test",Ra,Ra)}catch{ic=!1}var le=null,fc=null,Cu=null;function Vf(){if(Cu)return Cu;var l,t=fc,e=t.length,a,u="value"in le?le.value:le.textContent,n=u.length;for(l=0;l=Ba),$f=" ",kf=!1;function Ff(l,t){switch(l){case"keyup":return Mm.indexOf(t.keyCode)!==-1;case"keydown":return t.keyCode!==229;case"keypress":case"mousedown":case"focusout":return!0;default:return!1}}function If(l){return l=l.detail,typeof l=="object"&&"data"in l?l.data:null}var Fe=!1;function Dm(l,t){switch(l){case"compositionend":return If(t);case"keypress":return t.which!==32?null:(kf=!0,$f);case"textInput":return l=t.data,l===$f&&kf?null:l;default:return null}}function Nm(l,t){if(Fe)return l==="compositionend"||!yc&&Ff(l,t)?(l=Vf(),Cu=fc=le=null,Fe=!1,l):null;switch(l){case"paste":return null;case"keypress":if(!(t.ctrlKey||t.altKey||t.metaKey)||t.ctrlKey&&t.altKey){if(t.char&&1=t)return{node:e,offset:t-l};l=a}l:{for(;e;){if(e.nextSibling){e=e.nextSibling;break l}e=e.parentNode}e=void 0}e=cs(e)}}function fs(l,t){return l&&t?l===t?!0:l&&l.nodeType===3?!1:t&&t.nodeType===3?fs(l,t.parentNode):"contains"in l?l.contains(t):l.compareDocumentPosition?!!(l.compareDocumentPosition(t)&16):!1:!1}function ss(l){l=l!=null&&l.ownerDocument!=null&&l.ownerDocument.defaultView!=null?l.ownerDocument.defaultView:window;for(var t=Nu(l.document);t instanceof l.HTMLIFrameElement;){try{var e=typeof t.contentWindow.location.href=="string"}catch{e=!1}if(e)l=t.contentWindow;else break;t=Nu(l.document)}return t}function vc(l){var t=l&&l.nodeName&&l.nodeName.toLowerCase();return t&&(t==="input"&&(l.type==="text"||l.type==="search"||l.type==="tel"||l.type==="url"||l.type==="password")||t==="textarea"||l.contentEditable==="true")}var Ym=Bt&&"documentMode"in document&&11>=document.documentMode,Ie=null,gc=null,Qa=null,Sc=!1;function ds(l,t,e){var a=e.window===e?e.document:e.nodeType===9?e:e.ownerDocument;Sc||Ie==null||Ie!==Nu(a)||(a=Ie,"selectionStart"in a&&vc(a)?a={start:a.selectionStart,end:a.selectionEnd}:(a=(a.ownerDocument&&a.ownerDocument.defaultView||window).getSelection(),a={anchorNode:a.anchorNode,anchorOffset:a.anchorOffset,focusNode:a.focusNode,focusOffset:a.focusOffset}),Qa&&Ga(Qa,a)||(Qa=a,a=On(gc,"onSelect"),0>=c,u-=c,Dt=1<<32-it(t)+u|e<w?(tl=B,B=null):tl=B.sibling;var il=v(m,B,y[w],T);if(il===null){B===null&&(B=tl);break}l&&B&&il.alternate===null&&t(m,B),s=n(il,s,w),cl===null?Q=il:cl.sibling=il,cl=il,B=tl}if(w===y.length)return e(m,B),el&&Yt(m,w),Q;if(B===null){for(;ww?(tl=B,B=null):tl=B.sibling;var Te=v(m,B,il.value,T);if(Te===null){B===null&&(B=tl);break}l&&B&&Te.alternate===null&&t(m,B),s=n(Te,s,w),cl===null?Q=Te:cl.sibling=Te,cl=Te,B=tl}if(il.done)return e(m,B),el&&Yt(m,w),Q;if(B===null){for(;!il.done;w++,il=y.next())il=A(m,il.value,T),il!==null&&(s=n(il,s,w),cl===null?Q=il:cl.sibling=il,cl=il);return el&&Yt(m,w),Q}for(B=a(B);!il.done;w++,il=y.next())il=S(B,m,w,il.value,T),il!==null&&(l&&il.alternate!==null&&B.delete(il.key===null?w:il.key),s=n(il,s,w),cl===null?Q=il:cl.sibling=il,cl=il);return l&&B.forEach(function(ur){return t(m,ur)}),el&&Yt(m,w),Q}function bl(m,s,y,T){if(typeof y=="object"&&y!==null&&y.type===vl&&y.key===null&&(y=y.props.children),typeof y=="object"&&y!==null){switch(y.$$typeof){case P:l:{for(var Q=y.key;s!==null;){if(s.key===Q){if(Q=y.type,Q===vl){if(s.tag===7){e(m,s.sibling),T=u(s,y.props.children),T.return=m,m=T;break l}}else if(s.elementType===Q||typeof Q=="object"&&Q!==null&&Q.$$typeof===Bl&&He(Q)===s.type){e(m,s.sibling),T=u(s,y.props),Ja(T,y),T.return=m,m=T;break l}e(m,s);break}else t(m,s);s=s.sibling}y.type===vl?(T=Ne(y.props.children,m.mode,T,y.key),T.return=m,m=T):(T=Zu(y.type,y.key,y.props,null,m.mode,T),Ja(T,y),T.return=m,m=T)}return c(m);case k:l:{for(Q=y.key;s!==null;){if(s.key===Q)if(s.tag===4&&s.stateNode.containerInfo===y.containerInfo&&s.stateNode.implementation===y.implementation){e(m,s.sibling),T=u(s,y.children||[]),T.return=m,m=T;break l}else{e(m,s);break}else t(m,s);s=s.sibling}T=_c(y,m.mode,T),T.return=m,m=T}return c(m);case Bl:return y=He(y),bl(m,s,y,T)}if(ut(y))return j(m,s,y,T);if(fl(y)){if(Q=fl(y),typeof Q!="function")throw Error(o(150));return y=Q.call(y),Z(m,s,y,T)}if(typeof y.then=="function")return bl(m,s,$u(y),T);if(y.$$typeof===nl)return bl(m,s,Ku(m,y),T);ku(m,y)}return typeof y=="string"&&y!==""||typeof y=="number"||typeof y=="bigint"?(y=""+y,s!==null&&s.tag===6?(e(m,s.sibling),T=u(s,y),T.return=m,m=T):(e(m,s),T=pc(y,m.mode,T),T.return=m,m=T),c(m)):e(m,s)}return function(m,s,y,T){try{Ka=0;var Q=bl(m,s,y,T);return sa=null,Q}catch(B){if(B===fa||B===wu)throw B;var cl=st(29,B,null,m.mode);return cl.lanes=T,cl.return=m,cl}}}var qe=Rs(!0),js=Rs(!1),ne=!1;function qc(l){l.updateQueue={baseState:l.memoizedState,firstBaseUpdate:null,lastBaseUpdate:null,shared:{pending:null,lanes:0,hiddenCallbacks:null},callbacks:null}}function Yc(l,t){l=l.updateQueue,t.updateQueue===l&&(t.updateQueue={baseState:l.baseState,firstBaseUpdate:l.firstBaseUpdate,lastBaseUpdate:l.lastBaseUpdate,shared:l.shared,callbacks:null})}function ce(l){return{lane:l,tag:0,payload:null,callback:null,next:null}}function ie(l,t,e){var a=l.updateQueue;if(a===null)return null;if(a=a.shared,(sl&2)!==0){var u=a.pending;return u===null?t.next=t:(t.next=u.next,u.next=t),a.pending=t,t=Xu(l),gs(l,null,e),t}return Qu(l,a,t,e),Xu(l)}function wa(l,t,e){if(t=t.updateQueue,t!==null&&(t=t.shared,(e&4194048)!==0)){var a=t.lanes;a&=l.pendingLanes,e|=a,t.lanes=e,_f(l,e)}}function Gc(l,t){var e=l.updateQueue,a=l.alternate;if(a!==null&&(a=a.updateQueue,e===a)){var u=null,n=null;if(e=e.firstBaseUpdate,e!==null){do{var c={lane:e.lane,tag:e.tag,payload:e.payload,callback:null,next:null};n===null?u=n=c:n=n.next=c,e=e.next}while(e!==null);n===null?u=n=t:n=n.next=t}else u=n=t;e={baseState:a.baseState,firstBaseUpdate:u,lastBaseUpdate:n,shared:a.shared,callbacks:a.callbacks},l.updateQueue=e;return}l=e.lastBaseUpdate,l===null?e.firstBaseUpdate=t:l.next=t,e.lastBaseUpdate=t}var Qc=!1;function Wa(){if(Qc){var l=ia;if(l!==null)throw l}}function $a(l,t,e,a){Qc=!1;var u=l.updateQueue;ne=!1;var n=u.firstBaseUpdate,c=u.lastBaseUpdate,i=u.shared.pending;if(i!==null){u.shared.pending=null;var f=i,r=f.next;f.next=null,c===null?n=r:c.next=r,c=f;var b=l.alternate;b!==null&&(b=b.updateQueue,i=b.lastBaseUpdate,i!==c&&(i===null?b.firstBaseUpdate=r:i.next=r,b.lastBaseUpdate=f))}if(n!==null){var A=u.baseState;c=0,b=r=f=null,i=n;do{var v=i.lane&-536870913,S=v!==i.lane;if(S?(ll&v)===v:(a&v)===v){v!==0&&v===ca&&(Qc=!0),b!==null&&(b=b.next={lane:0,tag:i.tag,payload:i.payload,callback:null,next:null});l:{var j=l,Z=i;v=t;var bl=e;switch(Z.tag){case 1:if(j=Z.payload,typeof j=="function"){A=j.call(bl,A,v);break l}A=j;break l;case 3:j.flags=j.flags&-65537|128;case 0:if(j=Z.payload,v=typeof j=="function"?j.call(bl,A,v):j,v==null)break l;A=U({},A,v);break l;case 2:ne=!0}}v=i.callback,v!==null&&(l.flags|=64,S&&(l.flags|=8192),S=u.callbacks,S===null?u.callbacks=[v]:S.push(v))}else S={lane:v,tag:i.tag,payload:i.payload,callback:i.callback,next:null},b===null?(r=b=S,f=A):b=b.next=S,c|=v;if(i=i.next,i===null){if(i=u.shared.pending,i===null)break;S=i,i=S.next,S.next=null,u.lastBaseUpdate=S,u.shared.pending=null}}while(!0);b===null&&(f=A),u.baseState=f,u.firstBaseUpdate=r,u.lastBaseUpdate=b,n===null&&(u.shared.lanes=0),me|=c,l.lanes=c,l.memoizedState=A}}function Hs(l,t){if(typeof l!="function")throw Error(o(191,l));l.call(t)}function Bs(l,t){var e=l.callbacks;if(e!==null)for(l.callbacks=null,l=0;ln?n:8;var c=z.T,i={};z.T=i,ni(l,!1,t,e);try{var f=u(),r=z.S;if(r!==null&&r(i,f),f!==null&&typeof f=="object"&&typeof f.then=="function"){var b=wm(f,a);Ia(l,t,b,rt(l))}else Ia(l,t,a,rt(l))}catch(A){Ia(l,t,{then:function(){},status:"rejected",reason:A},rt())}finally{D.p=n,c!==null&&i.types!==null&&(c.types=i.types),z.T=c}}function Pm(){}function ai(l,t,e,a){if(l.tag!==5)throw Error(o(476));var u=r0(l).queue;y0(l,u,t,V,e===null?Pm:function(){return h0(l),e(a)})}function r0(l){var t=l.memoizedState;if(t!==null)return t;t={memoizedState:V,baseState:V,baseQueue:null,queue:{pending:null,lanes:0,dispatch:null,lastRenderedReducer:Zt,lastRenderedState:V},next:null};var e={};return t.next={memoizedState:e,baseState:e,baseQueue:null,queue:{pending:null,lanes:0,dispatch:null,lastRenderedReducer:Zt,lastRenderedState:e},next:null},l.memoizedState=t,l=l.alternate,l!==null&&(l.memoizedState=t),t}function h0(l){var t=r0(l);t.next===null&&(t=l.alternate.memoizedState),Ia(l,t.next.queue,{},rt())}function ui(){return Zl(hu)}function v0(){return Nl().memoizedState}function g0(){return Nl().memoizedState}function ly(l){for(var t=l.return;t!==null;){switch(t.tag){case 24:case 3:var e=rt();l=ce(e);var a=ie(t,l,e);a!==null&&(at(a,t,e),wa(a,t,e)),t={cache:Rc()},l.payload=t;return}t=t.return}}function ty(l,t,e){var a=rt();e={lane:a,revertLane:0,gesture:null,action:e,hasEagerState:!1,eagerState:null,next:null},cn(l)?b0(t,e):(e=Tc(l,t,e,a),e!==null&&(at(e,l,a),z0(e,t,a)))}function S0(l,t,e){var a=rt();Ia(l,t,e,a)}function Ia(l,t,e,a){var u={lane:a,revertLane:0,gesture:null,action:e,hasEagerState:!1,eagerState:null,next:null};if(cn(l))b0(t,u);else{var n=l.alternate;if(l.lanes===0&&(n===null||n.lanes===0)&&(n=t.lastRenderedReducer,n!==null))try{var c=t.lastRenderedState,i=n(c,e);if(u.hasEagerState=!0,u.eagerState=i,ft(i,c))return Qu(l,t,u,0),El===null&&Gu(),!1}catch{}if(e=Tc(l,t,u,a),e!==null)return at(e,l,a),z0(e,t,a),!0}return!1}function ni(l,t,e,a){if(a={lane:2,revertLane:qi(),gesture:null,action:a,hasEagerState:!1,eagerState:null,next:null},cn(l)){if(t)throw Error(o(479))}else t=Tc(l,e,a,2),t!==null&&at(t,l,2)}function cn(l){var t=l.alternate;return l===J||t!==null&&t===J}function b0(l,t){oa=Pu=!0;var e=l.pending;e===null?t.next=t:(t.next=e.next,e.next=t),l.pending=t}function z0(l,t,e){if((e&4194048)!==0){var a=t.lanes;a&=l.pendingLanes,e|=a,t.lanes=e,_f(l,e)}}var Pa={readContext:Zl,use:en,useCallback:Ol,useContext:Ol,useEffect:Ol,useImperativeHandle:Ol,useLayoutEffect:Ol,useInsertionEffect:Ol,useMemo:Ol,useReducer:Ol,useRef:Ol,useState:Ol,useDebugValue:Ol,useDeferredValue:Ol,useTransition:Ol,useSyncExternalStore:Ol,useId:Ol,useHostTransitionStatus:Ol,useFormState:Ol,useActionState:Ol,useOptimistic:Ol,useMemoCache:Ol,useCacheRefresh:Ol};Pa.useEffectEvent=Ol;var E0={readContext:Zl,use:en,useCallback:function(l,t){return Wl().memoizedState=[l,t===void 0?null:t],l},useContext:Zl,useEffect:u0,useImperativeHandle:function(l,t,e){e=e!=null?e.concat([l]):null,un(4194308,4,f0.bind(null,t,l),e)},useLayoutEffect:function(l,t){return un(4194308,4,l,t)},useInsertionEffect:function(l,t){un(4,2,l,t)},useMemo:function(l,t){var e=Wl();t=t===void 0?null:t;var a=l();if(Ye){It(!0);try{l()}finally{It(!1)}}return e.memoizedState=[a,t],a},useReducer:function(l,t,e){var a=Wl();if(e!==void 0){var u=e(t);if(Ye){It(!0);try{e(t)}finally{It(!1)}}}else u=t;return a.memoizedState=a.baseState=u,l={pending:null,lanes:0,dispatch:null,lastRenderedReducer:l,lastRenderedState:u},a.queue=l,l=l.dispatch=ty.bind(null,J,l),[a.memoizedState,l]},useRef:function(l){var t=Wl();return l={current:l},t.memoizedState=l},useState:function(l){l=Ic(l);var t=l.queue,e=S0.bind(null,J,t);return t.dispatch=e,[l.memoizedState,e]},useDebugValue:ti,useDeferredValue:function(l,t){var e=Wl();return ei(e,l,t)},useTransition:function(){var l=Ic(!1);return l=y0.bind(null,J,l.queue,!0,!1),Wl().memoizedState=l,[!1,l]},useSyncExternalStore:function(l,t,e){var a=J,u=Wl();if(el){if(e===void 0)throw Error(o(407));e=e()}else{if(e=t(),El===null)throw Error(o(349));(ll&127)!==0||Zs(a,t,e)}u.memoizedState=e;var n={value:e,getSnapshot:t};return u.queue=n,u0(Ls.bind(null,a,n,l),[l]),a.flags|=2048,ya(9,{destroy:void 0},Vs.bind(null,a,n,e,t),null),e},useId:function(){var l=Wl(),t=El.identifierPrefix;if(el){var e=Nt,a=Dt;e=(a&~(1<<32-it(a)-1)).toString(32)+e,t="_"+t+"R_"+e,e=ln++,0<\/script>",n=n.removeChild(n.firstChild);break;case"select":n=typeof a.is=="string"?c.createElement("select",{is:a.is}):c.createElement("select"),a.multiple?n.multiple=!0:a.size&&(n.size=a.size);break;default:n=typeof a.is=="string"?c.createElement(u,{is:a.is}):c.createElement(u)}}n[Ql]=t,n[Fl]=a;l:for(c=t.child;c!==null;){if(c.tag===5||c.tag===6)n.appendChild(c.stateNode);else if(c.tag!==4&&c.tag!==27&&c.child!==null){c.child.return=c,c=c.child;continue}if(c===t)break l;for(;c.sibling===null;){if(c.return===null||c.return===t)break l;c=c.return}c.sibling.return=c.return,c=c.sibling}t.stateNode=n;l:switch(Ll(n,u,a),u){case"button":case"input":case"select":case"textarea":a=!!a.autoFocus;break l;case"img":a=!0;break l;default:a=!1}a&&Lt(t)}}return pl(t),bi(t,t.type,l===null?null:l.memoizedProps,t.pendingProps,e),null;case 6:if(l&&t.stateNode!=null)l.memoizedProps!==a&&Lt(t);else{if(typeof a!="string"&&t.stateNode===null)throw Error(o(166));if(l=K.current,ua(t)){if(l=t.stateNode,e=t.memoizedProps,a=null,u=Xl,u!==null)switch(u.tag){case 27:case 5:a=u.memoizedProps}l[Ql]=t,l=!!(l.nodeValue===e||a!==null&&a.suppressHydrationWarning===!0||Qd(l.nodeValue,e)),l||ae(t,!0)}else l=Mn(l).createTextNode(a),l[Ql]=t,t.stateNode=l}return pl(t),null;case 31:if(e=t.memoizedState,l===null||l.memoizedState!==null){if(a=ua(t),e!==null){if(l===null){if(!a)throw Error(o(318));if(l=t.memoizedState,l=l!==null?l.dehydrated:null,!l)throw Error(o(557));l[Ql]=t}else Ue(),(t.flags&128)===0&&(t.memoizedState=null),t.flags|=4;pl(t),l=!1}else e=Dc(),l!==null&&l.memoizedState!==null&&(l.memoizedState.hydrationErrors=e),l=!0;if(!l)return t.flags&256?(ot(t),t):(ot(t),null);if((t.flags&128)!==0)throw Error(o(558))}return pl(t),null;case 13:if(a=t.memoizedState,l===null||l.memoizedState!==null&&l.memoizedState.dehydrated!==null){if(u=ua(t),a!==null&&a.dehydrated!==null){if(l===null){if(!u)throw Error(o(318));if(u=t.memoizedState,u=u!==null?u.dehydrated:null,!u)throw Error(o(317));u[Ql]=t}else Ue(),(t.flags&128)===0&&(t.memoizedState=null),t.flags|=4;pl(t),u=!1}else u=Dc(),l!==null&&l.memoizedState!==null&&(l.memoizedState.hydrationErrors=u),u=!0;if(!u)return t.flags&256?(ot(t),t):(ot(t),null)}return ot(t),(t.flags&128)!==0?(t.lanes=e,t):(e=a!==null,l=l!==null&&l.memoizedState!==null,e&&(a=t.child,u=null,a.alternate!==null&&a.alternate.memoizedState!==null&&a.alternate.memoizedState.cachePool!==null&&(u=a.alternate.memoizedState.cachePool.pool),n=null,a.memoizedState!==null&&a.memoizedState.cachePool!==null&&(n=a.memoizedState.cachePool.pool),n!==u&&(a.flags|=2048)),e!==l&&e&&(t.child.flags|=8192),mn(t,t.updateQueue),pl(t),null);case 4:return xl(),l===null&&Xi(t.stateNode.containerInfo),pl(t),null;case 10:return Qt(t.type),pl(t),null;case 19:if(g(Dl),a=t.memoizedState,a===null)return pl(t),null;if(u=(t.flags&128)!==0,n=a.rendering,n===null)if(u)tu(a,!1);else{if(Ml!==0||l!==null&&(l.flags&128)!==0)for(l=t.child;l!==null;){if(n=Iu(l),n!==null){for(t.flags|=128,tu(a,!1),l=n.updateQueue,t.updateQueue=l,mn(t,l),t.subtreeFlags=0,l=e,e=t.child;e!==null;)Ss(e,l),e=e.sibling;return O(Dl,Dl.current&1|2),el&&Yt(t,a.treeForkCount),t.child}l=l.sibling}a.tail!==null&&nt()>gn&&(t.flags|=128,u=!0,tu(a,!1),t.lanes=4194304)}else{if(!u)if(l=Iu(n),l!==null){if(t.flags|=128,u=!0,l=l.updateQueue,t.updateQueue=l,mn(t,l),tu(a,!0),a.tail===null&&a.tailMode==="hidden"&&!n.alternate&&!el)return pl(t),null}else 2*nt()-a.renderingStartTime>gn&&e!==536870912&&(t.flags|=128,u=!0,tu(a,!1),t.lanes=4194304);a.isBackwards?(n.sibling=t.child,t.child=n):(l=a.last,l!==null?l.sibling=n:t.child=n,a.last=n)}return a.tail!==null?(l=a.tail,a.rendering=l,a.tail=l.sibling,a.renderingStartTime=nt(),l.sibling=null,e=Dl.current,O(Dl,u?e&1|2:e&1),el&&Yt(t,a.treeForkCount),l):(pl(t),null);case 22:case 23:return ot(t),Zc(),a=t.memoizedState!==null,l!==null?l.memoizedState!==null!==a&&(t.flags|=8192):a&&(t.flags|=8192),a?(e&536870912)!==0&&(t.flags&128)===0&&(pl(t),t.subtreeFlags&6&&(t.flags|=8192)):pl(t),e=t.updateQueue,e!==null&&mn(t,e.retryQueue),e=null,l!==null&&l.memoizedState!==null&&l.memoizedState.cachePool!==null&&(e=l.memoizedState.cachePool.pool),a=null,t.memoizedState!==null&&t.memoizedState.cachePool!==null&&(a=t.memoizedState.cachePool.pool),a!==e&&(t.flags|=2048),l!==null&&g(je),null;case 24:return e=null,l!==null&&(e=l.memoizedState.cache),t.memoizedState.cache!==e&&(t.flags|=2048),Qt(Ul),pl(t),null;case 25:return null;case 30:return null}throw Error(o(156,t.tag))}function cy(l,t){switch(Mc(t),t.tag){case 1:return l=t.flags,l&65536?(t.flags=l&-65537|128,t):null;case 3:return Qt(Ul),xl(),l=t.flags,(l&65536)!==0&&(l&128)===0?(t.flags=l&-65537|128,t):null;case 26:case 27:case 5:return Tu(t),null;case 31:if(t.memoizedState!==null){if(ot(t),t.alternate===null)throw Error(o(340));Ue()}return l=t.flags,l&65536?(t.flags=l&-65537|128,t):null;case 13:if(ot(t),l=t.memoizedState,l!==null&&l.dehydrated!==null){if(t.alternate===null)throw Error(o(340));Ue()}return l=t.flags,l&65536?(t.flags=l&-65537|128,t):null;case 19:return g(Dl),null;case 4:return xl(),null;case 10:return Qt(t.type),null;case 22:case 23:return ot(t),Zc(),l!==null&&g(je),l=t.flags,l&65536?(t.flags=l&-65537|128,t):null;case 24:return Qt(Ul),null;case 25:return null;default:return null}}function K0(l,t){switch(Mc(t),t.tag){case 3:Qt(Ul),xl();break;case 26:case 27:case 5:Tu(t);break;case 4:xl();break;case 31:t.memoizedState!==null&&ot(t);break;case 13:ot(t);break;case 19:g(Dl);break;case 10:Qt(t.type);break;case 22:case 23:ot(t),Zc(),l!==null&&g(je);break;case 24:Qt(Ul)}}function eu(l,t){try{var e=t.updateQueue,a=e!==null?e.lastEffect:null;if(a!==null){var u=a.next;e=u;do{if((e.tag&l)===l){a=void 0;var n=e.create,c=e.inst;a=n(),c.destroy=a}e=e.next}while(e!==u)}}catch(i){hl(t,t.return,i)}}function de(l,t,e){try{var a=t.updateQueue,u=a!==null?a.lastEffect:null;if(u!==null){var n=u.next;a=n;do{if((a.tag&l)===l){var c=a.inst,i=c.destroy;if(i!==void 0){c.destroy=void 0,u=t;var f=e,r=i;try{r()}catch(b){hl(u,f,b)}}}a=a.next}while(a!==n)}}catch(b){hl(t,t.return,b)}}function J0(l){var t=l.updateQueue;if(t!==null){var e=l.stateNode;try{Bs(t,e)}catch(a){hl(l,l.return,a)}}}function w0(l,t,e){e.props=Ge(l.type,l.memoizedProps),e.state=l.memoizedState;try{e.componentWillUnmount()}catch(a){hl(l,t,a)}}function au(l,t){try{var e=l.ref;if(e!==null){switch(l.tag){case 26:case 27:case 5:var a=l.stateNode;break;case 30:a=l.stateNode;break;default:a=l.stateNode}typeof e=="function"?l.refCleanup=e(a):e.current=a}}catch(u){hl(l,t,u)}}function Ut(l,t){var e=l.ref,a=l.refCleanup;if(e!==null)if(typeof a=="function")try{a()}catch(u){hl(l,t,u)}finally{l.refCleanup=null,l=l.alternate,l!=null&&(l.refCleanup=null)}else if(typeof e=="function")try{e(null)}catch(u){hl(l,t,u)}else e.current=null}function W0(l){var t=l.type,e=l.memoizedProps,a=l.stateNode;try{l:switch(t){case"button":case"input":case"select":case"textarea":e.autoFocus&&a.focus();break l;case"img":e.src?a.src=e.src:e.srcSet&&(a.srcset=e.srcSet)}}catch(u){hl(l,l.return,u)}}function zi(l,t,e){try{var a=l.stateNode;xy(a,l.type,e,t),a[Fl]=t}catch(u){hl(l,l.return,u)}}function $0(l){return l.tag===5||l.tag===3||l.tag===26||l.tag===27&&ge(l.type)||l.tag===4}function Ei(l){l:for(;;){for(;l.sibling===null;){if(l.return===null||$0(l.return))return null;l=l.return}for(l.sibling.return=l.return,l=l.sibling;l.tag!==5&&l.tag!==6&&l.tag!==18;){if(l.tag===27&&ge(l.type)||l.flags&2||l.child===null||l.tag===4)continue l;l.child.return=l,l=l.child}if(!(l.flags&2))return l.stateNode}}function Ti(l,t,e){var a=l.tag;if(a===5||a===6)l=l.stateNode,t?(e.nodeType===9?e.body:e.nodeName==="HTML"?e.ownerDocument.body:e).insertBefore(l,t):(t=e.nodeType===9?e.body:e.nodeName==="HTML"?e.ownerDocument.body:e,t.appendChild(l),e=e._reactRootContainer,e!=null||t.onclick!==null||(t.onclick=Ht));else if(a!==4&&(a===27&&ge(l.type)&&(e=l.stateNode,t=null),l=l.child,l!==null))for(Ti(l,t,e),l=l.sibling;l!==null;)Ti(l,t,e),l=l.sibling}function yn(l,t,e){var a=l.tag;if(a===5||a===6)l=l.stateNode,t?e.insertBefore(l,t):e.appendChild(l);else if(a!==4&&(a===27&&ge(l.type)&&(e=l.stateNode),l=l.child,l!==null))for(yn(l,t,e),l=l.sibling;l!==null;)yn(l,t,e),l=l.sibling}function k0(l){var t=l.stateNode,e=l.memoizedProps;try{for(var a=l.type,u=t.attributes;u.length;)t.removeAttributeNode(u[0]);Ll(t,a,e),t[Ql]=l,t[Fl]=e}catch(n){hl(l,l.return,n)}}var Kt=!1,jl=!1,Ai=!1,F0=typeof WeakSet=="function"?WeakSet:Set,Gl=null;function iy(l,t){if(l=l.containerInfo,Li=jn,l=ss(l),vc(l)){if("selectionStart"in l)var e={start:l.selectionStart,end:l.selectionEnd};else l:{e=(e=l.ownerDocument)&&e.defaultView||window;var a=e.getSelection&&e.getSelection();if(a&&a.rangeCount!==0){e=a.anchorNode;var u=a.anchorOffset,n=a.focusNode;a=a.focusOffset;try{e.nodeType,n.nodeType}catch{e=null;break l}var c=0,i=-1,f=-1,r=0,b=0,A=l,v=null;t:for(;;){for(var S;A!==e||u!==0&&A.nodeType!==3||(i=c+u),A!==n||a!==0&&A.nodeType!==3||(f=c+a),A.nodeType===3&&(c+=A.nodeValue.length),(S=A.firstChild)!==null;)v=A,A=S;for(;;){if(A===l)break t;if(v===e&&++r===u&&(i=c),v===n&&++b===a&&(f=c),(S=A.nextSibling)!==null)break;A=v,v=A.parentNode}A=S}e=i===-1||f===-1?null:{start:i,end:f}}else e=null}e=e||{start:0,end:0}}else e=null;for(Ki={focusedElem:l,selectionRange:e},jn=!1,Gl=t;Gl!==null;)if(t=Gl,l=t.child,(t.subtreeFlags&1028)!==0&&l!==null)l.return=t,Gl=l;else for(;Gl!==null;){switch(t=Gl,n=t.alternate,l=t.flags,t.tag){case 0:if((l&4)!==0&&(l=t.updateQueue,l=l!==null?l.events:null,l!==null))for(e=0;e title"))),Ll(n,a,e),n[Ql]=l,Yl(n),a=n;break l;case"link":var c=ao("link","href",u).get(a+(e.href||""));if(c){for(var i=0;ibl&&(c=bl,bl=Z,Z=c);var m=is(i,Z),s=is(i,bl);if(m&&s&&(S.rangeCount!==1||S.anchorNode!==m.node||S.anchorOffset!==m.offset||S.focusNode!==s.node||S.focusOffset!==s.offset)){var y=A.createRange();y.setStart(m.node,m.offset),S.removeAllRanges(),Z>bl?(S.addRange(y),S.extend(s.node,s.offset)):(y.setEnd(s.node,s.offset),S.addRange(y))}}}}for(A=[],S=i;S=S.parentNode;)S.nodeType===1&&A.push({element:S,left:S.scrollLeft,top:S.scrollTop});for(typeof i.focus=="function"&&i.focus(),i=0;ie?32:e,z.T=null,e=Ni,Ni=null;var n=re,c=kt;if(ql=0,Sa=re=null,kt=0,(sl&6)!==0)throw Error(o(331));var i=sl;if(sl|=4,fd(n.current),nd(n,n.current,c,e),sl=i,su(0,!1),ct&&typeof ct.onPostCommitFiberRoot=="function")try{ct.onPostCommitFiberRoot(Ma,n)}catch{}return!0}finally{D.p=u,z.T=a,Od(l,t)}}function xd(l,t,e){t=St(e,t),t=si(l.stateNode,t,2),l=ie(l,t,2),l!==null&&(Da(l,2),Ct(l))}function hl(l,t,e){if(l.tag===3)xd(l,l,e);else for(;t!==null;){if(t.tag===3){xd(t,l,e);break}else if(t.tag===1){var a=t.stateNode;if(typeof t.type.getDerivedStateFromError=="function"||typeof a.componentDidCatch=="function"&&(ye===null||!ye.has(a))){l=St(e,l),e=D0(2),a=ie(t,e,2),a!==null&&(N0(e,a,t,l),Da(a,2),Ct(a));break}}t=t.return}}function ji(l,t,e){var a=l.pingCache;if(a===null){a=l.pingCache=new dy;var u=new Set;a.set(t,u)}else u=a.get(t),u===void 0&&(u=new Set,a.set(t,u));u.has(e)||(Oi=!0,u.add(e),l=hy.bind(null,l,t,e),t.then(l,l))}function hy(l,t,e){var a=l.pingCache;a!==null&&a.delete(t),l.pingedLanes|=l.suspendedLanes&e,l.warmLanes&=~e,El===l&&(ll&e)===e&&(Ml===4||Ml===3&&(ll&62914560)===ll&&300>nt()-vn?(sl&2)===0&&ba(l,0):Mi|=e,ga===ll&&(ga=0)),Ct(l)}function Dd(l,t){t===0&&(t=Af()),l=De(l,t),l!==null&&(Da(l,t),Ct(l))}function vy(l){var t=l.memoizedState,e=0;t!==null&&(e=t.retryLane),Dd(l,e)}function gy(l,t){var e=0;switch(l.tag){case 31:case 13:var a=l.stateNode,u=l.memoizedState;u!==null&&(e=u.retryLane);break;case 19:a=l.stateNode;break;case 22:a=l.stateNode._retryCache;break;default:throw Error(o(314))}a!==null&&a.delete(t),Dd(l,e)}function Sy(l,t){return wn(l,t)}var An=null,Ea=null,Hi=!1,pn=!1,Bi=!1,ve=0;function Ct(l){l!==Ea&&l.next===null&&(Ea===null?An=Ea=l:Ea=Ea.next=l),pn=!0,Hi||(Hi=!0,zy())}function su(l,t){if(!Bi&&pn){Bi=!0;do for(var e=!1,a=An;a!==null;){if(l!==0){var u=a.pendingLanes;if(u===0)var n=0;else{var c=a.suspendedLanes,i=a.pingedLanes;n=(1<<31-it(42|l)+1)-1,n&=u&~(c&~i),n=n&201326741?n&201326741|1:n?n|2:0}n!==0&&(e=!0,Rd(a,n))}else n=ll,n=Mu(a,a===El?n:0,a.cancelPendingCommit!==null||a.timeoutHandle!==-1),(n&3)===0||xa(a,n)||(e=!0,Rd(a,n));a=a.next}while(e);Bi=!1}}function by(){Nd()}function Nd(){pn=Hi=!1;var l=0;ve!==0&&Ny()&&(l=ve);for(var t=nt(),e=null,a=An;a!==null;){var u=a.next,n=Ud(a,t);n===0?(a.next=null,e===null?An=u:e.next=u,u===null&&(Ea=e)):(e=a,(l!==0||(n&3)!==0)&&(pn=!0)),a=u}ql!==0&&ql!==5||su(l),ve!==0&&(ve=0)}function Ud(l,t){for(var e=l.suspendedLanes,a=l.pingedLanes,u=l.expirationTimes,n=l.pendingLanes&-62914561;0i)break;var b=f.transferSize,A=f.initiatorType;b&&Xd(A)&&(f=f.responseEnd,c+=b*(f"u"?null:document;function Pd(l,t,e){var a=Ta;if(a&&typeof t=="string"&&t){var u=vt(t);u='link[rel="'+l+'"][href="'+u+'"]',typeof e=="string"&&(u+='[crossorigin="'+e+'"]'),Id.has(u)||(Id.add(u),l={rel:l,crossOrigin:e,href:t},a.querySelector(u)===null&&(t=a.createElement("link"),Ll(t,"link",l),Yl(t),a.head.appendChild(t)))}}function Gy(l){Ft.D(l),Pd("dns-prefetch",l,null)}function Qy(l,t){Ft.C(l,t),Pd("preconnect",l,t)}function Xy(l,t,e){Ft.L(l,t,e);var a=Ta;if(a&&l&&t){var u='link[rel="preload"][as="'+vt(t)+'"]';t==="image"&&e&&e.imageSrcSet?(u+='[imagesrcset="'+vt(e.imageSrcSet)+'"]',typeof e.imageSizes=="string"&&(u+='[imagesizes="'+vt(e.imageSizes)+'"]')):u+='[href="'+vt(l)+'"]';var n=u;switch(t){case"style":n=Aa(l);break;case"script":n=pa(l)}pt.has(n)||(l=U({rel:"preload",href:t==="image"&&e&&e.imageSrcSet?void 0:l,as:t},e),pt.set(n,l),a.querySelector(u)!==null||t==="style"&&a.querySelector(yu(n))||t==="script"&&a.querySelector(ru(n))||(t=a.createElement("link"),Ll(t,"link",l),Yl(t),a.head.appendChild(t)))}}function Zy(l,t){Ft.m(l,t);var e=Ta;if(e&&l){var a=t&&typeof t.as=="string"?t.as:"script",u='link[rel="modulepreload"][as="'+vt(a)+'"][href="'+vt(l)+'"]',n=u;switch(a){case"audioworklet":case"paintworklet":case"serviceworker":case"sharedworker":case"worker":case"script":n=pa(l)}if(!pt.has(n)&&(l=U({rel:"modulepreload",href:l},t),pt.set(n,l),e.querySelector(u)===null)){switch(a){case"audioworklet":case"paintworklet":case"serviceworker":case"sharedworker":case"worker":case"script":if(e.querySelector(ru(n)))return}a=e.createElement("link"),Ll(a,"link",l),Yl(a),e.head.appendChild(a)}}}function Vy(l,t,e){Ft.S(l,t,e);var a=Ta;if(a&&l){var u=Ke(a).hoistableStyles,n=Aa(l);t=t||"default";var c=u.get(n);if(!c){var i={loading:0,preload:null};if(c=a.querySelector(yu(n)))i.loading=5;else{l=U({rel:"stylesheet",href:l,"data-precedence":t},e),(e=pt.get(n))&&Ii(l,e);var f=c=a.createElement("link");Yl(f),Ll(f,"link",l),f._p=new Promise(function(r,b){f.onload=r,f.onerror=b}),f.addEventListener("load",function(){i.loading|=1}),f.addEventListener("error",function(){i.loading|=2}),i.loading|=4,Dn(c,t,a)}c={type:"stylesheet",instance:c,count:1,state:i},u.set(n,c)}}}function Ly(l,t){Ft.X(l,t);var e=Ta;if(e&&l){var a=Ke(e).hoistableScripts,u=pa(l),n=a.get(u);n||(n=e.querySelector(ru(u)),n||(l=U({src:l,async:!0},t),(t=pt.get(u))&&Pi(l,t),n=e.createElement("script"),Yl(n),Ll(n,"link",l),e.head.appendChild(n)),n={type:"script",instance:n,count:1,state:null},a.set(u,n))}}function Ky(l,t){Ft.M(l,t);var e=Ta;if(e&&l){var a=Ke(e).hoistableScripts,u=pa(l),n=a.get(u);n||(n=e.querySelector(ru(u)),n||(l=U({src:l,async:!0,type:"module"},t),(t=pt.get(u))&&Pi(l,t),n=e.createElement("script"),Yl(n),Ll(n,"link",l),e.head.appendChild(n)),n={type:"script",instance:n,count:1,state:null},a.set(u,n))}}function lo(l,t,e,a){var u=(u=K.current)?xn(u):null;if(!u)throw Error(o(446));switch(l){case"meta":case"title":return null;case"style":return typeof e.precedence=="string"&&typeof e.href=="string"?(t=Aa(e.href),e=Ke(u).hoistableStyles,a=e.get(t),a||(a={type:"style",instance:null,count:0,state:null},e.set(t,a)),a):{type:"void",instance:null,count:0,state:null};case"link":if(e.rel==="stylesheet"&&typeof e.href=="string"&&typeof e.precedence=="string"){l=Aa(e.href);var n=Ke(u).hoistableStyles,c=n.get(l);if(c||(u=u.ownerDocument||u,c={type:"stylesheet",instance:null,count:0,state:{loading:0,preload:null}},n.set(l,c),(n=u.querySelector(yu(l)))&&!n._p&&(c.instance=n,c.state.loading=5),pt.has(l)||(e={rel:"preload",as:"style",href:e.href,crossOrigin:e.crossOrigin,integrity:e.integrity,media:e.media,hrefLang:e.hrefLang,referrerPolicy:e.referrerPolicy},pt.set(l,e),n||Jy(u,l,e,c.state))),t&&a===null)throw Error(o(528,""));return c}if(t&&a!==null)throw Error(o(529,""));return null;case"script":return t=e.async,e=e.src,typeof e=="string"&&t&&typeof t!="function"&&typeof t!="symbol"?(t=pa(e),e=Ke(u).hoistableScripts,a=e.get(t),a||(a={type:"script",instance:null,count:0,state:null},e.set(t,a)),a):{type:"void",instance:null,count:0,state:null};default:throw Error(o(444,l))}}function Aa(l){return'href="'+vt(l)+'"'}function yu(l){return'link[rel="stylesheet"]['+l+"]"}function to(l){return U({},l,{"data-precedence":l.precedence,precedence:null})}function Jy(l,t,e,a){l.querySelector('link[rel="preload"][as="style"]['+t+"]")?a.loading=1:(t=l.createElement("link"),a.preload=t,t.addEventListener("load",function(){return a.loading|=1}),t.addEventListener("error",function(){return a.loading|=2}),Ll(t,"link",e),Yl(t),l.head.appendChild(t))}function pa(l){return'[src="'+vt(l)+'"]'}function ru(l){return"script[async]"+l}function eo(l,t,e){if(t.count++,t.instance===null)switch(t.type){case"style":var a=l.querySelector('style[data-href~="'+vt(e.href)+'"]');if(a)return t.instance=a,Yl(a),a;var u=U({},e,{"data-href":e.href,"data-precedence":e.precedence,href:null,precedence:null});return a=(l.ownerDocument||l).createElement("style"),Yl(a),Ll(a,"style",u),Dn(a,e.precedence,l),t.instance=a;case"stylesheet":u=Aa(e.href);var n=l.querySelector(yu(u));if(n)return t.state.loading|=4,t.instance=n,Yl(n),n;a=to(e),(u=pt.get(u))&&Ii(a,u),n=(l.ownerDocument||l).createElement("link"),Yl(n);var c=n;return c._p=new Promise(function(i,f){c.onload=i,c.onerror=f}),Ll(n,"link",a),t.state.loading|=4,Dn(n,e.precedence,l),t.instance=n;case"script":return n=pa(e.src),(u=l.querySelector(ru(n)))?(t.instance=u,Yl(u),u):(a=e,(u=pt.get(n))&&(a=U({},e),Pi(a,u)),l=l.ownerDocument||l,u=l.createElement("script"),Yl(u),Ll(u,"link",a),l.head.appendChild(u),t.instance=u);case"void":return null;default:throw Error(o(443,t.type))}else t.type==="stylesheet"&&(t.state.loading&4)===0&&(a=t.instance,t.state.loading|=4,Dn(a,e.precedence,l));return t.instance}function Dn(l,t,e){for(var a=e.querySelectorAll('link[rel="stylesheet"][data-precedence],style[data-precedence]'),u=a.length?a[a.length-1]:null,n=u,c=0;c title"):null)}function wy(l,t,e){if(e===1||t.itemProp!=null)return!1;switch(l){case"meta":case"title":return!0;case"style":if(typeof t.precedence!="string"||typeof t.href!="string"||t.href==="")break;return!0;case"link":if(typeof t.rel!="string"||typeof t.href!="string"||t.href===""||t.onLoad||t.onError)break;return t.rel==="stylesheet"?(l=t.disabled,typeof t.precedence=="string"&&l==null):!0;case"script":if(t.async&&typeof t.async!="function"&&typeof t.async!="symbol"&&!t.onLoad&&!t.onError&&t.src&&typeof t.src=="string")return!0}return!1}function no(l){return!(l.type==="stylesheet"&&(l.state.loading&3)===0)}function Wy(l,t,e,a){if(e.type==="stylesheet"&&(typeof a.media!="string"||matchMedia(a.media).matches!==!1)&&(e.state.loading&4)===0){if(e.instance===null){var u=Aa(a.href),n=t.querySelector(yu(u));if(n){t=n._p,t!==null&&typeof t=="object"&&typeof t.then=="function"&&(l.count++,l=Un.bind(l),t.then(l,l)),e.state.loading|=4,e.instance=n,Yl(n);return}n=t.ownerDocument||t,a=to(a),(u=pt.get(u))&&Ii(a,u),n=n.createElement("link"),Yl(n);var c=n;c._p=new Promise(function(i,f){c.onload=i,c.onerror=f}),Ll(n,"link",a),e.instance=n}l.stylesheets===null&&(l.stylesheets=new Map),l.stylesheets.set(e,t),(t=e.state.preload)&&(e.state.loading&3)===0&&(l.count++,e=Un.bind(l),t.addEventListener("load",e),t.addEventListener("error",e))}}var lf=0;function $y(l,t){return l.stylesheets&&l.count===0&&Rn(l,l.stylesheets),0lf?50:800)+t);return l.unsuspend=e,function(){l.unsuspend=null,clearTimeout(a),clearTimeout(u)}}:null}function Un(){if(this.count--,this.count===0&&(this.imgCount===0||!this.waitingForImages)){if(this.stylesheets)Rn(this,this.stylesheets);else if(this.unsuspend){var l=this.unsuspend;this.unsuspend=null,l()}}}var Cn=null;function Rn(l,t){l.stylesheets=null,l.unsuspend!==null&&(l.count++,Cn=new Map,t.forEach(ky,l),Cn=null,Un.call(l))}function ky(l,t){if(!(t.state.loading&4)){var e=Cn.get(l);if(e)var a=e.get(null);else{e=new Map,Cn.set(l,e);for(var u=l.querySelectorAll("link[data-precedence],style[data-precedence]"),n=0;n"u"||typeof __REACT_DEVTOOLS_GLOBAL_HOOK__.checkDCE!="function"))try{__REACT_DEVTOOLS_GLOBAL_HOOK__.checkDCE(h)}catch(M){console.error(M)}}return h(),df.exports=mr(),df.exports}var rr=yr();const hr=[{id:"all",label:"All"},{id:"lambda",label:"Lambda"},{id:"stateful",label:"Stateful"}],vr=[{id:"all",label:"All"},{id:"og18",label:"OG-18"},{id:"advanced_reasoning",label:"Advanced Reasoning"}],gf=["backend","mode","family","quant","replay"],gr=[{id:"reforged",label:"Reforged"},{id:"bare-vs-reforged",label:"Reforged vs Bare"},{id:"ablation",label:"Full Ablation"}],No=["reforged","bare","no_rescue","no_nudge","no_steps","no_recovery","no_compact"],Uo=["none","keep-last","full"],hf=[{id:"all",label:"All",groupBy:[]},{id:"by-backend",label:"By Backend",groupBy:["model","quant","ablation","replay"],intraSort:"backend"},{id:"by-family",label:"By Family",groupBy:["family"]}];async function Sr(){const h=window;if(!h.__FORGE_DATA__)throw new Error("window.__FORGE_DATA__ not injected — build via `python -m tests.eval.report --html `");return h.__FORGE_DATA__}function Co(h,M){if(M==="reforged")return h.filter(o=>o.ablation==="reforged");if(M==="bare-vs-reforged")return h.filter(o=>o.ablation==="reforged"||o.ablation==="bare");const C=new Set;for(const o of h)o.ablation.startsWith("no_")&&C.add(`${o.model}\0${o.backend}\0${o.mode}`);return h.filter(o=>C.has(`${o.model}\0${o.backend}\0${o.mode}`))}function Ro(h){const M=No.indexOf(h);return M===-1?No.length:M}function Zn(h){const M=Uo.indexOf(h);return M===-1?Uo.length:M}function Eu(h){return h==null?"":h>=95?"text-emerald-400":h>=90?"text-emerald-500/80":h>=70?"text-amber-400":h>=50?"text-orange-400":"text-red-400"}function Xn(h,M=0){return h==null?"—":`${h.toFixed(M)}%`}const br={0:"⁰",1:"¹",2:"²",3:"³",4:"⁴",5:"⁵",6:"⁶",7:"⁷",8:"⁸",9:"⁹"};function zr(h){return String(h).split("").map(M=>br[M]??M).join("")}function Er(h,M,C,o,q){const Y=p=>p.endsWith("_stateful");let H=M;return C==="lambda"?H=H.filter(p=>!Y(p)):C==="stateful"&&(H=H.filter(Y)),o==="og18"?H=H.filter(p=>q[p]==="og18"):o==="advanced_reasoning"&&(H=H.filter(p=>q[p]==="advanced_reasoning")),H.length===0?{rows:h,scenarios:M}:H.length===M.length?{rows:h,scenarios:M}:{rows:h.map(p=>{let E=0,G=0,U=0,x=0,P=0,k=0,vl=0,ul=0,zl=0,$=0;for(const fl of H)E+=p.scenarioRuns?.[fl]??0,G+=p.scenarioCorrect?.[fl]??0,U+=p.scenarioCompleted?.[fl]??0,x+=p.scenarioValidated?.[fl]??0,P+=p.scenarioIdealCalls?.[fl]??0,k+=p.scenarioActualCalls?.[fl]??0,vl+=p.scenarioWastedSum?.[fl]??0,ul+=p.scenarioWastedN?.[fl]??0,zl+=p.scenarioSpeedSum?.[fl]??0,$+=p.scenarioSpeedN?.[fl]??0;const nl=fl=>Math.round(fl*10)/10,al=E>0?nl(G/E*100):0,Hl=x>0?nl(G/x*100):null,Tl=E>0?nl(U/E*100):0,W=k>0?nl(Math.min(P/k,1)*100):0,Bl=ul>0?nl(vl/ul):0,Kl=$>0?nl(zl/$):0,$l=p.scenarioCompleted!==void 0,kl=Math.max(0,...H.map(fl=>p.scenarioRuns?.[fl]??0));return{...p,score:al,accuracy:$l?Hl:p.accuracy,completeness:$l?Tl:p.completeness,efficiency:$l?W:p.efficiency,wasted:$l?Bl:p.wasted,speed:$l?Kl:p.speed,n:kl}}),scenarios:H}}function Tr(h,M,C){const o=[...h];return o.sort((q,Y)=>{let H,R;return M.col==="label"?(H=q.label,R=Y.label):C.includes(M.col)?(H=q.scenarios[M.col]??-1,R=Y.scenarios[M.col]??-1):(H=q[M.col]??-1,R=Y[M.col]??-1),typeof H=="string"&&typeof R=="string"?M.asc?H.localeCompare(R):R.localeCompare(H):M.asc?H-R:R-H}),o}function Ar(h,M){return M.map(C=>String(h[C])).join("\0")}function pr(h,M,C,o,q){const Y=q==="reforged"?M:{id:M.id,label:M.label,groupBy:["model","backend","mode"]},H=q!=="reforged";if(Y.groupBy.length===0)return{sorted:Tr(h,C,o),groups:[]};const R=new Map;for(const G of h){const U=Ar(G,Y.groupBy);R.has(U)||R.set(U,[]),R.get(U).push(G)}const p=[];for(const[G,U]of R){U.sort((P,k)=>{if(H){const ul=Ro(P.ablation)-Ro(k.ablation);if(ul!==0)return ul;const zl=Zn(P.replay)-Zn(k.replay);return zl!==0?zl:k.score-P.score}const vl=k.score-P.score;return vl!==0?vl:Y.intraSort?String(P[Y.intraSort]).localeCompare(String(k[Y.intraSort])):0});const x=Y.groupBy.map(P=>U[0][P]).join(" / ");p.push({key:G,label:x,rows:U})}return p.sort((G,U)=>{const x=Math.max(...G.rows.map(k=>k.score));return Math.max(...U.rows.map(k=>k.score))-x}),{sorted:p.flatMap(G=>G.rows),groups:p}}function _r({active:h,onChange:M}){return _.jsxs("fieldset",{className:"mb-4",children:[_.jsx("legend",{className:"text-[0.65rem] font-semibold uppercase tracking-wider text-zinc-400 px-1 mb-1",children:"Screen"}),_.jsx("div",{className:"flex flex-col rounded border border-zinc-700 overflow-hidden",children:gr.map((C,o)=>_.jsx("button",{onClick:()=>M(C.id),className:`text-xs px-2 py-1.5 text-left transition-colors ${o>0?"border-t border-zinc-700":""} ${h===C.id?"bg-emerald-500/20 text-emerald-300 font-medium":"bg-zinc-900/40 text-zinc-400 hover:bg-zinc-900/70 hover:text-zinc-200"}`,children:C.label},C.id))})]})}function Or({active:h,onChange:M}){return _.jsxs("fieldset",{className:"mb-3 border border-zinc-800 rounded p-2",children:[_.jsx("legend",{className:"text-[0.65rem] font-semibold uppercase tracking-wider text-zinc-400 px-1",children:"View"}),_.jsx("div",{className:"flex flex-wrap gap-1",children:hf.map(C=>_.jsx("button",{onClick:()=>M(C.id),className:`text-[0.65rem] px-2 py-0.5 rounded-full border transition-colors ${h===C.id?"border-emerald-500 bg-emerald-500/15 text-emerald-400":"border-zinc-700 text-zinc-500 hover:border-zinc-500 hover:text-zinc-300"}`,children:C.label},C.id))})]})}const Mr={backend:"Backend",mode:"Mode",family:"Family",quant:"Quant",replay:"Reasoning Replay"};function xr({rows:h,filters:M,onFilterChange:C,activeScreen:o,onScreenChange:q,activeView:Y,onViewChange:H,scenarioScope:R,onScopeChange:p,suiteScope:E,onSuiteChange:G,showRetired:U,onShowRetiredChange:x,hasRetired:P,filteredCount:k,totalCount:vl,totalRuns:ul,timestamp:zl}){return _.jsxs("nav",{className:"w-52 min-w-52 shrink-0 border-r border-zinc-800 p-4 sticky top-0 h-screen overflow-y-auto bg-zinc-950/80",children:[_.jsx("h1",{className:"text-lg font-semibold mb-0.5",children:"Forge Eval"}),_.jsxs("p",{className:"text-xs text-zinc-500 mb-3",children:[k,"/",vl," configs ·"," ",ul.toLocaleString()," runs"]}),_.jsx(_r,{active:o,onChange:q}),_.jsxs("fieldset",{className:"mb-3 border border-zinc-800 rounded p-2",children:[_.jsx("legend",{className:"text-[0.65rem] font-semibold uppercase tracking-wider text-zinc-400 px-1",children:"Suite"}),_.jsx("div",{className:"flex flex-wrap gap-1",children:vr.map($=>_.jsx("button",{onClick:()=>G($.id),className:`text-[0.65rem] px-2 py-0.5 rounded-full border transition-colors ${E===$.id?"border-emerald-500 bg-emerald-500/15 text-emerald-400":"border-zinc-700 text-zinc-500 hover:border-zinc-500 hover:text-zinc-300"}`,children:$.label},$.id))})]}),_.jsxs("fieldset",{className:"mb-3 border border-zinc-800 rounded p-2",children:[_.jsx("legend",{className:"text-[0.65rem] font-semibold uppercase tracking-wider text-zinc-400 px-1",children:"Scenarios"}),_.jsx("div",{className:"flex flex-wrap gap-1",children:hr.map($=>_.jsx("button",{onClick:()=>p($.id),className:`text-[0.65rem] px-2 py-0.5 rounded-full border transition-colors ${R===$.id?"border-emerald-500 bg-emerald-500/15 text-emerald-400":"border-zinc-700 text-zinc-500 hover:border-zinc-500 hover:text-zinc-300"}`,children:$.label},$.id))})]}),o==="reforged"&&_.jsx(Or,{active:Y,onChange:H}),gf.map($=>{const nl=[...new Set(h.map(al=>al[$]))].sort($==="replay"?(al,Hl)=>Zn(al)-Zn(Hl):void 0);return nl.length<2?null:_.jsxs("fieldset",{className:"mb-3 border border-zinc-800 rounded p-2",children:[_.jsx("legend",{className:"text-[0.65rem] font-semibold uppercase tracking-wider text-zinc-400 px-1",children:Mr[$]}),nl.map(al=>_.jsxs("label",{className:"flex items-center gap-1.5 text-xs py-0.5 cursor-pointer hover:text-zinc-200",children:[_.jsx("input",{type:"checkbox",checked:M[$]?.has(al)??!0,onChange:Hl=>C($,al,Hl.target.checked),className:"w-3.5 h-3.5 rounded border-zinc-600 bg-zinc-800 accent-emerald-500"}),_.jsx("span",{children:al})]},al))]},$)}),P&&_.jsxs("label",{className:"flex items-center gap-1.5 text-xs py-0.5 mt-2 cursor-pointer text-zinc-500 hover:text-zinc-300",children:[_.jsx("input",{type:"checkbox",checked:U,onChange:$=>x($.target.checked),className:"w-3.5 h-3.5 rounded border-zinc-600 bg-zinc-800 accent-emerald-500"}),_.jsx("span",{children:"Show retired models"})]}),_.jsx("p",{className:"text-[0.6rem] text-zinc-600 mt-4",children:zl})]})}const jo=[{key:"score",label:"Scr%"},{key:"accuracy",label:"Acc%"},{key:"completeness",label:"Cmp%"},{key:"efficiency",label:"Eff%"},{key:"wasted",label:"Wst"},{key:"speed",label:"Spd"},{key:"n",label:"N"}];function rf({col:h,sort:M}){return M.col!==h?null:_.jsx("span",{className:"ml-0.5 text-emerald-400",children:M.asc?"▲":"▼"})}function Dr({rows:h,scenarios:M,scenarioAbbrev:C,sort:o,onSort:q,checked:Y,onCompareToggle:H,groups:R,maxGen:p,genInfo:E}){const G=new Map;if(R.length>0){let x=0;for(const P of R)G.set(x,P.label),x+=P.rows.length}const U=2+jo.length+M.length;return _.jsx("div",{className:"w-full overflow-x-auto",children:_.jsxs("table",{className:"text-xs whitespace-nowrap border-collapse",children:[_.jsx("thead",{children:_.jsxs("tr",{className:"border-b border-zinc-800",children:[_.jsx("th",{className:"p-1.5 w-8"}),_.jsxs("th",{className:"p-1.5 text-left cursor-pointer select-none hover:text-emerald-400 sticky left-0 bg-zinc-950 z-10",onClick:()=>q("label"),children:["Model/Backend",_.jsx(rf,{col:"label",sort:o})]}),jo.map(x=>_.jsxs("th",{className:"p-1.5 text-right cursor-pointer select-none hover:text-emerald-400",onClick:()=>q(x.key),children:[x.label,_.jsx(rf,{col:x.key,sort:o})]},x.key)),M.map(x=>_.jsxs("th",{className:"p-1.5 text-right cursor-pointer select-none hover:text-emerald-400",onClick:()=>q(x),title:x,children:[C[x]||x.slice(0,3),_.jsx(rf,{col:x,sort:o})]},x))]})}),_.jsx("tbody",{children:h.map((x,P)=>{const k=Y.includes(P),vl=G.get(P);return _.jsxs(ml.Fragment,{children:[vl!=null&&_.jsx("tr",{className:"bg-zinc-900/30",children:_.jsx("td",{colSpan:U,className:"px-2 py-1 text-[0.6rem] font-semibold text-zinc-400 uppercase tracking-wider border-t border-zinc-700/50",children:vl})}),_.jsxs("tr",{className:`border-b border-zinc-900 hover:bg-zinc-900/50 transition-colors ${k?"bg-zinc-800/40":""} ${x.retired?"opacity-60":""}`,children:[_.jsx("td",{className:"p-1.5 text-center",children:_.jsx("input",{type:"checkbox",checked:k,onChange:ul=>H(P,ul.target.checked),className:"w-3.5 h-3.5 rounded border-zinc-600 bg-zinc-800 accent-emerald-500 cursor-pointer"})}),_.jsxs("td",{className:"p-1.5 font-mono sticky left-0 bg-zinc-950 z-10",children:[x.label,p>0&&x.gen{const ul=E?.[String(x.gen)];return ul?`gen ${x.gen}: ${ul.note} (commit ${ul.commit}, ${ul.date})`:`gen ${x.gen}`})(),children:zr(x.gen)}),x.retired&&_.jsx("span",{className:"ml-1.5 align-middle text-[0.55rem] uppercase tracking-wider text-zinc-500 border border-zinc-700 rounded px-1",children:"retired"})]}),_.jsx("td",{className:`p-1.5 text-right tabular-nums ${Eu(x.score)}`,children:Xn(x.score,1)}),_.jsx("td",{className:`p-1.5 text-right tabular-nums ${Eu(x.accuracy)}`,children:Xn(x.accuracy,1)}),_.jsx("td",{className:`p-1.5 text-right tabular-nums ${Eu(x.completeness)}`,children:Xn(x.completeness,1)}),_.jsx("td",{className:`p-1.5 text-right tabular-nums ${Eu(x.efficiency)}`,children:Xn(x.efficiency)}),_.jsx("td",{className:"p-1.5 text-right tabular-nums text-zinc-400",children:x.wasted.toFixed(1)}),_.jsxs("td",{className:"p-1.5 text-right tabular-nums text-zinc-400",children:[x.speed.toFixed(1),"s"]}),_.jsx("td",{className:"p-1.5 text-right tabular-nums text-zinc-500",children:x.n}),M.map(ul=>{const zl=x.scenarios[ul],$=x.scenarioRuns?.[ul]??0;let nl,al;return zl!=null?(nl=String(zl),al=Eu(zl)):$===0?(nl="I",al="text-zinc-700"):(nl="—",al="text-zinc-600"),_.jsx("td",{className:`p-1.5 text-right tabular-nums ${al}`,children:nl},ul)})]})]},x.label)})})]})})}const Nr=[{key:"score",label:"Score",fmt:h=>h==null?"—":`${h.toFixed(1)}%`,higherBetter:!0},{key:"accuracy",label:"Accuracy",fmt:h=>h==null?"—":`${h.toFixed(1)}%`,higherBetter:!0},{key:"completeness",label:"Completeness",fmt:h=>h==null?"—":`${h.toFixed(1)}%`,higherBetter:!0},{key:"efficiency",label:"Efficiency",fmt:h=>h==null?"—":`${h.toFixed(1)}%`,higherBetter:!0},{key:"wasted",label:"Avg Wasted",fmt:h=>h==null?"—":h.toFixed(1),higherBetter:!1},{key:"speed",label:"Speed",fmt:h=>h==null?"—":`${h.toFixed(1)}s`,higherBetter:!1}];function Ho({va:h,vb:M,higherBetter:C}){if(h==null||M==null)return _.jsx("td",{className:"p-1.5 text-right text-zinc-600",children:"—"});const o=M-h,Y=(o>0?"+":"")+(Number.isInteger(o)?o:o.toFixed(1));let H="text-zinc-500";return o!==0&&(H=o>0===C?"text-emerald-400":"text-red-400"),_.jsx("td",{className:`p-1.5 text-right tabular-nums font-medium ${H}`,children:Y})}function Ur({a:h,b:M,scenarios:C,scenarioAbbrev:o,onSwap:q,onClear:Y}){const H=(R,p)=>p in R.scenarios?R.scenarios[p]:R[p]??null;return _.jsxs("div",{className:"mt-6 border border-zinc-800 rounded-lg p-4 max-w-2xl",children:[_.jsxs("div",{className:"flex items-center justify-between mb-3",children:[_.jsx("h3",{className:"text-sm font-semibold",children:"Compare"}),_.jsxs("div",{className:"flex gap-2",children:[_.jsx("button",{onClick:q,className:"text-xs px-2.5 py-1 rounded border border-zinc-700 hover:border-zinc-500 transition-colors",children:"Swap A↔B"}),_.jsx("button",{onClick:Y,className:"text-xs px-2.5 py-1 rounded border border-zinc-700 hover:border-red-500/50 hover:text-red-400 transition-colors",children:"Clear"})]})]}),_.jsxs("table",{className:"text-xs w-full border-collapse",children:[_.jsx("thead",{children:_.jsxs("tr",{className:"border-b border-zinc-800",children:[_.jsx("th",{className:"p-1.5 text-left text-zinc-500",children:"Metric"}),_.jsx("th",{className:"p-1.5 text-right text-zinc-400 max-w-48 truncate",title:h.label,children:h.label}),_.jsx("th",{className:"p-1.5 text-right text-zinc-500 w-16",children:"Delta"}),_.jsx("th",{className:"p-1.5 text-right text-zinc-400 max-w-48 truncate",title:M.label,children:M.label})]})}),_.jsxs("tbody",{children:[Nr.map(R=>{const p=H(h,R.key),E=H(M,R.key);return _.jsxs("tr",{className:"border-b border-zinc-900/50",children:[_.jsx("td",{className:"p-1.5 text-zinc-400",children:R.label}),_.jsx("td",{className:"p-1.5 text-right tabular-nums",children:R.fmt(p)}),_.jsx(Ho,{va:p,vb:E,higherBetter:R.higherBetter}),_.jsx("td",{className:"p-1.5 text-right tabular-nums",children:R.fmt(E)})]},R.key)}),_.jsx("tr",{children:_.jsx("td",{colSpan:4,className:"py-1",children:_.jsx("div",{className:"border-t border-zinc-800"})})}),C.map(R=>{const p=h.scenarios[R],E=M.scenarios[R],G=(U,x)=>U!=null?`${U}%`:(x.scenarioRuns?.[R]??0)===0?"I":"—";return _.jsxs("tr",{className:"border-b border-zinc-900/50",children:[_.jsx("td",{className:"p-1.5 text-zinc-500",children:o[R]||R}),_.jsx("td",{className:"p-1.5 text-right tabular-nums",children:G(p,h)}),_.jsx(Ho,{va:p,vb:E,higherBetter:!0}),_.jsx("td",{className:"p-1.5 text-right tabular-nums",children:G(E,M)})]},R)})]})]})]})}function Cr(h){const M={};for(const C of gf)M[C]=new Set(h.map(o=>o[C]));return M}function Rr(){const[h,M]=ml.useState(null),[C,o]=ml.useState(null),[q,Y]=ml.useState({col:"score",asc:!1}),[H,R]=ml.useState([]),[p,E]=ml.useState("reforged"),[G,U]=ml.useState("all"),[x,P]=ml.useState("all"),[k,vl]=ml.useState("all"),[ul,zl]=ml.useState(!1);ml.useEffect(()=>{Sr().then(g=>{M(g),o(Cr(g.rows))})},[]);const $=ml.useMemo(()=>h?ul?h.rows:h.rows.filter(g=>!g.retired):[],[h,ul]),nl=ml.useMemo(()=>h?h.rows.some(g=>g.retired):!1,[h]),al=ml.useMemo(()=>!h||!C?[]:Co($,p).filter(O=>gf.every(N=>!C[N]||C[N].has(O[N]))),[h,C,p,$]),{rows:Hl,scenarios:Tl}=ml.useMemo(()=>Er(al,h?.scenarios??[],x,k,h?.scenarioSuite??{}),[al,h,x,k]),W=ml.useMemo(()=>{if(!h)return{};const g=h.scenarioAbbrev,O=new Set(Tl),N={};for(const[X,K]of Object.entries(g))O.has(X)&&(N[X]=K);return N},[h,Tl]),Bl=ml.useMemo(()=>hf.find(g=>g.id===G)??hf[0],[G]),{sorted:Kl,groups:$l}=ml.useMemo(()=>pr(Hl,Bl,q,Tl,p),[Hl,Bl,q,Tl,p]),kl=ml.useMemo(()=>al.reduce((g,O)=>g+O.n*Tl.length,0),[al,Tl]),fl=ml.useCallback((g,O,N)=>{o(X=>{if(!X)return X;const K={...X,[g]:new Set(X[g])};return N?K[g].add(O):K[g].delete(O),K}),R([])},[]),Rt=ml.useCallback(g=>{E(g),R([])},[]),_t=ml.useCallback(g=>{U(g),R([])},[]),ut=ml.useCallback(g=>{P(g),R([])},[]),z=ml.useCallback(g=>{vl(g),R([])},[]),D=ml.useCallback(g=>{zl(g),R([])},[]),V=ml.useCallback(g=>{Y(O=>O.col===g?{col:g,asc:!O.asc}:{col:g,asc:g==="label"})},[]),dl=ml.useCallback((g,O)=>{R(N=>O?N.length>=2?[N[1],g]:[...N,g]:N.filter(X=>X!==g))},[]),yl=ml.useCallback(()=>{R(g=>[...g].reverse())},[]),d=ml.useCallback(()=>{R([])},[]);return!h||!C?_.jsx("div",{className:"flex items-center justify-center min-h-screen text-zinc-500",children:"Loading..."}):_.jsxs("div",{className:"flex min-h-screen",children:[_.jsx(xr,{rows:$,filters:C,onFilterChange:fl,activeScreen:p,onScreenChange:Rt,activeView:G,onViewChange:_t,scenarioScope:x,onScopeChange:ut,suiteScope:k,onSuiteChange:z,showRetired:ul,onShowRetiredChange:D,hasRetired:nl,filteredCount:al.length,totalCount:Co($,p).length,totalRuns:kl,timestamp:h.timestamp}),_.jsxs("main",{className:"flex-1 min-w-0 p-4 flex flex-col",children:[_.jsx(Dr,{rows:Kl,scenarios:Tl,scenarioAbbrev:W,sort:q,onSort:V,checked:H,onCompareToggle:dl,groups:$l,maxGen:h.maxGen??0,genInfo:h.genInfo}),H.length===2&&_.jsx(Ur,{a:Kl[H[0]],b:Kl[H[1]],scenarios:Tl,scenarioAbbrev:W,onSwap:yl,onClear:d}),_.jsxs("p",{className:"text-[0.6rem] text-zinc-600 mt-6",children:["Generated ",h.timestamp]})]})]})}rr.createRoot(document.getElementById("root")).render(_.jsx(ml.StrictMode,{children:_.jsx(Rr,{})})); - +
+ diff --git a/docs/results/index.md b/docs/results/index.md index 0c4e176..7574620 100644 --- a/docs/results/index.md +++ b/docs/results/index.md @@ -15,5 +15,6 @@ For model and backend recommendations, see [Model Guide](../MODEL_GUIDE.md). ## Other cross-cuts - [native-vs-prompt.md](raw/native-vs-prompt.md) — llama-server native FC vs prompt-injected, reforged only +- [reasoning-replay.md](raw/reasoning-replay.md) — reasoning_replay policy comparison (none / keep-last / full) per config -*Generated 2026-06-03 00:09* +*Generated 2026-06-11 20:28* diff --git a/docs/results/raw/native-vs-prompt.md b/docs/results/raw/native-vs-prompt.md index 7243f1b..6636389 100644 --- a/docs/results/raw/native-vs-prompt.md +++ b/docs/results/raw/native-vs-prompt.md @@ -6,20 +6,24 @@ ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Ministral-3-14B-Instruct-2512-Q4_K_M LS/N [reforged] 78.1% 78.1% 100.0% 97% 0.3 4.0s 50 100 100 100 100 100 100 100 100 100 16 0 0 100 100 100 100 100 100 100 100 100 100 14 0 0 100 -Ministral-3-14B-Instruct-2512-Q4_K_M LS/P [reforged] 80.2% 80.2% 100.0% 100% 0.0 2.9s 50 100 100 100 100 100 100 100 100 100 0 0 36 100 100 100 100 100 100 100 100 100 100 0 0 50 100 +Ministral-3-14B-Instruct-2512-Q4_K_M LS/N [reforged] 77.8% 77.8% 100.0% 97% 0.3 4.0s 50 100 100 100 100 100 100 100 100 100 8 0 0 100 100 100 100 100 100 100 100 100 100 14 0 0 100 +Ministral-3-14B-Instruct-2512-Q4_K_M LS/P [reforged] 80.6% 80.6% 100.0% 100% 0.0 3.0s 50 100 100 100 100 100 100 100 100 100 0 0 46 100 100 100 100 100 100 100 100 100 98 0 0 52 100 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## Ministral-3-14B-Reasoning-2512-Q4_K_M ``` ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Ministral-3-14B-Reasoning-2512-Q4_K_M LS/N [reforged] 84.5% 84.5% 100.0% 97% 0.6 5.4s 50 100 100 100 100 100 88 100 100 70 44 48 76 94 100 100 100 100 100 96 98 100 76 38 26 62 82 -Ministral-3-14B-Reasoning-2512-Q4_K_M LS/P [reforged] 79.5% 80.5% 98.7% 96% 0.5 3.7s 50 100 100 100 100 100 82 100 100 78 30 6 58 92 100 100 100 100 100 74 100 100 80 20 6 56 84 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/N [reforged] 83.3% 83.3% 100.0% 96% 0.6 4.8s 50 100 100 100 100 100 100 100 100 60 32 34 78 94 100 100 100 100 100 92 100 100 62 30 28 78 78 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/N [reforged:keep-last] 82.9% 82.9% 100.0% 95% 0.6 5.0s 50 100 100 100 100 100 92 100 100 68 40 30 60 94 100 100 100 100 100 96 100 100 62 36 20 72 86 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/N [reforged:full] 83.2% 83.2% 100.0% 97% 0.6 5.6s 50 100 100 100 100 100 86 98 100 68 40 32 76 96 100 100 100 100 100 92 100 100 62 34 30 68 82 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/P [reforged] 77.7% 78.8% 98.5% 95% 0.6 3.8s 50 100 100 100 100 100 74 100 100 78 32 2 46 90 100 100 100 100 100 66 100 100 74 28 4 48 78 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/P [reforged:keep-last] 78.0% 78.8% 98.9% 95% 0.6 3.8s 50 100 100 100 100 100 82 100 100 70 28 2 52 78 100 100 100 100 100 74 100 100 80 18 4 48 92 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/P [reforged:full] 81.2% 82.0% 98.9% 98% 0.6 3.8s 50 100 100 100 100 100 74 100 100 68 40 4 72 88 100 100 100 100 100 82 100 100 78 38 10 64 92 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## Ministral-3-8B-Instruct-2512-Q4_K_M @@ -28,8 +32,8 @@ Ministral-3-14B-Reasoning-2512-Q4_K_M LS/P [reforged] 79.5% 80.5% 98.7% --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Ministral-3-8B-Instruct-2512-Q4_K_M LS/N [reforged] 78.3% 78.4% 99.8% 95% 0.4 3.2s 50 100 100 100 100 100 100 100 98 100 22 0 0 100 100 100 100 100 100 100 100 98 100 14 2 2 100 -Ministral-3-8B-Instruct-2512-Q4_K_M LS/P [reforged] 75.6% 83.8% 90.2% 79% 1.3 3.0s 50 98 100 0 100 100 100 100 100 100 22 12 56 100 100 100 0 100 100 100 100 100 100 28 0 50 100 +Ministral-3-8B-Instruct-2512-Q4_K_M LS/N [reforged] 78.3% 78.3% 100.0% 95% 0.4 3.2s 50 100 100 100 100 100 100 100 100 100 18 0 0 100 100 100 100 100 100 100 100 100 100 16 2 0 100 +Ministral-3-8B-Instruct-2512-Q4_K_M LS/P [reforged] 74.9% 83.3% 89.9% 79% 1.3 3.1s 50 100 100 0 100 100 100 100 100 100 20 2 42 100 100 100 0 100 100 100 100 100 100 22 0 62 100 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` @@ -39,163 +43,193 @@ Ministral-3-8B-Instruct-2512-Q4_K_M LS/P [reforged] 75.6% 83.8% 90.2% ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Ministral-3-8B-Instruct-2512-Q8_0 LS/N [reforged] 81.4% 81.4% 100.0% 100% 0.3 4.0s 50 100 100 100 100 100 100 100 100 100 38 4 0 100 100 100 100 100 100 100 100 100 100 74 0 0 100 -Ministral-3-8B-Instruct-2512-Q8_0 LS/P [reforged] 84.4% 91.1% 92.6% 92% 0.7 4.6s 50 100 100 6 100 100 100 100 100 100 98 8 100 80 100 100 4 100 100 100 100 100 100 100 0 98 100 +Ministral-3-8B-Instruct-2512-Q8_0 LS/N [reforged] 81.0% 81.0% 100.0% 100% 0.3 4.1s 50 100 100 100 100 100 100 100 100 100 30 0 4 100 100 100 100 100 100 100 100 100 100 68 0 4 100 +Ministral-3-8B-Instruct-2512-Q8_0 LS/P [reforged] 84.2% 90.9% 92.7% 91% 0.7 4.6s 50 100 100 6 100 100 100 100 100 100 98 8 100 80 100 100 4 100 100 100 100 100 100 96 0 98 100 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## Ministral-3-8B-Reasoning-2512-Q4_K_M ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Ministral-3-8B-Reasoning-2512-Q4_K_M LS/N [reforged] 82.8% 82.8% 99.9% 95% 0.5 4.1s 50 100 100 100 100 100 100 100 100 98 66 24 34 92 100 100 100 100 100 96 100 100 100 70 0 30 42 -Ministral-3-8B-Reasoning-2512-Q4_K_M LS/P [reforged] 80.5% 83.0% 97.0% 96% 0.7 2.7s 50 100 98 98 100 100 98 98 100 96 74 10 36 70 100 98 96 100 100 98 100 98 94 70 2 38 20 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/N [reforged] 81.4% 81.6% 99.7% 94% 0.6 3.8s 50 100 100 98 100 100 100 100 100 100 68 24 18 86 100 100 98 100 100 100 100 100 96 70 4 28 26 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/N [reforged:keep-last] 82.4% 83.0% 99.3% 92% 0.6 4.2s 50 100 100 100 100 100 98 100 100 100 68 16 28 86 100 100 100 100 100 96 98 100 96 64 6 24 62 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/N [reforged:full] 81.8% 81.8% 100.0% 95% 0.5 4.3s 50 100 100 98 100 98 98 98 100 98 62 8 30 96 100 100 98 100 100 98 96 100 98 62 2 40 46 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/P [reforged] 79.5% 81.9% 97.0% 94% 0.7 2.8s 50 100 100 98 100 100 96 100 90 96 68 4 32 68 100 98 98 100 100 96 96 94 98 70 0 34 30 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/P [reforged:keep-last] 79.8% 82.6% 96.7% 95% 0.6 2.8s 50 100 100 96 100 100 98 98 94 98 66 8 36 64 100 100 96 100 100 100 98 96 94 76 0 34 24 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/P [reforged:full] 81.0% 83.1% 97.5% 95% 0.7 3.0s 50 100 98 88 100 100 96 100 98 98 84 24 38 70 100 100 100 100 100 96 98 100 96 72 2 22 26 +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## Ministral-3-8B-Reasoning-2512-Q8_0 ``` --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Ministral-3-8B-Reasoning-2512-Q8_0 LS/N [reforged] 84.2% 84.2% 100.0% 95% 0.5 6.0s 50 100 100 100 100 100 100 100 100 98 74 26 54 88 100 100 100 100 100 100 100 100 100 68 2 26 52 -Ministral-3-8B-Reasoning-2512-Q8_0 LS/P [reforged] 81.3% 85.0% 95.7% 96% 0.7 3.9s 50 100 100 96 100 100 98 100 92 98 56 14 46 70 100 100 88 100 100 98 100 94 100 82 0 48 34 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +Ministral-3-8B-Reasoning-2512-Q8_0 LS/N [reforged] 84.5% 84.8% 99.7% 96% 0.6 5.3s 50 100 100 96 100 100 98 100 100 100 76 18 44 92 100 100 100 100 100 100 98 100 98 80 2 42 54 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/N [reforged:keep-last] 84.8% 85.6% 99.1% 94% 0.6 5.9s 50 100 100 100 100 100 100 100 100 98 70 24 42 86 100 100 100 100 100 100 100 100 100 82 6 36 62 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/N [reforged:full] 83.1% 83.1% 99.9% 96% 0.5 6.0s 50 100 100 100 100 100 100 98 100 100 66 20 36 88 100 100 100 100 100 100 100 100 98 74 2 26 52 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/P [reforged] 80.9% 84.8% 95.4% 95% 0.7 3.9s 50 100 100 90 100 100 98 100 98 100 72 6 50 76 100 100 86 100 100 100 100 92 96 64 2 42 32 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/P [reforged:keep-last] 81.2% 84.6% 96.0% 97% 0.7 4.1s 50 100 100 94 100 100 94 100 92 98 72 12 42 78 100 100 86 100 100 98 100 98 100 64 0 52 32 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/P [reforged:full] 81.4% 84.8% 96.0% 95% 0.7 4.0s 50 100 100 88 100 100 94 100 96 100 72 12 58 80 100 98 88 100 100 96 100 92 100 64 2 36 40 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ``` ## Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M LS/N [reforged] 71.0% 71.2% 99.8% 96% 0.5 6.5s 50 100 100 100 100 98 58 100 100 68 4 12 20 90 100 100 100 100 100 50 100 100 4 22 0 20 100 -Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M LS/P [reforged] 78.2% 84.3% 92.7% 78% 1.1 3.6s 50 100 100 100 100 100 100 100 100 28 0 0 100 94 100 100 100 100 100 100 100 100 34 0 0 100 76 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M LS/N [reforged:full]² 71.0% 71.2% 99.8% 96% 0.5 6.5s 50 100 100 100 100 98 58 100 100 68 4 12 20 90 100 100 100 100 100 50 100 100 4 22 0 20 100 +Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M LS/P [reforged:full]² 78.2% 84.3% 92.7% 78% 1.1 3.6s 50 100 100 100 100 100 100 100 100 28 0 0 100 94 100 100 100 100 100 100 100 100 34 0 0 100 76 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## Nemotron-3-Nano-30B-A3B-Q4_K_M ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Nemotron-3-Nano-30B-A3B-Q4_K_M LS/N [reforged] 71.3% 81.0% 88.0% 72% 1.5 21.4s 50 100 100 100 100 100 66 98 52 92 28 4 34 34 100 100 100 98 100 86 92 68 98 24 8 34 38 -Nemotron-3-Nano-30B-A3B-Q4_K_M LS/P [reforged] 70.2% 70.7% 99.4% 89% 0.4 10.8s 50 100 100 100 100 98 52 100 84 90 6 4 0 100 100 100 100 100 100 42 100 80 92 6 2 4 66 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Nemotron-3-Nano-30B-A3B-Q4_K_M LS/N [reforged:full]² 71.3% 81.0% 88.0% 72% 1.5 21.4s 50 100 100 100 100 100 66 98 52 92 28 4 34 34 100 100 100 98 100 86 92 68 98 24 8 34 38 +Nemotron-3-Nano-30B-A3B-Q4_K_M LS/P [reforged:full]² 70.2% 70.7% 99.4% 89% 0.4 10.8s 50 100 100 100 100 98 52 100 84 90 6 4 0 100 100 100 100 100 100 42 100 80 92 6 2 4 66 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## Qwen3-14B-Q4_K_M ``` --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Qwen3-14B-Q4_K_M LS/N [reforged] 67.7% 67.7% 99.9% 85% 0.9 20.8s 50 100 100 100 100 100 94 100 62 36 4 22 44 22 100 100 100 100 98 84 100 66 24 12 18 42 32 -Qwen3-14B-Q4_K_M LS/P [reforged] 70.5% 70.8% 99.7% 86% 0.5 24.2s 50 100 100 100 100 100 94 100 64 68 0 0 32 72 100 100 100 100 98 94 100 58 66 0 0 30 58 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +Qwen3-14B-Q4_K_M LS/N [reforged] 68.5% 68.5% 100.0% 100% 0.4 21.8s 50 100 100 100 100 98 100 100 56 72 14 4 58 4 100 100 100 100 100 100 100 42 72 10 0 52 0 +Qwen3-14B-Q4_K_M LS/N [reforged:keep-last] 64.0% 64.0% 99.9% 91% 0.6 20.3s 50 100 100 100 100 100 90 98 48 40 18 6 38 4 100 100 100 98 96 88 100 54 30 16 2 38 0 +Qwen3-14B-Q4_K_M LS/N [reforged:full] 68.4% 68.4% 99.9% 83% 0.9 21.9s 50 100 100 100 100 98 90 98 60 32 20 18 50 38 100 100 100 100 100 86 100 74 34 6 18 34 22 +Qwen3-14B-Q4_K_M LS/P [reforged] 71.4% 71.4% 99.9% 86% 0.5 25.6s 50 100 100 100 100 100 98 100 70 70 0 0 30 80 100 100 100 100 100 92 100 76 56 0 0 22 62 +Qwen3-14B-Q4_K_M LS/P [reforged:keep-last] 71.8% 71.8% 100.0% 86% 0.5 23.8s 50 100 100 100 100 100 98 100 72 58 2 4 28 72 100 100 100 100 100 94 100 76 74 0 0 32 56 +Qwen3-14B-Q4_K_M LS/P [reforged:full] 71.8% 71.9% 99.8% 87% 0.5 24.3s 50 100 100 100 100 100 96 100 72 72 2 0 30 74 100 100 100 100 100 92 100 74 68 0 0 38 48 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ``` ## Qwen3-8B-Q4_K_M ``` -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Qwen3-8B-Q4_K_M LS/N [reforged] 68.2% 68.4% 99.6% 86% 0.7 16.1s 50 98 100 100 100 100 92 100 48 78 0 44 8 38 100 100 100 100 100 90 100 40 76 0 8 14 38 -Qwen3-8B-Q4_K_M LS/P [reforged] 70.4% 70.7% 99.6% 86% 0.5 17.8s 50 100 100 100 100 100 94 100 56 64 0 14 12 92 100 100 100 100 100 94 100 62 58 0 0 6 78 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3-8B-Q4_K_M LS/N [reforged] 67.3% 67.5% 99.7% 96% 0.3 15.6s 50 100 100 100 100 100 100 100 40 98 6 14 22 2 100 100 100 100 100 100 100 48 86 2 0 26 6 +Qwen3-8B-Q4_K_M LS/N [reforged:keep-last] 64.5% 64.6% 99.9% 91% 0.4 15.0s 50 100 100 100 100 100 100 100 30 82 0 28 12 10 100 100 100 98 100 96 100 22 86 2 2 6 4 +Qwen3-8B-Q4_K_M LS/N [reforged:full] 65.8% 66.0% 99.7% 84% 0.7 17.2s 50 100 100 100 100 100 94 100 34 66 0 18 10 38 100 100 100 96 100 86 100 34 74 0 10 12 40 +Qwen3-8B-Q4_K_M LS/P [reforged] 71.1% 71.2% 99.8% 87% 0.5 18.0s 50 100 100 100 100 100 96 100 70 66 0 18 8 96 100 100 100 100 98 96 100 60 60 0 0 10 70 +Qwen3-8B-Q4_K_M LS/P [reforged:keep-last] 72.2% 72.3% 99.9% 88% 0.5 17.9s 50 100 100 100 100 100 100 100 62 66 0 30 8 90 100 100 100 100 100 98 100 74 68 0 0 8 74 +Qwen3-8B-Q4_K_M LS/P [reforged:full] 70.5% 70.8% 99.6% 88% 0.4 17.4s 50 100 100 100 100 100 88 100 58 66 0 24 10 88 100 100 100 100 100 94 98 62 66 0 0 4 76 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## Qwen3-8B-Q8_0 ``` ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Qwen3-8B-Q8_0 LS/N [reforged] 70.3% 70.5% 99.7% 88% 0.6 24.1s 50 100 100 100 100 100 100 100 60 82 4 22 20 32 100 100 100 100 98 94 100 58 66 2 12 28 50 -Qwen3-8B-Q8_0 LS/P [reforged] 73.1% 73.2% 99.8% 89% 0.4 28.4s 50 100 100 100 100 100 100 100 58 96 0 8 28 94 100 100 100 100 96 100 98 64 88 0 0 12 58 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3-8B-Q8_0 LS/N [reforged] 68.2% 68.5% 99.5% 95% 0.3 24.8s 50 100 100 100 100 100 100 100 56 78 6 24 28 6 98 100 100 100 100 100 100 52 84 6 2 30 2 +Qwen3-8B-Q8_0 LS/N [reforged:keep-last] 67.0% 67.2% 99.8% 92% 0.4 23.2s 50 100 100 100 100 100 94 98 48 84 2 22 10 12 100 100 100 100 100 96 100 52 80 0 18 20 6 +Qwen3-8B-Q8_0 LS/N [reforged:full] 69.3% 69.6% 99.6% 88% 0.6 24.7s 50 98 100 100 100 100 94 100 48 76 2 28 24 46 100 100 100 100 100 98 100 46 76 2 8 16 40 +Qwen3-8B-Q8_0 LS/P [reforged] 72.0% 72.3% 99.6% 88% 0.4 28.6s 50 100 100 100 100 100 96 100 56 90 0 2 10 96 100 100 100 100 96 98 100 58 88 0 0 20 62 +Qwen3-8B-Q8_0 LS/P [reforged:keep-last] 72.8% 73.0% 99.7% 89% 0.4 28.0s 50 100 100 100 100 100 98 98 80 90 0 6 8 94 100 100 100 100 96 96 100 60 98 0 2 12 54 +Qwen3-8B-Q8_0 LS/P [reforged:full] 72.8% 72.9% 99.8% 88% 0.4 28.9s 50 100 100 100 100 100 98 100 70 90 0 4 20 96 100 100 100 100 92 100 96 66 92 0 0 12 56 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## Qwen3.5-27B-Q4_K_M ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Qwen3.5-27B-Q4_K_M LS/N [reforged] 93.2% 93.3% 99.8% 82% 1.4 37.6s 50 100 100 100 100 100 100 100 98 100 74 38 88 98 100 100 100 100 100 98 100 100 100 78 56 96 98 -Qwen3.5-27B-Q4_K_M LS/P [reforged] 86.8% 86.8% 100.0% 100% 0.1 24.4s 50 100 100 100 100 100 100 100 100 100 42 10 78 100 100 100 100 100 100 100 100 100 100 36 10 80 100 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3.5-27B-Q4_K_M LS/N [reforged:full]² 93.2% 93.3% 99.8% 82% 1.4 37.6s 50 100 100 100 100 100 100 100 98 100 74 38 88 98 100 100 100 100 100 98 100 100 100 78 56 96 98 +Qwen3.5-27B-Q4_K_M LS/P [reforged:full]² 86.8% 86.8% 100.0% 100% 0.1 24.4s 50 100 100 100 100 100 100 100 100 100 42 10 78 100 100 100 100 100 100 100 100 100 100 36 10 80 100 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## Qwen3.5-35B-A3B-Q4_K_M ``` --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Qwen3.5-35B-A3B-Q4_K_M LS/N [reforged] 92.1% 92.4% 99.7% 82% 1.3 11.1s 50 100 100 100 100 100 96 98 100 100 96 14 84 100 100 100 100 100 100 96 100 100 100 94 20 96 100 -Qwen3.5-35B-A3B-Q4_K_M LS/P [reforged] 82.8% 82.8% 100.0% 100% 0.2 10.4s 50 48 100 100 100 100 94 98 100 100 74 16 62 90 56 100 100 100 100 96 100 100 98 68 14 58 82 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3.5-35B-A3B-Q4_K_M LS/N [reforged:full]² 92.1% 92.4% 99.7% 82% 1.3 11.1s 50 100 100 100 100 100 96 98 100 100 96 14 84 100 100 100 100 100 100 96 100 100 100 94 20 96 100 +Qwen3.5-35B-A3B-Q4_K_M LS/P [reforged:full]² 82.8% 82.8% 100.0% 100% 0.2 10.4s 50 48 100 100 100 100 94 98 100 100 74 16 62 90 56 100 100 100 100 96 100 100 98 68 14 58 82 +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## Qwen3.6-27B-Q4_K_M ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Qwen3.6-27B-Q4_K_M LS/N [reforged] 92.2% 92.5% 99.6% 100% 0.4 37.9s 50 100 100 100 100 100 100 100 98 100 22 74 98 100 100 100 100 100 100 100 100 96 98 36 78 96 100 -Qwen3.6-27B-Q4_K_M LS/P [reforged] 83.5% 85.0% 98.2% 97% 0.4 53.9s 50 100 100 100 100 100 100 100 100 98 6 66 52 90 100 100 100 100 100 98 100 96 90 2 56 36 80 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3.6-27B-Q4_K_M LS/N [reforged:full]² 92.2% 92.5% 99.6% 100% 0.4 37.9s 50 100 100 100 100 100 100 100 98 100 22 74 98 100 100 100 100 100 100 100 100 96 98 36 78 96 100 +Qwen3.6-27B-Q4_K_M LS/P [reforged:full]² 83.5% 85.0% 98.2% 97% 0.4 53.9s 50 100 100 100 100 100 100 100 100 98 6 66 52 90 100 100 100 100 100 98 100 96 90 2 56 36 80 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## Qwen3.6-35B-A3B-UD-Q4_K_M ``` ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Qwen3.6-35B-A3B-UD-Q4_K_M LS/N [reforged] 94.8% 95.1% 99.7% 100% 0.6 12.7s 50 100 100 100 100 100 100 96 100 100 72 78 92 100 98 100 100 100 100 98 92 100 100 68 76 94 100 -Qwen3.6-35B-A3B-UD-Q4_K_M LS/P [reforged] 82.2% 82.2% 100.0% 100% 0.3 23.6s 50 96 100 100 100 100 90 92 98 92 16 46 62 98 88 100 100 100 100 88 94 96 88 8 42 50 94 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3.6-35B-A3B-UD-Q4_K_M LS/N [reforged:full]² 94.8% 95.1% 99.7% 100% 0.6 12.7s 50 100 100 100 100 100 100 96 100 100 72 78 92 100 98 100 100 100 100 98 92 100 100 68 76 94 100 +Qwen3.6-35B-A3B-UD-Q4_K_M LS/P [reforged:full]² 82.2% 82.2% 100.0% 100% 0.3 23.6s 50 96 100 100 100 100 90 92 98 92 16 46 62 98 88 100 100 100 100 88 94 96 88 8 42 50 94 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## gemma-4-E4B-it-Q4_K_M ``` -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -gemma-4-E4B-it-Q4_K_M LS/N [reforged] 78.2% 82.2% 95.1% 98% 0.5 9.0s 50 100 100 100 100 100 92 98 98 90 0 24 80 50 100 100 100 100 100 94 90 94 98 0 0 84 40 -gemma-4-E4B-it-Q4_K_M LS/P [reforged] 72.8% 72.8% 100.0% 85% 0.6 8.5s 50 100 100 100 100 100 38 98 94 94 0 18 26 96 100 100 100 100 100 26 98 100 92 0 2 22 90 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +gemma-4-E4B-it-Q4_K_M LS/N [reforged] 79.7% 79.9% 99.8% 100% 0.3 8.1s 50 100 100 100 100 100 94 98 100 84 8 30 64 92 100 100 100 100 100 88 94 100 86 2 0 48 84 +gemma-4-E4B-it-Q4_K_M LS/N [reforged:keep-last] 78.7% 81.4% 96.7% 99% 0.5 9.3s 50 100 100 100 100 100 96 92 96 96 6 20 64 62 100 100 100 100 100 88 88 100 98 2 0 82 56 +gemma-4-E4B-it-Q4_K_M LS/N [reforged:full] 79.2% 82.2% 96.3% 99% 0.5 10.0s 50 100 100 100 100 100 96 88 94 100 2 40 78 54 100 100 100 100 100 94 84 98 96 0 0 82 52 +gemma-4-E4B-it-Q4_K_M LS/P [reforged] 72.4% 72.4% 99.9% 85% 0.6 8.9s 50 100 100 100 100 100 56 96 100 86 0 14 28 84 100 100 100 100 100 34 100 96 74 0 0 22 92 +gemma-4-E4B-it-Q4_K_M LS/P [reforged:keep-last] 73.2% 73.2% 100.0% 85% 0.6 8.5s 50 100 100 100 100 100 40 98 96 92 0 10 24 96 100 100 100 100 100 38 100 100 86 0 0 32 92 +gemma-4-E4B-it-Q4_K_M LS/P [reforged:full] 72.9% 72.9% 100.0% 85% 0.6 8.5s 50 100 100 100 100 100 54 98 98 86 0 10 28 88 100 100 100 100 100 26 98 96 94 0 0 30 90 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## gemma-4-E4B-it-Q8_0 ``` ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -gemma-4-E4B-it-Q8_0 LS/N [reforged] 76.2% 80.7% 94.5% 98% 0.6 12.8s 50 100 100 100 100 100 84 88 90 96 2 14 80 44 100 100 100 100 100 88 90 96 94 4 0 80 32 -gemma-4-E4B-it-Q8_0 LS/P [reforged] 74.7% 74.7% 100.0% 85% 0.6 12.7s 50 100 100 100 100 100 70 100 90 88 0 16 34 94 100 100 100 100 100 48 100 98 84 0 0 30 90 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +gemma-4-E4B-it-Q8_0 LS/N [reforged] 77.8% 77.9% 99.8% 100% 0.2 10.8s 50 100 100 100 100 100 76 90 100 100 0 18 38 98 100 100 100 100 100 76 98 100 98 0 0 36 94 +gemma-4-E4B-it-Q8_0 LS/N [reforged:keep-last] 75.7% 79.3% 95.5% 99% 0.5 13.1s 50 100 100 100 100 100 76 92 98 96 4 16 82 48 100 100 100 100 100 50 86 98 94 2 0 86 40 +gemma-4-E4B-it-Q8_0 LS/N [reforged:full] 75.6% 80.8% 93.6% 98% 0.6 12.5s 50 100 100 100 100 100 92 94 92 88 0 18 82 28 100 100 100 100 100 76 92 94 98 2 0 84 26 +gemma-4-E4B-it-Q8_0 LS/P [reforged] 73.2% 73.3% 99.8% 85% 0.6 13.3s 50 100 100 100 100 100 54 98 94 80 0 28 20 94 100 100 100 100 100 40 100 98 90 0 0 12 96 +gemma-4-E4B-it-Q8_0 LS/P [reforged:keep-last] 74.1% 74.2% 99.8% 85% 0.6 13.4s 50 100 100 100 100 100 48 100 96 84 0 22 28 94 100 100 100 100 100 52 100 98 90 0 0 18 96 +gemma-4-E4B-it-Q8_0 LS/P [reforged:full] 73.7% 73.7% 100.0% 86% 0.6 12.7s 50 100 100 100 100 100 48 100 92 90 0 36 28 88 100 100 100 100 100 48 94 98 82 0 0 20 92 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## granite-4.1-8b-Q4_K_M ``` -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -granite-4.1-8b-Q4_K_M LS/N [reforged] 65.4% 68.0% 96.2% 90% 0.8 1.8s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 -granite-4.1-8b-Q4_K_M LS/P [reforged] 61.5% 61.5% 100.0% 90% 0.3 2.5s 50 100 100 100 100 100 0 100 100 0 0 100 0 0 100 100 100 100 100 0 100 100 0 0 100 0 0 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +granite-4.1-8b-Q4_K_M LS/N [reforged] 65.4% 68.0% 96.2% 90% 0.8 1.9s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 +granite-4.1-8b-Q4_K_M LS/N [reforged:keep-last] 65.4% 68.0% 96.2% 90% 0.8 1.8s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 +granite-4.1-8b-Q4_K_M LS/N [reforged:full] 65.4% 68.0% 96.2% 90% 0.8 1.9s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 +granite-4.1-8b-Q4_K_M LS/P [reforged] 61.5% 61.5% 100.0% 90% 0.3 2.5s 50 100 100 100 100 100 0 100 100 0 0 100 0 0 100 100 100 100 100 0 100 100 0 0 100 0 0 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## granite-4.1-8b-Q8_0 @@ -204,7 +238,7 @@ granite-4.1-8b-Q4_K_M LS/P [reforged] 61.5% 61.5% 100.0% 90% 0.3 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -granite-4.1-8b-Q8_0 LS/N [reforged] 65.4% 65.4% 100.0% 88% 1.4 2.8s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 +granite-4.1-8b-Q8_0 LS/N [reforged] 65.4% 65.4% 100.0% 88% 1.3 2.8s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 granite-4.1-8b-Q8_0 LS/P [reforged] 61.5% 66.7% 92.3% 73% 1.0 5.2s 50 0 100 100 100 100 100 100 100 0 0 0 100 0 0 100 100 100 100 100 100 100 0 0 0 100 0 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` @@ -212,8 +246,10 @@ granite-4.1-8b-Q8_0 LS/P [reforged] 61.5% 66.7% 92.3% 73% 1.0 5. Scr=score(correct/total), Acc=accuracy(correct/total, excl validate errors), Cmp=completeness(completed/total), Eff=efficiency(ideal/actual calls), Wst=avg wasted calls, Spd=avg time(excl compaction) rel=relevance_detection, arg=argument_fidelity, tsl=tool_selection, b2s=basic_2step, s3s=sequential_3step, crt=conditional_routing, srn=sequential_reasoning, err=error_recovery, dgr=data_gap_recovery, dge=data_gap_recovery_extended, art=argument_transformation, grs=grounded_synthesis, iar=inconsistent_api_recovery, rel_s=relevance_detection_stateful, arg_s=argument_fidelity_stateful, tsl_s=tool_selection_stateful, b2s_s=basic_2step_stateful, s3s_s=sequential_3step_stateful, crt_s=conditional_routing_stateful, srn_s=sequential_reasoning_stateful, err_s=error_recovery_stateful, dgr_s=data_gap_recovery_stateful, dge_s=data_gap_recovery_extended_stateful, art_s=argument_transformation_stateful, grs_s=grounded_synthesis_stateful, iar_s=inconsistent_api_recovery_stateful Ablation: full=all guardrails, no_rescue=no rescue loop, no_nudge=no rescue/retry nudge, no_steps=no step enforcement, no_recovery=no error recovery, no_compact=no compaction, bare=all guardrails off +Replay: ':keep-last'/':full' tags = reasoning_replay policy (how much captured reasoning is re-sent to the backend each turn); untagged = none (default). Rows predating the knob ran unbounded replay and count as full. Eval generations (older runs carried forward, superscript-tagged): ¹ gen 1 — v0.6.0 suite — incl. Anthropic ablation (commit 2b05dc4, 2026-05-08) + ² gen 2 — v0.7.0 lineup refresh (8–14B) + 32GB tier debut (v0.7.4) (commit 655e1f6, 2026-05-22) -*Generated 2026-06-03 00:09* +*Generated 2026-06-11 20:28* diff --git a/docs/results/raw/reasoning-replay.md b/docs/results/raw/reasoning-replay.md new file mode 100644 index 0000000..ae2f00e --- /dev/null +++ b/docs/results/raw/reasoning-replay.md @@ -0,0 +1,420 @@ +# Forge Eval — Reasoning Replay Policies + +## Ministral-3-8B-Reasoning-2512-Q8_0 (llamaserver/native) [reforged] + +``` +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +Ministral-3-8B-Reasoning-2512-Q8_0 LS/N [reforged] 84.5% 84.8% 99.7% 96% 0.6 5.3s 50 100 100 96 100 100 98 100 100 100 76 18 44 92 100 100 100 100 100 100 98 100 98 80 2 42 54 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/N [reforged:keep-last] 84.8% 85.6% 99.1% 94% 0.6 5.9s 50 100 100 100 100 100 100 100 100 98 70 24 42 86 100 100 100 100 100 100 100 100 100 82 6 36 62 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/N [reforged:full] 83.1% 83.1% 99.9% 96% 0.5 6.0s 50 100 100 100 100 100 100 98 100 100 66 20 36 88 100 100 100 100 100 100 100 100 98 74 2 26 52 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +``` + +## Ministral-3-14B-Reasoning-2512-Q4_K_M (llamaserver/native) [reforged] + +``` +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/N [reforged] 83.3% 83.3% 100.0% 96% 0.6 4.8s 50 100 100 100 100 100 100 100 100 60 32 34 78 94 100 100 100 100 100 92 100 100 62 30 28 78 78 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/N [reforged:keep-last] 82.9% 82.9% 100.0% 95% 0.6 5.0s 50 100 100 100 100 100 92 100 100 68 40 30 60 94 100 100 100 100 100 96 100 100 62 36 20 72 86 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/N [reforged:full] 83.2% 83.2% 100.0% 97% 0.6 5.6s 50 100 100 100 100 100 86 98 100 68 40 32 76 96 100 100 100 100 100 92 100 100 62 34 30 68 82 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## Ministral-3-8B-Reasoning-2512-Q4_K_M (llamaserver/native) [reforged] + +``` +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/N [reforged] 81.4% 81.6% 99.7% 94% 0.6 3.8s 50 100 100 98 100 100 100 100 100 100 68 24 18 86 100 100 98 100 100 100 100 100 96 70 4 28 26 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/N [reforged:keep-last] 82.4% 83.0% 99.3% 92% 0.6 4.2s 50 100 100 100 100 100 98 100 100 100 68 16 28 86 100 100 100 100 100 96 98 100 96 64 6 24 62 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/N [reforged:full] 81.8% 81.8% 100.0% 95% 0.5 4.3s 50 100 100 98 100 98 98 98 100 98 62 8 30 96 100 100 98 100 100 98 96 100 98 62 2 40 46 +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## Ministral-3-8B-Reasoning-2512-Q8_0 (llamaserver/prompt) [reforged] + +``` +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +Ministral-3-8B-Reasoning-2512-Q8_0 LS/P [reforged] 80.9% 84.8% 95.4% 95% 0.7 3.9s 50 100 100 90 100 100 98 100 98 100 72 6 50 76 100 100 86 100 100 100 100 92 96 64 2 42 32 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/P [reforged:keep-last] 81.2% 84.6% 96.0% 97% 0.7 4.1s 50 100 100 94 100 100 94 100 92 98 72 12 42 78 100 100 86 100 100 98 100 98 100 64 0 52 32 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/P [reforged:full] 81.4% 84.8% 96.0% 95% 0.7 4.0s 50 100 100 88 100 100 94 100 96 100 72 12 58 80 100 98 88 100 100 96 100 92 100 64 2 36 40 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +``` + +## Ministral-3-14B-Reasoning-2512-Q4_K_M (llamaserver/prompt) [reforged] + +``` +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/P [reforged] 77.7% 78.8% 98.5% 95% 0.6 3.8s 50 100 100 100 100 100 74 100 100 78 32 2 46 90 100 100 100 100 100 66 100 100 74 28 4 48 78 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/P [reforged:keep-last] 78.0% 78.8% 98.9% 95% 0.6 3.8s 50 100 100 100 100 100 82 100 100 70 28 2 52 78 100 100 100 100 100 74 100 100 80 18 4 48 92 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/P [reforged:full] 81.2% 82.0% 98.9% 98% 0.6 3.8s 50 100 100 100 100 100 74 100 100 68 40 4 72 88 100 100 100 100 100 82 100 100 78 38 10 64 92 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## Ministral-3-8B-Reasoning-2512-Q4_K_M (llamaserver/prompt) [reforged] + +``` +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/P [reforged] 79.5% 81.9% 97.0% 94% 0.7 2.8s 50 100 100 98 100 100 96 100 90 96 68 4 32 68 100 98 98 100 100 96 96 94 98 70 0 34 30 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/P [reforged:keep-last] 79.8% 82.6% 96.7% 95% 0.6 2.8s 50 100 100 96 100 100 98 98 94 98 66 8 36 64 100 100 96 100 100 100 98 96 94 76 0 34 24 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/P [reforged:full] 81.0% 83.1% 97.5% 95% 0.7 3.0s 50 100 98 88 100 100 96 100 98 98 84 24 38 70 100 100 100 100 100 96 98 100 96 72 2 22 26 +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## gemma-4-E4B-it-Q4_K_M (llamaserver/native) [reforged] + +``` +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +gemma-4-E4B-it-Q4_K_M LS/N [reforged] 79.7% 79.9% 99.8% 100% 0.3 8.1s 50 100 100 100 100 100 94 98 100 84 8 30 64 92 100 100 100 100 100 88 94 100 86 2 0 48 84 +gemma-4-E4B-it-Q4_K_M LS/N [reforged:keep-last] 78.7% 81.4% 96.7% 99% 0.5 9.3s 50 100 100 100 100 100 96 92 96 96 6 20 64 62 100 100 100 100 100 88 88 100 98 2 0 82 56 +gemma-4-E4B-it-Q4_K_M LS/N [reforged:full] 79.2% 82.2% 96.3% 99% 0.5 10.0s 50 100 100 100 100 100 96 88 94 100 2 40 78 54 100 100 100 100 100 94 84 98 96 0 0 82 52 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## gemma-4-E4B-it-Q8_0 (llamaserver/native) [reforged] + +``` +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +gemma-4-E4B-it-Q8_0 LS/N [reforged] 77.8% 77.9% 99.8% 100% 0.2 10.8s 50 100 100 100 100 100 76 90 100 100 0 18 38 98 100 100 100 100 100 76 98 100 98 0 0 36 94 +gemma-4-E4B-it-Q8_0 LS/N [reforged:keep-last] 75.7% 79.3% 95.5% 99% 0.5 13.1s 50 100 100 100 100 100 76 92 98 96 4 16 82 48 100 100 100 100 100 50 86 98 94 2 0 86 40 +gemma-4-E4B-it-Q8_0 LS/N [reforged:full] 75.6% 80.8% 93.6% 98% 0.6 12.5s 50 100 100 100 100 100 92 94 92 88 0 18 82 28 100 100 100 100 100 76 92 94 98 2 0 84 26 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## gemma-4-E4B-it-Q8_0 (llamaserver/prompt) [reforged] + +``` +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +gemma-4-E4B-it-Q8_0 LS/P [reforged] 73.2% 73.3% 99.8% 85% 0.6 13.3s 50 100 100 100 100 100 54 98 94 80 0 28 20 94 100 100 100 100 100 40 100 98 90 0 0 12 96 +gemma-4-E4B-it-Q8_0 LS/P [reforged:keep-last] 74.1% 74.2% 99.8% 85% 0.6 13.4s 50 100 100 100 100 100 48 100 96 84 0 22 28 94 100 100 100 100 100 52 100 98 90 0 0 18 96 +gemma-4-E4B-it-Q8_0 LS/P [reforged:full] 73.7% 73.7% 100.0% 86% 0.6 12.7s 50 100 100 100 100 100 48 100 92 90 0 36 28 88 100 100 100 100 100 48 94 98 82 0 0 20 92 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## gemma-4-E4B-it-Q4_K_M (llamaserver/prompt) [reforged] + +``` +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +gemma-4-E4B-it-Q4_K_M LS/P [reforged] 72.4% 72.4% 99.9% 85% 0.6 8.9s 50 100 100 100 100 100 56 96 100 86 0 14 28 84 100 100 100 100 100 34 100 96 74 0 0 22 92 +gemma-4-E4B-it-Q4_K_M LS/P [reforged:keep-last] 73.2% 73.2% 100.0% 85% 0.6 8.5s 50 100 100 100 100 100 40 98 96 92 0 10 24 96 100 100 100 100 100 38 100 100 86 0 0 32 92 +gemma-4-E4B-it-Q4_K_M LS/P [reforged:full] 72.9% 72.9% 100.0% 85% 0.6 8.5s 50 100 100 100 100 100 54 98 98 86 0 10 28 88 100 100 100 100 100 26 98 96 94 0 0 30 90 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## Qwen3-8B-Q8_0 (llamaserver/prompt) [reforged] + +``` +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3-8B-Q8_0 LS/P [reforged] 72.0% 72.3% 99.6% 88% 0.4 28.6s 50 100 100 100 100 100 96 100 56 90 0 2 10 96 100 100 100 100 96 98 100 58 88 0 0 20 62 +Qwen3-8B-Q8_0 LS/P [reforged:keep-last] 72.8% 73.0% 99.7% 89% 0.4 28.0s 50 100 100 100 100 100 98 98 80 90 0 6 8 94 100 100 100 100 96 96 100 60 98 0 2 12 54 +Qwen3-8B-Q8_0 LS/P [reforged:full] 72.8% 72.9% 99.8% 88% 0.4 28.9s 50 100 100 100 100 100 98 100 70 90 0 4 20 96 100 100 100 100 92 100 96 66 92 0 0 12 56 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## Qwen3-8B-Q4_K_M (llamaserver/prompt) [reforged] + +``` +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3-8B-Q4_K_M LS/P [reforged] 71.1% 71.2% 99.8% 87% 0.5 18.0s 50 100 100 100 100 100 96 100 70 66 0 18 8 96 100 100 100 100 98 96 100 60 60 0 0 10 70 +Qwen3-8B-Q4_K_M LS/P [reforged:keep-last] 72.2% 72.3% 99.9% 88% 0.5 17.9s 50 100 100 100 100 100 100 100 62 66 0 30 8 90 100 100 100 100 100 98 100 74 68 0 0 8 74 +Qwen3-8B-Q4_K_M LS/P [reforged:full] 70.5% 70.8% 99.6% 88% 0.4 17.4s 50 100 100 100 100 100 88 100 58 66 0 24 10 88 100 100 100 100 100 94 98 62 66 0 0 4 76 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## Qwen3-14B-Q4_K_M (llamaserver/prompt) [reforged] + +``` +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +Qwen3-14B-Q4_K_M LS/P [reforged] 71.4% 71.4% 99.9% 86% 0.5 25.6s 50 100 100 100 100 100 98 100 70 70 0 0 30 80 100 100 100 100 100 92 100 76 56 0 0 22 62 +Qwen3-14B-Q4_K_M LS/P [reforged:keep-last] 71.8% 71.8% 100.0% 86% 0.5 23.8s 50 100 100 100 100 100 98 100 72 58 2 4 28 72 100 100 100 100 100 94 100 76 74 0 0 32 56 +Qwen3-14B-Q4_K_M LS/P [reforged:full] 71.8% 71.9% 99.8% 87% 0.5 24.3s 50 100 100 100 100 100 96 100 72 72 2 0 30 74 100 100 100 100 100 92 100 74 68 0 0 38 48 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +``` + +## Qwen3-8B-Q8_0 (llamaserver/native) [reforged] + +``` +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3-8B-Q8_0 LS/N [reforged] 68.2% 68.5% 99.5% 95% 0.3 24.8s 50 100 100 100 100 100 100 100 56 78 6 24 28 6 98 100 100 100 100 100 100 52 84 6 2 30 2 +Qwen3-8B-Q8_0 LS/N [reforged:keep-last] 67.0% 67.2% 99.8% 92% 0.4 23.2s 50 100 100 100 100 100 94 98 48 84 2 22 10 12 100 100 100 100 100 96 100 52 80 0 18 20 6 +Qwen3-8B-Q8_0 LS/N [reforged:full] 69.3% 69.6% 99.6% 88% 0.6 24.7s 50 98 100 100 100 100 94 100 48 76 2 28 24 46 100 100 100 100 100 98 100 46 76 2 8 16 40 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## Qwen3-14B-Q4_K_M (llamaserver/native) [reforged] + +``` +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +Qwen3-14B-Q4_K_M LS/N [reforged] 68.5% 68.5% 100.0% 100% 0.4 21.8s 50 100 100 100 100 98 100 100 56 72 14 4 58 4 100 100 100 100 100 100 100 42 72 10 0 52 0 +Qwen3-14B-Q4_K_M LS/N [reforged:keep-last] 64.0% 64.0% 99.9% 91% 0.6 20.3s 50 100 100 100 100 100 90 98 48 40 18 6 38 4 100 100 100 98 96 88 100 54 30 16 2 38 0 +Qwen3-14B-Q4_K_M LS/N [reforged:full] 68.4% 68.4% 99.9% 83% 0.9 21.9s 50 100 100 100 100 98 90 98 60 32 20 18 50 38 100 100 100 100 100 86 100 74 34 6 18 34 22 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +``` + +## gemma-4-E4B-it-Q4_K_M (llamaserver/native) [bare] + +``` +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +gemma-4-E4B-it-Q4_K_M LS/N [bare] 67.5% 73.5% 91.8% 100% 0.3 7.5s 50 100 100 100 100 100 88 96 0 90 2 28 44 70 100 100 100 100 100 82 100 0 88 4 0 62 0 +gemma-4-E4B-it-Q4_K_M LS/N [bare:keep-last] 66.6% 75.6% 88.1% 100% 0.2 9.3s 50 100 100 68 100 100 86 90 0 96 0 24 78 84 98 100 66 100 100 76 88 0 96 4 0 78 0 +gemma-4-E4B-it-Q4_K_M LS/N [bare:full] 67.9% 77.5% 87.7% 100% 0.2 9.5s 50 96 98 78 100 100 94 84 0 92 2 24 80 80 98 100 78 100 100 92 92 0 94 0 0 84 0 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## Qwen3-8B-Q4_K_M (llamaserver/native) [reforged] + +``` +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3-8B-Q4_K_M LS/N [reforged] 67.3% 67.5% 99.7% 96% 0.3 15.6s 50 100 100 100 100 100 100 100 40 98 6 14 22 2 100 100 100 100 100 100 100 48 86 2 0 26 6 +Qwen3-8B-Q4_K_M LS/N [reforged:keep-last] 64.5% 64.6% 99.9% 91% 0.4 15.0s 50 100 100 100 100 100 100 100 30 82 0 28 12 10 100 100 100 98 100 96 100 22 86 2 2 6 4 +Qwen3-8B-Q4_K_M LS/N [reforged:full] 65.8% 66.0% 99.7% 84% 0.7 17.2s 50 100 100 100 100 100 94 100 34 66 0 18 10 38 100 100 100 96 100 86 100 34 74 0 10 12 40 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## gemma-4-E4B-it-Q8_0 (llamaserver/native) [bare] + +``` +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +gemma-4-E4B-it-Q8_0 LS/N [bare] 66.2% 72.2% 91.7% 100% 0.2 10.1s 50 98 100 100 100 100 82 94 0 94 0 30 48 86 100 100 100 100 100 74 96 0 94 0 0 26 0 +gemma-4-E4B-it-Q8_0 LS/N [bare:keep-last] 67.0% 75.4% 88.8% 100% 0.2 14.0s 50 96 100 72 98 100 74 96 0 96 0 28 78 94 98 100 74 100 98 74 92 0 90 0 0 84 0 +gemma-4-E4B-it-Q8_0 LS/N [bare:full] 65.5% 75.3% 87.0% 100% 0.3 14.4s 50 96 96 76 100 100 88 86 0 80 0 14 84 80 92 100 74 100 100 76 88 0 88 0 0 86 0 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## granite-4.1-8b-Q4_K_M (llamaserver/native) [reforged] + +``` +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +granite-4.1-8b-Q4_K_M LS/N [reforged] 65.4% 68.0% 96.2% 90% 0.8 1.9s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 +granite-4.1-8b-Q4_K_M LS/N [reforged:keep-last] 65.4% 68.0% 96.2% 90% 0.8 1.8s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 +granite-4.1-8b-Q4_K_M LS/N [reforged:full] 65.4% 68.0% 96.2% 90% 0.8 1.9s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## Qwen3-8B-Q8_0 (llamaserver/prompt) [bare] + +``` +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3-8B-Q8_0 LS/P [bare] 63.5% 69.6% 91.3% 97% 0.2 27.0s 50 100 100 96 98 100 98 96 0 90 0 8 18 92 100 100 94 100 100 64 100 0 84 0 2 12 0 +Qwen3-8B-Q8_0 LS/P [bare:keep-last] 63.4% 69.7% 90.9% 96% 0.2 28.5s 50 100 100 94 100 100 98 98 0 88 0 4 16 92 100 100 88 100 94 72 98 0 96 0 0 10 0 +Qwen3-8B-Q8_0 LS/P [bare:full] 63.9% 69.9% 91.5% 96% 0.2 27.3s 50 100 100 94 100 100 96 98 0 88 0 2 14 90 100 100 98 100 100 68 98 0 96 0 0 20 0 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## Ministral-3-14B-Reasoning-2512-Q4_K_M (llamaserver/prompt) [bare] + +``` +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/P [bare] 63.6% 78.5% 81.0% 100% 0.3 3.3s 50 100 100 100 100 98 82 90 0 60 20 4 46 72 100 100 100 100 96 72 96 0 56 22 0 40 0 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/P [bare:keep-last] 63.4% 76.7% 82.7% 100% 0.3 3.2s 50 100 100 100 100 98 68 98 0 64 20 4 46 78 100 100 96 100 100 68 92 0 58 14 0 44 0 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/P [bare:full] 63.5% 77.3% 82.1% 100% 0.3 3.2s 50 100 100 100 100 100 76 96 0 64 12 2 40 70 100 98 100 100 100 70 92 0 60 18 0 52 0 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## gemma-4-E4B-it-Q4_K_M (llamaserver/prompt) [bare] + +``` +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +gemma-4-E4B-it-Q4_K_M LS/P [bare] 62.5% 67.8% 92.2% 91% 0.4 9.0s 50 100 100 100 100 100 56 100 0 88 0 22 30 100 100 100 100 100 100 10 100 0 88 0 0 30 0 +gemma-4-E4B-it-Q4_K_M LS/P [bare:keep-last] 59.9% 65.0% 92.2% 92% 0.4 8.7s 50 100 100 100 100 100 44 98 0 90 0 4 22 92 100 100 100 100 100 12 92 0 92 0 0 12 0 +gemma-4-E4B-it-Q4_K_M LS/P [bare:full] 61.5% 66.7% 92.2% 91% 0.4 9.1s 50 100 100 100 100 100 46 94 0 86 0 14 28 88 100 100 100 100 100 22 96 0 94 0 0 30 0 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## gemma-4-E4B-it-Q8_0 (llamaserver/prompt) [bare] + +``` +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +gemma-4-E4B-it-Q8_0 LS/P [bare] 61.2% 66.3% 92.3% 94% 0.3 12.1s 50 100 100 100 100 100 64 96 0 84 0 22 22 94 100 100 100 100 100 0 100 0 80 0 0 30 0 +gemma-4-E4B-it-Q8_0 LS/P [bare:keep-last] 61.2% 66.4% 92.2% 95% 0.3 12.5s 50 100 100 100 100 100 68 100 0 84 0 22 20 90 100 100 100 100 100 2 100 0 82 0 0 24 0 +gemma-4-E4B-it-Q8_0 LS/P [bare:full] 60.5% 65.7% 92.1% 94% 0.3 12.6s 50 100 100 100 100 100 60 98 0 84 0 14 24 90 100 100 100 100 100 2 96 0 88 0 0 18 0 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## Ministral-3-8B-Reasoning-2512-Q8_0 (llamaserver/prompt) [bare] + +``` +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Ministral-3-8B-Reasoning-2512-Q8_0 LS/P [bare] 58.6% 80.0% 73.3% 100% 0.4 3.3s 50 52 88 32 100 98 100 100 0 86 58 4 18 64 36 92 48 100 96 92 100 0 90 60 2 6 2 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/P [bare:keep-last] 59.2% 80.5% 73.5% 100% 0.4 3.1s 50 44 88 36 100 96 96 100 0 82 66 4 10 82 34 92 38 100 98 96 100 0 92 66 0 16 4 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/P [bare:full] 58.5% 79.5% 73.5% 100% 0.3 3.5s 50 26 98 40 100 98 96 100 0 92 54 6 14 86 44 92 38 100 96 92 100 0 92 44 0 8 4 +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## Qwen3-8B-Q4_K_M (llamaserver/prompt) [bare] + +``` +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3-8B-Q4_K_M LS/P [bare] 57.4% 66.6% 86.2% 97% 0.2 17.2s 50 98 100 56 96 100 90 100 0 62 0 22 6 94 100 100 48 94 100 56 100 0 64 0 0 4 2 +Qwen3-8B-Q4_K_M LS/P [bare:keep-last] 59.1% 68.0% 86.9% 98% 0.2 16.6s 50 98 100 56 98 100 94 100 0 58 0 22 8 94 100 100 68 94 98 72 100 0 72 0 0 2 2 +Qwen3-8B-Q4_K_M LS/P [bare:full] 57.8% 66.9% 86.5% 97% 0.2 17.5s 50 100 100 66 94 100 96 100 0 54 0 24 8 96 100 100 58 96 98 68 100 0 44 0 0 2 0 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## Ministral-3-8B-Reasoning-2512-Q4_K_M (llamaserver/prompt) [bare] + +``` +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/P [bare] 55.2% 79.8% 69.2% 100% 0.4 2.3s 50 58 70 14 100 92 96 100 0 72 48 4 16 66 58 84 10 100 92 96 98 0 88 60 0 12 2 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/P [bare:keep-last] 54.3% 78.8% 68.9% 100% 0.3 2.4s 50 50 78 4 100 92 98 100 0 68 50 10 12 64 64 72 10 100 92 92 96 0 84 62 0 12 2 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/P [bare:full] 54.6% 78.7% 69.4% 100% 0.4 2.4s 50 58 74 4 100 86 94 98 0 76 54 2 12 62 66 78 8 100 94 96 98 0 86 54 0 18 2 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## Qwen3-14B-Q4_K_M (llamaserver/prompt) [bare] + +``` +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3-14B-Q4_K_M LS/P [bare] 54.1% 63.2% 85.6% 95% 0.2 22.3s 50 100 100 14 100 100 92 100 0 64 0 0 34 46 100 100 20 88 98 50 100 0 68 0 0 16 16 +Qwen3-14B-Q4_K_M LS/P [bare:keep-last] 53.8% 63.2% 85.2% 95% 0.2 23.1s 50 100 100 20 100 100 94 100 0 60 0 2 30 56 100 100 18 88 98 38 100 0 62 0 0 24 10 +Qwen3-14B-Q4_K_M LS/P [bare:full] 53.9% 62.9% 85.8% 94% 0.2 24.4s 50 100 100 22 100 100 90 100 0 66 0 0 30 56 100 100 14 76 100 30 100 0 70 0 0 36 12 +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## granite-4.1-8b-Q4_K_M (llamaserver/native) [bare] + +``` +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +granite-4.1-8b-Q4_K_M LS/N [bare] 53.8% 70.0% 76.9% 96% 0.2 1.9s 50 0 100 100 100 100 100 100 0 100 0 0 0 0 0 100 100 100 100 100 100 0 100 0 0 0 0 +granite-4.1-8b-Q4_K_M LS/N [bare:keep-last] 53.8% 70.0% 76.9% 96% 0.2 1.9s 50 0 100 100 100 100 100 100 0 100 0 0 0 0 0 100 100 100 100 100 100 0 100 0 0 0 0 +granite-4.1-8b-Q4_K_M LS/N [bare:full] 53.8% 70.0% 76.9% 96% 0.2 2.0s 50 0 100 100 100 100 100 100 0 100 0 0 0 0 0 100 100 100 100 100 100 0 100 0 0 0 0 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## Qwen3-8B-Q4_K_M (llamaserver/native) [bare] + +``` +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3-8B-Q4_K_M LS/N [bare] 53.2% 64.9% 82.1% 100% 0.1 15.0s 50 92 100 4 86 92 100 100 0 62 2 32 26 12 86 100 6 96 90 100 98 0 76 2 0 22 0 +Qwen3-8B-Q4_K_M LS/N [bare:keep-last] 46.4% 64.3% 72.2% 100% 0.1 15.0s 50 96 78 2 86 70 94 76 0 74 4 34 4 8 80 76 4 76 76 96 86 0 76 0 4 6 0 +Qwen3-8B-Q4_K_M LS/N [bare:full] 45.2% 63.7% 71.0% 100% 0.1 13.6s 50 92 80 2 86 76 100 80 0 74 0 14 12 2 88 68 4 74 82 94 70 0 60 0 4 14 0 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## Qwen3-8B-Q8_0 (llamaserver/native) [bare] + +``` +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3-8B-Q8_0 LS/N [bare] 50.4% 63.4% 79.5% 100% 0.1 23.1s 50 88 100 2 62 98 100 98 0 70 2 22 40 2 92 100 4 46 88 94 98 0 70 2 4 28 0 +Qwen3-8B-Q8_0 LS/N [bare:keep-last] 49.7% 65.4% 76.0% 100% 0.1 21.6s 50 86 78 2 88 88 100 80 0 64 2 18 18 8 90 88 2 84 94 96 86 0 92 0 12 16 0 +Qwen3-8B-Q8_0 LS/N [bare:full] 46.8% 63.0% 74.4% 100% 0.1 20.9s 50 84 70 0 84 96 96 82 0 62 0 12 14 8 88 70 2 98 88 84 86 0 74 0 4 16 0 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## Qwen3-14B-Q4_K_M (llamaserver/native) [bare] + +``` +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3-14B-Q4_K_M LS/N [bare] 48.1% 60.0% 80.2% 100% 0.1 21.2s 50 100 96 0 38 66 100 100 0 62 8 0 48 2 100 98 0 42 56 94 100 0 72 12 0 56 0 +Qwen3-14B-Q4_K_M LS/N [bare:keep-last] 30.8% 52.1% 59.2% 100% 0.0 18.3s 50 100 28 6 24 60 72 34 0 12 16 4 34 0 100 12 8 46 52 60 46 0 24 28 6 30 0 +Qwen3-14B-Q4_K_M LS/N [bare:full] 27.5% 48.5% 56.8% 100% 0.0 16.5s 50 96 16 4 24 48 62 36 0 22 4 6 28 0 100 18 6 52 62 60 18 0 14 4 4 32 0 +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## Ministral-3-14B-Reasoning-2512-Q4_K_M (llamaserver/native) [bare] + +``` +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/N [bare] 40.2% 81.0% 49.7% 100% 0.4 4.5s 50 0 100 50 76 94 70 0 0 14 6 28 32 86 0 100 52 92 94 78 0 0 10 4 16 24 20 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/N [bare:keep-last] 41.5% 77.2% 53.7% 100% 0.3 5.5s 50 0 100 52 64 94 68 2 0 12 4 34 40 82 0 100 64 90 100 82 2 0 24 4 16 30 14 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/N [bare:full] 45.6% 80.2% 56.8% 100% 0.3 5.8s 50 0 100 72 78 98 72 6 0 26 18 34 36 94 0 100 44 96 92 84 2 0 16 28 26 40 24 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## Ministral-3-8B-Reasoning-2512-Q8_0 (llamaserver/native) [bare] + +``` +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Ministral-3-8B-Reasoning-2512-Q8_0 LS/N [bare] 43.5% 80.3% 54.2% 100% 0.3 3.8s 50 2 100 98 100 100 86 0 0 2 14 12 14 72 2 100 98 100 100 86 0 0 6 16 0 22 0 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/N [bare:keep-last] 44.2% 79.3% 55.7% 100% 0.3 4.8s 50 10 100 92 100 100 94 0 0 2 16 26 30 52 4 100 94 100 100 78 0 0 4 18 4 20 4 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/N [bare:full] 44.3% 77.9% 56.8% 100% 0.3 5.7s 50 4 100 94 100 100 84 0 0 2 10 20 24 86 4 100 94 100 100 84 0 0 2 10 4 28 2 +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## Ministral-3-8B-Reasoning-2512-Q4_K_M (llamaserver/native) [bare] + +``` +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/N [bare] 40.2% 77.0% 52.2% 98% 0.3 2.8s 50 6 100 84 100 100 68 0 0 4 6 18 8 62 10 100 86 100 100 70 0 0 2 12 0 8 0 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/N [bare:keep-last] 41.2% 77.3% 53.3% 100% 0.2 3.4s 50 6 100 82 100 100 78 0 0 4 10 20 20 58 10 100 90 100 100 70 0 0 4 6 4 10 0 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/N [bare:full] 41.8% 72.7% 57.5% 100% 0.3 3.7s 50 6 100 84 100 100 70 2 0 2 12 8 12 90 14 100 82 100 100 66 2 0 0 18 0 16 2 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +Scr=score(correct/total), Acc=accuracy(correct/total, excl validate errors), Cmp=completeness(completed/total), Eff=efficiency(ideal/actual calls), Wst=avg wasted calls, Spd=avg time(excl compaction) +rel=relevance_detection, arg=argument_fidelity, tsl=tool_selection, b2s=basic_2step, s3s=sequential_3step, crt=conditional_routing, srn=sequential_reasoning, err=error_recovery, dgr=data_gap_recovery, dge=data_gap_recovery_extended, art=argument_transformation, grs=grounded_synthesis, iar=inconsistent_api_recovery, rel_s=relevance_detection_stateful, arg_s=argument_fidelity_stateful, tsl_s=tool_selection_stateful, b2s_s=basic_2step_stateful, s3s_s=sequential_3step_stateful, crt_s=conditional_routing_stateful, srn_s=sequential_reasoning_stateful, err_s=error_recovery_stateful, dgr_s=data_gap_recovery_stateful, dge_s=data_gap_recovery_extended_stateful, art_s=argument_transformation_stateful, grs_s=grounded_synthesis_stateful, iar_s=inconsistent_api_recovery_stateful +Ablation: full=all guardrails, no_rescue=no rescue loop, no_nudge=no rescue/retry nudge, no_steps=no step enforcement, no_recovery=no error recovery, no_compact=no compaction, bare=all guardrails off +Replay: ':keep-last'/':full' tags = reasoning_replay policy (how much captured reasoning is re-sent to the backend each turn); untagged = none (default). Rows predating the knob ran unbounded replay and count as full. + +Eval generations (older runs carried forward, superscript-tagged): + ¹ gen 1 — v0.6.0 suite — incl. Anthropic ablation (commit 2b05dc4, 2026-05-08) + ² gen 2 — v0.7.0 lineup refresh (8–14B) + 32GB tier debut (v0.7.4) (commit 655e1f6, 2026-05-22) + +*Generated 2026-06-11 20:28* diff --git a/docs/results/raw/reforged-vs-bare.md b/docs/results/raw/reforged-vs-bare.md index cd40f62..c557e14 100644 --- a/docs/results/raw/reforged-vs-bare.md +++ b/docs/results/raw/reforged-vs-bare.md @@ -1,105 +1,121 @@ # Forge Eval — Reforged vs Bare -## claude-opus-4-6 (anthropic/native) +## claude-sonnet-4-6 (anthropic/native) ``` --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -claude-opus-4-6 AN/N [reforged]¹ 99.2% 99.8% 99.4% 100% 0.0 15.6s 50 100 100 100 100 100 98 100 100 100 100 98 94 98 100 100 100 100 100 100 100 96 100 100 98 100 98 -claude-opus-4-6 AN/N [bare+any]¹ 87.1% 95.4% 91.3% 100% 0.0 12.9s 50 100 100 100 100 98 100 100 0 100 100 80 100 100 100 100 100 100 100 100 100 0 100 100 86 100 0 -claude-opus-4-6 AN/N [bare]¹ 87.9% 95.8% 91.8% 100% 0.0 16.5s 50 100 100 100 98 100 100 100 0 100 100 100 100 100 100 100 100 100 98 100 100 0 98 100 96 96 0 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +claude-sonnet-4-6 AN/N [reforged] 100.0% 100.0% 100.0% 100% 0.0 18.2s 50 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 +claude-sonnet-4-6 AN/N [bare+any] 81.2% 87.9% 92.3% 100% 0.0 9.8s 50 100 100 100 100 100 100 100 0 100 100 6 100 100 100 100 100 100 100 100 100 0 100 100 4 100 0 +claude-sonnet-4-6 AN/N [bare] 88.4% 95.8% 92.3% 100% 0.0 18.0s 50 100 100 100 100 100 100 100 0 100 100 98 100 100 100 100 100 100 100 100 100 0 100 100 100 100 0 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` -## claude-sonnet-4-6 (anthropic/native) +## claude-opus-4-8 (anthropic/native) ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -claude-sonnet-4-6 AN/N [reforged]¹ 98.4% 98.5% 99.9% 100% 0.1 13.1s 50 100 100 100 100 100 100 100 100 100 98 74 98 100 100 100 100 100 100 100 100 100 100 100 88 100 100 -claude-sonnet-4-6 AN/N [bare]¹ 85.1% 95.0% 89.5% 100% 0.0 14.3s 50 100 100 68 98 100 100 100 0 100 98 86 98 100 100 100 66 100 100 100 100 0 100 100 98 100 0 -claude-sonnet-4-6 AN/N [bare+any]¹ 81.5% 88.2% 92.3% 100% 0.0 11.6s 50 100 100 100 100 100 100 100 0 100 100 12 100 100 100 100 100 100 100 100 100 0 100 100 16 90 0 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +claude-opus-4-8 AN/N [reforged] 100.0% 100.0% 100.0% 100% 0.0 13.3s 50 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 +claude-opus-4-8 AN/N [bare] 88.0% 95.8% 91.8% 100% 0.0 13.7s 50 90 100 100 100 100 100 100 0 100 100 100 100 100 98 100 100 100 100 100 100 0 100 100 100 100 0 +claude-opus-4-8 AN/N [bare+any] 83.6% 90.7% 92.2% 100% 0.0 9.3s 50 100 100 100 100 100 100 100 0 100 100 30 100 100 100 100 100 100 100 100 100 0 100 100 44 100 0 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## claude-opus-4-6 (anthropic/native) + +``` +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +claude-opus-4-6 AN/N [reforged:full]¹ 99.2% 99.8% 99.4% 100% 0.0 15.6s 50 100 100 100 100 100 98 100 100 100 100 98 94 98 100 100 100 100 100 100 100 96 100 100 98 100 98 +claude-opus-4-6 AN/N [bare+any:full]¹ 87.1% 95.4% 91.3% 100% 0.0 12.9s 50 100 100 100 100 98 100 100 0 100 100 80 100 100 100 100 100 100 100 100 100 0 100 100 86 100 0 +claude-opus-4-6 AN/N [bare:full]¹ 87.9% 95.8% 91.8% 100% 0.0 16.5s 50 100 100 100 98 100 100 100 0 100 100 100 100 100 100 100 100 100 98 100 100 0 98 100 96 96 0 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## Qwen3.6-35B-A3B-UD-Q4_K_M (llamaserver/native) ``` ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Qwen3.6-35B-A3B-UD-Q4_K_M LS/N [reforged] 94.8% 95.1% 99.7% 100% 0.6 12.7s 50 100 100 100 100 100 100 96 100 100 72 78 92 100 98 100 100 100 100 98 92 100 100 68 76 94 100 -Qwen3.6-35B-A3B-UD-Q4_K_M LS/N [bare] 52.8% 88.4% 59.8% 100% 0.1 11.8s 50 14 72 2 92 68 92 42 0 98 58 66 52 90 10 80 4 86 56 90 40 0 90 70 56 46 0 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3.6-35B-A3B-UD-Q4_K_M LS/N [reforged:full]² 94.8% 95.1% 99.7% 100% 0.6 12.7s 50 100 100 100 100 100 100 96 100 100 72 78 92 100 98 100 100 100 100 98 92 100 100 68 76 94 100 +Qwen3.6-35B-A3B-UD-Q4_K_M LS/N [bare:full]² 52.8% 88.4% 59.8% 100% 0.1 11.8s 50 14 72 2 92 68 92 42 0 98 58 66 52 90 10 80 4 86 56 90 40 0 90 70 56 46 0 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## claude-haiku-4-5-20251001 (anthropic/native) ``` ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -claude-haiku-4-5-20251001 AN/N [reforged]¹ 94.5% 94.9% 99.6% 100% 0.3 8.5s 50 100 100 100 100 100 100 100 100 100 80 80 98 100 100 100 100 100 100 100 94 100 100 76 36 94 100 -claude-haiku-4-5-20251001 AN/N [bare]¹ 46.5% 86.2% 54.0% 100% 0.0 8.2s 50 0 96 100 0 100 100 0 0 4 4 74 100 100 0 96 96 0 100 100 0 0 0 6 36 98 0 -claude-haiku-4-5-20251001 AN/N [bare+any]¹ 74.0% 80.2% 92.3% 100% 0.0 5.5s 50 100 100 100 100 100 100 100 0 100 92 0 22 100 100 100 100 100 100 100 100 0 100 82 2 26 0 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +claude-haiku-4-5-20251001 AN/N [reforged] 94.2% 94.2% 99.9% 100% 0.3 6.6s 50 100 100 100 100 100 100 98 100 100 74 74 98 100 100 100 100 100 100 100 100 100 100 72 38 94 100 +claude-haiku-4-5-20251001 AN/N [bare] 47.2% 87.1% 54.2% 100% 0.0 7.4s 50 0 100 100 2 100 100 0 0 0 2 82 100 100 0 98 100 0 100 100 0 0 0 2 42 100 0 +claude-haiku-4-5-20251001 AN/N [bare+any] 74.0% 80.2% 92.3% 100% 0.0 5.2s 50 100 100 100 100 100 100 100 0 100 86 0 32 100 100 100 100 100 100 100 100 0 100 84 0 22 0 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## Qwen3.5-27B-Q4_K_M (llamaserver/native) ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Qwen3.5-27B-Q4_K_M LS/N [reforged] 93.2% 93.3% 99.8% 82% 1.4 37.6s 50 100 100 100 100 100 100 100 98 100 74 38 88 98 100 100 100 100 100 98 100 100 100 78 56 96 98 -Qwen3.5-27B-Q4_K_M LS/N [bare] 15.8% 100.0% 15.8% 100% 0.0 11.0s 50 88 0 12 30 2 32 0 0 2 0 0 16 0 96 0 24 56 4 40 0 0 4 0 0 6 0 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3.5-27B-Q4_K_M LS/N [reforged:full]² 93.2% 93.3% 99.8% 82% 1.4 37.6s 50 100 100 100 100 100 100 100 98 100 74 38 88 98 100 100 100 100 100 98 100 100 100 78 56 96 98 +Qwen3.5-27B-Q4_K_M LS/N [bare:full]² 15.8% 100.0% 15.8% 100% 0.0 11.0s 50 88 0 12 30 2 32 0 0 2 0 0 16 0 96 0 24 56 4 40 0 0 4 0 0 6 0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## Qwen3.6-27B-Q4_K_M (llamaserver/native) ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Qwen3.6-27B-Q4_K_M LS/N [reforged] 92.2% 92.5% 99.6% 100% 0.4 37.9s 50 100 100 100 100 100 100 100 98 100 22 74 98 100 100 100 100 100 100 100 100 96 98 36 78 96 100 -Qwen3.6-27B-Q4_K_M LS/N [bare] 47.0% 92.4% 50.8% 100% 0.0 26.3s 50 52 88 82 68 28 42 48 0 72 26 22 62 22 62 88 80 66 40 42 54 0 80 24 12 62 0 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3.6-27B-Q4_K_M LS/N [reforged:full]² 92.2% 92.5% 99.6% 100% 0.4 37.9s 50 100 100 100 100 100 100 100 98 100 22 74 98 100 100 100 100 100 100 100 100 96 98 36 78 96 100 +Qwen3.6-27B-Q4_K_M LS/N [bare:full]² 47.0% 92.4% 50.8% 100% 0.0 26.3s 50 52 88 82 68 28 42 48 0 72 26 22 62 22 62 88 80 66 40 42 54 0 80 24 12 62 0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## Qwen3.5-35B-A3B-Q4_K_M (llamaserver/native) ``` --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Qwen3.5-35B-A3B-Q4_K_M LS/N [reforged] 92.1% 92.4% 99.7% 82% 1.3 11.1s 50 100 100 100 100 100 96 98 100 100 96 14 84 100 100 100 100 100 100 96 100 100 100 94 20 96 100 -Qwen3.5-35B-A3B-Q4_K_M LS/N [bare] 12.2% 97.5% 12.5% 100% 0.0 3.9s 50 76 2 0 28 8 26 6 0 6 0 0 2 0 70 0 0 28 8 32 10 0 4 2 0 8 0 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3.5-35B-A3B-Q4_K_M LS/N [reforged:full]² 92.1% 92.4% 99.7% 82% 1.3 11.1s 50 100 100 100 100 100 96 98 100 100 96 14 84 100 100 100 100 100 100 96 100 100 100 94 20 96 100 +Qwen3.5-35B-A3B-Q4_K_M LS/N [bare:full]² 12.2% 97.5% 12.5% 100% 0.0 3.9s 50 76 2 0 28 8 26 6 0 6 0 0 2 0 70 0 0 28 8 32 10 0 4 2 0 8 0 +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## Qwen3.5-27B-Q4_K_M (llamaserver/prompt) ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Qwen3.5-27B-Q4_K_M LS/P [reforged] 86.8% 86.8% 100.0% 100% 0.1 24.4s 50 100 100 100 100 100 100 100 100 100 42 10 78 100 100 100 100 100 100 100 100 100 100 36 10 80 100 -Qwen3.5-27B-Q4_K_M LS/P [bare] 74.3% 81.0% 91.8% 100% 0.0 24.0s 50 100 100 100 100 100 100 100 0 92 38 14 84 100 100 100 100 100 100 68 100 0 98 44 6 88 0 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3.5-27B-Q4_K_M LS/P [reforged:full]² 86.8% 86.8% 100.0% 100% 0.1 24.4s 50 100 100 100 100 100 100 100 100 100 42 10 78 100 100 100 100 100 100 100 100 100 100 36 10 80 100 +Qwen3.5-27B-Q4_K_M LS/P [bare:full]² 74.3% 81.0% 91.8% 100% 0.0 24.0s 50 100 100 100 100 100 100 100 0 92 38 14 84 100 100 100 100 100 100 68 100 0 98 44 6 88 0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` -## Ministral-3-14B-Reasoning-2512-Q4_K_M (llamaserver/native) +## Ministral-3-8B-Reasoning-2512-Q8_0 (llamaserver/native) ``` ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Ministral-3-14B-Reasoning-2512-Q4_K_M LS/N [reforged] 84.5% 84.5% 100.0% 97% 0.6 5.4s 50 100 100 100 100 100 88 100 100 70 44 48 76 94 100 100 100 100 100 96 98 100 76 38 26 62 82 -Ministral-3-14B-Reasoning-2512-Q4_K_M LS/N [bare] 43.3% 78.5% 55.2% 100% 0.3 5.6s 50 0 100 54 74 94 74 8 0 22 22 28 40 84 0 100 52 90 94 76 6 0 24 24 20 26 14 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +Ministral-3-8B-Reasoning-2512-Q8_0 LS/N [reforged] 84.5% 84.8% 99.7% 96% 0.6 5.3s 50 100 100 96 100 100 98 100 100 100 76 18 44 92 100 100 100 100 100 100 98 100 98 80 2 42 54 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/N [reforged:keep-last] 84.8% 85.6% 99.1% 94% 0.6 5.9s 50 100 100 100 100 100 100 100 100 98 70 24 42 86 100 100 100 100 100 100 100 100 100 82 6 36 62 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/N [reforged:full] 83.1% 83.1% 99.9% 96% 0.5 6.0s 50 100 100 100 100 100 100 98 100 100 66 20 36 88 100 100 100 100 100 100 100 100 98 74 2 26 52 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/N [bare] 43.5% 80.3% 54.2% 100% 0.3 3.8s 50 2 100 98 100 100 86 0 0 2 14 12 14 72 2 100 98 100 100 86 0 0 6 16 0 22 0 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/N [bare:keep-last] 44.2% 79.3% 55.7% 100% 0.3 4.8s 50 10 100 92 100 100 94 0 0 2 16 26 30 52 4 100 94 100 100 78 0 0 4 18 4 20 4 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/N [bare:full] 44.3% 77.9% 56.8% 100% 0.3 5.7s 50 4 100 94 100 100 84 0 0 2 10 20 24 86 4 100 94 100 100 84 0 0 2 10 4 28 2 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ``` ## Ministral-3-8B-Instruct-2512-Q8_0 (llamaserver/prompt) @@ -108,97 +124,128 @@ Ministral-3-14B-Reasoning-2512-Q4_K_M LS/N [bare] 43.3% 78.5% 55.2% ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Ministral-3-8B-Instruct-2512-Q8_0 LS/P [reforged] 84.4% 91.1% 92.6% 92% 0.7 4.6s 50 100 100 6 100 100 100 100 100 100 98 8 100 80 100 100 4 100 100 100 100 100 100 100 0 98 100 -Ministral-3-8B-Instruct-2512-Q8_0 LS/P [bare] 50.2% 82.2% 61.0% 100% 0.2 2.6s 50 100 100 0 100 100 100 0 0 90 0 4 0 70 100 100 0 100 100 100 6 0 88 0 0 2 44 +Ministral-3-8B-Instruct-2512-Q8_0 LS/P [reforged] 84.2% 90.9% 92.7% 91% 0.7 4.6s 50 100 100 6 100 100 100 100 100 100 98 8 100 80 100 100 4 100 100 100 100 100 100 96 0 98 100 +Ministral-3-8B-Instruct-2512-Q8_0 LS/P [bare] 50.2% 83.0% 60.5% 100% 0.2 2.7s 50 100 100 0 100 98 100 0 0 82 0 8 0 78 100 100 0 100 94 100 0 0 90 0 0 0 54 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` -## Ministral-3-8B-Reasoning-2512-Q8_0 (llamaserver/native) +## Qwen3.6-27B-Q4_K_M (llamaserver/prompt) ``` --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Ministral-3-8B-Reasoning-2512-Q8_0 LS/N [reforged] 84.2% 84.2% 100.0% 95% 0.5 6.0s 50 100 100 100 100 100 100 100 100 98 74 26 54 88 100 100 100 100 100 100 100 100 100 68 2 26 52 -Ministral-3-8B-Reasoning-2512-Q8_0 LS/N [bare] 43.6% 76.5% 57.0% 100% 0.2 5.3s 50 4 100 94 100 100 84 0 0 4 18 12 24 76 2 100 96 100 100 82 0 0 0 10 0 28 0 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3.6-27B-Q4_K_M LS/P [reforged:full]² 83.5% 85.0% 98.2% 97% 0.4 53.9s 50 100 100 100 100 100 100 100 100 98 6 66 52 90 100 100 100 100 100 98 100 96 90 2 56 36 80 +Qwen3.6-27B-Q4_K_M LS/P [bare:full]² 69.0% 75.5% 91.4% 100% 0.2 50.5s 50 100 100 100 100 100 100 100 0 100 0 54 46 94 100 98 100 100 92 4 100 0 98 4 70 34 0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` -## Qwen3.6-27B-Q4_K_M (llamaserver/prompt) +## Ministral-3-14B-Reasoning-2512-Q4_K_M (llamaserver/native) ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Qwen3.6-27B-Q4_K_M LS/P [reforged] 83.5% 85.0% 98.2% 97% 0.4 53.9s 50 100 100 100 100 100 100 100 100 98 6 66 52 90 100 100 100 100 100 98 100 96 90 2 56 36 80 -Qwen3.6-27B-Q4_K_M LS/P [bare] 69.0% 75.5% 91.4% 100% 0.2 50.5s 50 100 100 100 100 100 100 100 0 100 0 54 46 94 100 98 100 100 92 4 100 0 98 4 70 34 0 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/N [reforged] 83.3% 83.3% 100.0% 96% 0.6 4.8s 50 100 100 100 100 100 100 100 100 60 32 34 78 94 100 100 100 100 100 92 100 100 62 30 28 78 78 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/N [reforged:keep-last] 82.9% 82.9% 100.0% 95% 0.6 5.0s 50 100 100 100 100 100 92 100 100 68 40 30 60 94 100 100 100 100 100 96 100 100 62 36 20 72 86 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/N [reforged:full] 83.2% 83.2% 100.0% 97% 0.6 5.6s 50 100 100 100 100 100 86 98 100 68 40 32 76 96 100 100 100 100 100 92 100 100 62 34 30 68 82 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/N [bare] 40.2% 81.0% 49.7% 100% 0.4 4.5s 50 0 100 50 76 94 70 0 0 14 6 28 32 86 0 100 52 92 94 78 0 0 10 4 16 24 20 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/N [bare:keep-last] 41.5% 77.2% 53.7% 100% 0.3 5.5s 50 0 100 52 64 94 68 2 0 12 4 34 40 82 0 100 64 90 100 82 2 0 24 4 16 30 14 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/N [bare:full] 45.6% 80.2% 56.8% 100% 0.3 5.8s 50 0 100 72 78 98 72 6 0 26 18 34 36 94 0 100 44 96 92 84 2 0 16 28 26 40 24 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## Qwen3.5-35B-A3B-Q4_K_M (llamaserver/prompt) ``` --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Qwen3.5-35B-A3B-Q4_K_M LS/P [reforged] 82.8% 82.8% 100.0% 100% 0.2 10.4s 50 48 100 100 100 100 94 98 100 100 74 16 62 90 56 100 100 100 100 96 100 100 98 68 14 58 82 -Qwen3.5-35B-A3B-Q4_K_M LS/P [bare] 59.7% 77.0% 77.5% 100% 0.1 9.8s 50 30 100 2 98 98 96 98 0 92 60 20 44 96 48 100 4 92 100 90 96 0 88 58 12 28 2 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3.5-35B-A3B-Q4_K_M LS/P [reforged:full]² 82.8% 82.8% 100.0% 100% 0.2 10.4s 50 48 100 100 100 100 94 98 100 100 74 16 62 90 56 100 100 100 100 96 100 100 98 68 14 58 82 +Qwen3.5-35B-A3B-Q4_K_M LS/P [bare:full]² 59.7% 77.0% 77.5% 100% 0.1 9.8s 50 30 100 2 98 98 96 98 0 92 60 20 44 96 48 100 4 92 100 90 96 0 88 58 12 28 2 +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## Ministral-3-8B-Reasoning-2512-Q4_K_M (llamaserver/native) ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Ministral-3-8B-Reasoning-2512-Q4_K_M LS/N [reforged] 82.8% 82.8% 99.9% 95% 0.5 4.1s 50 100 100 100 100 100 100 100 100 98 66 24 34 92 100 100 100 100 100 96 100 100 100 70 0 30 42 -Ministral-3-8B-Reasoning-2512-Q4_K_M LS/N [bare] 42.5% 74.8% 56.8% 100% 0.3 3.6s 50 12 100 90 100 100 72 2 0 2 12 12 14 84 20 100 78 100 100 64 2 0 0 10 2 24 4 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/N [reforged] 81.4% 81.6% 99.7% 94% 0.6 3.8s 50 100 100 98 100 100 100 100 100 100 68 24 18 86 100 100 98 100 100 100 100 100 96 70 4 28 26 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/N [reforged:keep-last] 82.4% 83.0% 99.3% 92% 0.6 4.2s 50 100 100 100 100 100 98 100 100 100 68 16 28 86 100 100 100 100 100 96 98 100 96 64 6 24 62 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/N [reforged:full] 81.8% 81.8% 100.0% 95% 0.5 4.3s 50 100 100 98 100 98 98 98 100 98 62 8 30 96 100 100 98 100 100 98 96 100 98 62 2 40 46 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/N [bare] 40.2% 77.0% 52.2% 98% 0.3 2.8s 50 6 100 84 100 100 68 0 0 4 6 18 8 62 10 100 86 100 100 70 0 0 2 12 0 8 0 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/N [bare:keep-last] 41.2% 77.3% 53.3% 100% 0.2 3.4s 50 6 100 82 100 100 78 0 0 4 10 20 20 58 10 100 90 100 100 70 0 0 4 6 4 10 0 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/N [bare:full] 41.8% 72.7% 57.5% 100% 0.3 3.7s 50 6 100 84 100 100 70 2 0 2 12 8 12 90 14 100 82 100 100 66 2 0 0 18 0 16 2 +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## Qwen3.6-35B-A3B-UD-Q4_K_M (llamaserver/prompt) ``` ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Qwen3.6-35B-A3B-UD-Q4_K_M LS/P [reforged] 82.2% 82.2% 100.0% 100% 0.3 23.6s 50 96 100 100 100 100 90 92 98 92 16 46 62 98 88 100 100 100 100 88 94 96 88 8 42 50 94 -Qwen3.6-35B-A3B-UD-Q4_K_M LS/P [bare] 65.5% 75.6% 86.6% 100% 0.1 23.0s 50 96 100 32 100 100 94 96 0 94 16 52 46 98 88 100 28 100 100 60 92 0 100 10 50 50 0 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3.6-35B-A3B-UD-Q4_K_M LS/P [reforged:full]² 82.2% 82.2% 100.0% 100% 0.3 23.6s 50 96 100 100 100 100 90 92 98 92 16 46 62 98 88 100 100 100 100 88 94 96 88 8 42 50 94 +Qwen3.6-35B-A3B-UD-Q4_K_M LS/P [bare:full]² 65.5% 75.6% 86.6% 100% 0.1 23.0s 50 96 100 32 100 100 94 96 0 94 16 52 46 98 88 100 28 100 100 60 92 0 100 10 50 50 0 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` -## Ministral-3-8B-Instruct-2512-Q8_0 (llamaserver/native) +## Ministral-3-8B-Reasoning-2512-Q8_0 (llamaserver/prompt) ``` -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Ministral-3-8B-Instruct-2512-Q8_0 LS/N [reforged] 81.4% 81.4% 100.0% 100% 0.3 4.0s 50 100 100 100 100 100 100 100 100 100 38 4 0 100 100 100 100 100 100 100 100 100 100 74 0 0 100 -Ministral-3-8B-Instruct-2512-Q8_0 LS/N [bare] 32.5% 62.0% 52.5% 100% 0.0 4.4s 50 0 100 0 0 100 100 0 0 66 0 4 0 100 0 100 0 0 100 100 0 0 74 2 0 0 0 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +Ministral-3-8B-Reasoning-2512-Q8_0 LS/P [reforged] 80.9% 84.8% 95.4% 95% 0.7 3.9s 50 100 100 90 100 100 98 100 98 100 72 6 50 76 100 100 86 100 100 100 100 92 96 64 2 42 32 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/P [reforged:keep-last] 81.2% 84.6% 96.0% 97% 0.7 4.1s 50 100 100 94 100 100 94 100 92 98 72 12 42 78 100 100 86 100 100 98 100 98 100 64 0 52 32 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/P [reforged:full] 81.4% 84.8% 96.0% 95% 0.7 4.0s 50 100 100 88 100 100 94 100 96 100 72 12 58 80 100 98 88 100 100 96 100 92 100 64 2 36 40 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/P [bare] 58.6% 80.0% 73.3% 100% 0.4 3.3s 50 52 88 32 100 98 100 100 0 86 58 4 18 64 36 92 48 100 96 92 100 0 90 60 2 6 2 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/P [bare:keep-last] 59.2% 80.5% 73.5% 100% 0.4 3.1s 50 44 88 36 100 96 96 100 0 82 66 4 10 82 34 92 38 100 98 96 100 0 92 66 0 16 4 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/P [bare:full] 58.5% 79.5% 73.5% 100% 0.3 3.5s 50 26 98 40 100 98 96 100 0 92 54 6 14 86 44 92 38 100 96 92 100 0 92 44 0 8 4 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ``` -## Ministral-3-8B-Reasoning-2512-Q8_0 (llamaserver/prompt) +## Ministral-3-14B-Reasoning-2512-Q4_K_M (llamaserver/prompt) ``` --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Ministral-3-8B-Reasoning-2512-Q8_0 LS/P [reforged] 81.3% 85.0% 95.7% 96% 0.7 3.9s 50 100 100 96 100 100 98 100 92 98 56 14 46 70 100 100 88 100 100 98 100 94 100 82 0 48 34 -Ministral-3-8B-Reasoning-2512-Q8_0 LS/P [bare] 59.7% 80.7% 74.0% 100% 0.4 3.2s 50 48 92 38 100 96 98 100 0 94 54 10 16 78 54 92 32 100 98 98 100 0 90 48 0 14 2 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/P [reforged] 77.7% 78.8% 98.5% 95% 0.6 3.8s 50 100 100 100 100 100 74 100 100 78 32 2 46 90 100 100 100 100 100 66 100 100 74 28 4 48 78 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/P [reforged:keep-last] 78.0% 78.8% 98.9% 95% 0.6 3.8s 50 100 100 100 100 100 82 100 100 70 28 2 52 78 100 100 100 100 100 74 100 100 80 18 4 48 92 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/P [reforged:full] 81.2% 82.0% 98.9% 98% 0.6 3.8s 50 100 100 100 100 100 74 100 100 68 40 4 72 88 100 100 100 100 100 82 100 100 78 38 10 64 92 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/P [bare] 63.6% 78.5% 81.0% 100% 0.3 3.3s 50 100 100 100 100 98 82 90 0 60 20 4 46 72 100 100 100 100 96 72 96 0 56 22 0 40 0 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/P [bare:keep-last] 63.4% 76.7% 82.7% 100% 0.3 3.2s 50 100 100 100 100 98 68 98 0 64 20 4 46 78 100 100 96 100 100 68 92 0 58 14 0 44 0 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/P [bare:full] 63.5% 77.3% 82.1% 100% 0.3 3.2s 50 100 100 100 100 100 76 96 0 64 12 2 40 70 100 98 100 100 100 70 92 0 60 18 0 52 0 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## Ministral-3-8B-Reasoning-2512-Q4_K_M (llamaserver/prompt) ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Ministral-3-8B-Reasoning-2512-Q4_K_M LS/P [reforged] 80.5% 83.0% 97.0% 96% 0.7 2.7s 50 100 98 98 100 100 98 98 100 96 74 10 36 70 100 98 96 100 100 98 100 98 94 70 2 38 20 -Ministral-3-8B-Reasoning-2512-Q4_K_M LS/P [bare] 54.9% 79.3% 69.2% 100% 0.3 2.6s 50 56 78 8 100 96 98 100 0 68 60 10 18 64 60 82 8 100 96 92 96 0 72 56 0 10 0 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/P [reforged] 79.5% 81.9% 97.0% 94% 0.7 2.8s 50 100 100 98 100 100 96 100 90 96 68 4 32 68 100 98 98 100 100 96 96 94 98 70 0 34 30 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/P [reforged:keep-last] 79.8% 82.6% 96.7% 95% 0.6 2.8s 50 100 100 96 100 100 98 98 94 98 66 8 36 64 100 100 96 100 100 100 98 96 94 76 0 34 24 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/P [reforged:full] 81.0% 83.1% 97.5% 95% 0.7 3.0s 50 100 98 88 100 100 96 100 98 98 84 24 38 70 100 100 100 100 100 96 98 100 96 72 2 22 26 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/P [bare] 55.2% 79.8% 69.2% 100% 0.4 2.3s 50 58 70 14 100 92 96 100 0 72 48 4 16 66 58 84 10 100 92 96 98 0 88 60 0 12 2 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/P [bare:keep-last] 54.3% 78.8% 68.9% 100% 0.3 2.4s 50 50 78 4 100 92 98 100 0 68 50 10 12 64 64 72 10 100 92 92 96 0 84 62 0 12 2 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/P [bare:full] 54.6% 78.7% 69.4% 100% 0.4 2.4s 50 58 74 4 100 86 94 98 0 76 54 2 12 62 66 78 8 100 94 96 98 0 86 54 0 18 2 +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## Ministral-3-8B-Instruct-2512-Q8_0 (llamaserver/native) + +``` +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Ministral-3-8B-Instruct-2512-Q8_0 LS/N [reforged] 81.0% 81.0% 100.0% 100% 0.3 4.1s 50 100 100 100 100 100 100 100 100 100 30 0 4 100 100 100 100 100 100 100 100 100 100 68 0 4 100 +Ministral-3-8B-Instruct-2512-Q8_0 LS/N [bare] 33.0% 62.2% 53.1% 100% 0.0 4.4s 50 0 100 0 0 100 100 0 0 68 0 2 2 100 0 100 0 0 100 100 0 0 78 2 4 2 0 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## Ministral-3-14B-Instruct-2512-Q4_K_M (llamaserver/prompt) @@ -207,31 +254,35 @@ Ministral-3-8B-Reasoning-2512-Q4_K_M LS/P [bare] 54.9% 79.3% 69.2% ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Ministral-3-14B-Instruct-2512-Q4_K_M LS/P [reforged] 80.2% 80.2% 100.0% 100% 0.0 2.9s 50 100 100 100 100 100 100 100 100 100 0 0 36 100 100 100 100 100 100 100 100 100 100 0 0 50 100 -Ministral-3-14B-Instruct-2512-Q4_K_M LS/P [bare] 66.5% 80.0% 83.1% 100% 0.0 2.4s 50 100 100 100 100 100 100 100 0 100 0 0 0 100 100 100 100 100 100 100 100 0 96 0 0 0 32 +Ministral-3-14B-Instruct-2512-Q4_K_M LS/P [reforged] 80.6% 80.6% 100.0% 100% 0.0 3.0s 50 100 100 100 100 100 100 100 100 100 0 0 46 100 100 100 100 100 100 100 100 100 98 0 0 52 100 +Ministral-3-14B-Instruct-2512-Q4_K_M LS/P [bare] 66.5% 79.9% 83.2% 100% 0.0 2.4s 50 100 100 100 100 100 100 100 0 100 0 0 0 100 100 100 100 100 100 100 100 0 96 0 0 0 32 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` -## Ministral-3-14B-Reasoning-2512-Q4_K_M (llamaserver/prompt) +## gemma-4-E4B-it-Q4_K_M (llamaserver/native) ``` ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Ministral-3-14B-Reasoning-2512-Q4_K_M LS/P [reforged] 79.5% 80.5% 98.7% 96% 0.5 3.7s 50 100 100 100 100 100 82 100 100 78 30 6 58 92 100 100 100 100 100 74 100 100 80 20 6 56 84 -Ministral-3-14B-Reasoning-2512-Q4_K_M LS/P [bare] 65.2% 78.1% 83.5% 100% 0.3 3.4s 50 100 98 100 100 98 74 92 0 62 30 0 56 70 100 100 100 100 96 78 96 0 72 14 0 58 0 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +gemma-4-E4B-it-Q4_K_M LS/N [reforged] 79.7% 79.9% 99.8% 100% 0.3 8.1s 50 100 100 100 100 100 94 98 100 84 8 30 64 92 100 100 100 100 100 88 94 100 86 2 0 48 84 +gemma-4-E4B-it-Q4_K_M LS/N [reforged:keep-last] 78.7% 81.4% 96.7% 99% 0.5 9.3s 50 100 100 100 100 100 96 92 96 96 6 20 64 62 100 100 100 100 100 88 88 100 98 2 0 82 56 +gemma-4-E4B-it-Q4_K_M LS/N [reforged:full] 79.2% 82.2% 96.3% 99% 0.5 10.0s 50 100 100 100 100 100 96 88 94 100 2 40 78 54 100 100 100 100 100 94 84 98 96 0 0 82 52 +gemma-4-E4B-it-Q4_K_M LS/N [bare] 67.5% 73.5% 91.8% 100% 0.3 7.5s 50 100 100 100 100 100 88 96 0 90 2 28 44 70 100 100 100 100 100 82 100 0 88 4 0 62 0 +gemma-4-E4B-it-Q4_K_M LS/N [bare:keep-last] 66.6% 75.6% 88.1% 100% 0.2 9.3s 50 100 100 68 100 100 86 90 0 96 0 24 78 84 98 100 66 100 100 76 88 0 96 4 0 78 0 +gemma-4-E4B-it-Q4_K_M LS/N [bare:full] 67.9% 77.5% 87.7% 100% 0.2 9.5s 50 96 98 78 100 100 94 84 0 92 2 24 80 80 98 100 78 100 100 92 92 0 94 0 0 84 0 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## qwen3:14b-q4_K_M (ollama/native) ``` --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -qwen3:14b-q4_K_M OL/N [reforged] 78.6% 78.7% 99.9% 77% 1.2 38.5s 50 100 100 100 100 100 100 100 100 74 4 12 68 78 100 100 100 100 100 100 100 94 88 4 0 54 68 -qwen3:14b-q4_K_M OL/N [bare] 46.5% 61.3% 75.8% 87% 0.7 34.7s 50 90 92 4 4 86 100 100 0 78 6 4 48 6 86 92 2 26 84 68 100 0 86 0 4 42 0 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +qwen3:14b-q4_K_M OL/N [reforged:full]² 78.6% 78.7% 99.9% 77% 1.2 38.5s 50 100 100 100 100 100 100 100 100 74 4 12 68 78 100 100 100 100 100 100 100 94 88 4 0 54 68 +qwen3:14b-q4_K_M OL/N [bare:full]² 46.5% 61.3% 75.8% 87% 0.7 34.7s 50 90 92 4 4 86 100 100 0 78 6 4 48 6 86 92 2 26 84 68 100 0 86 0 4 42 0 +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## Ministral-3-8B-Instruct-2512-Q4_K_M (llamaserver/native) @@ -240,31 +291,20 @@ qwen3:14b-q4_K_M OL/N [bare] 46.5% 61.3% 75.8% 87% 0.7 34.7s --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Ministral-3-8B-Instruct-2512-Q4_K_M LS/N [reforged] 78.3% 78.4% 99.8% 95% 0.4 3.2s 50 100 100 100 100 100 100 100 98 100 22 0 0 100 100 100 100 100 100 100 100 98 100 14 2 2 100 -Ministral-3-8B-Instruct-2512-Q4_K_M LS/N [bare] 27.3% 61.4% 44.5% 100% 0.0 3.8s 50 0 100 0 0 100 100 0 0 8 0 0 0 100 0 100 0 0 100 100 0 0 2 0 0 0 0 +Ministral-3-8B-Instruct-2512-Q4_K_M LS/N [reforged] 78.3% 78.3% 100.0% 95% 0.4 3.2s 50 100 100 100 100 100 100 100 100 100 18 0 0 100 100 100 100 100 100 100 100 100 100 16 2 0 100 +Ministral-3-8B-Instruct-2512-Q4_K_M LS/N [bare] 27.3% 62.8% 43.5% 100% 0.0 3.8s 50 0 100 0 0 100 100 0 0 4 0 0 2 100 0 100 0 0 100 100 0 0 4 0 0 0 0 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` -## gemma-4-E4B-it-Q4_K_M (llamaserver/native) - -``` -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -gemma-4-E4B-it-Q4_K_M LS/N [reforged] 78.2% 82.2% 95.1% 98% 0.5 9.0s 50 100 100 100 100 100 92 98 98 90 0 24 80 50 100 100 100 100 100 94 90 94 98 0 0 84 40 -gemma-4-E4B-it-Q4_K_M LS/N [bare] 67.8% 77.0% 88.1% 100% 0.2 9.4s 50 96 100 82 100 100 96 86 0 94 0 18 80 78 96 100 86 100 96 92 88 0 94 0 0 82 0 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -``` - ## Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M (llamaserver/prompt) ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M LS/P [reforged] 78.2% 84.3% 92.7% 78% 1.1 3.6s 50 100 100 100 100 100 100 100 100 28 0 0 100 94 100 100 100 100 100 100 100 100 34 0 0 100 76 -Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M LS/P [bare] 65.7% 81.1% 81.0% 80% 0.9 3.4s 50 100 100 96 100 98 100 100 0 32 0 0 100 46 100 100 98 100 98 100 100 0 40 0 0 100 0 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M LS/P [reforged:full]² 78.2% 84.3% 92.7% 78% 1.1 3.6s 50 100 100 100 100 100 100 100 100 28 0 0 100 94 100 100 100 100 100 100 100 100 34 0 0 100 76 +Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M LS/P [bare:full]² 65.7% 81.1% 81.0% 80% 0.9 3.4s 50 100 100 96 100 98 100 100 0 32 0 0 100 46 100 100 98 100 98 100 100 0 40 0 0 100 0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## Ministral-3-14B-Instruct-2512-Q4_K_M (llamaserver/native) @@ -273,20 +313,35 @@ Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M LS/P [bare] 65.7% 81.1% ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Ministral-3-14B-Instruct-2512-Q4_K_M LS/N [reforged] 78.1% 78.1% 100.0% 97% 0.3 4.0s 50 100 100 100 100 100 100 100 100 100 16 0 0 100 100 100 100 100 100 100 100 100 100 14 0 0 100 -Ministral-3-14B-Instruct-2512-Q4_K_M LS/N [bare] 27.5% 80.8% 34.1% 100% 0.0 3.0s 50 0 100 0 0 100 100 0 0 10 0 0 0 100 0 100 0 0 100 98 0 0 8 0 0 0 0 +Ministral-3-14B-Instruct-2512-Q4_K_M LS/N [reforged] 77.8% 77.8% 100.0% 97% 0.3 4.0s 50 100 100 100 100 100 100 100 100 100 8 0 0 100 100 100 100 100 100 100 100 100 100 14 0 0 100 +Ministral-3-14B-Instruct-2512-Q4_K_M LS/N [bare] 27.6% 82.2% 33.6% 100% 0.0 3.0s 50 0 100 0 0 100 100 0 0 12 0 0 0 100 0 100 0 0 100 98 0 0 8 0 0 0 0 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## gemma-4-E4B-it-Q8_0 (llamaserver/native) ``` ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -gemma-4-E4B-it-Q8_0 LS/N [reforged] 76.2% 80.7% 94.5% 98% 0.6 12.8s 50 100 100 100 100 100 84 88 90 96 2 14 80 44 100 100 100 100 100 88 90 96 94 4 0 80 32 -gemma-4-E4B-it-Q8_0 LS/N [bare] 67.7% 76.9% 88.1% 100% 0.3 13.8s 50 100 100 78 100 98 92 90 0 98 4 10 84 92 98 94 84 100 100 84 88 0 84 4 0 78 0 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +gemma-4-E4B-it-Q8_0 LS/N [reforged] 77.8% 77.9% 99.8% 100% 0.2 10.8s 50 100 100 100 100 100 76 90 100 100 0 18 38 98 100 100 100 100 100 76 98 100 98 0 0 36 94 +gemma-4-E4B-it-Q8_0 LS/N [reforged:keep-last] 75.7% 79.3% 95.5% 99% 0.5 13.1s 50 100 100 100 100 100 76 92 98 96 4 16 82 48 100 100 100 100 100 50 86 98 94 2 0 86 40 +gemma-4-E4B-it-Q8_0 LS/N [reforged:full] 75.6% 80.8% 93.6% 98% 0.6 12.5s 50 100 100 100 100 100 92 94 92 88 0 18 82 28 100 100 100 100 100 76 92 94 98 2 0 84 26 +gemma-4-E4B-it-Q8_0 LS/N [bare] 66.2% 72.2% 91.7% 100% 0.2 10.1s 50 98 100 100 100 100 82 94 0 94 0 30 48 86 100 100 100 100 100 74 96 0 94 0 0 26 0 +gemma-4-E4B-it-Q8_0 LS/N [bare:keep-last] 67.0% 75.4% 88.8% 100% 0.2 14.0s 50 96 100 72 98 100 74 96 0 96 0 28 78 94 98 100 74 100 98 74 92 0 90 0 0 84 0 +gemma-4-E4B-it-Q8_0 LS/N [bare:full] 65.5% 75.3% 87.0% 100% 0.3 14.4s 50 96 96 76 100 100 88 86 0 80 0 14 84 80 92 100 74 100 100 76 88 0 88 0 0 86 0 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +## phi-4-Q4_K_M (llamaserver/prompt) + +``` +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +phi-4-Q4_K_M LS/P [reforged] 75.3% 75.4% 99.8% 83% 0.9 4.2s 50 100 100 100 100 100 26 62 94 96 62 34 66 70 100 100 100 100 100 28 84 98 94 42 0 60 42 +phi-4-Q4_K_M LS/P [bare] 57.9% 68.0% 85.2% 91% 0.5 3.4s 50 100 100 100 98 100 22 58 0 80 28 20 16 54 100 100 100 98 100 14 68 0 82 36 2 28 2 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## Ministral-3-8B-Instruct-2512-Q4_K_M (llamaserver/prompt) @@ -295,229 +350,239 @@ gemma-4-E4B-it-Q8_0 LS/N [bare] 67.7% 76.9% 88.1% 100% 0.3 13. --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Ministral-3-8B-Instruct-2512-Q4_K_M LS/P [reforged] 75.6% 83.8% 90.2% 79% 1.3 3.0s 50 98 100 0 100 100 100 100 100 100 22 12 56 100 100 100 0 100 100 100 100 100 100 28 0 50 100 -Ministral-3-8B-Instruct-2512-Q4_K_M LS/P [bare] 45.8% 81.4% 56.3% 94% 0.8 2.6s 50 0 100 0 100 100 100 0 0 98 0 16 26 100 0 100 0 100 100 100 0 0 100 0 0 52 0 +Ministral-3-8B-Instruct-2512-Q4_K_M LS/P [reforged] 74.9% 83.3% 89.9% 79% 1.3 3.1s 50 100 100 0 100 100 100 100 100 100 20 2 42 100 100 100 0 100 100 100 100 100 100 22 0 62 100 +Ministral-3-8B-Instruct-2512-Q4_K_M LS/P [bare] 45.8% 81.5% 56.2% 94% 0.7 2.6s 50 0 100 0 100 100 100 2 0 98 0 16 36 100 0 100 0 100 100 100 4 0 98 0 0 38 0 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## ministral-3:14b-instruct-2512-q4_K_M (ollama/native) ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -ministral-3:14b-instruct-2512-q4_K_M OL/N [reforged] 74.8% 74.8% 100.0% 81% 1.0 6.2s 50 100 100 100 100 100 100 100 100 96 56 0 0 80 100 100 100 100 100 100 100 100 98 0 0 0 16 -ministral-3:14b-instruct-2512-q4_K_M OL/N [bare] 32.0% 84.4% 37.9% 100% 0.1 3.3s 50 100 100 100 0 100 0 0 0 0 0 0 0 30 100 100 100 0 100 0 0 0 2 0 0 0 0 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +ministral-3:14b-instruct-2512-q4_K_M OL/N [reforged:full]² 74.8% 74.8% 100.0% 81% 1.0 6.2s 50 100 100 100 100 100 100 100 100 96 56 0 0 80 100 100 100 100 100 100 100 100 98 0 0 0 16 +ministral-3:14b-instruct-2512-q4_K_M OL/N [bare:full]² 32.0% 84.4% 37.9% 100% 0.1 3.3s 50 100 100 100 0 100 0 0 0 0 0 0 0 30 100 100 100 0 100 0 0 0 2 0 0 0 0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## gemma4:e4b-it-q4_K_M (ollama/native) ``` ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -gemma4:e4b-it-q4_K_M OL/N [reforged] 74.8% 75.0% 99.8% 83% 0.8 11.3s 50 100 100 100 100 100 94 100 100 92 0 0 44 66 100 100 100 100 100 90 100 100 78 0 0 40 42 -gemma4:e4b-it-q4_K_M OL/N [bare] 62.3% 71.7% 86.9% 90% 0.4 8.7s 50 98 100 88 100 98 96 100 0 82 0 2 50 24 96 100 92 100 100 84 98 0 72 2 0 38 0 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +gemma4:e4b-it-q4_K_M OL/N [reforged:full]² 74.8% 75.0% 99.8% 83% 0.8 11.3s 50 100 100 100 100 100 94 100 100 92 0 0 44 66 100 100 100 100 100 90 100 100 78 0 0 40 42 +gemma4:e4b-it-q4_K_M OL/N [bare:full]² 62.3% 71.7% 86.9% 90% 0.4 8.7s 50 98 100 88 100 98 96 100 0 82 0 2 50 24 96 100 92 100 100 84 98 0 72 2 0 38 0 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ``` ## gemma-4-E4B-it-Q8_0 (llamaserver/prompt) ``` ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -gemma-4-E4B-it-Q8_0 LS/P [reforged] 74.7% 74.7% 100.0% 85% 0.6 12.7s 50 100 100 100 100 100 70 100 90 88 0 16 34 94 100 100 100 100 100 48 100 98 84 0 0 30 90 -gemma-4-E4B-it-Q8_0 LS/P [bare] 61.9% 67.1% 92.2% 94% 0.3 12.2s 50 100 100 100 100 100 58 100 0 90 0 24 20 98 100 100 100 100 100 4 100 0 90 0 0 26 0 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +gemma-4-E4B-it-Q8_0 LS/P [reforged] 73.2% 73.3% 99.8% 85% 0.6 13.3s 50 100 100 100 100 100 54 98 94 80 0 28 20 94 100 100 100 100 100 40 100 98 90 0 0 12 96 +gemma-4-E4B-it-Q8_0 LS/P [reforged:keep-last] 74.1% 74.2% 99.8% 85% 0.6 13.4s 50 100 100 100 100 100 48 100 96 84 0 22 28 94 100 100 100 100 100 52 100 98 90 0 0 18 96 +gemma-4-E4B-it-Q8_0 LS/P [reforged:full] 73.7% 73.7% 100.0% 86% 0.6 12.7s 50 100 100 100 100 100 48 100 92 90 0 36 28 88 100 100 100 100 100 48 94 98 82 0 0 20 92 +gemma-4-E4B-it-Q8_0 LS/P [bare] 61.2% 66.3% 92.3% 94% 0.3 12.1s 50 100 100 100 100 100 64 96 0 84 0 22 22 94 100 100 100 100 100 0 100 0 80 0 0 30 0 +gemma-4-E4B-it-Q8_0 LS/P [bare:keep-last] 61.2% 66.4% 92.2% 95% 0.3 12.5s 50 100 100 100 100 100 68 100 0 84 0 22 20 90 100 100 100 100 100 2 100 0 82 0 0 24 0 +gemma-4-E4B-it-Q8_0 LS/P [bare:full] 60.5% 65.7% 92.1% 94% 0.3 12.6s 50 100 100 100 100 100 60 98 0 84 0 14 24 90 100 100 100 100 100 2 96 0 88 0 0 18 0 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## gemma4:e4b-it-q8_0 (ollama/native) ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -gemma4:e4b-it-q8_0 OL/N [reforged] 73.6% 73.8% 99.8% 85% 0.8 12.8s 50 100 100 100 100 100 78 98 100 100 0 8 34 60 100 100 100 100 100 78 94 100 96 0 0 34 34 -gemma4:e4b-it-q8_0 OL/N [bare] 62.5% 69.9% 89.3% 89% 0.5 12.0s 50 90 100 100 100 100 86 92 0 72 0 2 48 50 100 100 98 100 100 66 94 0 84 0 0 42 0 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -``` - -## Qwen3-8B-Q8_0 (llamaserver/prompt) - -``` ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Qwen3-8B-Q8_0 LS/P [reforged] 73.1% 73.2% 99.8% 89% 0.4 28.4s 50 100 100 100 100 100 100 100 58 96 0 8 28 94 100 100 100 100 96 100 98 64 88 0 0 12 58 -Qwen3-8B-Q8_0 LS/P [bare] 64.5% 70.4% 91.6% 96% 0.2 27.4s 50 100 100 92 100 100 96 100 0 86 0 4 20 100 100 100 100 98 98 76 98 0 88 0 0 20 0 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -``` - -## phi-4-Q4_K_M (llamaserver/prompt) - -``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -phi-4-Q4_K_M LS/P [reforged] 72.9% 73.3% 99.5% 85% 0.9 4.1s 50 100 100 100 100 100 34 56 96 90 52 24 38 70 100 100 100 100 100 24 62 92 98 52 0 60 48 -phi-4-Q4_K_M LS/P [bare] 59.2% 69.4% 85.3% 92% 0.5 3.3s 50 100 100 100 100 100 22 68 0 88 44 28 24 32 100 100 100 100 100 10 76 0 92 38 0 18 0 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +gemma4:e4b-it-q8_0 OL/N [reforged:full]² 73.6% 73.8% 99.8% 85% 0.8 12.8s 50 100 100 100 100 100 78 98 100 100 0 8 34 60 100 100 100 100 100 78 94 100 96 0 0 34 34 +gemma4:e4b-it-q8_0 OL/N [bare:full]² 62.5% 69.9% 89.3% 89% 0.5 12.0s 50 90 100 100 100 100 86 92 0 72 0 2 48 50 100 100 98 100 100 66 94 0 84 0 0 42 0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## gemma-4-E4B-it-Q4_K_M (llamaserver/prompt) ``` -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -gemma-4-E4B-it-Q4_K_M LS/P [reforged] 72.8% 72.8% 100.0% 85% 0.6 8.5s 50 100 100 100 100 100 38 98 94 94 0 18 26 96 100 100 100 100 100 26 98 100 92 0 2 22 90 -gemma-4-E4B-it-Q4_K_M LS/P [bare] 60.6% 65.8% 92.1% 92% 0.5 8.4s 50 100 100 100 100 100 54 96 0 90 0 14 22 92 100 100 100 100 100 12 100 0 76 0 0 20 0 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +gemma-4-E4B-it-Q4_K_M LS/P [reforged] 72.4% 72.4% 99.9% 85% 0.6 8.9s 50 100 100 100 100 100 56 96 100 86 0 14 28 84 100 100 100 100 100 34 100 96 74 0 0 22 92 +gemma-4-E4B-it-Q4_K_M LS/P [reforged:keep-last] 73.2% 73.2% 100.0% 85% 0.6 8.5s 50 100 100 100 100 100 40 98 96 92 0 10 24 96 100 100 100 100 100 38 100 100 86 0 0 32 92 +gemma-4-E4B-it-Q4_K_M LS/P [reforged:full] 72.9% 72.9% 100.0% 85% 0.6 8.5s 50 100 100 100 100 100 54 98 98 86 0 10 28 88 100 100 100 100 100 26 98 96 94 0 0 30 90 +gemma-4-E4B-it-Q4_K_M LS/P [bare] 62.5% 67.8% 92.2% 91% 0.4 9.0s 50 100 100 100 100 100 56 100 0 88 0 22 30 100 100 100 100 100 100 10 100 0 88 0 0 30 0 +gemma-4-E4B-it-Q4_K_M LS/P [bare:keep-last] 59.9% 65.0% 92.2% 92% 0.4 8.7s 50 100 100 100 100 100 44 98 0 90 0 4 22 92 100 100 100 100 100 12 92 0 92 0 0 12 0 +gemma-4-E4B-it-Q4_K_M LS/P [bare:full] 61.5% 66.7% 92.2% 91% 0.4 9.1s 50 100 100 100 100 100 46 94 0 86 0 14 28 88 100 100 100 100 100 22 96 0 94 0 0 30 0 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` -## Nemotron-3-Nano-30B-A3B-Q4_K_M (llamaserver/native) +## Qwen3-8B-Q8_0 (llamaserver/prompt) ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Nemotron-3-Nano-30B-A3B-Q4_K_M LS/N [reforged] 71.3% 81.0% 88.0% 72% 1.5 21.4s 50 100 100 100 100 100 66 98 52 92 28 4 34 34 100 100 100 98 100 86 92 68 98 24 8 34 38 -Nemotron-3-Nano-30B-A3B-Q4_K_M LS/N [bare] 37.5% 85.7% 43.7% 95% 0.2 6.6s 50 32 100 88 74 86 70 6 0 20 10 0 14 4 14 100 94 72 92 56 0 0 18 14 0 10 0 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3-8B-Q8_0 LS/P [reforged] 72.0% 72.3% 99.6% 88% 0.4 28.6s 50 100 100 100 100 100 96 100 56 90 0 2 10 96 100 100 100 100 96 98 100 58 88 0 0 20 62 +Qwen3-8B-Q8_0 LS/P [reforged:keep-last] 72.8% 73.0% 99.7% 89% 0.4 28.0s 50 100 100 100 100 100 98 98 80 90 0 6 8 94 100 100 100 100 96 96 100 60 98 0 2 12 54 +Qwen3-8B-Q8_0 LS/P [reforged:full] 72.8% 72.9% 99.8% 88% 0.4 28.9s 50 100 100 100 100 100 98 100 70 90 0 4 20 96 100 100 100 100 92 100 96 66 92 0 0 12 56 +Qwen3-8B-Q8_0 LS/P [bare] 63.5% 69.6% 91.3% 97% 0.2 27.0s 50 100 100 96 98 100 98 96 0 90 0 8 18 92 100 100 94 100 100 64 100 0 84 0 2 12 0 +Qwen3-8B-Q8_0 LS/P [bare:keep-last] 63.4% 69.7% 90.9% 96% 0.2 28.5s 50 100 100 94 100 100 98 98 0 88 0 4 16 92 100 100 88 100 94 72 98 0 96 0 0 10 0 +Qwen3-8B-Q8_0 LS/P [bare:full] 63.9% 69.9% 91.5% 96% 0.2 27.3s 50 100 100 94 100 100 96 98 0 88 0 2 14 90 100 100 98 100 100 68 98 0 96 0 0 20 0 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` -## Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M (llamaserver/native) +## Qwen3-8B-Q4_K_M (llamaserver/prompt) ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M LS/N [reforged] 71.0% 71.2% 99.8% 96% 0.5 6.5s 50 100 100 100 100 98 58 100 100 68 4 12 20 90 100 100 100 100 100 50 100 100 4 22 0 20 100 -Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M LS/N [bare] 42.5% 67.4% 63.1% 100% 0.0 6.1s 50 0 100 100 100 100 68 10 0 4 4 12 12 98 0 100 100 100 100 56 20 0 0 0 2 20 0 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3-8B-Q4_K_M LS/P [reforged] 71.1% 71.2% 99.8% 87% 0.5 18.0s 50 100 100 100 100 100 96 100 70 66 0 18 8 96 100 100 100 100 98 96 100 60 60 0 0 10 70 +Qwen3-8B-Q4_K_M LS/P [reforged:keep-last] 72.2% 72.3% 99.9% 88% 0.5 17.9s 50 100 100 100 100 100 100 100 62 66 0 30 8 90 100 100 100 100 100 98 100 74 68 0 0 8 74 +Qwen3-8B-Q4_K_M LS/P [reforged:full] 70.5% 70.8% 99.6% 88% 0.4 17.4s 50 100 100 100 100 100 88 100 58 66 0 24 10 88 100 100 100 100 100 94 98 62 66 0 0 4 76 +Qwen3-8B-Q4_K_M LS/P [bare] 57.4% 66.6% 86.2% 97% 0.2 17.2s 50 98 100 56 96 100 90 100 0 62 0 22 6 94 100 100 48 94 100 56 100 0 64 0 0 4 2 +Qwen3-8B-Q4_K_M LS/P [bare:keep-last] 59.1% 68.0% 86.9% 98% 0.2 16.6s 50 98 100 56 98 100 94 100 0 58 0 22 8 94 100 100 68 94 98 72 100 0 72 0 0 2 2 +Qwen3-8B-Q4_K_M LS/P [bare:full] 57.8% 66.9% 86.5% 97% 0.2 17.5s 50 100 100 66 94 100 96 100 0 54 0 24 8 96 100 100 58 96 98 68 100 0 44 0 0 2 0 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` -## ministral-3:8b-instruct-2512-q8_0 (ollama/native) +## Qwen3-14B-Q4_K_M (llamaserver/prompt) ``` -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -ministral-3:8b-instruct-2512-q8_0 OL/N [reforged] 70.7% 74.5% 94.9% 74% 1.1 5.9s 50 100 100 100 100 100 100 100 100 92 12 42 0 42 100 100 100 100 100 100 100 100 26 4 8 0 12 -ministral-3:8b-instruct-2512-q8_0 OL/N [bare] 17.8% 49.6% 36.0% 93% 0.4 6.8s 50 0 0 0 0 68 100 0 0 14 0 20 0 96 0 0 0 0 64 100 0 0 0 0 2 0 0 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +Qwen3-14B-Q4_K_M LS/P [reforged] 71.4% 71.4% 99.9% 86% 0.5 25.6s 50 100 100 100 100 100 98 100 70 70 0 0 30 80 100 100 100 100 100 92 100 76 56 0 0 22 62 +Qwen3-14B-Q4_K_M LS/P [reforged:keep-last] 71.8% 71.8% 100.0% 86% 0.5 23.8s 50 100 100 100 100 100 98 100 72 58 2 4 28 72 100 100 100 100 100 94 100 76 74 0 0 32 56 +Qwen3-14B-Q4_K_M LS/P [reforged:full] 71.8% 71.9% 99.8% 87% 0.5 24.3s 50 100 100 100 100 100 96 100 72 72 2 0 30 74 100 100 100 100 100 92 100 74 68 0 0 38 48 +Qwen3-14B-Q4_K_M LS/P [bare] 54.1% 63.2% 85.6% 95% 0.2 22.3s 50 100 100 14 100 100 92 100 0 64 0 0 34 46 100 100 20 88 98 50 100 0 68 0 0 16 16 +Qwen3-14B-Q4_K_M LS/P [bare:keep-last] 53.8% 63.2% 85.2% 95% 0.2 23.1s 50 100 100 20 100 100 94 100 0 60 0 2 30 56 100 100 18 88 98 38 100 0 62 0 0 24 10 +Qwen3-14B-Q4_K_M LS/P [bare:full] 53.9% 62.9% 85.8% 94% 0.2 24.4s 50 100 100 22 100 100 90 100 0 66 0 0 30 56 100 100 14 76 100 30 100 0 70 0 0 36 12 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ``` -## Qwen3-14B-Q4_K_M (llamaserver/prompt) +## Nemotron-3-Nano-30B-A3B-Q4_K_M (llamaserver/native) ``` --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Qwen3-14B-Q4_K_M LS/P [reforged] 70.5% 70.8% 99.7% 86% 0.5 24.2s 50 100 100 100 100 100 94 100 64 68 0 0 32 72 100 100 100 100 98 94 100 58 66 0 0 30 58 -Qwen3-14B-Q4_K_M LS/P [bare] 53.5% 62.7% 85.3% 96% 0.2 22.5s 50 100 100 18 100 100 92 100 0 60 0 0 34 50 100 100 16 84 100 56 100 0 62 0 2 10 6 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Nemotron-3-Nano-30B-A3B-Q4_K_M LS/N [reforged:full]² 71.3% 81.0% 88.0% 72% 1.5 21.4s 50 100 100 100 100 100 66 98 52 92 28 4 34 34 100 100 100 98 100 86 92 68 98 24 8 34 38 +Nemotron-3-Nano-30B-A3B-Q4_K_M LS/N [bare:full]² 37.5% 85.7% 43.7% 95% 0.2 6.6s 50 32 100 88 74 86 70 6 0 20 10 0 14 4 14 100 94 72 92 56 0 0 18 14 0 10 0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` -## Qwen3-8B-Q4_K_M (llamaserver/prompt) +## Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M (llamaserver/native) ``` -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Qwen3-8B-Q4_K_M LS/P [reforged] 70.4% 70.7% 99.6% 86% 0.5 17.8s 50 100 100 100 100 100 94 100 56 64 0 14 12 92 100 100 100 100 100 94 100 62 58 0 0 6 78 -Qwen3-8B-Q4_K_M LS/P [bare] 57.7% 66.7% 86.5% 97% 0.2 17.3s 50 100 100 50 98 100 92 100 0 50 0 14 12 92 100 100 56 94 100 76 100 0 62 0 0 4 0 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M LS/N [reforged:full]² 71.0% 71.2% 99.8% 96% 0.5 6.5s 50 100 100 100 100 98 58 100 100 68 4 12 20 90 100 100 100 100 100 50 100 100 4 22 0 20 100 +Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M LS/N [bare:full]² 42.5% 67.4% 63.1% 100% 0.0 6.1s 50 0 100 100 100 100 68 10 0 4 4 12 12 98 0 100 100 100 100 56 20 0 0 0 2 20 0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` -## Qwen3-8B-Q8_0 (llamaserver/native) +## ministral-3:8b-instruct-2512-q8_0 (ollama/native) ``` ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Qwen3-8B-Q8_0 LS/N [reforged] 70.3% 70.5% 99.7% 88% 0.6 24.1s 50 100 100 100 100 100 100 100 60 82 4 22 20 32 100 100 100 100 98 94 100 58 66 2 12 28 50 -Qwen3-8B-Q8_0 LS/N [bare] 46.6% 64.0% 72.8% 100% 0.1 20.5s 50 80 76 0 86 94 94 82 0 74 0 14 16 4 88 70 0 88 84 94 76 0 66 0 4 18 4 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +ministral-3:8b-instruct-2512-q8_0 OL/N [reforged:full]² 70.7% 74.5% 94.9% 74% 1.1 5.9s 50 100 100 100 100 100 100 100 100 92 12 42 0 42 100 100 100 100 100 100 100 100 26 4 8 0 12 +ministral-3:8b-instruct-2512-q8_0 OL/N [bare:full]² 17.8% 49.6% 36.0% 93% 0.4 6.8s 50 0 0 0 0 68 100 0 0 14 0 20 0 96 0 0 0 0 64 100 0 0 0 0 2 0 0 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## Nemotron-3-Nano-30B-A3B-Q4_K_M (llamaserver/prompt) ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Nemotron-3-Nano-30B-A3B-Q4_K_M LS/P [reforged] 70.2% 70.7% 99.4% 89% 0.4 10.8s 50 100 100 100 100 98 52 100 84 90 6 4 0 100 100 100 100 100 100 42 100 80 92 6 2 4 66 -Nemotron-3-Nano-30B-A3B-Q4_K_M LS/P [bare] 58.6% 65.1% 90.1% 100% 0.1 10.0s 50 100 100 100 98 98 48 100 0 88 2 4 0 98 100 100 100 94 96 6 100 0 88 2 0 2 0 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Nemotron-3-Nano-30B-A3B-Q4_K_M LS/P [reforged:full]² 70.2% 70.7% 99.4% 89% 0.4 10.8s 50 100 100 100 100 98 52 100 84 90 6 4 0 100 100 100 100 100 100 42 100 80 92 6 2 4 66 +Nemotron-3-Nano-30B-A3B-Q4_K_M LS/P [bare:full]² 58.6% 65.1% 90.1% 100% 0.1 10.0s 50 100 100 100 98 98 48 100 0 88 2 4 0 98 100 100 100 94 96 6 100 0 88 2 0 2 0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` -## granite4.1:8b-q8_0 (ollama/native) +## Qwen3-8B-Q8_0 (llamaserver/native) ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -granite4.1:8b-q8_0 OL/N [reforged] 69.2% 69.2% 100.0% 83% 1.1 2.9s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 100 100 0 0 0 0 -granite4.1:8b-q8_0 OL/N [bare] 46.2% 60.0% 76.9% 95% 0.7 3.1s 50 0 100 0 100 100 100 100 0 100 0 0 0 0 0 100 0 100 100 100 100 0 100 0 0 0 0 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3-8B-Q8_0 LS/N [reforged] 68.2% 68.5% 99.5% 95% 0.3 24.8s 50 100 100 100 100 100 100 100 56 78 6 24 28 6 98 100 100 100 100 100 100 52 84 6 2 30 2 +Qwen3-8B-Q8_0 LS/N [reforged:keep-last] 67.0% 67.2% 99.8% 92% 0.4 23.2s 50 100 100 100 100 100 94 98 48 84 2 22 10 12 100 100 100 100 100 96 100 52 80 0 18 20 6 +Qwen3-8B-Q8_0 LS/N [reforged:full] 69.3% 69.6% 99.6% 88% 0.6 24.7s 50 98 100 100 100 100 94 100 48 76 2 28 24 46 100 100 100 100 100 98 100 46 76 2 8 16 40 +Qwen3-8B-Q8_0 LS/N [bare] 50.4% 63.4% 79.5% 100% 0.1 23.1s 50 88 100 2 62 98 100 98 0 70 2 22 40 2 92 100 4 46 88 94 98 0 70 2 4 28 0 +Qwen3-8B-Q8_0 LS/N [bare:keep-last] 49.7% 65.4% 76.0% 100% 0.1 21.6s 50 86 78 2 88 88 100 80 0 64 2 18 18 8 90 88 2 84 94 96 86 0 92 0 12 16 0 +Qwen3-8B-Q8_0 LS/N [bare:full] 46.8% 63.0% 74.4% 100% 0.1 20.9s 50 84 70 0 84 96 96 82 0 62 0 12 14 8 88 70 2 98 88 84 86 0 74 0 4 16 0 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` -## Qwen3-8B-Q4_K_M (llamaserver/native) +## granite4.1:8b-q8_0 (ollama/native) ``` -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Qwen3-8B-Q4_K_M LS/N [reforged] 68.2% 68.4% 99.6% 86% 0.7 16.1s 50 98 100 100 100 100 92 100 48 78 0 44 8 38 100 100 100 100 100 90 100 40 76 0 8 14 38 -Qwen3-8B-Q4_K_M LS/N [bare] 44.6% 63.0% 70.8% 100% 0.1 13.8s 50 90 74 2 88 74 90 86 0 60 2 16 16 6 88 82 6 90 72 76 70 0 60 0 8 4 0 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +granite4.1:8b-q8_0 OL/N [reforged:full]² 69.2% 69.2% 100.0% 83% 1.1 2.9s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 100 100 0 0 0 0 +granite4.1:8b-q8_0 OL/N [bare:full]² 46.2% 60.0% 76.9% 95% 0.7 3.1s 50 0 100 0 100 100 100 100 0 100 0 0 0 0 0 100 0 100 100 100 100 0 100 0 0 0 0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## Qwen3-14B-Q4_K_M (llamaserver/native) ``` --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Qwen3-14B-Q4_K_M LS/N [reforged] 67.7% 67.7% 99.9% 85% 0.9 20.8s 50 100 100 100 100 100 94 100 62 36 4 22 44 22 100 100 100 100 98 84 100 66 24 12 18 42 32 -Qwen3-14B-Q4_K_M LS/N [bare] 28.7% 50.1% 57.2% 100% 0.0 17.2s 50 100 4 0 24 46 62 30 0 20 6 10 44 0 100 12 8 54 54 68 34 0 20 14 4 32 0 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +Qwen3-14B-Q4_K_M LS/N [reforged] 68.5% 68.5% 100.0% 100% 0.4 21.8s 50 100 100 100 100 98 100 100 56 72 14 4 58 4 100 100 100 100 100 100 100 42 72 10 0 52 0 +Qwen3-14B-Q4_K_M LS/N [reforged:keep-last] 64.0% 64.0% 99.9% 91% 0.6 20.3s 50 100 100 100 100 100 90 98 48 40 18 6 38 4 100 100 100 98 96 88 100 54 30 16 2 38 0 +Qwen3-14B-Q4_K_M LS/N [reforged:full] 68.4% 68.4% 99.9% 83% 0.9 21.9s 50 100 100 100 100 98 90 98 60 32 20 18 50 38 100 100 100 100 100 86 100 74 34 6 18 34 22 +Qwen3-14B-Q4_K_M LS/N [bare] 48.1% 60.0% 80.2% 100% 0.1 21.2s 50 100 96 0 38 66 100 100 0 62 8 0 48 2 100 98 0 42 56 94 100 0 72 12 0 56 0 +Qwen3-14B-Q4_K_M LS/N [bare:keep-last] 30.8% 52.1% 59.2% 100% 0.0 18.3s 50 100 28 6 24 60 72 34 0 12 16 4 34 0 100 12 8 46 52 60 46 0 24 28 6 30 0 +Qwen3-14B-Q4_K_M LS/N [bare:full] 27.5% 48.5% 56.8% 100% 0.0 16.5s 50 96 16 4 24 48 62 36 0 22 4 6 28 0 100 18 6 52 62 60 18 0 14 4 4 32 0 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ``` ## qwen3:8b-q8_0 (ollama/native) ``` ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -qwen3:8b-q8_0 OL/N [reforged] 67.5% 67.6% 99.9% 85% 0.6 31.0s 50 100 100 100 100 100 100 100 26 88 0 2 6 66 100 100 100 100 100 100 100 40 82 0 0 2 44 -qwen3:8b-q8_0 OL/N [bare] 47.3% 56.8% 83.3% 96% 0.1 24.0s 50 86 100 2 34 92 100 100 0 64 0 4 6 16 88 98 2 68 92 98 92 0 84 0 0 0 4 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +qwen3:8b-q8_0 OL/N [reforged:full]² 67.5% 67.6% 99.9% 85% 0.6 31.0s 50 100 100 100 100 100 100 100 26 88 0 2 6 66 100 100 100 100 100 100 100 40 82 0 0 2 44 +qwen3:8b-q8_0 OL/N [bare:full]² 47.3% 56.8% 83.3% 96% 0.1 24.0s 50 86 100 2 34 92 100 100 0 64 0 4 6 16 88 98 2 68 92 98 92 0 84 0 0 0 4 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` -## ministral-3:8b-instruct-2512-q4_K_M (ollama/native) +## Qwen3-8B-Q4_K_M (llamaserver/native) ``` ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -ministral-3:8b-instruct-2512-q4_K_M OL/N [reforged] 66.8% 71.9% 92.9% 68% 1.4 5.4s 50 100 100 100 100 76 100 100 68 90 0 0 4 64 100 100 100 100 28 100 100 80 98 0 0 6 22 -ministral-3:8b-instruct-2512-q4_K_M OL/N [bare] 14.2% 45.1% 31.4% 91% 0.2 5.2s 50 0 0 0 0 76 100 0 0 4 0 0 2 8 0 0 0 0 86 70 0 0 22 0 0 0 0 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3-8B-Q4_K_M LS/N [reforged] 67.3% 67.5% 99.7% 96% 0.3 15.6s 50 100 100 100 100 100 100 100 40 98 6 14 22 2 100 100 100 100 100 100 100 48 86 2 0 26 6 +Qwen3-8B-Q4_K_M LS/N [reforged:keep-last] 64.5% 64.6% 99.9% 91% 0.4 15.0s 50 100 100 100 100 100 100 100 30 82 0 28 12 10 100 100 100 98 100 96 100 22 86 2 2 6 4 +Qwen3-8B-Q4_K_M LS/N [reforged:full] 65.8% 66.0% 99.7% 84% 0.7 17.2s 50 100 100 100 100 100 94 100 34 66 0 18 10 38 100 100 100 96 100 86 100 34 74 0 10 12 40 +Qwen3-8B-Q4_K_M LS/N [bare] 53.2% 64.9% 82.1% 100% 0.1 15.0s 50 92 100 4 86 92 100 100 0 62 2 32 26 12 86 100 6 96 90 100 98 0 76 2 0 22 0 +Qwen3-8B-Q4_K_M LS/N [bare:keep-last] 46.4% 64.3% 72.2% 100% 0.1 15.0s 50 96 78 2 86 70 94 76 0 74 4 34 4 8 80 76 4 76 76 96 86 0 76 0 4 6 0 +Qwen3-8B-Q4_K_M LS/N [bare:full] 45.2% 63.7% 71.0% 100% 0.1 13.6s 50 92 80 2 86 76 100 80 0 74 0 14 12 2 88 68 4 74 82 94 70 0 60 0 4 14 0 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` -## granite-4.1-8b-Q4_K_M (llamaserver/native) +## ministral-3:8b-instruct-2512-q4_K_M (ollama/native) ``` -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -granite-4.1-8b-Q4_K_M LS/N [reforged] 65.4% 68.0% 96.2% 90% 0.8 1.8s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 -granite-4.1-8b-Q4_K_M LS/N [bare] 53.8% 70.0% 76.9% 96% 0.2 1.9s 50 0 100 100 100 100 100 100 0 100 0 0 0 0 0 100 100 100 100 100 100 0 100 0 0 0 0 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +ministral-3:8b-instruct-2512-q4_K_M OL/N [reforged:full]² 66.8% 71.9% 92.9% 68% 1.4 5.4s 50 100 100 100 100 76 100 100 68 90 0 0 4 64 100 100 100 100 28 100 100 80 98 0 0 6 22 +ministral-3:8b-instruct-2512-q4_K_M OL/N [bare:full]² 14.2% 45.1% 31.4% 91% 0.2 5.2s 50 0 0 0 0 76 100 0 0 4 0 0 2 8 0 0 0 0 86 70 0 0 22 0 0 0 0 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## granite-4.1-8b-Q8_0 (llamaserver/native) @@ -526,20 +591,35 @@ granite-4.1-8b-Q4_K_M LS/N [bare] 53.8% 70.0% 76.9% 96% 0.2 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -granite-4.1-8b-Q8_0 LS/N [reforged] 65.4% 65.4% 100.0% 88% 1.4 2.8s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 +granite-4.1-8b-Q8_0 LS/N [reforged] 65.4% 65.4% 100.0% 88% 1.3 2.8s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 granite-4.1-8b-Q8_0 LS/N [bare] 46.2% 60.0% 77.0% 95% 1.1 3.2s 50 0 100 2 100 100 100 100 0 100 0 0 0 0 0 100 0 100 100 100 100 0 100 0 0 0 0 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` +## granite-4.1-8b-Q4_K_M (llamaserver/native) + +``` +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +granite-4.1-8b-Q4_K_M LS/N [reforged] 65.4% 68.0% 96.2% 90% 0.8 1.9s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 +granite-4.1-8b-Q4_K_M LS/N [reforged:keep-last] 65.4% 68.0% 96.2% 90% 0.8 1.8s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 +granite-4.1-8b-Q4_K_M LS/N [reforged:full] 65.4% 68.0% 96.2% 90% 0.8 1.9s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 +granite-4.1-8b-Q4_K_M LS/N [bare] 53.8% 70.0% 76.9% 96% 0.2 1.9s 50 0 100 100 100 100 100 100 0 100 0 0 0 0 0 100 100 100 100 100 100 0 100 0 0 0 0 +granite-4.1-8b-Q4_K_M LS/N [bare:keep-last] 53.8% 70.0% 76.9% 96% 0.2 1.9s 50 0 100 100 100 100 100 100 0 100 0 0 0 0 0 100 100 100 100 100 100 0 100 0 0 0 0 +granite-4.1-8b-Q4_K_M LS/N [bare:full] 53.8% 70.0% 76.9% 96% 0.2 2.0s 50 0 100 100 100 100 100 100 0 100 0 0 0 0 0 100 100 100 100 100 100 0 100 0 0 0 0 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + ## qwen3:8b-q4_K_M (ollama/native) ``` -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -qwen3:8b-q4_K_M OL/N [reforged] 64.9% 65.1% 99.8% 85% 0.6 21.0s 50 100 100 100 100 100 96 98 30 62 2 6 2 74 98 100 100 100 100 98 100 26 70 4 0 4 18 -qwen3:8b-q4_K_M OL/N [bare] 40.7% 53.0% 76.8% 96% 0.1 15.8s 50 56 98 2 4 100 94 100 2 38 0 4 0 12 68 100 6 78 98 74 96 0 28 0 0 0 0 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +qwen3:8b-q4_K_M OL/N [reforged:full]² 64.9% 65.1% 99.8% 85% 0.6 21.0s 50 100 100 100 100 100 96 98 30 62 2 6 2 74 98 100 100 100 100 98 100 26 70 4 0 4 18 +qwen3:8b-q4_K_M OL/N [bare:full]² 40.7% 53.0% 76.8% 96% 0.1 15.8s 50 56 98 2 4 100 94 100 2 38 0 4 0 12 68 100 6 78 98 74 96 0 28 0 0 0 0 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## granite-4.1-8b-Q4_K_M (llamaserver/prompt) @@ -560,26 +640,28 @@ granite-4.1-8b-Q4_K_M LS/P [bare] 46.2% 50.0% 92.3% 100% 0.0 Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- granite-4.1-8b-Q8_0 LS/P [reforged] 61.5% 66.7% 92.3% 73% 1.0 5.2s 50 0 100 100 100 100 100 100 100 0 0 0 100 0 0 100 100 100 100 100 100 100 0 0 0 100 0 -granite-4.1-8b-Q8_0 LS/P [bare] 42.3% 50.0% 84.6% 86% 0.4 2.3s 50 0 100 0 100 100 100 100 0 0 0 0 100 0 0 100 0 100 100 0 100 0 0 0 0 100 0 +granite-4.1-8b-Q8_0 LS/P [bare] 42.3% 50.0% 84.6% 86% 0.4 2.4s 50 0 100 0 100 100 100 100 0 0 0 0 100 0 0 100 0 100 100 0 100 0 0 0 0 100 0 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## granite4.1:8b-q4_K_M (ollama/native) ``` ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -granite4.1:8b-q4_K_M OL/N [reforged] 57.8% 57.8% 100.0% 81% 1.3 1.9s 50 100 100 100 100 100 100 100 100 2 0 0 0 0 100 100 100 100 100 100 100 0 2 0 0 0 0 -granite4.1:8b-q4_K_M OL/N [bare] 38.6% 50.2% 76.9% 94% 1.0 2.1s 50 0 100 0 100 100 100 100 0 2 0 0 0 0 0 100 0 100 100 100 100 0 2 0 0 0 0 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +granite4.1:8b-q4_K_M OL/N [reforged:full]² 57.8% 57.8% 100.0% 81% 1.3 1.9s 50 100 100 100 100 100 100 100 100 2 0 0 0 0 100 100 100 100 100 100 100 0 2 0 0 0 0 +granite4.1:8b-q4_K_M OL/N [bare:full]² 38.6% 50.2% 76.9% 94% 1.0 2.1s 50 0 100 0 100 100 100 100 0 2 0 0 0 0 0 100 0 100 100 100 100 0 2 0 0 0 0 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ``` Scr=score(correct/total), Acc=accuracy(correct/total, excl validate errors), Cmp=completeness(completed/total), Eff=efficiency(ideal/actual calls), Wst=avg wasted calls, Spd=avg time(excl compaction) rel=relevance_detection, arg=argument_fidelity, tsl=tool_selection, b2s=basic_2step, s3s=sequential_3step, crt=conditional_routing, srn=sequential_reasoning, err=error_recovery, dgr=data_gap_recovery, dge=data_gap_recovery_extended, art=argument_transformation, grs=grounded_synthesis, iar=inconsistent_api_recovery, rel_s=relevance_detection_stateful, arg_s=argument_fidelity_stateful, tsl_s=tool_selection_stateful, b2s_s=basic_2step_stateful, s3s_s=sequential_3step_stateful, crt_s=conditional_routing_stateful, srn_s=sequential_reasoning_stateful, err_s=error_recovery_stateful, dgr_s=data_gap_recovery_stateful, dge_s=data_gap_recovery_extended_stateful, art_s=argument_transformation_stateful, grs_s=grounded_synthesis_stateful, iar_s=inconsistent_api_recovery_stateful Ablation: full=all guardrails, no_rescue=no rescue loop, no_nudge=no rescue/retry nudge, no_steps=no step enforcement, no_recovery=no error recovery, no_compact=no compaction, bare=all guardrails off +Replay: ':keep-last'/':full' tags = reasoning_replay policy (how much captured reasoning is re-sent to the backend each turn); untagged = none (default). Rows predating the knob ran unbounded replay and count as full. Eval generations (older runs carried forward, superscript-tagged): ¹ gen 1 — v0.6.0 suite — incl. Anthropic ablation (commit 2b05dc4, 2026-05-08) + ² gen 2 — v0.7.0 lineup refresh (8–14B) + 32GB tier debut (v0.7.4) (commit 655e1f6, 2026-05-22) -*Generated 2026-06-03 00:09* +*Generated 2026-06-11 20:28* diff --git a/docs/results/raw/reforged/all.md b/docs/results/raw/reforged/all.md index ee89de6..437e7ab 100644 --- a/docs/results/raw/reforged/all.md +++ b/docs/results/raw/reforged/all.md @@ -1,69 +1,106 @@ # Forge Eval — Reforged Leaderboard ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -claude-opus-4-6 AN/N [reforged]¹ 99.2% 99.8% 99.4% 100% 0.0 15.6s 50 100 100 100 100 100 98 100 100 100 100 98 94 98 100 100 100 100 100 100 100 96 100 100 98 100 98 -claude-sonnet-4-6 AN/N [reforged]¹ 98.4% 98.5% 99.9% 100% 0.1 13.1s 50 100 100 100 100 100 100 100 100 100 98 74 98 100 100 100 100 100 100 100 100 100 100 100 88 100 100 -claude-haiku-4-5-20251001 AN/N [reforged]¹ 94.5% 94.9% 99.6% 100% 0.3 8.5s 50 100 100 100 100 100 100 100 100 100 80 80 98 100 100 100 100 100 100 100 94 100 100 76 36 94 100 -Qwen3.6-35B-A3B-UD-Q4_K_M LS/N [reforged] 94.8% 95.1% 99.7% 100% 0.6 12.7s 50 100 100 100 100 100 100 96 100 100 72 78 92 100 98 100 100 100 100 98 92 100 100 68 76 94 100 -Qwen3.5-27B-Q4_K_M LS/N [reforged] 93.2% 93.3% 99.8% 82% 1.4 37.6s 50 100 100 100 100 100 100 100 98 100 74 38 88 98 100 100 100 100 100 98 100 100 100 78 56 96 98 -Qwen3.6-27B-Q4_K_M LS/N [reforged] 92.2% 92.5% 99.6% 100% 0.4 37.9s 50 100 100 100 100 100 100 100 98 100 22 74 98 100 100 100 100 100 100 100 100 96 98 36 78 96 100 -Qwen3.5-35B-A3B-Q4_K_M LS/N [reforged] 92.1% 92.4% 99.7% 82% 1.3 11.1s 50 100 100 100 100 100 96 98 100 100 96 14 84 100 100 100 100 100 100 96 100 100 100 94 20 96 100 -Qwen3.5-27B-Q4_K_M LS/P [reforged] 86.8% 86.8% 100.0% 100% 0.1 24.4s 50 100 100 100 100 100 100 100 100 100 42 10 78 100 100 100 100 100 100 100 100 100 100 36 10 80 100 -Ministral-3-14B-Reasoning-2512-Q4_K_M LS/N [reforged] 84.5% 84.5% 100.0% 97% 0.6 5.4s 50 100 100 100 100 100 88 100 100 70 44 48 76 94 100 100 100 100 100 96 98 100 76 38 26 62 82 -Ministral-3-8B-Reasoning-2512-Q8_0 LS/N [reforged] 84.2% 84.2% 100.0% 95% 0.5 6.0s 50 100 100 100 100 100 100 100 100 98 74 26 54 88 100 100 100 100 100 100 100 100 100 68 2 26 52 -Ministral-3-8B-Instruct-2512-Q8_0 LS/P [reforged] 84.4% 91.1% 92.6% 92% 0.7 4.6s 50 100 100 6 100 100 100 100 100 100 98 8 100 80 100 100 4 100 100 100 100 100 100 100 0 98 100 -Qwen3.5-35B-A3B-Q4_K_M LS/P [reforged] 82.8% 82.8% 100.0% 100% 0.2 10.4s 50 48 100 100 100 100 94 98 100 100 74 16 62 90 56 100 100 100 100 96 100 100 98 68 14 58 82 -Ministral-3-8B-Reasoning-2512-Q4_K_M LS/N [reforged] 82.8% 82.8% 99.9% 95% 0.5 4.1s 50 100 100 100 100 100 100 100 100 98 66 24 34 92 100 100 100 100 100 96 100 100 100 70 0 30 42 -Qwen3.6-27B-Q4_K_M LS/P [reforged] 83.5% 85.0% 98.2% 97% 0.4 53.9s 50 100 100 100 100 100 100 100 100 98 6 66 52 90 100 100 100 100 100 98 100 96 90 2 56 36 80 -Qwen3.6-35B-A3B-UD-Q4_K_M LS/P [reforged] 82.2% 82.2% 100.0% 100% 0.3 23.6s 50 96 100 100 100 100 90 92 98 92 16 46 62 98 88 100 100 100 100 88 94 96 88 8 42 50 94 -Ministral-3-8B-Instruct-2512-Q8_0 LS/N [reforged] 81.4% 81.4% 100.0% 100% 0.3 4.0s 50 100 100 100 100 100 100 100 100 100 38 4 0 100 100 100 100 100 100 100 100 100 100 74 0 0 100 -Ministral-3-8B-Reasoning-2512-Q8_0 LS/P [reforged] 81.3% 85.0% 95.7% 96% 0.7 3.9s 50 100 100 96 100 100 98 100 92 98 56 14 46 70 100 100 88 100 100 98 100 94 100 82 0 48 34 -Ministral-3-14B-Instruct-2512-Q4_K_M LS/P [reforged] 80.2% 80.2% 100.0% 100% 0.0 2.9s 50 100 100 100 100 100 100 100 100 100 0 0 36 100 100 100 100 100 100 100 100 100 100 0 0 50 100 -Ministral-3-8B-Reasoning-2512-Q4_K_M LS/P [reforged] 80.5% 83.0% 97.0% 96% 0.7 2.7s 50 100 98 98 100 100 98 98 100 96 74 10 36 70 100 98 96 100 100 98 100 98 94 70 2 38 20 -qwen3:14b-q4_K_M OL/N [reforged] 78.6% 78.7% 99.9% 77% 1.2 38.5s 50 100 100 100 100 100 100 100 100 74 4 12 68 78 100 100 100 100 100 100 100 94 88 4 0 54 68 -Ministral-3-14B-Reasoning-2512-Q4_K_M LS/P [reforged] 79.5% 80.5% 98.7% 96% 0.5 3.7s 50 100 100 100 100 100 82 100 100 78 30 6 58 92 100 100 100 100 100 74 100 100 80 20 6 56 84 -Ministral-3-14B-Instruct-2512-Q4_K_M LS/N [reforged] 78.1% 78.1% 100.0% 97% 0.3 4.0s 50 100 100 100 100 100 100 100 100 100 16 0 0 100 100 100 100 100 100 100 100 100 100 14 0 0 100 -Ministral-3-8B-Instruct-2512-Q4_K_M LS/N [reforged] 78.3% 78.4% 99.8% 95% 0.4 3.2s 50 100 100 100 100 100 100 100 98 100 22 0 0 100 100 100 100 100 100 100 100 98 100 14 2 2 100 -gemma-4-E4B-it-Q4_K_M LS/N [reforged] 78.2% 82.2% 95.1% 98% 0.5 9.0s 50 100 100 100 100 100 92 98 98 90 0 24 80 50 100 100 100 100 100 94 90 94 98 0 0 84 40 -Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M LS/P [reforged] 78.2% 84.3% 92.7% 78% 1.1 3.6s 50 100 100 100 100 100 100 100 100 28 0 0 100 94 100 100 100 100 100 100 100 100 34 0 0 100 76 -gemma-4-E4B-it-Q8_0 LS/N [reforged] 76.2% 80.7% 94.5% 98% 0.6 12.8s 50 100 100 100 100 100 84 88 90 96 2 14 80 44 100 100 100 100 100 88 90 96 94 4 0 80 32 -Ministral-3-8B-Instruct-2512-Q4_K_M LS/P [reforged] 75.6% 83.8% 90.2% 79% 1.3 3.0s 50 98 100 0 100 100 100 100 100 100 22 12 56 100 100 100 0 100 100 100 100 100 100 28 0 50 100 -gemma-4-E4B-it-Q8_0 LS/P [reforged] 74.7% 74.7% 100.0% 85% 0.6 12.7s 50 100 100 100 100 100 70 100 90 88 0 16 34 94 100 100 100 100 100 48 100 98 84 0 0 30 90 -gemma4:e4b-it-q4_K_M OL/N [reforged] 74.8% 75.0% 99.8% 83% 0.8 11.3s 50 100 100 100 100 100 94 100 100 92 0 0 44 66 100 100 100 100 100 90 100 100 78 0 0 40 42 -ministral-3:14b-instruct-2512-q4_K_M OL/N [reforged] 74.8% 74.8% 100.0% 81% 1.0 6.2s 50 100 100 100 100 100 100 100 100 96 56 0 0 80 100 100 100 100 100 100 100 100 98 0 0 0 16 -gemma4:e4b-it-q8_0 OL/N [reforged] 73.6% 73.8% 99.8% 85% 0.8 12.8s 50 100 100 100 100 100 78 98 100 100 0 8 34 60 100 100 100 100 100 78 94 100 96 0 0 34 34 -Qwen3-8B-Q8_0 LS/P [reforged] 73.1% 73.2% 99.8% 89% 0.4 28.4s 50 100 100 100 100 100 100 100 58 96 0 8 28 94 100 100 100 100 96 100 98 64 88 0 0 12 58 -phi-4-Q4_K_M LS/P [reforged] 72.9% 73.3% 99.5% 85% 0.9 4.1s 50 100 100 100 100 100 34 56 96 90 52 24 38 70 100 100 100 100 100 24 62 92 98 52 0 60 48 -gemma-4-E4B-it-Q4_K_M LS/P [reforged] 72.8% 72.8% 100.0% 85% 0.6 8.5s 50 100 100 100 100 100 38 98 94 94 0 18 26 96 100 100 100 100 100 26 98 100 92 0 2 22 90 -Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M LS/N [reforged] 71.0% 71.2% 99.8% 96% 0.5 6.5s 50 100 100 100 100 98 58 100 100 68 4 12 20 90 100 100 100 100 100 50 100 100 4 22 0 20 100 -Qwen3-14B-Q4_K_M LS/P [reforged] 70.5% 70.8% 99.7% 86% 0.5 24.2s 50 100 100 100 100 100 94 100 64 68 0 0 32 72 100 100 100 100 98 94 100 58 66 0 0 30 58 -ministral-3:8b-instruct-2512-q8_0 OL/N [reforged] 70.7% 74.5% 94.9% 74% 1.1 5.9s 50 100 100 100 100 100 100 100 100 92 12 42 0 42 100 100 100 100 100 100 100 100 26 4 8 0 12 -Nemotron-3-Nano-30B-A3B-Q4_K_M LS/N [reforged] 71.3% 81.0% 88.0% 72% 1.5 21.4s 50 100 100 100 100 100 66 98 52 92 28 4 34 34 100 100 100 98 100 86 92 68 98 24 8 34 38 -Qwen3-8B-Q8_0 LS/N [reforged] 70.3% 70.5% 99.7% 88% 0.6 24.1s 50 100 100 100 100 100 100 100 60 82 4 22 20 32 100 100 100 100 98 94 100 58 66 2 12 28 50 -Qwen3-8B-Q4_K_M LS/P [reforged] 70.4% 70.7% 99.6% 86% 0.5 17.8s 50 100 100 100 100 100 94 100 56 64 0 14 12 92 100 100 100 100 100 94 100 62 58 0 0 6 78 -Nemotron-3-Nano-30B-A3B-Q4_K_M LS/P [reforged] 70.2% 70.7% 99.4% 89% 0.4 10.8s 50 100 100 100 100 98 52 100 84 90 6 4 0 100 100 100 100 100 100 42 100 80 92 6 2 4 66 -granite4.1:8b-q8_0 OL/N [reforged] 69.2% 69.2% 100.0% 83% 1.1 2.9s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 100 100 0 0 0 0 -Qwen3-8B-Q4_K_M LS/N [reforged] 68.2% 68.4% 99.6% 86% 0.7 16.1s 50 98 100 100 100 100 92 100 48 78 0 44 8 38 100 100 100 100 100 90 100 40 76 0 8 14 38 -Qwen3-14B-Q4_K_M LS/N [reforged] 67.7% 67.7% 99.9% 85% 0.9 20.8s 50 100 100 100 100 100 94 100 62 36 4 22 44 22 100 100 100 100 98 84 100 66 24 12 18 42 32 -qwen3:8b-q8_0 OL/N [reforged] 67.5% 67.6% 99.9% 85% 0.6 31.0s 50 100 100 100 100 100 100 100 26 88 0 2 6 66 100 100 100 100 100 100 100 40 82 0 0 2 44 -ministral-3:8b-instruct-2512-q4_K_M OL/N [reforged] 66.8% 71.9% 92.9% 68% 1.4 5.4s 50 100 100 100 100 76 100 100 68 90 0 0 4 64 100 100 100 100 28 100 100 80 98 0 0 6 22 -granite-4.1-8b-Q8_0 LS/N [reforged] 65.4% 65.4% 100.0% 88% 1.4 2.8s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 -qwen3:8b-q4_K_M OL/N [reforged] 64.9% 65.1% 99.8% 85% 0.6 21.0s 50 100 100 100 100 100 96 98 30 62 2 6 2 74 98 100 100 100 100 98 100 26 70 4 0 4 18 -granite-4.1-8b-Q4_K_M LS/N [reforged] 65.4% 68.0% 96.2% 90% 0.8 1.8s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 -granite-4.1-8b-Q4_K_M LS/P [reforged] 61.5% 61.5% 100.0% 90% 0.3 2.5s 50 100 100 100 100 100 0 100 100 0 0 100 0 0 100 100 100 100 100 0 100 100 0 0 100 0 0 -granite-4.1-8b-Q8_0 LS/P [reforged] 61.5% 66.7% 92.3% 73% 1.0 5.2s 50 0 100 100 100 100 100 100 100 0 0 0 100 0 0 100 100 100 100 100 100 100 0 0 0 100 0 -granite4.1:8b-q4_K_M OL/N [reforged] 57.8% 57.8% 100.0% 81% 1.3 1.9s 50 100 100 100 100 100 100 100 100 2 0 0 0 0 100 100 100 100 100 100 100 0 2 0 0 0 0 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +claude-opus-4-8 AN/N [reforged] 100.0% 100.0% 100.0% 100% 0.0 13.3s 50 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 +claude-sonnet-4-6 AN/N [reforged] 100.0% 100.0% 100.0% 100% 0.0 18.2s 50 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 +claude-opus-4-6 AN/N [reforged:full]¹ 99.2% 99.8% 99.4% 100% 0.0 15.6s 50 100 100 100 100 100 98 100 100 100 100 98 94 98 100 100 100 100 100 100 100 96 100 100 98 100 98 +Qwen3.6-35B-A3B-UD-Q4_K_M LS/N [reforged:full]² 94.8% 95.1% 99.7% 100% 0.6 12.7s 50 100 100 100 100 100 100 96 100 100 72 78 92 100 98 100 100 100 100 98 92 100 100 68 76 94 100 +claude-haiku-4-5-20251001 AN/N [reforged] 94.2% 94.2% 99.9% 100% 0.3 6.6s 50 100 100 100 100 100 100 98 100 100 74 74 98 100 100 100 100 100 100 100 100 100 100 72 38 94 100 +Qwen3.5-27B-Q4_K_M LS/N [reforged:full]² 93.2% 93.3% 99.8% 82% 1.4 37.6s 50 100 100 100 100 100 100 100 98 100 74 38 88 98 100 100 100 100 100 98 100 100 100 78 56 96 98 +Qwen3.6-27B-Q4_K_M LS/N [reforged:full]² 92.2% 92.5% 99.6% 100% 0.4 37.9s 50 100 100 100 100 100 100 100 98 100 22 74 98 100 100 100 100 100 100 100 100 96 98 36 78 96 100 +Qwen3.5-35B-A3B-Q4_K_M LS/N [reforged:full]² 92.1% 92.4% 99.7% 82% 1.3 11.1s 50 100 100 100 100 100 96 98 100 100 96 14 84 100 100 100 100 100 100 96 100 100 100 94 20 96 100 +Qwen3.5-27B-Q4_K_M LS/P [reforged:full]² 86.8% 86.8% 100.0% 100% 0.1 24.4s 50 100 100 100 100 100 100 100 100 100 42 10 78 100 100 100 100 100 100 100 100 100 100 36 10 80 100 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/N [reforged] 84.5% 84.8% 99.7% 96% 0.6 5.3s 50 100 100 96 100 100 98 100 100 100 76 18 44 92 100 100 100 100 100 100 98 100 98 80 2 42 54 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/N [reforged:keep-last] 84.8% 85.6% 99.1% 94% 0.6 5.9s 50 100 100 100 100 100 100 100 100 98 70 24 42 86 100 100 100 100 100 100 100 100 100 82 6 36 62 +Ministral-3-8B-Instruct-2512-Q8_0 LS/P [reforged] 84.2% 90.9% 92.7% 91% 0.7 4.6s 50 100 100 6 100 100 100 100 100 100 98 8 100 80 100 100 4 100 100 100 100 100 100 96 0 98 100 +Qwen3.5-35B-A3B-Q4_K_M LS/P [reforged:full]² 82.8% 82.8% 100.0% 100% 0.2 10.4s 50 48 100 100 100 100 94 98 100 100 74 16 62 90 56 100 100 100 100 96 100 100 98 68 14 58 82 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/N [reforged:full] 83.2% 83.2% 100.0% 97% 0.6 5.6s 50 100 100 100 100 100 86 98 100 68 40 32 76 96 100 100 100 100 100 92 100 100 62 34 30 68 82 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/N [reforged] 83.3% 83.3% 100.0% 96% 0.6 4.8s 50 100 100 100 100 100 100 100 100 60 32 34 78 94 100 100 100 100 100 92 100 100 62 30 28 78 78 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/N [reforged:full] 83.1% 83.1% 99.9% 96% 0.5 6.0s 50 100 100 100 100 100 100 98 100 100 66 20 36 88 100 100 100 100 100 100 100 100 98 74 2 26 52 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/N [reforged:keep-last] 82.9% 82.9% 100.0% 95% 0.6 5.0s 50 100 100 100 100 100 92 100 100 68 40 30 60 94 100 100 100 100 100 96 100 100 62 36 20 72 86 +Qwen3.6-27B-Q4_K_M LS/P [reforged:full]² 83.5% 85.0% 98.2% 97% 0.4 53.9s 50 100 100 100 100 100 100 100 100 98 6 66 52 90 100 100 100 100 100 98 100 96 90 2 56 36 80 +Qwen3.6-35B-A3B-UD-Q4_K_M LS/P [reforged:full]² 82.2% 82.2% 100.0% 100% 0.3 23.6s 50 96 100 100 100 100 90 92 98 92 16 46 62 98 88 100 100 100 100 88 94 96 88 8 42 50 94 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/N [reforged:full] 81.8% 81.8% 100.0% 95% 0.5 4.3s 50 100 100 98 100 98 98 98 100 98 62 8 30 96 100 100 98 100 100 98 96 100 98 62 2 40 46 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/N [reforged:keep-last] 82.4% 83.0% 99.3% 92% 0.6 4.2s 50 100 100 100 100 100 98 100 100 100 68 16 28 86 100 100 100 100 100 96 98 100 96 64 6 24 62 +Ministral-3-14B-Instruct-2512-Q4_K_M LS/P [reforged] 80.6% 80.6% 100.0% 100% 0.0 3.0s 50 100 100 100 100 100 100 100 100 100 0 0 46 100 100 100 100 100 100 100 100 100 98 0 0 52 100 +Ministral-3-8B-Instruct-2512-Q8_0 LS/N [reforged] 81.0% 81.0% 100.0% 100% 0.3 4.1s 50 100 100 100 100 100 100 100 100 100 30 0 4 100 100 100 100 100 100 100 100 100 100 68 0 4 100 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/N [reforged] 81.4% 81.6% 99.7% 94% 0.6 3.8s 50 100 100 98 100 100 100 100 100 100 68 24 18 86 100 100 98 100 100 100 100 100 96 70 4 28 26 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/P [reforged:full] 81.2% 82.0% 98.9% 98% 0.6 3.8s 50 100 100 100 100 100 74 100 100 68 40 4 72 88 100 100 100 100 100 82 100 100 78 38 10 64 92 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/P [reforged:full] 81.0% 83.1% 97.5% 95% 0.7 3.0s 50 100 98 88 100 100 96 100 98 98 84 24 38 70 100 100 100 100 100 96 98 100 96 72 2 22 26 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/P [reforged:keep-last] 81.2% 84.6% 96.0% 97% 0.7 4.1s 50 100 100 94 100 100 94 100 92 98 72 12 42 78 100 100 86 100 100 98 100 98 100 64 0 52 32 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/P [reforged:full] 81.4% 84.8% 96.0% 95% 0.7 4.0s 50 100 100 88 100 100 94 100 96 100 72 12 58 80 100 98 88 100 100 96 100 92 100 64 2 36 40 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/P [reforged] 80.9% 84.8% 95.4% 95% 0.7 3.9s 50 100 100 90 100 100 98 100 98 100 72 6 50 76 100 100 86 100 100 100 100 92 96 64 2 42 32 +gemma-4-E4B-it-Q4_K_M LS/N [reforged] 79.7% 79.9% 99.8% 100% 0.3 8.1s 50 100 100 100 100 100 94 98 100 84 8 30 64 92 100 100 100 100 100 88 94 100 86 2 0 48 84 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/P [reforged:keep-last] 79.8% 82.6% 96.7% 95% 0.6 2.8s 50 100 100 96 100 100 98 98 94 98 66 8 36 64 100 100 96 100 100 100 98 96 94 76 0 34 24 +qwen3:14b-q4_K_M OL/N [reforged:full]² 78.6% 78.7% 99.9% 77% 1.2 38.5s 50 100 100 100 100 100 100 100 100 74 4 12 68 78 100 100 100 100 100 100 100 94 88 4 0 54 68 +gemma-4-E4B-it-Q4_K_M LS/N [reforged:keep-last] 78.7% 81.4% 96.7% 99% 0.5 9.3s 50 100 100 100 100 100 96 92 96 96 6 20 64 62 100 100 100 100 100 88 88 100 98 2 0 82 56 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/P [reforged] 79.5% 81.9% 97.0% 94% 0.7 2.8s 50 100 100 98 100 100 96 100 90 96 68 4 32 68 100 98 98 100 100 96 96 94 98 70 0 34 30 +gemma-4-E4B-it-Q4_K_M LS/N [reforged:full] 79.2% 82.2% 96.3% 99% 0.5 10.0s 50 100 100 100 100 100 96 88 94 100 2 40 78 54 100 100 100 100 100 94 84 98 96 0 0 82 52 +gemma-4-E4B-it-Q8_0 LS/N [reforged] 77.8% 77.9% 99.8% 100% 0.2 10.8s 50 100 100 100 100 100 76 90 100 100 0 18 38 98 100 100 100 100 100 76 98 100 98 0 0 36 94 +Ministral-3-14B-Instruct-2512-Q4_K_M LS/N [reforged] 77.8% 77.8% 100.0% 97% 0.3 4.0s 50 100 100 100 100 100 100 100 100 100 8 0 0 100 100 100 100 100 100 100 100 100 100 14 0 0 100 +Ministral-3-8B-Instruct-2512-Q4_K_M LS/N [reforged] 78.3% 78.3% 100.0% 95% 0.4 3.2s 50 100 100 100 100 100 100 100 100 100 18 0 0 100 100 100 100 100 100 100 100 100 100 16 2 0 100 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/P [reforged] 77.7% 78.8% 98.5% 95% 0.6 3.8s 50 100 100 100 100 100 74 100 100 78 32 2 46 90 100 100 100 100 100 66 100 100 74 28 4 48 78 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/P [reforged:keep-last] 78.0% 78.8% 98.9% 95% 0.6 3.8s 50 100 100 100 100 100 82 100 100 70 28 2 52 78 100 100 100 100 100 74 100 100 80 18 4 48 92 +Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M LS/P [reforged:full]² 78.2% 84.3% 92.7% 78% 1.1 3.6s 50 100 100 100 100 100 100 100 100 28 0 0 100 94 100 100 100 100 100 100 100 100 34 0 0 100 76 +gemma-4-E4B-it-Q8_0 LS/N [reforged:keep-last] 75.7% 79.3% 95.5% 99% 0.5 13.1s 50 100 100 100 100 100 76 92 98 96 4 16 82 48 100 100 100 100 100 50 86 98 94 2 0 86 40 +gemma-4-E4B-it-Q8_0 LS/N [reforged:full] 75.6% 80.8% 93.6% 98% 0.6 12.5s 50 100 100 100 100 100 92 94 92 88 0 18 82 28 100 100 100 100 100 76 92 94 98 2 0 84 26 +phi-4-Q4_K_M LS/P [reforged] 75.3% 75.4% 99.8% 83% 0.9 4.2s 50 100 100 100 100 100 26 62 94 96 62 34 66 70 100 100 100 100 100 28 84 98 94 42 0 60 42 +gemma4:e4b-it-q4_K_M OL/N [reforged:full]² 74.8% 75.0% 99.8% 83% 0.8 11.3s 50 100 100 100 100 100 94 100 100 92 0 0 44 66 100 100 100 100 100 90 100 100 78 0 0 40 42 +ministral-3:14b-instruct-2512-q4_K_M OL/N [reforged:full]² 74.8% 74.8% 100.0% 81% 1.0 6.2s 50 100 100 100 100 100 100 100 100 96 56 0 0 80 100 100 100 100 100 100 100 100 98 0 0 0 16 +Ministral-3-8B-Instruct-2512-Q4_K_M LS/P [reforged] 74.9% 83.3% 89.9% 79% 1.3 3.1s 50 100 100 0 100 100 100 100 100 100 20 2 42 100 100 100 0 100 100 100 100 100 100 22 0 62 100 +gemma-4-E4B-it-Q8_0 LS/P [reforged:full] 73.7% 73.7% 100.0% 86% 0.6 12.7s 50 100 100 100 100 100 48 100 92 90 0 36 28 88 100 100 100 100 100 48 94 98 82 0 0 20 92 +gemma4:e4b-it-q8_0 OL/N [reforged:full]² 73.6% 73.8% 99.8% 85% 0.8 12.8s 50 100 100 100 100 100 78 98 100 100 0 8 34 60 100 100 100 100 100 78 94 100 96 0 0 34 34 +gemma-4-E4B-it-Q8_0 LS/P [reforged:keep-last] 74.1% 74.2% 99.8% 85% 0.6 13.4s 50 100 100 100 100 100 48 100 96 84 0 22 28 94 100 100 100 100 100 52 100 98 90 0 0 18 96 +Qwen3-8B-Q8_0 LS/P [reforged:keep-last] 72.8% 73.0% 99.7% 89% 0.4 28.0s 50 100 100 100 100 100 98 98 80 90 0 6 8 94 100 100 100 100 96 96 100 60 98 0 2 12 54 +Qwen3-8B-Q8_0 LS/P [reforged:full] 72.8% 72.9% 99.8% 88% 0.4 28.9s 50 100 100 100 100 100 98 100 70 90 0 4 20 96 100 100 100 100 92 100 96 66 92 0 0 12 56 +gemma-4-E4B-it-Q4_K_M LS/P [reforged:keep-last] 73.2% 73.2% 100.0% 85% 0.6 8.5s 50 100 100 100 100 100 40 98 96 92 0 10 24 96 100 100 100 100 100 38 100 100 86 0 0 32 92 +gemma-4-E4B-it-Q4_K_M LS/P [reforged:full] 72.9% 72.9% 100.0% 85% 0.6 8.5s 50 100 100 100 100 100 54 98 98 86 0 10 28 88 100 100 100 100 100 26 98 96 94 0 0 30 90 +gemma-4-E4B-it-Q8_0 LS/P [reforged] 73.2% 73.3% 99.8% 85% 0.6 13.3s 50 100 100 100 100 100 54 98 94 80 0 28 20 94 100 100 100 100 100 40 100 98 90 0 0 12 96 +Qwen3-8B-Q4_K_M LS/P [reforged:keep-last] 72.2% 72.3% 99.9% 88% 0.5 17.9s 50 100 100 100 100 100 100 100 62 66 0 30 8 90 100 100 100 100 100 98 100 74 68 0 0 8 74 +Qwen3-8B-Q8_0 LS/P [reforged] 72.0% 72.3% 99.6% 88% 0.4 28.6s 50 100 100 100 100 100 96 100 56 90 0 2 10 96 100 100 100 100 96 98 100 58 88 0 0 20 62 +Qwen3-14B-Q4_K_M LS/P [reforged:full] 71.8% 71.9% 99.8% 87% 0.5 24.3s 50 100 100 100 100 100 96 100 72 72 2 0 30 74 100 100 100 100 100 92 100 74 68 0 0 38 48 +Qwen3-14B-Q4_K_M LS/P [reforged:keep-last] 71.8% 71.8% 100.0% 86% 0.5 23.8s 50 100 100 100 100 100 98 100 72 58 2 4 28 72 100 100 100 100 100 94 100 76 74 0 0 32 56 +gemma-4-E4B-it-Q4_K_M LS/P [reforged] 72.4% 72.4% 99.9% 85% 0.6 8.9s 50 100 100 100 100 100 56 96 100 86 0 14 28 84 100 100 100 100 100 34 100 96 74 0 0 22 92 +Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M LS/N [reforged:full]² 71.0% 71.2% 99.8% 96% 0.5 6.5s 50 100 100 100 100 98 58 100 100 68 4 12 20 90 100 100 100 100 100 50 100 100 4 22 0 20 100 +Qwen3-8B-Q4_K_M LS/P [reforged:full] 70.5% 70.8% 99.6% 88% 0.4 17.4s 50 100 100 100 100 100 88 100 58 66 0 24 10 88 100 100 100 100 100 94 98 62 66 0 0 4 76 +Qwen3-8B-Q4_K_M LS/P [reforged] 71.1% 71.2% 99.8% 87% 0.5 18.0s 50 100 100 100 100 100 96 100 70 66 0 18 8 96 100 100 100 100 98 96 100 60 60 0 0 10 70 +Qwen3-14B-Q4_K_M LS/P [reforged] 71.4% 71.4% 99.9% 86% 0.5 25.6s 50 100 100 100 100 100 98 100 70 70 0 0 30 80 100 100 100 100 100 92 100 76 56 0 0 22 62 +ministral-3:8b-instruct-2512-q8_0 OL/N [reforged:full]² 70.7% 74.5% 94.9% 74% 1.1 5.9s 50 100 100 100 100 100 100 100 100 92 12 42 0 42 100 100 100 100 100 100 100 100 26 4 8 0 12 +Nemotron-3-Nano-30B-A3B-Q4_K_M LS/N [reforged:full]² 71.3% 81.0% 88.0% 72% 1.5 21.4s 50 100 100 100 100 100 66 98 52 92 28 4 34 34 100 100 100 98 100 86 92 68 98 24 8 34 38 +Nemotron-3-Nano-30B-A3B-Q4_K_M LS/P [reforged:full]² 70.2% 70.7% 99.4% 89% 0.4 10.8s 50 100 100 100 100 98 52 100 84 90 6 4 0 100 100 100 100 100 100 42 100 80 92 6 2 4 66 +Qwen3-14B-Q4_K_M LS/N [reforged] 68.5% 68.5% 100.0% 100% 0.4 21.8s 50 100 100 100 100 98 100 100 56 72 14 4 58 4 100 100 100 100 100 100 100 42 72 10 0 52 0 +Qwen3-8B-Q8_0 LS/N [reforged:full] 69.3% 69.6% 99.6% 88% 0.6 24.7s 50 98 100 100 100 100 94 100 48 76 2 28 24 46 100 100 100 100 100 98 100 46 76 2 8 16 40 +granite4.1:8b-q8_0 OL/N [reforged:full]² 69.2% 69.2% 100.0% 83% 1.1 2.9s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 100 100 0 0 0 0 +qwen3:8b-q8_0 OL/N [reforged:full]² 67.5% 67.6% 99.9% 85% 0.6 31.0s 50 100 100 100 100 100 100 100 26 88 0 2 6 66 100 100 100 100 100 100 100 40 82 0 0 2 44 +Qwen3-14B-Q4_K_M LS/N [reforged:full] 68.4% 68.4% 99.9% 83% 0.9 21.9s 50 100 100 100 100 98 90 98 60 32 20 18 50 38 100 100 100 100 100 86 100 74 34 6 18 34 22 +Qwen3-8B-Q8_0 LS/N [reforged] 68.2% 68.5% 99.5% 95% 0.3 24.8s 50 100 100 100 100 100 100 100 56 78 6 24 28 6 98 100 100 100 100 100 100 52 84 6 2 30 2 +Qwen3-8B-Q4_K_M LS/N [reforged] 67.3% 67.5% 99.7% 96% 0.3 15.6s 50 100 100 100 100 100 100 100 40 98 6 14 22 2 100 100 100 100 100 100 100 48 86 2 0 26 6 +Qwen3-8B-Q8_0 LS/N [reforged:keep-last] 67.0% 67.2% 99.8% 92% 0.4 23.2s 50 100 100 100 100 100 94 98 48 84 2 22 10 12 100 100 100 100 100 96 100 52 80 0 18 20 6 +ministral-3:8b-instruct-2512-q4_K_M OL/N [reforged:full]² 66.8% 71.9% 92.9% 68% 1.4 5.4s 50 100 100 100 100 76 100 100 68 90 0 0 4 64 100 100 100 100 28 100 100 80 98 0 0 6 22 +Qwen3-8B-Q4_K_M LS/N [reforged:full] 65.8% 66.0% 99.7% 84% 0.7 17.2s 50 100 100 100 100 100 94 100 34 66 0 18 10 38 100 100 100 96 100 86 100 34 74 0 10 12 40 +Qwen3-8B-Q4_K_M LS/N [reforged:keep-last] 64.5% 64.6% 99.9% 91% 0.4 15.0s 50 100 100 100 100 100 100 100 30 82 0 28 12 10 100 100 100 98 100 96 100 22 86 2 2 6 4 +granite-4.1-8b-Q8_0 LS/N [reforged] 65.4% 65.4% 100.0% 88% 1.3 2.8s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 +qwen3:8b-q4_K_M OL/N [reforged:full]² 64.9% 65.1% 99.8% 85% 0.6 21.0s 50 100 100 100 100 100 96 98 30 62 2 6 2 74 98 100 100 100 100 98 100 26 70 4 0 4 18 +granite-4.1-8b-Q4_K_M LS/N [reforged:keep-last] 65.4% 68.0% 96.2% 90% 0.8 1.8s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 +granite-4.1-8b-Q4_K_M LS/N [reforged:full] 65.4% 68.0% 96.2% 90% 0.8 1.9s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 +granite-4.1-8b-Q4_K_M LS/N [reforged] 65.4% 68.0% 96.2% 90% 0.8 1.9s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 +Qwen3-14B-Q4_K_M LS/N [reforged:keep-last] 64.0% 64.0% 99.9% 91% 0.6 20.3s 50 100 100 100 100 100 90 98 48 40 18 6 38 4 100 100 100 98 96 88 100 54 30 16 2 38 0 +granite-4.1-8b-Q4_K_M LS/P [reforged] 61.5% 61.5% 100.0% 90% 0.3 2.5s 50 100 100 100 100 100 0 100 100 0 0 100 0 0 100 100 100 100 100 0 100 100 0 0 100 0 0 +granite-4.1-8b-Q8_0 LS/P [reforged] 61.5% 66.7% 92.3% 73% 1.0 5.2s 50 0 100 100 100 100 100 100 100 0 0 0 100 0 0 100 100 100 100 100 100 100 0 0 0 100 0 +granite4.1:8b-q4_K_M OL/N [reforged:full]² 57.8% 57.8% 100.0% 81% 1.3 1.9s 50 100 100 100 100 100 100 100 100 2 0 0 0 0 100 100 100 100 100 100 100 0 2 0 0 0 0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` Scr=score(correct/total), Acc=accuracy(correct/total, excl validate errors), Cmp=completeness(completed/total), Eff=efficiency(ideal/actual calls), Wst=avg wasted calls, Spd=avg time(excl compaction) rel=relevance_detection, arg=argument_fidelity, tsl=tool_selection, b2s=basic_2step, s3s=sequential_3step, crt=conditional_routing, srn=sequential_reasoning, err=error_recovery, dgr=data_gap_recovery, dge=data_gap_recovery_extended, art=argument_transformation, grs=grounded_synthesis, iar=inconsistent_api_recovery, rel_s=relevance_detection_stateful, arg_s=argument_fidelity_stateful, tsl_s=tool_selection_stateful, b2s_s=basic_2step_stateful, s3s_s=sequential_3step_stateful, crt_s=conditional_routing_stateful, srn_s=sequential_reasoning_stateful, err_s=error_recovery_stateful, dgr_s=data_gap_recovery_stateful, dge_s=data_gap_recovery_extended_stateful, art_s=argument_transformation_stateful, grs_s=grounded_synthesis_stateful, iar_s=inconsistent_api_recovery_stateful Ablation: full=all guardrails, no_rescue=no rescue loop, no_nudge=no rescue/retry nudge, no_steps=no step enforcement, no_recovery=no error recovery, no_compact=no compaction, bare=all guardrails off +Replay: ':keep-last'/':full' tags = reasoning_replay policy (how much captured reasoning is re-sent to the backend each turn); untagged = none (default). Rows predating the knob ran unbounded replay and count as full. Eval generations (older runs carried forward, superscript-tagged): ¹ gen 1 — v0.6.0 suite — incl. Anthropic ablation (commit 2b05dc4, 2026-05-08) + ² gen 2 — v0.7.0 lineup refresh (8–14B) + 32GB tier debut (v0.7.4) (commit 655e1f6, 2026-05-22) -*Generated 2026-06-03 00:09* +*Generated 2026-06-11 20:28* diff --git a/docs/results/raw/reforged/by-backend.md b/docs/results/raw/reforged/by-backend.md index 59ee6e2..6314fd5 100644 --- a/docs/results/raw/reforged/by-backend.md +++ b/docs/results/raw/reforged/by-backend.md @@ -3,128 +3,152 @@ ## ministral-3-8b-instruct-q8_0 ``` -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Ministral-3-8B-Instruct-2512-Q8_0 LS/P [reforged] 84.4% 91.1% 92.6% 92% 0.7 4.6s 50 100 100 6 100 100 100 100 100 100 98 8 100 80 100 100 4 100 100 100 100 100 100 100 0 98 100 -Ministral-3-8B-Instruct-2512-Q8_0 LS/N [reforged] 81.4% 81.4% 100.0% 100% 0.3 4.0s 50 100 100 100 100 100 100 100 100 100 38 4 0 100 100 100 100 100 100 100 100 100 100 74 0 0 100 -ministral-3:8b-instruct-2512-q8_0 OL/N [reforged] 70.7% 74.5% 94.9% 74% 1.1 5.9s 50 100 100 100 100 100 100 100 100 92 12 42 0 42 100 100 100 100 100 100 100 100 26 4 8 0 12 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Ministral-3-8B-Instruct-2512-Q8_0 LS/P [reforged] 84.2% 90.9% 92.7% 91% 0.7 4.6s 50 100 100 6 100 100 100 100 100 100 98 8 100 80 100 100 4 100 100 100 100 100 100 96 0 98 100 +Ministral-3-8B-Instruct-2512-Q8_0 LS/N [reforged] 81.0% 81.0% 100.0% 100% 0.3 4.1s 50 100 100 100 100 100 100 100 100 100 30 0 4 100 100 100 100 100 100 100 100 100 100 68 0 4 100 +ministral-3:8b-instruct-2512-q8_0 OL/N [reforged:full]² 70.7% 74.5% 94.9% 74% 1.1 5.9s 50 100 100 100 100 100 100 100 100 92 12 42 0 42 100 100 100 100 100 100 100 100 26 4 8 0 12 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## ministral-3-14b-instruct-q4_K_M ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Ministral-3-14B-Instruct-2512-Q4_K_M LS/P [reforged] 80.2% 80.2% 100.0% 100% 0.0 2.9s 50 100 100 100 100 100 100 100 100 100 0 0 36 100 100 100 100 100 100 100 100 100 100 0 0 50 100 -Ministral-3-14B-Instruct-2512-Q4_K_M LS/N [reforged] 78.1% 78.1% 100.0% 97% 0.3 4.0s 50 100 100 100 100 100 100 100 100 100 16 0 0 100 100 100 100 100 100 100 100 100 100 14 0 0 100 -ministral-3:14b-instruct-2512-q4_K_M OL/N [reforged] 74.8% 74.8% 100.0% 81% 1.0 6.2s 50 100 100 100 100 100 100 100 100 96 56 0 0 80 100 100 100 100 100 100 100 100 98 0 0 0 16 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Ministral-3-14B-Instruct-2512-Q4_K_M LS/P [reforged] 80.6% 80.6% 100.0% 100% 0.0 3.0s 50 100 100 100 100 100 100 100 100 100 0 0 46 100 100 100 100 100 100 100 100 100 98 0 0 52 100 +Ministral-3-14B-Instruct-2512-Q4_K_M LS/N [reforged] 77.8% 77.8% 100.0% 97% 0.3 4.0s 50 100 100 100 100 100 100 100 100 100 8 0 0 100 100 100 100 100 100 100 100 100 100 14 0 0 100 +ministral-3:14b-instruct-2512-q4_K_M OL/N [reforged:full]² 74.8% 74.8% 100.0% 81% 1.0 6.2s 50 100 100 100 100 100 100 100 100 96 56 0 0 80 100 100 100 100 100 100 100 100 98 0 0 0 16 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` -## qwen3-14b-q4_K_M +## gemma4-e4b-q4_K_M ``` --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -qwen3:14b-q4_K_M OL/N [reforged] 78.6% 78.7% 99.9% 77% 1.2 38.5s 50 100 100 100 100 100 100 100 100 74 4 12 68 78 100 100 100 100 100 100 100 94 88 4 0 54 68 -Qwen3-14B-Q4_K_M LS/P [reforged] 70.5% 70.8% 99.7% 86% 0.5 24.2s 50 100 100 100 100 100 94 100 64 68 0 0 32 72 100 100 100 100 98 94 100 58 66 0 0 30 58 -Qwen3-14B-Q4_K_M LS/N [reforged] 67.7% 67.7% 99.9% 85% 0.9 20.8s 50 100 100 100 100 100 94 100 62 36 4 22 44 22 100 100 100 100 98 84 100 66 24 12 18 42 32 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +gemma-4-E4B-it-Q4_K_M LS/N [reforged] 79.7% 79.9% 99.8% 100% 0.3 8.1s 50 100 100 100 100 100 94 98 100 84 8 30 64 92 100 100 100 100 100 88 94 100 86 2 0 48 84 +gemma-4-E4B-it-Q4_K_M LS/N [reforged:full] 79.2% 82.2% 96.3% 99% 0.5 10.0s 50 100 100 100 100 100 96 88 94 100 2 40 78 54 100 100 100 100 100 94 84 98 96 0 0 82 52 +gemma-4-E4B-it-Q4_K_M LS/N [reforged:keep-last] 78.7% 81.4% 96.7% 99% 0.5 9.3s 50 100 100 100 100 100 96 92 96 96 6 20 64 62 100 100 100 100 100 88 88 100 98 2 0 82 56 +gemma4:e4b-it-q4_K_M OL/N [reforged:full]² 74.8% 75.0% 99.8% 83% 0.8 11.3s 50 100 100 100 100 100 94 100 100 92 0 0 44 66 100 100 100 100 100 90 100 100 78 0 0 40 42 +gemma-4-E4B-it-Q4_K_M LS/P [reforged:keep-last] 73.2% 73.2% 100.0% 85% 0.6 8.5s 50 100 100 100 100 100 40 98 96 92 0 10 24 96 100 100 100 100 100 38 100 100 86 0 0 32 92 +gemma-4-E4B-it-Q4_K_M LS/P [reforged:full] 72.9% 72.9% 100.0% 85% 0.6 8.5s 50 100 100 100 100 100 54 98 98 86 0 10 28 88 100 100 100 100 100 26 98 96 94 0 0 30 90 +gemma-4-E4B-it-Q4_K_M LS/P [reforged] 72.4% 72.4% 99.9% 85% 0.6 8.9s 50 100 100 100 100 100 56 96 100 86 0 14 28 84 100 100 100 100 100 34 100 96 74 0 0 22 92 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` -## ministral-3-8b-instruct-q4_K_M +## qwen3-14b-q4_K_M ``` ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Ministral-3-8B-Instruct-2512-Q4_K_M LS/N [reforged] 78.3% 78.4% 99.8% 95% 0.4 3.2s 50 100 100 100 100 100 100 100 98 100 22 0 0 100 100 100 100 100 100 100 100 98 100 14 2 2 100 -Ministral-3-8B-Instruct-2512-Q4_K_M LS/P [reforged] 75.6% 83.8% 90.2% 79% 1.3 3.0s 50 98 100 0 100 100 100 100 100 100 22 12 56 100 100 100 0 100 100 100 100 100 100 28 0 50 100 -ministral-3:8b-instruct-2512-q4_K_M OL/N [reforged] 66.8% 71.9% 92.9% 68% 1.4 5.4s 50 100 100 100 100 76 100 100 68 90 0 0 4 64 100 100 100 100 28 100 100 80 98 0 0 6 22 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +qwen3:14b-q4_K_M OL/N [reforged:full]² 78.6% 78.7% 99.9% 77% 1.2 38.5s 50 100 100 100 100 100 100 100 100 74 4 12 68 78 100 100 100 100 100 100 100 94 88 4 0 54 68 +Qwen3-14B-Q4_K_M LS/P [reforged:keep-last] 71.8% 71.8% 100.0% 86% 0.5 23.8s 50 100 100 100 100 100 98 100 72 58 2 4 28 72 100 100 100 100 100 94 100 76 74 0 0 32 56 +Qwen3-14B-Q4_K_M LS/P [reforged:full] 71.8% 71.9% 99.8% 87% 0.5 24.3s 50 100 100 100 100 100 96 100 72 72 2 0 30 74 100 100 100 100 100 92 100 74 68 0 0 38 48 +Qwen3-14B-Q4_K_M LS/P [reforged] 71.4% 71.4% 99.9% 86% 0.5 25.6s 50 100 100 100 100 100 98 100 70 70 0 0 30 80 100 100 100 100 100 92 100 76 56 0 0 22 62 +Qwen3-14B-Q4_K_M LS/N [reforged] 68.5% 68.5% 100.0% 100% 0.4 21.8s 50 100 100 100 100 98 100 100 56 72 14 4 58 4 100 100 100 100 100 100 100 42 72 10 0 52 0 +Qwen3-14B-Q4_K_M LS/N [reforged:full] 68.4% 68.4% 99.9% 83% 0.9 21.9s 50 100 100 100 100 98 90 98 60 32 20 18 50 38 100 100 100 100 100 86 100 74 34 6 18 34 22 +Qwen3-14B-Q4_K_M LS/N [reforged:keep-last] 64.0% 64.0% 99.9% 91% 0.6 20.3s 50 100 100 100 100 100 90 98 48 40 18 6 38 4 100 100 100 98 96 88 100 54 30 16 2 38 0 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ``` -## gemma4-e4b-q4_K_M +## ministral-3-8b-instruct-q4_K_M ``` -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -gemma-4-E4B-it-Q4_K_M LS/N [reforged] 78.2% 82.2% 95.1% 98% 0.5 9.0s 50 100 100 100 100 100 92 98 98 90 0 24 80 50 100 100 100 100 100 94 90 94 98 0 0 84 40 -gemma4:e4b-it-q4_K_M OL/N [reforged] 74.8% 75.0% 99.8% 83% 0.8 11.3s 50 100 100 100 100 100 94 100 100 92 0 0 44 66 100 100 100 100 100 90 100 100 78 0 0 40 42 -gemma-4-E4B-it-Q4_K_M LS/P [reforged] 72.8% 72.8% 100.0% 85% 0.6 8.5s 50 100 100 100 100 100 38 98 94 94 0 18 26 96 100 100 100 100 100 26 98 100 92 0 2 22 90 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Ministral-3-8B-Instruct-2512-Q4_K_M LS/N [reforged] 78.3% 78.3% 100.0% 95% 0.4 3.2s 50 100 100 100 100 100 100 100 100 100 18 0 0 100 100 100 100 100 100 100 100 100 100 16 2 0 100 +Ministral-3-8B-Instruct-2512-Q4_K_M LS/P [reforged] 74.9% 83.3% 89.9% 79% 1.3 3.1s 50 100 100 0 100 100 100 100 100 100 20 2 42 100 100 100 0 100 100 100 100 100 100 22 0 62 100 +ministral-3:8b-instruct-2512-q4_K_M OL/N [reforged:full]² 66.8% 71.9% 92.9% 68% 1.4 5.4s 50 100 100 100 100 76 100 100 68 90 0 0 4 64 100 100 100 100 28 100 100 80 98 0 0 6 22 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## gemma4-e4b-q8_0 ``` ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -gemma-4-E4B-it-Q8_0 LS/N [reforged] 76.2% 80.7% 94.5% 98% 0.6 12.8s 50 100 100 100 100 100 84 88 90 96 2 14 80 44 100 100 100 100 100 88 90 96 94 4 0 80 32 -gemma-4-E4B-it-Q8_0 LS/P [reforged] 74.7% 74.7% 100.0% 85% 0.6 12.7s 50 100 100 100 100 100 70 100 90 88 0 16 34 94 100 100 100 100 100 48 100 98 84 0 0 30 90 -gemma4:e4b-it-q8_0 OL/N [reforged] 73.6% 73.8% 99.8% 85% 0.8 12.8s 50 100 100 100 100 100 78 98 100 100 0 8 34 60 100 100 100 100 100 78 94 100 96 0 0 34 34 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +gemma-4-E4B-it-Q8_0 LS/N [reforged] 77.8% 77.9% 99.8% 100% 0.2 10.8s 50 100 100 100 100 100 76 90 100 100 0 18 38 98 100 100 100 100 100 76 98 100 98 0 0 36 94 +gemma-4-E4B-it-Q8_0 LS/N [reforged:keep-last] 75.7% 79.3% 95.5% 99% 0.5 13.1s 50 100 100 100 100 100 76 92 98 96 4 16 82 48 100 100 100 100 100 50 86 98 94 2 0 86 40 +gemma-4-E4B-it-Q8_0 LS/N [reforged:full] 75.6% 80.8% 93.6% 98% 0.6 12.5s 50 100 100 100 100 100 92 94 92 88 0 18 82 28 100 100 100 100 100 76 92 94 98 2 0 84 26 +gemma-4-E4B-it-Q8_0 LS/P [reforged:keep-last] 74.1% 74.2% 99.8% 85% 0.6 13.4s 50 100 100 100 100 100 48 100 96 84 0 22 28 94 100 100 100 100 100 52 100 98 90 0 0 18 96 +gemma-4-E4B-it-Q8_0 LS/P [reforged:full] 73.7% 73.7% 100.0% 86% 0.6 12.7s 50 100 100 100 100 100 48 100 92 90 0 36 28 88 100 100 100 100 100 48 94 98 82 0 0 20 92 +gemma4:e4b-it-q8_0 OL/N [reforged:full]² 73.6% 73.8% 99.8% 85% 0.8 12.8s 50 100 100 100 100 100 78 98 100 100 0 8 34 60 100 100 100 100 100 78 94 100 96 0 0 34 34 +gemma-4-E4B-it-Q8_0 LS/P [reforged] 73.2% 73.3% 99.8% 85% 0.6 13.3s 50 100 100 100 100 100 54 98 94 80 0 28 20 94 100 100 100 100 100 40 100 98 90 0 0 12 96 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## qwen3-8b-q8_0 ``` ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Qwen3-8B-Q8_0 LS/P [reforged] 73.1% 73.2% 99.8% 89% 0.4 28.4s 50 100 100 100 100 100 100 100 58 96 0 8 28 94 100 100 100 100 96 100 98 64 88 0 0 12 58 -Qwen3-8B-Q8_0 LS/N [reforged] 70.3% 70.5% 99.7% 88% 0.6 24.1s 50 100 100 100 100 100 100 100 60 82 4 22 20 32 100 100 100 100 98 94 100 58 66 2 12 28 50 -qwen3:8b-q8_0 OL/N [reforged] 67.5% 67.6% 99.9% 85% 0.6 31.0s 50 100 100 100 100 100 100 100 26 88 0 2 6 66 100 100 100 100 100 100 100 40 82 0 0 2 44 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3-8B-Q8_0 LS/P [reforged:keep-last] 72.8% 73.0% 99.7% 89% 0.4 28.0s 50 100 100 100 100 100 98 98 80 90 0 6 8 94 100 100 100 100 96 96 100 60 98 0 2 12 54 +Qwen3-8B-Q8_0 LS/P [reforged:full] 72.8% 72.9% 99.8% 88% 0.4 28.9s 50 100 100 100 100 100 98 100 70 90 0 4 20 96 100 100 100 100 92 100 96 66 92 0 0 12 56 +Qwen3-8B-Q8_0 LS/P [reforged] 72.0% 72.3% 99.6% 88% 0.4 28.6s 50 100 100 100 100 100 96 100 56 90 0 2 10 96 100 100 100 100 96 98 100 58 88 0 0 20 62 +Qwen3-8B-Q8_0 LS/N [reforged:full] 69.3% 69.6% 99.6% 88% 0.6 24.7s 50 98 100 100 100 100 94 100 48 76 2 28 24 46 100 100 100 100 100 98 100 46 76 2 8 16 40 +Qwen3-8B-Q8_0 LS/N [reforged] 68.2% 68.5% 99.5% 95% 0.3 24.8s 50 100 100 100 100 100 100 100 56 78 6 24 28 6 98 100 100 100 100 100 100 52 84 6 2 30 2 +qwen3:8b-q8_0 OL/N [reforged:full]² 67.5% 67.6% 99.9% 85% 0.6 31.0s 50 100 100 100 100 100 100 100 26 88 0 2 6 66 100 100 100 100 100 100 100 40 82 0 0 2 44 +Qwen3-8B-Q8_0 LS/N [reforged:keep-last] 67.0% 67.2% 99.8% 92% 0.4 23.2s 50 100 100 100 100 100 94 98 48 84 2 22 10 12 100 100 100 100 100 96 100 52 80 0 18 20 6 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## qwen3-8b-q4_K_M ``` -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Qwen3-8B-Q4_K_M LS/P [reforged] 70.4% 70.7% 99.6% 86% 0.5 17.8s 50 100 100 100 100 100 94 100 56 64 0 14 12 92 100 100 100 100 100 94 100 62 58 0 0 6 78 -Qwen3-8B-Q4_K_M LS/N [reforged] 68.2% 68.4% 99.6% 86% 0.7 16.1s 50 98 100 100 100 100 92 100 48 78 0 44 8 38 100 100 100 100 100 90 100 40 76 0 8 14 38 -qwen3:8b-q4_K_M OL/N [reforged] 64.9% 65.1% 99.8% 85% 0.6 21.0s 50 100 100 100 100 100 96 98 30 62 2 6 2 74 98 100 100 100 100 98 100 26 70 4 0 4 18 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3-8B-Q4_K_M LS/P [reforged:keep-last] 72.2% 72.3% 99.9% 88% 0.5 17.9s 50 100 100 100 100 100 100 100 62 66 0 30 8 90 100 100 100 100 100 98 100 74 68 0 0 8 74 +Qwen3-8B-Q4_K_M LS/P [reforged] 71.1% 71.2% 99.8% 87% 0.5 18.0s 50 100 100 100 100 100 96 100 70 66 0 18 8 96 100 100 100 100 98 96 100 60 60 0 0 10 70 +Qwen3-8B-Q4_K_M LS/P [reforged:full] 70.5% 70.8% 99.6% 88% 0.4 17.4s 50 100 100 100 100 100 88 100 58 66 0 24 10 88 100 100 100 100 100 94 98 62 66 0 0 4 76 +Qwen3-8B-Q4_K_M LS/N [reforged] 67.3% 67.5% 99.7% 96% 0.3 15.6s 50 100 100 100 100 100 100 100 40 98 6 14 22 2 100 100 100 100 100 100 100 48 86 2 0 26 6 +Qwen3-8B-Q4_K_M LS/N [reforged:full] 65.8% 66.0% 99.7% 84% 0.7 17.2s 50 100 100 100 100 100 94 100 34 66 0 18 10 38 100 100 100 96 100 86 100 34 74 0 10 12 40 +qwen3:8b-q4_K_M OL/N [reforged:full]² 64.9% 65.1% 99.8% 85% 0.6 21.0s 50 100 100 100 100 100 96 98 30 62 2 6 2 74 98 100 100 100 100 98 100 26 70 4 0 4 18 +Qwen3-8B-Q4_K_M LS/N [reforged:keep-last] 64.5% 64.6% 99.9% 91% 0.4 15.0s 50 100 100 100 100 100 100 100 30 82 0 28 12 10 100 100 100 98 100 96 100 22 86 2 2 6 4 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## granite-4.1-8b-q8_0 ``` ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -granite4.1:8b-q8_0 OL/N [reforged] 69.2% 69.2% 100.0% 83% 1.1 2.9s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 100 100 0 0 0 0 -granite-4.1-8b-Q8_0 LS/N [reforged] 65.4% 65.4% 100.0% 88% 1.4 2.8s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 -granite-4.1-8b-Q8_0 LS/P [reforged] 61.5% 66.7% 92.3% 73% 1.0 5.2s 50 0 100 100 100 100 100 100 100 0 0 0 100 0 0 100 100 100 100 100 100 100 0 0 0 100 0 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +granite4.1:8b-q8_0 OL/N [reforged:full]² 69.2% 69.2% 100.0% 83% 1.1 2.9s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 100 100 0 0 0 0 +granite-4.1-8b-Q8_0 LS/N [reforged] 65.4% 65.4% 100.0% 88% 1.3 2.8s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 +granite-4.1-8b-Q8_0 LS/P [reforged] 61.5% 66.7% 92.3% 73% 1.0 5.2s 50 0 100 100 100 100 100 100 100 0 0 0 100 0 0 100 100 100 100 100 100 100 0 0 0 100 0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## granite-4.1-8b-q4_K_M ``` -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -granite-4.1-8b-Q4_K_M LS/N [reforged] 65.4% 68.0% 96.2% 90% 0.8 1.8s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 -granite-4.1-8b-Q4_K_M LS/P [reforged] 61.5% 61.5% 100.0% 90% 0.3 2.5s 50 100 100 100 100 100 0 100 100 0 0 100 0 0 100 100 100 100 100 0 100 100 0 0 100 0 0 -granite4.1:8b-q4_K_M OL/N [reforged] 57.8% 57.8% 100.0% 81% 1.3 1.9s 50 100 100 100 100 100 100 100 100 2 0 0 0 0 100 100 100 100 100 100 100 0 2 0 0 0 0 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +granite-4.1-8b-Q4_K_M LS/N [reforged] 65.4% 68.0% 96.2% 90% 0.8 1.9s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 +granite-4.1-8b-Q4_K_M LS/N [reforged:keep-last] 65.4% 68.0% 96.2% 90% 0.8 1.8s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 +granite-4.1-8b-Q4_K_M LS/N [reforged:full] 65.4% 68.0% 96.2% 90% 0.8 1.9s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 +granite-4.1-8b-Q4_K_M LS/P [reforged] 61.5% 61.5% 100.0% 90% 0.3 2.5s 50 100 100 100 100 100 0 100 100 0 0 100 0 0 100 100 100 100 100 0 100 100 0 0 100 0 0 +granite4.1:8b-q4_K_M OL/N [reforged:full]² 57.8% 57.8% 100.0% 81% 1.3 1.9s 50 100 100 100 100 100 100 100 100 2 0 0 0 0 100 100 100 100 100 100 100 0 2 0 0 0 0 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` Scr=score(correct/total), Acc=accuracy(correct/total, excl validate errors), Cmp=completeness(completed/total), Eff=efficiency(ideal/actual calls), Wst=avg wasted calls, Spd=avg time(excl compaction) rel=relevance_detection, arg=argument_fidelity, tsl=tool_selection, b2s=basic_2step, s3s=sequential_3step, crt=conditional_routing, srn=sequential_reasoning, err=error_recovery, dgr=data_gap_recovery, dge=data_gap_recovery_extended, art=argument_transformation, grs=grounded_synthesis, iar=inconsistent_api_recovery, rel_s=relevance_detection_stateful, arg_s=argument_fidelity_stateful, tsl_s=tool_selection_stateful, b2s_s=basic_2step_stateful, s3s_s=sequential_3step_stateful, crt_s=conditional_routing_stateful, srn_s=sequential_reasoning_stateful, err_s=error_recovery_stateful, dgr_s=data_gap_recovery_stateful, dge_s=data_gap_recovery_extended_stateful, art_s=argument_transformation_stateful, grs_s=grounded_synthesis_stateful, iar_s=inconsistent_api_recovery_stateful Ablation: full=all guardrails, no_rescue=no rescue loop, no_nudge=no rescue/retry nudge, no_steps=no step enforcement, no_recovery=no error recovery, no_compact=no compaction, bare=all guardrails off +Replay: ':keep-last'/':full' tags = reasoning_replay policy (how much captured reasoning is re-sent to the backend each turn); untagged = none (default). Rows predating the knob ran unbounded replay and count as full. Eval generations (older runs carried forward, superscript-tagged): ¹ gen 1 — v0.6.0 suite — incl. Anthropic ablation (commit 2b05dc4, 2026-05-08) + ² gen 2 — v0.7.0 lineup refresh (8–14B) + 32GB tier debut (v0.7.4) (commit 655e1f6, 2026-05-22) -*Generated 2026-06-03 00:09* +*Generated 2026-06-11 20:28* diff --git a/docs/results/raw/reforged/by-family.md b/docs/results/raw/reforged/by-family.md index 315d025..a06cc09 100644 --- a/docs/results/raw/reforged/by-family.md +++ b/docs/results/raw/reforged/by-family.md @@ -2,144 +2,154 @@ ## claude -``` ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -claude-opus-4-6 AN/N [reforged]¹ 99.2% 99.8% 99.4% 100% 0.0 15.6s 50 100 100 100 100 100 98 100 100 100 100 98 94 98 100 100 100 100 100 100 100 96 100 100 98 100 98 -claude-sonnet-4-6 AN/N [reforged]¹ 98.4% 98.5% 99.9% 100% 0.1 13.1s 50 100 100 100 100 100 100 100 100 100 98 74 98 100 100 100 100 100 100 100 100 100 100 100 88 100 100 -claude-haiku-4-5-20251001 AN/N [reforged]¹ 94.5% 94.9% 99.6% 100% 0.3 8.5s 50 100 100 100 100 100 100 100 100 100 80 80 98 100 100 100 100 100 100 100 94 100 100 76 36 94 100 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -``` - -## qwen3.6-35b-a3b - ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Qwen3.6-35B-A3B-UD-Q4_K_M LS/N [reforged] 94.8% 95.1% 99.7% 100% 0.6 12.7s 50 100 100 100 100 100 100 96 100 100 72 78 92 100 98 100 100 100 100 98 92 100 100 68 76 94 100 -Qwen3.6-35B-A3B-UD-Q4_K_M LS/P [reforged] 82.2% 82.2% 100.0% 100% 0.3 23.6s 50 96 100 100 100 100 90 92 98 92 16 46 62 98 88 100 100 100 100 88 94 96 88 8 42 50 94 +claude-sonnet-4-6 AN/N [reforged] 100.0% 100.0% 100.0% 100% 0.0 18.2s 50 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 +claude-opus-4-8 AN/N [reforged] 100.0% 100.0% 100.0% 100% 0.0 13.3s 50 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 +claude-opus-4-6 AN/N [reforged:full]¹ 99.2% 99.8% 99.4% 100% 0.0 15.6s 50 100 100 100 100 100 98 100 100 100 100 98 94 98 100 100 100 100 100 100 100 96 100 100 98 100 98 +claude-haiku-4-5-20251001 AN/N [reforged] 94.2% 94.2% 99.9% 100% 0.3 6.6s 50 100 100 100 100 100 100 98 100 100 74 74 98 100 100 100 100 100 100 100 100 100 100 72 38 94 100 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` -## qwen3.5-27b +## qwen3.6-35b-a3b ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Qwen3.5-27B-Q4_K_M LS/N [reforged] 93.2% 93.3% 99.8% 82% 1.4 37.6s 50 100 100 100 100 100 100 100 98 100 74 38 88 98 100 100 100 100 100 98 100 100 100 78 56 96 98 -Qwen3.5-27B-Q4_K_M LS/P [reforged] 86.8% 86.8% 100.0% 100% 0.1 24.4s 50 100 100 100 100 100 100 100 100 100 42 10 78 100 100 100 100 100 100 100 100 100 100 36 10 80 100 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3.6-35B-A3B-UD-Q4_K_M LS/N [reforged:full]² 94.8% 95.1% 99.7% 100% 0.6 12.7s 50 100 100 100 100 100 100 96 100 100 72 78 92 100 98 100 100 100 100 98 92 100 100 68 76 94 100 +Qwen3.6-35B-A3B-UD-Q4_K_M LS/P [reforged:full]² 82.2% 82.2% 100.0% 100% 0.3 23.6s 50 96 100 100 100 100 90 92 98 92 16 46 62 98 88 100 100 100 100 88 94 96 88 8 42 50 94 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` -## qwen3.6-27b +## qwen3.5-27b ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Qwen3.6-27B-Q4_K_M LS/N [reforged] 92.2% 92.5% 99.6% 100% 0.4 37.9s 50 100 100 100 100 100 100 100 98 100 22 74 98 100 100 100 100 100 100 100 100 96 98 36 78 96 100 -Qwen3.6-27B-Q4_K_M LS/P [reforged] 83.5% 85.0% 98.2% 97% 0.4 53.9s 50 100 100 100 100 100 100 100 100 98 6 66 52 90 100 100 100 100 100 98 100 96 90 2 56 36 80 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3.5-27B-Q4_K_M LS/N [reforged:full]² 93.2% 93.3% 99.8% 82% 1.4 37.6s 50 100 100 100 100 100 100 100 98 100 74 38 88 98 100 100 100 100 100 98 100 100 100 78 56 96 98 +Qwen3.5-27B-Q4_K_M LS/P [reforged:full]² 86.8% 86.8% 100.0% 100% 0.1 24.4s 50 100 100 100 100 100 100 100 100 100 42 10 78 100 100 100 100 100 100 100 100 100 100 36 10 80 100 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` -## qwen3.5-35b-a3b +## qwen3.6-27b ``` --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Qwen3.5-35B-A3B-Q4_K_M LS/N [reforged] 92.1% 92.4% 99.7% 82% 1.3 11.1s 50 100 100 100 100 100 96 98 100 100 96 14 84 100 100 100 100 100 100 96 100 100 100 94 20 96 100 -Qwen3.5-35B-A3B-Q4_K_M LS/P [reforged] 82.8% 82.8% 100.0% 100% 0.2 10.4s 50 48 100 100 100 100 94 98 100 100 74 16 62 90 56 100 100 100 100 96 100 100 98 68 14 58 82 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3.6-27B-Q4_K_M LS/N [reforged:full]² 92.2% 92.5% 99.6% 100% 0.4 37.9s 50 100 100 100 100 100 100 100 98 100 22 74 98 100 100 100 100 100 100 100 100 96 98 36 78 96 100 +Qwen3.6-27B-Q4_K_M LS/P [reforged:full]² 83.5% 85.0% 98.2% 97% 0.4 53.9s 50 100 100 100 100 100 100 100 100 98 6 66 52 90 100 100 100 100 100 98 100 96 90 2 56 36 80 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` -## ministral-14b +## qwen3.5-35b-a3b ``` ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -Ministral-3-14B-Reasoning-2512-Q4_K_M LS/N [reforged] 84.5% 84.5% 100.0% 97% 0.6 5.4s 50 100 100 100 100 100 88 100 100 70 44 48 76 94 100 100 100 100 100 96 98 100 76 38 26 62 82 -Ministral-3-14B-Instruct-2512-Q4_K_M LS/P [reforged] 80.2% 80.2% 100.0% 100% 0.0 2.9s 50 100 100 100 100 100 100 100 100 100 0 0 36 100 100 100 100 100 100 100 100 100 100 0 0 50 100 -Ministral-3-14B-Reasoning-2512-Q4_K_M LS/P [reforged] 79.5% 80.5% 98.7% 96% 0.5 3.7s 50 100 100 100 100 100 82 100 100 78 30 6 58 92 100 100 100 100 100 74 100 100 80 20 6 56 84 -Ministral-3-14B-Instruct-2512-Q4_K_M LS/N [reforged] 78.1% 78.1% 100.0% 97% 0.3 4.0s 50 100 100 100 100 100 100 100 100 100 16 0 0 100 100 100 100 100 100 100 100 100 100 14 0 0 100 -ministral-3:14b-instruct-2512-q4_K_M OL/N [reforged] 74.8% 74.8% 100.0% 81% 1.0 6.2s 50 100 100 100 100 100 100 100 100 96 56 0 0 80 100 100 100 100 100 100 100 100 98 0 0 0 16 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3.5-35B-A3B-Q4_K_M LS/N [reforged:full]² 92.1% 92.4% 99.7% 82% 1.3 11.1s 50 100 100 100 100 100 96 98 100 100 96 14 84 100 100 100 100 100 100 96 100 100 100 94 20 96 100 +Qwen3.5-35B-A3B-Q4_K_M LS/P [reforged:full]² 82.8% 82.8% 100.0% 100% 0.2 10.4s 50 48 100 100 100 100 94 98 100 100 74 16 62 90 56 100 100 100 100 96 100 100 98 68 14 58 82 +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## ministral-8b ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Ministral-3-8B-Instruct-2512-Q8_0 LS/P [reforged] 84.4% 91.1% 92.6% 92% 0.7 4.6s 50 100 100 6 100 100 100 100 100 100 98 8 100 80 100 100 4 100 100 100 100 100 100 100 0 98 100 -Ministral-3-8B-Reasoning-2512-Q8_0 LS/N [reforged] 84.2% 84.2% 100.0% 95% 0.5 6.0s 50 100 100 100 100 100 100 100 100 98 74 26 54 88 100 100 100 100 100 100 100 100 100 68 2 26 52 -Ministral-3-8B-Reasoning-2512-Q4_K_M LS/N [reforged] 82.8% 82.8% 99.9% 95% 0.5 4.1s 50 100 100 100 100 100 100 100 100 98 66 24 34 92 100 100 100 100 100 96 100 100 100 70 0 30 42 -Ministral-3-8B-Instruct-2512-Q8_0 LS/N [reforged] 81.4% 81.4% 100.0% 100% 0.3 4.0s 50 100 100 100 100 100 100 100 100 100 38 4 0 100 100 100 100 100 100 100 100 100 100 74 0 0 100 -Ministral-3-8B-Reasoning-2512-Q8_0 LS/P [reforged] 81.3% 85.0% 95.7% 96% 0.7 3.9s 50 100 100 96 100 100 98 100 92 98 56 14 46 70 100 100 88 100 100 98 100 94 100 82 0 48 34 -Ministral-3-8B-Reasoning-2512-Q4_K_M LS/P [reforged] 80.5% 83.0% 97.0% 96% 0.7 2.7s 50 100 98 98 100 100 98 98 100 96 74 10 36 70 100 98 96 100 100 98 100 98 94 70 2 38 20 -Ministral-3-8B-Instruct-2512-Q4_K_M LS/N [reforged] 78.3% 78.4% 99.8% 95% 0.4 3.2s 50 100 100 100 100 100 100 100 98 100 22 0 0 100 100 100 100 100 100 100 100 98 100 14 2 2 100 -Ministral-3-8B-Instruct-2512-Q4_K_M LS/P [reforged] 75.6% 83.8% 90.2% 79% 1.3 3.0s 50 98 100 0 100 100 100 100 100 100 22 12 56 100 100 100 0 100 100 100 100 100 100 28 0 50 100 -ministral-3:8b-instruct-2512-q8_0 OL/N [reforged] 70.7% 74.5% 94.9% 74% 1.1 5.9s 50 100 100 100 100 100 100 100 100 92 12 42 0 42 100 100 100 100 100 100 100 100 26 4 8 0 12 -ministral-3:8b-instruct-2512-q4_K_M OL/N [reforged] 66.8% 71.9% 92.9% 68% 1.4 5.4s 50 100 100 100 100 76 100 100 68 90 0 0 4 64 100 100 100 100 28 100 100 80 98 0 0 6 22 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Ministral-3-8B-Reasoning-2512-Q8_0 LS/N [reforged:keep-last] 84.8% 85.6% 99.1% 94% 0.6 5.9s 50 100 100 100 100 100 100 100 100 98 70 24 42 86 100 100 100 100 100 100 100 100 100 82 6 36 62 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/N [reforged] 84.5% 84.8% 99.7% 96% 0.6 5.3s 50 100 100 96 100 100 98 100 100 100 76 18 44 92 100 100 100 100 100 100 98 100 98 80 2 42 54 +Ministral-3-8B-Instruct-2512-Q8_0 LS/P [reforged] 84.2% 90.9% 92.7% 91% 0.7 4.6s 50 100 100 6 100 100 100 100 100 100 98 8 100 80 100 100 4 100 100 100 100 100 100 96 0 98 100 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/N [reforged:full] 83.1% 83.1% 99.9% 96% 0.5 6.0s 50 100 100 100 100 100 100 98 100 100 66 20 36 88 100 100 100 100 100 100 100 100 98 74 2 26 52 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/N [reforged:keep-last] 82.4% 83.0% 99.3% 92% 0.6 4.2s 50 100 100 100 100 100 98 100 100 100 68 16 28 86 100 100 100 100 100 96 98 100 96 64 6 24 62 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/N [reforged:full] 81.8% 81.8% 100.0% 95% 0.5 4.3s 50 100 100 98 100 98 98 98 100 98 62 8 30 96 100 100 98 100 100 98 96 100 98 62 2 40 46 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/N [reforged] 81.4% 81.6% 99.7% 94% 0.6 3.8s 50 100 100 98 100 100 100 100 100 100 68 24 18 86 100 100 98 100 100 100 100 100 96 70 4 28 26 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/P [reforged:full] 81.4% 84.8% 96.0% 95% 0.7 4.0s 50 100 100 88 100 100 94 100 96 100 72 12 58 80 100 98 88 100 100 96 100 92 100 64 2 36 40 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/P [reforged:keep-last] 81.2% 84.6% 96.0% 97% 0.7 4.1s 50 100 100 94 100 100 94 100 92 98 72 12 42 78 100 100 86 100 100 98 100 98 100 64 0 52 32 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/P [reforged:full] 81.0% 83.1% 97.5% 95% 0.7 3.0s 50 100 98 88 100 100 96 100 98 98 84 24 38 70 100 100 100 100 100 96 98 100 96 72 2 22 26 +Ministral-3-8B-Instruct-2512-Q8_0 LS/N [reforged] 81.0% 81.0% 100.0% 100% 0.3 4.1s 50 100 100 100 100 100 100 100 100 100 30 0 4 100 100 100 100 100 100 100 100 100 100 68 0 4 100 +Ministral-3-8B-Reasoning-2512-Q8_0 LS/P [reforged] 80.9% 84.8% 95.4% 95% 0.7 3.9s 50 100 100 90 100 100 98 100 98 100 72 6 50 76 100 100 86 100 100 100 100 92 96 64 2 42 32 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/P [reforged:keep-last] 79.8% 82.6% 96.7% 95% 0.6 2.8s 50 100 100 96 100 100 98 98 94 98 66 8 36 64 100 100 96 100 100 100 98 96 94 76 0 34 24 +Ministral-3-8B-Reasoning-2512-Q4_K_M LS/P [reforged] 79.5% 81.9% 97.0% 94% 0.7 2.8s 50 100 100 98 100 100 96 100 90 96 68 4 32 68 100 98 98 100 100 96 96 94 98 70 0 34 30 +Ministral-3-8B-Instruct-2512-Q4_K_M LS/N [reforged] 78.3% 78.3% 100.0% 95% 0.4 3.2s 50 100 100 100 100 100 100 100 100 100 18 0 0 100 100 100 100 100 100 100 100 100 100 16 2 0 100 +Ministral-3-8B-Instruct-2512-Q4_K_M LS/P [reforged] 74.9% 83.3% 89.9% 79% 1.3 3.1s 50 100 100 0 100 100 100 100 100 100 20 2 42 100 100 100 0 100 100 100 100 100 100 22 0 62 100 +ministral-3:8b-instruct-2512-q8_0 OL/N [reforged:full]² 70.7% 74.5% 94.9% 74% 1.1 5.9s 50 100 100 100 100 100 100 100 100 92 12 42 0 42 100 100 100 100 100 100 100 100 26 4 8 0 12 +ministral-3:8b-instruct-2512-q4_K_M OL/N [reforged:full]² 66.8% 71.9% 92.9% 68% 1.4 5.4s 50 100 100 100 100 76 100 100 68 90 0 0 4 64 100 100 100 100 28 100 100 80 98 0 0 6 22 +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` -## qwen3-14b +## ministral-14b ``` --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -qwen3:14b-q4_K_M OL/N [reforged] 78.6% 78.7% 99.9% 77% 1.2 38.5s 50 100 100 100 100 100 100 100 100 74 4 12 68 78 100 100 100 100 100 100 100 94 88 4 0 54 68 -Qwen3-14B-Q4_K_M LS/P [reforged] 70.5% 70.8% 99.7% 86% 0.5 24.2s 50 100 100 100 100 100 94 100 64 68 0 0 32 72 100 100 100 100 98 94 100 58 66 0 0 30 58 -Qwen3-14B-Q4_K_M LS/N [reforged] 67.7% 67.7% 99.9% 85% 0.9 20.8s 50 100 100 100 100 100 94 100 62 36 4 22 44 22 100 100 100 100 98 84 100 66 24 12 18 42 32 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/N [reforged] 83.3% 83.3% 100.0% 96% 0.6 4.8s 50 100 100 100 100 100 100 100 100 60 32 34 78 94 100 100 100 100 100 92 100 100 62 30 28 78 78 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/N [reforged:full] 83.2% 83.2% 100.0% 97% 0.6 5.6s 50 100 100 100 100 100 86 98 100 68 40 32 76 96 100 100 100 100 100 92 100 100 62 34 30 68 82 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/N [reforged:keep-last] 82.9% 82.9% 100.0% 95% 0.6 5.0s 50 100 100 100 100 100 92 100 100 68 40 30 60 94 100 100 100 100 100 96 100 100 62 36 20 72 86 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/P [reforged:full] 81.2% 82.0% 98.9% 98% 0.6 3.8s 50 100 100 100 100 100 74 100 100 68 40 4 72 88 100 100 100 100 100 82 100 100 78 38 10 64 92 +Ministral-3-14B-Instruct-2512-Q4_K_M LS/P [reforged] 80.6% 80.6% 100.0% 100% 0.0 3.0s 50 100 100 100 100 100 100 100 100 100 0 0 46 100 100 100 100 100 100 100 100 100 98 0 0 52 100 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/P [reforged:keep-last] 78.0% 78.8% 98.9% 95% 0.6 3.8s 50 100 100 100 100 100 82 100 100 70 28 2 52 78 100 100 100 100 100 74 100 100 80 18 4 48 92 +Ministral-3-14B-Instruct-2512-Q4_K_M LS/N [reforged] 77.8% 77.8% 100.0% 97% 0.3 4.0s 50 100 100 100 100 100 100 100 100 100 8 0 0 100 100 100 100 100 100 100 100 100 100 14 0 0 100 +Ministral-3-14B-Reasoning-2512-Q4_K_M LS/P [reforged] 77.7% 78.8% 98.5% 95% 0.6 3.8s 50 100 100 100 100 100 74 100 100 78 32 2 46 90 100 100 100 100 100 66 100 100 74 28 4 48 78 +ministral-3:14b-instruct-2512-q4_K_M OL/N [reforged:full]² 74.8% 74.8% 100.0% 81% 1.0 6.2s 50 100 100 100 100 100 100 100 100 96 56 0 0 80 100 100 100 100 100 100 100 100 98 0 0 0 16 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## gemma4-e4b ``` -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -gemma-4-E4B-it-Q4_K_M LS/N [reforged] 78.2% 82.2% 95.1% 98% 0.5 9.0s 50 100 100 100 100 100 92 98 98 90 0 24 80 50 100 100 100 100 100 94 90 94 98 0 0 84 40 -gemma-4-E4B-it-Q8_0 LS/N [reforged] 76.2% 80.7% 94.5% 98% 0.6 12.8s 50 100 100 100 100 100 84 88 90 96 2 14 80 44 100 100 100 100 100 88 90 96 94 4 0 80 32 -gemma4:e4b-it-q4_K_M OL/N [reforged] 74.8% 75.0% 99.8% 83% 0.8 11.3s 50 100 100 100 100 100 94 100 100 92 0 0 44 66 100 100 100 100 100 90 100 100 78 0 0 40 42 -gemma-4-E4B-it-Q8_0 LS/P [reforged] 74.7% 74.7% 100.0% 85% 0.6 12.7s 50 100 100 100 100 100 70 100 90 88 0 16 34 94 100 100 100 100 100 48 100 98 84 0 0 30 90 -gemma4:e4b-it-q8_0 OL/N [reforged] 73.6% 73.8% 99.8% 85% 0.8 12.8s 50 100 100 100 100 100 78 98 100 100 0 8 34 60 100 100 100 100 100 78 94 100 96 0 0 34 34 -gemma-4-E4B-it-Q4_K_M LS/P [reforged] 72.8% 72.8% 100.0% 85% 0.6 8.5s 50 100 100 100 100 100 38 98 94 94 0 18 26 96 100 100 100 100 100 26 98 100 92 0 2 22 90 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +gemma-4-E4B-it-Q4_K_M LS/N [reforged] 79.7% 79.9% 99.8% 100% 0.3 8.1s 50 100 100 100 100 100 94 98 100 84 8 30 64 92 100 100 100 100 100 88 94 100 86 2 0 48 84 +gemma-4-E4B-it-Q4_K_M LS/N [reforged:full] 79.2% 82.2% 96.3% 99% 0.5 10.0s 50 100 100 100 100 100 96 88 94 100 2 40 78 54 100 100 100 100 100 94 84 98 96 0 0 82 52 +gemma-4-E4B-it-Q4_K_M LS/N [reforged:keep-last] 78.7% 81.4% 96.7% 99% 0.5 9.3s 50 100 100 100 100 100 96 92 96 96 6 20 64 62 100 100 100 100 100 88 88 100 98 2 0 82 56 +gemma-4-E4B-it-Q8_0 LS/N [reforged] 77.8% 77.9% 99.8% 100% 0.2 10.8s 50 100 100 100 100 100 76 90 100 100 0 18 38 98 100 100 100 100 100 76 98 100 98 0 0 36 94 +gemma-4-E4B-it-Q8_0 LS/N [reforged:keep-last] 75.7% 79.3% 95.5% 99% 0.5 13.1s 50 100 100 100 100 100 76 92 98 96 4 16 82 48 100 100 100 100 100 50 86 98 94 2 0 86 40 +gemma-4-E4B-it-Q8_0 LS/N [reforged:full] 75.6% 80.8% 93.6% 98% 0.6 12.5s 50 100 100 100 100 100 92 94 92 88 0 18 82 28 100 100 100 100 100 76 92 94 98 2 0 84 26 +gemma4:e4b-it-q4_K_M OL/N [reforged:full]² 74.8% 75.0% 99.8% 83% 0.8 11.3s 50 100 100 100 100 100 94 100 100 92 0 0 44 66 100 100 100 100 100 90 100 100 78 0 0 40 42 +gemma-4-E4B-it-Q8_0 LS/P [reforged:keep-last] 74.1% 74.2% 99.8% 85% 0.6 13.4s 50 100 100 100 100 100 48 100 96 84 0 22 28 94 100 100 100 100 100 52 100 98 90 0 0 18 96 +gemma-4-E4B-it-Q8_0 LS/P [reforged:full] 73.7% 73.7% 100.0% 86% 0.6 12.7s 50 100 100 100 100 100 48 100 92 90 0 36 28 88 100 100 100 100 100 48 94 98 82 0 0 20 92 +gemma4:e4b-it-q8_0 OL/N [reforged:full]² 73.6% 73.8% 99.8% 85% 0.8 12.8s 50 100 100 100 100 100 78 98 100 100 0 8 34 60 100 100 100 100 100 78 94 100 96 0 0 34 34 +gemma-4-E4B-it-Q8_0 LS/P [reforged] 73.2% 73.3% 99.8% 85% 0.6 13.3s 50 100 100 100 100 100 54 98 94 80 0 28 20 94 100 100 100 100 100 40 100 98 90 0 0 12 96 +gemma-4-E4B-it-Q4_K_M LS/P [reforged:keep-last] 73.2% 73.2% 100.0% 85% 0.6 8.5s 50 100 100 100 100 100 40 98 96 92 0 10 24 96 100 100 100 100 100 38 100 100 86 0 0 32 92 +gemma-4-E4B-it-Q4_K_M LS/P [reforged:full] 72.9% 72.9% 100.0% 85% 0.6 8.5s 50 100 100 100 100 100 54 98 98 86 0 10 28 88 100 100 100 100 100 26 98 96 94 0 0 30 90 +gemma-4-E4B-it-Q4_K_M LS/P [reforged] 72.4% 72.4% 99.9% 85% 0.6 8.9s 50 100 100 100 100 100 56 96 100 86 0 14 28 84 100 100 100 100 100 34 100 96 74 0 0 22 92 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` -## mistral-small-3.2 +## qwen3-14b ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M LS/P [reforged] 78.2% 84.3% 92.7% 78% 1.1 3.6s 50 100 100 100 100 100 100 100 100 28 0 0 100 94 100 100 100 100 100 100 100 100 34 0 0 100 76 -Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M LS/N [reforged] 71.0% 71.2% 99.8% 96% 0.5 6.5s 50 100 100 100 100 98 58 100 100 68 4 12 20 90 100 100 100 100 100 50 100 100 4 22 0 20 100 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +qwen3:14b-q4_K_M OL/N [reforged:full]² 78.6% 78.7% 99.9% 77% 1.2 38.5s 50 100 100 100 100 100 100 100 100 74 4 12 68 78 100 100 100 100 100 100 100 94 88 4 0 54 68 +Qwen3-14B-Q4_K_M LS/P [reforged:keep-last] 71.8% 71.8% 100.0% 86% 0.5 23.8s 50 100 100 100 100 100 98 100 72 58 2 4 28 72 100 100 100 100 100 94 100 76 74 0 0 32 56 +Qwen3-14B-Q4_K_M LS/P [reforged:full] 71.8% 71.9% 99.8% 87% 0.5 24.3s 50 100 100 100 100 100 96 100 72 72 2 0 30 74 100 100 100 100 100 92 100 74 68 0 0 38 48 +Qwen3-14B-Q4_K_M LS/P [reforged] 71.4% 71.4% 99.9% 86% 0.5 25.6s 50 100 100 100 100 100 98 100 70 70 0 0 30 80 100 100 100 100 100 92 100 76 56 0 0 22 62 +Qwen3-14B-Q4_K_M LS/N [reforged] 68.5% 68.5% 100.0% 100% 0.4 21.8s 50 100 100 100 100 98 100 100 56 72 14 4 58 4 100 100 100 100 100 100 100 42 72 10 0 52 0 +Qwen3-14B-Q4_K_M LS/N [reforged:full] 68.4% 68.4% 99.9% 83% 0.9 21.9s 50 100 100 100 100 98 90 98 60 32 20 18 50 38 100 100 100 100 100 86 100 74 34 6 18 34 22 +Qwen3-14B-Q4_K_M LS/N [reforged:keep-last] 64.0% 64.0% 99.9% 91% 0.6 20.3s 50 100 100 100 100 100 90 98 48 40 18 6 38 4 100 100 100 98 96 88 100 54 30 16 2 38 0 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ``` -## qwen3-8b +## mistral-small-3.2 ``` -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Qwen3-8B-Q8_0 LS/P [reforged] 73.1% 73.2% 99.8% 89% 0.4 28.4s 50 100 100 100 100 100 100 100 58 96 0 8 28 94 100 100 100 100 96 100 98 64 88 0 0 12 58 -Qwen3-8B-Q4_K_M LS/P [reforged] 70.4% 70.7% 99.6% 86% 0.5 17.8s 50 100 100 100 100 100 94 100 56 64 0 14 12 92 100 100 100 100 100 94 100 62 58 0 0 6 78 -Qwen3-8B-Q8_0 LS/N [reforged] 70.3% 70.5% 99.7% 88% 0.6 24.1s 50 100 100 100 100 100 100 100 60 82 4 22 20 32 100 100 100 100 98 94 100 58 66 2 12 28 50 -Qwen3-8B-Q4_K_M LS/N [reforged] 68.2% 68.4% 99.6% 86% 0.7 16.1s 50 98 100 100 100 100 92 100 48 78 0 44 8 38 100 100 100 100 100 90 100 40 76 0 8 14 38 -qwen3:8b-q8_0 OL/N [reforged] 67.5% 67.6% 99.9% 85% 0.6 31.0s 50 100 100 100 100 100 100 100 26 88 0 2 6 66 100 100 100 100 100 100 100 40 82 0 0 2 44 -qwen3:8b-q4_K_M OL/N [reforged] 64.9% 65.1% 99.8% 85% 0.6 21.0s 50 100 100 100 100 100 96 98 30 62 2 6 2 74 98 100 100 100 100 98 100 26 70 4 0 4 18 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M LS/P [reforged:full]² 78.2% 84.3% 92.7% 78% 1.1 3.6s 50 100 100 100 100 100 100 100 100 28 0 0 100 94 100 100 100 100 100 100 100 100 34 0 0 100 76 +Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M LS/N [reforged:full]² 71.0% 71.2% 99.8% 96% 0.5 6.5s 50 100 100 100 100 98 58 100 100 68 4 12 20 90 100 100 100 100 100 50 100 100 4 22 0 20 100 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## phi-4 @@ -148,41 +158,68 @@ qwen3:8b-q4_K_M OL/N [reforged] 64.9% 65.1% 99.8% 85% 0.6 21.0s ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -phi-4-Q4_K_M LS/P [reforged] 72.9% 73.3% 99.5% 85% 0.9 4.1s 50 100 100 100 100 100 34 56 96 90 52 24 38 70 100 100 100 100 100 24 62 92 98 52 0 60 48 +phi-4-Q4_K_M LS/P [reforged] 75.3% 75.4% 99.8% 83% 0.9 4.2s 50 100 100 100 100 100 26 62 94 96 62 34 66 70 100 100 100 100 100 28 84 98 94 42 0 60 42 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` +## qwen3-8b + +``` +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Qwen3-8B-Q8_0 LS/P [reforged:keep-last] 72.8% 73.0% 99.7% 89% 0.4 28.0s 50 100 100 100 100 100 98 98 80 90 0 6 8 94 100 100 100 100 96 96 100 60 98 0 2 12 54 +Qwen3-8B-Q8_0 LS/P [reforged:full] 72.8% 72.9% 99.8% 88% 0.4 28.9s 50 100 100 100 100 100 98 100 70 90 0 4 20 96 100 100 100 100 92 100 96 66 92 0 0 12 56 +Qwen3-8B-Q4_K_M LS/P [reforged:keep-last] 72.2% 72.3% 99.9% 88% 0.5 17.9s 50 100 100 100 100 100 100 100 62 66 0 30 8 90 100 100 100 100 100 98 100 74 68 0 0 8 74 +Qwen3-8B-Q8_0 LS/P [reforged] 72.0% 72.3% 99.6% 88% 0.4 28.6s 50 100 100 100 100 100 96 100 56 90 0 2 10 96 100 100 100 100 96 98 100 58 88 0 0 20 62 +Qwen3-8B-Q4_K_M LS/P [reforged] 71.1% 71.2% 99.8% 87% 0.5 18.0s 50 100 100 100 100 100 96 100 70 66 0 18 8 96 100 100 100 100 98 96 100 60 60 0 0 10 70 +Qwen3-8B-Q4_K_M LS/P [reforged:full] 70.5% 70.8% 99.6% 88% 0.4 17.4s 50 100 100 100 100 100 88 100 58 66 0 24 10 88 100 100 100 100 100 94 98 62 66 0 0 4 76 +Qwen3-8B-Q8_0 LS/N [reforged:full] 69.3% 69.6% 99.6% 88% 0.6 24.7s 50 98 100 100 100 100 94 100 48 76 2 28 24 46 100 100 100 100 100 98 100 46 76 2 8 16 40 +Qwen3-8B-Q8_0 LS/N [reforged] 68.2% 68.5% 99.5% 95% 0.3 24.8s 50 100 100 100 100 100 100 100 56 78 6 24 28 6 98 100 100 100 100 100 100 52 84 6 2 30 2 +qwen3:8b-q8_0 OL/N [reforged:full]² 67.5% 67.6% 99.9% 85% 0.6 31.0s 50 100 100 100 100 100 100 100 26 88 0 2 6 66 100 100 100 100 100 100 100 40 82 0 0 2 44 +Qwen3-8B-Q4_K_M LS/N [reforged] 67.3% 67.5% 99.7% 96% 0.3 15.6s 50 100 100 100 100 100 100 100 40 98 6 14 22 2 100 100 100 100 100 100 100 48 86 2 0 26 6 +Qwen3-8B-Q8_0 LS/N [reforged:keep-last] 67.0% 67.2% 99.8% 92% 0.4 23.2s 50 100 100 100 100 100 94 98 48 84 2 22 10 12 100 100 100 100 100 96 100 52 80 0 18 20 6 +Qwen3-8B-Q4_K_M LS/N [reforged:full] 65.8% 66.0% 99.7% 84% 0.7 17.2s 50 100 100 100 100 100 94 100 34 66 0 18 10 38 100 100 100 96 100 86 100 34 74 0 10 12 40 +qwen3:8b-q4_K_M OL/N [reforged:full]² 64.9% 65.1% 99.8% 85% 0.6 21.0s 50 100 100 100 100 100 96 98 30 62 2 6 2 74 98 100 100 100 100 98 100 26 70 4 0 4 18 +Qwen3-8B-Q4_K_M LS/N [reforged:keep-last] 64.5% 64.6% 99.9% 91% 0.4 15.0s 50 100 100 100 100 100 100 100 30 82 0 28 12 10 100 100 100 98 100 96 100 22 86 2 2 6 4 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + ## nemotron-3-nano ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Nemotron-3-Nano-30B-A3B-Q4_K_M LS/N [reforged] 71.3% 81.0% 88.0% 72% 1.5 21.4s 50 100 100 100 100 100 66 98 52 92 28 4 34 34 100 100 100 98 100 86 92 68 98 24 8 34 38 -Nemotron-3-Nano-30B-A3B-Q4_K_M LS/P [reforged] 70.2% 70.7% 99.4% 89% 0.4 10.8s 50 100 100 100 100 98 52 100 84 90 6 4 0 100 100 100 100 100 100 42 100 80 92 6 2 4 66 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Nemotron-3-Nano-30B-A3B-Q4_K_M LS/N [reforged:full]² 71.3% 81.0% 88.0% 72% 1.5 21.4s 50 100 100 100 100 100 66 98 52 92 28 4 34 34 100 100 100 98 100 86 92 68 98 24 8 34 38 +Nemotron-3-Nano-30B-A3B-Q4_K_M LS/P [reforged:full]² 70.2% 70.7% 99.4% 89% 0.4 10.8s 50 100 100 100 100 98 52 100 84 90 6 4 0 100 100 100 100 100 100 42 100 80 92 6 2 4 66 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` ## granite-4.1-8b ``` -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -granite4.1:8b-q8_0 OL/N [reforged] 69.2% 69.2% 100.0% 83% 1.1 2.9s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 100 100 0 0 0 0 -granite-4.1-8b-Q4_K_M LS/N [reforged] 65.4% 68.0% 96.2% 90% 0.8 1.8s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 -granite-4.1-8b-Q8_0 LS/N [reforged] 65.4% 65.4% 100.0% 88% 1.4 2.8s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 -granite-4.1-8b-Q4_K_M LS/P [reforged] 61.5% 61.5% 100.0% 90% 0.3 2.5s 50 100 100 100 100 100 0 100 100 0 0 100 0 0 100 100 100 100 100 0 100 100 0 0 100 0 0 -granite-4.1-8b-Q8_0 LS/P [reforged] 61.5% 66.7% 92.3% 73% 1.0 5.2s 50 0 100 100 100 100 100 100 100 0 0 0 100 0 0 100 100 100 100 100 100 100 0 0 0 100 0 -granite4.1:8b-q4_K_M OL/N [reforged] 57.8% 57.8% 100.0% 81% 1.3 1.9s 50 100 100 100 100 100 100 100 100 2 0 0 0 0 100 100 100 100 100 100 100 0 2 0 0 0 0 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Model/Backend Scr Acc Cmp Eff Wst Spd N rel arg tsl b2s s3s crt srn err dgr dge art grs iar rel_s arg_s tsl_s b2s_s s3s_s crt_s srn_s err_s dgr_s dge_s art_s grs_s iar_s +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +granite4.1:8b-q8_0 OL/N [reforged:full]² 69.2% 69.2% 100.0% 83% 1.1 2.9s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 100 100 0 0 0 0 +granite-4.1-8b-Q8_0 LS/N [reforged] 65.4% 65.4% 100.0% 88% 1.3 2.8s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 +granite-4.1-8b-Q4_K_M LS/N [reforged] 65.4% 68.0% 96.2% 90% 0.8 1.9s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 +granite-4.1-8b-Q4_K_M LS/N [reforged:keep-last] 65.4% 68.0% 96.2% 90% 0.8 1.8s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 +granite-4.1-8b-Q4_K_M LS/N [reforged:full] 65.4% 68.0% 96.2% 90% 0.8 1.9s 50 100 100 100 100 100 100 100 100 100 0 0 0 0 100 100 100 100 100 100 100 0 100 0 0 0 0 +granite-4.1-8b-Q8_0 LS/P [reforged] 61.5% 66.7% 92.3% 73% 1.0 5.2s 50 0 100 100 100 100 100 100 100 0 0 0 100 0 0 100 100 100 100 100 100 100 0 0 0 100 0 +granite-4.1-8b-Q4_K_M LS/P [reforged] 61.5% 61.5% 100.0% 90% 0.3 2.5s 50 100 100 100 100 100 0 100 100 0 0 100 0 0 100 100 100 100 100 0 100 100 0 0 100 0 0 +granite4.1:8b-q4_K_M OL/N [reforged:full]² 57.8% 57.8% 100.0% 81% 1.3 1.9s 50 100 100 100 100 100 100 100 100 2 0 0 0 0 100 100 100 100 100 100 100 0 2 0 0 0 0 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` Scr=score(correct/total), Acc=accuracy(correct/total, excl validate errors), Cmp=completeness(completed/total), Eff=efficiency(ideal/actual calls), Wst=avg wasted calls, Spd=avg time(excl compaction) rel=relevance_detection, arg=argument_fidelity, tsl=tool_selection, b2s=basic_2step, s3s=sequential_3step, crt=conditional_routing, srn=sequential_reasoning, err=error_recovery, dgr=data_gap_recovery, dge=data_gap_recovery_extended, art=argument_transformation, grs=grounded_synthesis, iar=inconsistent_api_recovery, rel_s=relevance_detection_stateful, arg_s=argument_fidelity_stateful, tsl_s=tool_selection_stateful, b2s_s=basic_2step_stateful, s3s_s=sequential_3step_stateful, crt_s=conditional_routing_stateful, srn_s=sequential_reasoning_stateful, err_s=error_recovery_stateful, dgr_s=data_gap_recovery_stateful, dge_s=data_gap_recovery_extended_stateful, art_s=argument_transformation_stateful, grs_s=grounded_synthesis_stateful, iar_s=inconsistent_api_recovery_stateful Ablation: full=all guardrails, no_rescue=no rescue loop, no_nudge=no rescue/retry nudge, no_steps=no step enforcement, no_recovery=no error recovery, no_compact=no compaction, bare=all guardrails off +Replay: ':keep-last'/':full' tags = reasoning_replay policy (how much captured reasoning is re-sent to the backend each turn); untagged = none (default). Rows predating the knob ran unbounded replay and count as full. Eval generations (older runs carried forward, superscript-tagged): ¹ gen 1 — v0.6.0 suite — incl. Anthropic ablation (commit 2b05dc4, 2026-05-08) + ² gen 2 — v0.7.0 lineup refresh (8–14B) + 32GB tier debut (v0.7.4) (commit 655e1f6, 2026-05-22) -*Generated 2026-06-03 00:09* +*Generated 2026-06-11 20:28* diff --git a/tests/eval/dashboard/src/Sidebar.tsx b/tests/eval/dashboard/src/Sidebar.tsx index 573a6bd..35ef2bc 100644 --- a/tests/eval/dashboard/src/Sidebar.tsx +++ b/tests/eval/dashboard/src/Sidebar.tsx @@ -1,5 +1,6 @@ import type { ConfigRow, FilterDimension, Filters, ScenarioScope, ScreenId, SuiteScope, ViewId } from "./types"; import { FILTER_DIMENSIONS, SCENARIO_SCOPES, SUITE_SCOPES } from "./types"; +import { replayRank } from "./utils"; import { ScreenSelector } from "./ScreenSelector"; import { ViewSelector } from "./ViewSelector"; @@ -8,6 +9,7 @@ const DIMENSION_LABELS: Record = { mode: "Mode", family: "Family", quant: "Quant", + replay: "Reasoning Replay", }; interface SidebarProps { @@ -108,7 +110,11 @@ export function Sidebar({ )} {FILTER_DIMENSIONS.map((dim) => { - const vals = [...new Set(rows.map((r) => r[dim]))].sort(); + const vals = [...new Set(rows.map((r) => r[dim]))].sort( + dim === "replay" + ? (a, b) => replayRank(a) - replayRank(b) + : undefined, + ); if (vals.length < 2) return null; return ( diff --git a/tests/eval/dashboard/src/types.ts b/tests/eval/dashboard/src/types.ts index d52ef86..19e2715 100644 --- a/tests/eval/dashboard/src/types.ts +++ b/tests/eval/dashboard/src/types.ts @@ -5,6 +5,8 @@ export interface ConfigRow { backend: string; mode: string; ablation: string; + /** reasoning_replay policy ("none" | "keep-last" | "full"); pre-knob rows count as "full". */ + replay: string; family: string; quant: string; /** Eval generation this row's data came from (see report.py dedup_latest_gen). */ @@ -72,7 +74,7 @@ export const SUITE_SCOPES: { id: SuiteScope; label: string }[] = [ { id: "advanced_reasoning", label: "Advanced Reasoning" }, ]; -export const FILTER_DIMENSIONS = ["backend", "mode", "family", "quant"] as const; +export const FILTER_DIMENSIONS = ["backend", "mode", "family", "quant", "replay"] as const; export type FilterDimension = (typeof FILTER_DIMENSIONS)[number]; export type Filters = Record>; @@ -102,6 +104,10 @@ export const ABLATION_ORDER: readonly string[] = [ "no_compact", ]; +/** Intra-group ordering for reasoning_replay rows: default policy first, + * then increasing replay volume. Mirrors _REPLAY_ORDER in report.py. */ +export const REPLAY_ORDER: readonly string[] = ["none", "keep-last", "full"]; + /** Pre-baked view definitions — control grouping within the active screen's row set. */ export type ViewId = "all" | "by-backend" | "by-family"; @@ -119,7 +125,7 @@ export const VIEWS: ViewDef[] = [ { id: "by-backend", label: "By Backend", - groupBy: ["model", "quant", "ablation"], + groupBy: ["model", "quant", "ablation", "replay"], intraSort: "backend", }, { diff --git a/tests/eval/dashboard/src/utils.ts b/tests/eval/dashboard/src/utils.ts index 1481425..eca06f1 100644 --- a/tests/eval/dashboard/src/utils.ts +++ b/tests/eval/dashboard/src/utils.ts @@ -1,5 +1,5 @@ import type { ConfigRow, ScenarioScope, ScreenId, SortState, SuiteScope, ViewDef } from "./types"; -import { ABLATION_ORDER } from "./types"; +import { ABLATION_ORDER, REPLAY_ORDER } from "./types"; /** Filter rows according to the active screen. * @@ -34,6 +34,12 @@ function ablationRank(name: string): number { return idx === -1 ? ABLATION_ORDER.length : idx; } +/** Rank for sorting reasoning_replay rows in canonical order; unknowns land last. */ +export function replayRank(policy: string): number { + const idx = REPLAY_ORDER.indexOf(policy); + return idx === -1 ? REPLAY_ORDER.length : idx; +} + /** Heat-map color class based on percentage value. */ export function heatClass(v: number | null): string { if (v == null) return ""; @@ -253,6 +259,8 @@ export function groupRows( if (byAblationRank) { const diff = ablationRank(a.ablation) - ablationRank(b.ablation); if (diff !== 0) return diff; + const rDiff = replayRank(a.replay) - replayRank(b.replay); + if (rDiff !== 0) return rDiff; return b.score - a.score; } const scoreDiff = b.score - a.score; diff --git a/tests/eval/report.py b/tests/eval/report.py index 9b061a2..5e09604 100644 --- a/tests/eval/report.py +++ b/tests/eval/report.py @@ -116,13 +116,23 @@ class ConfigKey: mode: str ablation: str = "reforged" tool_choice: str = "auto" + # Pre-knob rows ran unbounded replay — legacy behavior == "full". + reasoning_replay: str = "full" @property def _tag(self) -> str: - """Ablation + tool_choice tag, e.g. '[full]', '[bare]', '[bare+any]'.""" + """Ablation + tool_choice + replay tag, e.g. '[bare]', '[bare+any]', '[reforged:full]'. + + The replay policy is tagged only when it differs from the default + ("none"), so default-policy rows keep the familiar clean label. + """ if self.ablation != "reforged" and self.tool_choice != "auto": - return f"[{self.ablation}+{self.tool_choice}]" - return f"[{self.ablation}]" + base = f"{self.ablation}+{self.tool_choice}" + else: + base = self.ablation + if self.reasoning_replay != "none": + base = f"{base}:{self.reasoning_replay}" + return f"[{base}]" @property def label(self) -> str: @@ -144,7 +154,10 @@ def short_label(self) -> str: return f"{m} {b}/{mode_char} {self._tag}" def __hash__(self) -> int: - return hash((self.model, self.backend, self.mode, self.ablation, self.tool_choice)) + return hash(( + self.model, self.backend, self.mode, + self.ablation, self.tool_choice, self.reasoning_replay, + )) def __eq__(self, other: object) -> bool: if not isinstance(other, ConfigKey): @@ -155,6 +168,7 @@ def __eq__(self, other: object) -> bool: and self.mode == other.mode and self.ablation == other.ablation and self.tool_choice == other.tool_choice + and self.reasoning_replay == other.reasoning_replay ) @@ -172,8 +186,26 @@ def load_jsonl(path: Path) -> list[dict]: return rows +def _row_replay(row: dict) -> str: + """Row-level reasoning_replay, defaulting pre-knob rows to "full". + + Pre-knob rows (no field) ran unbounded replay, which the knob names + "full" — so carried-forward older generations surface with an honest + ':full' tag rather than masquerading as the current default. + """ + return row.get("reasoning_replay", "full") + + def _config_tuple(row: dict) -> tuple[str, str, str, str, str]: - """The identity a config is deduped on — mirrors ConfigKey's fields.""" + """The identity a config is deduped on — ConfigKey's fields minus reasoning_replay. + + reasoning_replay is deliberately NOT part of the dedup identity: pre-knob + rows have no field, and a newer-gen re-sweep should supersede them + regardless of which policies it ran (else every v0.7.0 row would survive + as a stale ':full' duplicate next to its re-swept config). Within one + generation all policy rows share the gen, so none/keep-last/full survive + dedup side by side as separate display rows (see group_rows). + """ return ( row["model"], row["backend"], @@ -220,7 +252,10 @@ def group_rows( for row in rows: ablation = row.get("ablation", "reforged") tc = row.get("tool_choice", "auto") - key = ConfigKey(row["model"], row["backend"], row["mode"], ablation, tc) + key = ConfigKey( + row["model"], row["backend"], row["mode"], ablation, tc, + _row_replay(row), + ) grouped[key][row["scenario"]].append(row) return grouped @@ -662,6 +697,10 @@ def extract_quant(model: str) -> str: "note": "v0.6.0 suite — incl. Anthropic ablation"}, 2: {"commit": "655e1f6", "date": "2026-05-22", "note": "v0.7.0 lineup refresh (8–14B) + 32GB tier debut (v0.7.4)"}, + # Tag ref, not a commit SHA: gen 3 landed via a branch whose squash-merge + # SHA didn't exist when this entry was written; the v0.7.5 tag resolves to it. + 3: {"commit": "v0.7.5", "date": "2026-06-11", + "note": "reasoning-replay grid (8–14B × none/keep-last/full) + Claude thinking-on baseline"}, } # Coarse families in the "Retired" tier of docs/MODEL_REGISTRY.md. Retired @@ -739,6 +778,11 @@ def _legend_lines(scenarios: list[str]) -> list[str]: "no_steps=no step enforcement, no_recovery=no error recovery, no_compact=no compaction, " "bare=all guardrails off" ) + lines.append( + "Replay: ':keep-last'/':full' tags = reasoning_replay policy (how much captured " + "reasoning is re-sent to the backend each turn); untagged = none (default). " + "Rows predating the knob ran unbounded replay and count as full." + ) return lines @@ -942,6 +986,7 @@ def _metrics_to_json_row(m: ConfigMetrics, scenarios: list[str]) -> dict: "backend": m.key.backend, "mode": m.key.mode, "ablation": m.key.ablation, + "replay": m.key.reasoning_replay, "family": extract_family(m.key.model), "quant": extract_quant(m.key.model), "gen": m.gen, @@ -1069,6 +1114,19 @@ def _ablation_rank(name: str) -> int: return len(_ABLATION_ORDER) +# Ordering for reasoning_replay rows within a group: default policy first, +# then increasing replay volume. Mirrors REPLAY_ORDER in the dashboard's types.ts. +_REPLAY_ORDER = ("none", "keep-last", "full") + + +def _replay_rank(policy: str) -> int: + """Rank for sorting reasoning_replay rows; unknowns land last.""" + try: + return _REPLAY_ORDER.index(policy) + except ValueError: + return len(_REPLAY_ORDER) + + def write_markdown_views( all_metrics: list[ConfigMetrics], scenarios: list[str], @@ -1087,6 +1145,7 @@ def write_markdown_views( reforged-vs-bare.md — per-(model,backend,mode) reforged+bare pair ablation.md — deep-ablation configs only, 7-row tower per config native-vs-prompt.md — llama-server paired native vs prompt (reforged) + reasoning-replay.md — replay policy comparison per config (>1 policy) budget.md — compaction scenarios only (reforged) """ import datetime @@ -1195,7 +1254,10 @@ def _grouped_view( "Forge Eval — Reforged vs Bare", "Forge lift: reforged vs bare for each (model, backend, mode)", [ - (f"{model} ({backend}/{mode})", sorted(group, key=lambda m: _ablation_rank(m.key.ablation))) + ( + f"{model} ({backend}/{mode})", + sorted(group, key=lambda m: (_ablation_rank(m.key.ablation), _replay_rank(m.key.reasoning_replay))), + ) for (model, backend, mode), group in sorted_rb ], ) @@ -1216,7 +1278,10 @@ def _grouped_view( "Forge Eval — Full Ablation", "Per-guardrail ablation: each config shows all ablation variants", [ - (f"{model} ({backend}/{mode})", sorted(group, key=lambda m: _ablation_rank(m.key.ablation))) + ( + f"{model} ({backend}/{mode})", + sorted(group, key=lambda m: (_ablation_rank(m.key.ablation), _replay_rank(m.key.reasoning_replay))), + ) for (model, backend, mode), group in sorted_abl ], ) @@ -1233,11 +1298,33 @@ def _grouped_view( "Forge Eval — Native vs Prompt (llama-server)", "llama-server native FC vs prompt-injected, reforged only", [ - (model, sorted(group, key=lambda m: m.key.mode)) + (model, sorted(group, key=lambda m: (m.key.mode, _replay_rank(m.key.reasoning_replay)))) for model, group in sorted(ls_paired.items()) ], ) + # ── Orthogonal: reasoning-replay.md ─────────────────────────── + # Policy comparison: same config (model, backend, mode, ablation), one row + # per reasoning_replay policy. Only configs that ran >1 policy appear. + rr_groups: dict[tuple[str, str, str, str], list[ConfigMetrics]] = defaultdict(list) + for m in complete: + if m.key.ablation in ("reforged", "bare"): + rr_groups[(m.key.model, m.key.backend, m.key.mode, m.key.ablation)].append(m) + rr_multi = {k: v for k, v in rr_groups.items() if len({m.key.reasoning_replay for m in v}) > 1} + sorted_rr = sorted(rr_multi.items(), key=lambda kv: max(m.score for m in kv[1]), reverse=True) + _grouped_view( + "reasoning-replay.md", + "Forge Eval — Reasoning Replay Policies", + "reasoning_replay policy comparison (none / keep-last / full) per config", + [ + ( + f"{model} ({backend}/{mode}) [{ablation}]", + sorted(group, key=lambda m: _replay_rank(m.key.reasoning_replay)), + ) + for (model, backend, mode, ablation), group in sorted_rr + ], + ) + # ── Orthogonal: budget.md ───────────────────────────────────── compaction_scenarios = [sc for sc in scenarios if sc in { "compaction_stress", "phase2_compaction", @@ -1261,7 +1348,7 @@ def _grouped_view( ("## Reforged — which model should I run?", lambda rp: rp.startswith("reforged/")), ("## Reforged vs Bare — how much does forge lift a model?", lambda rp: rp == "reforged-vs-bare.md"), ("## Full Ablation — which guardrails do the work?", lambda rp: rp == "ablation.md"), - ("## Other cross-cuts", lambda rp: rp in ("native-vs-prompt.md", "budget.md")), + ("## Other cross-cuts", lambda rp: rp in ("native-vs-prompt.md", "reasoning-replay.md", "budget.md")), ] index_lines = [ "# Forge Eval Reports\n", @@ -1308,6 +1395,11 @@ def main() -> None: help="Filter to specific ablation preset(s) (e.g. --ablation reforged bare). " "Default: show all.", ) + parser.add_argument( + "--reasoning-replay", nargs="*", + help="Filter to specific reasoning_replay polic(ies) (e.g. --reasoning-replay none). " + "Rows predating the knob count as 'full'. Default: show all.", + ) parser.add_argument( "--exclude-scenario", nargs="*", metavar="NAME", help="Exclude scenario(s) from aggregates and columns " @@ -1357,6 +1449,14 @@ def main() -> None: print(f"No data for ablation preset(s): {', '.join(args.ablation)}") sys.exit(0) + # Filter by reasoning_replay policy if requested + if args.reasoning_replay: + rr_set = set(args.reasoning_replay) + rows = [r for r in rows if _row_replay(r) in rr_set] + if not rows: + print(f"No data for reasoning_replay polic(ies): {', '.join(args.reasoning_replay)}") + sys.exit(0) + # Filter rows by scenario tag before detection if args.tags: _TAG_FILTERS = { From 330d57feb371224a80e92fa797c8e5f01f55bf47 Mon Sep 17 00:00:00 2001 From: Antoine Zambelli Date: Thu, 11 Jun 2026 20:40:21 -0500 Subject: [PATCH 13/14] docs: reasoning_replay knob + ADR-017 + model-registry updates Document the knob and the new none default across README, User Guide, and Backend Setup, with links to the eval evidence. ADR-017 records the policy design, the grid results behind the default, and the alternatives considered. Model Registry: Claude footnote updated for the v0.7.5 thinking-on re-baseline (Sonnet 4.6 / Opus 4.8; Opus 4.6 and the deep-ablation rows stay carried forward), and Qwen3 8B Q8_0 is flagged for future retirement on compute-cost vs signal-value grounds (~23% of the full sweep for a small Q4/Q8 delta). Co-Authored-By: Claude Fable 5 --- README.md | 2 +- docs/BACKEND_SETUP.md | 2 +- docs/MODEL_REGISTRY.md | 9 ++-- docs/USER_GUIDE.md | 4 +- docs/decisions/017-reasoning-replay-policy.md | 51 +++++++++++++++++++ 5 files changed, 60 insertions(+), 8 deletions(-) create mode 100644 docs/decisions/017-reasoning-replay-policy.md diff --git a/README.md b/README.md index e1b827a..0696369 100644 --- a/README.md +++ b/README.md @@ -128,7 +128,7 @@ For multi-step workflows, multi-turn conversations, and backend auto-management, Drop-in proxy that sits between any client and a local model server, speaking both the OpenAI chat-completions API and the Anthropic Messages API (`/v1/messages`). Point your client at the proxy (e.g. `http://localhost:8081/v1`) and forge applies its guardrails transparently — the client thinks it's talking to a smarter model. -This is the path for **using forge with an existing harness** (opencode, Continue, aider, Cline, anything that speaks the OpenAI chat-completions schema — or Claude Code, which speaks the Anthropic Messages API). No Python rewrite. Reasoning replay defaults to `keep-last`, so Forge captures reasoning for observability and replays only the latest available reasoning block to the backend on later turns; use `--reasoning-replay full` for the historical replay-all behavior or `--reasoning-replay none` to keep captured reasoning out of backend-facing history. +This is the path for **using forge with an existing harness** (opencode, Continue, aider, Cline, anything that speaks the OpenAI chat-completions schema — or Claude Code, which speaks the Anthropic Messages API). No Python rewrite. Reasoning replay defaults to `none`: Forge still captures reasoning for observability, but keeps it out of backend-facing history on later turns — the most token-efficient policy, and statistically indistinguishable from replay-all on the eval suite (see [reasoning-replay results](docs/results/raw/reasoning-replay.md)). Use `--reasoning-replay keep-last` to replay only the latest reasoning block, or `--reasoning-replay full` for the historical replay-all behavior. ```bash # External mode — you manage the backend, forge proxies it diff --git a/docs/BACKEND_SETUP.md b/docs/BACKEND_SETUP.md index 26702d3..2c75fe3 100644 --- a/docs/BACKEND_SETUP.md +++ b/docs/BACKEND_SETUP.md @@ -75,7 +75,7 @@ llamafile --server --nobrowser -m path/to/model.gguf --port 8080 -ngl 999 `LlamafileClient` is **native-first**: `mode="native"` (the default) forwards tools via the backend's `tools` parameter and requires native function calling (llama.cpp with `--jinja`). For a backend without native FC, declare `mode="prompt"` to inject tool descriptions into the prompt and parse the JSON call back out. The capability is declared at construction and frozen — there is no runtime auto-detection. Native-first is the default because local-model FC support has matured into the more reliable path; prompt-injection stays fully supported as an explicit opt-in, but note that on more complex, multi-step interactions models tend to struggle to drive the prompt-injected protocol reliably, so reach for it only when the backend leaves no alternative. -> **Proxy note:** the OpenAI-compatible proxy is **native-first**. By default (`--backend-capability native`) it forwards the client's tools verbatim to an FC-capable backend (llama.cpp with `--jinja`, vLLM, Ollama, Anthropic) — the recommended setup. For a non-FC llama.cpp/llamafile backend, opt into prompt-injection with `--backend-capability prompt` (strips tools into the prompt, parses the JSON call back; reuses the same prompt path as the WorkflowRunner). The choice is frozen at startup — there is no runtime auto-detect in the proxy. Reasoning replay is controlled separately with `--reasoning-replay {full,keep-last,none}`; the default `keep-last` replays only the latest captured reasoning block to the backend when that reasoning is available in the conversation history. See ADR-012. +> **Proxy note:** the OpenAI-compatible proxy is **native-first**. By default (`--backend-capability native`) it forwards the client's tools verbatim to an FC-capable backend (llama.cpp with `--jinja`, vLLM, Ollama, Anthropic) — the recommended setup. For a non-FC llama.cpp/llamafile backend, opt into prompt-injection with `--backend-capability prompt` (strips tools into the prompt, parses the JSON call back; reuses the same prompt path as the WorkflowRunner). The choice is frozen at startup — there is no runtime auto-detect in the proxy. Reasoning replay is controlled separately with `--reasoning-replay {full,keep-last,none}`; the default `none` keeps captured reasoning out of backend-facing history (`keep-last` replays only the latest captured reasoning block, `full` replays everything). See ADR-012. Smoke-test: diff --git a/docs/MODEL_REGISTRY.md b/docs/MODEL_REGISTRY.md index 370a14b..9d84d4c 100644 --- a/docs/MODEL_REGISTRY.md +++ b/docs/MODEL_REGISTRY.md @@ -4,7 +4,7 @@ Every model forge knows about, classified by eval-suite status. ## Status meanings -- **Current** — in the published eval. The dashboard folds multiple eval *generations* into one view (the v0.7.0 8–14B lineup, plus the v0.7.4 32GB tier); runs not yet re-swept against the latest code — e.g. the Anthropic ablation — are carried forward and superscript-tagged. Numbers in [`docs/results/`](results/) and the [dashboard](results/dashboard.html). +- **Current** — in the published eval. The dashboard folds multiple eval *generations* into one view (the v0.7.5 reasoning-replay grid for the 8–14B lineup and Claude tier, plus the v0.7.4 32GB tier); runs not yet re-swept against the latest code — e.g. the 32GB tier and the Claude deep-ablation rows — are carried forward and superscript-tagged. Numbers in [`docs/results/`](results/) and the [dashboard](results/dashboard.html). - **Retired** — appeared in a prior eval suite, cut from the current one. Either too weak (bare scores below the threshold for informative comparison) or superseded by a newer family member. Sampling defaults retained for backward compatibility. - **Unpublished** — sampling defaults are present, but no eval numbers have been published. Forge will work with these models; performance is undocumented. @@ -20,7 +20,7 @@ Sampling values are sourced from the model's HuggingFace card unless noted. Valu | Ministral-3 14B Instruct 2512 | Q4_K_M | 0.05¹ | — | — | — | — | — | [HF](https://huggingface.co/mistralai/Ministral-3-14B-Instruct-2512) | | Ministral-3 8B Reasoning 2512 | Q4_K_M, Q8_0 | 0.7 | —² | — | — | — | — | [HF](https://huggingface.co/mistralai/Ministral-3-8B-Reasoning-2512) | | Ministral-3 14B Reasoning 2512 | Q4_K_M | 1.0 | —² | — | — | — | — | [HF](https://huggingface.co/mistralai/Ministral-3-14B-Reasoning-2512) | -| Qwen3 8B | Q4_K_M, Q8_0 | 0.6 | 0.95 | 20 | 0.0 | — | — | [HF](https://huggingface.co/Qwen/Qwen3-8B) | +| Qwen3 8B | Q4_K_M, Q8_0⁸ | 0.6 | 0.95 | 20 | 0.0 | — | — | [HF](https://huggingface.co/Qwen/Qwen3-8B) | | Qwen3 14B | Q4_K_M | 0.6 | 0.95 | 20 | 0.0 | — | — | [HF](https://huggingface.co/Qwen/Qwen3-14B) | | Granite 4.1 8B | Q4_K_M, Q8_0 | 0.0³ | 1.0 | 0 | — | — | — | (IBM convention, unconfirmed) | | Gemma-4 E4B-it | Q4_K_M, Q8_0 | 1.0 | 0.95 | 64 | — | — | — | [HF](https://huggingface.co/google/gemma-4-e4b-it) | @@ -33,15 +33,16 @@ Sampling values are sourced from the model's HuggingFace card unless noted. Valu | Nemotron-3 Nano 30B-A3B | Q4_K_M | 0.6 | 0.95 | — | — | — | —⁷ | [HF](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) | | Claude Haiku 4.5⁵ | — | — | — | — | — | — | — | (SDK-managed) | | Claude Sonnet 4.6⁵ | — | — | — | — | — | — | — | (SDK-managed) | -| Claude Opus 4.6⁵ | — | — | — | — | — | — | — | (SDK-managed) | +| Claude Opus 4.8⁵ | — | — | — | — | — | — | — | (SDK-managed) | ¹ Ministral-3 Instruct cards say "temperature below 0.1 for production"; 0.05 picked within that range. ² Ministral-3 Reasoning cards show `top_p=0.95` in code examples but do NOT include it in the formal "Recommended Settings" section. Add explicitly if you want to follow the examples. ³ Granite 4.1 sampling mirrors the Granite 4.0 IBM convention (greedy decoding); marked unconfirmed pending IBM publication for the 4.1 family specifically. ⁴ Phi-4: no formal sampling recommendation from any official source (Microsoft HF card, model docs). Falls through to backend defaults. -⁵ **Claude numbers are carried forward from the v0.6.0 dataset** — gen 1 on the dashboard, superscript-tagged. The Anthropic ablation has not been re-run since, owing to cost (~$272 for the full 11,700-row matrix). Backend support is unchanged; numbers are stable to within tool-error-channel sensitivity (small). +⁵ **Claude baseline re-measured in the v0.7.5 dataset** with extended thinking enabled (adaptive) for Sonnet 4.6 and Opus 4.8; Haiku 4.5 does not support adaptive thinking and runs non-thinking. Earlier Claude rows ran thinking-off: Opus 4.6 and the Anthropic deep-ablation rows are carried forward from the v0.6.0 dataset (gen 1 on the dashboard, superscript-tagged) — the ablation has not been re-run owing to cost (~$272 for the full 11,700-row matrix). ⁶ Qwen3.6 27B (dense) deliberately diverges from its A3B siblings: its card drops the `presence_penalty=1.5` the MoE variants recommend, so forge sends `0.0` (no penalty). ⁷ Nemotron-3 Nano: the card splits sampling into a Reasoning preset (T=1.0, top_p=1.0) and a Tool-calling preset (T=0.6, top_p=0.95); the tool-calling preset is used here, with thinking enabled via `chat_template_kwargs`. +⁸ **Qwen3 8B Q8_0 will be cut (→ Retired) in a future eval generation** on compute-cost vs signal-value grounds, not quality: it was the single most expensive model in the v0.7.5 grid (~108 GPU-hours, ~23% of the full sweep) while adding little information over its Q4_K_M sibling (the Q4/Q8 delta is a couple of points on a mid-board model, and the quant-comparison axis is preserved by the cheaper Ministral and Gemma Q4/Q8 pairs). Its numbers stay Current while they are part of the published dataset. --- diff --git a/docs/USER_GUIDE.md b/docs/USER_GUIDE.md index e8ec90f..de10ec1 100644 --- a/docs/USER_GUIDE.md +++ b/docs/USER_GUIDE.md @@ -85,7 +85,7 @@ claude **Function-calling capability.** `--backend-capability native` (default) uses the backend's chat-template tool-calling and is the smoother default for Claude Code's heavy multi-turn tool use. `--backend-capability prompt` injects the tool surface into the prompt for llama.cpp/llamafile backends without a tool-calling template; whether a model stays coherent across multi-turn tool results in prompt mode varies by model — and tends to degrade on more complex, multi-step interactions — so prefer native whenever the backend supports it. The capability is declared at startup and frozen. -**Reasoning replay.** Reasoning-capable backends may return hidden reasoning alongside tool calls. Forge captures that reasoning for observability, then controls how much is replayed to the backend on later turns with `--reasoning-replay {full,keep-last,none}`. The default is `keep-last`: only the latest captured reasoning block is replayed. `full` preserves the historical behavior and replays every captured reasoning block. `none` keeps reasoning out of backend-facing history. In OpenAI-compatible proxy responses, `keep-last` exposes current reasoning as `reasoning_content` instead of normal assistant `content` so clients that preserve reasoning fields can replay only the latest block without turning it into plain text. Anthropic proxy responses only emit reasoning text under `full`; Forge does not synthesize signed Anthropic thinking blocks, so default Anthropic proxy responses do not expose replayable reasoning. +**Reasoning replay.** Reasoning-capable backends may return hidden reasoning alongside tool calls. Forge captures that reasoning for observability, then controls how much is replayed to the backend on later turns with `--reasoning-replay {full,keep-last,none}`. The default is `none`: captured reasoning stays out of backend-facing history entirely. This is the most token-efficient policy, and on forge's eval suite it is statistically indistinguishable from replay-all (no aggregate score cost; see [reasoning-replay results](results/raw/reasoning-replay.md)). `keep-last` replays only the latest captured reasoning block. `full` preserves the historical behavior and replays every captured reasoning block. In OpenAI-compatible proxy responses, `keep-last` exposes current reasoning as `reasoning_content` instead of normal assistant `content` so clients that preserve reasoning fields can replay only the latest block without turning it into plain text; under the default `none`, proxy responses omit captured reasoning. Anthropic proxy responses only emit reasoning text under `full`; Forge does not synthesize signed Anthropic thinking blocks, so default Anthropic proxy responses do not expose replayable reasoning. See [ADR-017](decisions/017-reasoning-replay-policy.md) for the policy design and the eval evidence behind the default. **Downstream protocol.** @@ -285,7 +285,7 @@ await server.stop() `WorkflowRunner` accepts an optional `on_message` callback that fires each time a `Message` is appended to the conversation during `run()`. This is the primary observability hook — use it for logging, eval metric collection, or building conversation history for multi-turn flows. -`WorkflowRunner(reasoning_replay=...)` uses the same policy as the proxy: `keep-last` by default, `full` for the historical replay-all behavior, and `none` to avoid replaying captured reasoning to the backend. The policy affects backend-facing serialization only; `MessageType.REASONING` entries still appear in `on_message` and internal history unless context compaction removes them. +`WorkflowRunner(reasoning_replay=...)` uses the same policy as the proxy: `none` by default (captured reasoning is not replayed to the backend), `keep-last` to replay only the latest reasoning block, and `full` for the historical replay-all behavior. The policy affects backend-facing serialization only; `MessageType.REASONING` entries still appear in `on_message` and internal history unless context compaction removes them. - **Single-turn (default):** `on_message` fires for every message the runner creates — system prompt, user input, assistant responses, tool results, nudges. - **Multi-turn (`initial_messages`):** `run()` accepts an optional `initial_messages` parameter that seeds the conversation with prior history. `on_message` fires **only for new messages created during this turn**, not for the replayed history. diff --git a/docs/decisions/017-reasoning-replay-policy.md b/docs/decisions/017-reasoning-replay-policy.md new file mode 100644 index 0000000..e2bba03 --- /dev/null +++ b/docs/decisions/017-reasoning-replay-policy.md @@ -0,0 +1,51 @@ +# ADR-017: Reasoning replay is a bounded policy, default `none` + +**Status:** accepted (unreleased) + +## Context + +Reasoning-capable backends (Ministral Reasoning, Qwen3 thinking, gemma 4, …) return hidden reasoning alongside tool calls. forge captures that reasoning for observability (`MessageType.REASONING`), and historically re-serialized **all** of it into backend-facing history on every later turn — unbounded accumulation, with no way to turn it off. + +Two problems motivated bounding this: + +- **Convergence.** A proxy non-convergence investigation traced runaway context growth to captured reasoning being replayed back to the backend each turn. Frontier labs practice *scoped* reasoning retention, not replay-everything. +- **Cost.** Replayed reasoning grows the prompt every turn. On long multi-step workflows it competes with real history for the context budget and inflates per-turn token cost. + +A serializer reality check sharpened the question: even the legacy behavior was not a faithful 1:1 re-send — `fold_and_serialize` collapses consecutive reasoning blocks (only the one preceding a tool call survives), so only ~29% of generated reasoning reached the wire on real transcripts. "Replay everything" was already an approximation, not a ground truth worth preserving by default. + +## Decision + +One knob, `reasoning_replay ∈ {"full", "keep-last", "none"}`, shared by `WorkflowRunner` and the proxy (`--reasoning-replay`), **default `"none"`**. + +- **`none` (default)** — captured reasoning never enters backend-facing history. +- **`keep-last`** — only the most recent captured reasoning block is replayed. +- **`full`** — legacy behavior; every captured reasoning block is replayed. Pre-knob forge ≡ `full`. + +The policy affects **backend-facing serialization only**. Reasoning is still captured, still surfaces in `on_message` and internal history, and still lands in eval transcripts — observability is unchanged. + +Proxy response shaping follows the policy: under `keep-last` current reasoning is exposed as `reasoning_content` (so clients that preserve reasoning fields can replay just the latest block); under `full` it rides assistant `content`; under `none` it is omitted. Anthropic-protocol responses emit reasoning text only under `full`; forge does not synthesize signed Anthropic thinking blocks. + +## Evidence + +The default was chosen from a dedicated re-sweep (the v0.7.5 grid): 14 models × {none, keep-last, full} × {bare, reforged} × {native, prompt}, 50 runs × 26 scenarios per cell, 170k runs total. Scoring treats the **scenario** as the sampling unit (runs cluster hard within scenarios), paired against the v0.7.0 legacy/`full` baseline. + +- **`full` reproduces the pre-knob baseline** on all reasoning models (n.s. everywhere) — the knob is a clean superset of legacy behavior; the message-processing refactor did not regress the legacy path. +- **`none` is statistically indistinguishable from legacy overall** (+0.49pp, p=0.17), and in the reforged-only read (−0.35pp, p=0.45). Bounding replay is a free token saving on this suite. +- **`none` edges out `keep-last` overall** (+0.86pp, p=0.007); the two are indistinguishable reforged-only. +- **No robust per-config downside survives multiple-comparison correction.** The closest is the Ministral-14B-Reasoning-Q4 family (reforged-only raw drop ~1.5pp, p≈0.04–0.06, with `none` ≈ `keep-last`) — a family/quantization caveat, not a blocker. +- **Wire-level validation:** `none` → exactly 0 reasoning on the wire across every row; `keep-last` ∈ {0, 1}; per-transcript ordering full ≥ keep-last ≥ none holds by construction. + +Full per-config tables: [results/raw/reasoning-replay.md](../results/raw/reasoning-replay.md). + +## Consequences + +- **Behavioral change for reasoning-capable backends.** Upgraders who want the old behavior pin `--reasoning-replay full` (proxy) or `WorkflowRunner(reasoning_replay="full")`. For non-reasoning/instruct models the knob is inert and nothing changes. +- **Token savings by default.** Backend-facing history stops accumulating reasoning; `full` remains the cost wildcard (context grows with run length). +- **Eval surface.** `reasoning_replay` is part of the eval resume key and a first-class report/dashboard dimension; rows predating the knob count as `full` (that is what they ran). +- **Claude rows are unaffected.** The Anthropic client drops returned thinking blocks rather than capturing them into history, so the knob is request-inert there; carrying thinking across turns natively is deferred pending evidence it moves scores. + +## Alternatives considered + +- **Default `keep-last`** (the knob's initial default while evidence was pending). A reasonable middle ground — but it measured slightly *below* `none` overall, still pays a replay cost, and busts rolling prompt-cache prefixes (earlier messages re-serialize differently each turn). Rejected once the grid showed `none` is quality-free. +- **Default `full` (legacy).** Preserves bug-for-bug continuity, but it is the most expensive policy, delivers no measured score benefit, and is the very accumulation pathology that motivated the knob. +- **Drop replay entirely (no knob).** Simplest, but unfalsifiable — `full`/`keep-last` exist precisely so the policy stays a measured variable and per-model exceptions (e.g. the Ministral-Q4 caveat) remain one flag away. From 8ba0f1eb03101e04e9b37a5eb8eb2f130a1f6c9e Mon Sep 17 00:00:00 2001 From: Antoine Zambelli Date: Thu, 11 Jun 2026 20:40:28 -0500 Subject: [PATCH 14/14] release: v0.7.5 version bump + changelog Co-Authored-By: Claude Fable 5 --- CHANGELOG.md | 13 +++++++++++++ pyproject.toml | 2 +- 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index e4e0a3f..f6961eb 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,6 +2,19 @@ All notable changes to forge are documented here. +## [0.7.5] — 2026-06-11 + +Reasoning replay is now a measured, bounded policy. Reasoning-capable backends return hidden reasoning alongside tool calls, and forge previously re-serialized all of it into backend-facing history on every later turn. The new `reasoning_replay` knob bounds that — and after a full re-sweep of the published eval grid showed that dropping replayed reasoning is quality-free and token-cheaper, the default is `none`. The release also re-baselines the Claude eval tier with extended thinking enabled and adds Anthropic prompt caching with cache-aware cost accounting. + +### Added +- **`reasoning_replay {full, keep-last, none}`** on `WorkflowRunner(reasoning_replay=…)` and the proxy (`--reasoning-replay`). `full` replays every captured reasoning block (the historical behavior), `keep-last` only the most recent, `none` keeps reasoning out of backend-facing history entirely. Serialization-only: reasoning is still captured and still surfaces in `on_message` and internal history. In OpenAI-compatible proxy responses, `keep-last` exposes current reasoning as `reasoning_content` rather than assistant `content`, so clients that preserve reasoning fields can replay just the latest block. See [ADR-017](docs/decisions/017-reasoning-replay-policy.md). +- **Reasoning-replay eval grid** (`eval_results_v0.7.5.jsonl`, a new eval generation): the full 8–14B lineup re-swept across all three policies × both ablations × native/prompt — ~170k runs. The policy is part of the eval resume key and a first-class report/dashboard dimension: row labels carry `:keep-last` / `:full` tags (untagged = `none`), the dashboard gains a Reasoning Replay filter, the report a `--reasoning-replay` filter, and a dedicated [reasoning-replay view](docs/results/raw/reasoning-replay.md) compares policies per config. A wire-level counter (`reasoning_wire`) validates each policy's on-wire behavior (`none` → exactly 0 replayed reasoning across every run). +- **Anthropic extended thinking — `AnthropicClient(thinking=…)`** — request-side extended-thinking config (e.g. `{"type": "adaptive"}`). When set, a forced `tool_choice` is suppressed (the API requires `auto` with thinking on) and `max_tokens` is raised to fit the thinking budget. The Claude eval baseline now runs Sonnet and Opus with adaptive thinking — all prior Claude rows had thinking off, the wrong baseline for a reasoning-flavored suite; Haiku does not support adaptive thinking and stays non-thinking. +- **Anthropic prompt caching — `AnthropicClient(prompt_caching=True)`** — marks a static ephemeral cache breakpoint over the tool definitions + system prompt (byte-identical every turn, so it read-hits from turn 2 onward instead of re-billing the re-sent schema). `TokenUsage` gains generic `cache_creation_input_tokens` / `cache_read_input_tokens` counters, and eval cost accounting prices cache writes (1.25×) and reads (0.1×) at their actual rates. + +### Changed +- **Captured reasoning is no longer replayed to the backend by default.** Pre-0.7.5 behavior replayed every captured reasoning block (equivalent to `reasoning_replay="full"`); the default is now `"none"`. On the published eval suite, `none` is statistically indistinguishable from replay-all in aggregate while saving the replayed tokens every turn; no per-config regression survives multiple-comparison correction (closest: a mild raw drop on Ministral-3 14B Reasoning Q4, where `none` and `keep-last` are indistinguishable from each other). The knob is inert for models that emit no reasoning. Migration: `--reasoning-replay full` (proxy) or `WorkflowRunner(reasoning_replay="full")` restores the historical behavior. Anthropic-protocol proxy responses emit reasoning text only under `full` — forge does not synthesize signed Anthropic thinking blocks. + ## [0.7.4] — 2026-06-03 Malformed tool-call arguments now self-correct on the tool-error channel, and the eval suite gains its first model-size upgrade — a 32GB tier (Qwen3.5 / 3.6 27–35B, Nemotron-3 Nano, Mistral-Small-3.2) surfaced in the dashboard alongside the existing 8–14B lineup. diff --git a/pyproject.toml b/pyproject.toml index 05826c9..c42aa85 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "hatchling.build" [project] name = "forge-guardrails" -version = "0.7.4" +version = "0.7.5" description = "A reliability layer for self-hosted LLM tool-calling. Guardrails, context management, and backend adapters for multi-step agentic workflows." requires-python = ">=3.12" license = "MIT"