unslothai · danielhanchen · May 1, 2026 · May 1, 2026
diff --git a/docs/diagnosis/arch.md b/docs/diagnosis/arch.md
@@ -0,0 +1,69 @@
+# Phase 9 — model architecture diff (llama.cpp `mistral3.cpp` vs HF `ministral3`)
+
+## TL;DR
+
+The two implementations are **structurally equivalent** for the inference path
+that Mistral-Medium-3.5 actually exercises. SwiGLU MLP, RMSNorm with FP32 cast,
+1/sqrt(head_dim) attention scale, pre-norm decoder layers, and YARN-scaled RoPE
+all match. The only optional code path is **Llama-4-style attention temperature
+scaling**, which is gated on `llama_4_scaling_beta != 0` in HF and on
+`hparams.f_attn_temp_scale != 0.0` in llama.cpp — both gates evaluate false for
+this model (`llama_4_scaling_beta = 0` per HF config, no
+`attn_temperature_scale` key in the GGUF), so the path is skipped on both
+sides. **Architecture is not the cause** of the long-context degradation.
+
+## Side-by-side mapping
+
+| concern | HF `Ministral3*` | llama.cpp `mistral3.cpp` |
+| --- | --- | --- |
+| pre-attention norm | `Ministral3RMSNorm(eps=rms_norm_eps)` cast→fp32 then back | `build_norm(LLM_NORM_RMS)` |
+| Q/K/V proj | `nn.Linear(bias=False)` | `build_qkv(...)` |
+| RoPE | `apply_rotary_pos_emb` with cached `cos,sin` (yarn) | `ggml_rope_ext` with `n_ctx_orig, freq_base, freq_scale, ext_factor, attn_factor, beta_fast, beta_slow` |
+| attn temperature scale | `Q *= 1 + beta * log(1 + floor(pos/orig_max))` (only if `beta!=0`) | `Q = ggml_mul(Q, inp_attn_scale)` (only if `f_attn_temp_scale!=0`) |
+| attention | `attention_interface(scale=1/sqrt(head_dim), sliding_window=None)` | `build_attn(..., kq_scale = 1/sqrt(n_embd_head))` |
+| sliding window | `getattr(config,'sliding_window',None)` → `None` for this model | not configured (correct) |
+| O proj | `o_proj = nn.Linear(bias=False)` | `model.layers[il].wo` |
+| residual after attn | `hidden = residual + hidden` | `ffn_inp = ggml_add(cur, inpSA)` |
+| post-attn norm | `Ministral3RMSNorm` | `build_norm(LLM_NORM_RMS)` (ffn_norm) |
+| MLP | SwiGLU: `down(silu(gate(x)) * up(x))` | `build_ffn(LLM_FFN_SILU, LLM_FFN_PAR)` (parallel = silu(gate)*up then down) |
+| residual after MLP | `hidden = residual + hidden` | `ggml_add(cur, ffn_inp)` |
+| MoE branch | n/a (Mistral-Medium-3.5 is dense) | guarded by `model.layers[il].ffn_gate_inp == nullptr` → dense path |
+
+## RoPE specifically
+
+Both honour every YARN parameter in the GGUF (`yarn_beta_fast=4`,
+`yarn_beta_slow=1`, `factor=64`, `original_context_length=4096`,
+`freq_base=1e6`, `yarn_log_multiplier=1.0`) — all of which match
+`rope_parameters` in `mistral_medium_check/config.json`. RoPE math is correct.
+
+## Llama-4 attn temperature scale
+
+```python
+def get_llama_4_attn_scale(positions_ids, beta, max_position_embeddings):
+    return (1 + beta * torch.log(1 + torch.floor(positions_ids / max_position_embeddings)))[:, None, :, None]
+```
+
+`beta = config.rope_parameters["llama_4_scaling_beta"] = 0`, so the multiplier
+collapses to **1.0** at every position. llama.cpp side-steps the entire
+multiplication when the GGUF doesn't carry the key. Both safe, both equivalent.
+
+## What this rules out
+
+- ❌ Sliding-window attention mismatch
+- ❌ Wrong RoPE dimensions / wrong YARN parameters
+- ❌ Missing Llama-4 temperature scaling (it's a no-op for this model)
+- ❌ Activation function mismatch (both SwiGLU)
+- ❌ Pre-norm vs post-norm placement
+- ❌ Attention scale factor
+
+## What it does NOT rule out
+
+- Numerical precision in CUDA kernels (FP16 accumulators in `ggml-cuda` for
+  the matmul or attention path).
+- Q8_0 quantization rounding error in long-attention contexts.
+- KV-cache dtype (default F16 → numerical drift over many tokens).
+- Sampler behaviour (min_p default of 0.05 in llama-server vs vLLM not applying
+  it).
+
+Phase 7 (rebuild with `GGML_CUDA_FORCE_CUBLAS_COMPUTE_32F=1`) and Phase 6
+(F16/BF16/Q8_0 KV-cache experiment) and Phase 8 (matched samplers) cover those.
diff --git a/docs/diagnosis/chat_template.md b/docs/diagnosis/chat_template.md
@@ -0,0 +1,44 @@
+# Phase 2 — chat template parity
+
+## TL;DR
+
+The four chat templates considered — (A) GGUF-embedded jinja in
+`bartowski/mistralai_Mistral-Medium-3.5-128B-GGUF`, (B) `unsloth/Mistral-Medium-3.5-128B`,
+(C) `mistralai/Mistral-Medium-3.5-128B` upstream, (D) the `tokenizer.chat_template`
+attribute exposed by HF AutoTokenizer for the upstream model — are
+**semantically equivalent for normal multi-turn chat**. The only diff that
+could change rendered output is in error-path code that nobody is hitting in
+this test. **Templates are not the cause** of the long-context degradation.
+
+## Files
+
+- `outputs/template_llamacpp.jinja` (13281 bytes) — A
+- `outputs/template_unsloth.jinja` (14475 bytes) — B
+- `outputs/template_mistralai.jinja` (13479 bytes) — C
+- `outputs/template_hftokenizer.jinja` (13479 bytes) — D
+- `outputs/template_diff_*` — pairwise diffs
+
+## Diff summary
+
+| pair | size | nature of diff |
+| --- | --- | --- |
+| C vs D (mistralai vs HF tokenizer) | 0 bytes | **identical** |
+| A vs C (GGUF vs upstream) | 686 bytes | one hunk: a disabled `{%- if false %}` assertion at line 201 of the GGUF copy where upstream has the real check `(content == '' or content is none) and (no tool calls)` → raise. Both branches are no-ops on valid messages. |
+| B vs C (unsloth vs upstream) | 2040 bytes | (1) unsloth uses `strftime_now` for date defaults; upstream hardcodes `today=29-04-2026`, `yesterday=28-04-2026`. Both get overridden by template arguments at render time, so no effect on output. (2) unsloth `arguments\|tojson\|safe` vs upstream `arguments\|tojson` for tool-call serialisation — only relevant when tools are used. |
+
+## Cross-check via tokenization parity (Phase 1)
+
+When all four templates render the same multi-turn fixture (system prompt + 3
+user/assistant pairs + final user turn, `reasoning_effort='none'`), they
+produce **byte-identical 434-token sequences**, including `<s>`,
+`[SYSTEM_PROMPT]`, `[/SYSTEM_PROMPT]`, `[MODEL_SETTINGS]`, `[/MODEL_SETTINGS]`,
+`[INST]`, `[/INST]`, and the per-assistant `</s>` markers in the right
+positions. See `outputs/diagnosis_tokenization.md`.
+
+## Conclusion
+
+If we replace llama-server's GGUF-embedded template with the upstream mistralai
+copy via `--chat-template-file outputs/template_mistralai.jinja`, the rendered
+text and resulting token stream are identical to what the GGUF template
+produces today. **No improvement is expected from a template swap, and indeed
+no improvement was observed empirically** — see Phase 8 matched-sampler runs.
diff --git a/docs/diagnosis/config.md b/docs/diagnosis/config.md
diff --git a/docs/diagnosis/first_divergence.md b/docs/diagnosis/first_divergence.md
@@ -0,0 +1,88 @@
+# First-divergence pin-down (vLLM vs llama-server, greedy)
+
+Reproducer: `repro_first_divergence.py` in this directory.
+
+Same input ("Create a Flappy Bird Python game" + Mistral-Medium-3.5 SYSTEM_PROMPT
+with `reasoning_effort=none`), greedy (temperature=0.0), max_tokens=100, on:
+
+- vLLM 0.20.1rc1.dev127 FP8 + `--attention-backend FLASH_ATTN` (port 8765)
+- llama-server (b1-aab68217b unslothai fork, also tested ggml-org master) Q8_0 (port 8766)
+
+The two outputs share an **identical 13-token prefix**:
+
+```
+# Flappy Bird Game in Python\n\nHere's a complete implementation of
+```
+
+Token 14 is where they first diverge. **The top-2 tokens at this position are
+the same in both servers**, but their *relative order* is flipped:
+
+| rank | vLLM | logprob | llama-server | logprob |
+| --- | --- | --- | --- | --- |
+| 1 | ` a` | **−0.314** | ` Fl` | **−0.289** |
+| 2 | ` Fl` | −1.314 | ` a` | −1.430 |
+| 3 | ` the` | −7.689 | ` the` | −4.434 |
+| 4 | `Fl` | −12.564 | `Fl` | −11.420 |
+
+Both have the same top-2 candidates, but llama-server compresses ` a` from
+−0.31 to −1.43 (a Δ of ~1.1 in logprob, i.e. a factor of 3 in probability)
+while vLLM compresses ` Fl` from −1.31 to −0.31 (same Δ in the other
+direction). On a logprob scale of −0.3 vs −1.4 these tokens are within 1 nat
+of each other; greedy-decoding *must* pick exactly one, and the order flip
+sends the two backends down different trajectories.
+
+After that single token flip:
+- vLLM continues with `... a Flappy Bird game using Pygame. This version includes the core mechanics ...`
+- llama-server continues with `... Flappy Bird using Pygame. This version includes all the classic elements ...`
+
+Both trajectories are coherent at this point. The two outputs both produce
+clean code for ~600–1000 generated tokens, after which **only the
+llama-server trajectory** degenerates into broken syntax and repetition (e.g.
+`pygame.display.set mode((800, 600)` — `set_mode` becomes `set mode`).
+
+## Logits progression at fixed prefix
+
+Reproducer: `repro_logits_progression.py`.
+
+When we take vLLM's full greedy output and **feed it as a fixed prefix** at
+checkpoints 50/200/500/1000/1400, both servers' top-1 next-token agrees at
+*every* checkpoint:
+
+| n_decoded | vLLM top-1 | vLLM logprob | llama-server top-1 | llama-server logprob | match |
+| --- | --- | --- | --- | --- | --- |
+| 50 | ` ``` ` | −0.0233 | ` ``` ` | −0.0000 | ✓ |
+| 200 | `Y` | ~0 | `Y` | ~0 | ✓ |
+| 500 | `.rect` | −0.0001 | `.rect` | −0.2432 | ✓ |
+| 1000 | `_p` | ~0 | `_p` | −0.0601 | ✓ |
+| 1400 | `   ` | ~0 | `   ` | −0.0028 | ✓ |
+
+So the long-context degeneration is **not** caused by the model converging on
+different next-token answers given the same prefix. It is caused by a single
+~1-nat precision flip near the start, after which vLLM and llama-server walk
+different (still individually plausible) decoding paths — and the
+llama-server path happens to land in a degenerate attractor.
+
+## Cross-check on Q4_K_M
+
+Repeating the experiment with `bartowski/.../Q4_K_M` (~74 GiB GGUF) on
+llama-server: identical degeneration tail. The same wrong top-2 ranking at
+token 14 occurs, then the trajectory degenerates *more* than Q8_0 — for
+example `pygame.display.set mode sdl hWSIZER, sdl lg2` syntax garbage. So
+this is uniform across llama.cpp quants, not a Q8_0-specific bug.
+
+## Implication
+
+The remaining hypothesis is that ggml-cuda's accumulator precision in the
+matmul or attention path for this specific model shape (88 layers,
+head_count=96, head_count_kv=8, head_dim=128, vocab=131072,
+intermediate=28672, rope_freq_base=1e6 with YARN factor=64) is producing
+logits that are subtly *flatter* than the reference. The ranking flip on
+token 14 is consistent with this: top-2 tokens are within 1 nat in both
+servers, but llama-server compresses the gap by ~0.4 nat in a way that puts
+` Fl` ahead of ` a` instead of behind.
+
+A targeted next step is to fix the matmul accumulator in
+`ggml-cuda/mmq.cu` (or the relevant GEMM path) to FP32 specifically for
+`LLM_ARCH_MISTRAL3` and re-test. `GGML_CUDA_FORCE_CUBLAS=1` was already tried
+without effect, but only `GGML_CUDA_FORCE_CUBLAS_COMPUTE_32F=1` only helps the
+cuBLAS half-precision path; Q8_0 dequant + matmul takes a different path.
diff --git a/docs/diagnosis/logits.md b/docs/diagnosis/logits.md
@@ -0,0 +1,29 @@
+# Phase 3 — top-10 logprobs comparison (vLLM vs llama-server)
+
+prompt: `What is the capital of France? Answer in exactly one word.`  
+temperature: 0.0 (greedy), max_tokens: 1, top_logprobs: 10
+
+vLLM completion: `Paris`  
+llama-server completion: `Paris`
+
+## Top-10 next-token logprobs
+
+| rank | vLLM token | vLLM logprob | llama token | llama logprob |
+| --- | --- | --- | --- | --- |
+| 1 | `token_id:42572` | -0.0001 | `Paris` | -0.0000 |
+| 2 | `token_id:6993` | -9.8751 | ` Paris` | -11.4609 |
+| 3 | `token_id:1784` | -12.0001 | `Par` | -12.5013 |
+| 4 | `token_id:2029` | -12.9376 | `PAR` | -14.4139 |
+| 5 | `token_id:3814` | -13.2501 | `巴黎` | -14.6281 |
+| 6 | `token_id:102726` | -13.6251 | `Pars` | -15.6960 |
+| 7 | `token_id:38166` | -13.6251 | `par` | -16.1068 |
+| 8 | `token_id:72056` | -13.8751 | ` Париж` | -16.8023 |
+| 9 | `token_id:75613` | -15.1876 | `Berlin` | -17.0093 |
+| 10 | `token_id:126441` | -15.3751 | ` París` | -17.9099 |
+
+Jaccard(top-10 token set): **0.00**  
+Top-1 match: **False**  
+
+## TL;DR
+
+Greedy top-1 token: **DIFFERS**.
diff --git a/docs/diagnosis/mistral-medium-3.5-long-context.md b/docs/diagnosis/mistral-medium-3.5-long-context.md
@@ -0,0 +1,131 @@
+# Mistral-Medium-3.5-128B llama.cpp long-context degradation — diagnosis
+
+## TL;DR
+
+llama-server (Q8_0 GGUF from `bartowski/mistralai_Mistral-Medium-3.5-128B-GGUF`)
+collapses into deterministic repetition loops after ~1000–1500 generated tokens,
+producing broken Python syntax (e.g. `pygame.display.set mode((800, 600)` —
+missing the underscore in `set_mode` and the closing paren). vLLM (FP8 of the
+same model) does not exhibit this; it stops naturally at 1496 tokens. **HF
+transformers BF16 inference of the same model also degrades** in the same way,
+ruling out llama.cpp/Q8_0 as the *unique* culprit.
+
+This matches what unsloth has already published on
+[`unsloth/Mistral-Medium-3.5-128B-GGUF`](https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF):
+
+> Testing shows that this behavior occurs **regardless of who or how** the model
+> was converted GGUF. The model initially responds correctly, but over long
+> context, does not work properly. **Mistral has now labeled GGUF support as a
+> WIP**.
+
+This document records the experiments that were run against vLLM (port 8765,
+FP8) and llama-server (port 8766, Q8_0) and what they ruled out. It is
+intended as input to a deeper fix.
+
+## Setup
+
+- vLLM 0.20.1rc1.dev127, FP8, `--tensor-parallel-size 2 --port 8765 --tool-call-parser mistral --enable-auto-tool-choice --reasoning-parser mistral --max_num_batched_tokens 16384 --max_num_seqs 128 --gpu_memory_utilization 0.8 --attention-backend FLASH_ATTN`, GPUs 4,5.
+- llama-server `b1-aab6821` (unslothai fork) and ggml-org master, Q8_0,
+  `--tensor-split 1,1 --port 8766 --jinja --ctx-size 32768 --parallel 1`,
+  GPUs 2,3.
+- HF transformers 5.7.0, `Mistral3ForConditionalGeneration` BF16,
+  `attn_implementation="eager"`, GPUs 6,7.
+
+## Phases run
+
+| phase | topic | result | location |
+| --- | --- | --- | --- |
+| 1 | Tokenization parity (vLLM / llama-server / mistral-common / HF) | **identical** 434 tokens | `outputs/diagnosis_tokenization.md` |
+| 2 | Chat template (GGUF jinja vs unsloth vs mistralai vs HF tokenizer) | semantically equivalent (single trivial diff: a disabled `if false` assertion at line 201) | `outputs/diagnosis_chat_template.md` (and `outputs/template_*.jinja`) |
+| 3 | Top-k logits agreement | top-1 token agrees on simple prompts; logit *distributions* differ noticeably (vLLM is less peaked) | `outputs/diagnosis_logits.md` |
+| 4 | GGUF metadata vs HF config | match. RoPE: `factor=64`, `freq_base=1e6`, `original_context_length=4096`, `yarn_beta_fast=4`, `yarn_beta_slow=1`, `yarn_log_mul=1.0`. All YARN parameters present in GGUF and equal to `text_config.rope_parameters` | `outputs/diagnosis_config.md`, `outputs/gguf_metadata_full.txt` |
+| 5 | HF transformers BF16 ground truth | **also degrades** (recall score 0/4 on the interleaved 11-turn test). T2 Flappy Bird hits 1500-token cap with looping syntax errors in tail | `outputs/hf_groundtruth_*` |
+| 6 | KV cache dtype: F16 (default) / BF16 / F32 | **no effect on degradation** | `outputs/diag_single_flappy_*.json` |
+| 7 | Rebuild llama.cpp with `-DGGML_CUDA_FORCE_CUBLAS=ON -DGGML_CUDA_FORCE_MMQ=OFF` and runtime `GGML_CUDA_FORCE_CUBLAS_COMPUTE_32F=1` | **no effect on degradation** | `outputs/llama_server_compute32f_*.log` |
+| 8 | Matched samplers (`temp=0.1, top_p=1, top_k=64, min_p=0.05, seed=42`) | **no effect**; tried `repetition_penalty ∈ {1.0,1.05,1.1,1.2}`, `frequency_penalty ∈ {0,0.1,0.3,0.5}`, `dry_multiplier ∈ {0,0.5,0.8}` — all still loop | `outputs/matched_*` |
+| 9 | Architecture diff — `llama.cpp/src/models/mistral3.cpp` vs HF `transformers/models/ministral3/modeling_ministral3.py` | **equivalent**. Same SwiGLU, RMSNorm-fp32, kq_scale=1/√head_dim, optional Llama-4 attn temperature scale (gated and disabled for this model since `llama_4_scaling_beta=0`). YARN `attn_factor` chain in `llama-context.cpp` matches HF's `_compute_yarn_parameters` — both produce `1.0` for `mscale=mscale_all_dim=1.0`. | `outputs/diagnosis_arch.md` |
+
+## Empirical convergence point
+
+Single-turn `Create a Flappy Bird Python game` (greedy, temperature=0):
+
+```
+| backend                          | finish | n_out | tail snippet
+| -------------------------------- | ------ | ----- | -----------
+| vLLM FP8                         | stop   | 1496  | "...Would you like me to explain any specific part?"
+| llama-server Q8_0 cuBLAS         | length | 2048  | "if __name__ == \"__main__,\\n    sys.exit()\\n```\\n\\n" (loop)
+| llama-server Q8_0 BF16 KV        | length | 2048  | "if pipe.x < 0\\n    self.pipes.remove(pipe)" (loop)
+| llama-server Q8_0 F32 KV + no FA | length | 2048  | identical loop
+| HF transformers BF16             | length | 1500  | "pipe1 = pipe(50, 100, 50, 100\\n    pipe2 = pipe(50, 100, 50, 100" (loop)
+```
+
+For llama-server, output is clean up to ~1000 tokens then degrades:
+
+```
+mt=  600: clean
+mt= 1000: still clean
+mt= 1500: looping
+```
+
+Common prefix between vLLM and llama-server greedy outputs is exactly **66
+characters**: `# Flappy Bird Game in Python\n\nHere's a complete implementation of `
+
+Then:
+- vLLM picks `a Flappy Bird game using Pygame`
+- llama-server picks `Flappy Bird using Pygame` (no leading `a`).
+
+## Working hypotheses (still open)
+
+- **H1: Q8_0 quantization error**. Q8_0 gives ~16-bit mantissa precision per
+  block of 32. Cumulative error across 88 layers × 1000 decode steps may be
+  enough to flip top-1 tokens that compound into loops. *Counter-evidence:* HF
+  BF16 with the FP8 safetensors **also** loops, so quantization can't be the
+  whole story.
+- **H2: Numerical kernel issue specific to Mistral-Medium-3.5's shape (88
+  layers, head_dim=128, head_count=96, head_count_kv=8, vocab=131072,
+  intermediate=28672, rope_freq_base=1e6 with YARN factor 64)** — common to
+  llama.cpp ggml-cuda *and* HF eager. vLLM avoids it because it uses CUTLASS
+  FP8 GEMMs with FP32 accumulators and possibly a different attention kernel
+  (`FLASH_ATTN` selected by vLLM in our setup).
+- **H3: Some prefill artifact** that vLLM mitigates via chunked prefill
+  (`--max_num_batched_tokens 16384`). Not yet tested in isolation.
+
+## What is NOT the cause
+
+- Tokenization (4 tokenizers byte-identical).
+- Chat template (4 templates render to identical token streams).
+- GGUF metadata (all numerical config matches HF, including all YARN params).
+- Sampler (no setting of temp/top_p/top_k/min_p/repetition_penalty/freq_penalty/dry that breaks the loop).
+- KV cache dtype.
+- Flash attention vs default attention in llama.cpp.
+- cuBLAS vs MMQ kernels.
+- Architecture code (mistral3.cpp implements the same residual/RMSNorm/SwiGLU/RoPE flow as HF).
+- Sliding window (None for this model — both honour that).
+
+## Artefacts
+
+All under `workspace_5/outputs/`:
+
+- `diagnosis_tokenization*.{md,txt}` + `tok_ids_*.json`
+- `diagnosis_chat_template.md` + `template_*.jinja` + `template_diff_*.txt`
+- `diagnosis_config.md` + `gguf_metadata_full.txt`
+- `diagnosis_arch.md`
+- `diagnosis_logits.md` + `diagnosis_logits_raw.json`
+- `hf_groundtruth_alphabet_*.txt` + `hf_groundtruth_interleaved_*.json` + `hf_groundtruth_tok_ids_*.json`
+- `recall_interleaved_{vllm,llamacpp}_*.txt`
+- `multi_turn_recall_{vllm,llamacpp}_*.txt`
+- `diag_single_flappy_*.json` + `.log`
+- `matched_*.{json,txt}`
+- All llama-server / vllm / HF logs in `workspace_5/logs/`.
+
+## Suggested next steps for a real fix
+
+1. Dump per-layer activations from both vLLM and HF for the same input, and
+   compare layer-by-layer to find where they first diverge meaningfully.
+2. If that divergence localises to RMSNorm or attention softmax, force FP32
+   accumulators in those ops in llama.cpp's CUDA kernels for `mistral3`.
+3. Compare against `llama-bench --perplexity` numbers per quant; Q8_0 vs Q6_K
+   vs F16-converted-from-FP8 GGUF to determine whether the issue scales with
+   precision.
+4. Consider writing a reference forward pass in PyTorch using the GGUF weights
+   (via `gguf-py`) and the HF arch and comparing token-by-token.