Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 69 additions & 0 deletions docs/diagnosis/arch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Phase 9 — model architecture diff (llama.cpp `mistral3.cpp` vs HF `ministral3`)

## TL;DR

The two implementations are **structurally equivalent** for the inference path
that Mistral-Medium-3.5 actually exercises. SwiGLU MLP, RMSNorm with FP32 cast,
1/sqrt(head_dim) attention scale, pre-norm decoder layers, and YARN-scaled RoPE
all match. The only optional code path is **Llama-4-style attention temperature
scaling**, which is gated on `llama_4_scaling_beta != 0` in HF and on
`hparams.f_attn_temp_scale != 0.0` in llama.cpp — both gates evaluate false for
this model (`llama_4_scaling_beta = 0` per HF config, no
`attn_temperature_scale` key in the GGUF), so the path is skipped on both
sides. **Architecture is not the cause** of the long-context degradation.

## Side-by-side mapping

| concern | HF `Ministral3*` | llama.cpp `mistral3.cpp` |
| --- | --- | --- |
| pre-attention norm | `Ministral3RMSNorm(eps=rms_norm_eps)` cast→fp32 then back | `build_norm(LLM_NORM_RMS)` |
| Q/K/V proj | `nn.Linear(bias=False)` | `build_qkv(...)` |
| RoPE | `apply_rotary_pos_emb` with cached `cos,sin` (yarn) | `ggml_rope_ext` with `n_ctx_orig, freq_base, freq_scale, ext_factor, attn_factor, beta_fast, beta_slow` |
| attn temperature scale | `Q *= 1 + beta * log(1 + floor(pos/orig_max))` (only if `beta!=0`) | `Q = ggml_mul(Q, inp_attn_scale)` (only if `f_attn_temp_scale!=0`) |
| attention | `attention_interface(scale=1/sqrt(head_dim), sliding_window=None)` | `build_attn(..., kq_scale = 1/sqrt(n_embd_head))` |
| sliding window | `getattr(config,'sliding_window',None)` → `None` for this model | not configured (correct) |
| O proj | `o_proj = nn.Linear(bias=False)` | `model.layers[il].wo` |
| residual after attn | `hidden = residual + hidden` | `ffn_inp = ggml_add(cur, inpSA)` |
| post-attn norm | `Ministral3RMSNorm` | `build_norm(LLM_NORM_RMS)` (ffn_norm) |
| MLP | SwiGLU: `down(silu(gate(x)) * up(x))` | `build_ffn(LLM_FFN_SILU, LLM_FFN_PAR)` (parallel = silu(gate)*up then down) |
| residual after MLP | `hidden = residual + hidden` | `ggml_add(cur, ffn_inp)` |
| MoE branch | n/a (Mistral-Medium-3.5 is dense) | guarded by `model.layers[il].ffn_gate_inp == nullptr` → dense path |

## RoPE specifically

Both honour every YARN parameter in the GGUF (`yarn_beta_fast=4`,
`yarn_beta_slow=1`, `factor=64`, `original_context_length=4096`,
`freq_base=1e6`, `yarn_log_multiplier=1.0`) — all of which match
`rope_parameters` in `mistral_medium_check/config.json`. RoPE math is correct.

## Llama-4 attn temperature scale

```python
def get_llama_4_attn_scale(positions_ids, beta, max_position_embeddings):
return (1 + beta * torch.log(1 + torch.floor(positions_ids / max_position_embeddings)))[:, None, :, None]
```

`beta = config.rope_parameters["llama_4_scaling_beta"] = 0`, so the multiplier
collapses to **1.0** at every position. llama.cpp side-steps the entire
multiplication when the GGUF doesn't carry the key. Both safe, both equivalent.

## What this rules out

- ❌ Sliding-window attention mismatch
- ❌ Wrong RoPE dimensions / wrong YARN parameters
- ❌ Missing Llama-4 temperature scaling (it's a no-op for this model)
- ❌ Activation function mismatch (both SwiGLU)
- ❌ Pre-norm vs post-norm placement
- ❌ Attention scale factor

## What it does NOT rule out

- Numerical precision in CUDA kernels (FP16 accumulators in `ggml-cuda` for
the matmul or attention path).
- Q8_0 quantization rounding error in long-attention contexts.
- KV-cache dtype (default F16 → numerical drift over many tokens).
- Sampler behaviour (min_p default of 0.05 in llama-server vs vLLM not applying
it).

Phase 7 (rebuild with `GGML_CUDA_FORCE_CUBLAS_COMPUTE_32F=1`) and Phase 6
(F16/BF16/Q8_0 KV-cache experiment) and Phase 8 (matched samplers) cover those.
44 changes: 44 additions & 0 deletions docs/diagnosis/chat_template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Phase 2 — chat template parity

## TL;DR

The four chat templates considered — (A) GGUF-embedded jinja in
`bartowski/mistralai_Mistral-Medium-3.5-128B-GGUF`, (B) `unsloth/Mistral-Medium-3.5-128B`,
(C) `mistralai/Mistral-Medium-3.5-128B` upstream, (D) the `tokenizer.chat_template`
attribute exposed by HF AutoTokenizer for the upstream model — are
**semantically equivalent for normal multi-turn chat**. The only diff that
could change rendered output is in error-path code that nobody is hitting in
this test. **Templates are not the cause** of the long-context degradation.

## Files

- `outputs/template_llamacpp.jinja` (13281 bytes) — A
- `outputs/template_unsloth.jinja` (14475 bytes) — B
- `outputs/template_mistralai.jinja` (13479 bytes) — C
- `outputs/template_hftokenizer.jinja` (13479 bytes) — D
- `outputs/template_diff_*` — pairwise diffs

## Diff summary

| pair | size | nature of diff |
| --- | --- | --- |
| C vs D (mistralai vs HF tokenizer) | 0 bytes | **identical** |
| A vs C (GGUF vs upstream) | 686 bytes | one hunk: a disabled `{%- if false %}` assertion at line 201 of the GGUF copy where upstream has the real check `(content == '' or content is none) and (no tool calls)` → raise. Both branches are no-ops on valid messages. |
| B vs C (unsloth vs upstream) | 2040 bytes | (1) unsloth uses `strftime_now` for date defaults; upstream hardcodes `today=29-04-2026`, `yesterday=28-04-2026`. Both get overridden by template arguments at render time, so no effect on output. (2) unsloth `arguments\|tojson\|safe` vs upstream `arguments\|tojson` for tool-call serialisation — only relevant when tools are used. |

## Cross-check via tokenization parity (Phase 1)

When all four templates render the same multi-turn fixture (system prompt + 3
user/assistant pairs + final user turn, `reasoning_effort='none'`), they
produce **byte-identical 434-token sequences**, including `<s>`,
`[SYSTEM_PROMPT]`, `[/SYSTEM_PROMPT]`, `[MODEL_SETTINGS]`, `[/MODEL_SETTINGS]`,
`[INST]`, `[/INST]`, and the per-assistant `</s>` markers in the right
positions. See `outputs/diagnosis_tokenization.md`.

## Conclusion

If we replace llama-server's GGUF-embedded template with the upstream mistralai
copy via `--chat-template-file outputs/template_mistralai.jinja`, the rendered
text and resulting token stream are identical to what the GGUF template
produces today. **No improvement is expected from a template swap, and indeed
no improvement was observed empirically** — see Phase 8 matched-sampler runs.
Binary file added docs/diagnosis/config.md
Binary file not shown.
88 changes: 88 additions & 0 deletions docs/diagnosis/first_divergence.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# First-divergence pin-down (vLLM vs llama-server, greedy)

Reproducer: `repro_first_divergence.py` in this directory.

Same input ("Create a Flappy Bird Python game" + Mistral-Medium-3.5 SYSTEM_PROMPT
with `reasoning_effort=none`), greedy (temperature=0.0), max_tokens=100, on:

- vLLM 0.20.1rc1.dev127 FP8 + `--attention-backend FLASH_ATTN` (port 8765)
- llama-server (b1-aab68217b unslothai fork, also tested ggml-org master) Q8_0 (port 8766)

The two outputs share an **identical 13-token prefix**:

```
# Flappy Bird Game in Python\n\nHere's a complete implementation of
```

Token 14 is where they first diverge. **The top-2 tokens at this position are
the same in both servers**, but their *relative order* is flipped:

| rank | vLLM | logprob | llama-server | logprob |
| --- | --- | --- | --- | --- |
| 1 | ` a` | **−0.314** | ` Fl` | **−0.289** |
| 2 | ` Fl` | −1.314 | ` a` | −1.430 |
| 3 | ` the` | −7.689 | ` the` | −4.434 |
| 4 | `Fl` | −12.564 | `Fl` | −11.420 |

Both have the same top-2 candidates, but llama-server compresses ` a` from
−0.31 to −1.43 (a Δ of ~1.1 in logprob, i.e. a factor of 3 in probability)
while vLLM compresses ` Fl` from −1.31 to −0.31 (same Δ in the other
direction). On a logprob scale of −0.3 vs −1.4 these tokens are within 1 nat
of each other; greedy-decoding *must* pick exactly one, and the order flip
sends the two backends down different trajectories.

After that single token flip:
- vLLM continues with `... a Flappy Bird game using Pygame. This version includes the core mechanics ...`
- llama-server continues with `... Flappy Bird using Pygame. This version includes all the classic elements ...`

Both trajectories are coherent at this point. The two outputs both produce
clean code for ~600–1000 generated tokens, after which **only the
llama-server trajectory** degenerates into broken syntax and repetition (e.g.
`pygame.display.set mode((800, 600)` — `set_mode` becomes `set mode`).

## Logits progression at fixed prefix

Reproducer: `repro_logits_progression.py`.

When we take vLLM's full greedy output and **feed it as a fixed prefix** at
checkpoints 50/200/500/1000/1400, both servers' top-1 next-token agrees at
*every* checkpoint:

| n_decoded | vLLM top-1 | vLLM logprob | llama-server top-1 | llama-server logprob | match |
| --- | --- | --- | --- | --- | --- |
| 50 | ` ``` ` | −0.0233 | ` ``` ` | −0.0000 | ✓ |
| 200 | `Y` | ~0 | `Y` | ~0 | ✓ |
| 500 | `.rect` | −0.0001 | `.rect` | −0.2432 | ✓ |
| 1000 | `_p` | ~0 | `_p` | −0.0601 | ✓ |
| 1400 | ` ` | ~0 | ` ` | −0.0028 | ✓ |

So the long-context degeneration is **not** caused by the model converging on
different next-token answers given the same prefix. It is caused by a single
~1-nat precision flip near the start, after which vLLM and llama-server walk
different (still individually plausible) decoding paths — and the
llama-server path happens to land in a degenerate attractor.

## Cross-check on Q4_K_M

Repeating the experiment with `bartowski/.../Q4_K_M` (~74 GiB GGUF) on
llama-server: identical degeneration tail. The same wrong top-2 ranking at
token 14 occurs, then the trajectory degenerates *more* than Q8_0 — for
example `pygame.display.set mode sdl hWSIZER, sdl lg2` syntax garbage. So
this is uniform across llama.cpp quants, not a Q8_0-specific bug.

## Implication

The remaining hypothesis is that ggml-cuda's accumulator precision in the
matmul or attention path for this specific model shape (88 layers,
head_count=96, head_count_kv=8, head_dim=128, vocab=131072,
intermediate=28672, rope_freq_base=1e6 with YARN factor=64) is producing
logits that are subtly *flatter* than the reference. The ranking flip on
token 14 is consistent with this: top-2 tokens are within 1 nat in both
servers, but llama-server compresses the gap by ~0.4 nat in a way that puts
` Fl` ahead of ` a` instead of behind.

A targeted next step is to fix the matmul accumulator in
`ggml-cuda/mmq.cu` (or the relevant GEMM path) to FP32 specifically for
`LLM_ARCH_MISTRAL3` and re-test. `GGML_CUDA_FORCE_CUBLAS=1` was already tried
without effect, but only `GGML_CUDA_FORCE_CUBLAS_COMPUTE_32F=1` only helps the
cuBLAS half-precision path; Q8_0 dequant + matmul takes a different path.
29 changes: 29 additions & 0 deletions docs/diagnosis/logits.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Phase 3 — top-10 logprobs comparison (vLLM vs llama-server)

prompt: `What is the capital of France? Answer in exactly one word.`
temperature: 0.0 (greedy), max_tokens: 1, top_logprobs: 10

vLLM completion: `Paris`
llama-server completion: `Paris`

## Top-10 next-token logprobs

| rank | vLLM token | vLLM logprob | llama token | llama logprob |
| --- | --- | --- | --- | --- |
| 1 | `token_id:42572` | -0.0001 | `Paris` | -0.0000 |
| 2 | `token_id:6993` | -9.8751 | ` Paris` | -11.4609 |
| 3 | `token_id:1784` | -12.0001 | `Par` | -12.5013 |
| 4 | `token_id:2029` | -12.9376 | `PAR` | -14.4139 |
| 5 | `token_id:3814` | -13.2501 | `巴黎` | -14.6281 |
| 6 | `token_id:102726` | -13.6251 | `Pars` | -15.6960 |
| 7 | `token_id:38166` | -13.6251 | `par` | -16.1068 |
| 8 | `token_id:72056` | -13.8751 | ` Париж` | -16.8023 |
| 9 | `token_id:75613` | -15.1876 | `Berlin` | -17.0093 |
| 10 | `token_id:126441` | -15.3751 | ` París` | -17.9099 |

Jaccard(top-10 token set): **0.00**
Top-1 match: **False**

## TL;DR

Greedy top-1 token: **DIFFERS**.
131 changes: 131 additions & 0 deletions docs/diagnosis/mistral-medium-3.5-long-context.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# Mistral-Medium-3.5-128B llama.cpp long-context degradation — diagnosis

## TL;DR

llama-server (Q8_0 GGUF from `bartowski/mistralai_Mistral-Medium-3.5-128B-GGUF`)
collapses into deterministic repetition loops after ~1000–1500 generated tokens,
producing broken Python syntax (e.g. `pygame.display.set mode((800, 600)` —
missing the underscore in `set_mode` and the closing paren). vLLM (FP8 of the
same model) does not exhibit this; it stops naturally at 1496 tokens. **HF
transformers BF16 inference of the same model also degrades** in the same way,
ruling out llama.cpp/Q8_0 as the *unique* culprit.

This matches what unsloth has already published on
[`unsloth/Mistral-Medium-3.5-128B-GGUF`](https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF):

> Testing shows that this behavior occurs **regardless of who or how** the model
> was converted GGUF. The model initially responds correctly, but over long
> context, does not work properly. **Mistral has now labeled GGUF support as a
> WIP**.

This document records the experiments that were run against vLLM (port 8765,
FP8) and llama-server (port 8766, Q8_0) and what they ruled out. It is
intended as input to a deeper fix.

## Setup

- vLLM 0.20.1rc1.dev127, FP8, `--tensor-parallel-size 2 --port 8765 --tool-call-parser mistral --enable-auto-tool-choice --reasoning-parser mistral --max_num_batched_tokens 16384 --max_num_seqs 128 --gpu_memory_utilization 0.8 --attention-backend FLASH_ATTN`, GPUs 4,5.
- llama-server `b1-aab6821` (unslothai fork) and ggml-org master, Q8_0,
`--tensor-split 1,1 --port 8766 --jinja --ctx-size 32768 --parallel 1`,
GPUs 2,3.
- HF transformers 5.7.0, `Mistral3ForConditionalGeneration` BF16,
`attn_implementation="eager"`, GPUs 6,7.

## Phases run

| phase | topic | result | location |
| --- | --- | --- | --- |
| 1 | Tokenization parity (vLLM / llama-server / mistral-common / HF) | **identical** 434 tokens | `outputs/diagnosis_tokenization.md` |
| 2 | Chat template (GGUF jinja vs unsloth vs mistralai vs HF tokenizer) | semantically equivalent (single trivial diff: a disabled `if false` assertion at line 201) | `outputs/diagnosis_chat_template.md` (and `outputs/template_*.jinja`) |
| 3 | Top-k logits agreement | top-1 token agrees on simple prompts; logit *distributions* differ noticeably (vLLM is less peaked) | `outputs/diagnosis_logits.md` |
| 4 | GGUF metadata vs HF config | match. RoPE: `factor=64`, `freq_base=1e6`, `original_context_length=4096`, `yarn_beta_fast=4`, `yarn_beta_slow=1`, `yarn_log_mul=1.0`. All YARN parameters present in GGUF and equal to `text_config.rope_parameters` | `outputs/diagnosis_config.md`, `outputs/gguf_metadata_full.txt` |
| 5 | HF transformers BF16 ground truth | **also degrades** (recall score 0/4 on the interleaved 11-turn test). T2 Flappy Bird hits 1500-token cap with looping syntax errors in tail | `outputs/hf_groundtruth_*` |
| 6 | KV cache dtype: F16 (default) / BF16 / F32 | **no effect on degradation** | `outputs/diag_single_flappy_*.json` |
| 7 | Rebuild llama.cpp with `-DGGML_CUDA_FORCE_CUBLAS=ON -DGGML_CUDA_FORCE_MMQ=OFF` and runtime `GGML_CUDA_FORCE_CUBLAS_COMPUTE_32F=1` | **no effect on degradation** | `outputs/llama_server_compute32f_*.log` |
| 8 | Matched samplers (`temp=0.1, top_p=1, top_k=64, min_p=0.05, seed=42`) | **no effect**; tried `repetition_penalty ∈ {1.0,1.05,1.1,1.2}`, `frequency_penalty ∈ {0,0.1,0.3,0.5}`, `dry_multiplier ∈ {0,0.5,0.8}` — all still loop | `outputs/matched_*` |
| 9 | Architecture diff — `llama.cpp/src/models/mistral3.cpp` vs HF `transformers/models/ministral3/modeling_ministral3.py` | **equivalent**. Same SwiGLU, RMSNorm-fp32, kq_scale=1/√head_dim, optional Llama-4 attn temperature scale (gated and disabled for this model since `llama_4_scaling_beta=0`). YARN `attn_factor` chain in `llama-context.cpp` matches HF's `_compute_yarn_parameters` — both produce `1.0` for `mscale=mscale_all_dim=1.0`. | `outputs/diagnosis_arch.md` |

## Empirical convergence point

Single-turn `Create a Flappy Bird Python game` (greedy, temperature=0):

```
| backend | finish | n_out | tail snippet
| -------------------------------- | ------ | ----- | -----------
| vLLM FP8 | stop | 1496 | "...Would you like me to explain any specific part?"
| llama-server Q8_0 cuBLAS | length | 2048 | "if __name__ == \"__main__,\\n sys.exit()\\n```\\n\\n" (loop)
| llama-server Q8_0 BF16 KV | length | 2048 | "if pipe.x < 0\\n self.pipes.remove(pipe)" (loop)
| llama-server Q8_0 F32 KV + no FA | length | 2048 | identical loop
| HF transformers BF16 | length | 1500 | "pipe1 = pipe(50, 100, 50, 100\\n pipe2 = pipe(50, 100, 50, 100" (loop)
```

For llama-server, output is clean up to ~1000 tokens then degrades:

```
mt= 600: clean
mt= 1000: still clean
mt= 1500: looping
```

Common prefix between vLLM and llama-server greedy outputs is exactly **66
characters**: `# Flappy Bird Game in Python\n\nHere's a complete implementation of `

Then:
- vLLM picks `a Flappy Bird game using Pygame`
- llama-server picks `Flappy Bird using Pygame` (no leading `a`).

## Working hypotheses (still open)

- **H1: Q8_0 quantization error**. Q8_0 gives ~16-bit mantissa precision per
block of 32. Cumulative error across 88 layers × 1000 decode steps may be
enough to flip top-1 tokens that compound into loops. *Counter-evidence:* HF
BF16 with the FP8 safetensors **also** loops, so quantization can't be the
whole story.
- **H2: Numerical kernel issue specific to Mistral-Medium-3.5's shape (88
layers, head_dim=128, head_count=96, head_count_kv=8, vocab=131072,
intermediate=28672, rope_freq_base=1e6 with YARN factor 64)** — common to
llama.cpp ggml-cuda *and* HF eager. vLLM avoids it because it uses CUTLASS
FP8 GEMMs with FP32 accumulators and possibly a different attention kernel
(`FLASH_ATTN` selected by vLLM in our setup).
- **H3: Some prefill artifact** that vLLM mitigates via chunked prefill
(`--max_num_batched_tokens 16384`). Not yet tested in isolation.

## What is NOT the cause

- Tokenization (4 tokenizers byte-identical).
- Chat template (4 templates render to identical token streams).
- GGUF metadata (all numerical config matches HF, including all YARN params).
- Sampler (no setting of temp/top_p/top_k/min_p/repetition_penalty/freq_penalty/dry that breaks the loop).
- KV cache dtype.
- Flash attention vs default attention in llama.cpp.
- cuBLAS vs MMQ kernels.
- Architecture code (mistral3.cpp implements the same residual/RMSNorm/SwiGLU/RoPE flow as HF).
- Sliding window (None for this model — both honour that).

## Artefacts

All under `workspace_5/outputs/`:

- `diagnosis_tokenization*.{md,txt}` + `tok_ids_*.json`
- `diagnosis_chat_template.md` + `template_*.jinja` + `template_diff_*.txt`
- `diagnosis_config.md` + `gguf_metadata_full.txt`
- `diagnosis_arch.md`
- `diagnosis_logits.md` + `diagnosis_logits_raw.json`
- `hf_groundtruth_alphabet_*.txt` + `hf_groundtruth_interleaved_*.json` + `hf_groundtruth_tok_ids_*.json`
- `recall_interleaved_{vllm,llamacpp}_*.txt`
- `multi_turn_recall_{vllm,llamacpp}_*.txt`
- `diag_single_flappy_*.json` + `.log`
- `matched_*.{json,txt}`
- All llama-server / vllm / HF logs in `workspace_5/logs/`.

## Suggested next steps for a real fix

1. Dump per-layer activations from both vLLM and HF for the same input, and
compare layer-by-layer to find where they first diverge meaningfully.
2. If that divergence localises to RMSNorm or attention softmax, force FP32
accumulators in those ops in llama.cpp's CUDA kernels for `mistral3`.
3. Compare against `llama-bench --perplexity` numbers per quant; Q8_0 vs Q6_K
vs F16-converted-from-FP8 GGUF to determine whether the issue scales with
precision.
4. Consider writing a reference forward pass in PyTorch using the GGUF weights
(via `gguf-py`) and the HF arch and comparing token-by-token.
Loading