mistral3: diagnosis of long-context degradation on Mistral-Medium-3.5-128B by danielhanchen · Pull Request #16 · unslothai/llama.cpp

danielhanchen · 2026-05-01T11:15:59Z

Summary

Adds docs/diagnosis/ with a structured writeup of why Mistral-Medium-3.5-128B
degrades on long generations under llama.cpp / llama-server, and what the
experiments rule out.

This is documentation only — no code changes. It is intended as a starting
point for a real fix; the experimental harness for each phase is reproducible
from the artefacts described in the docs.

The TL;DR matches what unsloth has already published on
unsloth/Mistral-Medium-3.5-128B-GGUF:
the model's GGUF support is WIP, and the long-context degeneration occurs
regardless of who converted the GGUF.

What the experiments ruled out

Tokenization (vLLM render, llama-server tokenize, mistral-common, and HF
AutoTokenizer all produce byte-identical 434-token streams for the same
fixture).
Chat template (4 templates render identical token streams; the only diff
vs upstream is a {%- if false %} no-op assertion).
GGUF metadata (all numerical config matches HF text_config.rope_parameters,
including yarn_beta_fast=4, beta_slow=1, factor=64, freq_base=1e6,
original_context_length=4096, yarn_log_mul=1).
KV cache dtype (F16, BF16, F32 all degrade the same way).
Flash attention on/off.
cuBLAS build flag + GGML_CUDA_FORCE_CUBLAS_COMPUTE_32F=1.
Sampler — matched temp=0.1, top_p=1, top_k=64, min_p=0.05, seed=42, and
sweeps over repetition_penalty, frequency_penalty, dry_multiplier.
Architecture code in src/models/mistral3.cpp vs HF
transformers/models/ministral3/modeling_ministral3.py — equivalent residual /
RMSNorm / SwiGLU / RoPE flow; YARN attn_factor resolves to 1.0 on both
sides for mscale=mscale_all_dim=1.0.

What is interesting

HF transformers BF16 inference of the FP8 safetensors also degenerates
on the same input, ruling out Q8_0 quantization or llama.cpp-specific kernel
bugs as the unique cause.
vLLM FP8 with --attention-backend FLASH_ATTN does not degenerate. It
is the only setup in our matrix that stays coherent past ~1000 generated
tokens.

Suggested next steps

Per-layer activation dump comparing vLLM and HF/llama.cpp on the same input
— find the first layer where they diverge meaningfully.
If divergence localises to RMSNorm or attention softmax, force FP32
accumulators in those ops in ggml-cuda for LLM_ARCH_MISTRAL3.
Sweep quants (Q8_0 / Q6_K / Q4_K_M / F16-from-FP8) for perplexity vs base.
Optional: a PyTorch reference forward pass driven by the GGUF weights via
gguf-py and the HF arch, token-by-token.

Test plan

CI: docs-only change; no behaviour to validate.
Reviewer reads docs/diagnosis/mistral-medium-3.5-long-context.md and
confirms the conclusions are reproducible from the listed artefacts.

…um-3.5-128B Documents the experiments and findings from comparing llama-server (Q8_0 GGUF), vLLM (FP8 safetensors), and HF transformers (BF16) inference of mistralai/Mistral-Medium-3.5-128B on the same multi-turn fixtures. Key findings: - Tokenization is identical across vLLM /v1/chat/completions/render, llama-server /apply-template + /tokenize, mistral-common, and HF AutoTokenizer (byte-identical 434-token streams). - Chat templates from the GGUF, unsloth, mistralai upstream, and the HF tokenizer all render to identical token streams for normal multi-turn chat. - GGUF metadata matches HF text_config including all YARN parameters (factor=64, freq_base=1e6, original_context_length=4096, beta_fast=4, beta_slow=1, yarn_log_mul=1.0). - llama.cpp's mistral3.cpp implements the same residual / RMSNorm / SwiGLU / RoPE flow as transformers' Ministral3DecoderLayer. - KV cache dtype (F16, BF16, F32), flash attention on/off, and CUDA build with -DGGML_CUDA_FORCE_CUBLAS=ON + GGML_CUDA_FORCE_CUBLAS_COMPUTE_32F=1 do not change the looping behaviour. - Sampler choice (matched min_p / top_k / top_p / seed; sweeps over repetition_penalty, frequency_penalty, dry_multiplier) does not change it either. - HF transformers BF16 also degrades on the same input. So this is not Q8_0-specific; it's a model-wide property that vLLM happens to avoid. This matches what unsloth has already published on https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF (Mistral has labeled GGUF support WIP). Empirical convergence point: same input, greedy temperature=0: - vLLM FP8 stops naturally at 1496 tokens - llama-server Q8_0 hits length cap, looping after ~1000 tokens - HF transformers BF16 hits length cap, looping after ~1000 tokens This commit only adds documentation; no code changes. The next step is to dump per-layer activations from vLLM and llama-server for the same input and find where they first diverge significantly.

…ripts Adds docs/diagnosis/first_divergence.md and the two reproducer scripts that exercise the OpenAI Chat Completions API on the running vLLM (8765) and llama-server (8766) endpoints. Findings (greedy temperature=0 on "Create a Flappy Bird Python game" with the Mistral-Medium-3.5 SYSTEM_PROMPT, reasoning_effort=none): - The two servers produce a byte-identical 13-token prefix: "# Flappy Bird Game in Python\n\nHere's a complete implementation of" - At token 14, the top-2 tokens (' a' and ' Fl') are the same on both, but their relative ranking is flipped: * vLLM: ' a' -0.314, ' Fl' -1.314 * llama-server:' Fl' -0.289, ' a' -1.430 - Both paths are individually coherent for ~600-1000 generated tokens; only the llama-server path degenerates after that into broken syntax and repetition. - When vLLM's full output is fed as a fixed prefix to llama-server at checkpoints 50/200/500/1000/1400, llama-server's top-1 next-token AGREES with vLLM's at every checkpoint. So the degeneration is not the model picking a different answer given the same context - it's the cumulative effect of one early ~1-nat ranking flip pushing llama-server's trajectory into a different (eventually degenerate) attractor. - The same divergence pattern is observed on Q4_K_M (74 GiB GGUF), with even worse downstream syntax garbage. So this is uniform across llama.cpp quants and not a Q8_0-specific issue. Hypothesis: ggml-cuda's matmul accumulator precision for the dequant+matmul path on this model's shape (88L, 96H, 8 KV-H, head_dim=128, vocab=131072, intermediate=28672, rope_freq_base=1e6 with YARN factor=64) yields logits that are subtly flatter than the FP8 reference, and the flatness manifests as a top-2 ranking flip on close calls. GGML_CUDA_FORCE_CUBLAS=1 was already tried with no effect; the FP32 compute mode only helps the FP16 cuBLAS path, not Q8_0 dequant + matmul. Targeted fix would be FP32 accumulators in mmq.cu / mmvq.cu specifically for LLM_ARCH_MISTRAL3.

danielhanchen · 2026-05-01T11:24:52Z

Update — first divergence pinned, see commit `e617ca3`

The two servers (vLLM FP8 and llama-server Q8_0) produce a byte-identical 13-token prefix then diverge at token 14. The top-2 candidates are the same on both, but the relative ranking is flipped:

rank	vLLM	logprob	llama-server	logprob
1	`a`	−0.314	`Fl`	−0.289
2	`Fl`	−1.314	`a`	−1.430

Both trajectories are individually clean for ~600–1000 generated tokens; only llama-server's degenerates after that into broken syntax (pygame.display.set mode((800, 600) with the underscore in set_mode lost).

When vLLM's full output is fed back as a fixed prefix to llama-server at checkpoints 50/200/500/1000/1400, llama-server's top-1 next-token agrees with vLLM at every checkpoint. So this is not a per-step disagreement — it's the cumulative effect of one early ~1-nat ranking flip pushing the path into a degenerate attractor.

Q4_K_M (74 GiB GGUF) makes the same wrong choice at token 14 and degenerates more downstream — uniform across quants.

GGML_CUDA_FORCE_CUBLAS_COMPUTE_32F=1 only helps the FP16 cuBLAS path; Q8_0 / Q4_K_M dequant + matmul takes a different code path. The targeted experiment that we have not been able to run is forcing FP32 accumulators in ggml-cuda/mmq*.cu (or wherever the dequant matmul lives) for LLM_ARCH_MISTRAL3 only.

Reproducer scripts are in docs/diagnosis/repro_*.py. Both run against the OpenAI Chat Completions API on a vLLM server (port 8765) and a llama-server (port 8766).

danielhanchen added 2 commits May 1, 2026 11:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mistral3: diagnosis of long-context degradation on Mistral-Medium-3.5-128B#16

mistral3: diagnosis of long-context degradation on Mistral-Medium-3.5-128B#16
danielhanchen wants to merge 2 commits into
masterfrom
mistral-medium-3.5-long-context-diag-20260501

danielhanchen commented May 1, 2026

Uh oh!

danielhanchen commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danielhanchen commented May 1, 2026

Summary

What the experiments ruled out

What is interesting

Suggested next steps

Test plan

Uh oh!

danielhanchen commented May 1, 2026

Update — first divergence pinned, see commit e617ca3

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Update — first divergence pinned, see commit `e617ca3`