mistral3: diagnosis of long-context degradation on Mistral-Medium-3.5-128B#16
mistral3: diagnosis of long-context degradation on Mistral-Medium-3.5-128B#16danielhanchen wants to merge 2 commits into
Conversation
…um-3.5-128B Documents the experiments and findings from comparing llama-server (Q8_0 GGUF), vLLM (FP8 safetensors), and HF transformers (BF16) inference of mistralai/Mistral-Medium-3.5-128B on the same multi-turn fixtures. Key findings: - Tokenization is identical across vLLM /v1/chat/completions/render, llama-server /apply-template + /tokenize, mistral-common, and HF AutoTokenizer (byte-identical 434-token streams). - Chat templates from the GGUF, unsloth, mistralai upstream, and the HF tokenizer all render to identical token streams for normal multi-turn chat. - GGUF metadata matches HF text_config including all YARN parameters (factor=64, freq_base=1e6, original_context_length=4096, beta_fast=4, beta_slow=1, yarn_log_mul=1.0). - llama.cpp's mistral3.cpp implements the same residual / RMSNorm / SwiGLU / RoPE flow as transformers' Ministral3DecoderLayer. - KV cache dtype (F16, BF16, F32), flash attention on/off, and CUDA build with -DGGML_CUDA_FORCE_CUBLAS=ON + GGML_CUDA_FORCE_CUBLAS_COMPUTE_32F=1 do not change the looping behaviour. - Sampler choice (matched min_p / top_k / top_p / seed; sweeps over repetition_penalty, frequency_penalty, dry_multiplier) does not change it either. - HF transformers BF16 also degrades on the same input. So this is not Q8_0-specific; it's a model-wide property that vLLM happens to avoid. This matches what unsloth has already published on https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF (Mistral has labeled GGUF support WIP). Empirical convergence point: same input, greedy temperature=0: - vLLM FP8 stops naturally at 1496 tokens - llama-server Q8_0 hits length cap, looping after ~1000 tokens - HF transformers BF16 hits length cap, looping after ~1000 tokens This commit only adds documentation; no code changes. The next step is to dump per-layer activations from vLLM and llama-server for the same input and find where they first diverge significantly.
…ripts
Adds docs/diagnosis/first_divergence.md and the two reproducer scripts that
exercise the OpenAI Chat Completions API on the running vLLM (8765) and
llama-server (8766) endpoints.
Findings (greedy temperature=0 on "Create a Flappy Bird Python game" with the
Mistral-Medium-3.5 SYSTEM_PROMPT, reasoning_effort=none):
- The two servers produce a byte-identical 13-token prefix:
"# Flappy Bird Game in Python\n\nHere's a complete implementation of"
- At token 14, the top-2 tokens (' a' and ' Fl') are the same on both, but
their relative ranking is flipped:
* vLLM: ' a' -0.314, ' Fl' -1.314
* llama-server:' Fl' -0.289, ' a' -1.430
- Both paths are individually coherent for ~600-1000 generated tokens; only
the llama-server path degenerates after that into broken syntax and
repetition.
- When vLLM's full output is fed as a fixed prefix to llama-server at
checkpoints 50/200/500/1000/1400, llama-server's top-1 next-token AGREES
with vLLM's at every checkpoint. So the degeneration is not the model
picking a different answer given the same context - it's the cumulative
effect of one early ~1-nat ranking flip pushing llama-server's trajectory
into a different (eventually degenerate) attractor.
- The same divergence pattern is observed on Q4_K_M (74 GiB GGUF), with even
worse downstream syntax garbage. So this is uniform across llama.cpp quants
and not a Q8_0-specific issue.
Hypothesis: ggml-cuda's matmul accumulator precision for the dequant+matmul
path on this model's shape (88L, 96H, 8 KV-H, head_dim=128, vocab=131072,
intermediate=28672, rope_freq_base=1e6 with YARN factor=64) yields logits
that are subtly flatter than the FP8 reference, and the flatness manifests
as a top-2 ranking flip on close calls. GGML_CUDA_FORCE_CUBLAS=1 was already
tried with no effect; the FP32 compute mode only helps the FP16 cuBLAS path,
not Q8_0 dequant + matmul. Targeted fix would be FP32 accumulators in
mmq.cu / mmvq.cu specifically for LLM_ARCH_MISTRAL3.
Update — first divergence pinned, see commit e617ca3The two servers (vLLM FP8 and llama-server Q8_0) produce a byte-identical 13-token prefix then diverge at token 14. The top-2 candidates are the same on both, but the relative ranking is flipped:
Both trajectories are individually clean for ~600–1000 generated tokens; only llama-server's degenerates after that into broken syntax ( When vLLM's full output is fed back as a fixed prefix to llama-server at checkpoints 50/200/500/1000/1400, llama-server's top-1 next-token agrees with vLLM at every checkpoint. So this is not a per-step disagreement — it's the cumulative effect of one early ~1-nat ranking flip pushing the path into a degenerate attractor. Q4_K_M (74 GiB GGUF) makes the same wrong choice at token 14 and degenerates more downstream — uniform across quants.
Reproducer scripts are in |
Summary
Adds
docs/diagnosis/with a structured writeup of why Mistral-Medium-3.5-128Bdegrades on long generations under llama.cpp / llama-server, and what the
experiments rule out.
This is documentation only — no code changes. It is intended as a starting
point for a real fix; the experimental harness for each phase is reproducible
from the artefacts described in the docs.
The TL;DR matches what unsloth has already published on
unsloth/Mistral-Medium-3.5-128B-GGUF:the model's GGUF support is WIP, and the long-context degeneration occurs
regardless of who converted the GGUF.
What the experiments ruled out
AutoTokenizer all produce byte-identical 434-token streams for the same
fixture).
vs upstream is a
{%- if false %}no-op assertion).text_config.rope_parameters,including
yarn_beta_fast=4,beta_slow=1,factor=64,freq_base=1e6,original_context_length=4096,yarn_log_mul=1).GGML_CUDA_FORCE_CUBLAS_COMPUTE_32F=1.temp=0.1, top_p=1, top_k=64, min_p=0.05, seed=42, andsweeps over
repetition_penalty,frequency_penalty,dry_multiplier.src/models/mistral3.cppvs HFtransformers/models/ministral3/modeling_ministral3.py— equivalent residual /RMSNorm / SwiGLU / RoPE flow; YARN
attn_factorresolves to1.0on bothsides for
mscale=mscale_all_dim=1.0.What is interesting
on the same input, ruling out Q8_0 quantization or llama.cpp-specific kernel
bugs as the unique cause.
--attention-backend FLASH_ATTNdoes not degenerate. Itis the only setup in our matrix that stays coherent past ~1000 generated
tokens.
Suggested next steps
— find the first layer where they diverge meaningfully.
accumulators in those ops in
ggml-cudaforLLM_ARCH_MISTRAL3.gguf-pyand the HF arch, token-by-token.Test plan
docs/diagnosis/mistral-medium-3.5-long-context.mdandconfirms the conclusions are reproducible from the listed artefacts.