Skip to content

mistral3: diagnosis of long-context degradation on Mistral-Medium-3.5-128B#16

Open
danielhanchen wants to merge 2 commits into
masterfrom
mistral-medium-3.5-long-context-diag-20260501
Open

mistral3: diagnosis of long-context degradation on Mistral-Medium-3.5-128B#16
danielhanchen wants to merge 2 commits into
masterfrom
mistral-medium-3.5-long-context-diag-20260501

Conversation

@danielhanchen
Copy link
Copy Markdown
Member

Summary

Adds docs/diagnosis/ with a structured writeup of why Mistral-Medium-3.5-128B
degrades on long generations under llama.cpp / llama-server, and what the
experiments rule out.

This is documentation only — no code changes. It is intended as a starting
point for a real fix; the experimental harness for each phase is reproducible
from the artefacts described in the docs.

The TL;DR matches what unsloth has already published on
unsloth/Mistral-Medium-3.5-128B-GGUF:
the model's GGUF support is WIP, and the long-context degeneration occurs
regardless of who converted the GGUF.

What the experiments ruled out

  • Tokenization (vLLM render, llama-server tokenize, mistral-common, and HF
    AutoTokenizer all produce byte-identical 434-token streams for the same
    fixture).
  • Chat template (4 templates render identical token streams; the only diff
    vs upstream is a {%- if false %} no-op assertion).
  • GGUF metadata (all numerical config matches HF text_config.rope_parameters,
    including yarn_beta_fast=4, beta_slow=1, factor=64, freq_base=1e6,
    original_context_length=4096, yarn_log_mul=1).
  • KV cache dtype (F16, BF16, F32 all degrade the same way).
  • Flash attention on/off.
  • cuBLAS build flag + GGML_CUDA_FORCE_CUBLAS_COMPUTE_32F=1.
  • Sampler — matched temp=0.1, top_p=1, top_k=64, min_p=0.05, seed=42, and
    sweeps over repetition_penalty, frequency_penalty, dry_multiplier.
  • Architecture code in src/models/mistral3.cpp vs HF
    transformers/models/ministral3/modeling_ministral3.py — equivalent residual /
    RMSNorm / SwiGLU / RoPE flow; YARN attn_factor resolves to 1.0 on both
    sides for mscale=mscale_all_dim=1.0.

What is interesting

  • HF transformers BF16 inference of the FP8 safetensors also degenerates
    on the same input, ruling out Q8_0 quantization or llama.cpp-specific kernel
    bugs as the unique cause.
  • vLLM FP8 with --attention-backend FLASH_ATTN does not degenerate. It
    is the only setup in our matrix that stays coherent past ~1000 generated
    tokens.

Suggested next steps

  1. Per-layer activation dump comparing vLLM and HF/llama.cpp on the same input
    — find the first layer where they diverge meaningfully.
  2. If divergence localises to RMSNorm or attention softmax, force FP32
    accumulators in those ops in ggml-cuda for LLM_ARCH_MISTRAL3.
  3. Sweep quants (Q8_0 / Q6_K / Q4_K_M / F16-from-FP8) for perplexity vs base.
  4. Optional: a PyTorch reference forward pass driven by the GGUF weights via
    gguf-py and the HF arch, token-by-token.

Test plan

  • CI: docs-only change; no behaviour to validate.
  • Reviewer reads docs/diagnosis/mistral-medium-3.5-long-context.md and
    confirms the conclusions are reproducible from the listed artefacts.

…um-3.5-128B

Documents the experiments and findings from comparing llama-server (Q8_0
GGUF), vLLM (FP8 safetensors), and HF transformers (BF16) inference of
mistralai/Mistral-Medium-3.5-128B on the same multi-turn fixtures.

Key findings:

- Tokenization is identical across vLLM /v1/chat/completions/render,
  llama-server /apply-template + /tokenize, mistral-common, and HF AutoTokenizer
  (byte-identical 434-token streams).
- Chat templates from the GGUF, unsloth, mistralai upstream, and the HF
  tokenizer all render to identical token streams for normal multi-turn chat.
- GGUF metadata matches HF text_config including all YARN parameters
  (factor=64, freq_base=1e6, original_context_length=4096, beta_fast=4,
  beta_slow=1, yarn_log_mul=1.0).
- llama.cpp's mistral3.cpp implements the same residual / RMSNorm / SwiGLU /
  RoPE flow as transformers' Ministral3DecoderLayer.
- KV cache dtype (F16, BF16, F32), flash attention on/off, and CUDA build
  with -DGGML_CUDA_FORCE_CUBLAS=ON + GGML_CUDA_FORCE_CUBLAS_COMPUTE_32F=1 do
  not change the looping behaviour.
- Sampler choice (matched min_p / top_k / top_p / seed; sweeps over
  repetition_penalty, frequency_penalty, dry_multiplier) does not change
  it either.
- HF transformers BF16 also degrades on the same input. So this is not
  Q8_0-specific; it's a model-wide property that vLLM happens to avoid.
  This matches what unsloth has already published on
  https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF
  (Mistral has labeled GGUF support WIP).

Empirical convergence point: same input, greedy temperature=0:
  - vLLM FP8                  stops naturally at 1496 tokens
  - llama-server Q8_0         hits length cap, looping after ~1000 tokens
  - HF transformers BF16      hits length cap, looping after ~1000 tokens

This commit only adds documentation; no code changes. The next step is to
dump per-layer activations from vLLM and llama-server for the same input
and find where they first diverge significantly.
…ripts

Adds docs/diagnosis/first_divergence.md and the two reproducer scripts that
exercise the OpenAI Chat Completions API on the running vLLM (8765) and
llama-server (8766) endpoints.

Findings (greedy temperature=0 on "Create a Flappy Bird Python game" with the
Mistral-Medium-3.5 SYSTEM_PROMPT, reasoning_effort=none):

- The two servers produce a byte-identical 13-token prefix:
  "# Flappy Bird Game in Python\n\nHere's a complete implementation of"
- At token 14, the top-2 tokens (' a' and ' Fl') are the same on both, but
  their relative ranking is flipped:
    * vLLM:        ' a' -0.314, ' Fl' -1.314
    * llama-server:' Fl' -0.289, ' a' -1.430
- Both paths are individually coherent for ~600-1000 generated tokens; only
  the llama-server path degenerates after that into broken syntax and
  repetition.
- When vLLM's full output is fed as a fixed prefix to llama-server at
  checkpoints 50/200/500/1000/1400, llama-server's top-1 next-token AGREES
  with vLLM's at every checkpoint. So the degeneration is not the model
  picking a different answer given the same context - it's the cumulative
  effect of one early ~1-nat ranking flip pushing llama-server's trajectory
  into a different (eventually degenerate) attractor.
- The same divergence pattern is observed on Q4_K_M (74 GiB GGUF), with even
  worse downstream syntax garbage. So this is uniform across llama.cpp quants
  and not a Q8_0-specific issue.

Hypothesis: ggml-cuda's matmul accumulator precision for the dequant+matmul
path on this model's shape (88L, 96H, 8 KV-H, head_dim=128, vocab=131072,
intermediate=28672, rope_freq_base=1e6 with YARN factor=64) yields logits
that are subtly flatter than the FP8 reference, and the flatness manifests
as a top-2 ranking flip on close calls. GGML_CUDA_FORCE_CUBLAS=1 was already
tried with no effect; the FP32 compute mode only helps the FP16 cuBLAS path,
not Q8_0 dequant + matmul. Targeted fix would be FP32 accumulators in
mmq.cu / mmvq.cu specifically for LLM_ARCH_MISTRAL3.
@danielhanchen
Copy link
Copy Markdown
Member Author

Update — first divergence pinned, see commit e617ca3

The two servers (vLLM FP8 and llama-server Q8_0) produce a byte-identical 13-token prefix then diverge at token 14. The top-2 candidates are the same on both, but the relative ranking is flipped:

rank vLLM logprob llama-server logprob
1 a −0.314 Fl −0.289
2 Fl −1.314 a −1.430

Both trajectories are individually clean for ~600–1000 generated tokens; only llama-server's degenerates after that into broken syntax (pygame.display.set mode((800, 600) with the underscore in set_mode lost).

When vLLM's full output is fed back as a fixed prefix to llama-server at checkpoints 50/200/500/1000/1400, llama-server's top-1 next-token agrees with vLLM at every checkpoint. So this is not a per-step disagreement — it's the cumulative effect of one early ~1-nat ranking flip pushing the path into a degenerate attractor.

Q4_K_M (74 GiB GGUF) makes the same wrong choice at token 14 and degenerates more downstream — uniform across quants.

GGML_CUDA_FORCE_CUBLAS_COMPUTE_32F=1 only helps the FP16 cuBLAS path; Q8_0 / Q4_K_M dequant + matmul takes a different code path. The targeted experiment that we have not been able to run is forcing FP32 accumulators in ggml-cuda/mmq*.cu (or wherever the dequant matmul lives) for LLM_ARCH_MISTRAL3 only.

Reproducer scripts are in docs/diagnosis/repro_*.py. Both run against the OpenAI Chat Completions API on a vLLM server (port 8765) and a llama-server (port 8766).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant