Skip to content

feat(configs): Kimi-K2.7-Code PD-disaggregated wide-EP inference config#2827

Draft
hallerite wants to merge 3 commits into
mainfrom
feat/kimi-k27-code-disagg-config
Draft

feat(configs): Kimi-K2.7-Code PD-disaggregated wide-EP inference config#2827
hallerite wants to merge 3 commits into
mainfrom
feat/kimi-k27-code-disagg-config

Conversation

@hallerite

@hallerite hallerite commented Jun 15, 2026

Copy link
Copy Markdown
Member

Summary

Adds configs/kimi_k27_code_disagg_inference/inference.toml — a prefill/decode disaggregated, wide-EP serving config for moonshotai/Kimi-K2.7-Code (1T MoE, DeepSeek-V3-style MLA, 384 experts, INT4 compressed-tensors) on H200 nodes at 130k context.

No prime-rl or vLLM code changes — vLLM serves KimiK25ForConditionalGeneration natively (the arch K2.7-Code declares) and ships the kimi_k2 reasoning/tool parsers. Works on the current pin and on 0.23.0.

Topology & sizing (H200, 141 GiB)

Wide-EP (no [parallel]tp=1): DP-replicated MLA attention + EP-sharded experts, same pattern as configs/glm5_disagg_inference. EP width = (nodes / replicas) × 8.

  • prefill: 4 nodes → EP32 (12 experts/GPU, ~38.6 GiB weights/GPU)
  • decode: 2 nodes → EP16 (24 experts/GPU, ~55 GiB weights/GPU)
  • 6 nodes / 48 H200 total

Memory math: weights ~553 GiB = ~22 GiB bf16 (attn/shared/embed/lm_head, replicated per GPU) + 532 GiB INT4 experts (EP-sharded). MLA keeps KV tiny — ~8.6 GiB per full 130k sequence — so the EP16 decode replica holds ~117 concurrent 130k sequences.

Key delta vs GLM-5: use_deep_gemm = false (K2.7-Code is INT4 compressed-tensors, not block-wise FP8; DeepGEMM is FP8-only).

Scaling

To scale out, prefer more replicas over wider EP (cheaper cross-node all-to-all, router load-balances, fault isolation). The file documents the 16-node variant (2× EP32 per role, ~590 concurrent 130k seqs).

Optimizations

Most of the heavy lifting is auto-applied by the disaggregated launcher (no config needed): DeepEP all-to-all (prefill deepep_high_throughput / decode deepep_low_latency), FULL_DECODE_ONLY decode cudagraphs, NIXL KV transfer, and FlashMLA attention on Hopper. The INT4 (W4A16) compressed-tensors MoE kernel is auto-selected.

Added in the config: enable_prefix_caching = true (reuse the shared codebase/system prefix across agentic turns). use_deep_gemm = false (config overrides the launcher's default VLLM_USE_DEEP_GEMM=1 via setup_vllm_env — DeepGEMM is FP8-only).

Documented opt-in levers (commented): fp8 KV cache (vllm_extra.kv_cache_dtype="fp8" — FlashMLA supports it, ~2× decode concurrency at a small long-context accuracy cost), chunked-prefill batch size for 130k prompts, and EPLB.

Validation

Loads cleanly through the prime-rl config parser; derived parallelism confirmed (tp=1, prefill EP32 / decode EP16). Not run on hardware yet — first bring-up should confirm the INT4-MoE kernels (machete/Marlin W4A16) perform on Hopper and that the …ForConditionalGeneration base serves text-only cleanly.

🤖 Generated with Claude Code

hallerite and others added 3 commits June 15, 2026 22:11
…config

Add configs/kimi_k27_code_disagg_inference/inference.toml: a prefill/decode
disaggregated, wide-EP (tp=1, DP attention + EP experts) serving config for
moonshotai/Kimi-K2.7-Code (1T MoE, DeepSeek-V3 MLA, 384 experts, INT4
compressed-tensors) on H200 nodes at 130k context.

Mirrors configs/glm5_disagg_inference (prefill EP32 / decode EP16, 6 nodes /
48 H200) with use_deep_gemm=false (INT4, not FP8) and the native kimi_k2
reasoning/tool parsers. vLLM serves KimiK25ForConditionalGeneration natively;
no custom model arch. Validated through the prime-rl config parser.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ions

Set enable_prefix_caching=true (reuse the shared codebase prefix across agentic
turns). Document the optimizations the disaggregated launcher already auto-applies
(DeepEP all-to-all, FULL_DECODE_ONLY decode cudagraphs, NIXL, FlashMLA) and the
opt-in levers (fp8 KV cache via FlashMLA for ~2x decode concurrency,
chunked-prefill batch size, EPLB).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant