feat(configs): Kimi-K2.7-Code PD-disaggregated wide-EP inference config#2827
Draft
hallerite wants to merge 3 commits into
Draft
feat(configs): Kimi-K2.7-Code PD-disaggregated wide-EP inference config#2827hallerite wants to merge 3 commits into
hallerite wants to merge 3 commits into
Conversation
…config Add configs/kimi_k27_code_disagg_inference/inference.toml: a prefill/decode disaggregated, wide-EP (tp=1, DP attention + EP experts) serving config for moonshotai/Kimi-K2.7-Code (1T MoE, DeepSeek-V3 MLA, 384 experts, INT4 compressed-tensors) on H200 nodes at 130k context. Mirrors configs/glm5_disagg_inference (prefill EP32 / decode EP16, 6 nodes / 48 H200) with use_deep_gemm=false (INT4, not FP8) and the native kimi_k2 reasoning/tool parsers. vLLM serves KimiK25ForConditionalGeneration natively; no custom model arch. Validated through the prime-rl config parser. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ions Set enable_prefix_caching=true (reuse the shared codebase prefix across agentic turns). Document the optimizations the disaggregated launcher already auto-applies (DeepEP all-to-all, FULL_DECODE_ONLY decode cudagraphs, NIXL, FlashMLA) and the opt-in levers (fp8 KV cache via FlashMLA for ~2x decode concurrency, chunked-prefill batch size, EPLB). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
configs/kimi_k27_code_disagg_inference/inference.toml— a prefill/decode disaggregated, wide-EP serving config formoonshotai/Kimi-K2.7-Code(1T MoE, DeepSeek-V3-style MLA, 384 experts, INT4 compressed-tensors) on H200 nodes at 130k context.No prime-rl or vLLM code changes — vLLM serves
KimiK25ForConditionalGenerationnatively (the arch K2.7-Code declares) and ships thekimi_k2reasoning/tool parsers. Works on the current pin and on 0.23.0.Topology & sizing (H200, 141 GiB)
Wide-EP (no
[parallel]→tp=1): DP-replicated MLA attention + EP-sharded experts, same pattern asconfigs/glm5_disagg_inference. EP width =(nodes / replicas) × 8.Memory math: weights ~553 GiB = ~22 GiB bf16 (attn/shared/embed/lm_head, replicated per GPU) + 532 GiB INT4 experts (EP-sharded). MLA keeps KV tiny — ~8.6 GiB per full 130k sequence — so the EP16 decode replica holds ~117 concurrent 130k sequences.
Key delta vs GLM-5:
use_deep_gemm = false(K2.7-Code is INT4 compressed-tensors, not block-wise FP8; DeepGEMM is FP8-only).Scaling
To scale out, prefer more replicas over wider EP (cheaper cross-node all-to-all, router load-balances, fault isolation). The file documents the 16-node variant (2× EP32 per role, ~590 concurrent 130k seqs).
Optimizations
Most of the heavy lifting is auto-applied by the disaggregated launcher (no config needed): DeepEP all-to-all (prefill
deepep_high_throughput/ decodedeepep_low_latency),FULL_DECODE_ONLYdecode cudagraphs, NIXL KV transfer, and FlashMLA attention on Hopper. The INT4 (W4A16) compressed-tensors MoE kernel is auto-selected.Added in the config:
enable_prefix_caching = true(reuse the shared codebase/system prefix across agentic turns).use_deep_gemm = false(config overrides the launcher's defaultVLLM_USE_DEEP_GEMM=1viasetup_vllm_env— DeepGEMM is FP8-only).Documented opt-in levers (commented): fp8 KV cache (
vllm_extra.kv_cache_dtype="fp8"— FlashMLA supports it, ~2× decode concurrency at a small long-context accuracy cost), chunked-prefill batch size for 130k prompts, and EPLB.Validation
Loads cleanly through the prime-rl config parser; derived parallelism confirmed (
tp=1, prefill EP32 / decode EP16). Not run on hardware yet — first bring-up should confirm the INT4-MoE kernels (machete/Marlin W4A16) perform on Hopper and that the…ForConditionalGenerationbase serves text-only cleanly.🤖 Generated with Claude Code