feat(configs): Kimi-K2.7-Code PD-disaggregated wide-EP inference config by hallerite · Pull Request #2827 · PrimeIntellect-ai/prime-rl

hallerite · 2026-06-15T22:11:35Z

Summary

Adds configs/kimi_k27_code_disagg_inference/inference.toml — a prefill/decode disaggregated, wide-EP serving config for moonshotai/Kimi-K2.7-Code (1T MoE, DeepSeek-V3-style MLA, 384 experts, INT4 compressed-tensors) on H200 nodes at 130k context.

No prime-rl or vLLM code changes — vLLM serves KimiK25ForConditionalGeneration natively (the arch K2.7-Code declares) and ships the kimi_k2 reasoning/tool parsers. Works on the current pin and on 0.23.0.

Topology & sizing (H200, 141 GiB)

Wide-EP (no [parallel] → tp=1): DP-replicated MLA attention + EP-sharded experts, same pattern as configs/glm5_disagg_inference. EP width = (nodes / replicas) × 8.

prefill: 4 nodes → EP32 (12 experts/GPU, ~38.6 GiB weights/GPU)
decode: 2 nodes → EP16 (24 experts/GPU, ~55 GiB weights/GPU)
6 nodes / 48 H200 total

Memory math: weights ~553 GiB = ~22 GiB bf16 (attn/shared/embed/lm_head, replicated per GPU) + 532 GiB INT4 experts (EP-sharded). MLA keeps KV tiny — ~8.6 GiB per full 130k sequence — so the EP16 decode replica holds ~117 concurrent 130k sequences.

Key delta vs GLM-5: use_deep_gemm = false (K2.7-Code is INT4 compressed-tensors, not block-wise FP8; DeepGEMM is FP8-only).

Scaling

To scale out, prefer more replicas over wider EP (cheaper cross-node all-to-all, router load-balances, fault isolation). The file documents the 16-node variant (2× EP32 per role, ~590 concurrent 130k seqs).

Optimizations

Most of the heavy lifting is auto-applied by the disaggregated launcher (no config needed): DeepEP all-to-all (prefill deepep_high_throughput / decode deepep_low_latency), FULL_DECODE_ONLY decode cudagraphs, NIXL KV transfer, and FlashMLA attention on Hopper. The INT4 (W4A16) compressed-tensors MoE kernel is auto-selected.

Added in the config: enable_prefix_caching = true (reuse the shared codebase/system prefix across agentic turns). use_deep_gemm = false (config overrides the launcher's default VLLM_USE_DEEP_GEMM=1 via setup_vllm_env — DeepGEMM is FP8-only).

Documented opt-in levers (commented): fp8 KV cache (vllm_extra.kv_cache_dtype="fp8" — FlashMLA supports it, ~2× decode concurrency at a small long-context accuracy cost), chunked-prefill batch size for 130k prompts, and EPLB.

Validation

Loads cleanly through the prime-rl config parser; derived parallelism confirmed (tp=1, prefill EP32 / decode EP16). Not run on hardware yet — first bring-up should confirm the INT4-MoE kernels (machete/Marlin W4A16) perform on Hopper and that the …ForConditionalGeneration base serves text-only cleanly.

🤖 Generated with Claude Code

…config Add configs/kimi_k27_code_disagg_inference/inference.toml: a prefill/decode disaggregated, wide-EP (tp=1, DP attention + EP experts) serving config for moonshotai/Kimi-K2.7-Code (1T MoE, DeepSeek-V3 MLA, 384 experts, INT4 compressed-tensors) on H200 nodes at 130k context. Mirrors configs/glm5_disagg_inference (prefill EP32 / decode EP16, 6 nodes / 48 H200) with use_deep_gemm=false (INT4, not FP8) and the native kimi_k2 reasoning/tool parsers. vLLM serves KimiK25ForConditionalGeneration natively; no custom model arch. Validated through the prime-rl config parser. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ions Set enable_prefix_caching=true (reuse the shared codebase prefix across agentic turns). Document the optimizations the disaggregated launcher already auto-applies (DeepEP all-to-all, FULL_DECODE_ONLY decode cudagraphs, NIXL, FlashMLA) and the opt-in levers (fp8 KV cache via FlashMLA for ~2x decode concurrency, chunked-prefill batch size, EPLB). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

hallerite and others added 3 commits June 15, 2026 22:11

Merge branch 'main' into feat/kimi-k27-code-disagg-config

3c18555

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(configs): Kimi-K2.7-Code PD-disaggregated wide-EP inference config#2827

feat(configs): Kimi-K2.7-Code PD-disaggregated wide-EP inference config#2827
hallerite wants to merge 3 commits into
mainfrom
feat/kimi-k27-code-disagg-config

hallerite commented Jun 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hallerite commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Topology & sizing (H200, 141 GiB)

Scaling

Optimizations

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hallerite commented Jun 15, 2026 •

edited

Loading