The first public implementation of complexity + VRAM-aware routing for local dual-tier LLM serving.
Adaptive inference routing, honest Blackwell benchmarks, MTP reality check, and the only
open-source implementation of complexity + VRAM routing for consumer GPU dual-tier deployment.
# Full production stack for 200 users — one command
git clone https://github.com/angelnicolasc/Stratum
cd Stratum
cp .env.example .env # fill in HF_TOKEN
bash scripts/start.sh # vLLM + llama.cpp + adaptive router + LiteLLM + OpenWebUI + Grafana# Or just the routing module — works with any local LLM stack
pip install gemma4-adaptive-routerfrom adaptive_router import AdaptiveRouter, RoutingConfig
router = AdaptiveRouter(RoutingConfig(complexity_threshold=0.65, vram_headroom_gb=1.5))
tier = router.route("Prove the Riemann hypothesis step by step") # → "tier_high"
tier = router.route("What time is it in Tokyo?") # → "tier_low"95% of Gemma 4 deployment repos ignore this. This one doesn't.
| Model | Quant | VRAM (weights) | KV headroom | Viable in vLLM | Verdict |
|---|---|---|---|---|---|
| Gemma 4 26B | BF16 | ~52GB | ❌ | ❌ | Does not fit |
| Gemma 4 26B | AWQ INT4 | ~15GB | ❌ production | No KV headroom | |
| Gemma 4 26B | Q4_K_M (GGUF) | ~16GB | ❌ | llama.cpp only | This repo's tier_high |
| Gemma 4 E4B | BF16 | ~8GB | ✅ ~6-8GB | ✅ | Production primary |
| Gemma 4 E4B | FP8 KV | ~8GB + FP8 | ✅ more | ✅ with workarounds | See §FP8 KV Gotchas |
The only honest 16GB architecture:
Tier low: Gemma 4 E4B → vLLM (fast, BF16, ~90% of requests)
Tier high: Gemma 4 26B → llama.cpp (quality, ~10% of requests)
Router: adaptive_router decides in < 1ms which tier to use
The piece that doesn't exist in any other public repo.
┌─────────────────────────────────────────────────────────────┐
│ adaptive_router/ │
│ │
│ Layer 1: ComplexityScorer (sub-ms, no inference) │
│ ├── 6 dimensions: math · code · depth · tokens │
│ ├── entities · negation │
│ └── Precompiled regexes — zero cost per call │
│ │
│ Layer 2: VRAMMonitor (daemon thread) │
│ ├── pynvml direct — no subprocess nvidia-smi overhead │
│ ├── Polling at 10-20ms with atomic shared state │
│ └── Circuit: GPU util > 90% → force tier_low │
│ │
│ Layer 3: RoutingDecision (chain of rules) │
│ ├── complexity_rule: score < 0.65 → tier_low │
│ ├── vram_rule: free VRAM < 1.5GB → tier_low │
│ └── sla_rule: EMA latency > 2s → tier_low │
└─────────────────────────────────────────────────────────────┘
# Deploy as ASGI proxy
pip install gemma4-adaptive-router
python -m adaptive_router.middleware \
--tier-high-url http://llama-cpp:8080/v1 \
--tier-low-url http://vllm:8000/v1 \
--port 9000→ See adaptive_router/README.md for full documentation.
Numbers require running on your hardware. The methodology and scripts are production-grade. Fill this table by running
python benchmarks/bench.py.
| Engine | TTFT p50 (c=1) | TTFT p50 (c=10) | Multi-user | Notes |
|---|---|---|---|---|
| vLLM E4B BF16 | [run bench.py] |
[run bench.py] |
✅ Production | Primary tier_low |
| SGLang E4B BF16 | [run bench.py] |
[run bench.py] |
✅ Production | RAG workloads (prefix > 60%) |
| EXL2 E4B | [run bench.py] |
❌ Single-user | TabbyAPI maintainer caveat* | |
| llama.cpp 26B Q4 | [run bench.py] |
[run bench.py] |
tier_high — quality over speed |
* TabbyAPI maintainers explicitly state it's not designed for multi-user production. See
docs/BENCHMARKS.md.
The document that doesn't exist in any other Gemma 4 deployment repo.
| Prefix Overlap | Cache Hit | TTFT Benefit | vs vLLM prefix_cache | Use |
|---|---|---|---|---|
| > 75% | 75-95% | 3-6x faster | ~10-15% better | ✅ SGLang |
| 40-75% | 30-60% | 1.5-2x | Neutral | |
| < 30% | < 20% | < 1.2x | vLLM wins | ❌ Don't use SGLang |
The contraintuitivestructure insight:
# ❌ WRONG — docs before system prompt breaks prefix sharing
prompt = f"{retrieved_docs}\nSystem: {system_prompt}\nUser: {query}"
# ✅ CORRECT — fixed system prompt first, radix tree caches it across requests
prompt = f"System: {system_prompt}\n\nContext: {retrieved_docs}\n\nUser: {query}"→ See demo: rag-integration/sglang_radixattention_demo.py
→ Run sweep: python benchmarks/radixattention_prefix_sweep.py
| Configuration | Status | Notes |
|---|---|---|
BF16 + --speculative-config |
✅ Works | Correct pair: E4B + E4B-it-assistant |
| NVFP4 + speculative | ❌ Garbage output | Datacenter-only |
| Marlin + speculative | ❌ -22% throughput | Do not use |
| FP8 KV + speculative (FlashInfer) | Measure with mtp_bench.py |
Correct syntax (vLLM 0.19+):
# ❌ Old syntax — removed
--speculative-model google/gemma-4-e2b-it
# ✅ Correct — dedicated pair, JSON config
--speculative-config '{"model": "google/gemma-4-e4b-it-assistant", "num_speculative_tokens": 4}'→ See docs/MTP-SPECULATIVE-DECODING.md
- Linux (Ubuntu 22.04 LTS recommended), CUDA 12.9+
- RTX 5060 Ti 16GB (or any GPU with ≥ 16GB VRAM)
- Docker + Docker Compose + NVIDIA Container Toolkit
- HuggingFace account with Gemma 4 access
git clone https://github.com/angelnicolasc/Stratum
cd Stratum
# On a fresh machine
bash scripts/install-cuda-linux.sh
# Configure
cp .env.example .env
# Edit .env: add your HF_TOKEN
# Download 26B GGUF (tier_high model)
huggingface-cli download bartowski/gemma-4-27b-it-GGUF \
--include "gemma-4-27b-it-Q4_K_M.gguf" \
--local-dir docker/llama-cpp/models
# Start everything
bash scripts/start.shbash scripts/health-check.sh
# Quick inference test
curl http://localhost:9000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"gemma4-fast","messages":[{"role":"user","content":"What is 2+2?"}],"max_tokens":16}'Open http://localhost:3000 — use model gemma4-fast for the adaptive router, or gemma4-smart to force the 26B tier.
| Document | What it covers |
|---|---|
docs/VRAM-REALITY-CHECK.md |
GPU VRAM limits, 3 SM120 FP8 KV gotchas, --enforce-eager rationale |
docs/MTP-SPECULATIVE-DECODING.md |
Corrected MTP architecture, SM120 state, honest framing |
docs/RADIXATTENTION-OPERATIVE-TABLE.md |
When SGLang beats vLLM, prompt structure instruction |
docs/MIGRATION.md |
LM Studio → Linux, 4 LiteLLM/llama.cpp bridge gaps |
docs/VISION-BUDGET.md |
Token budgets, flags, VRAM implications |
docs/HARDWARE-RECOMMENDATIONS.md |
Upgrade paths 16GB→48GB, formula |
docs/BENCHMARKS.md |
4-engine comparison methodology |
docs/HANDOVER.md |
Ops guide, failure modes, recovery |
adaptive_router/README.md |
Standalone router docs — install, configure, extend |
These go in the README because honesty is what makes this repo credible.
-
16GB doesn't fit 26B in vLLM natively. With llama.cpp it works, but throughput is lower. The real fix for 200 users at 26B quality is dual-GPU. The router mitigates the problem, it doesn't eliminate it.
-
Blackwell SM120 has rough edges in vLLM. FP8 KV and MTP require workarounds specific to SM120. Documented exactly in
docs/VRAM-REALITY-CHECK.md. -
EXL2 is not designed for multi-user production. TabbyAPI maintainers say so explicitly. The benchmark includes it with that caveat, not without.
-
SGLang doesn't always win. The RadixAttention benefit depends on prefix overlap and prompt structure. The operative table documents this — the 29% claim circulates without the context that makes it true.
-
MTP in SM120 is not plug-and-play. BF16 + workarounds works. NVFP4 produces garbage. The throughput delta needs empirical measurement on your hardware.
-
Cold-start burst on tier_high. With
sla_warmup_seed_ms=0.0(default), the EMA starts optimistic. The first ~15-20 complex queries all hit tier_high before the EMA converges. Setsla_warmup_seed_ms=800if you know your expected p50 tier_high latency. -
adaptive_routeris v0.1. Functional and tested, but not battle-tested at thousands of users. It's a solid starting point, not a battle-hardened system.
These are the measurements this repo needs from the community and from running the hardware.
| Gap | Script | Document |
|---|---|---|
Startup time with vs without --enforce-eager on SM120 |
time docker compose up |
docs/VRAM-REALITY-CHECK.md |
FP8 KV VRAM savings in E4B (FlashInfer, no --calculate-kv-scales) |
rag-integration/vision_demo.py |
docs/VRAM-REALITY-CHECK.md |
| MTP throughput delta with BF16 + speculative-config on SM120 | benchmarks/mtp_bench.py |
docs/MTP-SPECULATIVE-DECODING.md |
| EXL2 throughput on E4B at RTX 5060 Ti | benchmarks/bench.py --engine exl2_e4b |
docs/BENCHMARKS.md |
| RadixAttention prefix sweep on this hardware | benchmarks/radixattention_prefix_sweep.py |
docs/RADIXATTENTION-OPERATIVE-TABLE.md |
Issues, PRs, and benchmark results from different hardware are welcome. The most valuable contribution is running bench.py or radixattention_prefix_sweep.py on your hardware and sharing the results.
Apache-2.0. See LICENSE.
Built May 2026. Corrects v1 architectural errors (MTP cross-tier, FP8 KV flags, SGLang 29% claim without context, --enforce-eager). Introduces adaptive_router, EXL2 benchmark, RadixAttention operative table, Vision budget, and LiteLLM/llama.cpp bridge gaps.