Skip to content

angelnicolasc/Stratum

Repository files navigation

Gemma 4 Production Blueprint 2026

The first public implementation of complexity + VRAM-aware routing for local dual-tier LLM serving.
Adaptive inference routing, honest Blackwell benchmarks, MTP reality check, and the only
open-source implementation of complexity + VRAM routing for consumer GPU dual-tier deployment.

CI PyPI Python 3.10+ vLLM 0.19+ License: Apache-2.0


TL;DR

# Full production stack for 200 users — one command
git clone https://github.com/angelnicolasc/Stratum
cd Stratum
cp .env.example .env   # fill in HF_TOKEN
bash scripts/start.sh  # vLLM + llama.cpp + adaptive router + LiteLLM + OpenWebUI + Grafana
# Or just the routing module — works with any local LLM stack
pip install gemma4-adaptive-router
from adaptive_router import AdaptiveRouter, RoutingConfig

router = AdaptiveRouter(RoutingConfig(complexity_threshold=0.65, vram_headroom_gb=1.5))
tier = router.route("Prove the Riemann hypothesis step by step")  # → "tier_high"
tier = router.route("What time is it in Tokyo?")                  # → "tier_low"

VRAM Reality Check — Read This First

95% of Gemma 4 deployment repos ignore this. This one doesn't.

Model Quant VRAM (weights) KV headroom Viable in vLLM Verdict
Gemma 4 26B BF16 ~52GB Does not fit
Gemma 4 26B AWQ INT4 ~15GB ⚠️ ~1GB ❌ production No KV headroom
Gemma 4 26B Q4_K_M (GGUF) ~16GB llama.cpp only This repo's tier_high
Gemma 4 E4B BF16 ~8GB ✅ ~6-8GB Production primary
Gemma 4 E4B FP8 KV ~8GB + FP8 ✅ more ✅ with workarounds See §FP8 KV Gotchas

The only honest 16GB architecture:

Tier low:   Gemma 4 E4B  → vLLM  (fast, BF16, ~90% of requests)
Tier high:  Gemma 4 26B  → llama.cpp (quality, ~10% of requests)
Router:     adaptive_router decides in < 1ms which tier to use

Adaptive Router

The piece that doesn't exist in any other public repo.

┌─────────────────────────────────────────────────────────────┐
│                      adaptive_router/                       │
│                                                             │
│  Layer 1: ComplexityScorer (sub-ms, no inference)          │
│  ├── 6 dimensions: math · code · depth · tokens            │
│  ├── entities · negation                                    │
│  └── Precompiled regexes — zero cost per call              │
│                                                             │
│  Layer 2: VRAMMonitor (daemon thread)                       │
│  ├── pynvml direct — no subprocess nvidia-smi overhead     │
│  ├── Polling at 10-20ms with atomic shared state           │
│  └── Circuit: GPU util > 90% → force tier_low              │
│                                                             │
│  Layer 3: RoutingDecision (chain of rules)                  │
│  ├── complexity_rule: score < 0.65 → tier_low              │
│  ├── vram_rule: free VRAM < 1.5GB → tier_low               │
│  └── sla_rule: EMA latency > 2s → tier_low                 │
└─────────────────────────────────────────────────────────────┘
# Deploy as ASGI proxy
pip install gemma4-adaptive-router
python -m adaptive_router.middleware \
    --tier-high-url http://llama-cpp:8080/v1 \
    --tier-low-url  http://vllm:8000/v1 \
    --port 9000

→ See adaptive_router/README.md for full documentation.


Benchmarks

Numbers require running on your hardware. The methodology and scripts are production-grade. Fill this table by running python benchmarks/bench.py.

Engine TTFT p50 (c=1) TTFT p50 (c=10) Multi-user Notes
vLLM E4B BF16 [run bench.py] [run bench.py] ✅ Production Primary tier_low
SGLang E4B BF16 [run bench.py] [run bench.py] ✅ Production RAG workloads (prefix > 60%)
EXL2 E4B [run bench.py] ⚠️ degrades ❌ Single-user TabbyAPI maintainer caveat*
llama.cpp 26B Q4 [run bench.py] [run bench.py] ⚠️ Low throughput tier_high — quality over speed

* TabbyAPI maintainers explicitly state it's not designed for multi-user production. See docs/BENCHMARKS.md.


RadixAttention Operative Table

The document that doesn't exist in any other Gemma 4 deployment repo.

Prefix Overlap Cache Hit TTFT Benefit vs vLLM prefix_cache Use
> 75% 75-95% 3-6x faster ~10-15% better ✅ SGLang
40-75% 30-60% 1.5-2x Neutral ⚠️ Benchmark both
< 30% < 20% < 1.2x vLLM wins ❌ Don't use SGLang

The contraintuitivestructure insight:

# ❌ WRONG — docs before system prompt breaks prefix sharing
prompt = f"{retrieved_docs}\nSystem: {system_prompt}\nUser: {query}"

# ✅ CORRECT — fixed system prompt first, radix tree caches it across requests
prompt = f"System: {system_prompt}\n\nContext: {retrieved_docs}\n\nUser: {query}"

→ See demo: rag-integration/sglang_radixattention_demo.py
→ Run sweep: python benchmarks/radixattention_prefix_sweep.py


MTP Reality Check — SM120

Configuration Status Notes
BF16 + --speculative-config ✅ Works Correct pair: E4B + E4B-it-assistant
NVFP4 + speculative ❌ Garbage output Datacenter-only
Marlin + speculative ❌ -22% throughput Do not use
FP8 KV + speculative (FlashInfer) ⚠️ Experimental Measure with mtp_bench.py

Correct syntax (vLLM 0.19+):

# ❌ Old syntax — removed
--speculative-model google/gemma-4-e2b-it

# ✅ Correct — dedicated pair, JSON config
--speculative-config '{"model": "google/gemma-4-e4b-it-assistant", "num_speculative_tokens": 4}'

→ See docs/MTP-SPECULATIVE-DECODING.md


Quick Start

Prerequisites

  • Linux (Ubuntu 22.04 LTS recommended), CUDA 12.9+
  • RTX 5060 Ti 16GB (or any GPU with ≥ 16GB VRAM)
  • Docker + Docker Compose + NVIDIA Container Toolkit
  • HuggingFace account with Gemma 4 access

Install

git clone https://github.com/angelnicolasc/Stratum
cd Stratum

# On a fresh machine
bash scripts/install-cuda-linux.sh

# Configure
cp .env.example .env
# Edit .env: add your HF_TOKEN

# Download 26B GGUF (tier_high model)
huggingface-cli download bartowski/gemma-4-27b-it-GGUF \
    --include "gemma-4-27b-it-Q4_K_M.gguf" \
    --local-dir docker/llama-cpp/models

# Start everything
bash scripts/start.sh

Verify

bash scripts/health-check.sh

# Quick inference test
curl http://localhost:9000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gemma4-fast","messages":[{"role":"user","content":"What is 2+2?"}],"max_tokens":16}'

OpenWebUI

Open http://localhost:3000 — use model gemma4-fast for the adaptive router, or gemma4-smart to force the 26B tier.


Documentation

Document What it covers
docs/VRAM-REALITY-CHECK.md GPU VRAM limits, 3 SM120 FP8 KV gotchas, --enforce-eager rationale
docs/MTP-SPECULATIVE-DECODING.md Corrected MTP architecture, SM120 state, honest framing
docs/RADIXATTENTION-OPERATIVE-TABLE.md When SGLang beats vLLM, prompt structure instruction
docs/MIGRATION.md LM Studio → Linux, 4 LiteLLM/llama.cpp bridge gaps
docs/VISION-BUDGET.md Token budgets, flags, VRAM implications
docs/HARDWARE-RECOMMENDATIONS.md Upgrade paths 16GB→48GB, formula
docs/BENCHMARKS.md 4-engine comparison methodology
docs/HANDOVER.md Ops guide, failure modes, recovery
adaptive_router/README.md Standalone router docs — install, configure, extend

Known Limitations

These go in the README because honesty is what makes this repo credible.

  1. 16GB doesn't fit 26B in vLLM natively. With llama.cpp it works, but throughput is lower. The real fix for 200 users at 26B quality is dual-GPU. The router mitigates the problem, it doesn't eliminate it.

  2. Blackwell SM120 has rough edges in vLLM. FP8 KV and MTP require workarounds specific to SM120. Documented exactly in docs/VRAM-REALITY-CHECK.md.

  3. EXL2 is not designed for multi-user production. TabbyAPI maintainers say so explicitly. The benchmark includes it with that caveat, not without.

  4. SGLang doesn't always win. The RadixAttention benefit depends on prefix overlap and prompt structure. The operative table documents this — the 29% claim circulates without the context that makes it true.

  5. MTP in SM120 is not plug-and-play. BF16 + workarounds works. NVFP4 produces garbage. The throughput delta needs empirical measurement on your hardware.

  6. Cold-start burst on tier_high. With sla_warmup_seed_ms=0.0 (default), the EMA starts optimistic. The first ~15-20 complex queries all hit tier_high before the EMA converges. Set sla_warmup_seed_ms=800 if you know your expected p50 tier_high latency.

  7. adaptive_router is v0.1. Functional and tested, but not battle-tested at thousands of users. It's a solid starting point, not a battle-hardened system.


Empirical Gaps

These are the measurements this repo needs from the community and from running the hardware.

Gap Script Document
Startup time with vs without --enforce-eager on SM120 time docker compose up docs/VRAM-REALITY-CHECK.md
FP8 KV VRAM savings in E4B (FlashInfer, no --calculate-kv-scales) rag-integration/vision_demo.py docs/VRAM-REALITY-CHECK.md
MTP throughput delta with BF16 + speculative-config on SM120 benchmarks/mtp_bench.py docs/MTP-SPECULATIVE-DECODING.md
EXL2 throughput on E4B at RTX 5060 Ti benchmarks/bench.py --engine exl2_e4b docs/BENCHMARKS.md
RadixAttention prefix sweep on this hardware benchmarks/radixattention_prefix_sweep.py docs/RADIXATTENTION-OPERATIVE-TABLE.md

Contributing

Issues, PRs, and benchmark results from different hardware are welcome. The most valuable contribution is running bench.py or radixattention_prefix_sweep.py on your hardware and sharing the results.

License

Apache-2.0. See LICENSE.


Built May 2026. Corrects v1 architectural errors (MTP cross-tier, FP8 KV flags, SGLang 29% claim without context, --enforce-eager). Introduces adaptive_router, EXL2 benchmark, RadixAttention operative table, Vision budget, and LiteLLM/llama.cpp bridge gaps.

About

Adaptive dual-tier serving for Gemma 4 on consumer 16GB GPUs. Complexity + real-time VRAM routing between vLLM E4B and llama.cpp 27B. Production stack with OpenWebUI, monitoring, and more.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors