Gemma 4 Production Blueprint 2026

The first public implementation of complexity + VRAM-aware routing for local dual-tier LLM serving.
Adaptive inference routing, honest Blackwell benchmarks, MTP reality check, and the only
open-source implementation of complexity + VRAM routing for consumer GPU dual-tier deployment.

TL;DR

# Full production stack for 200 users — one command
git clone https://github.com/angelnicolasc/Stratum
cd Stratum
cp .env.example .env   # fill in HF_TOKEN
bash scripts/start.sh  # vLLM + llama.cpp + adaptive router + LiteLLM + OpenWebUI + Grafana

# Or just the routing module — works with any local LLM stack
pip install gemma4-adaptive-router

from adaptive_router import AdaptiveRouter, RoutingConfig

router = AdaptiveRouter(RoutingConfig(complexity_threshold=0.65, vram_headroom_gb=1.5))
tier = router.route("Prove the Riemann hypothesis step by step")  # → "tier_high"
tier = router.route("What time is it in Tokyo?")                  # → "tier_low"

VRAM Reality Check — Read This First

95% of Gemma 4 deployment repos ignore this. This one doesn't.

Model	Quant	VRAM (weights)	KV headroom	Viable in vLLM	Verdict
Gemma 4 26B	BF16	~52GB	❌	❌	Does not fit
Gemma 4 26B	AWQ INT4	~15GB	⚠️ ~1GB	❌ production	No KV headroom
Gemma 4 26B	Q4_K_M (GGUF)	~16GB	❌	llama.cpp only	This repo's tier_high
Gemma 4 E4B	BF16	~8GB	✅ ~6-8GB	✅	Production primary
Gemma 4 E4B	FP8 KV	~8GB + FP8	✅ more	✅ with workarounds	See §FP8 KV Gotchas

The only honest 16GB architecture:

Tier low:   Gemma 4 E4B  → vLLM  (fast, BF16, ~90% of requests)
Tier high:  Gemma 4 26B  → llama.cpp (quality, ~10% of requests)
Router:     adaptive_router decides in < 1ms which tier to use

Adaptive Router

The piece that doesn't exist in any other public repo.

┌─────────────────────────────────────────────────────────────┐
│                      adaptive_router/                       │
│                                                             │
│  Layer 1: ComplexityScorer (sub-ms, no inference)          │
│  ├── 6 dimensions: math · code · depth · tokens            │
│  ├── entities · negation                                    │
│  └── Precompiled regexes — zero cost per call              │
│                                                             │
│  Layer 2: VRAMMonitor (daemon thread)                       │
│  ├── pynvml direct — no subprocess nvidia-smi overhead     │
│  ├── Polling at 10-20ms with atomic shared state           │
│  └── Circuit: GPU util > 90% → force tier_low              │
│                                                             │
│  Layer 3: RoutingDecision (chain of rules)                  │
│  ├── complexity_rule: score < 0.65 → tier_low              │
│  ├── vram_rule: free VRAM < 1.5GB → tier_low               │
│  └── sla_rule: EMA latency > 2s → tier_low                 │
└─────────────────────────────────────────────────────────────┘

# Deploy as ASGI proxy
pip install gemma4-adaptive-router
python -m adaptive_router.middleware \
    --tier-high-url http://llama-cpp:8080/v1 \
    --tier-low-url  http://vllm:8000/v1 \
    --port 9000

→ See adaptive_router/README.md for full documentation.

Benchmarks

Numbers require running on your hardware. The methodology and scripts are production-grade. Fill this table by running python benchmarks/bench.py.

Engine	TTFT p50 (c=1)	TTFT p50 (c=10)	Multi-user	Notes
vLLM E4B BF16	`[run bench.py]`	`[run bench.py]`	✅ Production	Primary tier_low
SGLang E4B BF16	`[run bench.py]`	`[run bench.py]`	✅ Production	RAG workloads (prefix > 60%)
EXL2 E4B	`[run bench.py]`	⚠️ degrades	❌ Single-user	TabbyAPI maintainer caveat*
llama.cpp 26B Q4	`[run bench.py]`	`[run bench.py]`	⚠️ Low throughput	tier_high — quality over speed

* TabbyAPI maintainers explicitly state it's not designed for multi-user production. See docs/BENCHMARKS.md.

RadixAttention Operative Table

The document that doesn't exist in any other Gemma 4 deployment repo.

Prefix Overlap	Cache Hit	TTFT Benefit	vs vLLM prefix_cache	Use
> 75%	75-95%	3-6x faster	~10-15% better	✅ SGLang
40-75%	30-60%	1.5-2x	Neutral	⚠️ Benchmark both
< 30%	< 20%	< 1.2x	vLLM wins	❌ Don't use SGLang

The contraintuitivestructure insight:

# ❌ WRONG — docs before system prompt breaks prefix sharing
prompt = f"{retrieved_docs}\nSystem: {system_prompt}\nUser: {query}"

# ✅ CORRECT — fixed system prompt first, radix tree caches it across requests
prompt = f"System: {system_prompt}\n\nContext: {retrieved_docs}\n\nUser: {query}"

→ See demo: rag-integration/sglang_radixattention_demo.py
→ Run sweep: python benchmarks/radixattention_prefix_sweep.py

MTP Reality Check — SM120

Configuration	Status	Notes
BF16 + `--speculative-config`	✅ Works	Correct pair: E4B + E4B-it-assistant
NVFP4 + speculative	❌ Garbage output	Datacenter-only
Marlin + speculative	❌ -22% throughput	Do not use
FP8 KV + speculative (FlashInfer)	⚠️ Experimental	Measure with `mtp_bench.py`

Correct syntax (vLLM 0.19+):

# ❌ Old syntax — removed
--speculative-model google/gemma-4-e2b-it

# ✅ Correct — dedicated pair, JSON config
--speculative-config '{"model": "google/gemma-4-e4b-it-assistant", "num_speculative_tokens": 4}'

→ See docs/MTP-SPECULATIVE-DECODING.md

Quick Start

Prerequisites

Linux (Ubuntu 22.04 LTS recommended), CUDA 12.9+
RTX 5060 Ti 16GB (or any GPU with ≥ 16GB VRAM)
Docker + Docker Compose + NVIDIA Container Toolkit
HuggingFace account with Gemma 4 access

Install

git clone https://github.com/angelnicolasc/Stratum
cd Stratum

# On a fresh machine
bash scripts/install-cuda-linux.sh

# Configure
cp .env.example .env
# Edit .env: add your HF_TOKEN

# Download 26B GGUF (tier_high model)
huggingface-cli download bartowski/gemma-4-27b-it-GGUF \
    --include "gemma-4-27b-it-Q4_K_M.gguf" \
    --local-dir docker/llama-cpp/models

# Start everything
bash scripts/start.sh

Verify

bash scripts/health-check.sh

# Quick inference test
curl http://localhost:9000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gemma4-fast","messages":[{"role":"user","content":"What is 2+2?"}],"max_tokens":16}'

OpenWebUI

Open http://localhost:3000 — use model gemma4-fast for the adaptive router, or gemma4-smart to force the 26B tier.

Documentation

Document	What it covers
`docs/VRAM-REALITY-CHECK.md`	GPU VRAM limits, 3 SM120 FP8 KV gotchas, `--enforce-eager` rationale
`docs/MTP-SPECULATIVE-DECODING.md`	Corrected MTP architecture, SM120 state, honest framing
`docs/RADIXATTENTION-OPERATIVE-TABLE.md`	When SGLang beats vLLM, prompt structure instruction
`docs/MIGRATION.md`	LM Studio → Linux, 4 LiteLLM/llama.cpp bridge gaps
`docs/VISION-BUDGET.md`	Token budgets, flags, VRAM implications
`docs/HARDWARE-RECOMMENDATIONS.md`	Upgrade paths 16GB→48GB, formula
`docs/BENCHMARKS.md`	4-engine comparison methodology
`docs/HANDOVER.md`	Ops guide, failure modes, recovery
`adaptive_router/README.md`	Standalone router docs — install, configure, extend

Known Limitations

These go in the README because honesty is what makes this repo credible.

16GB doesn't fit 26B in vLLM natively. With llama.cpp it works, but throughput is lower. The real fix for 200 users at 26B quality is dual-GPU. The router mitigates the problem, it doesn't eliminate it.
Blackwell SM120 has rough edges in vLLM. FP8 KV and MTP require workarounds specific to SM120. Documented exactly in docs/VRAM-REALITY-CHECK.md.
EXL2 is not designed for multi-user production. TabbyAPI maintainers say so explicitly. The benchmark includes it with that caveat, not without.
SGLang doesn't always win. The RadixAttention benefit depends on prefix overlap and prompt structure. The operative table documents this — the 29% claim circulates without the context that makes it true.
MTP in SM120 is not plug-and-play. BF16 + workarounds works. NVFP4 produces garbage. The throughput delta needs empirical measurement on your hardware.
Cold-start burst on tier_high. With sla_warmup_seed_ms=0.0 (default), the EMA starts optimistic. The first ~15-20 complex queries all hit tier_high before the EMA converges. Set sla_warmup_seed_ms=800 if you know your expected p50 tier_high latency.
adaptive_router is v0.1. Functional and tested, but not battle-tested at thousands of users. It's a solid starting point, not a battle-hardened system.

Empirical Gaps

These are the measurements this repo needs from the community and from running the hardware.

Gap	Script	Document
Startup time with vs without `--enforce-eager` on SM120	`time docker compose up`	`docs/VRAM-REALITY-CHECK.md`
FP8 KV VRAM savings in E4B (FlashInfer, no `--calculate-kv-scales`)	`rag-integration/vision_demo.py`	`docs/VRAM-REALITY-CHECK.md`
MTP throughput delta with BF16 + speculative-config on SM120	`benchmarks/mtp_bench.py`	`docs/MTP-SPECULATIVE-DECODING.md`
EXL2 throughput on E4B at RTX 5060 Ti	`benchmarks/bench.py --engine exl2_e4b`	`docs/BENCHMARKS.md`
RadixAttention prefix sweep on this hardware	`benchmarks/radixattention_prefix_sweep.py`	`docs/RADIXATTENTION-OPERATIVE-TABLE.md`

Contributing

Issues, PRs, and benchmark results from different hardware are welcome. The most valuable contribution is running bench.py or radixattention_prefix_sweep.py on your hardware and sharing the results.

License

Apache-2.0. See LICENSE.

Built May 2026. Corrects v1 architectural errors (MTP cross-tier, FP8 KV flags, SGLang 29% claim without context, --enforce-eager). Introduces adaptive_router, EXL2 benchmark, RadixAttention operative table, Vision budget, and LiteLLM/llama.cpp bridge gaps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gemma 4 Production Blueprint 2026

TL;DR

VRAM Reality Check — Read This First

Adaptive Router

Benchmarks

RadixAttention Operative Table

MTP Reality Check — SM120

Quick Start

Prerequisites

Install

Verify

OpenWebUI

Documentation

Known Limitations

Empirical Gaps

Contributing

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
adaptive_router		adaptive_router
benchmarks		benchmarks
docker		docker
docs		docs
k8s		k8s
monitoring		monitoring
openwebui/custom-theme		openwebui/custom-theme
rag-integration		rag-integration
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Gemma 4 Production Blueprint 2026

TL;DR

VRAM Reality Check — Read This First

Adaptive Router

Benchmarks

RadixAttention Operative Table

MTP Reality Check — SM120

Quick Start

Prerequisites

Install

Verify

OpenWebUI

Documentation

Known Limitations

Empirical Gaps

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages