Inference Router

OpenAI-compatible inference router for remote vLLM backends. The router runs locally or on a control-plane VM, deploys models to GPU servers over SSH, and forwards /v1/chat/completions to vLLM with sticky prefix routing.

What It Does

Installs vLLM on a remote GPU server over SSH.
Downloads a Hugging Face model into the remote HF cache.
Starts a remote OpenAI-compatible vLLM runtime.
Runs a FastAPI router in front of one or more vLLM backends.
Uses tenant/model/system/prefix sticky routing for KV locality.
Emits lightweight JSON metrics at /metrics.
Adds router.stream_stats SSE events with cache outcome, TTFT, TPS, backend, and cached/novel token estimates.

Architecture

Client
  -> FastAPI router
  -> admission / prefix key / backend selection
  -> vLLM OpenAI API on remote GPU host
  -> stream stats appended before [DONE]

Prefix key:

sha256(tenant | model | normalized_system_prompt | normalized_prefix_window)[:32]

Tenant is read from X-Tenant-ID, then X-User-ID, then X-Forwarded-For, then client IP.

Install

uv sync --extra dev
cp .env.example .env

Deploy vLLM To A GPU Server

uv run python scripts/deploy_remote.py \
  --host root@154.59.156.29 \
  --port 42189 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --served-model llama-3.1-8b \
  --vllm-port 8000 \
  --tp 1 \
  --max-model-len 32768

Keep a tunnel open if the vLLM server is bound to remote localhost:

ssh -p 42189 -N -L 8000:127.0.0.1:8000 root@154.59.156.29

Run Router

export SERVED_MODEL=llama-3.1-8b
export BACKEND_URLS=http://127.0.0.1:8000
uv run python main.py

Test

curl -N http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'X-Tenant-ID: demo-user' \
  -d '{
    "model": "llama-3.1-8b",
    "messages": [{"role": "user", "content": "Say hello in one sentence."}],
    "max_tokens": 64,
    "temperature": 0,
    "stream": true
  }'

Health and metrics:

curl http://localhost:8080/health
curl http://localhost:8080/metrics
curl http://localhost:8080/architecture

Multi-Backend Routing

export BACKEND_URLS=http://127.0.0.1:8000,http://127.0.0.1:8001,http://127.0.0.1:8002

If REDIS_URL is set, prefix ownership is stored in Redis. Otherwise it uses in-memory ownership.

export REDIS_URL=redis://localhost:6379/0

Development

uv run ruff format .
uv run ruff check .
uv run mypy app/ --ignore-missing-imports

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github		.github
app		app
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Inference Router

What It Does

Architecture

Install

Deploy vLLM To A GPU Server

Run Router

Test

Multi-Backend Routing

Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Inference Router

What It Does

Architecture

Install

Deploy vLLM To A GPU Server

Run Router

Test

Multi-Backend Routing

Development

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages