OpenAI-compatible inference router for remote vLLM backends. The router runs locally or on a control-plane VM, deploys models to GPU servers over SSH, and forwards /v1/chat/completions to vLLM with sticky prefix routing.
- Installs vLLM on a remote GPU server over SSH.
- Downloads a Hugging Face model into the remote HF cache.
- Starts a remote OpenAI-compatible vLLM runtime.
- Runs a FastAPI router in front of one or more vLLM backends.
- Uses tenant/model/system/prefix sticky routing for KV locality.
- Emits lightweight JSON metrics at
/metrics. - Adds
router.stream_statsSSE events with cache outcome, TTFT, TPS, backend, and cached/novel token estimates.
Client
-> FastAPI router
-> admission / prefix key / backend selection
-> vLLM OpenAI API on remote GPU host
-> stream stats appended before [DONE]
Prefix key:
sha256(tenant | model | normalized_system_prompt | normalized_prefix_window)[:32]
Tenant is read from X-Tenant-ID, then X-User-ID, then X-Forwarded-For, then client IP.
uv sync --extra dev
cp .env.example .envuv run python scripts/deploy_remote.py \
--host root@154.59.156.29 \
--port 42189 \
--model meta-llama/Llama-3.1-8B-Instruct \
--served-model llama-3.1-8b \
--vllm-port 8000 \
--tp 1 \
--max-model-len 32768Keep a tunnel open if the vLLM server is bound to remote localhost:
ssh -p 42189 -N -L 8000:127.0.0.1:8000 root@154.59.156.29export SERVED_MODEL=llama-3.1-8b
export BACKEND_URLS=http://127.0.0.1:8000
uv run python main.pycurl -N http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'X-Tenant-ID: demo-user' \
-d '{
"model": "llama-3.1-8b",
"messages": [{"role": "user", "content": "Say hello in one sentence."}],
"max_tokens": 64,
"temperature": 0,
"stream": true
}'Health and metrics:
curl http://localhost:8080/health
curl http://localhost:8080/metrics
curl http://localhost:8080/architectureexport BACKEND_URLS=http://127.0.0.1:8000,http://127.0.0.1:8001,http://127.0.0.1:8002If REDIS_URL is set, prefix ownership is stored in Redis. Otherwise it uses in-memory ownership.
export REDIS_URL=redis://localhost:6379/0uv run ruff format .
uv run ruff check .
uv run mypy app/ --ignore-missing-imports