Skip to content

bas3line/lora-train-pipeline

Repository files navigation

Inference Router

OpenAI-compatible inference router for remote vLLM backends. The router runs locally or on a control-plane VM, deploys models to GPU servers over SSH, and forwards /v1/chat/completions to vLLM with sticky prefix routing.

What It Does

  • Installs vLLM on a remote GPU server over SSH.
  • Downloads a Hugging Face model into the remote HF cache.
  • Starts a remote OpenAI-compatible vLLM runtime.
  • Runs a FastAPI router in front of one or more vLLM backends.
  • Uses tenant/model/system/prefix sticky routing for KV locality.
  • Emits lightweight JSON metrics at /metrics.
  • Adds router.stream_stats SSE events with cache outcome, TTFT, TPS, backend, and cached/novel token estimates.

Architecture

Client
  -> FastAPI router
  -> admission / prefix key / backend selection
  -> vLLM OpenAI API on remote GPU host
  -> stream stats appended before [DONE]

Prefix key:

sha256(tenant | model | normalized_system_prompt | normalized_prefix_window)[:32]

Tenant is read from X-Tenant-ID, then X-User-ID, then X-Forwarded-For, then client IP.

Install

uv sync --extra dev
cp .env.example .env

Deploy vLLM To A GPU Server

uv run python scripts/deploy_remote.py \
  --host root@154.59.156.29 \
  --port 42189 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --served-model llama-3.1-8b \
  --vllm-port 8000 \
  --tp 1 \
  --max-model-len 32768

Keep a tunnel open if the vLLM server is bound to remote localhost:

ssh -p 42189 -N -L 8000:127.0.0.1:8000 root@154.59.156.29

Run Router

export SERVED_MODEL=llama-3.1-8b
export BACKEND_URLS=http://127.0.0.1:8000
uv run python main.py

Test

curl -N http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'X-Tenant-ID: demo-user' \
  -d '{
    "model": "llama-3.1-8b",
    "messages": [{"role": "user", "content": "Say hello in one sentence."}],
    "max_tokens": 64,
    "temperature": 0,
    "stream": true
  }'

Health and metrics:

curl http://localhost:8080/health
curl http://localhost:8080/metrics
curl http://localhost:8080/architecture

Multi-Backend Routing

export BACKEND_URLS=http://127.0.0.1:8000,http://127.0.0.1:8001,http://127.0.0.1:8002

If REDIS_URL is set, prefix ownership is stored in Redis. Otherwise it uses in-memory ownership.

export REDIS_URL=redis://localhost:6379/0

Development

uv run ruff format .
uv run ruff check .
uv run mypy app/ --ignore-missing-imports

About

LoRA fine-tuning

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors