model-gateway

An authenticated, metered API gateway for LLM inference.

Documentation

Full documentation is published at:

https://gperdrizet.github.io/model-gateway/

How it works

Users register at https://promptlyapi.com and receive a trial allocation (100k tokens, 7 days)
API calls are made to https://promptlyapi.com/v1 with a bearer token
Each request deducts tokens from the user's balance; requests are rejected with 402 when exhausted
Users can top up via Stripe (card) or BTCPay Server (Bitcoin)
All usage is recorded for metering and display on the dashboard

Stack

FastAPI + uvicorn: API server
PostgreSQL: user accounts, token balances, usage events, purchases
Docker Compose: gateway + db + adminer
nginx: TLS termination and reverse proxy on the gateway server

Using the API

1. Register

Go to https://promptlyapi.com and click Create an account. Enter your email address and your API key will arrive by email within a few seconds.

Your account starts with a free trial: 100,000 tokens valid for 7 days.

Lost your key? Go back to https://promptlyapi.com/register and enter the same email address. A new key will be issued and sent to you; your token balance is preserved, but the old key is immediately invalidated.

2. Make your first request

The API is compatible with the current OpenAI Python SDK. Promptly supports both OpenAI-style Chat Completions (/v1/chat/completions) and a text-focused Responses API profile (/v1/responses).

Use this pattern with the Python SDK:

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://promptlyapi.com/v1",
    api_key=os.environ["PROMPTLY_API_KEY"],
)

completion = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Hello!"}],
)

print(completion.choices[0].message.content)

Or with LangChain:

import os

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

llm = ChatOpenAI(
    base_url="https://promptlyapi.com/v1",
    api_key=os.environ["PROMPTLY_API_KEY"],
    model="default",
)

response = llm.invoke([HumanMessage(content="Hello!")])
print(response.content)

Or with curl:

curl https://promptlyapi.com/v1/chat/completions \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

3. Check your balance

Go to https://promptlyapi.com, enter your API key in the Already have a key? box, and click View dashboard. You can also go directly to https://promptlyapi.com/dashboard?key=sk-your-key-here.

4. Top up

When your trial runs out, top up via Stripe (card) or BTCPay Server (Bitcoin) from the dashboard. Token packs are charged at cost.

API notes

All /v1/* routes accept OpenAI-compatible request bodies and return OpenAI-compatible responses
Streaming is supported ("stream": true)
Requests are rejected with 402 Payment Required when your balance is exhausted
Rate limits: 120 requests/min per IP, 60 requests/min per API key

Responses API compatibility profile

Promptly supports a text-generation profile of the OpenAI Responses API at /v1/responses.

Supported:

input as a plain string
input as message-style text content
instructions as a system-style instruction string
stream: true for text output events
Usage reporting in the response payload

Not currently enabled:

Multimodal input (input_image, audio, or other non-text content types)
Tool calling and hosted tool features (tools, tool_choice, parallel_tool_calls)

Unsupported features are rejected with a clear 400 error instead of being silently ignored.

Model name and response format

The model field in your request is accepted but ignored; the server always uses whichever model is currently loaded. The model name returned in the response reflects the actual loaded model (e.g. gpt-oss-20b-mxfp4.gguf). You can query the current model name with:

curl https://promptlyapi.com/v1/models \
  -H "Authorization: Bearer sk-your-key-here"

The currently loaded model is a reasoning model. Responses include a non-standard reasoning_content field alongside the standard content field:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "Hello!",
      "reasoning_content": "The user says: \"Say hello.\" ..."
    }
  }]
}

Always read choices[0].message.content. The reasoning_content field contains the model's internal chain-of-thought and is not part of the OpenAI spec; most clients will ignore it automatically.

For a compact reference designed to be dropped into an AI agent's context, see AGENTS.md.

Development

Requirements

Python 3.12
Docker + Docker Compose

Setup

git clone git@github.com:gperdrizet/model-gateway.git
cd model-gateway
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt -r requirements-dev.txt
cp .env.template .env
# Edit .env: set DATABASE_URL, LLAMA_BASE_URL, etc.

Run locally

docker compose up -d db          # start postgres only
uvicorn app.app:app --reload --port 8503

Or run the full stack:

docker compose up --build

Run tests

pytest tests/ -v

Tests use an in-memory SQLite database; no Docker required. All 17 tests should pass.

Deployment

Infrastructure

Model server: runs llama-server on :8502, accessible over a private WireGuard tunnel
Gateway server: VPS running nginx + Docker; model-gateway runs here behind nginx

Production stack on the gateway server

/opt/model-gateway/          # production git repo and .env
/opt/model-gateway-staging/  # staging git repo and .env

nginx proxies https://<your-domain> to http://127.0.0.1:8503 (production gateway).

Environment files

Copy .env.template and fill in values. Key production overrides vs. defaults:

Variable	Production value
`BASE_URL`	`https://promptlyapi.com`
`LLAMA_BASE_URL`	`http://100.64.0.2:8502`
`ADMINER_BIND_HOST`	`100.64.0.1` (tailnet only)
`GATEWAY_BIND`	`127.0.0.1` (behind nginx)
`GATEWAY_PORT`	`8503`

Staging .env is the same but with GATEWAY_PORT=8505, ADMINER_PORT=8506, and BASE_URL=http://127.0.0.1:8505.

CI/CD

Branches

dev: active development branch. All work happens here.
main: production-ready code only. Protected; direct pushes are blocked.

Workflow

Work on dev, commit and push changes
Open a pull request from dev to main
GitHub Actions runs the test suite automatically on the PR
Branch protection blocks merge until all tests pass
Merge the PR; staging deploy triggers automatically
Verify staging, then trigger production deploy manually

On every push to `main` (after PR merge)

GitHub Actions runs the test suite (pytest tests/ -v)
If tests pass, SSHs to the gateway server and deploys to staging at port 8505
Smoke tests http://127.0.0.1:8505/health

Production deploy

Manual trigger only: go to Actions, then Deploy to Production, then Run workflow, enter a version number (e.g. 1.0.0) and type deploy to confirm.

The workflow:

SSHs to the gateway server, pulls the latest commit into /opt/model-gateway/
Runs docker compose up --build -d
Smoke tests the health endpoint
Tags the commit as v<version> and creates a GitHub release with auto-generated notes

Required GitHub secrets

Secret	Value
`GATEKEEPER_HOST`	Gateway server public IP
`GATEKEEPER_USER`	SSH username on the gateway server
`GATEKEEPER_SSH_KEY`	Private key (matching public key in gateway server's `authorized_keys`)

Generate a dedicated deploy key:

ssh-keygen -t ed25519 -C "github-actions-deploy" -f ~/.ssh/github_deploy -N ""
# Add ~/.ssh/github_deploy.pub to the gateway server's authorized_keys
# Add contents of ~/.ssh/github_deploy as the GATEKEEPER_SSH_KEY secret

Staging environment

Staging runs on the same gateway server at port 8505, accessible over the private WireGuard/tailnet network. It is deployed automatically on every merge to main.

Tailnet address: http://100.64.0.1:8505

Manually testing the registration flow on staging

SSH into any tailnet machine (or use the gateway server itself):
```
ssh user@100.64.0.1 -p 44441
```
Open the registration page in a browser pointed at the tailnet address, or use curl:
```
curl -s http://100.64.0.1:8505/register
```

Submit a registration:

curl -s -X POST http://100.64.0.1:8505/register \
  -d "email=test@example.com" \
  --include

The API key is sent by email. If SMTP is configured in the staging .env, check the inbox. Otherwise retrieve it directly from the admin panel or Adminer.

Test an authenticated request:

curl http://100.64.0.1:8505/v1/chat/completions \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{"model": "default", "messages": [{"role": "user", "content": "ping"}]}'

Check the dashboard:

curl -s http://100.64.0.1:8505/dashboard?key=sk-your-key-here

Smoke test

scripts/smoke-test.py is a stdlib-only Python script that runs a full end-to-end check against any gateway deployment. It covers: health, registration, auth, admin panel, inference (optional), rate limiting, and cleanup.

Quick start (reads ADMIN_KEY from .env automatically):

python3 scripts/smoke-test.py

With inference test (supply an existing valid API key):

SMOKE_API_KEY=sk-your-key-here python3 scripts/smoke-test.py

Against production (skip the slow rate-limit hammer):

python3 scripts/smoke-test.py \
  --url http://100.64.0.1:8503 \
  --skip-rate-limit

All options:

  --url URL           Base URL (default: http://100.64.0.1:8505)
  --admin-key KEY     Admin key (default: $ADMIN_KEY or .env file)
  --skip-rate-limit   Skip the 130-request rate-limit stress test
  --verbose           Print extra detail on failures

The script creates a uniquely-named test user, runs all checks, then deletes the user via the admin panel. If cleanup fails, the user email is printed so you can remove it manually.

Note on inference testing: the raw API key is emailed on registration and cannot be recovered from the admin panel (only the key prefix is stored). Set SMOKE_API_KEY to any existing valid key to enable the inference phase.

Admin panel

The admin panel is at /admin?key=<ADMIN_KEY>.

Access is restricted by two independent layers:

ADMIN_KEY: must match the ADMIN_KEY env var (compared in constant time)
IP CIDR check: request must originate from the private WireGuard/tailnet range, localhost, or Docker bridge. Configured via ADMIN_ALLOWED_CIDRS in .env.

From outside the private network, /admin returns 403 regardless of key.

Admin panel features

View all users: email, key prefix, token balance, trial status, 30-day usage, join date
Email filter search box for finding users quickly
Adjust tokens: add or subtract from any user's paid token balance
Grant trial: give a user a new trial allocation (tokens + days)
Delete user: permanently removes the user and all associated records

Admin operations via API

All admin actions can also be scripted directly against the API from any tailnet machine:

Adjust a user's paid token balance (positive = add, negative = deduct):

curl -X POST http://100.64.0.1:8503/admin/adjust \
  -d "key=<ADMIN_KEY>&email=user@example.com&delta=1000000" \
  --data-urlencode ""

Grant a trial allocation:

curl -X POST http://100.64.0.1:8503/admin/grant \
  -d "key=<ADMIN_KEY>&email=user@example.com&tokens=100000&days=7"

Delete a user (requires the numeric user ID from the admin panel):

curl -X POST http://100.64.0.1:8503/admin/delete \
  -d "key=<ADMIN_KEY>&user_id=42"

All three endpoints accept application/x-www-form-urlencoded. They return 200 on success, 403 if the key is wrong or the request IP is not in the allowed CIDR range.

Adminer (database GUI)

Adminer runs at port 8504 (production) or 8506 (staging), bound to the private WireGuard/tailnet IP on the gateway server; not accessible from the public internet.

Access from a machine on the tailnet:

http://100.64.0.1:8504
Server:   db
Username: gateway
Password: (POSTGRES_PASSWORD from .env)
Database: gateway

Useful queries

View all users and balances:

SELECT u.email, u.key_prefix, b.paid_tokens, b.trial_tokens, b.trial_expiry
FROM users u
JOIN token_purchases b ON b.user_id = u.id
ORDER BY u.created_at DESC;

View recent usage events:

SELECT u.email, e.prompt_tokens, e.completion_tokens, e.created_at
FROM usage_events e
JOIN users u ON u.id = e.user_id
ORDER BY e.created_at DESC
LIMIT 50;

Manually credit a user (paid balance):

UPDATE token_purchases
SET paid_tokens = paid_tokens + 1000000
WHERE user_id = (SELECT id FROM users WHERE email = 'user@example.com');

Billing

Stripe

Set STRIPE_SECRET_KEY and STRIPE_WEBHOOK_SECRET in .env. Register a webhook at https://<your-domain>/stripe/webhook in the Stripe dashboard with the checkout.session.completed event.

BTCPay Server (Bitcoin)

A separate compose stack (docker-compose.btcpay.yml) runs BTCPay Server bound to your private network IP. Set BTCPAY_URL, BTCPAY_API_KEY, BTCPAY_STORE_ID, and BTCPAY_WEBHOOK_SECRET in .env after configuring the store.

Token packs

Pack	Tokens	Price
`starter`	5M	$0.50
`standard`	25M	$2.00
`pro`	100M	$6.00

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.github/workflows		.github/workflows
app		app
docs		docs
scripts		scripts
tests		tests
.env.template		.env.template
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
README.md		README.md
docker-compose.staging.yml		docker-compose.staging.yml
docker-compose.yml		docker-compose.yml
mkdocs.yml		mkdocs.yml
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation