An authenticated, metered API gateway for LLM inference.
Full documentation is published at:
https://gperdrizet.github.io/model-gateway/
- Users register at https://promptlyapi.com and receive a trial allocation (100k tokens, 7 days)
- API calls are made to
https://promptlyapi.com/v1with a bearer token - Each request deducts tokens from the user's balance; requests are rejected with 402 when exhausted
- Users can top up via Stripe (card) or BTCPay Server (Bitcoin)
- All usage is recorded for metering and display on the dashboard
- FastAPI + uvicorn: API server
- PostgreSQL: user accounts, token balances, usage events, purchases
- Docker Compose: gateway + db + adminer
- nginx: TLS termination and reverse proxy on the gateway server
Go to https://promptlyapi.com and click Create an account. Enter your email address and your API key will arrive by email within a few seconds.
Your account starts with a free trial: 100,000 tokens valid for 7 days.
Lost your key? Go back to https://promptlyapi.com/register and enter the same email address. A new key will be issued and sent to you; your token balance is preserved, but the old key is immediately invalidated.
The API is compatible with the current OpenAI Python SDK. Promptly supports both OpenAI-style Chat Completions (/v1/chat/completions) and a text-focused Responses API profile (/v1/responses).
Use this pattern with the Python SDK:
import os
from openai import OpenAI
client = OpenAI(
base_url="https://promptlyapi.com/v1",
api_key=os.environ["PROMPTLY_API_KEY"],
)
completion = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Hello!"}],
)
print(completion.choices[0].message.content)Or with LangChain:
import os
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
llm = ChatOpenAI(
base_url="https://promptlyapi.com/v1",
api_key=os.environ["PROMPTLY_API_KEY"],
model="default",
)
response = llm.invoke([HumanMessage(content="Hello!")])
print(response.content)Or with curl:
curl https://promptlyapi.com/v1/chat/completions \
-H "Authorization: Bearer sk-your-key-here" \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "Hello!"}]
}'Go to https://promptlyapi.com, enter your API key in the Already have a key? box, and click View dashboard. You can also go directly to https://promptlyapi.com/dashboard?key=sk-your-key-here.
When your trial runs out, top up via Stripe (card) or BTCPay Server (Bitcoin) from the dashboard. Token packs are charged at cost.
- All
/v1/*routes accept OpenAI-compatible request bodies and return OpenAI-compatible responses - Streaming is supported (
"stream": true) - Requests are rejected with 402 Payment Required when your balance is exhausted
- Rate limits: 120 requests/min per IP, 60 requests/min per API key
Promptly supports a text-generation profile of the OpenAI Responses API at /v1/responses.
Supported:
inputas a plain stringinputas message-style text contentinstructionsas a system-style instruction stringstream: truefor text output events- Usage reporting in the response payload
Not currently enabled:
- Multimodal input (
input_image, audio, or other non-text content types) - Tool calling and hosted tool features (
tools,tool_choice,parallel_tool_calls)
Unsupported features are rejected with a clear 400 error instead of being silently ignored.
The model field in your request is accepted but ignored; the server always uses whichever model is currently loaded. The model name returned in the response reflects the actual loaded model (e.g. gpt-oss-20b-mxfp4.gguf). You can query the current model name with:
curl https://promptlyapi.com/v1/models \
-H "Authorization: Bearer sk-your-key-here"The currently loaded model is a reasoning model. Responses include a non-standard reasoning_content field alongside the standard content field:
{
"choices": [{
"message": {
"role": "assistant",
"content": "Hello!",
"reasoning_content": "The user says: \"Say hello.\" ..."
}
}]
}Always read choices[0].message.content. The reasoning_content field contains the model's internal chain-of-thought and is not part of the OpenAI spec; most clients will ignore it automatically.
For a compact reference designed to be dropped into an AI agent's context, see AGENTS.md.
- Python 3.12
- Docker + Docker Compose
git clone git@github.com:gperdrizet/model-gateway.git
cd model-gateway
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt -r requirements-dev.txt
cp .env.template .env
# Edit .env: set DATABASE_URL, LLAMA_BASE_URL, etc.docker compose up -d db # start postgres only
uvicorn app.app:app --reload --port 8503Or run the full stack:
docker compose up --buildpytest tests/ -vTests use an in-memory SQLite database; no Docker required. All 17 tests should pass.
- Model server: runs llama-server on
:8502, accessible over a private WireGuard tunnel - Gateway server: VPS running nginx + Docker; model-gateway runs here behind nginx
/opt/model-gateway/ # production git repo and .env
/opt/model-gateway-staging/ # staging git repo and .env
nginx proxies https://<your-domain> to http://127.0.0.1:8503 (production gateway).
Copy .env.template and fill in values. Key production overrides vs. defaults:
| Variable | Production value |
|---|---|
BASE_URL |
https://promptlyapi.com |
LLAMA_BASE_URL |
http://100.64.0.2:8502 |
ADMINER_BIND_HOST |
100.64.0.1 (tailnet only) |
GATEWAY_BIND |
127.0.0.1 (behind nginx) |
GATEWAY_PORT |
8503 |
Staging .env is the same but with GATEWAY_PORT=8505, ADMINER_PORT=8506, and BASE_URL=http://127.0.0.1:8505.
dev: active development branch. All work happens here.main: production-ready code only. Protected; direct pushes are blocked.
- Work on
dev, commit and push changes - Open a pull request from
devtomain - GitHub Actions runs the test suite automatically on the PR
- Branch protection blocks merge until all tests pass
- Merge the PR; staging deploy triggers automatically
- Verify staging, then trigger production deploy manually
- GitHub Actions runs the test suite (
pytest tests/ -v) - If tests pass, SSHs to the gateway server and deploys to staging at port
8505 - Smoke tests
http://127.0.0.1:8505/health
Manual trigger only: go to Actions, then Deploy to Production, then Run workflow, enter a version number (e.g. 1.0.0) and type deploy to confirm.
The workflow:
- SSHs to the gateway server, pulls the latest commit into
/opt/model-gateway/ - Runs
docker compose up --build -d - Smoke tests the health endpoint
- Tags the commit as
v<version>and creates a GitHub release with auto-generated notes
| Secret | Value |
|---|---|
GATEKEEPER_HOST |
Gateway server public IP |
GATEKEEPER_USER |
SSH username on the gateway server |
GATEKEEPER_SSH_KEY |
Private key (matching public key in gateway server's authorized_keys) |
Generate a dedicated deploy key:
ssh-keygen -t ed25519 -C "github-actions-deploy" -f ~/.ssh/github_deploy -N ""
# Add ~/.ssh/github_deploy.pub to the gateway server's authorized_keys
# Add contents of ~/.ssh/github_deploy as the GATEKEEPER_SSH_KEY secretStaging runs on the same gateway server at port 8505, accessible over the private WireGuard/tailnet network. It is deployed automatically on every merge to main.
Tailnet address: http://100.64.0.1:8505
-
SSH into any tailnet machine (or use the gateway server itself):
ssh user@100.64.0.1 -p 44441
-
Open the registration page in a browser pointed at the tailnet address, or use curl:
curl -s http://100.64.0.1:8505/register
-
Submit a registration:
curl -s -X POST http://100.64.0.1:8505/register \ -d "email=test@example.com" \ --include -
The API key is sent by email. If SMTP is configured in the staging
.env, check the inbox. Otherwise retrieve it directly from the admin panel or Adminer. -
Test an authenticated request:
curl http://100.64.0.1:8505/v1/chat/completions \ -H "Authorization: Bearer sk-your-key-here" \ -H "Content-Type: application/json" \ -d '{"model": "default", "messages": [{"role": "user", "content": "ping"}]}'
-
Check the dashboard:
curl -s http://100.64.0.1:8505/dashboard?key=sk-your-key-here
scripts/smoke-test.py is a stdlib-only Python script that runs a full end-to-end check against any gateway deployment. It covers: health, registration, auth, admin panel, inference (optional), rate limiting, and cleanup.
Quick start (reads ADMIN_KEY from .env automatically):
python3 scripts/smoke-test.pyWith inference test (supply an existing valid API key):
SMOKE_API_KEY=sk-your-key-here python3 scripts/smoke-test.pyAgainst production (skip the slow rate-limit hammer):
python3 scripts/smoke-test.py \
--url http://100.64.0.1:8503 \
--skip-rate-limitAll options:
--url URL Base URL (default: http://100.64.0.1:8505)
--admin-key KEY Admin key (default: $ADMIN_KEY or .env file)
--skip-rate-limit Skip the 130-request rate-limit stress test
--verbose Print extra detail on failures
The script creates a uniquely-named test user, runs all checks, then deletes the user via the admin panel. If cleanup fails, the user email is printed so you can remove it manually.
Note on inference testing: the raw API key is emailed on registration and cannot be recovered from the admin panel (only the key prefix is stored). Set
SMOKE_API_KEYto any existing valid key to enable the inference phase.
The admin panel is at /admin?key=<ADMIN_KEY>.
Access is restricted by two independent layers:
ADMIN_KEY: must match theADMIN_KEYenv var (compared in constant time)- IP CIDR check: request must originate from the private WireGuard/tailnet range, localhost, or Docker bridge. Configured via
ADMIN_ALLOWED_CIDRSin.env.
From outside the private network, /admin returns 403 regardless of key.
- View all users: email, key prefix, token balance, trial status, 30-day usage, join date
- Email filter search box for finding users quickly
- Adjust tokens: add or subtract from any user's paid token balance
- Grant trial: give a user a new trial allocation (tokens + days)
- Delete user: permanently removes the user and all associated records
All admin actions can also be scripted directly against the API from any tailnet machine:
Adjust a user's paid token balance (positive = add, negative = deduct):
curl -X POST http://100.64.0.1:8503/admin/adjust \
-d "key=<ADMIN_KEY>&email=user@example.com&delta=1000000" \
--data-urlencode ""Grant a trial allocation:
curl -X POST http://100.64.0.1:8503/admin/grant \
-d "key=<ADMIN_KEY>&email=user@example.com&tokens=100000&days=7"Delete a user (requires the numeric user ID from the admin panel):
curl -X POST http://100.64.0.1:8503/admin/delete \
-d "key=<ADMIN_KEY>&user_id=42"All three endpoints accept application/x-www-form-urlencoded. They return 200 on success, 403 if the key is wrong or the request IP is not in the allowed CIDR range.
Adminer runs at port 8504 (production) or 8506 (staging), bound to the private WireGuard/tailnet IP on the gateway server; not accessible from the public internet.
Access from a machine on the tailnet:
http://100.64.0.1:8504
Server: db
Username: gateway
Password: (POSTGRES_PASSWORD from .env)
Database: gateway
View all users and balances:
SELECT u.email, u.key_prefix, b.paid_tokens, b.trial_tokens, b.trial_expiry
FROM users u
JOIN token_purchases b ON b.user_id = u.id
ORDER BY u.created_at DESC;View recent usage events:
SELECT u.email, e.prompt_tokens, e.completion_tokens, e.created_at
FROM usage_events e
JOIN users u ON u.id = e.user_id
ORDER BY e.created_at DESC
LIMIT 50;Manually credit a user (paid balance):
UPDATE token_purchases
SET paid_tokens = paid_tokens + 1000000
WHERE user_id = (SELECT id FROM users WHERE email = 'user@example.com');Set STRIPE_SECRET_KEY and STRIPE_WEBHOOK_SECRET in .env. Register a webhook at https://<your-domain>/stripe/webhook in the Stripe dashboard with the checkout.session.completed event.
A separate compose stack (docker-compose.btcpay.yml) runs BTCPay Server bound to your private network IP. Set BTCPAY_URL, BTCPAY_API_KEY, BTCPAY_STORE_ID, and BTCPAY_WEBHOOK_SECRET in .env after configuring the store.
| Pack | Tokens | Price |
|---|---|---|
starter |
5M | $0.50 |
standard |
25M | $2.00 |
pro |
100M | $6.00 |