OpenAI + Anthropic compatible REST API proxy backed by NousResearch Portal OAuth — run your own inference gateway with subscription-based rate limits. Works with agentic coding tools like OpenCode, Claude Code, and OpenAI Codex CLI.
App Client --[API Key]--> NousProxy --[OAuth Agent Key]--> NousResearch Inference API
- Proxy handles OAuth device code flow (one-time setup).
- Auto-refreshes access token and agent key in the background.
- Clients use a static proxy API key for authentication.
- Forwards requests to NousResearch inference API with product attribution (
tags: ["product=hermes-agent"]).
cd /opt/nous-proxy
# Build & start
docker compose up -d
# Get your proxy API key
cat data/api_keys.json
# Auth (one-time, auto-polls until you authorize)
curl -X POST http://localhost:8090/auth/device-code \
-H "Authorization: Bearer np-xxx"
# → Open the URL, enter the code
# → Polling happens automatically in background
# Check auth status
curl http://localhost:8090/auth/status \
-H "Authorization: Bearer np-xxx"
# Use it
curl http://localhost:8090/v1/chat/completions \
-H "Authorization: Bearer np-xxx" \
-H "Content-Type: application/json" \
-d '{
"model": "xiaomi/mimo-v2-pro",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Logs
docker logs -f nous-proxy
# Stop
docker compose down
# Rebuild after code changes
docker compose up -d --buildData (tokens, API keys) persists in ./data/ via bind mount.
| Method | Path | Description |
|---|---|---|
| GET | / |
Landing page |
| GET | /health |
Health check + token status |
| POST | /auth/device-code |
Start OAuth + auto-poll |
| GET | /auth/status |
Check auth/polling status |
| POST | /auth/poll |
Fallback: wait for auth completion |
| POST | /v1/chat/completions |
OpenAI-compatible chat completions |
| GET | /v1/models |
List available models |
| POST | /v1/messages |
Anthropic Messages API (Claude Code) |
| POST | /v1/messages/count_tokens |
Anthropic token counting stub |
| POST | /v1/responses |
OpenAI Responses API (Codex CLI) |
| POST | /admin/generate-key |
Generate a new proxy API key |
OpenCode uses AI SDK with OpenAI-compatible API. Configure in ~/.opencode.json:
{
"provider": {
"nous": {
"options": {
"baseURL": "http://localhost:8090/v1",
"apiKey": "np-YOUR_PROXY_KEY"
}
}
}
}Then run /models in OpenCode to select a model (e.g. xiaomi/mimo-v2-pro).
Claude Code uses Anthropic Messages API. Configure in .claude/settings.json:
{
"env": {
"ANTHROPIC_BASE_URL": "http://localhost:8090",
"ANTHROPIC_AUTH_TOKEN": "np-***",
"ANTHROPIC_MODEL": "xiaomi/mimo-v2-pro",
"ANTHROPIC_DEFAULT_SONNET_MODEL": "xiaomi/mimo-v2-pro",
"ANTHROPIC_DEFAULT_HAIKU_MODEL": "xiaomi/mimo-v2-pro"
}
}The proxy translates Anthropic Messages API to OpenAI chat completions automatically, including tool use and streaming.
Note: Thinking blocks are filtered out to reduce latency — the Xiaomi model generates excessive reasoning (20-30 events) that slows down responses. Tool calls work correctly with proper
input_json_deltastreaming.
Codex CLI uses OpenAI Responses API. Configure in ~/.codex/config.toml:
model = "xiaomi/mimo-v2-pro"
model_provider = "nous"
[model_providers.nous]
name = "NousResearch"
base_url = "http://localhost:8090/v1"
wire_api = "responses"
experimental_bearer_token = "np-***"The proxy translates Responses API to Chat Completions format automatically.
Note: The
web_searchtool is filtered out (not supported). Theapply_patchcustom tool is converted to a function tool. Xiaomi model can generate tool calls but may exhibit "planning mode" behavior — generating text instead of executing tools when the system prompt is complex. This is a model limitation, not a proxy issue.
NousResearch agent key from free OAuth subscription has model access restrictions:
| Category | Access | Notes |
|---|---|---|
| Standard models | ✅ | 350+ models available |
:free models (26) |
❌ | Blocked: "OpenRouter free models are not supported" |
openrouter/* models (4) |
❌ | Blocked: "This model is not supported on Free Tier" |
These models cost $0 on the standard (non-:free) path and work with free subscription:
| Model | Context | Max Output | Tools | Reasoning |
|---|---|---|---|---|
xiaomi/mimo-v2-pro |
1M | 131K | ✅ | ✅ |
xiaomi/mimo-v2-omni |
262K | 65K | ✅ | ✅ |
Other $0 models exist (image gen, video gen, reranking) but are not text chat LLMs:
black-forest-labs/flux.2-*— image generationalibaba/wan-2.6,alibaba/wan-2.7— video generationbytedance/seedance-*— video generationopenai/sora-2-pro— video generationgoogle/veo-3.1— video generationcohere/rerank-*— reranking (not chat)
For access beyond $0 models, check available models via:
curl http://localhost:8090/v1/models \
-H "Authorization: Bearer np-xxx" | python3 -m json.toolEdit /opt/nous-proxy/.env:
NOUS_PORTAL_URL=https://portal.nousresearch.com
NOUS_INFERENCE_URL=https://inference-api.nousresearch.com/v1
PROXY_PORT=8090
PROXY_API_KEYS=np-xxx # Comma-separated, or auto-generated/opt/nous-proxy/
├── pyproject.toml # Dependencies & build config
├── Dockerfile # Multi-stage Docker build
├── docker-compose.yml # Container orchestration
├── docker-entrypoint.sh # Auto-fix data dir ownership
├── .dockerignore # Docker build exclusions
├── .env # Environment config
├── .env.example # Template
├── data/ # Persisted tokens & API keys (bind mount)
│ ├── tokens.json
│ └── api_keys.json
└── nous_proxy/
├── __init__.py
├── config.py # Settings (pydantic-settings)
├── auth.py # OAuth device code flow
├── token_manager.py # Token lifecycle + auto-refresh
├── api_keys.py # API key validation
├── proxy.py # Request forwarding + attribution
├── anthropic.py # Anthropic Messages API translator (Claude Code)
├── responses.py # OpenAI Responses API translator (Codex CLI)
└── main.py # FastAPI app + CLI
cd /opt/nous-proxy
uv venv .venv
source .venv/bin/activate
uv pip install -e .
# Run with auto-reload
python -m nous_proxy.main --reload- Thinking blocks filtered: Xiaomi model generates 20-30 reasoning events per response, causing high latency. Thinking is filtered out at the proxy level.
- Tool call arguments: Fixed —
input_json_deltanow properly streams tool arguments. developerrole: Merged into system message (Responses API specific).
web_searchtool: Filtered out (not a function tool type).apply_patchtool: Converted fromcustomtype tofunctiontype.- "Planning mode" behavior: Xiaomi model may generate text instead of tool calls when Codex's complex system prompt (~25K chars) is present. This is a model limitation — the model works fine with simpler prompts (e.g., hermes-agent).
- Streaming: Responses API streaming works with proper SSE events (
response.created,response.output_item.added,response.function_call_arguments.delta,response.completed).
- Model compatibility:
xiaomi/mimo-v2-prosupports reasoning and tool calling, but behavior varies by client system prompt complexity. - Rate limits: NousResearch free tier has rate limits. Upgrade for higher limits.
- Context window: Large conversations (>100KB body) may cause timeouts.