diff --git a/CHANGELOG.md b/CHANGELOG.md index bcfdeeb0..b331c9b9 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,11 @@ # apollo +## 2.0.0 + +### Major Changes + +- 782b846: Force authorisation on all API keys + ## 1.5.0 ### Minor Changes diff --git a/CLAUDE.md b/CLAUDE.md index c0ed0931..25b2e406 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -6,6 +6,9 @@ code in this repository. **Never chain Bash commands with `&&`, `;`, or `cd ... &&`. Use separate Bash calls instead.** +**Do not end single-sentence console/log output strings with a full stop.** +`console.log("rejected request")` not `console.log("rejected request.")` + ## Commands ### Development @@ -46,7 +49,7 @@ TypeScript) service modules. any `services//` directory not starting with `_`. Detects service type by checking for `.py` (Python) or `.ts` (TypeScript) index file. - **Instance auth** (`platform/src/auth/`): `/services/*` uses a - map-if-known-else-forward auth hook that is always active (no flag). The auth surface + known-client-only auth hook that is always active (no flag). The auth surface is split into three named concerns: the client-credential authenticate hook and Anthropic-key resolver on the injectable `InstanceAuth` class (`instance-auth.ts`), the internal-call exemption (`internal-token.ts`), and the shared `hashToken` @@ -57,13 +60,19 @@ TypeScript) service modules. `POSTGRES_URL` when unset, so local dev needs only the one var; staging/prod point `APOLLO_CLIENTS_DB_URL` at a separate credentials DB). The inbound `api_key` is treated purely as a credential and is **never** forwarded to the LLM on a known - match: it is replaced with the matched client's stored `anthropic_api_key`, or - stripped (falling back to the global `ANTHROPIC_API_KEY`) when that column is - `NULL`. An *unknown* key is forwarded unchanged only if it is `sk-ant-`-shaped - (bring-your-own key); an unknown non-`sk-ant-` key is rejected with `401` - (likely a Lightning credential that must not leak to the LLM). No `api_key` - falls back to the global key. The resolver (`InstanceAuth.resolveKey`) returns a - tagged `KeyResolution` (`useKey`/`useGlobal`/`forward`/`passthrough`) dispatched + match: it is replaced with the matched client's stored `anthropic_api_key`. A + known client must have a stored key; a `NULL` `anthropic_api_key` is a + server-side misconfiguration and is rejected with `500` (`CLIENT_MISCONFIGURED`, + reported to Sentry via `captureException`), never billed to the global key. An + *unknown* key is **rejected** regardless of shape: `401` when the + lookup completes and confirms no such client (a verified unknown that must not + leak to the LLM), and `503` when the + client store can't be reached (we can't verify, so don't guess — retryable, + never a misleading `401`). A request with no `api_key` at all is served by the + global `ANTHROPIC_API_KEY` when one is configured (the field is dropped); with no + global key to serve it, the keyless request is rejected with `401`. The resolver + (`InstanceAuth.resolveKey`) returns a + tagged `KeyResolution` (`useKey`/`useGlobal`/`passthrough`) dispatched by a named switch in `services.ts`. The stored `anthropic_api_key` may be plaintext or AES-256-GCM-encrypted (`enc:v1:` values, decrypted with `APOLLO_ENC_KEY`; see @@ -74,9 +83,8 @@ TypeScript) service modules. stale-while-revalidate, so a burst of requests at the TTL boundary shares one DB read (cold start awaits it; a warm-but-stale cache is served while one background refresh runs) rather than stampeding the DB. If the table can't be - reached, known-client swaps don't resolve and callers degrade to the - shape-checked forward path (the same `sk-ant-` rule applies; it does not - blanket-reject). The auth hook is scoped to + reached, known-client swaps don't resolve and a caller with an `api_key` gets a + retryable `503` (we can't verify them). The auth hook is scoped to `/services/*`, so health/root endpoints outside that group are unaffected. Internal Apollo-to-Apollo `apollo()` calls are exempt via a per-process internal token (`APOLLO_INTERNAL_TOKEN`, minted at startup; `bridge.ts` injects diff --git a/README.md b/README.md index bf9bc91a..f2603d85 100644 --- a/README.md +++ b/README.md @@ -16,16 +16,18 @@ This repo contains: This README covers running, debugging and deploying the server. Deeper docs live alongside the code: -- **Architecture** — see [Server Architecture](#server-architecture) below, and +- **Architecture** - see [Server Architecture](#server-architecture) below, and each service's own README for service-specific detail. -- **Contributing & service conventions** — [`CONTRIBUTING.md`](CONTRIBUTING.md) +- **Contributing & service conventions** - [`CONTRIBUTING.md`](CONTRIBUTING.md) explains how to add and structure a Python service (entry.py, imports, logging, code quality). -- **Services** — every service has its own README with payload specs and +- **Services** - every service has its own README with payload specs and examples. See the [Services](#services) index below. -- **Instance auth** — [`platform/src/auth/README.md`](platform/src/auth/README.md) - covers authenticating `/services/*` per client and managing per-client Anthropic keys. -- **Testing** — [`services/testing/README.md`](services/testing/README.md) +- **Instance auth** - + [`platform/src/auth/README.md`](platform/src/auth/README.md) covers + authenticating `/services/*` per client and managing per-client Anthropic + keys. +- **Testing** - [`services/testing/README.md`](services/testing/README.md) documents the shared acceptance-test harness (YAML specs + LLM-as-judge); per-service guides live in each service's `tests/`. @@ -213,53 +215,11 @@ the [Contribution Guide](CONTRIBUTING.md) for details. ## Services Each service lives in `services//` and is auto-mounted by service -discovery. Every mounted service has its own README with payload specs and -examples — start there for anything service-specific. - -### Chat & orchestration - -| Service | What it does | -| --- | --- | -| [`global_chat`](services/global_chat/README.md) | Orchestrator and single entry point for OpenFn AI chat; routes to subagents or escalates to a planner. Also see its [`PAYLOAD_SPEC.md`](services/global_chat/PAYLOAD_SPEC.md). | -| [`job_chat`](services/job_chat/README.md) | AI chat for OpenFn job code, with a code-suggestions/auto-patch mode. | -| [`workflow_chat`](services/workflow_chat/README.md) | Generates and edits OpenFn workflow YAML, preserving job code and IDs. | -| [`doc_agent_chat`](services/doc_agent_chat/README.md) | Agentic chat over a project's uploaded documents (RAG). | - -### Docs, search & RAG - -| Service | What it does | -| --- | --- | -| [`search_docsite`](services/search_docsite/README.md) | Semantic search over the OpenFn docs (`docsite` Pinecone index). | -| [`embed_docsite`](services/embed_docsite/README.md) | Downloads and indexes the OpenFn docs into the `docsite` index. | -| [`doc_agent_upload`](services/doc_agent_upload/README.md) | Fetches and indexes project documents into the `doc-agent` index. | - -### Adaptors - -| Service | What it does | -| --- | --- | -| [`load_adaptor_docs`](services/load_adaptor_docs/README.md) | Parses adaptor function docs into Postgres. | -| [`search_adaptor_docs`](services/search_adaptor_docs/README.md) | Queries adaptor docs back out of Postgres by version. | -| [`latest_adaptors`](services/latest_adaptors/README.md) | Fetches the latest adaptor versions from the OpenFn repo. | -| [`adaptor_apis`](services/adaptor_apis/README.md) | **TypeScript** service: produces a JSON schema of an adaptor's API. | - -### Medical vocab & embeddings - -| Service | What it does | -| --- | --- | -| `vocab_mapper` | Maps medical vocabularies (LOINC/SNOMED) against the `apollo-mappings` index. (No README yet.) | -| [`embeddings`](services/embeddings/README.md) | Vector-store wrapper used by the vocab services. | -| [`embed_loinc_dataset`](services/embed_loinc_dataset/README.md) | Embeds the LOINC dataset into `apollo-mappings`. | -| [`embed_snomed_dataset`](services/embed_snomed_dataset/README.md) | Embeds the SNOMED dataset into `apollo-mappings`. | -| [`embeddings_demo`](services/embeddings_demo/README.md) | Standalone embeddings demo (Zilliz). | - -### Utilities & support - -| Service | What it does | -| --- | --- | -| [`status`](services/status/README.md) | Health check: validates Anthropic, OpenAI and Pinecone keys. | -| [`echo`](services/echo/README.md) | Test service that returns its input; useful for verifying the pipeline. | -| [`auth`](platform/src/auth/README.md) | Instance-auth hook + provisioning (server layer, under `platform/`, not a mounted service). See [Database](#database). | -| [`testing`](services/testing/README.md) | Shared acceptance-test harness (not a mounted service). | +discovery. + +Start the server with `bun start` and head to `http://localhost:3000` to see an +overview of active services. Or go to `https://apollo.openfn.org` to see the +production services. ## Database @@ -272,12 +232,12 @@ apply with `psql`, both written with `CREATE TABLE IF NOT EXISTS` so re-running them is safe: - [`services/load_adaptor_docs/schema.sql`](services/load_adaptor_docs/schema.sql) - — `adaptor_function_docs`. This table is also created lazily the first time - `load_adaptor_docs` runs, so applying it by hand is optional. -- `lightning_clients` — created and kept current by the migration runner + - `adaptor_function_docs`. This table is also created lazily the first time + `load_adaptor_docs` runs, so applying it by hand is optional. +- `lightning_clients` - created and kept current by the migration runner (`platform/src/db/migrate.ts`, migrations under - [`platform/migrations/`](platform/migrations/)). It is applied automatically at - Apollo startup when `POSTGRES_URL` is set; no manual `psql` step is needed. + [`platform/migrations/`](platform/migrations/)). It is applied automatically + at Apollo startup when `POSTGRES_URL` is set; no manual `psql` step is needed. First, make sure you've configured your desired `POSTGRES_URL` in your `.env` file. @@ -299,47 +259,33 @@ Create a Postgres DB matching your POSTGRES_URL from the `.env` file ### Instance authentication (optional) -`/services/*` can be authenticated so that only known clients (e.g. specific Lightning -instances) may call it, with Apollo using **each client's own Anthropic API -key** for that client's requests. - -- It is **transparent and backward compatible** (map-if-known-else-forward): the - auth hook is always active but only swaps in a key when it recognises the caller. - Clients are looked up in the `lightning_clients` table via `POSTGRES_URL`; if - that table can't be reached, known-client swaps simply don't resolve and every - caller degrades to the forward path (it does **not** blanket-reject). -- The credential is the **`api_key` the caller already sends in the request - body** — there is no bearer token, no `Authorization` header, and no - Lightning-side change. Apollo stores only a SHA-256 hash of it. -- On a match, the inbound `api_key` is treated purely as a credential and is - **never** forwarded to the LLM: it is replaced with the client's stored - Anthropic key (so LLM usage bills to that client), or stripped — falling back - to the global `ANTHROPIC_API_KEY` — if the client has no stored key. -- An **unrecognised** key is forwarded unchanged **only if it looks like an - Anthropic key** (prefix `sk-ant-`) — this is the bring-your-own-key path. An - unrecognised key that is _not_ `sk-ant-`-shaped is a likely Lightning - credential, so it is **rejected** (`401`) rather than forwarded, which would - leak it to the LLM. A request with no `api_key` falls back to the global key. -- Health/root endpoints (`/livez`, `/status`, `/`) are outside `/services/*` and - never subject to the auth hook. Internal Apollo-to-Apollo `apollo()` calls are exempt via a - per-process internal token (`APOLLO_INTERNAL_TOKEN`), not by network position. +`/services/*` is authenticated so that only known clients (e.g. specific +Lightning instances) may call it, with Apollo using **each client's own +Anthropic API key** for that client's requests. + +The auth hook is always active. A request that carries an `api_key` must resolve +to a known Lightning client or it is **rejected**. Clients are looked up in the +`lightning_clients` table via `APOLLO_CLIENTS_DB_URL` (falling back to +`POSTGRES_URL` when unset). To enable it and provision clients, see [`platform/src/auth/`](platform/src/auth/README.md). #### Deploying: pin `APOLLO_INTERNAL_TOKEN` in production -`APOLLO_INTERNAL_TOKEN` is the mechanism that lets internal `apollo()` self-calls -through the auth hook: a self-call carries it in the `x-apollo-internal` header and the -hook matches it. Because the global `ANTHROPIC_API_KEY` is dev-only, a token that -fails to match is a dead end (a `401`), not a soft fallback to the global key — so -the match has to work in every topology. +`APOLLO_INTERNAL_TOKEN` is the mechanism that lets internal `apollo()` +self-calls through the auth hook: a self-call carries it in the +`x-apollo-internal` header and the hook matches it. Because the global +`ANTHROPIC_API_KEY` is dev-only, a token that fails to match is a dead end (a +`401`), not a soft fallback to the global key - so the match has to work in +every topology. -- **Always set `APOLLO_INTERNAL_TOKEN` to a shared value across the deployment in - production.** When it is set, the per-process minting path never runs. +- **Always set `APOLLO_INTERNAL_TOKEN` to a shared value across the deployment + in production.** When it is set, the per-process minting path never runs. - The per-process random mint is a **dev-only convenience**. Apollo assumes one - Bun process per host; the `apollo()` self-call relies on loopback calls landing - on the same process, so a minted token only works single-process-per-host. + Bun process per host; the `apollo()` self-call relies on loopback calls + landing on the same process, so a minted token only works + single-process-per-host. - If `reusePort` clustering is ever enabled, a shared `APOLLO_INTERNAL_TOKEN` is **required**, not optional: a self-call can otherwise be routed to a sibling process that minted a different token and will `401`. Startup logs the token's diff --git a/package.json b/package.json index d25cbc20..3da0d9f6 100644 --- a/package.json +++ b/package.json @@ -1,7 +1,7 @@ { "name": "apollo", "module": "platform/index.ts", - "version": "1.5.0", + "version": "2.0.0", "type": "module", "scripts": { "start": "NODE_ENV=production bun platform/src/index.ts", diff --git a/platform/src/auth/README.md b/platform/src/auth/README.md index 878ad41b..8b03b4a4 100644 --- a/platform/src/auth/README.md +++ b/platform/src/auth/README.md @@ -22,21 +22,32 @@ migrations under `platform/migrations/`). hashes it, and looks for a matching row. The inbound `api_key` is treated **purely as a credential and is never forwarded to the LLM** on a known match. - On a match it is replaced with the client's stored `anthropic_api_key`, so all - LLM usage for that request bills to the key Apollo controls. If the column is - `NULL`, the inbound key is **stripped** and Apollo falls back to its global - `ANTHROPIC_API_KEY`. Either way the caller's key cannot pass through. + LLM usage for that request bills to the key Apollo controls. A known client + **must** have a stored key: a `NULL` `anthropic_api_key` is a server-side + misconfiguration, so such a request is rejected with **`500`** (reported to + Sentry), never silently billed to the global key. The caller's key never passes + through to the LLM. - **Performance:** lookups are cached per client on a ~60s TTL with single-flight, stale-while-revalidate refresh, so the database is queried at most once per minute per process per token, never on the per-request path to Anthropic. The per-request cost is a hash plus a map lookup. -- **Transparent / backward compatible (map-if-known-else-forward):** the auth hook - is always active but only swaps in a key when it recognises the caller. An - unrecognised key is forwarded unchanged if it is `sk-ant-`-shaped - (bring-your-own key) and rejected (`401`) otherwise; a non-`sk-ant-` key is a - likely Lightning credential that must not reach the LLM. A request with no - `api_key` falls back to the global key. When this table can't be reached, - known-client swaps don't resolve and every caller degrades to that forward - path; it does **not** blanket-reject. +- **Known-client-only:** the auth hook is always active. A request that carries an + `api_key` must resolve to a known Lightning client or it is **rejected**, + whatever the key's shape (`sk-ant-` or not). The rejection splits on whose fault + the failure is: + - **`401`** when the lookup completes and confirms no such client (a verified + unknown key, which must not reach the LLM). + - **`503`** when the client store can't be reached (DB never came up, or the + read threw). We can't verify the caller, so we don't guess: this is our + outage and is retryable, never a misleading `401`. + - **`500`** when the lookup finds the client but its stored `anthropic_api_key` + is `NULL`. A recognised client with no key is a server-side misconfiguration, + reported to Sentry, not a caller error. + + A request with **no `api_key` at all** is served by the global + `ANTHROPIC_API_KEY` when one is configured (the field is simply dropped); when + no global key is configured there is nothing to serve it, so it is rejected with + **`401`**. - The health endpoints (`/livez`, `/status`, `/`) sit outside `/services/*` and are never subject to the auth hook. Internal Apollo-to-Apollo `apollo()` calls are exempt via a per-process internal token (`APOLLO_INTERNAL_TOKEN`), not by network position. @@ -110,8 +121,9 @@ lands in shell history or `ps`; the client **name** is a positional argument. 4. The client is active as soon as its row is in the table — there is no flag to set or restart needed. The startup log shows `Apollo instance auth: lightning_clients lookup ready.` once the DB is reachable. (If the table is - missing or the DB is down, the log warns and callers fall to the forward path - rather than being rejected; known-client swaps just won't resolve.) + missing or the DB is down, the log warns and callers with an `api_key` get a + retryable `503` — we can't verify them, so known-client swaps just won't + resolve until the DB is back.) 5. Give the printed `api_key` to the Lightning instance. It keeps sending it as `api_key` exactly as it does today — no other Lightning-side change. @@ -126,8 +138,9 @@ lands in shell history or `ps`; the client **name** is a positional argument. ``` - **Verify** that a client's stored key resolves under the current `APOLLO_ENC_KEY` - — reports `decrypts` / `plaintext` / `global` (NULL) / `DECRYPT_FAILED`, and exits - non-zero on failure: + — reports `decrypts` / `plaintext` / `DECRYPT_FAILED`, and exits non-zero on + failure. A `NULL` stored key is an invalid (keyless) client row, not a usable + state: ```sh bun run client verify acme @@ -140,10 +153,10 @@ lands in shell history or `ps`; the client **name** is a positional argument. ### `encrypt` — the lower-level subcommand `bun run client encrypt` prints the `enc:v1:…` value for the key on stdin and makes -**no DB write**. Useful for manual SQL / row-seeding — e.g. to add a client whose -`anthropic_api_key` is `NULL` (so it uses Apollo's global `ANTHROPIC_API_KEY`), -which `add` doesn't cover. Pair the printed value with an `auth_token_hash` you -compute yourself: +**no DB write**. Useful for manual SQL / row-seeding when you need to write the row +yourself. Every client row needs a non-`NULL` `anthropic_api_key`: a recognised +client with no stored key is a misconfiguration that the auth hook rejects with +`500`. Pair the printed value with an `auth_token_hash` you compute yourself: ```sh echo "$KEY" | bun run client encrypt @@ -158,7 +171,8 @@ compatibility. - **Fail closed.** If an `enc:v1:` row can't be decrypted (wrong/missing `APOLLO_ENC_KEY` or corrupt value), that client is dropped from the allow-list and its requests get `401` — Apollo never falls back to the global key for an - encrypted-but-undecryptable row. A `NULL` key still means "use the global key". + encrypted-but-undecryptable row. A recognised client must have a usable stored + key; a `NULL` `anthropic_api_key` is a misconfiguration that yields `500`. - **Rotation** is manual: re-encrypt every `enc:v1:` row with the new key, then swap `APOLLO_ENC_KEY` and restart. - **What it protects.** The ciphertext is useless without `APOLLO_ENC_KEY`, so this diff --git a/platform/src/auth/client/cli.ts b/platform/src/auth/client/cli.ts index a1aad825..86b3ec56 100644 --- a/platform/src/auth/client/cli.ts +++ b/platform/src/auth/client/cli.ts @@ -124,9 +124,9 @@ function reportVerify(name: string, status: VerifyStatus): number { case "plaintext": console.log(`Client "${name}": anthropic_api_key is stored as plaintext (used as-is).`); return 0; - case "global": - console.log(`Client "${name}": anthropic_api_key is NULL, falls back to the global ANTHROPIC_API_KEY.`); - return 0; + case "no_key": + console.error(`Client "${name}": anthropic_api_key is NULL. This is an invalid (keyless) client row; the auth hook will reject every request with 500. Set a key with \`rotate\`.`); + return 1; case "decrypt_failed": console.error(`Client "${name}": DECRYPT_FAILED. The stored enc:v1: key cannot be decrypted with the current APOLLO_ENC_KEY.`); return 1; diff --git a/platform/src/auth/client/commands.ts b/platform/src/auth/client/commands.ts index c9d56931..d0f06a53 100644 --- a/platform/src/auth/client/commands.ts +++ b/platform/src/auth/client/commands.ts @@ -61,20 +61,22 @@ export function encryptValue(encKey: Buffer, plaintext: string): string { export type VerifyStatus = | "decrypts" // "enc:v1:…" that decrypts cleanly | "plaintext" // legacy plaintext, used as-is - | "global" // NULL -> falls back to the global ANTHROPIC_API_KEY + | "no_key" // NULL -> a misconfigured client row; the auth hook rejects it with 500 | "decrypt_failed" // "enc:v1:…" with no/wrong key or a corrupt blob | "unknown_client"; // no row by that name /** Classify a stored value the same way instance-auth's decryptStoredKey does, but * reporting the outcome instead of dropping the client. Pure; no DB. The branch - * order here (NULL -> global, no-prefix -> plaintext, no-key/decrypt-error -> + * order here (NULL -> no_key, no-prefix -> plaintext, no-key/decrypt-error -> * fail) must track decryptStoredKey in instance-auth.ts: verify exists to predict - * the auth hook's behaviour, so the two cannot be allowed to diverge. */ + * the auth hook's behaviour, so the two cannot be allowed to diverge. A NULL stored + * key is not a usable state — the auth hook rejects such a client with 500 rather + * than falling back to the global key. */ export function classifyStoredKey( stored: string | null, encKey: Buffer | null ): VerifyStatus { - if (stored === null) return "global"; + if (stored === null) return "no_key"; if (!stored.startsWith(ENC_PREFIX)) return "plaintext"; if (!encKey) return "decrypt_failed"; try { diff --git a/platform/src/auth/hash.ts b/platform/src/auth/hash.ts index b3abd7d7..56e9ef0b 100644 --- a/platform/src/auth/hash.ts +++ b/platform/src/auth/hash.ts @@ -1,13 +1,7 @@ import { createHash } from "node:crypto"; -/** SHA-256 hex of a client credential, computed over its UTF-8 bytes. This is the - * one definition of the hash contract: provisioning writes it, the auth hook looks - * it up. Apollo is TS-only on this path, so there is no cross-language duplicate. - * - * Trims leading and trailing whitespace before hashing. The credential is its - * trimmed form: a token minted with `randomBytes(32).toString("base64url")` has - * no whitespace, so trim is identity for every stored hash, but a copy-paste - * newline or stray space at either call site can no longer split the contract. */ +/** SHA-256 hex of a client credential. Trimmed before hashing so a copy-paste + * newline or stray space hashes to the same value as the clean credential. */ export function hashToken(token: string): string { return createHash("sha256").update(token.trim()).digest("hex"); } diff --git a/platform/src/auth/instance-auth.ts b/platform/src/auth/instance-auth.ts index a352d745..b120d60e 100644 --- a/platform/src/auth/instance-auth.ts +++ b/platform/src/auth/instance-auth.ts @@ -1,6 +1,14 @@ import type { ApolloError } from "../util/errors"; -import { serviceUnavailable, unauthorized } from "../util/errors"; -import { ENC_PREFIX, decryptKey, parseEncKey } from "../util/instance-key-crypto"; +import { + clientMisconfigured, + serviceUnavailable, + unauthorized, +} from "../util/errors"; +import { + ENC_PREFIX, + decryptKey, + parseEncKey, +} from "../util/instance-key-crypto"; import { clientsDbUrl, getDb } from "../db"; import { hashToken } from "./hash"; import { checkInternalHeader } from "./internal-token"; @@ -8,20 +16,18 @@ import { captureException } from "../util/sentry"; export type Client = { name: string; anthropicKey: string | null }; -// Resolve a client by token hash. Used both for the injected pre-cache bypass -// (lookup) and the injected per-hash DB read driving the cache (dbLookup). type Lookup = (hash: string) => Promise | Client | null; -// Per-hash cache entry. An absent key means "never checked"; a "miss" means -// "checked, confirmed unknown". The two are distinguishable so a verified miss -// can be cached without being mistaken for a cold slot. +// Absent = never checked; "miss" = checked, confirmed unknown, so a verified +// miss isn't mistaken for a cold slot. type CacheEntry = | { kind: "hit"; client: Client; checkedAt: number } | { kind: "miss"; checkedAt: number }; -// Outcome of resolving one token hash. "unavailable" (DB never came up, or the read -// threw) is kept distinct from "absent" (lookup completed, no such client) so the auth -// hook can answer a non-sk-ant- caller with a retryable 503 instead of a misleading 401. +// Outcome of resolving one token hash. "unavailable" (DB never came up, or +// the read threw) is kept distinct from "absent" (lookup completed, no such +// client) so the auth hook can answer with a retryable 503 instead of a +// misleading 401. type LookupResult = | { kind: "found"; client: Client } | { kind: "absent" } @@ -29,42 +35,57 @@ type LookupResult = /** Resolution of which key the outgoing payload carries. The names are * load-bearing: services.ts dispatches them in a named switch so the - * inbound-credential-never-forwarded invariant is structural, not positional. */ + * inbound-credential-never-forwarded invariant is structural, not + * positional. */ export type KeyResolution = - | { kind: "useKey"; key: string } // known client: swap in its stored Anthropic key - | { kind: "useGlobal" } // known client, NULL stored key: drop field -> global key - | { kind: "forward" } // unknown caller past the shape check: leave body as received - | { kind: "passthrough" }; // internal apollo() hop: leave body exactly as received - -// Anthropic keys are prefixed sk-ant-; our client credentials are base64url, so the -// two shapes don't collide. An unknown key without this prefix is treated as a -// (likely Lightning) credential and rejected rather than forwarded to the LLM. -const ANTHROPIC_KEY_PREFIX = "sk-ant-"; + // known client: swap in its stored Anthropic key + | { kind: "useKey"; key: string } + // drop field -> global key: a request with no api_key at all + | { kind: "useGlobal" } + // internal apollo() hop: leave body exactly as received + | { kind: "passthrough" }; const CACHE_TTL_MS = 60_000; -// Tunable ceiling: the longest a stale entry survives a broken DB before eviction. -// Scales with the TTL by design (Stu signed off on the 3x multiple). +// Longest a stale entry survives a broken DB before eviction. Scales with +// the TTL. const MAX_STALENESS_MS = CACHE_TTL_MS * 3; -// Returned when an encrypted key can't be decrypted: that client is omitted (fail -// closed) rather than falling back to the global key, which would mis-bill its usage. +// Returned when an encrypted key can't be decrypted: that client is omitted +// (fail closed) rather than falling back to the global key, which would +// mis-bill its usage. const DECRYPT_FAILED = Symbol("decrypt-failed"); +interface AuthCtx { + body?: { api_key?: unknown }; + query?: { api_key?: unknown }; + request?: { url?: unknown }; + set?: { status?: number }; + internalCall?: boolean; + lightningClient?: Client; +} + export interface InstanceAuthOptions { - // Master key for decrypting stored anthropic_api_key values. When omitted, read - // from APOLLO_ENC_KEY at init(); pass explicitly in tests to skip the env. + // Master key for decrypting stored anthropic_api_key values. When omitted, + // read from APOLLO_ENC_KEY at init(); pass explicitly in tests to skip the + // env. encKey?: Buffer | null; - // Test injection: a synchronous client resolver that bypasses the cache/DB path - // entirely. Production leaves this unset. + // Test injection: a synchronous client resolver that bypasses the cache/DB + // path entirely. Production leaves this unset. lookup?: Lookup | null; - // Test injection: a per-hash DB read that drives the real cache/single-flight/ - // staleness logic in place of Postgres. Setting it implies the DB is ready. + // Test injection: a per-hash DB read that drives the real + // cache/single-flight/staleness logic in place of Postgres. Setting it + // implies the DB is ready. dbLookup?: Lookup | null; + // Whether a global ANTHROPIC_API_KEY is configured to serve keyless + // requests. A boolean is snapshotted; a thunk is read live. When omitted, + // defaults to reading process.env.ANTHROPIC_API_KEY through a thunk so + // env/test changes stay live. + hasGlobalKey?: boolean | (() => boolean); } -// Read a query param straight off the request URL. Fallback for the WS-upgrade hook, -// where Elysia may not have populated ctx.query before beforeHandle runs. -function queryParam(ctx: any, name: string): string { +// WS-upgrade fallback: Elysia may not have populated ctx.query before +// beforeHandle runs. +function queryParam(ctx: AuthCtx, name: string): string { const url = ctx?.request?.url; if (typeof url !== "string") return ""; try { @@ -74,77 +95,82 @@ function queryParam(ctx: any, name: string): string { } } -// Map a completed lookup (Client | null) to a found/absent LookupResult. "unavailable" -// is never produced here; it arises only where the DB-backed path cannot complete a -// read (db not ready, or the read threw). function toLookupResult(client: Client | null): LookupResult { return client ? { kind: "found", client } : { kind: "absent" }; } /** - * The instance-auth surface, owning all of what used to be module-level state: - * the per-client cache, single-flight handles, dbReady flag, and the encryption - * key. One instance per process is created in server.ts; tests construct their own - * with injected lookups, so there are no test-seam exports and no module globals. + * The instance-auth surface, owning all per-process state: the per-client + * cache, single-flight handles, dbReady flag, and the encryption key. One + * instance per process is created in server.ts; tests construct their own + * with injected lookups, so there are no test-seam exports and no module + * globals. */ export class InstanceAuth { - // False until the lightning_clients lookup is usable. A down DB means no client - // can be resolved, so every caller degrades to the shape-checked forward path. + // False until the lightning_clients lookup is usable. A down DB means no + // client can be resolved, so a caller with an api_key gets a retryable 503 + // (we can't verify). private dbReady = false; private readonly clientCache = new Map(); - // Single-flight handles per hash: concurrent lookups for the same token share one - // promise, so a burst can never trigger more than one DB read for that hash. + // Single-flight handles per hash: concurrent lookups for the same token + // share one promise, so a burst can never trigger more than one DB read for + // that hash. private readonly lookupInFlight = new Map>(); private encKey: Buffer | null; private readonly lookupOverride: Lookup | null; private readonly dbLookupOverride: Lookup | null; + private readonly hasGlobalKey?: boolean | (() => boolean); constructor(opts: InstanceAuthOptions = {}) { this.encKey = opts.encKey ?? null; this.lookupOverride = opts.lookup ?? null; this.dbLookupOverride = opts.dbLookup ?? null; - // Injecting a per-hash DB read implies a reachable DB; mirror the old test seam. + // A dbLookup override implies a reachable DB. if (this.dbLookupOverride) this.dbReady = true; + this.hasGlobalKey = opts.hasGlobalKey; } - // The auth hook is always active. This reads APOLLO_ENC_KEY and probes the - // lightning_clients lookup so known clients can be resolved; if the DB is - // unreachable, dbReady stays false and every caller degrades to the shape-checked - // forward path (it does not blanket-reject). + globalKeyConfigured(): boolean { + if (typeof this.hasGlobalKey === "function") return this.hasGlobalKey(); + if (this.hasGlobalKey !== undefined) return this.hasGlobalKey; + return Boolean(process.env.ANTHROPIC_API_KEY); + } + + // Reads APOLLO_ENC_KEY and probes the lightning_clients DB. If unreachable, + // dbReady stays false and api_key callers get a 503. async init(): Promise { this.encKey = parseEncKey(process.env.APOLLO_ENC_KEY); if (process.env.APOLLO_ENC_KEY && !this.encKey) { console.error( - "Apollo instance auth: APOLLO_ENC_KEY is set but is not valid base64 of 32 bytes; encrypted client keys cannot be decrypted and those clients will be REJECTED." + "Apollo instance auth: APOLLO_ENC_KEY is set but is not valid base64 of 32 bytes; encrypted client keys cannot be decrypted" ); } if (!clientsDbUrl()) { this.dbReady = false; console.warn( - "Apollo instance auth: neither APOLLO_CLIENTS_DB_URL nor POSTGRES_URL is set, so known clients cannot be looked up; callers fall to the shape-checked forward path." + "Apollo instance auth: neither APOLLO_CLIENTS_DB_URL nor POSTGRES_URL is set" ); return; } - // Migrations have already run (server.ts), so the table exists if the DB is up. - // The probe is now just a reachability check that sets dbReady. + // Migrations already ran in server.ts; this only probes reachability. try { await getDb()`SELECT 1`; this.dbReady = true; - this.clientCache.clear(); // force a fresh load on the first request - console.log("Apollo instance auth: lightning_clients lookup ready."); + this.clientCache.clear(); + console.log("Apollo instance auth: lightning_clients lookup ready"); } catch (err) { this.dbReady = false; console.error( - "Apollo instance auth: the database could not be reached, so known-client swaps will not resolve; callers fall to the shape-checked forward path.", + "Apollo instance auth: the database could not be reached", err ); } } - // null => global env key; "enc:v1:…" => AES-256-GCM decrypt (DECRYPT_FAILED on - // error); anything else => legacy plaintext. + // null => global env key; "enc:v1:…" => AES-256-GCM decrypt + // (DECRYPT_FAILED on error); anything else => plaintext key, used as-is. private decryptStoredKey( stored: string | null, clientName: string @@ -153,7 +179,7 @@ export class InstanceAuth { if (!stored.startsWith(ENC_PREFIX)) return stored; if (!this.encKey) { console.error( - `Apollo instance auth: client "${clientName}" has an encrypted anthropic_api_key but APOLLO_ENC_KEY is unset/invalid; omitting this client (fail closed).` + `Apollo instance auth: client "${clientName}" has an encrypted anthropic_api_key but APOLLO_ENC_KEY is unset/invalid` ); captureException( new Error( @@ -167,7 +193,7 @@ export class InstanceAuth { return decryptKey(stored, this.encKey); } catch (err) { console.error( - `Apollo instance auth: could not decrypt anthropic_api_key for client "${clientName}"; omitting this client (fail closed).`, + `Apollo instance auth: could not decrypt anthropic_api_key for client "${clientName}";`, err ); captureException(err, { reason: "decrypt-error", client: clientName }); @@ -175,8 +201,6 @@ export class InstanceAuth { } } - /** Turn one lightning_clients row into a Client, dropping it (fail closed) if its - * encrypted key can't be decrypted. */ rowToClient(row: { name: string; anthropic_api_key: string | null; @@ -186,8 +210,8 @@ export class InstanceAuth { return { name: row.name, anthropicKey: key }; } - // One targeted query for a single hash. Replaces the unbounded bulk SELECT: an - // unknown token now costs at most one round-trip on first sight, then a cached miss. + // One targeted query per hash: an unknown token costs one round-trip on + // first sight, then a cached miss. private async queryClient(hash: string): Promise { const rows = (await getDb()` SELECT name, anthropic_api_key @@ -200,10 +224,8 @@ export class InstanceAuth { return row ? this.rowToClient(row) : null; } - // Single-flight DB read for one hash: a concurrent caller for the same hash joins - // the in-flight promise (set synchronously before the await yields, so an eviction - // followed by a burst still produces one query). The handle is cleared in finally - // so a failed lookup never wedges the slot. + // Set the in-flight handle synchronously before the await yields, so a + // burst after eviction still produces one query. private loadClient(hash: string): Promise { const inFlight = this.lookupInFlight.get(hash); if (inFlight) return inFlight; @@ -215,149 +237,134 @@ export class InstanceAuth { return promise; } - private cacheResult(hash: string, client: Client | null): Client | null { + private cacheResult(hash: string, client: Client | null): void { this.clientCache.set( hash, client ? { kind: "hit", client, checkedAt: Date.now() } : { kind: "miss", checkedAt: Date.now() } ); - return client; + } + + private entryResult(entry: CacheEntry): LookupResult { + return toLookupResult(entry.kind === "hit" ? entry.client : null); + } + + private refreshInBackground(hash: string, entry: CacheEntry): void { + void this.loadClient(hash) + .then((client) => this.cacheResult(hash, client)) + .catch((err) => { + // A throw in the log/capture on this voided chain would surface + // as an unhandledRejection. + try { + console.error( + "Apollo instance auth: background refresh of a cached client failed; serving until expiry", + err + ); + captureException(err, { + reason: "stale-refresh-error", + client: entry.kind === "hit" ? entry.client.name : null, + }); + } catch {} + }); } private async lookupClient(hash: string): Promise { - if (!this.dbReady) return { kind: "unavailable" }; // lookup never came up -> cannot verify + // lookup never came up -> cannot verify + if (!this.dbReady) return { kind: "unavailable" }; const entry = this.clientCache.get(hash); if (entry) { const age = Date.now() - entry.checkedAt; if (age <= CACHE_TTL_MS) { - return toLookupResult(entry.kind === "hit" ? entry.client : null); + return this.entryResult(entry); } if (age <= MAX_STALENESS_MS) { - // Stale but within the ceiling: serve the cached value now and refresh once - // in the background. A failed refresh leaves the entry to age further; only - // the single-flight read fires, so a burst at the boundary triggers one call. - void this.loadClient(hash) - .then((client) => this.cacheResult(hash, client)) - .catch((err) => { - // Guard the catch body: a throw inside the log/capture on this voided - // chain would otherwise surface as an unhandledRejection. - try { - console.error( - "Apollo instance auth: background refresh of a cached client failed; serving the stale entry until it ages out.", - err - ); - captureException(err, { - reason: "stale-refresh-error", - client: entry.kind === "hit" ? entry.client.name : null, - }); - } catch {} - }); - return toLookupResult(entry.kind === "hit" ? entry.client : null); + // Stale but within ceiling: serve cached now, refresh once in + // background. A failed refresh leaves it to age out. + this.refreshInBackground(hash, entry); + return this.entryResult(entry); } - // Beyond the ceiling: evict and fall through to a cold, awaited lookup rather - // than serve a possibly-revoked client. + // Beyond ceiling: evict and fall through to a cold awaited lookup + // rather than serve a possibly-revoked client. this.clientCache.delete(hash); - console.warn( - `Apollo instance auth: cache entry for a client token exceeded the max-staleness ceiling (${MAX_STALENESS_MS}ms); evicting and re-checking the database.` - ); } - // Cold (or just-evicted): every concurrent caller awaits the one shared read. On - // failure cache nothing and fail closed (a former hit now rejects; a former miss - // simply re-queries next time). try { const client = await this.loadClient(hash); this.cacheResult(hash, client); return toLookupResult(client); } catch (err) { console.error( - "Apollo instance auth: client lookup failed against the database; rejecting this request (fail closed).", + "Apollo instance auth: client lookup failed against the database", err ); return { kind: "unavailable" }; } } - /** - * Client-credential authentication: the /services/* onBeforeHandle hook. Internal-call - * exemption is checked first and short-circuits; otherwise the inbound api_key is - * hashed and looked up. The auth hook is always active but only rejects two cases: a - * forged internal header, and an unknown non-sk-ant- key (a likely Lightning - * credential we must not forward to the LLM). Returning a value short-circuits the - * request with that body. - */ + private extractApiKey(ctx: AuthCtx): string { + // WS upgrade is a bodyless GET, so the credential rides as ?api_key= + // (query, URL fallback). A query token shows in logs, acceptable: it's + // hashed at rest. Don't move to a header; browsers can't set them on a + // WS upgrade. + return ( + (typeof ctx.body?.api_key === "string" ? ctx.body.api_key.trim() : "") || + (typeof ctx.query?.api_key === "string" + ? ctx.query.api_key.trim() + : "") || + queryParam(ctx, "api_key") + ); + } + authenticate = async (ctx: any): Promise => { - // Internal-call exemption wins precedence: Python children echo back the - // internal token (services/util.py apollo()), so such calls skip the api_key - // check. External callers can't forge it; it's a per-process secret never sent - // to clients. const internal = checkInternalHeader(ctx); if (internal.kind === "match") { - // Flag so the key resolves to passthrough: the forwarded api_key is left - // untouched rather than stripped, which would mis-bill a per-client key - // passed down an apollo() hop. ctx.internalCall = true; return; } if (internal.kind === "mismatch") { - // A non-empty internal header is a claim to be Apollo itself; a mismatch is - // either a sibling process without a shared APOLLO_INTERNAL_TOKEN or a forged - // header. Reject outright, never re-try as an external api_key caller, so a - // wrong internal header can't ride in on a valid body credential. + // A wrong internal header is rejected outright, never retried as an + // external api_key caller, so it can't ride in on a valid body + // credential. console.warn( - "Apollo internal token MISMATCH: x-apollo-internal present but does not match; likely a sibling process without a shared APOLLO_INTERNAL_TOKEN, or a forged header. Rejecting with 401." - ); - captureException( - new Error("Apollo instance auth: internal token mismatch"), - { reason: "internal-token-mismatch" } + "Apollo internal token MISMATCH: x-apollo-internal present but does not match; likely a sibling process without a shared APOLLO_INTERNAL_TOKEN, or a forged header. Rejecting with 401" ); return unauthorized(ctx); } - // The credential is the api_key the caller sends. POST puts it in the body; a WS - // upgrade is a bodyless GET, so it rides as the ?api_key= query param instead - // (ctx.query, with a URL fallback for hooks where Elysia hasn't parsed it yet). - // A query-string token shows up in access/proxy logs, which is acceptable here: Apollo - // is internal and the token is hashed at rest, so a log leak doesn't expose the - // stored Anthropic key. Don't "fix" this by moving to a header: browsers can't - // set headers on a WS upgrade. No key at all takes the forward path (-> global - // key), so only proceed to lookup when one is present. - const apiKey = - (typeof ctx.body?.api_key === "string" ? ctx.body.api_key.trim() : "") || - (typeof ctx.query?.api_key === "string" ? ctx.query.api_key.trim() : "") || - queryParam(ctx, "api_key"); - if (!apiKey) return; + const apiKey = this.extractApiKey(ctx); + console.log(apiKey); + if (!apiKey) { + return this.globalKeyConfigured() ? undefined : unauthorized(ctx); + } const hash = hashToken(apiKey); const result: LookupResult = this.lookupOverride ? toLookupResult(await this.lookupOverride(hash)) : await this.lookupClient(hash); if (result.kind === "found") { + // Recognised client with no stored key is a server-side + // misconfiguration, not a caller error. Mis-billing the global key + // would hide it, so 500 + report. + if (!result.client.anthropicKey) { + captureException( + new Error( + "Apollo instance auth: recognised client has no anthropic_api_key configured" + ), + { reason: "client-misconfigured-no-key", client: result.client.name } + ); + return clientMisconfigured(ctx); + } ctx.lightningClient = result.client; return; } - // An sk-ant- key is a bring-your-own Anthropic key: it needs no client lookup and - // is forwarded unchanged even during a store outage, so it NEVER reaches the 503 - // below. On a WS upgrade it only rode the query string, so record it for the - // message handler to fold into the outgoing payload. - if (apiKey.startsWith(ANTHROPIC_KEY_PREFIX)) { - ctx.forwardApiKey = apiKey; - return; - } - - // Non-sk-ant- from here. Split on whose fault the failure is: - // - unavailable: we could not complete the lookup (DB never came up, or the read - // threw). That is our outage and is retryable, so 503, never a misleading 401, - // never a silent forward of a likely Lightning credential. - // - absent: the lookup completed and confirmed no such client, so 401 as before. + // unavailable: lookup couldn't complete (our outage) -> retryable 503. + // absent: lookup confirmed no such client -> 401. if (result.kind === "unavailable") { captureException( - new Error( - "Apollo instance auth: client store unavailable; returning 503 rather than a misleading 401" - ), + new Error("Apollo instance auth: client store unavailable"), { reason: "client-store-unavailable-503", tokenHash: hash } ); return serviceUnavailable(ctx); @@ -365,21 +372,13 @@ export class InstanceAuth { return unauthorized(ctx); }; - /** - * Anthropic-key resolver. The result is dispatched by a named switch in - * services.ts; the inbound credential is never forwarded to the LLM in the - * useKey/useGlobal cases. Internal hops pass through untouched; a known client - * either swaps in its stored key or (NULL) falls back to the global key; every - * other caller forwards the body as received. - */ - resolveKey = (ctx: any): KeyResolution => { + resolveKey = (ctx: AuthCtx): KeyResolution => { if (ctx?.internalCall) return { kind: "passthrough" }; const client = ctx?.lightningClient as Client | undefined; - if (client) { - return client.anthropicKey - ? { kind: "useKey", key: client.anthropicKey } - : { kind: "useGlobal" }; - } - return { kind: "forward" }; + // A null-key client never reaches here: authenticate() rejects it with + // 500 and never sets ctx.lightningClient, so any client present here has + // a stored key. + if (client) return { kind: "useKey", key: client.anthropicKey as string }; + return { kind: "useGlobal" }; }; } diff --git a/platform/src/middleware/services.ts b/platform/src/middleware/services.ts index 05379844..1d5d6073 100644 --- a/platform/src/middleware/services.ts +++ b/platform/src/middleware/services.ts @@ -32,11 +32,11 @@ export default async (app: Elysia, port: number, auth: InstanceAuth) => { // Apply the resolved key to an outgoing payload with an explicit switch so the // inbound-credential-never-forwarded invariant is structural, not positional: a - // known client's stored key is swapped in (useKey), a NULL stored key drops the - // field so Python uses the global key (useGlobal), and every other caller forwards - // the body exactly as received (forward/passthrough). `ctx` is the upgrade-time - // context that carries lightningClient/internalCall: on POST the route ctx, on WS - // the captured ws.data, never a fresh per-message one. + // known client's stored key is swapped in (useKey), a NULL stored key (or a request + // with no api_key) drops the field so Python uses the global key (useGlobal), and an + // internal apollo() hop forwards the body exactly as received (passthrough). `ctx` is + // the upgrade-time context that carries lightningClient/internalCall: on POST the + // route ctx, on WS the captured ws.data, never a fresh per-message one. const applyKey = (payload: Record, ctx: any) => { const resolution = auth.resolveKey(ctx); switch (resolution.kind) { @@ -46,7 +46,6 @@ export default async (app: Elysia, port: number, auth: InstanceAuth) => { case "useGlobal": delete payload.api_key; break; - case "forward": case "passthrough": break; default: { @@ -65,9 +64,9 @@ export default async (app: Elysia, port: number, auth: InstanceAuth) => { applyKey({ ...(ctx.body ?? {}), session_id: ctx.uuid }, ctx); app.group("/services", (app) => { - // Resolve every /services/* caller: swap a known client's key, forward an - // unknown sk-ant- (or absent) key, reject a forged internal header or an - // unknown non-sk-ant- key. + // Resolve every /services/* caller: swap a known client's key, drop the field for + // a no-key request (global key), and reject anything else (forged internal header, + // or an api_key that isn't a known client -> 401/503). app.onBeforeHandle(auth.authenticate); modules.forEach((m) => { @@ -197,14 +196,10 @@ export default async (app: Elysia, port: number, auth: InstanceAuth) => { }); }; - // The credential rode the upgrade query string, not the message body. - // Seed a forwardable unknown key onto the payload so applyKey's - // forward case preserves it; a known client's useKey/useGlobal then - // overrides or drops it exactly as on POST. + // The credential rode the upgrade query string and was resolved by the + // auth hook onto ws.data; applyKey reads the resolution off it. A known + // client's useKey/useGlobal swaps or drops the field exactly as on POST. const base: Record = { ...(message.data ?? {}) }; - if (base.api_key == null && (ws.data as any)?.forwardApiKey) { - base.api_key = (ws.data as any).forwardApiKey; - } const payload = applyKey(base, ws.data); callService(m, port, payload as any, onLog, onEvent).then( diff --git a/platform/src/util/errors.ts b/platform/src/util/errors.ts index 541aac87..f967acb4 100644 --- a/platform/src/util/errors.ts +++ b/platform/src/util/errors.ts @@ -34,3 +34,12 @@ export function serviceUnavailable(ctx: any): ApolloError { "Client verification is temporarily unavailable" ); } + +export function clientMisconfigured(ctx: any): ApolloError { + return apolloError( + ctx, + 500, + "CLIENT_MISCONFIGURED", + "Client has no API key configured" + ); +} diff --git a/platform/test/auth/client/commands.test.ts b/platform/test/auth/client/commands.test.ts index b399e65a..1d0fd216 100644 --- a/platform/test/auth/client/commands.test.ts +++ b/platform/test/auth/client/commands.test.ts @@ -75,8 +75,8 @@ describe("client/commands key-prep (no DB)", () => { describe("client/commands classifyStoredKey (no DB)", () => { const encKey = randomBytes(32); - it("NULL -> global", () => { - expect(classifyStoredKey(null, encKey)).toBe("global"); + it("NULL -> no_key", () => { + expect(classifyStoredKey(null, encKey)).toBe("no_key"); }); it("a non-enc value -> plaintext", () => { expect(classifyStoredKey("sk-ant-plain", encKey)).toBe("plaintext"); @@ -127,17 +127,19 @@ describe("addClient -> auth-hook resolution (no DB)", () => { expect(gated.resolveKey(ctx)).toEqual({ kind: "useKey", key: "sk-ant-provisioned-secret" }); }); - it("a different api_key does not resolve the provisioned client", async () => { + it("a different api_key does not resolve the provisioned client and is rejected", async () => { const encKey = randomBytes(32); const sql = captureSql(); await addClient(sql as any, encKey, "acme", "sk-ant-secret"); const [{ values }] = sql.calls; const gated = gatedFor(encKey, values[1] as string, values[2] as string); + // The lookup completes and confirms no such client, so the present-but-unknown + // key is rejected outright (401). const ctx = ctxFor("sk-ant-some-other-key"); await gated.authenticate(ctx); expect(ctx.lightningClient).toBeUndefined(); - expect(gated.resolveKey(ctx)).toEqual({ kind: "forward" }); + expect(ctx.set.status).toBe(401); }); }); diff --git a/platform/test/server.test.ts b/platform/test/server.test.ts index 15667f0a..72ac4e96 100644 --- a/platform/test/server.test.ts +++ b/platform/test/server.test.ts @@ -33,11 +33,15 @@ const capturedExtras = ( // The shared listening app gets a synchronous lookup driven by a test-controlled // map. Setting `knownClients` per test is how a fresh configuration is applied -// without any module-global poke seam; an empty/absent map routes everyone by the -// shape rule, exactly as a down DB would. +// without any module-global poke seam; an empty/absent map makes every caller +// resolve as absent, so any present api_key is rejected. hasGlobalKey is forced +// true so keyless requests on this app take the global-key path (the test env has +// no real ANTHROPIC_API_KEY); the keyless-without-a-global-key 401 is covered by a +// dedicated instance below. let knownClients: Record | null = null; const sharedAuth = new InstanceAuth({ lookup: (hash) => knownClients?.[hash] ?? null, + hasGlobalKey: true, }); const app = await setup(port, sharedAuth); @@ -218,8 +222,9 @@ describe("Sentry", () => { describe("Instance authentication", () => { // No real DB — the seam keys clients by SHA-256 of the api_key they send. ALPHA - // has a stored Anthropic key (swapped in); BETA has none (credential stripped). - // Any other key is unknown and routed by the shape check. + // has a stored Anthropic key (swapped in); BETA is a recognised client whose + // stored key is NULL, which is a server-side misconfiguration (500), not a + // "use the global key" shortcut. Any other key is unknown and rejected. const ALPHA = "lightning-cred-alpha"; const BETA = "lightning-cred-beta"; const clients: Record = { @@ -231,8 +236,8 @@ describe("Instance authentication", () => { post(path, { ...data, ...(apiKey ? { api_key: apiKey } : {}) }); // One mode now: the auth hook is always active. Point the shared instance's injected - // lookup at the known-client map so rows 1/2 resolve; unknown keys fall to the - // shape check regardless. + // lookup at the known-client map so rows 1/2 resolve; unknown keys are rejected + // (DB reachable -> the lookup confirms absent -> 401). beforeEach(() => { knownClients = clients; }); @@ -251,28 +256,40 @@ describe("Instance authentication", () => { expect(body.api_key).not.toBe(ALPHA); }); - // Row 2 - it("strips the credential when the client has no stored key", async () => { - const res = await app.handle(postKey("services/echo", { x: 2 }, BETA)); - expect(res.status).toBe(200); - const body = await res.json(); - // No stored key → api_key dropped entirely (Apollo uses its global key). - expect(body.api_key).toBeUndefined(); + // Row 2 — a recognised client whose stored key is NULL is a server-side + // misconfiguration: the caller authenticated with a known credential, so we + // reject with 500 and report it rather than quietly billing the global key. + it("rejects a known client with no stored key as misconfigured (500) and reports it", async () => { + const capture = spyOn(sentry, "captureException"); + try { + const res = await app.handle(postKey("services/echo", { x: 2 }, BETA)); + expect(res.status).toBe(500); + const body = await res.json(); + expect(body.code).toBe(500); + expect(body.type).toBe("CLIENT_MISCONFIGURED"); + // The misconfiguration is reported to Sentry with a non-secret reason/name. + const extras = capturedExtras(capture, "client-misconfigured-no-key"); + expect(extras).toBeDefined(); + expect(extras?.client).toBe("beta"); + } finally { + capture.mockRestore(); + } }); - // Row 3 — unknown but sk-ant-shaped: bring-your-own key, forwarded unchanged. - it("forwards an unknown sk-ant-shaped key unchanged (bring-your-own)", async () => { + // Row 3 — an unknown key is rejected regardless of shape. The DB is reachable, so + // the lookup confirms there is no such client: 401. An Anthropic-key-shaped key is + // treated the same as any other unknown key. + it("rejects an unknown Anthropic-key-shaped key with 401", async () => { const res = await app.handle( - postKey("services/echo", { x: 1 }, "sk-ant-byo") + postKey("services/echo", { x: 1 }, "sk-ant-unknown") ); - expect(res.status).toBe(200); + expect(res.status).toBe(401); const body = await res.json(); - expect(body.api_key).toBe("sk-ant-byo"); + expect(body.code).toBe(401); + expect(body.type).toBe("UNAUTHORIZED"); }); - // Row 3b — unknown and NOT sk-ant-shaped: a likely Lightning credential; reject - // rather than forward it to the LLM. - it("rejects an unknown non-sk-ant- key with 401 (never forwarded)", async () => { + it("rejects an unknown non-Anthropic-shaped key with 401", async () => { const res = await app.handle( postKey("services/echo", { x: 1 }, "lightning-cred-unknown") ); @@ -282,24 +299,43 @@ describe("Instance authentication", () => { expect(body.type).toBe("UNAUTHORIZED"); }); - // Row 4 — no api_key at all: forwarded without the field (global key fallback). - it("forwards a request with no api_key (no 401), field absent", async () => { + // Row 4 — no api_key at all, and a global key is configured: served by the global + // key with the field absent. (This app's auth has hasGlobalKey: true.) + it("serves a keyless request with the global key (no 401), field absent", async () => { const res = await app.handle(post("services/echo", { x: 1 })); expect(res.status).toBe(200); const body = await res.json(); expect(body.api_key).toBeUndefined(); }); - // Row 1 and row 3 coexist: known-client swap and bring-your-own forward in one run. - it("serves the known-client swap and the bring-your-own forward side by side", async () => { - const swapped = await ( - await app.handle(postKey("services/echo", { x: 1 }, ALPHA)) - ).json(); - const forwarded = await ( - await app.handle(postKey("services/echo", { x: 1 }, "sk-ant-byo")) - ).json(); - expect(swapped.api_key).toBe("sk-ant-stored-alpha"); - expect(forwarded.api_key).toBe("sk-ant-byo"); + // A keyless request when NO global key is configured has nothing to serve it, so + // it is rejected with 401. Driven directly against an InstanceAuth built with + // hasGlobalKey: false (the shared app forces it true). + it("rejects a keyless request with 401 when no global key is configured", async () => { + const auth = new InstanceAuth({ + lookup: () => null, + hasGlobalKey: false, + }); + const ctx = { + request: { headers: { get: () => null } }, + body: { x: 1 }, + set: { status: 200 }, + } as any; + await auth.authenticate(ctx); + expect(ctx.set.status).toBe(401); + expect(ctx.lightningClient).toBeUndefined(); + }); + + // Row 1 and row 3 coexist: a known client swaps to its stored key while an unknown + // key is rejected, in one run. + it("swaps a known client's key and rejects an unknown key side by side", async () => { + const swap = await app.handle(postKey("services/echo", { x: 1 }, ALPHA)); + const unknown = await app.handle( + postKey("services/echo", { x: 1 }, "sk-ant-unknown") + ); + expect(swap.status).toBe(200); + expect((await swap.json()).api_key).toBe("sk-ant-stored-alpha"); + expect(unknown.status).toBe(401); }); it("leaves health and root endpoints open", async () => { @@ -361,7 +397,7 @@ describe("Instance authentication", () => { } }); - it("captures the internal-token mismatch and still rejects with 401", async () => { + it("rejects an internal-token mismatch with 401 without reporting to Sentry", async () => { const warn = spyOn(console, "warn").mockImplementation(() => {}); const capture = spyOn(sentry, "captureException"); try { @@ -374,10 +410,11 @@ describe("Instance authentication", () => { }, }); const res = await app.handle(req); - // Behaviour unchanged: a forged internal header still rejects. + // A forged or stale internal header still rejects... expect(res.status).toBe(401); - // ...and the mismatch is no longer silent. - expect(capturedExtras(capture, "internal-token-mismatch")).toBeDefined(); + // ...but a mismatch is a config/forgery signal, not a code bug, so it + // stays out of Sentry (the warn covers it). Don't make Sentry noisy. + expect(capturedExtras(capture, "internal-token-mismatch")).toBeUndefined(); } finally { capture.mockRestore(); warn.mockRestore(); @@ -412,7 +449,7 @@ describe("Instance authentication", () => { it("does not emit the mismatch warn on the normal external path (no internal header)", async () => { const warn = spyOn(console, "warn").mockImplementation(() => {}); try { - // An unknown non-sk-ant- key takes the explicit-fail path (no internal header). + // An unknown key takes the explicit-fail path (no internal header). const res = await app.handle( postKey("services/echo", { x: 1 }, "lightning-cred-unknown") ); @@ -433,20 +470,20 @@ describe("Instance authentication", () => { // the payload rather than being stripped to the global key. const req = new Request(`${baseUrl}/services/echo`, { method: "POST", - body: JSON.stringify({ x: 9, api_key: "sk-ant-forwarded" }), + body: JSON.stringify({ x: 9, api_key: "sk-ant-internal-hop" }), headers: { "Content-Type": "application/json", ...internalAuthHeader() }, }); const res = await app.handle(req); expect(res.status).toBe(200); const body = await res.json(); - expect(body.api_key).toBe("sk-ant-forwarded"); + expect(body.api_key).toBe("sk-ant-internal-hop"); }); - // WS upgrade auth decision via app.handle(): under the forward model a bare - // upgrade is no longer rejected. app.handle() never performs a real socket - // upgrade (Bun upgrades only through the listening server), so this proves the - // auth hook forwarded rather than 401'd, not that the upgrade itself succeeds. The - // 101/end-to-end no-regression proof is the live-socket echo test above. + // WS upgrade auth decision via app.handle(): a bare upgrade carries no api_key, so + // it takes the global-key path and is not rejected. app.handle() never performs a + // real socket upgrade (Bun upgrades only through the listening server), so this + // proves the auth hook let it through rather than 401'd, not that the upgrade itself + // succeeds. The 101/end-to-end no-regression proof is the live-socket echo test above. it("passes an unauthenticated WebSocket upgrade through the auth hook", async () => { const req = new Request(`${baseUrl}/services/echo`, { method: "GET", @@ -505,10 +542,12 @@ describe("Instance authentication", () => { expect(body.ws).toBe(1); }); - // AC3 — an unrecognised sk-ant- token on the upgrade connects and forwards as-is. - it("forwards an unknown sk-ant- token on a WS upgrade unchanged", async () => { - const body = await wsRoundTrip(`?api_key=sk-ant-ws-byo`); - expect(body.api_key).toBe("sk-ant-ws-byo"); + // AC3 — an unrecognised token on the upgrade (any shape) is rejected: the auth hook + // returns a 401 from beforeHandle, so the upgrade is refused and the socket never + // opens. The round-trip rejects (error/close) rather than completing. Known clients + // are unaffected (see the swap test above); an unknown key is rejected whatever its shape. + it("rejects an unknown token on a WS upgrade", async () => { + await expect(wsRoundTrip(`?api_key=sk-ant-ws-unknown`)).rejects.toBeDefined(); }); // Internal exemption holds on WS: the upgrade GET carries the internal header, so @@ -547,32 +586,6 @@ describe("Instance authentication", () => { }); }); -describe("Instance auth — DB-down forward path", () => { - // No known clients: every caller is "unknown". The shape rule still applies — an - // sk-ant- key forwards, a non-sk-ant- key fails explicitly. - beforeEach(() => { - knownClients = null; - }); - - // Row 7 - it("forwards an unknown sk-ant- key when the DB is down", async () => { - const res = await app.handle( - post("services/echo", { x: 1, api_key: "sk-ant-byo" }) - ); - expect(res.status).toBe(200); - const body = await res.json(); - expect(body.api_key).toBe("sk-ant-byo"); - }); - - // Row 8 - it("rejects an unknown non-sk-ant- key when the DB is down (never forwarded)", async () => { - const res = await app.handle( - post("services/echo", { x: 1, api_key: "lightning-cred-unknown" }) - ); - expect(res.status).toBe(401); - }); -}); - describe("Instance auth — lookup never came up (dbReady false)", () => { // The shared app injects a `lookup`, so it never reaches lookupClient's dbReady // guard. Construct a bare InstanceAuth (no lookup/dbLookup => dbReady stays false) @@ -586,21 +599,22 @@ describe("Instance auth — lookup never came up (dbReady false)", () => { set: { status: 200 }, } as any); - it("forwards an unknown sk-ant- key via the shape rule (no DB)", async () => { + // When the lookup never came up we cannot verify any caller, whatever the key's + // shape — that is our outage, not a bad credential: 503, never a misleading 401. + it("returns 503 for an unknown sk-ant- key when the lookup never came up (no DB)", async () => { const auth = new InstanceAuth(); - const ctx = fakeCtx("sk-ant-byo"); + const ctx = fakeCtx("sk-ant-unknown"); await auth.authenticate(ctx); - expect(ctx.forwardApiKey).toBe("sk-ant-byo"); - expect(ctx.set.status).toBe(200); + expect(ctx.set.status).toBe(503); + expect(ctx.lightningClient).toBeUndefined(); }); it("returns 503 for an unknown non-sk-ant- key when the lookup never came up (no DB)", async () => { const auth = new InstanceAuth(); const ctx = fakeCtx("lightning-cred-unknown"); await auth.authenticate(ctx); - // dbReady is false, so we cannot verify the caller — that is our outage, not a bad credential: 503, never a misleading 401, never a forward. expect(ctx.set.status).toBe(503); - expect(ctx.forwardApiKey).toBeUndefined(); + expect(ctx.lightningClient).toBeUndefined(); }); }); @@ -682,7 +696,7 @@ describe("Instance auth cache refresh", () => { const first = fakeCtx(UNKNOWN); await auth.authenticate(first); - // sk-ant-shaped? no -> unknown non-anthropic key is rejected. + // The lookup completed and confirmed no such client -> 401 (unknown is unknown). expect(first.set.status).toBe(401); const second = fakeCtx(UNKNOWN); @@ -822,26 +836,19 @@ describe("Instance auth cache refresh", () => { // awaited cold lookup fails, so the request is rejected rather than served stale. fail = true; advanceClock(OVER_CEILING); - const warn = spyOn(console, "warn").mockImplementation(() => {}); const error = spyOn(console, "error").mockImplementation(() => {}); try { const ctx = fakeCtx(ALPHA); await auth.authenticate(ctx); expect(ctx.lightningClient).toBeUndefined(); - // ALPHA is not sk-ant-shaped and the evicted-then-failed lookup could not verify it, so we 503 (our outage) rather than a misleading 401. + // The evicted-then-failed lookup could not verify the caller, so we 503 (our outage) rather than a misleading 401. expect(ctx.set.status).toBe(503); - expect( - warn.mock.calls.some(([m]) => - String(m).includes("max-staleness ceiling") - ) - ).toBe(true); expect( error.mock.calls.some(([m]) => String(m).includes("client lookup failed") ) ).toBe(true); } finally { - warn.mockRestore(); error.mockRestore(); } }); @@ -906,7 +913,7 @@ describe("Instance auth cache refresh", () => { } }); - it("returns 503 (not 401) when a cold DB read fails for a non-sk-ant- caller, capturing the outage", async () => { + it("returns 503 (not 401) when a cold DB read fails, capturing the outage", async () => { const auth = new InstanceAuth({ dbLookup: async () => { throw new Error("db down"); @@ -915,11 +922,10 @@ describe("Instance auth cache refresh", () => { const error = spyOn(console, "error").mockImplementation(() => {}); const capture = spyOn(sentry, "captureException"); try { - const ctx = fakeCtx(ALPHA); // non-sk-ant-shaped credential + const ctx = fakeCtx(ALPHA); await auth.authenticate(ctx); expect(ctx.set.status).toBe(503); expect(ctx.lightningClient).toBeUndefined(); - expect(ctx.forwardApiKey).toBeUndefined(); const extras = capturedExtras(capture, "client-store-unavailable-503"); expect(extras).toBeDefined(); @@ -932,7 +938,9 @@ describe("Instance auth cache refresh", () => { } }); - it("still forwards an sk-ant- caller when a cold DB read fails (BYO key needs no lookup)", async () => { + // A failed cold read can't verify an sk-ant-shaped key, so it 503s exactly like any + // other unknown key. + it("503s an sk-ant- caller too when a cold DB read fails", async () => { const auth = new InstanceAuth({ dbLookup: async () => { throw new Error("db down"); @@ -940,10 +948,10 @@ describe("Instance auth cache refresh", () => { }); const error = spyOn(console, "error").mockImplementation(() => {}); try { - const ctx = fakeCtx("sk-ant-byo"); + const ctx = fakeCtx("sk-ant-unknown"); await auth.authenticate(ctx); - expect(ctx.set.status).toBe(200); - expect(ctx.forwardApiKey).toBe("sk-ant-byo"); + expect(ctx.set.status).toBe(503); + expect(ctx.lightningClient).toBeUndefined(); } finally { error.mockRestore(); }