docs(ongoing): design doc for production-ready OpenID & SSO#3907
docs(ongoing): design doc for production-ready OpenID & SSO#3907sea-snake wants to merge 6 commits into
Conversation
Adds docs/ongoing/openid-sso-prod-readiness.md, a design doc for removing the SSO allowlist and putting the OpenID stack on a production footing. Covers five concrete problems and their solutions: 1. Replace Vec<Box<dyn OpenIdProvider>> + parallel DISCOVERY_TASKS with a data-only config registry and free-function verify; per-provider mutable state moves into a bounded cache. 2. Replace always-on discovery/JWKS timers with an on-demand LRU cache plus DoH-style dedup primitive. 1000 entries per cache, 1h TTL. 3. Set max_response_bytes on every outcall, validate response shape in transforms before parsing, ceiling per-call cycles. 4. Switch the OAuth callback from response_mode=fragment to form_post. New canister POST handler returns certified HTML that delivers the payload via BroadcastChannel (popup) or sessionStorage (same-tab). Apple Sign In's name/email handling is the operational driver. 5. Drop add_discoverable_oidc_config and the persistent OIDC_CONFIGS list. Add discover_sso(domain), an anonymous update call backed by the cache. Threads discovery_domain through the four JWT-consuming API methods. The salt + nonce + caller() binding that secures JWT redemption against compromised transports is preserved end-to-end. Rollout layout (§9) has form_post, outcall safety, and the cache stack landing in parallel. cc @aterga 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a single design document (docs/ongoing/openid-sso-prod-readiness.md) outlining how to make the OpenID/SSO stack production-ready and lift the canary allowlist. No code changes.
Changes:
- New RFC-style design doc covering five problems (stateful provider registry, timer-driven outcalls, HTTP outcall safety, fragment callback, allowlist removal) and proposed solutions.
- Includes threat model, API/persistent-state deltas, migration approach, and a rollout/PR-dependency plan.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
session_sk lives in the browser's memory, not a mailbox. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CONFIG_REGISTRY now only holds direct providers (Google/MS/Apple) from the init arg. SSO domains live nowhere on the canister beyond the transient cache — the previous enum variant for them contradicted the no-persistent-SSO-state principle in §8. Cache primitive simplified to one HashMap per cache with an enum-typed entry (Done | Pending) so a single key has at most one entry. Renamed §5.4 to "Cache sizing and TTL" and inlined the TTL discussion that had gotten stranded. Stripped framing-noise references to how the dedup primitive relates to other in-flight work; the doc describes what is, not how lines move.
DiscoverySsoResult now carries the OIDC fields the FE actually needs to build the authorize URL (authorization_endpoint, scopes_supported) on top of the previously-spec'd client_id/issuer/name. With those in the response, the FE's ssoDiscovery.ts module is no longer needed — the canister already does the two-hop fetch and cache, so we just expose the result earlier in the flow. §8.3 grew the response shape; §8.7 calls out that the ~560-line FE discovery module is deleted; §9.3 mentions it in the "what changes" summary. Trade-off: discover_sso is an update call (~2-4s warm, ~5-8s cold), whereas direct FE fetches today are ~200-500ms. Accepted because discovery runs once per SSO per session and the canister-side cache already had to do the same outcalls for JWT verification anyway, so nothing is duplicated work-wise — we just stop having two separate caches (one in FE, one on canister) chasing the same data.
|
|
||
| | Change | Effect on trust model | | ||
| | ------ | --------------------- | | ||
| | Stateless verifier + cache (§4, §5) | None. The verifier still runs `SHA256(salt | caller()) == jwt.nonce` against `caller()` from a signed ingress message. The cache holds JWKS + discovery results, not JWTs or salts. | |
|
|
||
| - Compromise of the IdP's signing key. Every credential issued by that IdP is at risk; that is the unavoidable trust assumption of any OIDC integration. | ||
| - Compromise of the user's device / browser that holds the session_sk. Standard endpoint compromise — out of scope. | ||
| - Phishing the user into completing an OAuth flow on a malicious site impersonating Internet Identity. This is the standard "website phishing" problem; we don't try to solve it in this layer. |
There was a problem hiding this comment.
Isn't this mitigated by the fact that the SSO provider needs to add a well-known configuration fiel specifically for II?
There was a problem hiding this comment.
Yes this is indeed mitigated by that, seems like this was a bad hallucination.
| salt: &[u8; 32], | ||
| discovery_domain: Option<&str>, | ||
| cache: &mut OidcCache, | ||
| ) -> Result<OpenIdCredential, OpenIDJWTVerificationError> { |
There was a problem hiding this comment.
The planned caching design isn't clear from these comments; e.g., how many caches do we have?
I recommend outlining the lifecycle for a particular SSO <> II integration (first time it's a cache miss, second time it's a hit, third time it's a miss due to cache eviction, etc.)
There was a problem hiding this comment.
Yes, probably a clear diagram doesn't hurt here either.
| enum CacheEntry<V> { | ||
| /// Fetched and currently usable. | ||
| Done { value: V, expires_at_secs: u64 }, | ||
| /// First caller is fetching; concurrent arrivals register their `Waker` |
There was a problem hiding this comment.
Can we omit this new terminology?
At the very least, please clearly define all the terms, e.g., "waker"
|
|
||
| Entry TTL: 1 hour for both. Matches the upstream OIDC discovery doc cache headers we've observed, and is well under the JWKS rotation window of every mainstream IdP (Google rotates ~every 4–6 h; Microsoft daily; Apple monthly). On TTL lapse, the next verifier call observes the entry as expired and re-fetches inline — same code path as a cold start. | ||
|
|
||
| Numbers tunable; they live in two consts in the cache module: |
There was a problem hiding this comment.
Indeed section 5.4 has unintended repeated code block and paragraph.
| pub const JWKS_CACHE_CAPACITY: usize = 1000; | ||
| ``` | ||
|
|
||
| Entry TTL: 1 hour for both. Matches the upstream OIDC discovery doc cache headers we've observed, and matches the JWKS rotation window of every mainstream IdP (Google rotates ~every 4–6 h; Microsoft daily; Apple monthly). On TTL lapse, the next verifier call observes a cache miss and re-fetches — same code path as a cold start. |
There was a problem hiding this comment.
Refer to the section where it's explained how the TTL is implemented.
|
|
||
| Proposed worst case at steady state with a hot cache: ~0 cycles/h (no scheduled outcalls). Per-sign-in cost: ~30 Gcycles on cache miss (cold), ~0 on cache hit (warm). Multiplied by sign-in volume rather than provider count. | ||
|
|
||
| For DoS analysis (anonymous `discover_sso` flood): N attacker calls × ~60 Gcycles per cold path (hop-1 + hop-2 + JWKS fetch) = ~60N Gcycles. Bounded by per-call cycle accounting (§6) and the IC's existing reject-on-low-cycles behaviour. |
There was a problem hiding this comment.
reject-on-low-cycles behaviour
I don't think this is applicable to II, let's make it clearer what DoS risks really are on a system subnet.
Hint: We could rely on the possibility to rate limit via boundary nodes, if necessary.
|
|
||
| ### 6.3 What's deliberately not done | ||
|
|
||
| - **TLS certificate pinning.** The IC's HTTPS outcall layer terminates TLS at the boundary node; we trust the gateway's certificate chain. Pinning per-SSO certs would require an out-of-band trust anchor distribution, which is the kind of governance question (§2) we're explicitly punting on. |
There was a problem hiding this comment.
we trust the gateway's certificate chain
double check this
|
|
||
| --- | ||
|
|
||
| ## 7. Problem 4 — Fragment callback |
There was a problem hiding this comment.
This section wasn't easy enough for me to understand.
Could you ELI5 what the problem and the proposed solutions are, please?
| Existing credentials in stable storage have no `sso_domain` stamp. There are two upgrade-time approaches: | ||
|
|
||
| - **Backfill from `OIDC_CONFIGS`.** The upgrade hook reads the about-to-be-deleted `persistent_state.oidc_configs`. For each registered domain, it runs the discovery cache lookup (cold, so triggers outcalls — this could be deferred until the cache is warm post-upgrade). Then iterates credentials, looking for any whose `iss + aud` matches a discovered provider, and stamps `sso_domain` on each. One-shot, runs in `post_upgrade`. Risk: discovery outcalls from an upgrade hook are unusual; if the upgrade fails the canister rolls back, which is the safe failure mode. | ||
| - **Lazy backfill.** Existing credentials keep `sso_domain = None`. The FE supplies `discovery_domain` on the next use of each credential. The canister, on successful JWT verification with `Some(domain)`, stamps `sso_domain = domain` on the matching stored credential. Slower convergence; no upgrade-time outcalls; works for credentials that are actually used. |
There was a problem hiding this comment.
This is the right approach.
There was a problem hiding this comment.
Note that outcalls aren't possible on upgrade anyway, we'd need a batched migration (if we didn't have this approach)
There was a problem hiding this comment.
We wouldn't need outcalls since we could migrate with the data from the sso configuration (if not removed yet at this stage).
| - The 1-click SSO path (`authorize/+page.svelte:145-167`) replaces its `actor.add_discoverable_oidc_config({...})` + FE-side discovery chain with a single `actor.discover_sso(domain)` call. Result carries everything needed to build the authorize URL. | ||
| - **`src/frontend/src/lib/utils/ssoDiscovery.ts` is deleted.** All ~560 lines of two-hop fetch, per-hop cache, exponential-backoff retry, abort plumbing, hostname-match validation, HTTPS enforcement, and `DomainNotConfiguredError` classification go away. The canister now does the fetching, caching, and validation; the FE just calls `discover_sso` and reads the result. Removes the only FE-side direct fetch to arbitrary SSO domains (and with it, the CORS, IPv6, and timeout edge cases that module currently handles). | ||
|
|
||
| The FE-side simplification has its own cost: `discover_sso` is an update call (~2-4 s warm, ~5-8 s cold) whereas the FE's direct fetches today are ~200-500 ms when the SSO host is fast. For the input-debounce UX in `SignInWithSso.svelte` we accept this tradeoff — the user only ever runs discovery once per SSO before signing in, and the canister-cache means subsequent runs (same domain or other users on the same domain) hit the warm path. |
There was a problem hiding this comment.
We could improve the UX significantly by also having a query, e.g. is_sso_discovered, that would provide data in the warmed-up cache case; only if that returns null would the FE need to call the update function.
| or in the same PR. | ||
|
|
||
| Day 0+3N: | ||
| PR-C4 (lazy backfill) — opens after C3 lands. |
There was a problem hiding this comment.
Intuitively, this shouldn't be the last step.
Is is safe to release the feature without this PR? If not, then it should take place before, so that we always have releasable code on the main branch.
There was a problem hiding this comment.
I think data migration should be consider as early as possible indeed, instead of being an after thought.
|
|
||
| - Heap memory from registered SSOs: O(cache_size) ≤ ~20 MB. | ||
| - Background outcall fanout: zero. Discovery and JWKS fetching are sign-in-driven. | ||
| - Apple Sign In returns name + email. Okta / Auth0 / Keycloak / Authelia / generic OIDC become viable without per-IdP workarounds for fragment-mode quirks. |
There was a problem hiding this comment.
Is this true? I certainly hope so!
NCR
There was a problem hiding this comment.
I've manually confirmed a while back that Apple returns names and emails in this case.
|
|
||
| ## 10. Open questions | ||
|
|
||
| - **Cache size revisit.** 1000 entries per cache is a guess based on "1000 simultaneously-hot SSOs is more than we expect" and "10 MB total heap is comfortably under budget." Once we have telemetry on real-world miss rates we should revisit. If miss rate is non-trivial we either raise the cap or shorten the TTL — both have cycle-budget implications. |
There was a problem hiding this comment.
I would start more conservatively at 100
There was a problem hiding this comment.
Also, please add a section to describe how we add observability:
- metrics to record the number of cached SSO configs
- histogram metrics for how many credentials we have for each SSO domain (bonus points)
- plausible funnels (if they need any changes)
| - **TTL on `Pending` entries.** `PENDING_STALE_AFTER_SECS = 120 s` is comfortable for IC HTTP outcalls under normal conditions; if we see verifier latency spikes above that floor we'd revisit. | ||
| - **Lazy backfill convergence.** Credentials that are never used again never get `sso_domain` stamped. The metadata layer (`sso_fields_for` callers) gracefully handles `None` today (renders as the bare domain or falls through to the direct-provider config), so this is not a functional problem — but a future cleanup pass could backfill in bulk if telemetry shows long-tailed credentials hampering attribute-sharing UX. | ||
| - **`response_mode=form_post` and FedCM coexistence.** Today's `requestJWT` falls back to popup when FedCM isn't available. Form_post applies only to the popup path. If a future FedCM upgrade adds a redirect fallback we'd revisit; for now the two paths are independent. | ||
| - **Direct provider JWKS cache behaviour.** Hardcoded providers' JWKS today is fetched eagerly at canister bring-up via `Provider::create`. Under §5 it becomes lazy on first sign-in attempt. First-after-deploy sign-in pays one cache-miss outcall. Acceptable, but worth flagging in the rollout notes. |
There was a problem hiding this comment.
We should ideally provide visual feedback to the user to indicate that while signing them in, II is fetching a fresh SSO domain config (better progress indication ==> less frustration)
|
|
||
| - **Cache size revisit.** 1000 entries per cache is a guess based on "1000 simultaneously-hot SSOs is more than we expect" and "10 MB total heap is comfortably under budget." Once we have telemetry on real-world miss rates we should revisit. If miss rate is non-trivial we either raise the cap or shorten the TTL — both have cycle-budget implications. | ||
| - **TTL on `Pending` entries.** `PENDING_STALE_AFTER_SECS = 120 s` is comfortable for IC HTTP outcalls under normal conditions; if we see verifier latency spikes above that floor we'd revisit. | ||
| - **Lazy backfill convergence.** Credentials that are never used again never get `sso_domain` stamped. The metadata layer (`sso_fields_for` callers) gracefully handles `None` today (renders as the bare domain or falls through to the direct-provider config), so this is not a functional problem — but a future cleanup pass could backfill in bulk if telemetry shows long-tailed credentials hampering attribute-sharing UX. |
There was a problem hiding this comment.
which exact telemetry would provide this data?
Addresses the 28 inline review comments on #3907: - §3.2/glossary: fix MD pipe rendering inside table cells (L99, L25). - §3.3: reframe attack #3 (no boundary node in outcall path; replication is the defense). - §3.4: phishing bullet acknowledges the II-specific .well-known file as a soft mitigation against impersonating a registered SSO (L116). - §4.3: clarify that DirectProviderConfig holds boot-time direct providers (Google/Microsoft/Apple), NOT SSO — forestalls the rename suggestion (L168/L176). - §5.2: drop bare 'Waker' terminology, define it inline against the existing doh/cache.rs precedent (L260). Restructure §5.2 to lead with the cache count (two: discovery + JWKS). - §5.2.1 NEW: cache lifecycle sequence diagram (mermaid) showing cold miss → warm hit → expired refetch → eviction for one SSO (L190). - §5.4: dedupe repeated const block + paragraph, cross-ref to CacheEntry::Done.expires_at_secs (L313, L320). Cap → 100 per cache (L749). - §5.6: drop reject-on-low-cycles framing for the system subnet; ground DoS analysis in cycle ceilings, LRU cap, and boundary-node throttling (L336). - §6.3: rewrite the TLS-trust bullet — HTTPS outcalls run per-replica against the public CA system, no boundary node in the outcall path (L415). - §7.0 NEW: TL;DR for form_post including a mermaid sequence diagram of the full popup flow (L421). - §7.2: footnote that Apple form_post name/email behaviour was manually confirmed (L733). - §8.3: add discover_sso_query — query-mode shortcut that returns warm data without an update call (L679). - §8.5: storage OpenIdCredential gains sso_domain + sso_name fields. - §8.6 REWRITTEN: replace lazy backfill with sso_credential_migration upgrade arg. Migration is now the root of the rollout, deterministic, outcall-free, no cache dependency. Includes mermaid sequence diagram of the pre-flight query → NNS proposal → post_upgrade flow (L669, L724). - §8.7: progress-UI spinner on cold discovery; tied to plausible events in §11 (L753). - §9 REWRITTEN: rollout dependency layout as mermaid graph TD with PR-M as the root. C2 has a HARD dep on M (DISCOVERY_TASKS removal requires every credential pre-stamped); A/B/C1 have soft deps on M to keep main releasable. Suggested PR sequence updated. - §10: rewrite Cache size revisit and Pending-TTL questions to reference §11 metrics by name. Drop lazy backfill convergence question. - §11 NEW: Observability — cache gauges/counters, credential distribution histograms, plausible funnels, one-shot migration verification metrics (L749 reply, L751).
…he actual mitigation, not the .well-known file
|
✅ No security or compliance issues detected. Reviewed everything up to 0940827. Security Overview
Detected Code Changes| Change Type | Relevant files ... (code changes summary truncated to fit VCS comment limits.) |

📄 Read the design doc directly:
docs/ongoing/openid-sso-prod-readiness.mdon the forkSummary
Adds
docs/ongoing/openid-sso-prod-readiness.md— a design doc for putting the OpenID/SSO stack on a production footing so the canary allowlist can be lifted.The doc is problem-driven: it identifies the five concrete things the allowlist masks today, and proposes a solution for each:
Vec<Box<dyn OpenIdProvider>>+ parallelDISCOVERY_TASKSwith a data-onlyCONFIG_REGISTRYand a free-functionverify_jwt. Trait dispatch goes away; per-provider differences become enum-tagged config fields.init_discovery_timersandschedule_fetch_certswith an on-demand LRU cache (1000 entries per cache, 1 h TTL). Concurrent-fetch dedup uses the samePending/Wakerprimitive as the DoH cache in feat(doh): DoH fallback for DKIM/DMARC TXT records on unsigned domains #3841, copy-pasted rather than lifted.max_response_bytescaps (8 KB hop-1, 64 KB hop-2, 64 KB JWKS), transform validation before parse, tightened per-call cycle ceilings.response_mode=fragment→response_mode=form_post. New canister POST handler (update mode for response certification) parses the form body, returns certified HTML that delivers{id_token, state}via BroadcastChannel (popup) or sessionStorage (same-tab). Apple Sign In returningname/emailonly under form_post is the operational driver. Discussion in §7.4 of why the canister can't consume the JWT itself.add_discoverable_oidc_configand the persistedOIDC_CONFIGSlist. Adddiscover_sso(domain) -> DiscoverySsoResult, anonymous, cache-backed. Threaddiscovery_domain: opt textthrough the four JWT-consuming API methods. Stampsso_domainonOpenIdCredentialat creation; lazy backfill for existing credentials.The salt + nonce +
caller()binding that secures the JWT against compromised transports stays end-to-end identical. §3 walks through why several otherwise-attractive simplifications (consume the JWT in the POST, 302-translate to fragment, etc.) were considered and rejected.§9 lays out the rollout: tracks A (form_post), B (outcall safety), C1→C2→C3→C4 (cache stack) can run in parallel; only within Track C is there a stack.
Reading order
Per request, this PR is the design doc only — no code changes.
cc @aterga
🤖 Generated with Claude Code