docs(ongoing): design doc for production-ready OpenID & SSO by sea-snake · Pull Request #3907 · dfinity/internet-identity

sea-snake · 2026-05-18T13:35:16Z

📄 Read the design doc directly: docs/ongoing/openid-sso-prod-readiness.md on the fork

Summary

Adds docs/ongoing/openid-sso-prod-readiness.md — a design doc for putting the OpenID/SSO stack on a production footing so the canary allowlist can be lifted.

The doc is problem-driven: it identifies the five concrete things the allowlist masks today, and proposes a solution for each:

Stateful provider registry (§4) — replace Vec<Box<dyn OpenIdProvider>> + parallel DISCOVERY_TASKS with a data-only CONFIG_REGISTRY and a free-function verify_jwt. Trait dispatch goes away; per-provider differences become enum-tagged config fields.
Timer-driven outcalls (§5) — replace init_discovery_timers and schedule_fetch_certs with an on-demand LRU cache (1000 entries per cache, 1 h TTL). Concurrent-fetch dedup uses the same Pending/Waker primitive as the DoH cache in feat(doh): DoH fallback for DKIM/DMARC TXT records on unsigned domains #3841, copy-pasted rather than lifted.
HTTP outcall safety (§6) — max_response_bytes caps (8 KB hop-1, 64 KB hop-2, 64 KB JWKS), transform validation before parse, tightened per-call cycle ceilings.
Fragment callback (§7) — switch response_mode=fragment → response_mode=form_post. New canister POST handler (update mode for response certification) parses the form body, returns certified HTML that delivers {id_token, state} via BroadcastChannel (popup) or sessionStorage (same-tab). Apple Sign In returning name/email only under form_post is the operational driver. Discussion in §7.4 of why the canister can't consume the JWT itself.
Allowlist removal (§8) — drop add_discoverable_oidc_config and the persisted OIDC_CONFIGS list. Add discover_sso(domain) -> DiscoverySsoResult, anonymous, cache-backed. Thread discovery_domain: opt text through the four JWT-consuming API methods. Stamp sso_domain on OpenIdCredential at creation; lazy backfill for existing credentials.

The salt + nonce + caller() binding that secures the JWT against compromised transports stays end-to-end identical. §3 walks through why several otherwise-attractive simplifications (consume the JWT in the POST, 302-translate to fragment, etc.) were considered and rejected.

§9 lays out the rollout: tracks A (form_post), B (outcall safety), C1→C2→C3→C4 (cache stack) can run in parallel; only within Track C is there a stack.

Reading order

§1: the five problems framed.
§2-§3: scope + threat model.
§4-§8: one section per problem, each with what's broken today + what we're changing.
§9: dependency layout and suggested PR sequence.
§10: open questions.

Per request, this PR is the design doc only — no code changes.

cc @aterga

🤖 Generated with Claude Code

@aterga

Adds docs/ongoing/openid-sso-prod-readiness.md, a design doc for removing the SSO allowlist and putting the OpenID stack on a production footing. Covers five concrete problems and their solutions: 1. Replace Vec<Box<dyn OpenIdProvider>> + parallel DISCOVERY_TASKS with a data-only config registry and free-function verify; per-provider mutable state moves into a bounded cache. 2. Replace always-on discovery/JWKS timers with an on-demand LRU cache plus DoH-style dedup primitive. 1000 entries per cache, 1h TTL. 3. Set max_response_bytes on every outcall, validate response shape in transforms before parsing, ceiling per-call cycles. 4. Switch the OAuth callback from response_mode=fragment to form_post. New canister POST handler returns certified HTML that delivers the payload via BroadcastChannel (popup) or sessionStorage (same-tab). Apple Sign In's name/email handling is the operational driver. 5. Drop add_discoverable_oidc_config and the persistent OIDC_CONFIGS list. Add discover_sso(domain), an anonymous update call backed by the cache. Threads discovery_domain through the four JWT-consuming API methods. The salt + nonce + caller() binding that secures JWT redemption against compromised transports is preserved end-to-end. Rollout layout (§9) has form_post, outcall safety, and the cache stack landing in parallel. cc @aterga 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

Adds a single design document (docs/ongoing/openid-sso-prod-readiness.md) outlining how to make the OpenID/SSO stack production-ready and lift the canary allowlist. No code changes.

Changes:

New RFC-style design doc covering five problems (stateful provider registry, timer-driven outcalls, HTTP outcall safety, fragment callback, allowlist removal) and proposed solutions.
Includes threat model, API/persistent-state deltas, migration approach, and a rollout/PR-dependency plan.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

session_sk lives in the browser's memory, not a mailbox. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CONFIG_REGISTRY now only holds direct providers (Google/MS/Apple) from the init arg. SSO domains live nowhere on the canister beyond the transient cache — the previous enum variant for them contradicted the no-persistent-SSO-state principle in §8. Cache primitive simplified to one HashMap per cache with an enum-typed entry (Done | Pending) so a single key has at most one entry. Renamed §5.4 to "Cache sizing and TTL" and inlined the TTL discussion that had gotten stranded. Stripped framing-noise references to how the dedup primitive relates to other in-flight work; the doc describes what is, not how lines move.

DiscoverySsoResult now carries the OIDC fields the FE actually needs to build the authorize URL (authorization_endpoint, scopes_supported) on top of the previously-spec'd client_id/issuer/name. With those in the response, the FE's ssoDiscovery.ts module is no longer needed — the canister already does the two-hop fetch and cache, so we just expose the result earlier in the flow. §8.3 grew the response shape; §8.7 calls out that the ~560-line FE discovery module is deleted; §9.3 mentions it in the "what changes" summary. Trade-off: discover_sso is an update call (~2-4s warm, ~5-8s cold), whereas direct FE fetches today are ~200-500ms. Accepted because discovery runs once per SSO per session and the canister-side cache already had to do the same outcalls for JWT verification anyway, so nothing is duplicated work-wise — we just stop having two separate caches (one in FE, one on canister) chasing the same data.

aterga · 2026-05-20T11:55:24Z

+
+| Change | Effect on trust model |
+| ------ | --------------------- |
+| Stateless verifier + cache (§4, §5) | None. The verifier still runs `SHA256(salt | caller()) == jwt.nonce` against `caller()` from a signed ingress message. The cache holds JWKS + discovery results, not JWTs or salts. |


MD rendering is broken on this line:

aterga · 2026-05-20T12:09:12Z

+
+- Compromise of the IdP's signing key. Every credential issued by that IdP is at risk; that is the unavoidable trust assumption of any OIDC integration.
+- Compromise of the user's device / browser that holds the session_sk. Standard endpoint compromise — out of scope.
+- Phishing the user into completing an OAuth flow on a malicious site impersonating Internet Identity. This is the standard "website phishing" problem; we don't try to solve it in this layer.


Isn't this mitigated by the fact that the SSO provider needs to add a well-known configuration fiel specifically for II?

Yes this is indeed mitigated by that, seems like this was a bad hallucination.

aterga · 2026-05-20T12:23:33Z

+    salt: &[u8; 32],
+    discovery_domain: Option<&str>,
+    cache: &mut OidcCache,
+) -> Result<OpenIdCredential, OpenIDJWTVerificationError> {


The planned caching design isn't clear from these comments; e.g., how many caches do we have?

I recommend outlining the lifecycle for a particular SSO <> II integration (first time it's a cache miss, second time it's a hit, third time it's a miss due to cache eviction, etc.)

Yes, probably a clear diagram doesn't hurt here either.

aterga · 2026-05-20T14:52:27Z

+enum CacheEntry<V> {
+    /// Fetched and currently usable.
+    Done { value: V, expires_at_secs: u64 },
+    /// First caller is fetching; concurrent arrivals register their `Waker`


Can we omit this new terminology?

At the very least, please clearly define all the terms, e.g., "waker"

aterga · 2026-05-20T14:56:12Z

+
+Entry TTL: 1 hour for both. Matches the upstream OIDC discovery doc cache headers we've observed, and is well under the JWKS rotation window of every mainstream IdP (Google rotates ~every 4–6 h; Microsoft daily; Apple monthly). On TTL lapse, the next verifier call observes the entry as expired and re-fetches inline — same code path as a cold start.
+
+Numbers tunable; they live in two consts in the cache module:


nit. a bit repetitive?

Indeed section 5.4 has unintended repeated code block and paragraph.

aterga · 2026-05-20T14:58:14Z

+pub const JWKS_CACHE_CAPACITY:      usize = 1000;
+```
+
+Entry TTL: 1 hour for both. Matches the upstream OIDC discovery doc cache headers we've observed, and matches the JWKS rotation window of every mainstream IdP (Google rotates ~every 4–6 h; Microsoft daily; Apple monthly). On TTL lapse, the next verifier call observes a cache miss and re-fetches — same code path as a cold start.


Refer to the section where it's explained how the TTL is implemented.

aterga · 2026-05-20T15:03:13Z

+
+Proposed worst case at steady state with a hot cache: ~0 cycles/h (no scheduled outcalls). Per-sign-in cost: ~30 Gcycles on cache miss (cold), ~0 on cache hit (warm). Multiplied by sign-in volume rather than provider count.
+
+For DoS analysis (anonymous `discover_sso` flood): N attacker calls × ~60 Gcycles per cold path (hop-1 + hop-2 + JWKS fetch) = ~60N Gcycles. Bounded by per-call cycle accounting (§6) and the IC's existing reject-on-low-cycles behaviour.


reject-on-low-cycles behaviour

I don't think this is applicable to II, let's make it clearer what DoS risks really are on a system subnet.

Hint: We could rely on the possibility to rate limit via boundary nodes, if necessary.

aterga · 2026-05-20T15:04:09Z

+
+### 6.3 What's deliberately not done
+
+- **TLS certificate pinning.** The IC's HTTPS outcall layer terminates TLS at the boundary node; we trust the gateway's certificate chain. Pinning per-SSO certs would require an out-of-band trust anchor distribution, which is the kind of governance question (§2) we're explicitly punting on.


we trust the gateway's certificate chain

double check this

aterga · 2026-05-20T15:06:39Z

+
+---
+
+## 7. Problem 4 — Fragment callback


This section wasn't easy enough for me to understand.

Could you ELI5 what the problem and the proposed solutions are, please?

aterga · 2026-05-20T15:11:43Z

+Existing credentials in stable storage have no `sso_domain` stamp. There are two upgrade-time approaches:
+
+- **Backfill from `OIDC_CONFIGS`.** The upgrade hook reads the about-to-be-deleted `persistent_state.oidc_configs`. For each registered domain, it runs the discovery cache lookup (cold, so triggers outcalls — this could be deferred until the cache is warm post-upgrade). Then iterates credentials, looking for any whose `iss + aud` matches a discovered provider, and stamps `sso_domain` on each. One-shot, runs in `post_upgrade`. Risk: discovery outcalls from an upgrade hook are unusual; if the upgrade fails the canister rolls back, which is the safe failure mode.
+- **Lazy backfill.** Existing credentials keep `sso_domain = None`. The FE supplies `discovery_domain` on the next use of each credential. The canister, on successful JWT verification with `Some(domain)`, stamps `sso_domain = domain` on the matching stored credential. Slower convergence; no upgrade-time outcalls; works for credentials that are actually used.


This is the right approach.

Note that outcalls aren't possible on upgrade anyway, we'd need a batched migration (if we didn't have this approach)

We wouldn't need outcalls since we could migrate with the data from the sso configuration (if not removed yet at this stage).

aterga · 2026-05-20T15:16:24Z

+- The 1-click SSO path (`authorize/+page.svelte:145-167`) replaces its `actor.add_discoverable_oidc_config({...})` + FE-side discovery chain with a single `actor.discover_sso(domain)` call. Result carries everything needed to build the authorize URL.
+- **`src/frontend/src/lib/utils/ssoDiscovery.ts` is deleted.** All ~560 lines of two-hop fetch, per-hop cache, exponential-backoff retry, abort plumbing, hostname-match validation, HTTPS enforcement, and `DomainNotConfiguredError` classification go away. The canister now does the fetching, caching, and validation; the FE just calls `discover_sso` and reads the result. Removes the only FE-side direct fetch to arbitrary SSO domains (and with it, the CORS, IPv6, and timeout edge cases that module currently handles).
+
+The FE-side simplification has its own cost: `discover_sso` is an update call (~2-4 s warm, ~5-8 s cold) whereas the FE's direct fetches today are ~200-500 ms when the SSO host is fast. For the input-debounce UX in `SignInWithSso.svelte` we accept this tradeoff — the user only ever runs discovery once per SSO before signing in, and the canister-cache means subsequent runs (same domain or other users on the same domain) hit the warm path.


We could improve the UX significantly by also having a query, e.g. is_sso_discovered, that would provide data in the warmed-up cache case; only if that returns null would the FE need to call the update function.

aterga · 2026-05-20T15:19:47Z

+                                                or in the same PR.
+
+Day 0+3N:
+  PR-C4 (lazy backfill)                       — opens after C3 lands.


Intuitively, this shouldn't be the last step.

Is is safe to release the feature without this PR? If not, then it should take place before, so that we always have releasable code on the main branch.

I think data migration should be consider as early as possible indeed, instead of being an after thought.

aterga · 2026-05-20T15:22:44Z

+
+- Heap memory from registered SSOs: O(cache_size) ≤ ~20 MB.
+- Background outcall fanout: zero. Discovery and JWKS fetching are sign-in-driven.
+- Apple Sign In returns name + email. Okta / Auth0 / Keycloak / Authelia / generic OIDC become viable without per-IdP workarounds for fragment-mode quirks.


Is this true? I certainly hope so!

NCR

I've manually confirmed a while back that Apple returns names and emails in this case.

aterga · 2026-05-20T15:23:12Z

+
+## 10. Open questions
+
+- **Cache size revisit.** 1000 entries per cache is a guess based on "1000 simultaneously-hot SSOs is more than we expect" and "10 MB total heap is comfortably under budget." Once we have telemetry on real-world miss rates we should revisit. If miss rate is non-trivial we either raise the cap or shorten the TTL — both have cycle-budget implications.


I would start more conservatively at 100

Also, please add a section to describe how we add observability:

metrics to record the number of cached SSO configs

histogram metrics for how many credentials we have for each SSO domain (bonus points)

plausible funnels (if they need any changes)

aterga · 2026-05-20T15:27:33Z

+- **TTL on `Pending` entries.** `PENDING_STALE_AFTER_SECS = 120 s` is comfortable for IC HTTP outcalls under normal conditions; if we see verifier latency spikes above that floor we'd revisit.
+- **Lazy backfill convergence.** Credentials that are never used again never get `sso_domain` stamped. The metadata layer (`sso_fields_for` callers) gracefully handles `None` today (renders as the bare domain or falls through to the direct-provider config), so this is not a functional problem — but a future cleanup pass could backfill in bulk if telemetry shows long-tailed credentials hampering attribute-sharing UX.
+- **`response_mode=form_post` and FedCM coexistence.** Today's `requestJWT` falls back to popup when FedCM isn't available. Form_post applies only to the popup path. If a future FedCM upgrade adds a redirect fallback we'd revisit; for now the two paths are independent.
+- **Direct provider JWKS cache behaviour.** Hardcoded providers' JWKS today is fetched eagerly at canister bring-up via `Provider::create`. Under §5 it becomes lazy on first sign-in attempt. First-after-deploy sign-in pays one cache-miss outcall. Acceptable, but worth flagging in the rollout notes.


We should ideally provide visual feedback to the user to indicate that while signing them in, II is fetching a fresh SSO domain config (better progress indication ==> less frustration)

aterga · 2026-05-20T15:28:57Z

+
+- **Cache size revisit.** 1000 entries per cache is a guess based on "1000 simultaneously-hot SSOs is more than we expect" and "10 MB total heap is comfortably under budget." Once we have telemetry on real-world miss rates we should revisit. If miss rate is non-trivial we either raise the cap or shorten the TTL — both have cycle-budget implications.
+- **TTL on `Pending` entries.** `PENDING_STALE_AFTER_SECS = 120 s` is comfortable for IC HTTP outcalls under normal conditions; if we see verifier latency spikes above that floor we'd revisit.
+- **Lazy backfill convergence.** Credentials that are never used again never get `sso_domain` stamped. The metadata layer (`sso_fields_for` callers) gracefully handles `None` today (renders as the bare domain or falls through to the direct-provider config), so this is not a functional problem — but a future cleanup pass could backfill in bulk if telemetry shows long-tailed credentials hampering attribute-sharing UX.


which exact telemetry would provide this data?

Addresses the 28 inline review comments on #3907: - §3.2/glossary: fix MD pipe rendering inside table cells (L99, L25). - §3.3: reframe attack #3 (no boundary node in outcall path; replication is the defense). - §3.4: phishing bullet acknowledges the II-specific .well-known file as a soft mitigation against impersonating a registered SSO (L116). - §4.3: clarify that DirectProviderConfig holds boot-time direct providers (Google/Microsoft/Apple), NOT SSO — forestalls the rename suggestion (L168/L176). - §5.2: drop bare 'Waker' terminology, define it inline against the existing doh/cache.rs precedent (L260). Restructure §5.2 to lead with the cache count (two: discovery + JWKS). - §5.2.1 NEW: cache lifecycle sequence diagram (mermaid) showing cold miss → warm hit → expired refetch → eviction for one SSO (L190). - §5.4: dedupe repeated const block + paragraph, cross-ref to CacheEntry::Done.expires_at_secs (L313, L320). Cap → 100 per cache (L749). - §5.6: drop reject-on-low-cycles framing for the system subnet; ground DoS analysis in cycle ceilings, LRU cap, and boundary-node throttling (L336). - §6.3: rewrite the TLS-trust bullet — HTTPS outcalls run per-replica against the public CA system, no boundary node in the outcall path (L415). - §7.0 NEW: TL;DR for form_post including a mermaid sequence diagram of the full popup flow (L421). - §7.2: footnote that Apple form_post name/email behaviour was manually confirmed (L733). - §8.3: add discover_sso_query — query-mode shortcut that returns warm data without an update call (L679). - §8.5: storage OpenIdCredential gains sso_domain + sso_name fields. - §8.6 REWRITTEN: replace lazy backfill with sso_credential_migration upgrade arg. Migration is now the root of the rollout, deterministic, outcall-free, no cache dependency. Includes mermaid sequence diagram of the pre-flight query → NNS proposal → post_upgrade flow (L669, L724). - §8.7: progress-UI spinner on cold discovery; tied to plausible events in §11 (L753). - §9 REWRITTEN: rollout dependency layout as mermaid graph TD with PR-M as the root. C2 has a HARD dep on M (DISCOVERY_TASKS removal requires every credential pre-stamped); A/B/C1 have soft deps on M to keep main releasable. Suggested PR sequence updated. - §10: rewrite Cache size revisit and Pending-TTL questions to reference §11 metrics by name. Drop lazy backfill convergence question. - §11 NEW: Observability — cache gauges/counters, credential distribution histograms, plausible funnels, one-shot migration verification metrics (L749 reply, L751).

…he actual mitigation, not the .well-known file

zeropath-ai · 2026-05-27T15:32:31Z

✅ No security or compliance issues detected. Reviewed everything up to 0940827.

Security Overview

🔎 Scanned files: 1 changed file(s)
🔗 Scan Link: https://zeropath.com/app/repositories/069435eb-c5c7-4d1b-a71d-02c03f1fc8d3?scanId=440e0944-50af-4f67-8006-1633df0c0b6c&codeScanTypes=PrScan&tab=issues

Detected Code Changes

| Change Type | Relevant files

... (code changes summary truncated to fit VCS comment limits.)

sea-snake requested a review from a team as a code owner May 18, 2026 13:35

Copilot AI review requested due to automatic review settings May 18, 2026 13:35

Copilot started reviewing on behalf of sea-snake May 18, 2026 13:36 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

sea-snake and others added 3 commits May 18, 2026 15:42

fix: s/mailbox/browser in threat model §3.4

098fa90

session_sk lives in the browser's memory, not a mailbox. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

aterga reviewed May 20, 2026

View reviewed changes

Comment thread docs/ongoing/openid-sso-prod-readiness.md

aterga reviewed May 20, 2026

View reviewed changes

Comment thread docs/ongoing/openid-sso-prod-readiness.md

aterga reviewed May 20, 2026

View reviewed changes

aterga approved these changes May 20, 2026

View reviewed changes

sea-snake added 2 commits May 27, 2026 11:47

docs(ongoing): fix L116 phishing bullet — redirect_uri allowlist is t…

0940827

…he actual mitigation, not the .well-known file


		Entry TTL: 1 hour for both. Matches the upstream OIDC discovery doc cache headers we've observed, and is well under the JWKS rotation window of every mainstream IdP (Google rotates ~every 4–6 h; Microsoft daily; Apple monthly). On TTL lapse, the next verifier call observes the entry as expired and re-fetches inline — same code path as a cold start.

		Numbers tunable; they live in two consts in the cache module:


		Proposed worst case at steady state with a hot cache: ~0 cycles/h (no scheduled outcalls). Per-sign-in cost: ~30 Gcycles on cache miss (cold), ~0 on cache hit (warm). Multiplied by sign-in volume rather than provider count.

		For DoS analysis (anonymous `discover_sso` flood): N attacker calls × ~60 Gcycles per cold path (hop-1 + hop-2 + JWKS fetch) = ~60N Gcycles. Bounded by per-call cycle accounting (§6) and the IC's existing reject-on-low-cycles behaviour.


		### 6.3 What's deliberately not done

		- TLS certificate pinning. The IC's HTTPS outcall layer terminates TLS at the boundary node; we trust the gateway's certificate chain. Pinning per-SSO certs would require an out-of-band trust anchor distribution, which is the kind of governance question (§2) we're explicitly punting on.


		## 10. Open questions

		- Cache size revisit. 1000 entries per cache is a guess based on "1000 simultaneously-hot SSOs is more than we expect" and "10 MB total heap is comfortably under budget." Once we have telemetry on real-world miss rates we should revisit. If miss rate is non-trivial we either raise the cap or shorten the TTL — both have cycle-budget implications.

Conversation

sea-snake commented May 18, 2026

Summary

Reading order

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zeropath-ai Bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants