Skip to content

docs(ongoing): design doc for production-ready OpenID & SSO#3907

Open
sea-snake wants to merge 6 commits into
mainfrom
design/openid-sso-prod
Open

docs(ongoing): design doc for production-ready OpenID & SSO#3907
sea-snake wants to merge 6 commits into
mainfrom
design/openid-sso-prod

Conversation

@sea-snake

Copy link
Copy Markdown
Contributor

📄 Read the design doc directly: docs/ongoing/openid-sso-prod-readiness.md on the fork


Summary

Adds docs/ongoing/openid-sso-prod-readiness.md — a design doc for putting the OpenID/SSO stack on a production footing so the canary allowlist can be lifted.

The doc is problem-driven: it identifies the five concrete things the allowlist masks today, and proposes a solution for each:

  1. Stateful provider registry (§4) — replace Vec<Box<dyn OpenIdProvider>> + parallel DISCOVERY_TASKS with a data-only CONFIG_REGISTRY and a free-function verify_jwt. Trait dispatch goes away; per-provider differences become enum-tagged config fields.
  2. Timer-driven outcalls (§5) — replace init_discovery_timers and schedule_fetch_certs with an on-demand LRU cache (1000 entries per cache, 1 h TTL). Concurrent-fetch dedup uses the same Pending/Waker primitive as the DoH cache in feat(doh): DoH fallback for DKIM/DMARC TXT records on unsigned domains #3841, copy-pasted rather than lifted.
  3. HTTP outcall safety (§6) — max_response_bytes caps (8 KB hop-1, 64 KB hop-2, 64 KB JWKS), transform validation before parse, tightened per-call cycle ceilings.
  4. Fragment callback (§7) — switch response_mode=fragmentresponse_mode=form_post. New canister POST handler (update mode for response certification) parses the form body, returns certified HTML that delivers {id_token, state} via BroadcastChannel (popup) or sessionStorage (same-tab). Apple Sign In returning name/email only under form_post is the operational driver. Discussion in §7.4 of why the canister can't consume the JWT itself.
  5. Allowlist removal (§8) — drop add_discoverable_oidc_config and the persisted OIDC_CONFIGS list. Add discover_sso(domain) -> DiscoverySsoResult, anonymous, cache-backed. Thread discovery_domain: opt text through the four JWT-consuming API methods. Stamp sso_domain on OpenIdCredential at creation; lazy backfill for existing credentials.

The salt + nonce + caller() binding that secures the JWT against compromised transports stays end-to-end identical. §3 walks through why several otherwise-attractive simplifications (consume the JWT in the POST, 302-translate to fragment, etc.) were considered and rejected.

§9 lays out the rollout: tracks A (form_post), B (outcall safety), C1→C2→C3→C4 (cache stack) can run in parallel; only within Track C is there a stack.

Reading order

  • §1: the five problems framed.
  • §2-§3: scope + threat model.
  • §4-§8: one section per problem, each with what's broken today + what we're changing.
  • §9: dependency layout and suggested PR sequence.
  • §10: open questions.

Per request, this PR is the design doc only — no code changes.

cc @aterga

🤖 Generated with Claude Code

Adds docs/ongoing/openid-sso-prod-readiness.md, a design doc for
removing the SSO allowlist and putting the OpenID stack on a production
footing. Covers five concrete problems and their solutions:

1. Replace Vec<Box<dyn OpenIdProvider>> + parallel DISCOVERY_TASKS with
   a data-only config registry and free-function verify; per-provider
   mutable state moves into a bounded cache.
2. Replace always-on discovery/JWKS timers with an on-demand LRU cache
   plus DoH-style dedup primitive. 1000 entries per cache, 1h TTL.
3. Set max_response_bytes on every outcall, validate response shape in
   transforms before parsing, ceiling per-call cycles.
4. Switch the OAuth callback from response_mode=fragment to form_post.
   New canister POST handler returns certified HTML that delivers the
   payload via BroadcastChannel (popup) or sessionStorage (same-tab).
   Apple Sign In's name/email handling is the operational driver.
5. Drop add_discoverable_oidc_config and the persistent OIDC_CONFIGS
   list. Add discover_sso(domain), an anonymous update call backed by
   the cache. Threads discovery_domain through the four JWT-consuming
   API methods.

The salt + nonce + caller() binding that secures JWT redemption against
compromised transports is preserved end-to-end. Rollout layout (§9) has
form_post, outcall safety, and the cache stack landing in parallel.

cc @aterga

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@sea-snake sea-snake requested a review from a team as a code owner May 18, 2026 13:35
Copilot AI review requested due to automatic review settings May 18, 2026 13:35

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a single design document (docs/ongoing/openid-sso-prod-readiness.md) outlining how to make the OpenID/SSO stack production-ready and lift the canary allowlist. No code changes.

Changes:

  • New RFC-style design doc covering five problems (stateful provider registry, timer-driven outcalls, HTTP outcall safety, fragment callback, allowlist removal) and proposed solutions.
  • Includes threat model, API/persistent-state deltas, migration approach, and a rollout/PR-dependency plan.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sea-snake and others added 3 commits May 18, 2026 15:42
session_sk lives in the browser's memory, not a mailbox.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CONFIG_REGISTRY now only holds direct providers (Google/MS/Apple) from
the init arg. SSO domains live nowhere on the canister beyond the
transient cache — the previous enum variant for them contradicted the
no-persistent-SSO-state principle in §8.

Cache primitive simplified to one HashMap per cache with an enum-typed
entry (Done | Pending) so a single key has at most one entry. Renamed
§5.4 to "Cache sizing and TTL" and inlined the TTL discussion that had
gotten stranded.

Stripped framing-noise references to how the dedup primitive relates to
other in-flight work; the doc describes what is, not how lines move.
DiscoverySsoResult now carries the OIDC fields the FE actually needs to
build the authorize URL (authorization_endpoint, scopes_supported) on
top of the previously-spec'd client_id/issuer/name. With those in the
response, the FE's ssoDiscovery.ts module is no longer needed — the
canister already does the two-hop fetch and cache, so we just expose
the result earlier in the flow.

§8.3 grew the response shape; §8.7 calls out that the ~560-line FE
discovery module is deleted; §9.3 mentions it in the "what changes"
summary.

Trade-off: discover_sso is an update call (~2-4s warm, ~5-8s cold),
whereas direct FE fetches today are ~200-500ms. Accepted because
discovery runs once per SSO per session and the canister-side cache
already had to do the same outcalls for JWT verification anyway, so
nothing is duplicated work-wise — we just stop having two separate
caches (one in FE, one on canister) chasing the same data.

| Change | Effect on trust model |
| ------ | --------------------- |
| Stateless verifier + cache (§4, §5) | None. The verifier still runs `SHA256(salt | caller()) == jwt.nonce` against `caller()` from a signed ingress message. The cache holds JWKS + discovery results, not JWTs or salts. |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MD rendering is broken on this line:

Image


- Compromise of the IdP's signing key. Every credential issued by that IdP is at risk; that is the unavoidable trust assumption of any OIDC integration.
- Compromise of the user's device / browser that holds the session_sk. Standard endpoint compromise — out of scope.
- Phishing the user into completing an OAuth flow on a malicious site impersonating Internet Identity. This is the standard "website phishing" problem; we don't try to solve it in this layer.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this mitigated by the fact that the SSO provider needs to add a well-known configuration fiel specifically for II?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is indeed mitigated by that, seems like this was a bad hallucination.

Comment thread docs/ongoing/openid-sso-prod-readiness.md
Comment thread docs/ongoing/openid-sso-prod-readiness.md
salt: &[u8; 32],
discovery_domain: Option<&str>,
cache: &mut OidcCache,
) -> Result<OpenIdCredential, OpenIDJWTVerificationError> {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The planned caching design isn't clear from these comments; e.g., how many caches do we have?

I recommend outlining the lifecycle for a particular SSO <> II integration (first time it's a cache miss, second time it's a hit, third time it's a miss due to cache eviction, etc.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, probably a clear diagram doesn't hurt here either.

enum CacheEntry<V> {
/// Fetched and currently usable.
Done { value: V, expires_at_secs: u64 },
/// First caller is fetching; concurrent arrivals register their `Waker`

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we omit this new terminology?

At the very least, please clearly define all the terms, e.g., "waker"


Entry TTL: 1 hour for both. Matches the upstream OIDC discovery doc cache headers we've observed, and is well under the JWKS rotation window of every mainstream IdP (Google rotates ~every 4–6 h; Microsoft daily; Apple monthly). On TTL lapse, the next verifier call observes the entry as expired and re-fetches inline — same code path as a cold start.

Numbers tunable; they live in two consts in the cache module:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. a bit repetitive?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed section 5.4 has unintended repeated code block and paragraph.

pub const JWKS_CACHE_CAPACITY: usize = 1000;
```

Entry TTL: 1 hour for both. Matches the upstream OIDC discovery doc cache headers we've observed, and matches the JWKS rotation window of every mainstream IdP (Google rotates ~every 4–6 h; Microsoft daily; Apple monthly). On TTL lapse, the next verifier call observes a cache miss and re-fetches — same code path as a cold start.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to the section where it's explained how the TTL is implemented.


Proposed worst case at steady state with a hot cache: ~0 cycles/h (no scheduled outcalls). Per-sign-in cost: ~30 Gcycles on cache miss (cold), ~0 on cache hit (warm). Multiplied by sign-in volume rather than provider count.

For DoS analysis (anonymous `discover_sso` flood): N attacker calls × ~60 Gcycles per cold path (hop-1 + hop-2 + JWKS fetch) = ~60N Gcycles. Bounded by per-call cycle accounting (§6) and the IC's existing reject-on-low-cycles behaviour.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reject-on-low-cycles behaviour

I don't think this is applicable to II, let's make it clearer what DoS risks really are on a system subnet.

Hint: We could rely on the possibility to rate limit via boundary nodes, if necessary.


### 6.3 What's deliberately not done

- **TLS certificate pinning.** The IC's HTTPS outcall layer terminates TLS at the boundary node; we trust the gateway's certificate chain. Pinning per-SSO certs would require an out-of-band trust anchor distribution, which is the kind of governance question (§2) we're explicitly punting on.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we trust the gateway's certificate chain

double check this


---

## 7. Problem 4 — Fragment callback

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section wasn't easy enough for me to understand.

Could you ELI5 what the problem and the proposed solutions are, please?

Existing credentials in stable storage have no `sso_domain` stamp. There are two upgrade-time approaches:

- **Backfill from `OIDC_CONFIGS`.** The upgrade hook reads the about-to-be-deleted `persistent_state.oidc_configs`. For each registered domain, it runs the discovery cache lookup (cold, so triggers outcalls — this could be deferred until the cache is warm post-upgrade). Then iterates credentials, looking for any whose `iss + aud` matches a discovered provider, and stamps `sso_domain` on each. One-shot, runs in `post_upgrade`. Risk: discovery outcalls from an upgrade hook are unusual; if the upgrade fails the canister rolls back, which is the safe failure mode.
- **Lazy backfill.** Existing credentials keep `sso_domain = None`. The FE supplies `discovery_domain` on the next use of each credential. The canister, on successful JWT verification with `Some(domain)`, stamps `sso_domain = domain` on the matching stored credential. Slower convergence; no upgrade-time outcalls; works for credentials that are actually used.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the right approach.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that outcalls aren't possible on upgrade anyway, we'd need a batched migration (if we didn't have this approach)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wouldn't need outcalls since we could migrate with the data from the sso configuration (if not removed yet at this stage).

- The 1-click SSO path (`authorize/+page.svelte:145-167`) replaces its `actor.add_discoverable_oidc_config({...})` + FE-side discovery chain with a single `actor.discover_sso(domain)` call. Result carries everything needed to build the authorize URL.
- **`src/frontend/src/lib/utils/ssoDiscovery.ts` is deleted.** All ~560 lines of two-hop fetch, per-hop cache, exponential-backoff retry, abort plumbing, hostname-match validation, HTTPS enforcement, and `DomainNotConfiguredError` classification go away. The canister now does the fetching, caching, and validation; the FE just calls `discover_sso` and reads the result. Removes the only FE-side direct fetch to arbitrary SSO domains (and with it, the CORS, IPv6, and timeout edge cases that module currently handles).

The FE-side simplification has its own cost: `discover_sso` is an update call (~2-4 s warm, ~5-8 s cold) whereas the FE's direct fetches today are ~200-500 ms when the SSO host is fast. For the input-debounce UX in `SignInWithSso.svelte` we accept this tradeoff — the user only ever runs discovery once per SSO before signing in, and the canister-cache means subsequent runs (same domain or other users on the same domain) hit the warm path.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could improve the UX significantly by also having a query, e.g. is_sso_discovered, that would provide data in the warmed-up cache case; only if that returns null would the FE need to call the update function.

or in the same PR.

Day 0+3N:
PR-C4 (lazy backfill) — opens after C3 lands.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intuitively, this shouldn't be the last step.

Is is safe to release the feature without this PR? If not, then it should take place before, so that we always have releasable code on the main branch.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think data migration should be consider as early as possible indeed, instead of being an after thought.


- Heap memory from registered SSOs: O(cache_size) ≤ ~20 MB.
- Background outcall fanout: zero. Discovery and JWKS fetching are sign-in-driven.
- Apple Sign In returns name + email. Okta / Auth0 / Keycloak / Authelia / generic OIDC become viable without per-IdP workarounds for fragment-mode quirks.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? I certainly hope so!

NCR

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've manually confirmed a while back that Apple returns names and emails in this case.


## 10. Open questions

- **Cache size revisit.** 1000 entries per cache is a guess based on "1000 simultaneously-hot SSOs is more than we expect" and "10 MB total heap is comfortably under budget." Once we have telemetry on real-world miss rates we should revisit. If miss rate is non-trivial we either raise the cap or shorten the TTL — both have cycle-budget implications.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would start more conservatively at 100

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please add a section to describe how we add observability:

  1. metrics to record the number of cached SSO configs
  2. histogram metrics for how many credentials we have for each SSO domain (bonus points)
  3. plausible funnels (if they need any changes)

- **TTL on `Pending` entries.** `PENDING_STALE_AFTER_SECS = 120 s` is comfortable for IC HTTP outcalls under normal conditions; if we see verifier latency spikes above that floor we'd revisit.
- **Lazy backfill convergence.** Credentials that are never used again never get `sso_domain` stamped. The metadata layer (`sso_fields_for` callers) gracefully handles `None` today (renders as the bare domain or falls through to the direct-provider config), so this is not a functional problem — but a future cleanup pass could backfill in bulk if telemetry shows long-tailed credentials hampering attribute-sharing UX.
- **`response_mode=form_post` and FedCM coexistence.** Today's `requestJWT` falls back to popup when FedCM isn't available. Form_post applies only to the popup path. If a future FedCM upgrade adds a redirect fallback we'd revisit; for now the two paths are independent.
- **Direct provider JWKS cache behaviour.** Hardcoded providers' JWKS today is fetched eagerly at canister bring-up via `Provider::create`. Under §5 it becomes lazy on first sign-in attempt. First-after-deploy sign-in pays one cache-miss outcall. Acceptable, but worth flagging in the rollout notes.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should ideally provide visual feedback to the user to indicate that while signing them in, II is fetching a fresh SSO domain config (better progress indication ==> less frustration)


- **Cache size revisit.** 1000 entries per cache is a guess based on "1000 simultaneously-hot SSOs is more than we expect" and "10 MB total heap is comfortably under budget." Once we have telemetry on real-world miss rates we should revisit. If miss rate is non-trivial we either raise the cap or shorten the TTL — both have cycle-budget implications.
- **TTL on `Pending` entries.** `PENDING_STALE_AFTER_SECS = 120 s` is comfortable for IC HTTP outcalls under normal conditions; if we see verifier latency spikes above that floor we'd revisit.
- **Lazy backfill convergence.** Credentials that are never used again never get `sso_domain` stamped. The metadata layer (`sso_fields_for` callers) gracefully handles `None` today (renders as the bare domain or falls through to the direct-provider config), so this is not a functional problem — but a future cleanup pass could backfill in bulk if telemetry shows long-tailed credentials hampering attribute-sharing UX.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which exact telemetry would provide this data?

sea-snake added 2 commits May 27, 2026 11:47
Addresses the 28 inline review comments on #3907:

- §3.2/glossary: fix MD pipe rendering inside table cells (L99, L25).
- §3.3: reframe attack #3 (no boundary node in outcall path; replication is
  the defense).
- §3.4: phishing bullet acknowledges the II-specific .well-known file as a
  soft mitigation against impersonating a registered SSO (L116).
- §4.3: clarify that DirectProviderConfig holds boot-time direct providers
  (Google/Microsoft/Apple), NOT SSO — forestalls the rename suggestion
  (L168/L176).
- §5.2: drop bare 'Waker' terminology, define it inline against the
  existing doh/cache.rs precedent (L260). Restructure §5.2 to lead with
  the cache count (two: discovery + JWKS).
- §5.2.1 NEW: cache lifecycle sequence diagram (mermaid) showing cold miss
  → warm hit → expired refetch → eviction for one SSO (L190).
- §5.4: dedupe repeated const block + paragraph, cross-ref to
  CacheEntry::Done.expires_at_secs (L313, L320). Cap → 100 per cache (L749).
- §5.6: drop reject-on-low-cycles framing for the system subnet; ground
  DoS analysis in cycle ceilings, LRU cap, and boundary-node throttling
  (L336).
- §6.3: rewrite the TLS-trust bullet — HTTPS outcalls run per-replica
  against the public CA system, no boundary node in the outcall path
  (L415).
- §7.0 NEW: TL;DR for form_post including a mermaid sequence diagram of
  the full popup flow (L421).
- §7.2: footnote that Apple form_post name/email behaviour was manually
  confirmed (L733).
- §8.3: add discover_sso_query — query-mode shortcut that returns warm
  data without an update call (L679).
- §8.5: storage OpenIdCredential gains sso_domain + sso_name fields.
- §8.6 REWRITTEN: replace lazy backfill with sso_credential_migration
  upgrade arg. Migration is now the root of the rollout, deterministic,
  outcall-free, no cache dependency. Includes mermaid sequence diagram
  of the pre-flight query → NNS proposal → post_upgrade flow (L669, L724).
- §8.7: progress-UI spinner on cold discovery; tied to plausible events
  in §11 (L753).
- §9 REWRITTEN: rollout dependency layout as mermaid graph TD with PR-M
  as the root. C2 has a HARD dep on M (DISCOVERY_TASKS removal requires
  every credential pre-stamped); A/B/C1 have soft deps on M to keep main
  releasable. Suggested PR sequence updated.
- §10: rewrite Cache size revisit and Pending-TTL questions to reference
  §11 metrics by name. Drop lazy backfill convergence question.
- §11 NEW: Observability — cache gauges/counters, credential distribution
  histograms, plausible funnels, one-shot migration verification metrics
  (L749 reply, L751).
…he actual mitigation, not the .well-known file
@zeropath-ai

zeropath-ai Bot commented May 27, 2026

Copy link
Copy Markdown

No security or compliance issues detected. Reviewed everything up to 0940827.

Security Overview
Detected Code Changes

| Change Type | Relevant files

... (code changes summary truncated to fit VCS comment limits.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants