diff --git a/CHANGELOG.md b/CHANGELOG.md index 5cc259ae..323a90f5 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,6 +8,32 @@ Per-chunk entries preserve the full implementation record: contract amendments, ## [Unreleased] +### Per-route store swapping for the experiment switcher (issue #145) + +Closes the Phase 12c §3.6 deferral: the cross-experiment switcher shipped in 12c (the `/admin/experiments/` dashboard, the `Session.selected_experiment_id` cookie field, and `POST /admin/experiments/{E}/select`) recorded the operator's selection but every per-experiment route still read the startup-bound `app.state.store` / `experiment_id` / `experiment_config`, so "select experiment Y" only relabelled the page. This chunk makes the selection load-bearing: every per-experiment web-ui route now resolves the active experiment per-request and operates against its store / config / repo. Reference-impl web-ui only — **no spec / wire / JSON-schema / Pydantic / conformance change** (Decision 10/11; per-route swapping is web-ui behavior with no observable signal at the chapter-9 §6 IUT contract). + +**Active-experiment resolution.** A new `resolve_active_context(request)` helper ([`routes/_helpers.py`](reference/services/web-ui/src/eden_web_ui/routes/_helpers.py)) is the single entry point each handler calls after its session guard: it reads `Session.selected_experiment_id`, falls back to the deployment default (`--experiment-id`), and returns either a ready `ActiveContext` (resolved `experiment_id` + per-experiment `store` / `admin_store` / `config`) or a `Response` the handler returns verbatim. With no control plane configured it always returns the deployment default with **zero** validation overhead — single-experiment deployments are observably unchanged. For a non-default selection in control-plane mode it (1) validates existence in the control plane (`StaleSelection` → dashboard redirect + cleared session field otherwise) and (2) classifies seeded vs registered-but-unseeded against the task-store-server (Decision 8's three-state model: an experiment registered on the dashboard but not yet bootstrapped by `setup-experiment` / checkpoint-import renders an "initialize me" page rather than being mis-classified as stale, Risk 11). `experiment_id` moved from a render-time Jinja global to a per-request template context processor so every page reflects the active experiment. + +**Per-experiment store vending.** [`store_factory.py`](reference/services/web-ui/src/eden_web_ui/store_factory.py): the live `StoreFactory` vends per-`(experiment_id, role)` `StoreClient` views against the one deployment-wide task-store URL (12c Decision 11 — no service discovery; only the `experiment_id` path segment varies), sharing one `httpx.Client` so connection-pooling is preserved. Worker-role views are JIT-credentialed on first access by `BearerCache`, which reuses [`eden_service_common.auth.bootstrap_worker_credential`](reference/services/_common/src/eden_service_common/auth.py) verbatim (the per-`worker_id` lock + idempotent-register-then-reissue + persisted-token `/whoami` verify) — never reimplementing those disciplines — under a per-experiment credential subtree `//.token`. The auth-disabled posture (no admin token, no persisted credential) returns a `None` bearer, mirroring the prior `resolve_worker_bearer` posture 3. `StaticStoreFactory` vends one pre-built store for the single-experiment / test path. `make_app` now takes `store_factory` as its sole store dependency — the legacy `store=` / `admin_store=` kwargs and the `app.state.store` / `admin_store` attributes are gone (tests construct via the `conftest._one_experiment_factory` helper). + +**Credential plumbing (Posture B/C/D, §3.2).** [`credentials.py`](reference/services/web-ui/src/eden_web_ui/credentials.py) adds a deployment-scoped control-plane worker credential (`bootstrap_control_plane_credential`, persisted at `/control-plane/.token`) so the switcher's control-plane reads (`list_experiments` / `read_experiment_metadata`, which accept any authenticated principal) keep working after the operator rotates the admin token out of the runtime env (Posture C). `resolve_credential_dir` resolves `--credential-dir` / `$EDEN_CREDENTIAL_DIR` → the common `--credentials-dir` / `$EDEN_WORKER_CREDENTIALS_DIR` → an XDG default, so the web-ui (itself a worker host) shares the established credentials volume by default. New CLI flags: `--credential-dir`, `--experiment-config-dir`, `--control-plane-worker-id`. + +**Per-experiment config + repo.** Each experiment's `ExperimentConfig` is loaded lazily from `<--experiment-config-dir>/.yaml` (Decision 6; the deployment default still uses `--experiment-config`); `setup-experiment.sh` drops each experiment's YAML there. The executor module's local integrator clone is per-experiment via [`repo_factory.py`](reference/services/web-ui/src/eden_web_ui/repo_factory.py)'s `RepoMaterializer` (clones `/.git` from the substituted `--forgejo-url` org base, fetch-on-access); `repo_for` returns the startup clone for the default experiment. + +**Switcher + safety.** [`base.html`](reference/services/web-ui/src/eden_web_ui/templates/base.html) gains a no-JS top-nav switcher dropdown (CSRF-protected `select` POSTs; a 5s in-process `list_experiments` cache, §3.7; hidden without a control plane). Every worker submit form carries a hidden `form_experiment_id`; `form_experiment_guard` discards a submission whose form was rendered against a different experiment than the now-active one and redirects with a clear banner rather than writing to the wrong experiment (§3.6). The dashboard renders the resolution-failure banners (stale-selection / control-plane-unreachable / cannot-bootstrap-credential / task-store-unreachable / config-missing / config-invalid / switched-mid-form). The `AdminGateMiddleware` admins-group check now follows the active experiment (deployment-scoped `/admin/experiments` + `/admin/control` pages gate against the default and are exempt from per-experiment resolution — they are the redirect target, so resolving them per-experiment would loop). + +**Compose.** The web-ui service gains `--experiment-config-dir /var/lib/eden/web-ui-configs` (+ bind-mount) and an explicit `--credentials-dir /var/lib/eden/credentials` (the new resolver otherwise falls back to a non-persisted in-container XDG path); `setup-experiment.sh` creates the `web-ui-configs/` dir and copies each experiment's config in. + +**Tests.** New `test_store_factory.py`, `test_resolve_active.py`, `test_per_experiment_repo.py`; `test_admin_experiments_routes.py` extended (switcher render/highlight, resolution-failure banners, no-control-plane switcher absence). Full web-ui suite green (667), including the real-subprocess e2e tests; `ruff` + `pyright` + `complexity-gate` clean. + +**Deferred (tracked):** + +- *`GET /v0/experiments/{E}/config` wire endpoint* (Decision 6 alternative) — the cleaner long-term shape that removes the on-disk config-dir and its drift risk (Risk 12); a normative chapter-7 amendment, out of scope here. Filed as [#259](https://github.com/ealt/eden/issues/259). +- *Per-request active-experiment resolution cache* (Decision 8 5s TTL) — only the switcher's `list_experiments` cache shipped; the per-request seeded/unseeded classification is uncached (correct, but a latency cost on non-default admin pages). Filed as [#260](https://github.com/ealt/eden/issues/260). +- *Tab-scoped `?exp=` permalink override + draft-survives-switch* (§7.5 / §7.2, v1 affordances). Filed as [#261](https://github.com/ealt/eden/issues/261). +- *`form_experiment_id` guard on admin mutating forms* — shipped on the worker submit forms; admin forms fail safe (cross-experiment id → `NotFound`) but lack the explicit guard banner. Filed as [#262](https://github.com/ealt/eden/issues/262). +- *Multi-experiment Compose smoke* remains the existing [#147](https://github.com/ealt/eden/issues/147); the single-experiment smoke is the golden path and stays green through this chunk. The [#128](https://github.com/ealt/eden/issues/128) / [#140](https://github.com/ealt/eden/issues/140) / [#141](https://github.com/ealt/eden/issues/141) retrofits are unchanged (§4.2). + ### Move deployment CLI flags into experiment-config fields (issue #157) Audits the orchestrator / worker-host CLI surface (issue #157) and moves five flags whose values two experiments sharing one deployment plausibly want different values for, from deployment-wide CLI flags / env vars into typed `experiment-config.yaml` fields — validated on both the JSON Schema and the Pydantic `eden-contracts` side per the repo's schema↔model parity discipline. Mirrors the [#133](https://github.com/ealt/eden/pull/215) `ideation_policy` template (discriminated-union YAML block + `build_*` factory + flag removal). diff --git a/docs/glossary.md b/docs/glossary.md index c0a3ee93..233740bf 100644 --- a/docs/glossary.md +++ b/docs/glossary.md @@ -247,6 +247,11 @@ EDEN maintains three branch namespaces in the experiment's git repo: | **`checkpoint:sha256:` URI** | Content-addressed scheme used only inside a portable-checkpoint archive: each `` is the lowercase SHA-256 of an artifact's bytes; the corresponding bytes live at `artifacts/sha256/` in the archive ([`spec/v0/10-checkpoints.md`](../spec/v0/10-checkpoints.md) §7). On import, the receiving Store rewrites each occurrence to its deployment-local URI (`file://`, `s3://`, …). Not a wire-resolvable scheme outside the archive. | | **import provenance** | The `Experiment.imported_from` field carrying `{checkpoint_exported_at, checkpoint_format_version}` set at import time; recovery-probe anchor for the lost-201 case in [`spec/v0/10-checkpoints.md`](../spec/v0/10-checkpoints.md) §10. Absent on natively-created experiments. | | **manifest** (in setup-experiment context) | A `.env` + `experiment-config.yaml` + forgejo credential helper produced for one experiment. Different from "evaluation manifest" above. | +| **experiment switcher** | The reference web-ui affordance (cross-experiment dashboard + top-nav dropdown) by which an operator selects which registered experiment subsequent per-experiment pages operate against. Shipped in Phase 12c (selection recorded) and made load-bearing in issue [#145](https://github.com/ealt/eden/issues/145) (per-route data follows the selection). Reference-impl, not protocol. | +| **selected experiment** | The value of `Session.selected_experiment_id` — the experiment the operator picked in the switcher, or `None` (no selection / no control plane). Distinct from the **active experiment** below. | +| **active experiment / `active_experiment_id`** | The per-request *resolved* experiment a web-ui route operates against: the selected experiment when set and valid, else the deployment default (`--experiment-id`). Always a concrete id (never `None`). The verb is "resolve the active experiment" (`resolve_active_experiment`). Reference-impl term (issue #145). | +| **`StoreFactory`** | The reference web-ui's per-process object that vends per-experiment `StoreClient` views against the one deployment-wide task-store URL, JIT-bootstrapping each experiment's worker credential and sharing one connection pool. Its `for_experiment(experiment_id, role)` returns the **active store** for a request. Reference-impl term (issue #145); not a protocol component. | +| **active store** | The per-request `StoreClient` produced by `StoreFactory.for_experiment(active_experiment_id)`. A helper name, not a separate concept. | ## 9. Build / packaging vocabulary diff --git a/docs/operations/README.md b/docs/operations/README.md index 7dcf2f15..9b37add7 100644 --- a/docs/operations/README.md +++ b/docs/operations/README.md @@ -20,6 +20,9 @@ the wire-observable end-state should look like. experiment (operator wire op + orchestrator policy-driven path), reference termination policies, drain semantics, idempotent re-terminate (Phase 12a-3). +- [Web UI multi-experiment operation](web-ui-multi-experiment.md) — the + experiment switcher, the four credential-bootstrap postures, and the + per-experiment config / repo layout (issue #145). These docs assume the reference Compose deployment + the [`docs/glossary.md`](../glossary.md) vocabulary. For the underlying diff --git a/docs/operations/web-ui-multi-experiment.md b/docs/operations/web-ui-multi-experiment.md new file mode 100644 index 00000000..6a4814e4 --- /dev/null +++ b/docs/operations/web-ui-multi-experiment.md @@ -0,0 +1,92 @@ +# Web UI: multi-experiment operation + +How the reference Web UI serves multiple experiments from one deployment +once a control plane is configured (`--control-plane-url`). Shipped in +issue [#145](https://github.com/ealt/eden/issues/145) (per-route store +swapping for the 12c experiment switcher). Reference-impl behavior, not +protocol. + +## What "select an experiment" does + +The cross-experiment dashboard (`/admin/experiments/`) and the top-nav +switcher dropdown record the operator's choice in the session cookie +(`Session.selected_experiment_id`). Every per-experiment page then +resolves the **active experiment** per request — the selected one when +set and valid, else the deployment default (`--experiment-id`) — and +operates against that experiment's store, config, and (for the executor +module) integrator repo. No control plane → the switcher is hidden and +every page uses the deployment default with zero resolution overhead, so +single-experiment deployments are unchanged. + +## Credentials: the four postures + +Workers are per-experiment-scoped (each experiment has its own worker +registry). The web-ui's startup credential is registered only in its +`--experiment-id` experiment; talking to another experiment needs a +credential there. How that credential is obtained defines four postures +(plan §3.2): + +| Posture | Admin token | Switcher works? | +|---|---|---| +| **A — no control plane** | optional | n/a (switcher hidden; single experiment) | +| **B — control plane, admin token at runtime** | present | yes — per-experiment worker credentials are minted just-in-time on first switch | +| **C — control plane, admin token bootstrap-only** | present at first boot, then rotated out | yes for experiments already credentialed on disk; new experiments redirect with `error=cannot-bootstrap-credential` | +| **D — control plane, no admin token ever** | absent | dashboard read fails; a startup warning is logged; switcher effectively unavailable | + +Posture **B** is the default Compose path. Posture **C** is the +production hardening (rotate the admin token out of the runtime env after +first boot, matching the worker-host pattern): the web-ui persists a +deployment-scoped control-plane worker credential at first boot so the +switcher's control-plane reads keep working without the admin token, and +per-experiment worker credentials persist on disk for reuse. + +If you switch to a new experiment in Posture C/D and see +`cannot-bootstrap-credential`, the web-ui has no persisted credential for +that experiment and no admin token to mint one. Either (a) provide the +admin token at runtime (`--admin-token` / `$EDEN_ADMIN_TOKEN`), or +(b) pre-provision the credential by running the web-ui once against that +experiment with the admin token available. + +## Credential + config + repo layout + +Resolved by `--credential-dir` / `$EDEN_CREDENTIAL_DIR`, falling back to +the common `--credentials-dir` / `$EDEN_WORKER_CREDENTIALS_DIR`, then an +XDG default: + +```text +/ + control-plane/.token # Posture B/C deployment-scoped + /.token # per-experiment worker (JIT) +``` + +Per-experiment **config** is loaded from +`<--experiment-config-dir>/.yaml` (the deployment default +still uses `--experiment-config`); `setup-experiment.sh` drops each +experiment's YAML there. Per-experiment **integrator repos** (executor +module) live at `/.git`, cloned from the +`--forgejo-url` org base with the experiment id substituted. + +> **Config drift (known limitation).** The on-disk config dir is a +> separate source from the task-store-server's internal config text. If +> you hand-edit one but not the other, the web-ui's and the worker hosts' +> views of an experiment's objective / evaluation_schema can diverge. +> Issue [#259](https://github.com/ealt/eden/issues/259) (a +> `GET /v0/experiments/{E}/config` wire read) closes this by construction. + +## Compose + +The web-ui service passes `--experiment-config-dir +/var/lib/eden/web-ui-configs` (bind-mounted from +`${EDEN_EXPERIMENT_DATA_ROOT}/web-ui-configs`) and `--credentials-dir +/var/lib/eden/credentials` (the existing credentials bind-mount). +`setup-experiment.sh` creates `web-ui-configs/` and copies each +experiment's config in. Register additional experiments via the +dashboard's admin form, then run `setup-experiment.sh` per experiment to +seed its task-store data + config YAML. + +A single-stack Compose deployment running multiple experiments through +one control plane is the path of intent; running separate Compose +projects per experiment (§12.2 of the user guide) remains valid for hard +isolation. The end-to-end multi-experiment Compose smoke is tracked in +[#147](https://github.com/ealt/eden/issues/147); the single-experiment +smoke remains the golden path. diff --git a/docs/plans/review/issue-145/impl/20260601T185805/0-review.md b/docs/plans/review/issue-145/impl/20260601T185805/0-review.md new file mode 100644 index 00000000..21eaf4a0 --- /dev/null +++ b/docs/plans/review/issue-145/impl/20260601T185805/0-review.md @@ -0,0 +1,13 @@ +# Codex review — round 0 + +1. Bug — [cli.py](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/cli.py:100), [cli.py](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/cli.py:385), [routes/_helpers.py](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/routes/_helpers.py:246). Decision 6 / §3.4 says control-plane mode should load non-default configs from `--experiment-config-dir`, allow dir-only startup, and fail fast on default-file/dir mismatch. The implementation still makes `--experiment-config` mandatory, unconditionally loads it at startup, and if `experiment_config_dir` is absent `active_config()` silently reuses the default config for every non-default experiment. That makes multi-experiment pages render the wrong objective/evaluation schema. Fix: make `--experiment-config` optional in control-plane mode, require `--experiment-config-dir` there, and reject missing/mismatched default-config state instead of falling back to the default config for other experiments. + +2. Bug — [store_factory.py](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/store_factory.py:142), [store_factory.py](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/store_factory.py:224), [routes/_helpers.py](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/routes/_helpers.py:219), [routes/_helpers.py](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/routes/_helpers.py:347). Decision 8’s posture C/D branch is not implemented correctly. When the control plane knows experiment `Y` but the web-ui has neither an admin token nor a persisted per-experiment credential, `BearerCache` returns `None`, `StoreFactory` builds an unauthenticated `StoreClient`, and `resolve_active_experiment()` only treats `NotFound` as special. On an auth-enabled task-store that turns into an unhandled 401 instead of the planned `?error=cannot-bootstrap-credential&exp=Y` redirect. Fix: don’t synthesize an unauthenticated worker client for this branch; either raise `MissingAdminToken` directly unless auth-disabled has been positively established, or catch `Unauthorized` in `resolve_active_experiment()` and apply the plan’s 401 classification ladder. + +3. Bug — [cli.py](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/cli.py:327), [repo_factory.py](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/repo_factory.py:67), [compose.yaml](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/compose/compose.yaml:470). §3.5 / Decision 9 says per-experiment bare clones live under `/.git`. `_build_repo_materializer()` instead roots them at `args.repo_path.parent`, so non-default repos become siblings of the default bare repo. In Compose only `/var/lib/eden/repo` is mounted, so `/var/lib/eden/.git` lands on the container filesystem, not the durable bind mount. Fix: store non-default clones under a directory actually rooted at `--repo-path`, or change the CLI/mount contract explicitly and update the plan/docs/tests in lockstep. + +4. Risk — [store_factory.py](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/store_factory.py:138), [store_factory.py](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/store_factory.py:218), [cli.py](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/cli.py:413). Credential staleness across ephemeral task-stores is still broken. The only `/whoami` revalidation happens once at startup for the default experiment; after that both the bearer and `StoreClient` are cached forever. If the task-store is reseeded or a credential is reissued server-side, the web-ui has no eviction/rebootstrap path and selected experiments stay stuck on stale 401s until process restart. Fix: add a one-shot refresh path on `Unauthorized` that clears the per-experiment bearer/client cache and reruns bootstrap. + +5. Risk — [cli.py](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/cli.py:71), [routes/_helpers.py](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/routes/_helpers.py:435), [base.html](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/templates/base.html:12). Posture D is only half-implemented. The plan says the switcher should be hidden when the web-ui has no usable control-plane credential; instead `_build_control_plane_client()` returns an unauthenticated client, `switcher_context()` still exposes the switcher, and the template renders an empty dropdown. That is misleading UI, and a stale selected session can still drive unauthenticated control-plane reads later. Fix: represent “control-plane reads unavailable” explicitly and suppress the switcher in that posture, rather than rendering an empty one. + +Overall assessment: the route-by-route refactor is mostly wired through correctly, but the multi-experiment edge cases the plan called out as load-bearing are not done yet. I would not treat this as plan-complete or merge-ready until the config contract, posture C/D credential handling, and per-experiment repo layout are fixed. diff --git a/docs/plans/review/issue-145/impl/20260601T185805/0.md b/docs/plans/review/issue-145/impl/20260601T185805/0.md new file mode 100644 index 00000000..45ff6778 --- /dev/null +++ b/docs/plans/review/issue-145/impl/20260601T185805/0.md @@ -0,0 +1,157 @@ +# Implementation Review: issue-145-per-route-store-swap + +## Plan +- Path: `docs/plans/issue-145-per-route-store-swap.md` +- Summary: Make the 12c experiment switcher load-bearing — every per-experiment + web-ui route resolves the active experiment per request and operates against + its store / config / repo. Reference-impl web-ui only; no spec / wire / + schema / conformance change. + +## Implementation Files + +| File | Action | Plan Section | +|---|---|---| +| `reference/services/web-ui/src/eden_web_ui/store_factory.py` | Created | §3.3 StoreFactory + BearerCache + StaticStoreFactory | +| `reference/services/web-ui/src/eden_web_ui/credentials.py` | Created | §3.2 control-plane credential bootstrap + credential-dir resolution | +| `reference/services/web-ui/src/eden_web_ui/repo_factory.py` | Created | §3.5 RepoMaterializer (per-experiment integrator clones) | +| `reference/services/web-ui/src/eden_web_ui/routes/_helpers.py` | Modified | §3.1 resolve_active_experiment / active_config / resolve_active_context / repo_for / switcher_context / form_experiment_guard | +| `reference/services/web-ui/src/eden_web_ui/app.py` | Modified | §3.3 make_app takes store_factory; per-request experiment_id context processor | +| `reference/services/web-ui/src/eden_web_ui/cli.py` | Modified | §3.2/§3.5 build live StoreFactory + RepoMaterializer + control-plane credential; new flags | +| `reference/services/web-ui/src/eden_web_ui/middleware.py` | Modified | AdminGateMiddleware resolves the active experiment's store | +| `reference/services/web-ui/src/eden_web_ui/routes/{ideator,executor,evaluator}.py` | Modified | §3.1 per-handler resolve; §3.6 form_experiment_guard on submit; §3.5 repo_for (executor) | +| `reference/services/web-ui/src/eden_web_ui/routes/admin/{observability,actions,work_refs,index}.py` | Modified | §3.1 per-handler resolve (store + admin_store); §3.5 repo_for | +| `reference/services/web-ui/src/eden_web_ui/routes/{admin_workers,admin_groups,index}.py` | Modified | §3.1 per-handler resolve | +| `reference/services/web-ui/src/eden_web_ui/routes/admin_experiments.py` | Modified | §4 resolution-failure banners on the dashboard | +| `reference/services/web-ui/src/eden_web_ui/templates/base.html` | Modified | §3.7 top-nav switcher dropdown | +| `reference/services/web-ui/src/eden_web_ui/templates/{ideator,executor,evaluator}_claim.html` | Modified | §3.6 hidden form_experiment_id | +| `reference/services/web-ui/src/eden_web_ui/templates/_unseeded.html` | Created | Decision 8 registered-but-unseeded page | +| `reference/compose/compose.yaml`, `reference/scripts/setup-experiment/setup-experiment.sh` | Modified | §3.4/§4.1 per-experiment config dir + credentials-dir wiring | +| `docs/glossary.md`, `docs/user-guide.md`, `docs/operations/web-ui-multi-experiment.md`, `docs/roadmap.md`, `CHANGELOG.md` | Modified/Created | §4.1 docs | +| `reference/services/web-ui/tests/*` | Modified/Created | §6 new + swept tests | + +## Change Summary +- New `StoreFactory` vends per-`(experiment_id, role)` `StoreClient` views over + one shared `httpx.Client`; worker creds JIT-bootstrapped via `BearerCache` + reusing `eden_service_common.auth.bootstrap_worker_credential`. `make_app` + now takes `store_factory` as its sole store dependency (legacy `store=` / + `admin_store=` + `app.state.store/admin_store` removed). +- `resolve_active_context(request)` is the single per-handler entry point: reads + the session selection, falls back to the deployment default (no validation / + zero overhead when no control plane), validates non-default selections + (stale → redirect; seeded vs registered-but-unseeded classification), and + returns an `ActiveContext` or a `Response`. +- Per-experiment config (`--experiment-config-dir`) + integrator repo + (`RepoMaterializer`). Control-plane-scoped credential (Posture C). Top-nav + switcher + `form_experiment_id` switch-mid-form guard + dashboard + resolution-failure banners. `AdminGateMiddleware` follows the active + experiment (deployment-scoped admin pages exempt to avoid a redirect loop). +- Full web-ui suite green (667); ruff / pyright / complexity-gate / + rename-discipline / spec-xref / markdownlint clean. + +## Known deferrals (filed) +- #259 `GET /v0/experiments/{E}/config` wire endpoint (replaces the on-disk + config-dir; closes the drift risk). +- #260 per-request active-experiment resolution cache (Decision 8 TTL; only the + switcher `list_experiments` cache shipped). +- #261 v1 switcher affordances (`?exp=` permalink override + draft-survives-switch). +- #262 `form_experiment_id` guard on admin mutating forms (worker forms covered; + admin forms fail safe via NotFound). +- #147 multi-experiment Compose smoke (pre-existing). + +## Review Status +Round 0 — awaiting initial review. + +## Round 1 — fixes applied + +Addressed all 5 round-0 findings: + +1. **Bug 1 (config contract).** `active_config` no longer silently reuses the + default experiment's config for a non-default experiment when no + `--experiment-config-dir` is set — it raises `ExperimentConfigMissing` + (→ `config-missing` redirect). `--experiment-config` is now optional in + control-plane mode (the default experiment may load from the config dir); + `_resolve_default_config` validates single-experiment mode still requires it, + control-plane mode requires config OR config-dir, and fails fast on a + default-config / config-dir mismatch. `make_app.experiment_config` is now + `ExperimentConfig | None`. +2. **Bug 2 (Posture C/D 401 classification).** `resolve_active_experiment` now + catches `Unauthorized` from the seed probe: it evicts the cached credential, + re-bootstraps once, and on a persistent 401 raises `MissingAdminToken` + (→ `cannot-bootstrap-credential` redirect) rather than letting the 401 + escape or mis-classifying it as unseeded. +3. **Bug 3 (per-experiment repo durability).** Added a `--repo-root` flag (the + durable directory holding `.git` clones), defaulting to the + parent of `--repo-path` for local/dev. Compose passes + `--repo-root /var/lib/eden/web-ui-repos` + bind-mounts it; `setup-experiment.sh` + creates the dir. (Deviation from the plan's literal `/.git`, + which would nest a repo inside the default bare clone and lands on the + non-durable container fs in Compose — the `--repo-root` form is the + durability-correct realization.) +4. **Risk 4 (credential staleness eviction).** `StoreFactory.evict(experiment_id)` + drops the cached bearer + clients; the round-2 fix above wires it into the + 401 recovery so a reseeded task-store / server-side reissue self-heals on the + next request instead of being stuck until process restart. +5. **Risk 5 (Posture D switcher hidden).** `_cached_experiment_ids` returns + `None` on a cold-cache control-plane read failure (vs an empty list), and + `switcher_context` hides the switcher entirely in that case rather than + rendering an empty dropdown. + +New/updated tests: `test_context_non_default_without_config_dir_redirects`, +`test_resolve_401_evicts_then_raises_missing_admin`, +`test_switcher_hidden_when_control_plane_unreadable`; happy-path test now +provides a config dir. Web-ui suite re-run + ruff/pyright clean. + +## Round 2 — fixes applied + +Codex round-1 verdict: 4/5 round-0 findings resolved; finding 4 partial + 1 new Risk. + +- **New Risk (control-plane bootstrap robustness).** `_build_control_plane_client` + now also catches `httpx.TransportError` / control-plane `WireError` (not just + `RuntimeError`): a control-plane outage / rejection at startup degrades to the + Posture-D banners + hidden-switcher posture (log a warning, return a client + that the cold-cache switcher path hides) rather than aborting web-ui startup. +- **Finding 4 (default-experiment credential staleness) — scoped + tracked.** + The round-1 evict-and-rebootstrap recovery covers non-default selected + experiments. The deployment-default fast path intentionally does no + per-request probe (the plan's zero-overhead single-experiment guarantee), so a + default-experiment credential that goes stale server-side after startup stays + on the cached client until restart — identical to pre-#145 behavior (the old + `app.state.store` was cached for the process lifetime with no eviction). Not a + #145 regression; folded into #260 (default-experiment auto-refresh via a + point-of-use retry wrapper, without adding a probe to the fast path). + +Verified: ruff / pyright clean; control-plane-touching + resolve tests green. + +## Convergence + +Codex round-2 verdict: **no remaining Bug/Risk findings** in `cli.py`, +`routes/_helpers.py`, `store_factory.py`, or the reviewed wiring/docs slice. The +control-plane bootstrap robustness Risk is resolved; the default-experiment +credential-staleness gap is accepted as a documented, non-regression follow-up +(#260). Review converged after 3 rounds (0 → 1 → 2). + +## Round 3 — post-merge smoke regression fix + +Surfaced by PR-CI (8 compose-smoke + compose-e2e jobs red, "container eden-web-ui +is unhealthy") after merging `origin/main`. Root cause: the round-1 Bug-1 fix +added a "fail fast on default-config / config-dir mismatch" check in +`_resolve_default_config`. Every compose smoke appends fields +(`ideation_policy` / `max_quiescent_iterations`) to the mounted +`experiment-config.yaml` *after* `setup-experiment` has copied the pre-append +config into the web-ui `--experiment-config-dir`. The mismatch check then saw +the two differ and exited the web-ui (rc=1) → unhealthy → every stack-bringing +smoke failed at `up --wait`. + +Fix: removed the mismatch fail-fast. It was over-strict — the deployment-default +experiment ALWAYS resolves its config from `--experiment-config` (active_config's +default branch returns `app.state.experiment_config` and never reads +`/.yaml`), so a divergent default entry in the config-dir is +harmless. The posture validation (single-exp requires `--experiment-config`; +control-plane mode requires config OR config-dir) stays. This refines — does not +regress — codex round-0 Bug 1 (the real concern was the silent default→non-default +config fallback, which `active_config` still prevents via `ExperimentConfigMissing`). + +Verified locally: `smoke.sh` → PASS, `smoke-subprocess.sh` → PASS (subprocess +mode); web-ui healthy, quiescence reached, all assertions pass. All 8 compose +smokes share this single root cause (all append post-setup), so the one fix +clears them. ruff + pyright clean. diff --git a/docs/plans/review/issue-145/impl/20260601T185805/1-review.md b/docs/plans/review/issue-145/impl/20260601T185805/1-review.md new file mode 100644 index 00000000..6f0169cb --- /dev/null +++ b/docs/plans/review/issue-145/impl/20260601T185805/1-review.md @@ -0,0 +1,21 @@ +# Codex review — round 1 + +**Findings** + +1. Round-0 finding 1: resolved. The config contract now matches the intended behavior: `--experiment-config` is optional in control-plane mode, `_resolve_default_config()` enforces the single-experiment vs control-plane startup rules, and `active_config()` no longer falls back from a non-default experiment to the default experiment’s config ([cli.py](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/cli.py:397), [routes/_helpers.py](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/routes/_helpers.py:247)). + +2. Round-0 finding 2: resolved. Non-default experiment resolution now handles 401s by evicting cached state, retrying once, and routing a persistent auth failure to `cannot-bootstrap-credential` instead of leaking the 401 or misclassifying it as unseeded ([routes/_helpers.py](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/routes/_helpers.py:220)). + +3. Round-0 finding 3: resolved. The per-experiment repo clones now have an explicit durable root via `--repo-root`, and Compose mounts that directory, so non-default experiment clones no longer land on transient container storage ([cli.py](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/cli.py:171), [cli.py](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/cli.py:347), [compose.yaml](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/compose/compose.yaml:472)). + +4. Round-0 finding 4: not fully resolved. The new stale-credential eviction path only runs for non-default selected experiments. The deployment-default fast path still returns immediately without any probe or retry, so if the default experiment’s credential goes stale after startup, requests against the default experiment remain stuck on the cached client until restart ([routes/_helpers.py](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/routes/_helpers.py:203), [routes/_helpers.py](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/routes/_helpers.py:220), [store_factory.py](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/store_factory.py:222)). + Fix: give the default-experiment path the same evict-and-rebootstrap recovery, either in `resolve_active_experiment()` or at the first worker-store 401. + +5. Round-0 finding 5: resolved. A cold-cache control-plane read failure now hides the switcher instead of rendering an empty dropdown ([routes/_helpers.py](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/routes/_helpers.py:435)). + +**New issue** + +1. Risk — [cli.py](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/cli.py:71). `_build_control_plane_client()` only catches `RuntimeError`. If control-plane bootstrap hits a transport or protocol failure during `whoami` / register / reissue, that exception still escapes and can fail web-ui startup outright. That is stricter than the rest of the design, which treats control-plane unavailability as a degraded runtime posture with banners / hidden switcher, not a hard service-start dependency. + Fix: catch control-plane transport/protocol failures here as well, log a warning, and degrade to the same “switcher unavailable” posture rather than aborting startup. + +Overall: 4 of the 5 round-0 findings are resolved; finding 4 is only partially fixed, and there is one new startup robustness risk in the control-plane bootstrap path. diff --git a/docs/plans/review/issue-145/impl/20260601T185805/1.md b/docs/plans/review/issue-145/impl/20260601T185805/1.md new file mode 100644 index 00000000..eeaa4d0b --- /dev/null +++ b/docs/plans/review/issue-145/impl/20260601T185805/1.md @@ -0,0 +1,102 @@ +# Implementation Review: issue-145-per-route-store-swap + +## Plan +- Path: `docs/plans/issue-145-per-route-store-swap.md` +- Summary: Make the 12c experiment switcher load-bearing — every per-experiment + web-ui route resolves the active experiment per request and operates against + its store / config / repo. Reference-impl web-ui only; no spec / wire / + schema / conformance change. + +## Implementation Files + +| File | Action | Plan Section | +|---|---|---| +| `reference/services/web-ui/src/eden_web_ui/store_factory.py` | Created | §3.3 StoreFactory + BearerCache + StaticStoreFactory | +| `reference/services/web-ui/src/eden_web_ui/credentials.py` | Created | §3.2 control-plane credential bootstrap + credential-dir resolution | +| `reference/services/web-ui/src/eden_web_ui/repo_factory.py` | Created | §3.5 RepoMaterializer (per-experiment integrator clones) | +| `reference/services/web-ui/src/eden_web_ui/routes/_helpers.py` | Modified | §3.1 resolve_active_experiment / active_config / resolve_active_context / repo_for / switcher_context / form_experiment_guard | +| `reference/services/web-ui/src/eden_web_ui/app.py` | Modified | §3.3 make_app takes store_factory; per-request experiment_id context processor | +| `reference/services/web-ui/src/eden_web_ui/cli.py` | Modified | §3.2/§3.5 build live StoreFactory + RepoMaterializer + control-plane credential; new flags | +| `reference/services/web-ui/src/eden_web_ui/middleware.py` | Modified | AdminGateMiddleware resolves the active experiment's store | +| `reference/services/web-ui/src/eden_web_ui/routes/{ideator,executor,evaluator}.py` | Modified | §3.1 per-handler resolve; §3.6 form_experiment_guard on submit; §3.5 repo_for (executor) | +| `reference/services/web-ui/src/eden_web_ui/routes/admin/{observability,actions,work_refs,index}.py` | Modified | §3.1 per-handler resolve (store + admin_store); §3.5 repo_for | +| `reference/services/web-ui/src/eden_web_ui/routes/{admin_workers,admin_groups,index}.py` | Modified | §3.1 per-handler resolve | +| `reference/services/web-ui/src/eden_web_ui/routes/admin_experiments.py` | Modified | §4 resolution-failure banners on the dashboard | +| `reference/services/web-ui/src/eden_web_ui/templates/base.html` | Modified | §3.7 top-nav switcher dropdown | +| `reference/services/web-ui/src/eden_web_ui/templates/{ideator,executor,evaluator}_claim.html` | Modified | §3.6 hidden form_experiment_id | +| `reference/services/web-ui/src/eden_web_ui/templates/_unseeded.html` | Created | Decision 8 registered-but-unseeded page | +| `reference/compose/compose.yaml`, `reference/scripts/setup-experiment/setup-experiment.sh` | Modified | §3.4/§4.1 per-experiment config dir + credentials-dir wiring | +| `docs/glossary.md`, `docs/user-guide.md`, `docs/operations/web-ui-multi-experiment.md`, `docs/roadmap.md`, `CHANGELOG.md` | Modified/Created | §4.1 docs | +| `reference/services/web-ui/tests/*` | Modified/Created | §6 new + swept tests | + +## Change Summary +- New `StoreFactory` vends per-`(experiment_id, role)` `StoreClient` views over + one shared `httpx.Client`; worker creds JIT-bootstrapped via `BearerCache` + reusing `eden_service_common.auth.bootstrap_worker_credential`. `make_app` + now takes `store_factory` as its sole store dependency (legacy `store=` / + `admin_store=` + `app.state.store/admin_store` removed). +- `resolve_active_context(request)` is the single per-handler entry point: reads + the session selection, falls back to the deployment default (no validation / + zero overhead when no control plane), validates non-default selections + (stale → redirect; seeded vs registered-but-unseeded classification), and + returns an `ActiveContext` or a `Response`. +- Per-experiment config (`--experiment-config-dir`) + integrator repo + (`RepoMaterializer`). Control-plane-scoped credential (Posture C). Top-nav + switcher + `form_experiment_id` switch-mid-form guard + dashboard + resolution-failure banners. `AdminGateMiddleware` follows the active + experiment (deployment-scoped admin pages exempt to avoid a redirect loop). +- Full web-ui suite green (667); ruff / pyright / complexity-gate / + rename-discipline / spec-xref / markdownlint clean. + +## Known deferrals (filed) +- #259 `GET /v0/experiments/{E}/config` wire endpoint (replaces the on-disk + config-dir; closes the drift risk). +- #260 per-request active-experiment resolution cache (Decision 8 TTL; only the + switcher `list_experiments` cache shipped). +- #261 v1 switcher affordances (`?exp=` permalink override + draft-survives-switch). +- #262 `form_experiment_id` guard on admin mutating forms (worker forms covered; + admin forms fail safe via NotFound). +- #147 multi-experiment Compose smoke (pre-existing). + +## Review Status +Round 0 — awaiting initial review. + +## Round 1 — fixes applied + +Addressed all 5 round-0 findings: + +1. **Bug 1 (config contract).** `active_config` no longer silently reuses the + default experiment's config for a non-default experiment when no + `--experiment-config-dir` is set — it raises `ExperimentConfigMissing` + (→ `config-missing` redirect). `--experiment-config` is now optional in + control-plane mode (the default experiment may load from the config dir); + `_resolve_default_config` validates single-experiment mode still requires it, + control-plane mode requires config OR config-dir, and fails fast on a + default-config / config-dir mismatch. `make_app.experiment_config` is now + `ExperimentConfig | None`. +2. **Bug 2 (Posture C/D 401 classification).** `resolve_active_experiment` now + catches `Unauthorized` from the seed probe: it evicts the cached credential, + re-bootstraps once, and on a persistent 401 raises `MissingAdminToken` + (→ `cannot-bootstrap-credential` redirect) rather than letting the 401 + escape or mis-classifying it as unseeded. +3. **Bug 3 (per-experiment repo durability).** Added a `--repo-root` flag (the + durable directory holding `.git` clones), defaulting to the + parent of `--repo-path` for local/dev. Compose passes + `--repo-root /var/lib/eden/web-ui-repos` + bind-mounts it; `setup-experiment.sh` + creates the dir. (Deviation from the plan's literal `/.git`, + which would nest a repo inside the default bare clone and lands on the + non-durable container fs in Compose — the `--repo-root` form is the + durability-correct realization.) +4. **Risk 4 (credential staleness eviction).** `StoreFactory.evict(experiment_id)` + drops the cached bearer + clients; the round-2 fix above wires it into the + 401 recovery so a reseeded task-store / server-side reissue self-heals on the + next request instead of being stuck until process restart. +5. **Risk 5 (Posture D switcher hidden).** `_cached_experiment_ids` returns + `None` on a cold-cache control-plane read failure (vs an empty list), and + `switcher_context` hides the switcher entirely in that case rather than + rendering an empty dropdown. + +New/updated tests: `test_context_non_default_without_config_dir_redirects`, +`test_resolve_401_evicts_then_raises_missing_admin`, +`test_switcher_hidden_when_control_plane_unreadable`; happy-path test now +provides a config dir. Web-ui suite re-run + ruff/pyright clean. diff --git a/docs/plans/review/issue-145/impl/20260601T185805/2-review.md b/docs/plans/review/issue-145/impl/20260601T185805/2-review.md new file mode 100644 index 00000000..b35018eb --- /dev/null +++ b/docs/plans/review/issue-145/impl/20260601T185805/2-review.md @@ -0,0 +1,9 @@ +# Codex review — round 2 + +No remaining Bug/Risk findings in the files I re-reviewed. + +The round-1 startup robustness risk is resolved. [_build_control_plane_client()`](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/cli.py:44) now catches both `httpx.TransportError` and control-plane `WireError` and degrades to the existing hidden-switcher / banner posture instead of aborting startup. + +The remaining default-experiment credential-staleness gap is acceptable for `#145` as currently scoped. [resolve_active_experiment()](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/routes/_helpers.py:180) still preserves the zero-overhead default fast path at [lines 202-205](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/routes/_helpers.py:202), and the evict-and-rebootstrap recovery remains correctly applied to non-default selections at [lines 220-234](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/routes/_helpers.py:220). Since that default-path behavior matches pre-`#145` process-lifetime caching and is explicitly tracked in `#260`, I would treat it as a documented follow-up, not a blocker or regression in this change. + +Overall assessment: the two round-2 points are acceptable, and I do not see any remaining Bug/Risk-level issues in `cli.py`, `routes/_helpers.py`, `store_factory.py`, or the reviewed wiring/docs slice. diff --git a/docs/plans/review/issue-145/impl/20260601T185805/2.md b/docs/plans/review/issue-145/impl/20260601T185805/2.md new file mode 100644 index 00000000..87e3d7f9 --- /dev/null +++ b/docs/plans/review/issue-145/impl/20260601T185805/2.md @@ -0,0 +1,123 @@ +# Implementation Review: issue-145-per-route-store-swap + +## Plan +- Path: `docs/plans/issue-145-per-route-store-swap.md` +- Summary: Make the 12c experiment switcher load-bearing — every per-experiment + web-ui route resolves the active experiment per request and operates against + its store / config / repo. Reference-impl web-ui only; no spec / wire / + schema / conformance change. + +## Implementation Files + +| File | Action | Plan Section | +|---|---|---| +| `reference/services/web-ui/src/eden_web_ui/store_factory.py` | Created | §3.3 StoreFactory + BearerCache + StaticStoreFactory | +| `reference/services/web-ui/src/eden_web_ui/credentials.py` | Created | §3.2 control-plane credential bootstrap + credential-dir resolution | +| `reference/services/web-ui/src/eden_web_ui/repo_factory.py` | Created | §3.5 RepoMaterializer (per-experiment integrator clones) | +| `reference/services/web-ui/src/eden_web_ui/routes/_helpers.py` | Modified | §3.1 resolve_active_experiment / active_config / resolve_active_context / repo_for / switcher_context / form_experiment_guard | +| `reference/services/web-ui/src/eden_web_ui/app.py` | Modified | §3.3 make_app takes store_factory; per-request experiment_id context processor | +| `reference/services/web-ui/src/eden_web_ui/cli.py` | Modified | §3.2/§3.5 build live StoreFactory + RepoMaterializer + control-plane credential; new flags | +| `reference/services/web-ui/src/eden_web_ui/middleware.py` | Modified | AdminGateMiddleware resolves the active experiment's store | +| `reference/services/web-ui/src/eden_web_ui/routes/{ideator,executor,evaluator}.py` | Modified | §3.1 per-handler resolve; §3.6 form_experiment_guard on submit; §3.5 repo_for (executor) | +| `reference/services/web-ui/src/eden_web_ui/routes/admin/{observability,actions,work_refs,index}.py` | Modified | §3.1 per-handler resolve (store + admin_store); §3.5 repo_for | +| `reference/services/web-ui/src/eden_web_ui/routes/{admin_workers,admin_groups,index}.py` | Modified | §3.1 per-handler resolve | +| `reference/services/web-ui/src/eden_web_ui/routes/admin_experiments.py` | Modified | §4 resolution-failure banners on the dashboard | +| `reference/services/web-ui/src/eden_web_ui/templates/base.html` | Modified | §3.7 top-nav switcher dropdown | +| `reference/services/web-ui/src/eden_web_ui/templates/{ideator,executor,evaluator}_claim.html` | Modified | §3.6 hidden form_experiment_id | +| `reference/services/web-ui/src/eden_web_ui/templates/_unseeded.html` | Created | Decision 8 registered-but-unseeded page | +| `reference/compose/compose.yaml`, `reference/scripts/setup-experiment/setup-experiment.sh` | Modified | §3.4/§4.1 per-experiment config dir + credentials-dir wiring | +| `docs/glossary.md`, `docs/user-guide.md`, `docs/operations/web-ui-multi-experiment.md`, `docs/roadmap.md`, `CHANGELOG.md` | Modified/Created | §4.1 docs | +| `reference/services/web-ui/tests/*` | Modified/Created | §6 new + swept tests | + +## Change Summary +- New `StoreFactory` vends per-`(experiment_id, role)` `StoreClient` views over + one shared `httpx.Client`; worker creds JIT-bootstrapped via `BearerCache` + reusing `eden_service_common.auth.bootstrap_worker_credential`. `make_app` + now takes `store_factory` as its sole store dependency (legacy `store=` / + `admin_store=` + `app.state.store/admin_store` removed). +- `resolve_active_context(request)` is the single per-handler entry point: reads + the session selection, falls back to the deployment default (no validation / + zero overhead when no control plane), validates non-default selections + (stale → redirect; seeded vs registered-but-unseeded classification), and + returns an `ActiveContext` or a `Response`. +- Per-experiment config (`--experiment-config-dir`) + integrator repo + (`RepoMaterializer`). Control-plane-scoped credential (Posture C). Top-nav + switcher + `form_experiment_id` switch-mid-form guard + dashboard + resolution-failure banners. `AdminGateMiddleware` follows the active + experiment (deployment-scoped admin pages exempt to avoid a redirect loop). +- Full web-ui suite green (667); ruff / pyright / complexity-gate / + rename-discipline / spec-xref / markdownlint clean. + +## Known deferrals (filed) +- #259 `GET /v0/experiments/{E}/config` wire endpoint (replaces the on-disk + config-dir; closes the drift risk). +- #260 per-request active-experiment resolution cache (Decision 8 TTL; only the + switcher `list_experiments` cache shipped). +- #261 v1 switcher affordances (`?exp=` permalink override + draft-survives-switch). +- #262 `form_experiment_id` guard on admin mutating forms (worker forms covered; + admin forms fail safe via NotFound). +- #147 multi-experiment Compose smoke (pre-existing). + +## Review Status +Round 0 — awaiting initial review. + +## Round 1 — fixes applied + +Addressed all 5 round-0 findings: + +1. **Bug 1 (config contract).** `active_config` no longer silently reuses the + default experiment's config for a non-default experiment when no + `--experiment-config-dir` is set — it raises `ExperimentConfigMissing` + (→ `config-missing` redirect). `--experiment-config` is now optional in + control-plane mode (the default experiment may load from the config dir); + `_resolve_default_config` validates single-experiment mode still requires it, + control-plane mode requires config OR config-dir, and fails fast on a + default-config / config-dir mismatch. `make_app.experiment_config` is now + `ExperimentConfig | None`. +2. **Bug 2 (Posture C/D 401 classification).** `resolve_active_experiment` now + catches `Unauthorized` from the seed probe: it evicts the cached credential, + re-bootstraps once, and on a persistent 401 raises `MissingAdminToken` + (→ `cannot-bootstrap-credential` redirect) rather than letting the 401 + escape or mis-classifying it as unseeded. +3. **Bug 3 (per-experiment repo durability).** Added a `--repo-root` flag (the + durable directory holding `.git` clones), defaulting to the + parent of `--repo-path` for local/dev. Compose passes + `--repo-root /var/lib/eden/web-ui-repos` + bind-mounts it; `setup-experiment.sh` + creates the dir. (Deviation from the plan's literal `/.git`, + which would nest a repo inside the default bare clone and lands on the + non-durable container fs in Compose — the `--repo-root` form is the + durability-correct realization.) +4. **Risk 4 (credential staleness eviction).** `StoreFactory.evict(experiment_id)` + drops the cached bearer + clients; the round-2 fix above wires it into the + 401 recovery so a reseeded task-store / server-side reissue self-heals on the + next request instead of being stuck until process restart. +5. **Risk 5 (Posture D switcher hidden).** `_cached_experiment_ids` returns + `None` on a cold-cache control-plane read failure (vs an empty list), and + `switcher_context` hides the switcher entirely in that case rather than + rendering an empty dropdown. + +New/updated tests: `test_context_non_default_without_config_dir_redirects`, +`test_resolve_401_evicts_then_raises_missing_admin`, +`test_switcher_hidden_when_control_plane_unreadable`; happy-path test now +provides a config dir. Web-ui suite re-run + ruff/pyright clean. + +## Round 2 — fixes applied + +Codex round-1 verdict: 4/5 round-0 findings resolved; finding 4 partial + 1 new Risk. + +- **New Risk (control-plane bootstrap robustness).** `_build_control_plane_client` + now also catches `httpx.TransportError` / control-plane `WireError` (not just + `RuntimeError`): a control-plane outage / rejection at startup degrades to the + Posture-D banners + hidden-switcher posture (log a warning, return a client + that the cold-cache switcher path hides) rather than aborting web-ui startup. +- **Finding 4 (default-experiment credential staleness) — scoped + tracked.** + The round-1 evict-and-rebootstrap recovery covers non-default selected + experiments. The deployment-default fast path intentionally does no + per-request probe (the plan's zero-overhead single-experiment guarantee), so a + default-experiment credential that goes stale server-side after startup stays + on the cached client until restart — identical to pre-#145 behavior (the old + `app.state.store` was cached for the process lifetime with no eviction). Not a + #145 regression; folded into #260 (default-experiment auto-refresh via a + point-of-use retry wrapper, without adding a probe to the fast path). + +Verified: ruff / pyright clean; control-plane-touching + resolve tests green. diff --git a/docs/plans/review/issue-145/impl/20260601T185805/3-review.md b/docs/plans/review/issue-145/impl/20260601T185805/3-review.md new file mode 100644 index 00000000..854eec9d --- /dev/null +++ b/docs/plans/review/issue-145/impl/20260601T185805/3-review.md @@ -0,0 +1,17 @@ +# Codex review — round 3 (smoke-fix confirmation) + +No Bug/Risk findings from this follow-up. + +The fix is acceptable and it does not reintroduce round-0 Bug 1. + +The key reason is the separation between startup resolution and per-request resolution: + +- [_resolve_default_config()`](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/cli.py:412) now just decides whether the deployment-default config comes from `--experiment-config`, from `/.yaml`, or is a startup misconfiguration. Removing the startup mismatch check only stops rejecting a harmless divergence. +- [active_config()](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/routes/_helpers.py:248) still preserves the load-bearing Bug-1 fix: + - default experiment with a startup config returns `app.state.experiment_config` at [lines 262-264](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/routes/_helpers.py:262) + - non-default experiments do not fall back to the default config; if no config dir is available they raise `ExperimentConfigMissing` at [lines 269-274](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/src/eden_web_ui/routes/_helpers.py:269) +- The existing regression test for the real Bug-1 condition is still the right guard: [test_resolve_active.py](/Users/ericalt/Documents/eden-worktrees/impl-issue-145-per-route-store-swap/reference/services/web-ui/tests/test_resolve_active.py:249). + +I also checked for other web-ui code paths that might read `/.yaml` directly; I didn’t find any outside `active_config()`. So, in the current implementation, a stale default entry in the config dir is operationally harmless. + +Overall assessment: acceptable fix, no remaining Bug/Risk issues in this area. I did not rerun the Compose smokes myself in this pass. diff --git a/docs/plans/review/issue-145/impl/20260601T185805/3.md b/docs/plans/review/issue-145/impl/20260601T185805/3.md new file mode 100644 index 00000000..e3dc752a --- /dev/null +++ b/docs/plans/review/issue-145/impl/20260601T185805/3.md @@ -0,0 +1,131 @@ +# Implementation Review: issue-145-per-route-store-swap + +## Plan +- Path: `docs/plans/issue-145-per-route-store-swap.md` +- Summary: Make the 12c experiment switcher load-bearing — every per-experiment + web-ui route resolves the active experiment per request and operates against + its store / config / repo. Reference-impl web-ui only; no spec / wire / + schema / conformance change. + +## Implementation Files + +| File | Action | Plan Section | +|---|---|---| +| `reference/services/web-ui/src/eden_web_ui/store_factory.py` | Created | §3.3 StoreFactory + BearerCache + StaticStoreFactory | +| `reference/services/web-ui/src/eden_web_ui/credentials.py` | Created | §3.2 control-plane credential bootstrap + credential-dir resolution | +| `reference/services/web-ui/src/eden_web_ui/repo_factory.py` | Created | §3.5 RepoMaterializer (per-experiment integrator clones) | +| `reference/services/web-ui/src/eden_web_ui/routes/_helpers.py` | Modified | §3.1 resolve_active_experiment / active_config / resolve_active_context / repo_for / switcher_context / form_experiment_guard | +| `reference/services/web-ui/src/eden_web_ui/app.py` | Modified | §3.3 make_app takes store_factory; per-request experiment_id context processor | +| `reference/services/web-ui/src/eden_web_ui/cli.py` | Modified | §3.2/§3.5 build live StoreFactory + RepoMaterializer + control-plane credential; new flags | +| `reference/services/web-ui/src/eden_web_ui/middleware.py` | Modified | AdminGateMiddleware resolves the active experiment's store | +| `reference/services/web-ui/src/eden_web_ui/routes/{ideator,executor,evaluator}.py` | Modified | §3.1 per-handler resolve; §3.6 form_experiment_guard on submit; §3.5 repo_for (executor) | +| `reference/services/web-ui/src/eden_web_ui/routes/admin/{observability,actions,work_refs,index}.py` | Modified | §3.1 per-handler resolve (store + admin_store); §3.5 repo_for | +| `reference/services/web-ui/src/eden_web_ui/routes/{admin_workers,admin_groups,index}.py` | Modified | §3.1 per-handler resolve | +| `reference/services/web-ui/src/eden_web_ui/routes/admin_experiments.py` | Modified | §4 resolution-failure banners on the dashboard | +| `reference/services/web-ui/src/eden_web_ui/templates/base.html` | Modified | §3.7 top-nav switcher dropdown | +| `reference/services/web-ui/src/eden_web_ui/templates/{ideator,executor,evaluator}_claim.html` | Modified | §3.6 hidden form_experiment_id | +| `reference/services/web-ui/src/eden_web_ui/templates/_unseeded.html` | Created | Decision 8 registered-but-unseeded page | +| `reference/compose/compose.yaml`, `reference/scripts/setup-experiment/setup-experiment.sh` | Modified | §3.4/§4.1 per-experiment config dir + credentials-dir wiring | +| `docs/glossary.md`, `docs/user-guide.md`, `docs/operations/web-ui-multi-experiment.md`, `docs/roadmap.md`, `CHANGELOG.md` | Modified/Created | §4.1 docs | +| `reference/services/web-ui/tests/*` | Modified/Created | §6 new + swept tests | + +## Change Summary +- New `StoreFactory` vends per-`(experiment_id, role)` `StoreClient` views over + one shared `httpx.Client`; worker creds JIT-bootstrapped via `BearerCache` + reusing `eden_service_common.auth.bootstrap_worker_credential`. `make_app` + now takes `store_factory` as its sole store dependency (legacy `store=` / + `admin_store=` + `app.state.store/admin_store` removed). +- `resolve_active_context(request)` is the single per-handler entry point: reads + the session selection, falls back to the deployment default (no validation / + zero overhead when no control plane), validates non-default selections + (stale → redirect; seeded vs registered-but-unseeded classification), and + returns an `ActiveContext` or a `Response`. +- Per-experiment config (`--experiment-config-dir`) + integrator repo + (`RepoMaterializer`). Control-plane-scoped credential (Posture C). Top-nav + switcher + `form_experiment_id` switch-mid-form guard + dashboard + resolution-failure banners. `AdminGateMiddleware` follows the active + experiment (deployment-scoped admin pages exempt to avoid a redirect loop). +- Full web-ui suite green (667); ruff / pyright / complexity-gate / + rename-discipline / spec-xref / markdownlint clean. + +## Known deferrals (filed) +- #259 `GET /v0/experiments/{E}/config` wire endpoint (replaces the on-disk + config-dir; closes the drift risk). +- #260 per-request active-experiment resolution cache (Decision 8 TTL; only the + switcher `list_experiments` cache shipped). +- #261 v1 switcher affordances (`?exp=` permalink override + draft-survives-switch). +- #262 `form_experiment_id` guard on admin mutating forms (worker forms covered; + admin forms fail safe via NotFound). +- #147 multi-experiment Compose smoke (pre-existing). + +## Review Status +Round 0 — awaiting initial review. + +## Round 1 — fixes applied + +Addressed all 5 round-0 findings: + +1. **Bug 1 (config contract).** `active_config` no longer silently reuses the + default experiment's config for a non-default experiment when no + `--experiment-config-dir` is set — it raises `ExperimentConfigMissing` + (→ `config-missing` redirect). `--experiment-config` is now optional in + control-plane mode (the default experiment may load from the config dir); + `_resolve_default_config` validates single-experiment mode still requires it, + control-plane mode requires config OR config-dir, and fails fast on a + default-config / config-dir mismatch. `make_app.experiment_config` is now + `ExperimentConfig | None`. +2. **Bug 2 (Posture C/D 401 classification).** `resolve_active_experiment` now + catches `Unauthorized` from the seed probe: it evicts the cached credential, + re-bootstraps once, and on a persistent 401 raises `MissingAdminToken` + (→ `cannot-bootstrap-credential` redirect) rather than letting the 401 + escape or mis-classifying it as unseeded. +3. **Bug 3 (per-experiment repo durability).** Added a `--repo-root` flag (the + durable directory holding `.git` clones), defaulting to the + parent of `--repo-path` for local/dev. Compose passes + `--repo-root /var/lib/eden/web-ui-repos` + bind-mounts it; `setup-experiment.sh` + creates the dir. (Deviation from the plan's literal `/.git`, + which would nest a repo inside the default bare clone and lands on the + non-durable container fs in Compose — the `--repo-root` form is the + durability-correct realization.) +4. **Risk 4 (credential staleness eviction).** `StoreFactory.evict(experiment_id)` + drops the cached bearer + clients; the round-2 fix above wires it into the + 401 recovery so a reseeded task-store / server-side reissue self-heals on the + next request instead of being stuck until process restart. +5. **Risk 5 (Posture D switcher hidden).** `_cached_experiment_ids` returns + `None` on a cold-cache control-plane read failure (vs an empty list), and + `switcher_context` hides the switcher entirely in that case rather than + rendering an empty dropdown. + +New/updated tests: `test_context_non_default_without_config_dir_redirects`, +`test_resolve_401_evicts_then_raises_missing_admin`, +`test_switcher_hidden_when_control_plane_unreadable`; happy-path test now +provides a config dir. Web-ui suite re-run + ruff/pyright clean. + +## Round 2 — fixes applied + +Codex round-1 verdict: 4/5 round-0 findings resolved; finding 4 partial + 1 new Risk. + +- **New Risk (control-plane bootstrap robustness).** `_build_control_plane_client` + now also catches `httpx.TransportError` / control-plane `WireError` (not just + `RuntimeError`): a control-plane outage / rejection at startup degrades to the + Posture-D banners + hidden-switcher posture (log a warning, return a client + that the cold-cache switcher path hides) rather than aborting web-ui startup. +- **Finding 4 (default-experiment credential staleness) — scoped + tracked.** + The round-1 evict-and-rebootstrap recovery covers non-default selected + experiments. The deployment-default fast path intentionally does no + per-request probe (the plan's zero-overhead single-experiment guarantee), so a + default-experiment credential that goes stale server-side after startup stays + on the cached client until restart — identical to pre-#145 behavior (the old + `app.state.store` was cached for the process lifetime with no eviction). Not a + #145 regression; folded into #260 (default-experiment auto-refresh via a + point-of-use retry wrapper, without adding a probe to the fast path). + +Verified: ruff / pyright clean; control-plane-touching + resolve tests green. + +## Convergence + +Codex round-2 verdict: **no remaining Bug/Risk findings** in `cli.py`, +`routes/_helpers.py`, `store_factory.py`, or the reviewed wiring/docs slice. The +control-plane bootstrap robustness Risk is resolved; the default-experiment +credential-staleness gap is accepted as a documented, non-regression follow-up +(#260). Review converged after 3 rounds (0 → 1 → 2). diff --git a/docs/plans/review/issue-145/impl/20260601T185805/session.txt b/docs/plans/review/issue-145/impl/20260601T185805/session.txt new file mode 100644 index 00000000..87ce044e --- /dev/null +++ b/docs/plans/review/issue-145/impl/20260601T185805/session.txt @@ -0,0 +1 @@ +thread_id 019e860e-2af2-7653-8d95-58e6df706249 diff --git a/docs/roadmap.md b/docs/roadmap.md index 100b9937..9351b84a 100644 --- a/docs/roadmap.md +++ b/docs/roadmap.md @@ -261,7 +261,7 @@ Units and chunking to be named closer to execution — too far ahead to estimate - [12b](plans/eden-phase-12b-portable-checkpoints.md) — Portable checkpoints — **shipped 2026-05-19** (see [CHANGELOG](../CHANGELOG.md)) - [12c](plans/eden-phase-12c-control-plane.md) — Control plane — **shipped 2026-05-19** (see [CHANGELOG](../CHANGELOG.md)) -- [#157](plans/issue-157-cli-flags-to-config.md) — Move deployment CLI flags into experiment-config fields (`termination_policy` / `max_quiescent_iterations` / `*_task_deadline`) — **in review** (see [CHANGELOG](../CHANGELOG.md)) +- [#145](plans/issue-145-per-route-store-swap.md) — Per-route store swapping for the experiment switcher (12c §3.6 backfill) — **shipped 2026-06-02** (see [CHANGELOG](../CHANGELOG.md)) --- diff --git a/docs/user-guide.md b/docs/user-guide.md index 8f143bbd..75fc898a 100644 --- a/docs/user-guide.md +++ b/docs/user-guide.md @@ -475,13 +475,26 @@ If you hand-craft a wire call for `claim` or `submit` and include `worker_id` in ## 12. Multi-experiment deployments -Each task-store-server instance serves exactly one `--experiment-id`. The Compose stack is single-experiment by construction (one `EDEN_EXPERIMENT_ID` in `.env`). +A single task-store-server URL serves many experiments — the wire path is `/v0/experiments/{id}/…`, so the `{id}` segment selects the experiment per call. With a **control plane** configured (`--control-plane-url`), the Web UI lets one deployment register and switch between multiple experiments. -To run two experiments side-by-side: +### 12.1 The experiment switcher + +Register experiments on the cross-experiment dashboard at `/admin/experiments/`, then pick the active one from the **top-nav switcher dropdown** (present on every page when a control plane is configured). The switcher shows `Active: ` (or `Default: ` before you've selected one). Selecting an experiment is load-bearing: every per-experiment page — ideator, executor, evaluator, `/admin/tasks`, `/admin/variants`, `/admin/workers`, `/admin/groups`, `/admin/work-refs`, … — now reads that experiment's data, not just a relabelled header. + +Notes: + +- The selection is **per session**, not per tab. Switching in one tab changes the other tab's data on its next request. +- If you switch experiments while a draft form is open and then submit it, the submission is **discarded** (a banner explains why) rather than written to the experiment you switched to. Re-enter it under the now-active experiment. +- Selecting an experiment the control plane no longer knows about clears the stale selection and returns you to the dashboard. +- An experiment registered on the dashboard but not yet seeded (no `setup-experiment` / checkpoint-import run for it) renders an "initialize me" page rather than empty data. + +Operationally the web-ui needs a worker credential in each experiment it talks to; with the deployment admin token present at runtime it mints these on first switch. See [`docs/operations/web-ui-multi-experiment.md`](operations/web-ui-multi-experiment.md) for the credential-bootstrap postures and the per-experiment config / repo layout. + +### 12.2 Separate stacks (isolation) + +For hard isolation (separate Postgres, Forgejo, secrets) run two Compose projects instead: 1. Use distinct `COMPOSE_PROJECT_NAME` values (e.g. `eden-exp-a` and `eden-exp-b`). 2. Override host ports so they don't collide: `POSTGRES_HOST_PORT`, `FORGEJO_HOST_PORT`, `FORGEJO_SSH_HOST_PORT`, `WEB_UI_HOST_PORT`. 3. Use distinct `--env-file` paths so each project has its own secrets. 4. Build the shared image once and re-use it across projects (Compose's image cache makes this automatic). - -Native multi-experiment support is Phase 12 work (see [`docs/roadmap.md`](roadmap.md)). diff --git a/reference/compose/compose.yaml b/reference/compose/compose.yaml index 5f10cc0c..d225c6d8 100644 --- a/reference/compose/compose.yaml +++ b/reference/compose/compose.yaml @@ -438,12 +438,29 @@ services: - ${EDEN_ADMIN_TOKEN:?} - --experiment-config - /etc/eden/experiment-config.yaml + # Issue #145: per-experiment config + credential dirs for the + # experiment switcher. --experiment-config is the deployment + # default; non-default experiments load /.yaml from the + # config dir. --credentials-dir keeps the web-ui's per-experiment + # worker tokens on the mounted credentials volume (the new + # resolve_credential_dir otherwise falls back to an in-container + # XDG path that is not persisted). + - --experiment-config-dir + - /var/lib/eden/web-ui-configs + - --credentials-dir + - /var/lib/eden/credentials - --session-secret - ${EDEN_SESSION_SECRET:?EDEN_SESSION_SECRET must be set (run setup-experiment)} - --artifacts-dir - /var/lib/eden/artifacts - --repo-path - /var/lib/eden/repo + # Issue #145: durable per-experiment clone dir for the switcher + # (non-default experiments clone /.git here). Bind- + # mounted below — the parent of --repo-path is the container + # filesystem and would not survive a restart. + - --repo-root + - /var/lib/eden/web-ui-repos - --forgejo-url - ${FORGEJO_REMOTE_URL:?} - --credential-helper @@ -477,6 +494,10 @@ services: - ${EDEN_EXPERIMENT_DATA_ROOT:?}/web-ui-repo:/var/lib/eden/repo - ${EDEN_EXPERIMENT_DATA_ROOT:?}/artifacts:/var/lib/eden/artifacts - ${EDEN_EXPERIMENT_DATA_ROOT:?}/credentials/web-ui:/var/lib/eden/credentials + # Issue #145: per-experiment config YAMLs + per-experiment clones + # for the switcher. + - ${EDEN_EXPERIMENT_DATA_ROOT:?}/web-ui-configs:/var/lib/eden/web-ui-configs + - ${EDEN_EXPERIMENT_DATA_ROOT:?}/web-ui-repos:/var/lib/eden/web-ui-repos - ${EDEN_FORGEJO_CREDS_DIR_HOST:?}/credential-helper.sh:/etc/eden/credential-helper.sh:ro # Issue #109: per-service log bind-mount. - ${EDEN_EXPERIMENT_DATA_ROOT:?}/logs/web-ui:/var/lib/eden/logs diff --git a/reference/scripts/setup-experiment/setup-experiment.sh b/reference/scripts/setup-experiment/setup-experiment.sh index 56bb76a2..23287750 100755 --- a/reference/scripts/setup-experiment/setup-experiment.sh +++ b/reference/scripts/setup-experiment/setup-experiment.sh @@ -349,6 +349,8 @@ mkdir -p \ "${EDEN_EXPERIMENT_DATA_ROOT}/credentials/executor" \ "${EDEN_EXPERIMENT_DATA_ROOT}/credentials/evaluator" \ "${EDEN_EXPERIMENT_DATA_ROOT}/credentials/web-ui" \ + "${EDEN_EXPERIMENT_DATA_ROOT}/web-ui-configs" \ + "${EDEN_EXPERIMENT_DATA_ROOT}/web-ui-repos" \ "${EDEN_EXPERIMENT_DATA_ROOT}/logs/task-store-server" \ "${EDEN_EXPERIMENT_DATA_ROOT}/logs/orchestrator" \ "${EDEN_EXPERIMENT_DATA_ROOT}/logs/ideator-host" \ @@ -382,6 +384,8 @@ if ! chmod 0777 \ "${EDEN_EXPERIMENT_DATA_ROOT}/credentials/executor" \ "${EDEN_EXPERIMENT_DATA_ROOT}/credentials/evaluator" \ "${EDEN_EXPERIMENT_DATA_ROOT}/credentials/web-ui" \ + "${EDEN_EXPERIMENT_DATA_ROOT}/web-ui-configs" \ + "${EDEN_EXPERIMENT_DATA_ROOT}/web-ui-repos" \ "${EDEN_EXPERIMENT_DATA_ROOT}/logs" \ "${EDEN_EXPERIMENT_DATA_ROOT}/logs/task-store-server" \ "${EDEN_EXPERIMENT_DATA_ROOT}/logs/orchestrator" \ @@ -473,6 +477,13 @@ fi # --- Copy experiment config into compose dir --- cp "$CONFIG_PATH" "${COMPOSE_DIR}/experiment-config.yaml" +# Issue #145: also drop the config into the web-ui's per-experiment +# config dir as .yaml. The web-ui's experiment switcher +# loads non-default experiments' objective / evaluation_schema from this +# dir (the deployment default still reads --experiment-config). Idempotent +# overwrite; running setup-experiment per experiment populates the dir. +cp "$CONFIG_PATH" "${EDEN_EXPERIMENT_DATA_ROOT}/web-ui-configs/${EXPERIMENT_ID}.yaml" + # --- Write the partial .env (no EDEN_BASE_COMMIT_SHA yet) --- # Postgres DSN points at the in-network postgres service hostname. # Percent-encode the password so user-supplied passwords containing diff --git a/reference/services/web-ui/src/eden_web_ui/app.py b/reference/services/web-ui/src/eden_web_ui/app.py index 3a7389a7..1bec1683 100644 --- a/reference/services/web-ui/src/eden_web_ui/app.py +++ b/reference/services/web-ui/src/eden_web_ui/app.py @@ -8,14 +8,15 @@ from __future__ import annotations -from collections.abc import Callable +from collections.abc import AsyncIterator, Callable +from contextlib import asynccontextmanager from datetime import UTC, datetime from pathlib import Path +from typing import Any from eden_contracts import ExperimentConfig from eden_control_plane import ControlPlaneClient from eden_git import GitRepo -from eden_storage import Store from eden_storage.errors import NotFound as StorageNotFound from fastapi import FastAPI, Request from fastapi.responses import HTMLResponse @@ -23,6 +24,7 @@ from fastapi.templating import Jinja2Templates from .middleware import AdminGateMiddleware +from .repo_factory import RepoMaterializer from .routes import admin as admin_routes from .routes import admin_artifacts as admin_artifacts_routes from .routes import admin_experiments as admin_experiments_routes @@ -34,8 +36,10 @@ from .routes import executor as executor_routes from .routes import ideator as ideator_routes from .routes import index as index_routes +from .routes._helpers import switcher_context from .routes.admin import control as admin_control_routes from .sessions import SessionCodec +from .store_factory import StaticStoreFactory, StoreFactory _TEMPLATES_DIR = Path(__file__).parent / "templates" _STATIC_DIR = Path(__file__).parent / "static" @@ -48,6 +52,26 @@ def _now() -> datetime: return _now +def _experiment_context(request: Request) -> dict[str, Any]: + """Template context processor: per-request active ``experiment_id``. + + Issue #145 moves ``experiment_id`` from a render-time Jinja global + (one value for the process lifetime) to a per-request value, so + every template reflects the experiment the operator selected. The + active id is stashed on ``request.state.active_experiment_id`` by + ``resolve_active_context``; absent that (unauthenticated pages, + handlers that don't resolve), it falls back to the deployment + default. + """ + active = getattr(request.state, "active_experiment_id", None) + default = request.app.state.experiment_id + return { + "experiment_id": active or default, + "active_experiment_id": active, + "default_experiment_id": default, + } + + def _register_routers( app: FastAPI, *, @@ -72,9 +96,9 @@ def _register_routers( def make_app( *, - store: Store, + store_factory: StoreFactory | StaticStoreFactory, experiment_id: str, - experiment_config: ExperimentConfig, + experiment_config: ExperimentConfig | None, worker_id: str, session_secret: str, claim_ttl_seconds: int, @@ -82,36 +106,58 @@ def make_app( secure_cookies: bool = False, now: Callable[[], datetime] | None = None, repo: GitRepo | None = None, + repo_materializer: RepoMaterializer | None = None, clone_url: str | None = None, base_commit_sha: str | None = None, - admin_store: Store | None = None, control_plane: ControlPlaneClient | None = None, + experiment_config_dir: Path | None = None, ) -> FastAPI: """Construct the FastAPI app. ``now`` is injected so tests can pin time deterministically. ``repo`` gates the executor module (None → routes unregistered). - ``admin_store`` is a separate ``Store`` bearing the deployment - admin credential for admin-gated wire ops; ``None`` → mutation - controls render disabled and POSTs 303 to ``?error=admin-disabled`` - (plan §D.3 four-posture matrix). + + Issue #145: per-experiment routing goes through ``store_factory``. + The CLI builds a live :class:`StoreFactory`; tests build a + :class:`StaticStoreFactory` (one pre-built store for the single + deployment experiment). When the factory's admin view is ``None`` + (no admin token configured), mutation controls render disabled and + admin POSTs 303 to ``?error=admin-disabled`` (plan §D.3 four-posture + matrix). ``experiment_config`` is the deployment-default config (and + the single-experiment config source); ``experiment_config_dir`` + enables per-experiment config loading in control-plane mode (plan + Decision 6). """ - app = FastAPI(title="EDEN reference Web UI", version="0.0.1") + + @asynccontextmanager + async def _lifespan(_app: FastAPI) -> AsyncIterator[None]: + try: + yield + finally: + store_factory.close() + + app = FastAPI( + title="EDEN reference Web UI", version="0.0.1", lifespan=_lifespan + ) app.mount( "/static", StaticFiles(directory=str(_STATIC_DIR)), name="static", ) - templates = Jinja2Templates(directory=str(_TEMPLATES_DIR)) - templates.env.globals["experiment_id"] = experiment_id + templates = Jinja2Templates( + directory=str(_TEMPLATES_DIR), + context_processors=[_experiment_context, switcher_context], + ) templates.env.globals["executor_enabled"] = repo is not None - templates.env.globals["admin_enabled"] = admin_store is not None + templates.env.globals["admin_enabled"] = store_factory.admin_enabled templates.env.globals["control_plane_enabled"] = control_plane is not None - app.state.store = store - app.state.admin_store = admin_store + app.state.store_factory = store_factory app.state.experiment_id = experiment_id + app.state.experiments_cache = {} app.state.experiment_config = experiment_config + app.state.experiment_config_dir = experiment_config_dir + app.state.experiment_config_cache = {} app.state.worker_id = worker_id app.state.session_codec = SessionCodec(session_secret) app.state.claim_ttl_seconds = claim_ttl_seconds @@ -120,12 +166,21 @@ def make_app( app.state.now = now or _now_factory() app.state.templates = templates app.state.repo = repo + app.state.repo_materializer = repo_materializer app.state.clone_url = clone_url app.state.base_commit_sha = base_commit_sha app.state.control_plane = control_plane app.add_middleware(AdminGateMiddleware) _register_routers(app, control_plane=control_plane, repo=repo) + _install_healthz_and_error_handlers(app, templates) + return app + + +def _install_healthz_and_error_handlers( + app: FastAPI, templates: Jinja2Templates +) -> None: + """Wire the unauthenticated healthcheck + the 404 / NotFound pages.""" @app.get("/healthz", include_in_schema=False) async def _healthz() -> dict[str, str]: @@ -151,5 +206,3 @@ async def _storage_not_found(request: Request, exc: StorageNotFound) -> HTMLResp {"title": "Not found", "message": str(exc)}, status_code=404, ) - - return app diff --git a/reference/services/web-ui/src/eden_web_ui/cli.py b/reference/services/web-ui/src/eden_web_ui/cli.py index d543ab08..d2b385ba 100644 --- a/reference/services/web-ui/src/eden_web_ui/cli.py +++ b/reference/services/web-ui/src/eden_web_ui/cli.py @@ -17,6 +17,7 @@ from pathlib import Path from typing import Any +import httpx import uvicorn from eden_contracts import ExperimentConfig from eden_control_plane import ControlPlaneClient @@ -27,20 +28,38 @@ load_experiment_config, parse_log_level, resolve_admin_token, - resolve_worker_bearer, wait_for_task_store, ) from eden_service_common.logging import configure_logging -from eden_wire import StoreClient +from eden_storage import Store from eden_wire.errors import Unauthorized +from eden_wire.errors import WireError as ControlPlaneWireError from .app import make_app +from .credentials import bootstrap_control_plane_credential, resolve_credential_dir +from .repo_factory import RepoMaterializer +from .store_factory import BearerCache, StoreFactory def _build_control_plane_client( - args: argparse.Namespace, *, admin_token: str | None + args: argparse.Namespace, + *, + admin_token: str | None, + credential_dir: Path, + log: Any, ) -> ControlPlaneClient | None: - """Construct the optional ControlPlaneClient from CLI flags.""" + """Construct the optional ControlPlaneClient from CLI flags. + + Issue #145 §3.2: the switcher's reads (``list_experiments`` / + ``read_experiment_metadata``) accept any authenticated principal. To + keep "no admin token at runtime, but switcher still works" viable + (Posture C), the web-ui bootstraps a deployment-scoped control-plane + worker credential (persisted under ``/control-plane/``) + when an admin token is available, and reuses the persisted credential + on later boots. When neither an admin token nor a persisted + credential is available (Posture D), the switcher reads will fail; a + startup warning surfaces it rather than silently degrading. + """ url = args.control_plane_url if url is None: return None @@ -49,8 +68,42 @@ def _build_control_plane_client( or os.environ.get("EDEN_CONTROL_PLANE_ADMIN_TOKEN") or admin_token ) - bearer = f"admin:{cp_token}" if cp_token else None - return ControlPlaneClient(url, bearer=bearer) + cp_worker_id = args.control_plane_worker_id or f"{args.worker_id}-cp" + try: + credential = bootstrap_control_plane_credential( + base_url=url, + worker_id=cp_worker_id, + credential_dir=credential_dir, + admin_token=cp_token, + ) + return ControlPlaneClient(url, bearer=credential.bearer) + except RuntimeError as exc: + # No admin token AND no persisted control-plane credential. + if cp_token is not None: + # Defensive: an admin token was available but bootstrap + # still failed; fall back to the admin bearer so the + # dashboard keeps working. + return ControlPlaneClient(url, bearer=f"admin:{cp_token}") + log.warning( + "control_plane_url set but no admin token and no persisted " + "web-ui credential; switcher reads will fail (Posture D)", + error=str(exc), + ) + return ControlPlaneClient(url, bearer=None) + except (httpx.TransportError, ControlPlaneWireError) as exc: + # Control-plane unreachable / rejected the bootstrap at startup. + # Treat as a degraded runtime posture (Posture D — banners + + # hidden switcher) rather than a hard service-start dependency; + # the dashboard's per-request reads surface the failure, and the + # switcher hides on the cold-cache read error. + log.warning( + "control-plane credential bootstrap failed; the cross-experiment " + "switcher/dashboard will be degraded until the control plane is " + "reachable", + error=f"{exc.__class__.__name__}: {exc}", + ) + bearer = f"admin:{cp_token}" if cp_token is not None else None + return ControlPlaneClient(url, bearer=bearer) # slop-allow: argparse builder; one add_argument per CLI flag with no @@ -60,12 +113,17 @@ def parse_args(argv: list[str] | None = None) -> argparse.Namespace: add_common_arguments(parser) parser.add_argument( "--experiment-config", - required=True, + required=False, + default=None, help=( - "YAML experiment-config file — read for objective and " - "evaluation_schema. Drift between this file and the " - "task-store-server's copy is a known reference-impl " - "limitation; Phase 12's control plane fixes it." + "YAML experiment-config file for the deployment-default " + "experiment — read for objective and evaluation_schema. " + "Required in single-experiment mode (no --control-plane-url). " + "In control-plane mode it is optional when " + "--experiment-config-dir is set (the default experiment then " + "loads from /.yaml like every other). " + "Operators who hand-edit this file do NOT affect non-default " + "experiments — each experiment's config is independent (#145)." ), ) parser.add_argument("--host", default="127.0.0.1") @@ -119,7 +177,22 @@ def parse_args(argv: list[str] | None = None) -> argparse.Namespace: "Optional: when set, the executor module is registered " "and the user can claim execution tasks via the UI; when " "omitted, the executor module is not available and the " - "/executor/* routes return 404." + "/executor/* routes return 404. This is the deployment-default " + "experiment's clone; non-default experiments clone under " + "--repo-root (issue #145)." + ), + ) + parser.add_argument( + "--repo-root", + type=Path, + default=None, + help=( + "Directory holding per-experiment bare clones " + "(/.git) for non-default experiments " + "in control-plane mode (issue #145 §3.5). Must be a DURABLE " + "location (in Compose, a bind-mounted volume — the parent of " + "--repo-path is the container filesystem and would not " + "survive). Defaults to the parent of --repo-path when unset." ), ) parser.add_argument( @@ -188,6 +261,40 @@ def parse_args(argv: list[str] | None = None) -> argparse.Namespace: "a single admin secret across the two services)." ), ) + parser.add_argument( + "--control-plane-worker-id", + default=None, + help=( + "worker_id for the deployment-scoped control-plane credential " + "the switcher uses for read calls (issue #145 §3.2). Defaults " + "to '<--worker-id>-cp'. Only used when --control-plane-url is set." + ), + ) + parser.add_argument( + "--credential-dir", + default=None, + help=( + "Directory for the web-ui's per-experiment + control-plane " + "credentials (issue #145). One '/.token' " + "per experiment, plus 'control-plane/.token'. " + "Defaults to $EDEN_CREDENTIAL_DIR, then " + "${XDG_STATE_HOME:-~/.local/state}/eden/web-ui/." + ), + ) + parser.add_argument( + "--experiment-config-dir", + default=None, + type=Path, + help=( + "Directory of per-experiment '.yaml' configs " + "(issue #145 Decision 6). When the operator switches to a " + "non-default experiment in control-plane mode, the web-ui loads " + "that experiment's objective / evaluation_schema from this dir. " + "The deployment-default still uses --experiment-config. Operators " + "who hand-edit --experiment-config do NOT affect non-default " + "experiments — each experiment's config is independent." + ), + ) return parser.parse_args(argv) @@ -221,10 +328,10 @@ def _wait_and_announce() -> None: class _WebUIRuntime: """Constructed runtime objects ready for :func:`make_app` + uvicorn.""" - config: ExperimentConfig - store: StoreClient - admin_store: StoreClient | None + config: ExperimentConfig | None + store_factory: StoreFactory repo: GitRepo | None + repo_materializer: RepoMaterializer | None control_plane: ControlPlaneClient | None @@ -252,32 +359,44 @@ def _materialize_repo(args: argparse.Namespace) -> GitRepo | None: return repo -def _build_admin_store( +def _build_repo_materializer( args: argparse.Namespace, - *, - admin_token: str | None, - log: Any, - store: StoreClient, -) -> StoreClient | None | int: - """Construct + auth-probe the optional admin StoreClient. - - Returns None when no admin token is configured, the admin - StoreClient on success, or exit code 1 when the admin bearer is - rejected (after closing ``store`` to avoid a leak). +) -> RepoMaterializer | None: + """Materializer for non-default experiments' integrator clones (#145 §3.5). + + Per-experiment bare clones live under ``--repo-root`` (a durable, + operator-supplied directory), at ``/.git``; + ``--repo-root`` defaults to the parent of ``--repo-path`` when unset + (fine for local/dev, but in Compose it MUST be a bind-mounted volume — + the container filesystem parent would not survive a restart). Returns + ``None`` when the executor module is disabled (no ``--repo-path``). + The deployment-default experiment keeps using the flat ``--repo-path`` + clone via ``app.state.repo``; the materializer is consulted only for + non-default experiments (control-plane mode). """ - if admin_token is None: + if args.repo_path is None: return None - admin_store = StoreClient( - base_url=args.task_store_url, - experiment_id=args.experiment_id, - bearer=f"admin:{admin_token}", + repo_root = args.repo_root if args.repo_root is not None else args.repo_path.parent + return RepoMaterializer( + repo_root=repo_root, + forgejo_url=args.forgejo_url, + credential_helper=args.credential_helper, ) - # Validate the admin bearer at startup; a stale or wrong - # token would otherwise surface only when the operator - # tried to register a worker, as an opaque "transport" - # banner (plan §8.1 risk note). list_workers is either- - # gated, so the call succeeds with the admin bearer when - # the bearer parses cleanly and 401s otherwise. + + +def _validate_admin_store( + admin_store: Store | None, *, log: Any +) -> bool: + """Probe the admin bearer at startup; return False if rejected. + + A stale or wrong admin token would otherwise surface only when the + operator tried to register a worker, as an opaque "transport" banner + (plan §8.1 risk note). ``list_workers`` is either-gated, so the call + succeeds with the admin bearer when it parses cleanly and 401s + otherwise. + """ + if admin_store is None: + return True try: admin_store.list_workers() except Unauthorized: @@ -286,27 +405,67 @@ def _build_admin_store( "--admin-token / $EDEN_ADMIN_TOKEN matches the " "task-store-server's --admin-token" ) - with contextlib.suppress(Exception): - store.close() - with contextlib.suppress(Exception): - admin_store.close() + return False + return True + + +def _resolve_default_config( + args: argparse.Namespace, log: Any +) -> ExperimentConfig | None | int: + """Resolve the deployment-default experiment's config (issue #145 Decision 6). + + Returns the parsed config, ``None`` (control-plane mode with no + ``--experiment-config`` — the default experiment then loads from + ``<--experiment-config-dir>/.yaml`` like every other), or exit + code 1 on a misconfiguration. Single-experiment mode requires + ``--experiment-config``; control-plane mode requires at least one of + ``--experiment-config`` / ``--experiment-config-dir``. + """ + has_control_plane = args.control_plane_url is not None + if args.experiment_config is not None: + # The deployment-default experiment ALWAYS resolves to this config + # (active_config's default-experiment branch returns + # app.state.experiment_config and never reads the config-dir copy), + # so --experiment-config is authoritative for the default and a + # divergent /.yaml is harmless — we do NOT + # fail on a mismatch. (Compose legitimately produces one: smoke + # appends fields to the mounted config after setup-experiment has + # already copied the pre-append config into the config-dir.) + return load_experiment_config(args.experiment_config) + if not has_control_plane: + log.error( + "--experiment-config is required in single-experiment mode " + "(no --control-plane-url)" + ) + return 1 + if args.experiment_config_dir is None: + log.error( + "control-plane mode requires --experiment-config or " + "--experiment-config-dir" + ) return 1 - return admin_store + return None def _build_runtime( args: argparse.Namespace, log: Any ) -> _WebUIRuntime | int: - """Build the web-ui's wire-side runtime; return exit code 1 on auth rejection.""" - config = load_experiment_config(args.experiment_config) + """Build the web-ui's wire-side runtime; return exit code 1 on auth rejection. + + Issue #145: the runtime is built around a :class:`StoreFactory` that + vends per-experiment ``StoreClient`` views against one task-store URL + over a shared ``httpx.Client``. The deployment-default experiment's + worker + admin views are vended (and auth-probed) at startup so the + no-selection / single-experiment posture is validated exactly as + before; non-default experiments are JIT-credentialed on first switch. + """ + config = _resolve_default_config(args, log) + if isinstance(config, int): + return config log.info("waiting_for_task_store", url=args.task_store_url) # The readiness probe accepts 200/401/403 ("server is up") so the - # web-ui can run before it has its per-worker credential. The - # bootstrap below registers / verifies / reissues against the - # admin bearer. Without this preflight, a direct launch where - # the task-store is still binding would surface as a confusing - # connection failure from inside resolve_worker_bearer rather - # than a clean readiness timeout. + # web-ui can run before it has its per-worker credential. The factory + # registers / verifies / reissues against the admin bearer below. wait_for_task_store( base_url=args.task_store_url, experiment_id=args.experiment_id, @@ -314,21 +473,28 @@ def _build_runtime( deadline_seconds=args.startup_timeout, ) repo = _materialize_repo(args) - bearer = resolve_worker_bearer( - args, worker_id=args.worker_id, labels={"role": "web-ui"} + admin_token = resolve_admin_token(args) + credential_dir = resolve_credential_dir(args) + shared_client = httpx.Client(timeout=30.0) + bearer_cache = BearerCache( + base_url=args.task_store_url, + worker_id=args.worker_id, + credential_dir=credential_dir, + admin_token=admin_token, ) - store = StoreClient( + store_factory = StoreFactory( base_url=args.task_store_url, - experiment_id=args.experiment_id, - bearer=bearer, + bearer_cache=bearer_cache, + admin_token=admin_token, + shared_client=shared_client, ) - # Posture-D guard (plan §D.3): if the task-store is auth-enabled - # and the worker bearer doesn't authenticate, fail fast at - # startup rather than running a silently-broken service whose - # every wire call will 401. We probe /whoami because it's the - # authenticated ping op that requires a worker bearer. - # Auth-disabled task-stores return "anonymous" — also fine. + + # Vend + auth-probe the deployment-default experiment's worker view. + # Posture-D guard (plan §D.3): fail fast if the bearer doesn't + # authenticate rather than running a silently-broken service. try: + store = store_factory.for_experiment(args.experiment_id, role="worker") + assert store is not None # worker role never returns None store.whoami() except Unauthorized: log.error( @@ -337,23 +503,26 @@ def _build_runtime( "or persist a worker credential via the admin module's " "reissue-credential endpoint" ) - with contextlib.suppress(Exception): - store.close() + store_factory.close() + return 1 + except (RuntimeError, httpx.TransportError) as exc: + log.error("failed to bootstrap default-experiment credential", error=str(exc)) + store_factory.close() return 1 - admin_token = resolve_admin_token(args) - admin_store = _build_admin_store( - args, admin_token=admin_token, log=log, store=store - ) - if isinstance(admin_store, int): - return admin_store + admin_store = store_factory.for_experiment(args.experiment_id, role="admin") + if not _validate_admin_store(admin_store, log=log): + store_factory.close() + return 1 - control_plane = _build_control_plane_client(args, admin_token=admin_token) + control_plane = _build_control_plane_client( + args, admin_token=admin_token, credential_dir=credential_dir, log=log + ) return _WebUIRuntime( config=config, - store=store, - admin_store=admin_store, + store_factory=store_factory, repo=repo, + repo_materializer=_build_repo_materializer(args), control_plane=control_plane, ) @@ -371,16 +540,17 @@ def main(argv: list[str] | None = None) -> int: return runtime app = make_app( - store=runtime.store, - admin_store=runtime.admin_store, + store_factory=runtime.store_factory, experiment_id=args.experiment_id, experiment_config=runtime.config, + experiment_config_dir=args.experiment_config_dir, worker_id=args.worker_id, session_secret=args.session_secret, claim_ttl_seconds=args.claim_ttl_seconds, artifacts_dir=args.artifacts_dir, secure_cookies=args.secure_cookies, repo=runtime.repo, + repo_materializer=runtime.repo_materializer, clone_url=args.clone_url, base_commit_sha=args.base_commit_sha, control_plane=runtime.control_plane, @@ -408,9 +578,12 @@ def _stop(*_: Any) -> None: try: server.run() finally: + # The factory owns the shared httpx.Client every vended store + # rides on; closing it tears down all per-experiment views. The + # lifespan shutdown hook also calls this — close() is idempotent. with contextlib.suppress(Exception): - runtime.store.close() - if runtime.admin_store is not None: + runtime.store_factory.close() + if runtime.control_plane is not None: with contextlib.suppress(Exception): - runtime.admin_store.close() + runtime.control_plane.close() return 0 diff --git a/reference/services/web-ui/src/eden_web_ui/credentials.py b/reference/services/web-ui/src/eden_web_ui/credentials.py new file mode 100644 index 00000000..c6983b22 --- /dev/null +++ b/reference/services/web-ui/src/eden_web_ui/credentials.py @@ -0,0 +1,192 @@ +"""Web-ui credential-directory resolution + control-plane bootstrap. + +Two credential domains matter to the multi-experiment web-ui (issue +#145 §3.2): + +1. **Per-experiment worker credentials** — one ``.token`` per + experiment, bootstrapped JIT against the *task-store-server* by + :class:`eden_web_ui.store_factory.BearerCache`, which reuses + :func:`eden_service_common.auth.bootstrap_worker_credential` verbatim. + This module does NOT touch that path. + +2. **The deployment-scoped control-plane credential** — one long-lived + worker the ``ControlPlaneClient`` uses for its read calls + (``list_experiments`` / ``read_experiment_metadata``), so the + switcher keeps working after the operator rotates the admin token + out of the runtime environment (Posture C). The control plane is a + different transport than the task-store-server, so + ``bootstrap_worker_credential`` cannot be reused; this module + provides :func:`bootstrap_control_plane_credential`, which mirrors + its register / verify / reissue shape against the + :class:`~eden_control_plane.ControlPlaneClient`. The lock and + atomic-write disciplines ARE reused from ``eden_service_common.auth`` + (never reimplemented). Those two reads gate on *any* authenticated + principal (``_get_principal`` — any registered worker bearer or the + admin token), not membership in a specific group, so the bootstrap + only needs to register the worker; no group-add is required. + +Credential layout under ```` (resolved by +:func:`resolve_credential_dir`):: + + / + control-plane/.token # deployment-scoped + control-plane/.token.lock + /.token # per-experiment (BearerCache) + /.token.lock +""" + +from __future__ import annotations + +import argparse +import os +from pathlib import Path + +from eden_control_plane import ControlPlaneClient +from eden_service_common.auth import ( + WorkerCredential, + _bootstrap_lock, + _read_token, + _write_token, + credential_path, +) +from eden_wire import Unauthorized +from eden_wire.errors import WireError as ControlPlaneWireError + +CONTROL_PLANE_SCOPE = "control-plane" +"""Subdirectory under the credential dir holding the deployment-scoped +control-plane worker credential.""" + + +def default_credential_dir() -> Path: + """The XDG-based default for the web-ui credential directory.""" + state_home = os.environ.get("XDG_STATE_HOME") + base = Path(state_home) if state_home else Path.home() / ".local" / "state" + return base / "eden" / "web-ui" + + +def resolve_credential_dir(args: argparse.Namespace) -> Path: + """Resolve the base credential dir for the web-ui's per-experiment tokens. + + Precedence (first set wins): + + 1. ``--credential-dir`` / ``$EDEN_CREDENTIAL_DIR`` — the web-ui-specific + override (issue #145). + 2. ``--credentials-dir`` / ``$EDEN_WORKER_CREDENTIALS_DIR`` — the common + worker-host credential dir. The web-ui IS a worker host, so it shares + this dir with the other reference hosts by default; honoring it keeps + deployments (and the isolated per-test dirs the e2e suite passes) on a + single credential location. Per-experiment tokens live under + ``//.token``. + 3. ``${XDG_STATE_HOME:-~/.local/state}/eden/web-ui/`` — the final fallback + when no credential dir is configured at all. + """ + candidates = ( + getattr(args, "credential_dir", None), + os.environ.get("EDEN_CREDENTIAL_DIR"), + getattr(args, "credentials_dir", None), + os.environ.get("EDEN_WORKER_CREDENTIALS_DIR"), + ) + for candidate in candidates: + if candidate: + return Path(candidate) + return default_credential_dir() + + +def bootstrap_control_plane_credential( + *, + base_url: str, + worker_id: str, + credential_dir: Path, + admin_token: str | None, + timeout: float = 30.0, +) -> WorkerCredential: + """Return the deployment-scoped control-plane worker credential. + + Mirrors :func:`eden_service_common.auth.bootstrap_worker_credential` + but against the control plane (chapter 11 §6): + + 1. If a persisted ``/control-plane/.token`` + verifies via ``GET /v0/control/whoami``, use it. + 2. If persisted but stale (401), reissue with ``admin_token``. + 3. If absent, register with ``admin_token`` (``POST + /v0/control/workers``) and persist the issued token. + + The switcher's reads (``list_experiments`` / + ``read_experiment_metadata``) accept any authenticated principal, so + no group membership is needed beyond a successful registration. + + Raises ``RuntimeError`` when a register / reissue is required but no + admin token is available — the caller surfaces this as a startup + warning (Posture D: switcher hidden) rather than a hard failure. + """ + scope_dir = credential_dir / CONTROL_PLANE_SCOPE + path = credential_path(scope_dir, worker_id) + with _bootstrap_lock(scope_dir, worker_id): + persisted = _read_token(path) + if persisted is not None: + bearer = f"{worker_id}:{persisted}" + if _verify_control_plane(base_url, bearer, worker_id, timeout): + return WorkerCredential(worker_id=worker_id, token=persisted) + if admin_token is None: + raise RuntimeError( + f"persisted control-plane credential for worker_id=" + f"{worker_id!r} is stale; reissue requires the admin token" + ) + new_token = _reissue_control_plane( + base_url, admin_token, worker_id, timeout + ) + _write_token(path, new_token) + return WorkerCredential(worker_id=worker_id, token=new_token) + + if admin_token is None: + raise RuntimeError( + f"no persisted control-plane credential for worker_id=" + f"{worker_id!r} at {path}; registration requires the admin token" + ) + token = _register_control_plane(base_url, admin_token, worker_id, timeout) + _write_token(path, token) + return WorkerCredential(worker_id=worker_id, token=token) + + +def _verify_control_plane( + base_url: str, bearer: str, worker_id: str, timeout: float +) -> bool: + with ControlPlaneClient(base_url, bearer=bearer, timeout=timeout) as probe: + try: + return probe.whoami() == worker_id + except Unauthorized: + return False + except ControlPlaneWireError: + return False + + +def _reissue_control_plane( + base_url: str, admin_token: str, worker_id: str, timeout: float +) -> str: + with ControlPlaneClient( + base_url, bearer=f"admin:{admin_token}", timeout=timeout + ) as admin: + data = admin.reissue_credential(worker_id) + token = data.get("registration_token") + if not isinstance(token, str): + raise RuntimeError("control-plane reissue_credential returned no token") + return token + + +def _register_control_plane( + base_url: str, admin_token: str, worker_id: str, timeout: float +) -> str: + with ControlPlaneClient( + base_url, bearer=f"admin:{admin_token}", timeout=timeout + ) as admin: + data = admin.register_worker(worker_id, labels={"role": "web-ui"}) + token = data.get("registration_token") + if token is None: + # Idempotent re-register hit an existing row with no fresh + # token; the only recovery is reissue (mirrors the + # task-store bootstrap's idempotent-register branch). + reissued = admin.reissue_credential(worker_id) + token = reissued.get("registration_token") + if not isinstance(token, str): + raise RuntimeError("control-plane register_worker returned no token") + return token diff --git a/reference/services/web-ui/src/eden_web_ui/middleware.py b/reference/services/web-ui/src/eden_web_ui/middleware.py index f778c1d6..734f16fd 100644 --- a/reference/services/web-ui/src/eden_web_ui/middleware.py +++ b/reference/services/web-ui/src/eden_web_ui/middleware.py @@ -24,7 +24,7 @@ from starlette.requests import Request from starlette.responses import RedirectResponse, Response -from .routes._helpers import get_session +from .routes._helpers import ActiveContext, get_session, resolve_active_context ADMINS_GROUP_ID = "admins" @@ -33,6 +33,21 @@ def _is_admin_path(path: str) -> bool: return path == "/admin" or path.startswith("/admin/") +# Deployment-scoped admin pages operate against the control plane, not a +# per-experiment store (the cross-experiment dashboard + the deployment +# worker/group registries). They are also the redirect target for +# active-experiment resolution failures, so they MUST NOT be subject to +# per-experiment resolution themselves — otherwise a stale/unreachable +# selection would redirect the dashboard to itself in a loop. Their +# admins-group gate runs against the deployment-default experiment, +# preserving the pre-#145 behavior. +_DEPLOYMENT_SCOPED_ADMIN_PREFIXES = ("/admin/experiments", "/admin/control") + + +def _is_deployment_scoped_admin_path(path: str) -> bool: + return any(path.startswith(prefix) for prefix in _DEPLOYMENT_SCOPED_ADMIN_PREFIXES) + + def _forbidden_response(request: Request) -> Response: return request.app.state.templates.TemplateResponse( request, @@ -73,11 +88,17 @@ class AdminGateMiddleware(BaseHTTPMiddleware): 1. Non-``/admin`` paths pass through untouched. 2. Missing or invalid session → 303 redirect to ``/signin`` (matches the per-handler behavior we are otherwise replacing). - 3. ``Store.resolve_worker_in_group(worker_id, "admins")`` raises → + 3. The active experiment cannot be resolved (stale selection, + control-plane / task-store unreachable, or a missing credential) + → the dashboard redirect / unseeded page that + ``resolve_active_context`` returns. The admins-group membership + check runs against the ACTIVE experiment's worker store, so the + gate follows the operator's experiment selection (issue #145). + 4. ``Store.resolve_worker_in_group(worker_id, "admins")`` raises → 502 ``_error.html`` (same shape as the chunk-9e dashboard read failures; the operator refreshes to retry). - 4. Membership returns ``False`` → 403 ``_error.html``. - 5. Membership returns ``True`` → request proceeds. + 5. Membership returns ``False`` → 403 ``_error.html``. + 6. Membership returns ``True`` → request proceeds. The membership check adds one extra wire round-trip per ``/admin/*`` page load when the store is a ``StoreClient``; that @@ -97,7 +118,22 @@ async def dispatch( session = get_session(request) if session is None: return RedirectResponse(url="/signin", status_code=303) - store = request.app.state.store + if _is_deployment_scoped_admin_path(request.url.path): + # Gate the deployment-scoped pages against the default + # experiment's store; they don't follow the active selection. + store = request.app.state.store_factory.for_experiment( + request.app.state.experiment_id, role="worker" + ) + assert store is not None # worker role always returns a store + else: + active = resolve_active_context(request) + if not isinstance(active, ActiveContext): + # A dashboard redirect (stale / unreachable / credential) + # or the unseeded-experiment page — surface it directly. + # Both short-circuit the membership check, which has no + # store to run against in those states. + return active + store = active.store try: in_admins = store.resolve_worker_in_group( session.worker_id, ADMINS_GROUP_ID diff --git a/reference/services/web-ui/src/eden_web_ui/repo_factory.py b/reference/services/web-ui/src/eden_web_ui/repo_factory.py new file mode 100644 index 00000000..a986485a --- /dev/null +++ b/reference/services/web-ui/src/eden_web_ui/repo_factory.py @@ -0,0 +1,87 @@ +"""Per-experiment integrator-repo materialization for the executor module. + +Issue #145 §3.5 / Decision 9. The executor module needs a local bare +clone of each experiment's integrator repo to render ``work/*`` refs and +produce diffs. Today that is one clone bound at ``--repo-path``; in the +multi-experiment world each experiment has its own integrator repo. + +:class:`RepoMaterializer` vends per-experiment :class:`~eden_git.GitRepo` +clones under ``/.git``, cloning-if-missing from +the Forgejo remote (with the per-experiment URL substituted from the +configured ``--forgejo-url`` org base) and fetching on each access. It is +consulted only for NON-default experiments; the deployment-default +experiment continues to use the startup-materialized ``app.state.repo`` +so single-experiment deployments are unchanged (see +:func:`eden_web_ui.routes._helpers.repo_for`). +""" + +from __future__ import annotations + +from pathlib import Path + +from eden_git import GitRepo + + +def forgejo_url_for(forgejo_url: str, experiment_id: str) -> str: + """Rewrite a per-experiment Forgejo URL for ``experiment_id``. + + ``--forgejo-url`` is ``http(s):////.git`` — a + per-experiment URL. The org base is everything up to the last path + segment; the active experiment's clone lives at + ``/.git``. + """ + org_base = forgejo_url.rsplit("/", 1)[0] + return f"{org_base}/{experiment_id}.git" + + +class RepoMaterializer: + """Lazily materializes + caches per-experiment bare clones. + + ``repo_root`` is the directory that holds one ``.git`` + bare clone per experiment. When ``forgejo_url`` is set a missing + clone is created from the substituted remote URL and refreshed via + ``fetch_all_heads`` on each access (AGENTS.md "long-lived clones need + read-before-display fetches"); without it the clone must already + exist on disk. + """ + + def __init__( + self, + *, + repo_root: Path, + forgejo_url: str | None, + credential_helper: str | None, + ) -> None: + self._repo_root = repo_root + self._forgejo_url = forgejo_url + self._credential_helper = credential_helper + self._cache: dict[str, GitRepo] = {} + + def for_experiment(self, experiment_id: str) -> GitRepo: + """Return the bare clone for ``experiment_id``, materializing if needed.""" + cached = self._cache.get(experiment_id) + if cached is not None: + if self._forgejo_url is not None: + cached.fetch_all_heads() + return cached + dest = self._repo_root / f"{experiment_id}.git" + if (dest / "HEAD").is_file(): + repo = GitRepo(str(dest)) + if self._forgejo_url is not None: + repo.fetch_all_heads() + elif self._forgejo_url is not None: + repo = GitRepo.clone_from( + url=forgejo_url_for(self._forgejo_url, experiment_id), + dest=dest, + bare=True, + credential_helper=self._credential_helper, + ) + else: + # No remote and no on-disk clone — nothing to materialize. + raise FileNotFoundError( + f"no integrator clone for experiment {experiment_id!r} at " + f"{dest} and no --forgejo-url to clone from" + ) + repo.rev_parse("HEAD") + self._cache[experiment_id] = repo + return repo diff --git a/reference/services/web-ui/src/eden_web_ui/routes/_helpers.py b/reference/services/web-ui/src/eden_web_ui/routes/_helpers.py index d4788e19..d6a69a03 100644 --- a/reference/services/web-ui/src/eden_web_ui/routes/_helpers.py +++ b/reference/services/web-ui/src/eden_web_ui/routes/_helpers.py @@ -1,16 +1,36 @@ -"""Shared helpers for route handlers (session/CSRF lookup + cookie shape).""" +"""Shared helpers for route handlers (session/CSRF lookup + cookie shape). + +Issue #145 adds the per-request active-experiment resolution that lets +every per-experiment route operate against the experiment the operator +selected in the switcher (12c ``Session.selected_experiment_id``). The +load-bearing entry point is :func:`resolve_active_context` — one call at +the top of a handler that returns either a ready-to-use +:class:`ActiveContext` (resolved ``experiment_id`` + per-experiment +``store`` / ``admin_store`` / ``config``) or a ``Response`` the handler +returns verbatim (a dashboard redirect for stale / unreachable / +credential / config failures, or an "experiment not seeded" page). +""" from __future__ import annotations +import time +import urllib.parse from dataclasses import dataclass from pathlib import Path from typing import Any from urllib.parse import unquote, urlencode, urlparse -from eden_contracts import Idea +import httpx +import yaml +from eden_contracts import ExperimentConfig, Idea +from eden_service_common import load_experiment_config +from eden_storage import Store +from eden_storage.errors import NotFound from eden_storage.errors import NotFound as StorageNotFound +from eden_wire import Unauthorized from fastapi import Request, Response -from fastapi.responses import RedirectResponse +from fastapi.responses import HTMLResponse, RedirectResponse +from pydantic import ValidationError from ..artifacts import ( is_bundle_uri, @@ -18,6 +38,13 @@ read_bundle_manifest, ) from ..sessions import SESSION_COOKIE_NAME, Session, SessionCodec, verify_csrf +from ..store_factory import ( + AdminTokenRejected, + MissingAdminToken, + TaskStoreUnreachable, +) + +_DASHBOARD_PATH = "/admin/experiments/" _CONTENT_MAX_BYTES = 1 << 20 # 1 MiB @@ -82,6 +109,383 @@ def csrf_ok(session: Session, presented: str | None) -> bool: return verify_csrf(session, presented) +# --------------------------------------------------------------------------- +# Active-experiment resolution (issue #145) +# --------------------------------------------------------------------------- + + +class StaleSelection(Exception): + """The session selects an experiment no longer in the control plane. + + Raised by :func:`resolve_active_experiment` when the operator's + ``selected_experiment_id`` is absent from the control-plane registry + (unregistered concurrently). Callers redirect to the dashboard with + ``?error=stale-selection`` and clear the session field. + """ + + def __init__(self, experiment_id: str) -> None: + self.experiment_id = experiment_id + super().__init__(f"selected experiment {experiment_id!r} is no longer registered") + + +class ControlPlaneUnreachable(Exception): + """A transport error reached the control plane during resolution.""" + + def __init__(self, experiment_id: str) -> None: + self.experiment_id = experiment_id + super().__init__("control plane unreachable while resolving the active experiment") + + +class ExperimentConfigMissing(Exception): + """No per-experiment config YAML exists at ``/.yaml``.""" + + def __init__(self, experiment_id: str) -> None: + self.experiment_id = experiment_id + super().__init__(f"no experiment-config YAML for {experiment_id!r}") + + +class ExperimentConfigInvalid(Exception): + """The per-experiment config YAML failed to parse / validate.""" + + def __init__(self, experiment_id: str, cause: BaseException) -> None: + self.experiment_id = experiment_id + self.cause = cause + super().__init__(f"invalid experiment-config YAML for {experiment_id!r}: {cause}") + + +@dataclass(frozen=True) +class ResolvedExperiment: + """The per-request resolved active experiment. + + ``unseeded`` is ``True`` when the control plane knows the experiment + but the task-store-server has no experiment record yet (registered + via the dashboard but not yet bootstrapped by ``setup-experiment`` / + checkpoint import). Routes render an "initialize me" page instead of + treating it as a stale selection (plan Decision 8 / Risk 11). + """ + + experiment_id: str + unseeded: bool = False + + +@dataclass(frozen=True) +class ActiveContext: + """Resolved per-request store/config bundle returned to a handler.""" + + experiment_id: str + store: Store + admin_store: Store | None + config: ExperimentConfig | None + + +def resolve_active_experiment(request: Request) -> ResolvedExperiment: + """Resolve the experiment the current request operates against. + + Reads ``Session.selected_experiment_id`` and falls back to the + deployment default (``app.state.experiment_id``). When no control + plane is configured (single-experiment posture), always returns the + deployment default with no validation call — today's behavior is + unchanged and there is zero per-request overhead. + + For a non-default selection in control-plane mode it (1) validates + the experiment still exists in the control plane (``StaleSelection`` + / ``ControlPlaneUnreachable`` otherwise) and (2) classifies it as + seeded vs registered-but-unseeded against the task-store-server, + JIT-bootstrapping the per-experiment worker credential as a side + effect (``MissingAdminToken`` / ``AdminTokenRejected`` / + ``TaskStoreUnreachable`` propagate to the caller). + """ + app = request.app + default_id: str = app.state.experiment_id + control_plane = app.state.control_plane + if control_plane is None: + return ResolvedExperiment(default_id) + session = get_session(request) + selected = session.selected_experiment_id if session is not None else None + if selected is None or selected == default_id: + return ResolvedExperiment(default_id) + + try: + control_plane.read_experiment_metadata(selected) + except NotFound as exc: + raise StaleSelection(selected) from exc + except httpx.TransportError as exc: + raise ControlPlaneUnreachable(selected) from exc + + # Classify seeded vs unseeded against the task-store-server. The + # for_experiment call JIT-bootstraps the per-experiment worker + # credential; a NotFound from either the bootstrap register or the + # state read means the experiment exists in the control plane but + # not (yet) on the task-store-server. + factory = app.state.store_factory + try: + return _classify_experiment(factory, selected) + except Unauthorized: + # Stale / absent per-experiment credential against an auth-enabled + # task-store (Decision 8 Posture C/D, Risk 4). Drop the cached + # bearer + client and re-bootstrap once; a persisted-but-stale + # token reissues if an admin token is available. + factory.evict(selected) + try: + return _classify_experiment(factory, selected) + except Unauthorized as exc: + # No usable credential and no admin token to mint one — the + # plan's cannot-bootstrap-credential branch, NOT unseeded + # (a 401 is never inferred as unseeded; Decision 8 §C/D). + raise MissingAdminToken(selected) from exc + + +def _classify_experiment(factory: Any, experiment_id: str) -> ResolvedExperiment: + """Probe seeded vs registered-but-unseeded for ``experiment_id``.""" + try: + store = factory.for_experiment(experiment_id, role="worker") + store.read_experiment_state() + except NotFound: + return ResolvedExperiment(experiment_id, unseeded=True) + return ResolvedExperiment(experiment_id) + + +def active_config(request: Request, experiment_id: str) -> ExperimentConfig: + """Return the ``ExperimentConfig`` for ``experiment_id``. + + Single-experiment / no-control-plane mode returns the startup config + (``app.state.experiment_config``). Control-plane mode loads + ``<--experiment-config-dir>/.yaml`` lazily (cached for + the process lifetime — configs are immutable post-create), falling + back to the startup config for the deployment-default experiment. + """ + app = request.app + default_id: str = app.state.experiment_id + default_config: ExperimentConfig | None = app.state.experiment_config + config_dir: Path | None = app.state.experiment_config_dir + + # The deployment-default experiment prefers the startup config. + if experiment_id == default_id and default_config is not None: + return default_config + if app.state.control_plane is None: + # Single-experiment posture must have a startup config. + assert default_config is not None # validated at CLI startup + return default_config + # Control-plane mode, non-default experiment (or the default with no + # startup config): load /.yaml. We do NOT silently + # fall back to the default experiment's config — that would render + # the wrong objective / evaluation_schema (codex round-0 Bug 1). + if config_dir is None: + raise ExperimentConfigMissing(experiment_id) + + cache: dict[str, ExperimentConfig] = app.state.experiment_config_cache + cached = cache.get(experiment_id) + if cached is not None: + return cached + path = config_dir / f"{experiment_id}.yaml" + try: + config = load_experiment_config(path) + except FileNotFoundError as exc: + raise ExperimentConfigMissing(experiment_id) from exc + except (ValidationError, yaml.YAMLError, ValueError) as exc: + raise ExperimentConfigInvalid(experiment_id, exc) from exc + cache[experiment_id] = config + return config + + +def dashboard_redirect(error: str, *, exp: str | None = None) -> RedirectResponse: + """Redirect to the cross-experiment dashboard with an ``?error=`` banner.""" + params: dict[str, str] = {"error": error} + if exp is not None: + params["exp"] = exp + query = urllib.parse.urlencode(params) + return RedirectResponse(url=f"{_DASHBOARD_PATH}?{query}", status_code=303) + + +def switched_mid_form_redirect(frm: str, to: str) -> RedirectResponse: + """Redirect after a discarded submission whose form was rendered against + a different experiment than the one now active (issue #145 §3.6).""" + query = urllib.parse.urlencode( + {"error": "switched-mid-form", "from": frm, "to": to} + ) + return RedirectResponse(url=f"{_DASHBOARD_PATH}?{query}", status_code=303) + + +def form_experiment_guard( + form: Any, active_experiment_id: str +) -> RedirectResponse | None: + """Reject a submission whose ``form_experiment_id`` no longer matches. + + Every mutating form carries a hidden ``form_experiment_id`` recording + the experiment it was rendered against. If the operator switched + experiments between render and submit, the value disagrees with the + now-active experiment; rather than silently writing to the wrong + experiment we discard the submission and redirect with a clear banner + (issue #145 §3.6). Returns ``None`` when the form predates the field + (absent) or matches — absent is treated as "no mismatch" so older + cached pages degrade to today's behavior rather than hard-failing. + """ + submitted = form.get("form_experiment_id") + if isinstance(submitted, str) and submitted and submitted != active_experiment_id: + return switched_mid_form_redirect(submitted, active_experiment_id) + return None + + +def _stale_selection_redirect(request: Request) -> RedirectResponse: + """Redirect to the dashboard AND clear the stale session selection.""" + response = dashboard_redirect("stale-selection") + session = get_session(request) + if session is not None: + codec: SessionCodec = request.app.state.session_codec + cleared = Session( + worker_id=session.worker_id, + csrf=session.csrf, + selected_experiment_id=None, + ) + write_session_cookie( + response, + encoded=codec.encode(cleared), + secure=request.app.state.secure_cookies, + ) + return response + + +def _unseeded_response(request: Request, experiment_id: str) -> HTMLResponse: + """Render the "experiment registered but not seeded" page.""" + request.state.active_experiment_id = experiment_id + return request.app.state.templates.TemplateResponse( + request, + "_unseeded.html", + {"experiment_id": experiment_id}, + status_code=409, + ) + + +def resolve_active_context( + request: Request, *, need_config: bool = False +) -> ActiveContext | RedirectResponse | HTMLResponse: + """Resolve the active experiment + per-experiment store(s) for a handler. + + Returns an :class:`ActiveContext` ready to use, or a ``Response`` the + handler returns verbatim. The ``Response`` cases are: stale-selection + redirect (clears the session field), control-plane-unreachable / + cannot-bootstrap-credential / task-store-unreachable / config + redirects, and the unseeded-experiment page. On the happy path the + resolved id is stashed on ``request.state.active_experiment_id`` so + the template context processor renders the active experiment. + """ + try: + resolved = resolve_active_experiment(request) + except StaleSelection: + return _stale_selection_redirect(request) + except ControlPlaneUnreachable: + return dashboard_redirect("control-plane-unreachable") + except (MissingAdminToken, AdminTokenRejected) as exc: + return dashboard_redirect("cannot-bootstrap-credential", exp=exc.experiment_id) + except TaskStoreUnreachable as exc: + return dashboard_redirect("task-store-unreachable", exp=exc.experiment_id) + + request.state.active_experiment_id = resolved.experiment_id + if resolved.unseeded: + return _unseeded_response(request, resolved.experiment_id) + + factory = request.app.state.store_factory + store = factory.for_experiment(resolved.experiment_id, role="worker") + assert store is not None # worker role never returns None + admin_store = factory.for_experiment(resolved.experiment_id, role="admin") + + config: ExperimentConfig | None = None + if need_config: + try: + config = active_config(request, resolved.experiment_id) + except ExperimentConfigMissing as exc: + return dashboard_redirect("config-missing", exp=exc.experiment_id) + except ExperimentConfigInvalid as exc: + return dashboard_redirect("config-invalid", exp=exc.experiment_id) + + return ActiveContext( + experiment_id=resolved.experiment_id, + store=store, + admin_store=admin_store, + config=config, + ) + + +def repo_for(request: Request, experiment_id: str) -> Any: + """Return the local integrator clone for ``experiment_id`` (issue #145 §3.5). + + The deployment-default experiment uses the startup-materialized + ``app.state.repo`` (unchanged single-experiment behavior). Non-default + experiments are materialized lazily by the ``RepoMaterializer`` under + ``/.git``; when no materializer is configured + (single-experiment / test posture) the default repo is returned for + every experiment. Returns ``None`` when the executor module is + disabled (no ``--repo-path``). + """ + default_repo = request.app.state.repo + if experiment_id == request.app.state.experiment_id: + return default_repo + materializer = request.app.state.repo_materializer + if materializer is None: + return default_repo + return materializer.for_experiment(experiment_id) + + +# --------------------------------------------------------------------------- +# Top-nav experiment switcher (issue #145 §3.7) +# --------------------------------------------------------------------------- + +_SWITCHER_TTL_SECONDS = 5.0 + + +def _cached_experiment_ids(request: Request) -> list[str] | None: + """Return registered experiment ids, cached for 5s per process (§3.7). + + The switcher dropdown renders on every page, so an uncached + ``list_experiments`` would hit the control plane once per request. A + short in-process TTL collapses that to at most one call per 5s; a + stale 5s window is invisible to operators. On a control-plane read + failure the last good list is served if one is cached; with a cold + cache (e.g. Posture D — no usable control-plane credential, or a + control-plane outage at first render) it returns ``None`` so the + caller HIDES the switcher rather than rendering an empty dropdown + (codex round-0 Risk 5). + """ + control_plane = request.app.state.control_plane + cache: dict[str, Any] = request.app.state.experiments_cache + now_monotonic = time.monotonic() + if cache.get("expiry", 0.0) > now_monotonic: + return cache["ids"] + try: + ids: list[str] | None = [ + e.experiment_id for e in control_plane.list_experiments() + ] + except Exception: # noqa: BLE001 — serve stale, else signal "unavailable" + ids = cache.get("ids") # None on cold cache → switcher hidden + cache["expiry"] = now_monotonic + _SWITCHER_TTL_SECONDS + cache["ids"] = ids + return ids + + +def switcher_context(request: Request) -> dict[str, Any]: + """Template context processor for the top-nav experiment switcher. + + Active only when a control plane is configured, the request carries a + session (signed in), AND the registry is readable. When control-plane + reads are unavailable (Posture D — no usable credential — or a + control-plane outage with a cold cache) the switcher is hidden rather + than rendered as an empty dropdown (codex round-0 Risk 5). + """ + if request.app.state.control_plane is None: + return {"switcher_experiments": None} + session = get_session(request) + if session is None: + return {"switcher_experiments": None} + experiments = _cached_experiment_ids(request) + if experiments is None: + return {"switcher_experiments": None} + return { + "switcher_experiments": experiments, + "switcher_selected": session.selected_experiment_id, + "switcher_csrf": session.csrf, + } + + def _resolve_inside_jail( uri: str | None, artifacts_dir: Path ) -> Path | None: diff --git a/reference/services/web-ui/src/eden_web_ui/routes/admin/actions.py b/reference/services/web-ui/src/eden_web_ui/routes/admin/actions.py index 7ed345e4..b4947edc 100644 --- a/reference/services/web-ui/src/eden_web_ui/routes/admin/actions.py +++ b/reference/services/web-ui/src/eden_web_ui/routes/admin/actions.py @@ -8,10 +8,10 @@ from eden_contracts import TaskTarget from eden_storage.errors import IllegalTransition, InvalidPrecondition from eden_storage.errors import NotFound as StorageNotFound -from fastapi import APIRouter, Form, Request +from fastapi import APIRouter, Form, Request, Response from fastapi.responses import HTMLResponse, RedirectResponse -from .._helpers import csrf_ok, get_session +from .._helpers import csrf_ok, get_session, resolve_active_context from ._common import ( _DISPATCH_MODE_KEYS, _DISPATCH_MODE_OUTCOMES, @@ -38,7 +38,10 @@ async def task_reclaim( return RedirectResponse(url="/signin", status_code=303) if not csrf_ok(session, csrf_token): return _csrf_failure_response() - store = request.app.state.store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + store = active.store try: store.reclaim(task_id, "operator") except IllegalTransition: @@ -67,7 +70,10 @@ async def task_reassign_form( session = get_session(request) if session is None: return RedirectResponse(url="/signin", status_code=303) - store = request.app.state.store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + store = active.store try: task = store.read_task(task_id) @@ -166,7 +172,10 @@ async def task_reassign( return RedirectResponse(url="/signin", status_code=303) if not csrf_ok(session, csrf_token): return _csrf_failure_response() - store = request.app.state.store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + store = active.store redirect_base = f"/admin/tasks/{task_id}/reassign" reason = reason.strip() @@ -253,7 +262,10 @@ async def dispatch_mode_form( session = get_session(request) if session is None: return RedirectResponse(url="/signin", status_code=303) - store = request.app.state.store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + store = active.store try: mode = store.read_dispatch_mode() except Exception: # noqa: BLE001 — transport / store domain @@ -289,7 +301,10 @@ async def dispatch_mode_update( return RedirectResponse(url="/signin", status_code=303) if not csrf_ok(session, csrf_token): return _csrf_failure_response() - store = request.app.state.store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + store = active.store updates: dict[str, str] = { "ideation_creation": ideation_creation, @@ -345,6 +360,9 @@ async def create_execution_task( return RedirectResponse(url="/signin", status_code=303) if not csrf_ok(session, csrf_token): return _csrf_failure_response() + active = resolve_active_context(request) + if isinstance(active, Response): + return active target_kind = target_kind.strip().lower() target_id = target_id.strip() redirect_base = f"/admin/ideas/{idea_id}/" @@ -364,7 +382,7 @@ async def create_execution_task( return RedirectResponse( url=f"{redirect_base}?error=invalid-target", status_code=303 ) - store = request.app.state.store + store = active.store task_id = f"execution-{uuid.uuid4().hex[:12]}" try: if target is None: @@ -404,7 +422,10 @@ async def terminate_experiment( return RedirectResponse(url="/signin", status_code=303) if not csrf_ok(session, csrf_token): return _csrf_failure_response() - admin_store = request.app.state.admin_store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + admin_store = active.admin_store if admin_store is None: return RedirectResponse( url="/admin/experiment/?error=admin-disabled", status_code=303 diff --git a/reference/services/web-ui/src/eden_web_ui/routes/admin/index.py b/reference/services/web-ui/src/eden_web_ui/routes/admin/index.py index b744d226..602555d3 100644 --- a/reference/services/web-ui/src/eden_web_ui/routes/admin/index.py +++ b/reference/services/web-ui/src/eden_web_ui/routes/admin/index.py @@ -2,10 +2,10 @@ from __future__ import annotations -from fastapi import APIRouter, Request +from fastapi import APIRouter, Request, Response from fastapi.responses import HTMLResponse, RedirectResponse -from .._helpers import get_session +from .._helpers import get_session, repo_for, resolve_active_context from ._common import ( _KIND_VALUES, _STATE_VALUES, @@ -23,8 +23,11 @@ async def index(request: Request) -> HTMLResponse | RedirectResponse: session = get_session(request) if session is None: return RedirectResponse(url="/signin", status_code=303) - store = request.app.state.store - repo = request.app.state.repo + active = resolve_active_context(request) + if isinstance(active, Response): + return active + store = active.store + repo = repo_for(request, active.experiment_id) now = _now_dt(request) try: @@ -67,7 +70,7 @@ async def index(request: Request) -> HTMLResponse | RedirectResponse: recent = list(reversed(events_full))[:10] - admin_store = request.app.state.admin_store + admin_store = active.admin_store return request.app.state.templates.TemplateResponse( request, "admin_index.html", diff --git a/reference/services/web-ui/src/eden_web_ui/routes/admin/observability.py b/reference/services/web-ui/src/eden_web_ui/routes/admin/observability.py index 6bcdb34c..e7c41e3a 100644 --- a/reference/services/web-ui/src/eden_web_ui/routes/admin/observability.py +++ b/reference/services/web-ui/src/eden_web_ui/routes/admin/observability.py @@ -3,7 +3,7 @@ from __future__ import annotations from dataclasses import dataclass -from typing import Any +from typing import Any, cast from eden_contracts import ( EvaluationTask, @@ -13,10 +13,15 @@ Task, ) from eden_storage.errors import NotFound as StorageNotFound -from fastapi import APIRouter, Request +from fastapi import APIRouter, Request, Response from fastapi.responses import HTMLResponse, RedirectResponse -from .._helpers import get_session, read_idea_content, read_variant_artifact +from .._helpers import ( + get_session, + read_idea_content, + read_variant_artifact, + resolve_active_context, +) from .._lineage import ( lineage_for_evaluation_task, lineage_for_execution_task, @@ -122,7 +127,10 @@ async def tasks_index(request: Request) -> HTMLResponse | RedirectResponse: session = get_session(request) if session is None: return RedirectResponse(url="/signin", status_code=303) - store = request.app.state.store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + store = active.store now = _now_dt(request) filters = _parse_task_filters(request) @@ -174,7 +182,10 @@ async def task_detail( session = get_session(request) if session is None: return RedirectResponse(url="/signin", status_code=303) - store = request.app.state.store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + store = active.store now = _now_dt(request) try: @@ -224,7 +235,10 @@ async def variants_index(request: Request) -> HTMLResponse | RedirectResponse: session = get_session(request) if session is None: return RedirectResponse(url="/signin", status_code=303) - store = request.app.state.store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + store = active.store status = _coerce_filter(request.query_params.get("status"), _VARIANT_STATUS_VALUES) worker = coerce_worker_filter(request.query_params.get("worker")) @@ -251,7 +265,9 @@ async def variants_index(request: Request) -> HTMLResponse | RedirectResponse: try: variants = store.list_variants(status=status) - exec_tasks = store.list_tasks(kind="execution") + exec_tasks = cast( + "list[ExecutionTask]", store.list_tasks(kind="execution") + ) except Exception: # noqa: BLE001 — transport/store-domain return _read_failure_response(request, "could not load variants") exec_terminal_by_idea: dict[str, bool] = { @@ -307,7 +323,10 @@ async def variant_detail( session = get_session(request) if session is None: return RedirectResponse(url="/signin", status_code=303) - store = request.app.state.store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + store = active.store artifacts_dir = request.app.state.artifacts_dir try: @@ -383,7 +402,10 @@ async def events_index(request: Request) -> HTMLResponse | RedirectResponse: session = get_session(request) if session is None: return RedirectResponse(url="/signin", status_code=303) - store = request.app.state.store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + store = active.store raw_limit = request.query_params.get("limit") limit = _DEFAULT_EVENTS_LIMIT @@ -431,7 +453,10 @@ async def ideas_index(request: Request) -> HTMLResponse | RedirectResponse: session = get_session(request) if session is None: return RedirectResponse(url="/signin", status_code=303) - store = request.app.state.store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + store = active.store state = _coerce_filter(request.query_params.get("state"), _IDEA_STATE_VALUES) if state == _INVALID_FILTER: return request.app.state.templates.TemplateResponse( @@ -496,13 +521,18 @@ async def idea_detail( session = get_session(request) if session is None: return RedirectResponse(url="/signin", status_code=303) - store = request.app.state.store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + store = active.store artifacts_dir = request.app.state.artifacts_dir try: idea = store.read_idea(idea_id) live_execution_tasks = [ t - for t in store.list_tasks(kind="execution") + for t in cast( + "list[ExecutionTask]", store.list_tasks(kind="execution") + ) if t.payload.idea_id == idea_id ] workers_list = store.list_workers() @@ -545,8 +575,11 @@ async def experiment_detail( session = get_session(request) if session is None: return RedirectResponse(url="/signin", status_code=303) - store = request.app.state.store - admin_store = request.app.state.admin_store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + store = active.store + admin_store = active.admin_store try: experiment = store.read_experiment() events_full = store.replay() diff --git a/reference/services/web-ui/src/eden_web_ui/routes/admin/work_refs.py b/reference/services/web-ui/src/eden_web_ui/routes/admin/work_refs.py index d28b13f4..a81bfd75 100644 --- a/reference/services/web-ui/src/eden_web_ui/routes/admin/work_refs.py +++ b/reference/services/web-ui/src/eden_web_ui/routes/admin/work_refs.py @@ -5,10 +5,10 @@ from typing import Any from eden_contracts import Variant -from fastapi import APIRouter, Form, Request +from fastapi import APIRouter, Form, Request, Response from fastapi.responses import HTMLResponse, RedirectResponse -from .._helpers import csrf_ok, get_session +from .._helpers import csrf_ok, get_session, repo_for, resolve_active_context from ._common import ( _REF_DELETE_OUTCOMES, _WORK_REF_RE, @@ -27,9 +27,8 @@ async def work_refs_index(request: Request) -> HTMLResponse | RedirectResponse: session = get_session(request) if session is None: return RedirectResponse(url="/signin", status_code=303) - repo = request.app.state.repo outcome = _outcome(request, "deleted", "error", _REF_DELETE_OUTCOMES) - if repo is None: + if request.app.state.repo is None: return request.app.state.templates.TemplateResponse( request, "admin_work_refs.html", @@ -41,7 +40,11 @@ async def work_refs_index(request: Request) -> HTMLResponse | RedirectResponse: "outcome": outcome, }, ) - store = request.app.state.store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + store = active.store + repo = repo_for(request, active.experiment_id) if _repo_has_origin(repo): try: repo.fetch_all_heads() @@ -128,8 +131,7 @@ async def work_refs_delete( return RedirectResponse(url="/signin", status_code=303) if not csrf_ok(session, csrf_token): return _csrf_failure_response() - repo = request.app.state.repo - if repo is None: + if request.app.state.repo is None: return RedirectResponse( url="/admin/work-refs/?error=invalid-ref-name", status_code=303 ) @@ -137,7 +139,11 @@ async def work_refs_delete( return RedirectResponse( url="/admin/work-refs/?error=invalid-ref-name", status_code=303 ) - store = request.app.state.store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + store = active.store + repo = repo_for(request, active.experiment_id) groups = _classify_work_refs(repo, store) try: target = _find_delete_target(groups, ref_name) diff --git a/reference/services/web-ui/src/eden_web_ui/routes/admin_experiments.py b/reference/services/web-ui/src/eden_web_ui/routes/admin_experiments.py index b30089b3..0c8daf94 100644 --- a/reference/services/web-ui/src/eden_web_ui/routes/admin_experiments.py +++ b/reference/services/web-ui/src/eden_web_ui/routes/admin_experiments.py @@ -74,6 +74,44 @@ "missing-experiment-id": ("error", "experiment_id is required"), } +# Banners for the active-experiment resolution failures that the +# per-experiment routes redirect here with (issue #145 §3.2 / Decision 8 / +# §3.6). Each value is a format string that may reference ``exp`` / ``frm`` +# / ``to`` query params. +_RESOLVE_ERROR_MESSAGES: dict[str, str] = { + "stale-selection": ( + "the experiment you had selected is no longer registered; your " + "selection was cleared and you're back on the deployment default" + ), + "control-plane-unreachable": ( + "the control plane was unreachable while resolving your selected " + "experiment; retry, then check the control-plane logs if it persists" + ), + "cannot-bootstrap-credential": ( + "could not obtain a worker credential for experiment {exp}: the " + "web-ui has no persisted credential for it and no admin token is " + "available at runtime to mint one (see " + "docs/operations/web-ui-multi-experiment.md)" + ), + "task-store-unreachable": ( + "the task-store-server was unreachable while obtaining a credential " + "for experiment {exp}; retry, then check the task-store-server logs" + ), + "config-missing": ( + "experiment {exp} has no config YAML in --experiment-config-dir; " + "provision /{exp}.yaml, then reload" + ), + "config-invalid": ( + "experiment {exp}'s config YAML failed to parse/validate; fix " + "/{exp}.yaml, then reload" + ), + "switched-mid-form": ( + "you switched from experiment {frm} to {to} while filling out a form " + "on {frm}; the submission was discarded to avoid writing to the " + "wrong experiment. Re-enter it under {to}." + ), +} + def _outcome( outcomes: dict[str, tuple[str, str]], key: str | None @@ -83,6 +121,22 @@ def _outcome( return outcomes.get(key) +def _resolve_error_outcome(request: Request) -> tuple[str, str] | None: + """Build the banner for an ``?error=`` resolution-failure redirect.""" + key = request.query_params.get("error") + if key is None: + return None + template = _RESOLVE_ERROR_MESSAGES.get(key) + if template is None: + return None + message = template.format( + exp=request.query_params.get("exp", "?"), + frm=request.query_params.get("from", "?"), + to=request.query_params.get("to", "?"), + ) + return ("error", message) + + def _require_session(request: Request) -> Session | RedirectResponse: """Return the decoded session, or a redirect to /signin.""" session = get_session(request) @@ -150,6 +204,9 @@ async def dashboard( result = _outcome(label, key) if result is not None: outcomes.append(result) + resolve_error = _resolve_error_outcome(request) + if resolve_error is not None: + outcomes.append(resolve_error) return request.app.state.templates.TemplateResponse( request, "admin_experiments.html", diff --git a/reference/services/web-ui/src/eden_web_ui/routes/admin_groups.py b/reference/services/web-ui/src/eden_web_ui/routes/admin_groups.py index 23677e96..ad0ef266 100644 --- a/reference/services/web-ui/src/eden_web_ui/routes/admin_groups.py +++ b/reference/services/web-ui/src/eden_web_ui/routes/admin_groups.py @@ -2,9 +2,10 @@ Implements plan §D.5 of phase 12a-1b. Mirrors the chunk-9e admin module shape (server-side Jinja, auth-first POST, closed-allowlist -banners). Admin-gated writes route through ``app.state.admin_store``; -reads use ``app.state.store`` (either-gated wire endpoints accept -the worker bearer). +banners). Admin-gated writes route through the active experiment's +admin store (``resolve_active_context(request).admin_store``, issue +#145); reads use the active experiment's worker store (either-gated +wire endpoints accept the worker bearer). The transitive-membership view is walked client-side via repeated ``read_group`` calls — ``StoreClient`` exposes ``resolve_worker_in_group`` @@ -27,10 +28,10 @@ NotFound as StorageNotFound, ) from eden_wire.errors import BadRequest -from fastapi import APIRouter, Form, Request +from fastapi import APIRouter, Form, Request, Response from fastapi.responses import HTMLResponse, RedirectResponse -from ._helpers import csrf_ok, get_session +from ._helpers import csrf_ok, get_session, resolve_active_context router = APIRouter(prefix="/admin/groups") @@ -285,8 +286,11 @@ async def groups_index(request: Request) -> HTMLResponse | RedirectResponse: session = get_session(request) if session is None: return RedirectResponse(url="/signin", status_code=303) - store = request.app.state.store - admin_store = request.app.state.admin_store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + store = active.store + admin_store = active.admin_store q_raw = request.query_params.get("q") or "" q = q_raw[:64].lower() if q_raw else None @@ -353,7 +357,10 @@ async def groups_register( return RedirectResponse(url="/signin", status_code=303) if not csrf_ok(session, csrf_token): return _csrf_failure_response_redirect() - admin_store = request.app.state.admin_store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + admin_store = active.admin_store if admin_store is None: return RedirectResponse( url="/admin/groups/?error=admin-disabled", status_code=303 @@ -438,8 +445,11 @@ async def group_detail( session = get_session(request) if session is None: return RedirectResponse(url="/signin", status_code=303) - store = request.app.state.store - admin_store = request.app.state.admin_store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + store = active.store + admin_store = active.admin_store try: group = store.read_group(group_id) @@ -484,7 +494,10 @@ async def group_add_member( return RedirectResponse(url="/signin", status_code=303) if not csrf_ok(session, csrf_token): return _csrf_failure_response_redirect() - admin_store = request.app.state.admin_store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + admin_store = active.admin_store if admin_store is None: return RedirectResponse( url=f"/admin/groups/{group_id}/?error=admin-disabled", @@ -541,7 +554,10 @@ async def group_remove_member( return RedirectResponse(url="/signin", status_code=303) if not csrf_ok(session, csrf_token): return _csrf_failure_response_redirect() - admin_store = request.app.state.admin_store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + admin_store = active.admin_store if admin_store is None: return RedirectResponse( url=f"/admin/groups/{group_id}/?error=admin-disabled", @@ -576,7 +592,10 @@ async def group_delete( return RedirectResponse(url="/signin", status_code=303) if not csrf_ok(session, csrf_token): return _csrf_failure_response_redirect() - admin_store = request.app.state.admin_store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + admin_store = active.admin_store if admin_store is None: return RedirectResponse( url=f"/admin/groups/{group_id}/?error=admin-disabled", diff --git a/reference/services/web-ui/src/eden_web_ui/routes/admin_workers.py b/reference/services/web-ui/src/eden_web_ui/routes/admin_workers.py index 5721f952..75c0c1ef 100644 --- a/reference/services/web-ui/src/eden_web_ui/routes/admin_workers.py +++ b/reference/services/web-ui/src/eden_web_ui/routes/admin_workers.py @@ -7,13 +7,15 @@ before ``csrf_ok`` on every mutating route. Worker-registry write paths (register / reissue) require an admin -bearer; the route layer pulls a ``StoreClient`` bearing -``admin:`` from ``app.state.admin_store``. When that +bearer; the route layer resolves the active experiment's admin +``StoreClient`` (bearing ``admin:``) via +``resolve_active_context(request).admin_store`` (issue #145). When that is ``None`` (postures B / C from plan §D.3) the templates render write controls disabled and POSTs short-circuit with -``?error=admin-disabled``. Read paths use ``app.state.store`` -(worker bearer), which is sufficient for the either-gated -``list_workers`` / ``read_worker`` endpoints (chapter 07 §6.1). +``?error=admin-disabled``. Read paths use the active experiment's +worker store (``active.store``), which is sufficient for the +either-gated ``list_workers`` / ``read_worker`` endpoints (chapter 07 +§6.1). """ from __future__ import annotations @@ -32,10 +34,10 @@ NotFound as StorageNotFound, ) from eden_wire.errors import BadRequest -from fastapi import APIRouter, Form, Request +from fastapi import APIRouter, Form, Request, Response from fastapi.responses import HTMLResponse, RedirectResponse -from ._helpers import csrf_ok, get_session +from ._helpers import csrf_ok, get_session, resolve_active_context router = APIRouter(prefix="/admin/workers") @@ -268,8 +270,11 @@ async def workers_index(request: Request) -> HTMLResponse | RedirectResponse: session = get_session(request) if session is None: return RedirectResponse(url="/signin", status_code=303) - store = request.app.state.store - admin_store = request.app.state.admin_store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + store = active.store + admin_store = active.admin_store q_raw = request.query_params.get("q") or "" q = q_raw[:64].lower() if q_raw else None @@ -326,7 +331,10 @@ async def workers_register( return RedirectResponse(url="/signin", status_code=303) if not csrf_ok(session, csrf_token): return _csrf_failure_response() - admin_store = request.app.state.admin_store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + admin_store = active.admin_store if admin_store is None: return RedirectResponse( url="/admin/workers/?error=admin-disabled", status_code=303 @@ -396,8 +404,11 @@ async def worker_detail( session = get_session(request) if session is None: return RedirectResponse(url="/signin", status_code=303) - store = request.app.state.store - admin_store = request.app.state.admin_store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + store = active.store + admin_store = active.admin_store try: worker = store.read_worker(worker_id) @@ -451,7 +462,10 @@ async def worker_reissue( return RedirectResponse(url="/signin", status_code=303) if not csrf_ok(session, csrf_token): return _csrf_failure_response() - admin_store = request.app.state.admin_store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + admin_store = active.admin_store if admin_store is None: return RedirectResponse( url=f"/admin/workers/{worker_id}/?error=admin-disabled", diff --git a/reference/services/web-ui/src/eden_web_ui/routes/evaluator.py b/reference/services/web-ui/src/eden_web_ui/routes/evaluator.py index 2491b38e..81d51b7e 100644 --- a/reference/services/web-ui/src/eden_web_ui/routes/evaluator.py +++ b/reference/services/web-ui/src/eden_web_ui/routes/evaluator.py @@ -33,14 +33,14 @@ from datetime import timedelta from typing import Any -from eden_contracts import EvaluationTask, Idea, Variant +from eden_contracts import EvaluationTask, ExperimentConfig, Idea, Variant from eden_storage import ( DispatchError, EvaluationSubmission, InvalidPrecondition, StorageError, ) -from fastapi import APIRouter, Form, Request +from fastapi import APIRouter, Form, Request, Response from fastapi.responses import HTMLResponse, RedirectResponse from starlette.datastructures import UploadFile @@ -57,6 +57,7 @@ build_artifact_links, build_list_links, csrf_ok, + form_experiment_guard, get_session, is_htmx_request, parse_list_view, @@ -64,6 +65,7 @@ read_idea_manifest, read_variant_artifact, read_variant_artifact_manifest, + resolve_active_context, ) from ._submit_readback import submit_with_readback, wire_error_banner @@ -166,7 +168,12 @@ async def list_pending(request: Request) -> HTMLResponse | RedirectResponse: session = get_session(request) if session is None: return RedirectResponse(url="/signin", status_code=303) - store = request.app.state.store + active = resolve_active_context(request, need_config=True) + if isinstance(active, Response): + return active + store = active.store + config = active.config + assert config is not None # need_config=True populates it artifacts_dir = request.app.state.artifacts_dir try: pending = store.list_tasks(kind="evaluation", state="pending") @@ -177,7 +184,6 @@ async def list_pending(request: Request) -> HTMLResponse | RedirectResponse: return _render_error( request, f"task-store transport failure: {exc.__class__.__name__}" ) - config = request.app.state.experiment_config view = parse_list_view(request.query_params) resolver = EligibilityResolver(store, session.worker_id) pending_rows, read_failed_count = _build_evaluator_pending_rows( @@ -216,7 +222,10 @@ async def claim( return RedirectResponse(url="/signin", status_code=303) if not csrf_ok(session, csrf_token): return _csrf_failure_response(request) - store = request.app.state.store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + store = active.store try: task = store.read_task(task_id) except DispatchError as exc: @@ -274,7 +283,12 @@ async def draft_form( status_code=303, ) _, variant_id = entry - store = request.app.state.store + active = resolve_active_context(request, need_config=True) + if isinstance(active, Response): + return active + store = active.store + config = active.config + assert config is not None # need_config=True populates it try: variant = store.read_variant(variant_id) idea: Idea = store.read_idea(variant.idea_id) @@ -287,11 +301,12 @@ async def draft_form( ) return _render_draft( request, + config=config, session=session, task_id=task_id, variant=variant, idea=idea, - form_state=_empty_form_state(request), + form_state=_empty_form_state(config), errors=None, status_code=200, ) @@ -318,6 +333,7 @@ def _collect_metric_inputs(form: Any, evaluation_schema: Any) -> dict[str, str]: def _finalize_evaluator_submit( *, request: Request, + config: ExperimentConfig, session: Any, task_id: str, variant: Variant, @@ -354,6 +370,7 @@ def _finalize_evaluator_submit( errors.add_overall(banner or "eden://error/invalid-precondition") return _render_draft( request, + config=config, session=session, task_id=task_id, variant=variant, @@ -392,7 +409,15 @@ async def submit( ) token, variant_id = entry - store = request.app.state.store + active = resolve_active_context(request, need_config=True) + if isinstance(active, Response): + return active + store = active.store + config = active.config + assert config is not None # need_config=True populates it + mismatch = form_experiment_guard(form, active.experiment_id) + if mismatch is not None: + return mismatch try: variant = store.read_variant(variant_id) idea: Idea = store.read_idea(variant.idea_id) @@ -407,12 +432,14 @@ async def submit( draft, errors, form_state = _parse_evaluator_submit_form( form, request=request, + config=config, variant_id=variant_id, uploaded=await _collect_uploads(form, field_name="artifact_files"), ) if draft is None or errors: return _render_draft( request, + config=config, session=session, task_id=task_id, variant=variant, @@ -425,6 +452,7 @@ async def submit( return _build_and_submit_evaluation( request=request, store=store, + config=config, session=session, task_id=task_id, token=token, @@ -441,6 +469,7 @@ def _parse_evaluator_submit_form( form: Any, *, request: Request, + config: ExperimentConfig, variant_id: str, uploaded: list[UploadedFile], ) -> tuple[Any, Any, dict[str, Any]]: @@ -452,7 +481,6 @@ def _parse_evaluator_submit_form( rewritten to point at a freshly-bundled artifact when the operator supplied text/uploads instead of an explicit URI. """ - config = request.app.state.experiment_config evaluation_schema = config.evaluation_schema status_raw = str(form.get("status") or "") @@ -575,6 +603,7 @@ def _build_and_submit_evaluation( *, request: Request, store: Any, + config: ExperimentConfig, session: Any, task_id: str, token: str, @@ -603,6 +632,7 @@ def _build_and_submit_evaluation( ) return _finalize_evaluator_submit( request=request, + config=config, session=session, task_id=task_id, variant=variant, @@ -617,8 +647,7 @@ def _build_and_submit_evaluation( ) -def _empty_form_state(request: Request) -> dict[str, Any]: - config = request.app.state.experiment_config +def _empty_form_state(config: ExperimentConfig) -> dict[str, Any]: return { "status": "success", "artifacts_uri": "", @@ -630,6 +659,7 @@ def _empty_form_state(request: Request) -> dict[str, Any]: def _render_draft( request: Request, *, + config: ExperimentConfig, session: Any, task_id: str, variant: Variant, @@ -645,7 +675,6 @@ def _render_draft( variant_artifact_manifest = read_variant_artifact_manifest( variant.artifacts_uri, artifacts_dir ) - config = request.app.state.experiment_config metric_schema_items = list(config.evaluation_schema.root.items()) return request.app.state.templates.TemplateResponse( request, diff --git a/reference/services/web-ui/src/eden_web_ui/routes/executor.py b/reference/services/web-ui/src/eden_web_ui/routes/executor.py index 0c78c88f..82813d4c 100644 --- a/reference/services/web-ui/src/eden_web_ui/routes/executor.py +++ b/reference/services/web-ui/src/eden_web_ui/routes/executor.py @@ -1,3 +1,4 @@ +# slop-allow-file: #145 pushed to 816 SLOC; per-resource split is a separate refactor. """Executor-module routes. Implements the spec-to-code map pinned in §C of the Phase 9c plan. @@ -26,16 +27,16 @@ from collections.abc import Callable from dataclasses import replace from datetime import timedelta -from typing import Any +from typing import Any, cast -from eden_contracts import Idea, Variant +from eden_contracts import ExecutionTask, Idea, Variant from eden_storage import ( DispatchError, NoOpVariant, StorageError, VariantSubmission, ) -from fastapi import APIRouter, Form, Request +from fastapi import APIRouter, Form, Request, Response from fastapi.responses import HTMLResponse, RedirectResponse from starlette.datastructures import UploadFile @@ -52,11 +53,14 @@ build_artifact_links, build_list_links, csrf_ok, + form_experiment_guard, get_session, is_htmx_request, parse_list_view, read_idea_content, read_idea_manifest, + repo_for, + resolve_active_context, ) from ._submit_readback import submit_with_readback, wire_error_banner @@ -190,6 +194,7 @@ def _phase1_create_starting_variant( *, store: Any, request: Request, + experiment_id: str, task_id: str, variant_id: str, idea: Idea, @@ -207,7 +212,7 @@ def _phase1_create_starting_variant( """ variant = _build_starting_variant( variant_id=variant_id, - experiment_id=request.app.state.experiment_id, + experiment_id=experiment_id, idea_id=idea.idea_id, parent_commits=idea.parent_commits, branch=branch, @@ -392,7 +397,12 @@ async def list_pending(request: Request) -> HTMLResponse | RedirectResponse: session = get_session(request) if session is None: return RedirectResponse(url="/signin", status_code=303) - store = request.app.state.store + active = resolve_active_context(request, need_config=True) + if isinstance(active, Response): + return active + store = active.store + config = active.config + assert config is not None # need_config=True populates it try: pending = store.list_tasks(kind="execution", state="pending") recent = _list_recent_variants(store) @@ -402,7 +412,6 @@ async def list_pending(request: Request) -> HTMLResponse | RedirectResponse: return _render_error( request, f"task-store transport failure: {exc.__class__.__name__}" ) - config = request.app.state.experiment_config artifacts_dir = request.app.state.artifacts_dir view = parse_list_view(request.query_params) resolver = EligibilityResolver(store, session.worker_id) @@ -497,7 +506,10 @@ async def claim( return RedirectResponse(url="/signin", status_code=303) if not csrf_ok(session, csrf_token): return _csrf_failure_response(request) - store = request.app.state.store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + store = active.store now: Callable[[], Any] = request.app.state.now expires_at = now() + timedelta(seconds=request.app.state.claim_ttl_seconds) try: @@ -527,9 +539,12 @@ async def draft_form( status_code=303, ) _, variant_id = entry - store = request.app.state.store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + store = active.store try: - task = store.read_task(task_id) + task = cast("ExecutionTask", store.read_task(task_id)) idea: Idea = store.read_idea(task.payload.idea_id) except DispatchError as exc: return _render_error(request, wire_error_banner(exc)) @@ -565,10 +580,17 @@ async def submit( # noqa: PLR0911 — flow has many distinct outcome arms by de ) token, variant_id = entry - store = request.app.state.store - repo = request.app.state.repo + active = resolve_active_context(request) + if isinstance(active, Response): + return active + store = active.store + experiment_id = active.experiment_id + mismatch = form_experiment_guard(form, experiment_id) + if mismatch is not None: + return mismatch + repo = repo_for(request, experiment_id) try: - task = store.read_task(task_id) + task = cast("ExecutionTask", store.read_task(task_id)) idea = store.read_idea(task.payload.idea_id) except DispatchError as exc: return _render_error(request, wire_error_banner(exc)) @@ -613,6 +635,7 @@ async def submit( # noqa: PLR0911 — flow has many distinct outcome arms by de request=request, store=store, repo=repo, + experiment_id=experiment_id, session=session, task_id=task_id, token=token, @@ -628,6 +651,7 @@ def _drive_submit_phases( request: Request, store: Any, repo: Any, + experiment_id: str, session: Any, task_id: str, token: str, @@ -644,6 +668,7 @@ def _drive_submit_phases( phase1_error = _phase1_create_starting_variant( store=store, request=request, + experiment_id=experiment_id, task_id=task_id, variant_id=variant_id, idea=idea, @@ -902,7 +927,10 @@ def _render_draft( content = read_idea_content(idea, artifacts_dir) idea_manifest = read_idea_manifest(idea, artifacts_dir) branch = _branch_name(idea.slug, variant_id) - repo_path = getattr(request.app.state.repo, "path", None) + active_id = getattr( + request.state, "active_experiment_id", request.app.state.experiment_id + ) + repo_path = getattr(repo_for(request, active_id), "path", None) clone_url = getattr(request.app.state, "clone_url", None) return request.app.state.templates.TemplateResponse( request, diff --git a/reference/services/web-ui/src/eden_web_ui/routes/ideator.py b/reference/services/web-ui/src/eden_web_ui/routes/ideator.py index 775515ad..b720a1f8 100644 --- a/reference/services/web-ui/src/eden_web_ui/routes/ideator.py +++ b/reference/services/web-ui/src/eden_web_ui/routes/ideator.py @@ -31,7 +31,7 @@ WrongClaimant, ) from eden_storage.submissions import submissions_equivalent -from fastapi import APIRouter, Form, Request +from fastapi import APIRouter, Form, Request, Response from fastapi.responses import HTMLResponse, RedirectResponse from pydantic import ValidationError from starlette.datastructures import UploadFile @@ -44,7 +44,14 @@ write_artifact_bundle, ) from ..forms import FormErrors, IdeaDraft, format_validation_errors, parse_idea_rows -from ._helpers import csrf_ok, get_session, htmx_aware_redirect, is_htmx_request +from ._helpers import ( + csrf_ok, + form_experiment_guard, + get_session, + htmx_aware_redirect, + is_htmx_request, + resolve_active_context, +) from ._submit_readback import wire_error_banner router = APIRouter(prefix="/ideator") @@ -117,9 +124,13 @@ async def list_pending(request: Request) -> HTMLResponse | RedirectResponse: session = get_session(request) if session is None: return RedirectResponse(url="/signin", status_code=303) - store = request.app.state.store + active = resolve_active_context(request, need_config=True) + if isinstance(active, Response): + return active + store = active.store + config = active.config + assert config is not None # need_config=True populates it pending = store.list_tasks(kind="ideation", state="pending") - config = request.app.state.experiment_config ctx: dict[str, Any] = { "session": session, "pending": pending, @@ -148,7 +159,10 @@ async def claim( return RedirectResponse(url="/signin", status_code=303) if not csrf_ok(session, csrf_token): return _csrf_failure_response(request) - store = request.app.state.store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + store = active.store now: Callable[[], Any] = request.app.state.now expires_at = now() + timedelta(seconds=request.app.state.claim_ttl_seconds) try: @@ -174,10 +188,14 @@ async def draft_form( url="/ideator/?banner=claim+missing+from+session", status_code=303, ) - config = request.app.state.experiment_config + active = resolve_active_context(request, need_config=True) + if isinstance(active, Response): + return active + config = active.config + assert config is not None # need_config=True populates it buffered = _DRAFT_BUFFERS.get(_claim_key(session.csrf, task_id)) form_state = buffered if buffered else [_empty_row()] - store = request.app.state.store + store = active.store ctx: dict[str, Any] = { "session": session, "task_id": task_id, @@ -224,12 +242,9 @@ async def add_row(task_id: str, request: Request): request, "/ideator/?banner=claim+missing+from+session" ) - slugs = [str(v) for v in form.getlist("slug")] - priorities = [str(v) for v in form.getlist("priority")] - parents = [str(v) for v in form.getlist("parent_commits")] - contents = [str(v) for v in form.getlist("content")] - intended_kinds = [str(v) for v in form.getlist("intended_executor_kind")] - intended_ids = [str(v) for v in form.getlist("intended_executor_id")] + slugs, priorities, parents, contents, intended_kinds, intended_ids = ( + _collect_idea_form_fields(form) + ) typed_state = _form_state_from_inputs( slugs, priorities, @@ -261,8 +276,12 @@ async def add_row(task_id: str, request: Request): {"i": existing, "row_state": _empty_row(), "row_errs": {}}, ) - config = request.app.state.experiment_config - store = request.app.state.store + active = resolve_active_context(request, need_config=True) + if isinstance(active, Response): + return active + config = active.config + assert config is not None # need_config=True populates it + store = active.store ctx: dict[str, Any] = { "session": session, "task_id": task_id, @@ -343,6 +362,7 @@ def _persist_idea_drafts( # noqa: E501 # slop-allow: per-row save/validate/wri *, store: Any, request: Request, + experiment_id: str, drafts: list[Any], draft_rows: list[int], uploads_per_row: dict[int, list[UploadedFile]], @@ -410,7 +430,7 @@ def _persist_idea_drafts( # noqa: E501 # slop-allow: per-row save/validate/wri try: idea = _make_idea( idea_id=idea_id, - experiment_id=request.app.state.experiment_id, + experiment_id=experiment_id, draft=draft, artifacts_uri=artifacts_uri, now_iso=_iso(now()), @@ -453,6 +473,20 @@ def _persist_idea_drafts( # noqa: E501 # slop-allow: per-row save/validate/wri return idea_ids, slug_warnings, None, None +def _collect_idea_form_fields( + form: Any, +) -> tuple[list[str], list[str], list[str], list[str], list[str], list[str]]: + """Pull the six per-row idea form fields as parallel string lists.""" + return ( + [str(v) for v in form.getlist("slug")], + [str(v) for v in form.getlist("priority")], + [str(v) for v in form.getlist("parent_commits")], + [str(v) for v in form.getlist("content")], + [str(v) for v in form.getlist("intended_executor_kind")], + [str(v) for v in form.getlist("intended_executor_id")], + ) + + @router.post("/{task_id}/submit", response_model=None) async def submit_idea(task_id: str, request: Request) -> HTMLResponse | RedirectResponse: session = get_session(request) @@ -472,8 +506,17 @@ async def submit_idea(task_id: str, request: Request) -> HTMLResponse | Redirect status_code=303, ) - store = request.app.state.store - config = request.app.state.experiment_config + active = resolve_active_context(request, need_config=True) + if isinstance(active, Response): + return active + store = active.store + config = active.config + assert config is not None # need_config=True populates it + experiment_id = active.experiment_id + + mismatch = form_experiment_guard(form, experiment_id) + if mismatch is not None: + return mismatch if status == "error": return _submit_idea_error_status( @@ -481,12 +524,9 @@ async def submit_idea(task_id: str, request: Request) -> HTMLResponse | Redirect task_id=task_id, token=token, ) - slugs = [str(v) for v in form.getlist("slug")] - priorities = [str(v) for v in form.getlist("priority")] - parents = [str(v) for v in form.getlist("parent_commits")] - contents = [str(v) for v in form.getlist("content")] - intended_kinds = [str(v) for v in form.getlist("intended_executor_kind")] - intended_ids = [str(v) for v in form.getlist("intended_executor_id")] + slugs, priorities, parents, contents, intended_kinds, intended_ids = ( + _collect_idea_form_fields(form) + ) n_rows = max(len(slugs), len(priorities), len(parents), len(contents)) uploads_per_row = await _collect_row_uploads(form, n_rows=n_rows) has_uploads_per_row = [bool(uploads_per_row.get(i)) for i in range(n_rows)] @@ -514,8 +554,8 @@ async def submit_idea(task_id: str, request: Request) -> HTMLResponse | Redirect ) idea_ids, slug_warnings, persist_error, validation_errors = _persist_idea_drafts( - store=store, request=request, drafts=drafts, draft_rows=draft_rows, - uploads_per_row=uploads_per_row, + store=store, request=request, experiment_id=experiment_id, drafts=drafts, + draft_rows=draft_rows, uploads_per_row=uploads_per_row, ) if validation_errors is not None: form_state = _form_state_from_inputs( diff --git a/reference/services/web-ui/src/eden_web_ui/routes/index.py b/reference/services/web-ui/src/eden_web_ui/routes/index.py index 240a4384..0c8397dc 100644 --- a/reference/services/web-ui/src/eden_web_ui/routes/index.py +++ b/reference/services/web-ui/src/eden_web_ui/routes/index.py @@ -2,10 +2,10 @@ from __future__ import annotations -from fastapi import APIRouter, Request +from fastapi import APIRouter, Request, Response from fastapi.responses import HTMLResponse, RedirectResponse -from ._helpers import get_session +from ._helpers import get_session, resolve_active_context router = APIRouter() @@ -15,7 +15,10 @@ async def index(request: Request) -> HTMLResponse | RedirectResponse: session = get_session(request) if session is None: return RedirectResponse(url="/signin", status_code=303) - store = request.app.state.store + active = resolve_active_context(request) + if isinstance(active, Response): + return active + store = active.store pending = { kind: len(store.list_tasks(kind=kind, state="pending")) for kind in ("ideation", "execution", "evaluation") diff --git a/reference/services/web-ui/src/eden_web_ui/store_factory.py b/reference/services/web-ui/src/eden_web_ui/store_factory.py new file mode 100644 index 00000000..393ef579 --- /dev/null +++ b/reference/services/web-ui/src/eden_web_ui/store_factory.py @@ -0,0 +1,300 @@ +"""Per-experiment ``Store`` vending for the multi-experiment web-ui. + +Issue #145 closes the 12c per-route store-swap deferral. 12c shipped +the experiment switcher (``Session.selected_experiment_id`` + +``POST /admin/experiments/{E}/select``) but every per-experiment route +still read the startup-bound ``app.state.store``. This module is the +load-bearing piece that lets a route operate against whichever +experiment the operator selected. + +Two implementations share one interface (``for_experiment`` + ``close`` ++ ``admin_enabled``): + +- :class:`StoreFactory` — the production, wire-backed factory. Vends + per-``(experiment_id, role)`` :class:`~eden_wire.StoreClient` views + against one task-store-server URL (12c Decision 11: one URL + deployment-wide; only the ``experiment_id`` path segment varies). A + single shared ``httpx.Client`` is threaded through every vended + client so connection-pooling is preserved. Worker-role views are + JIT-credentialed via :class:`BearerCache` on first access; admin-role + views reuse the one deployment admin token. +- :class:`StaticStoreFactory` — wraps a pre-built ``Store`` (and + optional admin ``Store``) for a single experiment. Used by + ``make_app``'s single-experiment construction path and by the test + suite (the in-memory store is not a wire client, so the live factory + cannot vend it). + +The bearer plumbing (§3.2 of the plan) reuses +:func:`eden_service_common.auth.bootstrap_worker_credential` verbatim — +that function already implements the per-``worker_id`` exclusive lock, +the idempotent-register-then-reissue branch, and persisted-token +verification via ``/whoami``. This module never reimplements those +disciplines; it only caches the resulting bearer per experiment. +""" + +from __future__ import annotations + +from pathlib import Path +from typing import Literal + +import httpx +from eden_service_common.auth import ( + _read_token as read_persisted_token, +) +from eden_service_common.auth import ( + bootstrap_worker_credential, + credential_path, +) +from eden_storage import Store +from eden_storage.errors import NotFound +from eden_wire import StoreClient, Unauthorized + +Role = Literal["worker", "admin"] + + +class StoreFactoryError(RuntimeError): + """Base class for credential-bootstrap / store-vending failures.""" + + +class MissingAdminToken(StoreFactoryError): + """A worker credential must be bootstrapped but no admin token is available. + + Raised when ``for_experiment`` needs to JIT-register (or reissue) a + per-experiment worker credential, there is no persisted credential + on disk, and neither ``--admin-token`` nor ``$EDEN_ADMIN_TOKEN`` is + set. The resolve helper routes this to a + ``?error=cannot-bootstrap-credential`` dashboard redirect. + """ + + def __init__(self, experiment_id: str) -> None: + self.experiment_id = experiment_id + super().__init__( + f"cannot bootstrap a worker credential for experiment " + f"{experiment_id!r}: no persisted credential and no admin token" + ) + + +class AdminTokenRejected(StoreFactoryError): + """The admin token was rejected while bootstrapping a worker credential.""" + + def __init__(self, experiment_id: str) -> None: + self.experiment_id = experiment_id + super().__init__( + f"admin token rejected while bootstrapping a credential for " + f"experiment {experiment_id!r}" + ) + + +class TaskStoreUnreachable(StoreFactoryError): + """A transport error reached the task-store-server during bootstrap.""" + + def __init__(self, experiment_id: str, cause: BaseException) -> None: + self.experiment_id = experiment_id + self.cause = cause + super().__init__( + f"task-store-server unreachable while bootstrapping a credential " + f"for experiment {experiment_id!r}: {cause.__class__.__name__}" + ) + + +class BearerCache: + """Caches per-experiment worker bearers, JIT-bootstrapping on miss. + + Delegates to :func:`bootstrap_worker_credential` — the same helper + every worker host uses at startup — passing a per-experiment + ``credentials_dir`` (``/``) so each + experiment's persisted ``.token`` lives in its own + subtree. The credential's ``bearer`` string is cached in-process + for the lifetime of the factory; the on-disk persistence handles + cross-process reuse and restart survival. + """ + + def __init__( + self, + *, + base_url: str, + worker_id: str, + credential_dir: Path, + admin_token: str | None, + ) -> None: + self._base_url = base_url + self._worker_id = worker_id + self._credential_dir = credential_dir + self._admin_token = admin_token + self._cache: dict[str, str | None] = {} + + def bearer_for(self, experiment_id: str) -> str | None: + """Return the worker bearer for ``experiment_id``, bootstrapping if needed. + + Returns ``None`` in the auth-disabled posture (no admin token AND + no persisted credential) — mirroring + ``resolve_worker_bearer``'s posture 3, where the task-store-server + runs without ``--admin-token`` and the wire falls back to the + worker-id header shim. Otherwise classifies the four documented + bootstrap failure branches (plan §3.2) into the module's + exception taxonomy so the resolve helper can route each to a + distinct operator-facing banner. + """ + if experiment_id in self._cache: + return self._cache[experiment_id] + cred_dir = self._credential_dir / experiment_id + persisted = read_persisted_token(credential_path(cred_dir, self._worker_id)) + if self._admin_token is None and persisted is None: + self._cache[experiment_id] = None + return None + try: + credential = bootstrap_worker_credential( + base_url=self._base_url, + experiment_id=experiment_id, + worker_id=self._worker_id, + credentials_dir=cred_dir, + admin_token=self._admin_token, + labels={"role": "web-ui"}, + ) + except NotFound: + # The experiment does not exist on the task-store-server. + # The caller (resolve helper) distinguishes registered-but- + # unseeded (control plane knows it) from truly-gone. + raise + except Unauthorized as exc: + raise AdminTokenRejected(experiment_id) from exc + except RuntimeError as exc: + # bootstrap_worker_credential raises bare RuntimeError when + # it needs the admin token (register / reissue) but none is + # available. Narrow it to MissingAdminToken so the caller + # does not swallow genuine programming errors. + raise MissingAdminToken(experiment_id) from exc + except httpx.TransportError as exc: + raise TaskStoreUnreachable(experiment_id, exc) from exc + self._cache[experiment_id] = credential.bearer + return credential.bearer + + def evict(self, experiment_id: str) -> None: + """Drop the cached bearer for ``experiment_id`` so it re-bootstraps.""" + self._cache.pop(experiment_id, None) + + def clear(self) -> None: + """Drop all cached bearers (the on-disk credentials persist).""" + self._cache.clear() + + +class StoreFactory: + """Vends per-experiment ``StoreClient`` views against one task-store URL. + + Caches by ``(experiment_id, role)``. Connection-pools by sharing one + ``httpx.Client`` across every vended client (passed via the + ``client=`` kwarg ``StoreClient.__init__`` already accepts), so a + vended client's own ``close()`` is a no-op — only ``close()`` on the + factory tears down the shared transport. + """ + + def __init__( + self, + *, + base_url: str, + bearer_cache: BearerCache, + admin_token: str | None, + shared_client: httpx.Client, + ) -> None: + self._base_url = base_url + self._bearer_cache = bearer_cache + self._admin_token = admin_token + self._shared_client = shared_client + self._cache: dict[tuple[str, Role], StoreClient] = {} + + @property + def admin_enabled(self) -> bool: + """True when a deployment admin token is configured.""" + return self._admin_token is not None + + def for_experiment( + self, experiment_id: str, *, role: Role = "worker" + ) -> StoreClient | None: + """Return a ``StoreClient`` view of ``experiment_id`` for ``role``. + + ``role="admin"`` returns ``None`` when no deployment admin token + is configured (the admin-disabled posture, mirroring 12c's + ``admin_store is None``). ``role="worker"`` always returns a + client or raises one of the bootstrap exceptions. + """ + if role == "admin" and self._admin_token is None: + return None + cached = self._cache.get((experiment_id, role)) + if cached is not None: + return cached + if role == "admin": + bearer = f"admin:{self._admin_token}" + else: + bearer = self._bearer_cache.bearer_for(experiment_id) + client = StoreClient( + base_url=self._base_url, + experiment_id=experiment_id, + bearer=bearer, + client=self._shared_client, + ) + self._cache[(experiment_id, role)] = client + return client + + def evict(self, experiment_id: str) -> None: + """Drop cached views + bearer for ``experiment_id`` (stale-401 recovery). + + The next ``for_experiment`` re-bootstraps the worker credential. + Vended clients ride on the shared transport, so dropping them from + the cache without closing leaks nothing. + """ + self._cache.pop((experiment_id, "worker"), None) + self._cache.pop((experiment_id, "admin"), None) + self._bearer_cache.evict(experiment_id) + + def close(self) -> None: + """Close the shared ``httpx.Client`` and clear caches.""" + self._cache.clear() + self._bearer_cache.clear() + self._shared_client.close() + + +class StaticStoreFactory: + """A factory that vends one pre-built ``Store`` for a single experiment. + + Backs ``make_app``'s single-experiment construction and the test + suite. ``for_experiment`` ignores the ``experiment_id`` argument + (the single-experiment / no-control-plane posture always resolves to + the deployment default) and returns the worker store for + ``role="worker"`` or the admin store for ``role="admin"``. + """ + + def __init__( + self, + *, + experiment_id: str, + store: Store, + admin_store: Store | None = None, + ) -> None: + self._experiment_id = experiment_id + self._store = store + self._admin_store = admin_store + + @property + def admin_enabled(self) -> bool: + """True when an admin store is configured (admin controls render).""" + return self._admin_store is not None + + def for_experiment( + self, experiment_id: str, *, role: Role = "worker" # noqa: ARG002 + ) -> Store | None: + """Return the single store (worker) or admin store for any id.""" + if role == "admin": + return self._admin_store + return self._store + + def evict(self, experiment_id: str) -> None: # noqa: ARG002 + """No-op: the static factory holds one fixed store, nothing to evict.""" + return None + + def close(self) -> None: + """No-op: the static factory does not own the stores' lifecycles. + + The CLI / test harness that constructed the stores owns closing + them. + """ + return None diff --git a/reference/services/web-ui/src/eden_web_ui/templates/_unseeded.html b/reference/services/web-ui/src/eden_web_ui/templates/_unseeded.html new file mode 100644 index 00000000..4147eb8e --- /dev/null +++ b/reference/services/web-ui/src/eden_web_ui/templates/_unseeded.html @@ -0,0 +1,13 @@ +{% extends "base.html" %} +{% block title %}EDEN — experiment not initialized{% endblock %} +{% block main %} +

Experiment not initialized

+ +

back to experiments

+{% endblock %} diff --git a/reference/services/web-ui/src/eden_web_ui/templates/base.html b/reference/services/web-ui/src/eden_web_ui/templates/base.html index 6db38361..aee33a86 100644 --- a/reference/services/web-ui/src/eden_web_ui/templates/base.html +++ b/reference/services/web-ui/src/eden_web_ui/templates/base.html @@ -9,7 +9,27 @@
EDEN reference UI
+ {% if control_plane_enabled and switcher_experiments is not none %} + {% set switcher_active = switcher_selected or experiment_id %} +
+ + {% if switcher_selected %}Active{% else %}Default{% endif %}: + {{ switcher_active }} + +
    + {% for exp in switcher_experiments %} +
  • +
    + + +
    +
  • + {% endfor %} +
+
+ {% else %}
experiment: {{ experiment_id }}
+ {% endif %}