diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 36b64620..f61cfb40 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -537,6 +537,17 @@ jobs: working-directory: reference/compose run: bash healthcheck/smoke-checkpoint.sh + # Backfill of the Phase 12c CHANGELOG-narrated deferral (issue #147), + # re-scoped: the reference impl cannot host >1 experiment per + # deployment (single-experiment task-store-server; cross-experiment + # isolation deferred to #254), so this exercises the control-plane as + # a first-class Compose service + the chapter-11 lease lifecycle and + # lease-handoff chaos drill on the deployed stack. Not required by + # branch protection in this PR; same posture as the other newly-added + # smoke jobs — bump to required-status after staying clean on main + # for ~2 weeks. + compose-smoke-multi-experiment: + name: compose-smoke-multi-experiment # Issue #110: exercises the opt-in Loki + Alloy + Grafana log-search # overlay (compose.logging.yaml) end-to-end — brings up base + # subprocess + logging, asserts Loki ingests EDEN lines, Grafana is @@ -566,6 +577,9 @@ jobs: jq --version python3 --version + - name: Run compose smoke (control-plane + lease-handoff drill) + working-directory: reference/compose + run: bash healthcheck/smoke-multi-experiment.sh - name: Run compose smoke (log-search overlay) working-directory: reference/compose run: bash healthcheck/smoke-logging.sh diff --git a/AGENTS.md b/AGENTS.md index 8edd1766..31d2d83c 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -69,6 +69,7 @@ At Phase 10 chunk 10d follow-up A, markdown linting, JSON Schema validation, and | `bash reference/compose/healthcheck/smoke-subprocess.sh` | Phase 10d subprocess-mode smoke (mirrors the `compose-smoke-subprocess` CI job; layers `compose.subprocess.yaml` over the base stack and runs against the fixture's `ideation.py` / `execution.py` / `evaluation.py`). | | `bash reference/compose/healthcheck/smoke-subprocess-docker.sh` | Phase 10d follow-up A docker-mode smoke (mirrors the `compose-smoke-subprocess-docker` CI job; runs setup-experiment with `--exec-mode docker` so each `*_command` runs in a sibling container via DooD; asserts no orphan executor/evaluator containers post-quiescence + ideator-orphan reaped after `compose stop`). | | `bash reference/compose/healthcheck/smoke-checkpoint.sh` | Phase 12b portable-checkpoint round-trip smoke (mirrors the `compose-smoke-checkpoint` CI job — runs setup-experiment + brings up the full stack + waits for quiescence, exports via `POST /v0/experiments//checkpoint`, tears down + wipes the data root, brings up only postgres + task-store-server against the same `.env` for an empty-store receiver, imports via `POST /v0/checkpoints/import`, asserts pre/post wire state matches + `imported_from` is stamped). Issue [#152](https://github.com/ealt/eden/issues/152). | +| `bash reference/compose/healthcheck/smoke-multi-experiment.sh` | Issue [#147](https://github.com/ealt/eden/issues/147) control-plane + lease-handoff smoke (mirrors the `compose-smoke-multi-experiment` CI job — brings up the always-on `control-plane` service + TWO orchestrator replicas in chapter-11 lease-driven mode against ONE registered experiment, asserts the lease-singleton invariant, kills the lease holder and asserts clean hand-off to the standby, then drives to `experiment.terminated` and asserts the control-plane `last_known_state` converges). Re-scoped from the original two-experiment smoke; cross-experiment isolation is deferred to [#254](https://github.com/ealt/eden/issues/254) (the reference impl is single-experiment per task-store-server). | | `bash reference/compose/healthcheck/smoke-logging.sh` | Issue #110 log-search overlay smoke (mirrors the `compose-smoke-logging` CI job — runs setup-experiment, statically merge-gates the privileged `compose.logging-infra.yaml` overlay, brings up base + subprocess + logging, asserts Loki ingests EDEN lines + Grafana is healthy with the Loki datasource + `eden-explore` dashboard provisioned + a `{service="orchestrator"}` LogQL query returns ≥1 line; when a docker socket is reachable it also layers `compose.logging-infra.yaml` and asserts postgres stdout reaches Loki). | | `uv run pytest -q -m docker` | Run the docker-backed `container_exec` integration tests (gated on a reachable docker daemon; skipped otherwise). | | `bash reference/compose/healthcheck/e2e.sh` | Phase 10e end-to-end smoke (mirrors the `compose-e2e` CI job — staged bring-up, Web UI ideator walkthrough + admin-reclaim drill via `e2e_drive.py`, full-stack quiescence wait, termination drill). Requires `httpx` importable from `python3` — locally, prefix with `PATH="/path/to/.venv/bin:$PATH"` or activate the workspace venv. | diff --git a/CHANGELOG.md b/CHANGELOG.md index 41a99191..ee8d1370 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,6 +8,20 @@ Per-chunk entries preserve the full implementation record: contract amendments, ## [Unreleased] +### Control-plane as a first-class Compose service + lease-handoff smoke (issue #147; re-scoped) + +Backfills the Phase 12c CHANGELOG-narrated deferral of a `compose-smoke-multi-experiment` CI job. **Re-scoped during impl** (operator-authorized): the draft plan's headline — two experiments end-to-end with cross-experiment isolation asserted via wire reads — is **not buildable** on the reference impl, because it hosts exactly one experiment per deployment. Three sites enforce this: the task-store-server's `Store` binds a single `experiment_id` and the wire layer rejects any other (`ExperimentIdMismatch` at [`_dependencies.py:73`](reference/packages/eden-wire/src/eden_wire/_dependencies.py)); the orchestrator multi-experiment loop targets one task-store URL for all experiments ([`multi_loop.py`](reference/services/orchestrator/src/eden_orchestrator/multi_loop.py) `make_runtime_factory`); and the integrator is one shared bare repo deployment-wide ([`cli.py`](reference/services/orchestrator/src/eden_orchestrator/cli.py) `_build_runtime_factory`). 12c's multi-experiment surface was validated only against fake stores + the single-IUT conformance binding. **True multi-experiment hosting + the cross-experiment-isolation smoke are deferred to [#254](https://github.com/ealt/eden/issues/254)** (filed at re-scope time). This chunk ships the genuinely-new, genuinely-shippable substrate piece instead: the control plane as a first-class Compose service, plus a lease-lifecycle + lease-handoff chaos smoke. + +**Control-plane Compose service.** [`compose.yaml`](reference/compose/compose.yaml) gains an always-on `control-plane` service (Postgres-backed; chapter 11 §3.4 Option A — a separate `eden_control_plane` database in the same instance, created by the new [`init-control-plane-db.sh`](reference/compose/init-control-plane-db.sh) postgres init hook). A `/healthz` endpoint was added to the control-plane server ([`app.py`](reference/services/control-plane/src/eden_control_plane_server/app.py); unauthenticated, outside `/v0/control`) for the container healthcheck. The service is always-on but **opt-in**: the orchestrator and web-ui only talk to it when `EDEN_CONTROL_PLANE_URL` is non-empty, so the existing six Compose smokes are unchanged in posture. + +**Env-fallback instead of entrypoint wrappers.** Rather than the draft plan's bash wrapper scripts, the orchestrator + web-ui CLIs gained an `EDEN_CONTROL_PLANE_URL` env fallback for `--control-plane-url` (empty treated as unset), mirroring the existing `EDEN_CONTROL_PLANE_ADMIN_TOKEN` fallback. The orchestrator selects mode solely on `--control-plane-url` being set, and `--experiment-id` is harmless in lease mode (a logging label), so a single compose service definition flips between single- and multi-experiment mode purely via the env var — no wrapper scripts, no Dockerfile change. `compose.control-plane.yaml` (whose only content was web-ui flag-passing) is **deleted**; `docs/observability.md` §3.4 is rewritten to the first-class-service + env-toggle flow. + +**Lease-handoff smoke.** New [`compose.multi-experiment.yaml`](reference/compose/compose.multi-experiment.yaml) overlay (a second `orchestrator-2` replica in lease mode) + [`smoke-multi-experiment.sh`](reference/compose/healthcheck/smoke-multi-experiment.sh) + the `compose-smoke-multi-experiment` CI job (unrequired initially; bump to required-status after ~2 weeks clean on main). The smoke brings up the control plane + two lease-contending replicas against one registered experiment, asserts the lease-singleton invariant, kills the lease holder and asserts clean hand-off to the standby, has the surviving replica drive the full pipeline to `≥2 variant.integrated`, then issues an **operator-driven** `terminate_experiment` and asserts the control-plane `last_known_state` converges to `terminated`. Validated locally end-to-end (PASS). + +**Two pre-existing gaps surfaced by the smoke (filed, not fixed here).** (1) In lease-driven mode the orchestrator joins only the *control-plane* `orchestrators` group, not the *task-store* one (single-experiment mode self-joins via `_ensure_orchestrators_membership`; the multi-experiment path does not), so without seeding, the lease holder's §3.7-gated dispatch/integrate calls 403 — the smoke seeds the task-store group as a workaround; folded into [#254](https://github.com/ealt/eden/issues/254). (2) The orchestrator's *auto*-termination decision (`dispatch_mode.termination = "auto"`) 403s under wire auth because `terminate_experiment` is `admins`-gated while the orchestrator is in `orchestrators` (a spec inter-chapter drift between 03 §6.2 and 07 §2.9 / 04 §8.2, never caught because existing smokes use `never_terminate` + quiescence-exit and dispatch tests run auth-disabled) — filed as [#256](https://github.com/ealt/eden/issues/256). The smoke uses the supported operator-driven termination path instead. + +**setup-experiment.** Emits the control-plane store DSN (`EDEN_CONTROL_PLANE_STORE_URL`, `POSTGRES_DB_CONTROL_PLANE`) + `EDEN_CONTROL_PLANE_URL=` (empty) and creates the `logs/control-plane` substrate dir. No `--register-additional-experiment` flag (that was the two-experiment path; deferred to #254). Closes #147. + ### Evaluatable baseline variant (issue #122) Elevates the experiment seed — the single commit on `main` at experiment start — to a first-class `kind == "baseline"` variant so operators have a "what did the seed score?" comparison anchor and the lineage tree has a colored root. Default-on (suppressed by `baseline.enabled: false`). Spans spec + schemas + contracts + storage + wire + dispatch + orchestrator (both modes) + web-ui. diff --git a/docs/observability.md b/docs/observability.md index d7b48b6b..5e40a917 100644 --- a/docs/observability.md +++ b/docs/observability.md @@ -57,7 +57,7 @@ Notes: - These are read views over the wire API. Filter changes do not mutate state. - Reclaim / reassign / terminate / dispatch-mode toggles do mutate, behind CSRF + the same authorization model as the wire API. - **Auth.** Every `/admin/*` page load requires the signed-in session's worker to be a transitive member of the `admins` group; non-admin sessions get a 403 forbidden page from the route-layer middleware (issue #144). The `setup-experiment.sh` script seeds the `admins` group with the web-ui's worker so the default Compose deployment already meets this requirement. Sign-ups created after a deployment is up are not added to `admins` by default and will hit the 403 page until an existing admin adds them via `/admin/groups/admins/`. -- `/admin/experiments/` is only mounted when the web-ui is started with `--control-plane-url`. The default Compose stack omits this flag; the route returns 404 ("page does not exist") until a control-plane-server is wired up. To enable it on the demo stack, run a sibling control-plane container and recreate the web-ui with the [`compose.control-plane.yaml`](../reference/compose/compose.control-plane.yaml) overlay — see [§3.4](#34-enabling-the-multi-experiment-control-plane). +- `/admin/experiments/` is only mounted when the web-ui is started with `--control-plane-url`. Since issue #147 the control-plane-server runs as a first-class always-on Compose service, but the web-ui's `--control-plane-url` is still opt-in via the `EDEN_CONTROL_PLANE_URL` env var (empty by default), so the route returns 404 ("page does not exist") on the default stack. To enable it, set `EDEN_CONTROL_PLANE_URL` and recreate the web-ui — see [§3.4](#34-enabling-the-multi-experiment-control-plane). ### 2.2 Forgejo Web UI (`http://localhost:3001`) @@ -333,40 +333,24 @@ If you prefer a desktop tool over Adminer, connect directly to `localhost:5433` ### 3.4 Enabling the multi-experiment control plane -The `/admin/experiments/` route is gated on the web-ui being started with `--control-plane-url`. Phase 12c shipped the control-plane-server as a separate service that the default Compose stack does not start. To enable the route on a running demo stack: +Since issue #147 the control-plane-server is a **first-class always-on Compose service** (`control-plane`, port 8081), Postgres-backed (a separate `eden_control_plane` database in the same instance, created by the postgres init hook). `setup-experiment.sh` provisions its store DSN. So the control plane is already running on any `docker compose up` stack — what's opt-in is whether the **web-ui** talks to it. + +The `/admin/experiments/` route is gated on the web-ui being started with `--control-plane-url`, which the web-ui CLI reads from the `EDEN_CONTROL_PLANE_URL` env var (empty by default → route stays 404). To enable the cross-experiment dashboard on a running stack: ```bash -# 1. Spin up control-plane-server as a sibling container on the eden network. -ADMIN=$(grep '^EDEN_ADMIN_TOKEN=' reference/compose/.env | cut -d= -f2) -docker run --rm -d --name eden-demo-control-plane \ - --network eden-reference_default \ - -p 8081:8081 \ - eden-reference:dev \ - python -m eden_control_plane_server \ - --store-url ':memory:' \ - --host 0.0.0.0 --port 8081 \ - --admin-token "$ADMIN" \ - --task-store-url http://task-store-server:8080 - -# 2. Tell the web-ui's overlay where to find it, then recreate web-ui only. cd reference/compose -echo 'EDEN_CONTROL_PLANE_URL=http://eden-demo-control-plane:8081' >> .env -docker compose --env-file .env \ - -f compose.yaml -f compose.control-plane.yaml \ - up -d --force-recreate --no-deps web-ui +# Point the web-ui at the in-network control-plane service, then +# recreate web-ui only. The control-plane admin token defaults to +# EDEN_ADMIN_TOKEN inside the CLI, so no separate token is needed. +echo 'EDEN_CONTROL_PLANE_URL=http://control-plane:8081' >> .env +docker compose --env-file .env up -d --force-recreate --no-deps web-ui ``` -`compose.control-plane.yaml` is an overlay that re-declares the web-ui's `command:` with the two extra flags (`--control-plane-url` + `--control-plane-admin-token`). Compose replaces (not merges) list-shaped command keys, so the overlay carries the full command — keep it in lockstep with `compose.yaml` if web-ui flags change. - -`/admin/experiments/` now resolves (303 → sign-in if you're not authenticated, 200 once you are). The control-plane API itself listens at `http://localhost:8081/v0/control/*` (14 endpoints; bearer-authed with the same admin token); fetch its `/openapi.json` with the bearer for the full surface. - -State-sync caveat: the demo command above uses `:memory:` storage, so the control-plane forgets its experiment registry on container restart. For a persistent deployment you'd point `--store-url` at Postgres (a separate database / schema from the task store), and likely run it as a first-class Compose service rather than a sibling container. +`/admin/experiments/` now resolves (303 → sign-in if you're not authenticated, 200 once you are). The control-plane API itself listens at `http://localhost:${CONTROL_PLANE_HOST_PORT:-8081}/v0/control/*` (bearer-authed with the same admin token; `/healthz` is unauthenticated); fetch its `/openapi.json` with the bearer for the full surface. Note the registry is empty until an experiment is registered via `POST /v0/control/experiments` (the lease-handoff smoke and a lease-driven orchestrator do this). -Tear down: +Tear down (revert the web-ui to the no-control-plane command): ```bash -docker rm -f eden-demo-control-plane -# Optional: revert the web-ui to the no-control-plane command. cd reference/compose sed -i.bak '/^EDEN_CONTROL_PLANE_URL=/d' .env && rm .env.bak docker compose --env-file .env up -d --force-recreate --no-deps web-ui diff --git a/docs/plans/issue-110-loki-grafana-overlay.md b/docs/plans/issue-110-loki-grafana-overlay.md index 475421b1..7ec14b06 100644 --- a/docs/plans/issue-110-loki-grafana-overlay.md +++ b/docs/plans/issue-110-loki-grafana-overlay.md @@ -52,7 +52,7 @@ These are the load-bearing calls this plan makes. They are defensible defaults, ### 3.1 New overlay: `reference/compose/compose.logging.yaml` -Sibling to `compose.subprocess.yaml` / `compose.docker-exec.yaml` / `compose.control-plane.yaml` / `compose.multi-orchestrator.yaml`. Layered as: +Sibling to `compose.subprocess.yaml` / `compose.docker-exec.yaml` / `compose.multi-orchestrator.yaml` / `compose.multi-experiment.yaml`. Layered as: ```bash cd reference/compose @@ -166,7 +166,7 @@ No renames of existing identifiers. New identifiers introduced (validated agains | New identifier | Kind | Convention followed | |---|---|---| -| `compose.logging.yaml` / `compose.logging-infra.yaml` | overlay files | `compose..yaml` (matches `compose.subprocess.yaml`, `compose.docker-exec.yaml`, `compose.control-plane.yaml`, `compose.multi-orchestrator.yaml`) | +| `compose.logging.yaml` / `compose.logging-infra.yaml` | overlay files | `compose..yaml` (matches `compose.subprocess.yaml`, `compose.docker-exec.yaml`, `compose.multi-orchestrator.yaml`, `compose.multi-experiment.yaml`) | | `loki` / `alloy` / `grafana` | compose service names | upstream tool names, lowercase (matches `forgejo`, `postgres`) | | `EDEN_GRAFANA_ADMIN_PASSWORD` | env var (secret) | `EDEN__` (matches `EDEN_READONLY_PASSWORD`, `EDEN_ADMIN_TOKEN`, `EDEN_SESSION_SECRET`) | | `EDEN_LOGGING_DOCKER_GID` | env var (infra-overlay required) | `EDEN__` — distinct from docker-exec's `EDEN_DOCKER_GID` so the infra overlay fails fast instead of inheriting the default `0` (§3.4) | diff --git a/docs/plans/issue-147-compose-smoke-multi-experiment.md b/docs/plans/issue-147-compose-smoke-multi-experiment.md index b4f3b120..e92900d1 100644 --- a/docs/plans/issue-147-compose-smoke-multi-experiment.md +++ b/docs/plans/issue-147-compose-smoke-multi-experiment.md @@ -1,6 +1,6 @@ # Issue #147 — Compose-smoke-multi-experiment CI job (Phase 12c backfill) -**Status.** Draft (plan). +**Status.** Re-scoped (impl) — see §0. **Predecessors.** Phase 12c (control plane) merged ([CHANGELOG](../../CHANGELOG.md) §"Phase 12c"); chapter 11 normative surface + `eden-control-plane` package + `reference/services/control-plane/` reference service + orchestrator `LeaseManager` + web-ui `/admin/experiments/` dashboard are all shipped. Reference impl is `v1+roles+orchestrator-substrate+lifecycle+checkpoints+multi-experiment` conformant: 246/246 conformance scenarios pass at the chapter-7 binding level. What 12c deferred was the **deployment-substrate** integration — `control-plane` is not yet a first-class Compose service, and there is no end-to-end multi-experiment smoke. This plan backfills both. @@ -16,6 +16,33 @@ - "Cross-experiment isolation" is the smoke's load-bearing assertion shape — no task-id / event-stream / variant-id leakage between two registered experiments sharing the deployment substrate. Not a new spec term; observational only. - The new smoke script and CI job follow the existing naming convention: `smoke-multi-experiment.sh` (parallel to `smoke.sh` / `smoke-subprocess.sh` / `smoke-multi-orchestrator.sh` / `smoke-checkpoint.sh`) and `compose-smoke-multi-experiment` (parallel to `compose-smoke-multi-orchestrator` / `compose-smoke-checkpoint`). +## 0. Re-scope (2026-05-31, operator-authorized) — THIS GOVERNS + +During impl, a read of the focal code paths surfaced a structural blocker the draft plan did not anticipate: **the reference implementation cannot host more than one experiment per deployment substrate.** Three independent sites enforce single-experiment hosting: + +1. **Task-store-server is single-experiment-bound.** `build_store(...)` ([`reference/services/task-store-server/src/eden_task_store_server/app.py`](../../reference/services/task-store-server/src/eden_task_store_server/app.py)) constructs the `Store` with one fixed `experiment_id`, and the wire layer rejects any other path id: [`reference/packages/eden-wire/src/eden_wire/_dependencies.py:73`](../../reference/packages/eden-wire/src/eden_wire/_dependencies.py) → `if path_exp != deps.store.experiment_id: raise ExperimentIdMismatch(...)`. No multi-experiment `Store` class exists in `eden-storage`. +2. **The orchestrator multi-experiment loop targets a single task-store URL for all experiments.** [`reference/services/orchestrator/src/eden_orchestrator/multi_loop.py:255-302`](../../reference/services/orchestrator/src/eden_orchestrator/multi_loop.py) builds `StoreClient(task_store_url, experiment_id)` per experiment against the same CLI-supplied URL — no `experiment_id → endpoint` mapping. +3. **The orchestrator integrator is one shared bare repo / forgejo remote deployment-wide.** [`reference/services/orchestrator/src/eden_orchestrator/cli.py:633-657`](../../reference/services/orchestrator/src/eden_orchestrator/cli.py) documents the v0 design: "one task-store-server (and one canonical bare repo) deployment-wide." + +12c's multi-experiment surface was validated only against fake stores (`test_multi_loop_unit.py`) + the conformance suite's single-IUT chapter-7 binding (the 9 documented skips are precisely the ones that need >1 hosted experiment). The deployed reference stack has never hosted two experiments — and as written, it cannot. + +**Decision (operator-authorized 2026-05-31): re-scope #147 to what the reference impl actually supports.** This plan now delivers: + +- **The control-plane as a first-class Compose service** (the genuinely-new, genuinely-shippable 12c substrate piece) — §3.1.1–§3.1.4 below are retained. +- **A lease-handoff chaos smoke** against the deployed stack: ONE registered experiment, TWO orchestrator replicas contending for its single lease (multi-experiment / lease-driven mode), with the chaos drill killing the lease holder and asserting the standby replica picks it up cleanly and the experiment still completes. This exercises the chapter-11 control-plane + lease lifecycle end-to-end on the real Compose substrate. + +**Deferred to [#254](https://github.com/ealt/eden/issues/254) (multi-experiment task-store-server hosting — the prereq):** the cross-experiment-isolation smoke (two experiments end-to-end; disjoint task-id / variant-id / idea-id / event streams). The following draft-plan content is **SUPERSEDED** by this re-scope and folded into #254: + +- **Decision 4** (two experiments end-to-end) and **Decision 5** (per-experiment worker host trios + per-experiment forgejo repos) — see the rewritten Decisions below. +- **§3.1.5** (`compose.multi-experiment.yaml` second host trio) — replaced by a second *orchestrator replica* in lease mode, §3.1.5′. +- **§3.2** (`setup-experiment --register-additional-experiment`) — not needed; a single experiment is registered with the control plane (§3.2′). +- **§3.4** (per-experiment `_2` env-var namespacing) — not needed. +- The two-experiment portions of **§3.3** (smoke phases 4/6 cross-experiment isolation) — replaced by the lease-handoff smoke design, §3.3′. + +**Impl refinement (no entrypoint wrappers).** The draft plan's §3.1.3 Shape A used a bash entrypoint wrapper to omit `--experiment-id` in multi-experiment mode. That is unnecessary: the orchestrator CLI selects mode **solely** on `--control-plane-url` being set ([`cli.py:359`](../../reference/services/orchestrator/src/eden_orchestrator/cli.py) `if args.control_plane_url is not None`), and `--experiment-id` is merely a logging label in multi mode. So the impl instead adds an **env fallback** for `--control-plane-url` to the orchestrator + web-ui CLIs (mirroring the existing `EDEN_CONTROL_PLANE_ADMIN_TOKEN` fallback): an empty `${EDEN_CONTROL_PLANE_URL:-}` → single-experiment mode (unchanged); a non-empty value → lease-driven mode. No wrapper scripts, no Dockerfile change. `--experiment-id` and `--lease-duration-seconds` stay as always-present flags (harmless in the mode that ignores them). This supersedes §3.1.3 Shape A and the `orchestrator-entrypoint.sh` / `web-ui-entrypoint.sh` / Dockerfile-COPY items in §4.1/§5. + +Where the rest of this document (written pre-re-scope) describes "two experiments" / "cross-experiment isolation," read it as historical context superseded by §0 + the primed (′) sections. §0 governs on any conflict. + ## 1. Context ### 1.1 What 12c shipped vs what's missing @@ -77,17 +104,13 @@ These are the load-bearing design calls; §3 unpacks each. Together, **the existing 6 smokes need NO change** (they don't set `EDEN_ORCHESTRATOR_MULTI_EXPERIMENT` or `EDEN_CONTROL_PLANE_URL` in their generated `.env`, so the orchestrator runs single-experiment and the web-ui ignores the control-plane). The new multi-experiment smoke sets both. This satisfies the [CHANGELOG](../../CHANGELOG.md) note's "The existing 6 Compose smokes are unchanged in posture" pledge. -4. **The multi-experiment smoke runs TWO experiments end-to-end, not one.** A smoke that registers one experiment via the control plane is not meaningfully different from `smoke.sh` (which exercises a single experiment without a control plane). The substrate-level value of this job is exercising the multi-experiment topology: two registered experiments, two leases held simultaneously, cross-experiment isolation asserted via wire reads. The smoke MUST therefore set up two distinct experiments end-to-end. +4. **[RE-SCOPED — see §0] The smoke runs ONE registered experiment with TWO orchestrator replicas contending for its lease.** The draft plan ran two experiments end-to-end; that is unbuildable (§0) and deferred to [#254](https://github.com/ealt/eden/issues/254). The substrate-level value retained here is exercising the chapter-11 control-plane + lease lifecycle on the real Compose stack: one experiment registered with the control plane, two orchestrator replicas in lease-driven (multi-experiment) mode contending for its single lease, with a chaos drill that kills the lease holder and asserts clean hand-off. This is meaningfully different from `smoke.sh` (which runs a single orchestrator with no control plane and no lease machinery). -5. **The two experiments share the SAME forgejo + postgres + task-store-server + control-plane + ONE multi-experiment orchestrator, but use DISTINCT per-experiment worker hosts.** Per §1.2: worker hosts are single-experiment-scoped in the v0 reference impl. Two experiments means two forgejo repos (existing setup-experiment shape supports this — each experiment-id maps to `eden/.git`) and two sets of host containers (six total: `ideator-host-A`, `ideator-host-B`, `executor-host-A`, `executor-host-B`, `evaluator-host-A`, `evaluator-host-B`). The orchestrator runs in multi-experiment mode (no `--experiment-id`); it acquires both leases and drives both loops. +5. **[RE-SCOPED — see §0] One experiment, one forgejo repo, one task-store-server, one control-plane, the existing single worker-host trio, and TWO orchestrator replicas in lease mode.** The per-experiment worker-host trios + per-experiment forgejo repos from the draft plan are deferred to #254. The two orchestrator replicas (`orchestrator`, `orchestrator-2`) both run with `--control-plane-url` set and no `--experiment-id` (lease-driven mode via the §3.1.3 entrypoint wrapper); they self-register deployment-scoped credentials, join the `orchestrators` group, and contend for the single experiment's lease. Exactly one holds it at any instant; the standby idles. Each replica has its own bare-clone + credentials volumes (mirrors `compose.multi-orchestrator.yaml`). - - **Alternative considered: one set of worker hosts shared across experiments.** Rejected: the worker-host CLIs require `--experiment-id` and have per-experiment forgejo credentials + per-experiment substrate paths. Refactoring host CLIs to multi-experiment is a separate, much bigger lift that arguably belongs in a future phase; it is not required to expose multi-experiment ORCHESTRATION at the smoke level. +6. **[RE-SCOPED — see §0] No `setup-experiment --register-additional-experiment`.** A single experiment is provisioned by the normal `setup-experiment.sh` flow; the smoke then registers that one experiment with the control plane via an admin-authenticated `POST /v0/control/experiments` (§3.2′). The `_2`-namespaced env convention from the draft plan is not needed. -6. **setup-experiment.sh becomes idempotently re-runnable against the same data root for a different experiment-id.** Today, running setup-experiment a second time against a different `--experiment-id` clobbers the `.env` file with the new experiment's settings. For the multi-experiment smoke, we need either (a) two `.env` files merged, or (b) `setup-experiment` extended to support a "register-additional-experiment" mode. - - **Decision: option (b) — add `--register-additional-experiment ` flag.** When passed, setup-experiment treats the existing `.env` as the BASELINE (postgres password, admin token, control-plane URL, etc. are reused as-is from the first invocation), provisions only the experiment-specific resources (forgejo repo + creds dir + data subdirs + bare-repo seed for that experiment), and appends per-experiment env vars under a namespaced prefix (`EDEN_EXPERIMENT_ID_2`, `EDEN_BASE_COMMIT_SHA_2`, etc.). The smoke script then renders the per-experiment host containers using those prefixed values via compose's variable substitution. - - **Why not option (a) — merged .env files:** compose doesn't naturally support that; either we'd have a custom merge step or move to two compose projects sharing a network. Both add complexity orthogonal to the smoke's intent. Option (b) is bounded — setup-experiment grows one new code path + the env-namespacing convention is restricted to the new multi-experiment-overlay scope. - -7. **The multi-experiment overlay is a new compose file: `compose.multi-experiment.yaml`.** It defines the second per-experiment host trio (`ideator-host-2`, `executor-host-2`, `evaluator-host-2`) plus any per-experiment-2 volumes; it does NOT redefine shared services (task-store-server, control-plane, orchestrator, postgres, forgejo, web-ui). Layered as `-f compose.yaml -f compose.multi-experiment.yaml` (mirrors `compose.multi-orchestrator.yaml`'s pattern). +7. **[RE-SCOPED — see §0] The overlay (`compose.multi-experiment.yaml`) adds a second orchestrator replica in lease mode**, not a second host trio. It does NOT redefine shared services. Layered as `-f compose.yaml -f compose.multi-experiment.yaml` (mirrors `compose.multi-orchestrator.yaml`). See §3.1.5′. 8. **CI job follows the established not-required-then-bump posture.** Same as compose-smoke-multi-orchestrator (12a-2) and compose-smoke-checkpoint (#152): the new `compose-smoke-multi-experiment` job is added unrequired in the implementation PR; bumped to required-status after staying clean on main for ~2 weeks. Documented in the implementation PR description. @@ -257,7 +280,32 @@ Notes: - Worker-ids are deterministic (`ideator-host-2`, etc.) so the `_ensure_orchestrators_membership`-style bootstrap works idempotently. - The host trio shares the same `task-store-server` healthcheck dependency as the experiment-1 trio; both trios register against the same task-store-server. -### 3.2 setup-experiment changes +#### 3.1.5′ `compose.multi-experiment.yaml` — RE-SCOPED overlay (second orchestrator replica in lease mode) + +Per §0, the overlay adds a SECOND orchestrator replica (`orchestrator-2`) in lease-driven mode, NOT a second host trio. Structurally it mirrors [`compose.multi-orchestrator.yaml`](../../reference/compose/compose.multi-orchestrator.yaml)'s `orchestrator-2` (its own bare-clone + credentials volumes, worker_id `orchestrator-2`), with two additions: it passes `--lease-duration-seconds ${EDEN_LEASE_DURATION_SECONDS:-30}` and sets `EDEN_CONTROL_PLANE_URL: ${EDEN_CONTROL_PLANE_URL:-}` in its `environment:` (the env-fallback that flips it into lease mode — see the §0 impl refinement). The command stays a plain `python -m eden_orchestrator …` list (no wrapper). It `depends_on` `control-plane: service_healthy` in addition to `task-store-server`. The base-compose `orchestrator` flips to lease mode the same way (it carries the same `EDEN_CONTROL_PLANE_URL` env), so both replicas contend for the one experiment's lease when the smoke sets the env var. Per-replica volumes: + +```yaml +volumes: + eden-orchestrator-2-repo: + eden-orchestrator-2-credentials: +``` + +### 3.2′ setup-experiment + control-plane registration (RE-SCOPED — see §0) + +No `setup-experiment` change is needed. The normal `setup-experiment.sh ` flow provisions the single experiment (forgejo repo, creds, seed, `.env`). The smoke then registers that one experiment with the control plane via an admin-authenticated wire call (issued from inside the control-plane container so no host curl/port-guessing is needed, mirroring setup-experiment's `bootstrap_curl`): + +```text +POST http://control-plane:8081/v0/control/experiments + Authorization: Bearer admin:${EDEN_ADMIN_TOKEN} + {"experiment_id": "${EDEN_EXPERIMENT_ID}", "config_uri": "file:///etc/eden/experiment-config.yaml"} + → accept 201 (first register) or 200 (idempotent replay; chapter 11 §2 / 12c round-6). +``` + +Both orchestrator replicas' multi-experiment loops then observe the registered experiment via `manager.refresh()` and contend for its lease. `config_uri` is informational here — the orchestrator reads ideation/termination policy from its `--experiment-config` CLI flag, and the control-plane state-sync poller reads `experiment.state` from `--task-store-url`, not from `config_uri`. + +The numbered `--register-additional-experiment` steps below are SUPERSEDED by §0 and folded into [#254](https://github.com/ealt/eden/issues/254); retained as historical context only. + +### 3.2 setup-experiment changes [SUPERSEDED — see §3.2′ + §0] Add a new `--register-additional-experiment ` flag to [`reference/scripts/setup-experiment/setup-experiment.sh`](../../reference/scripts/setup-experiment/setup-experiment.sh). When passed: @@ -273,85 +321,75 @@ Add a new `--register-additional-experiment ` flag to [`reference/scripts/se The `--register-additional-experiment` flag is intentionally suffixed `_2` rather than building a fully-generic N-experiment registry. For the smoke's needs, two experiments is enough; a future generalization to N can follow the same pattern with `_` suffixes if needed. -### 3.3 The smoke script — `reference/compose/healthcheck/smoke-multi-experiment.sh` +### 3.3′ The smoke script — `reference/compose/healthcheck/smoke-multi-experiment.sh` (RE-SCOPED, lease-handoff — see §0) -Structure (mirrors smoke-checkpoint.sh / smoke-multi-orchestrator.sh patterns): +ONE registered experiment + TWO orchestrator replicas contending for its lease. Structure mirrors `smoke-checkpoint.sh` / `smoke-multi-orchestrator.sh`: ```text -Phase 0 — Preflight (docker / jq / curl / python3 available; docker compose v2) - -Phase 1 — Provision both experiments - setup-experiment.sh --experiment-id exp-A --env-file $ENV --data-root $ROOT - setup-experiment.sh --register-additional-experiment exp-B \ - --env-file $ENV --data-root $ROOT - - # The smoke pins: - # EDEN_ORCHESTRATOR_MULTI_EXPERIMENT=1 - # EDEN_CONTROL_PLANE_URL=http://control-plane:8081 - # EDEN_IDEATION_POLICY_MAX_TOTAL=2 for both experiments - # EDEN_LEASE_DURATION_SECONDS=10 (faster for the chaos drill) +Phase 0 — Preflight (docker / jq / curl / python3 available; docker compose v2). + Volume cleanup before run (AGENTS.md: rotate-password trap). + +Phase 1 — Provision the single experiment + pin lease-mode env. + setup-experiment.sh --experiment-id --env-file $ENV \ + --data-root $(mktemp -d) # per-run data root → no rotate-password trap + # The smoke rewrites EDEN_CONTROL_PLANE_URL (setup wrote it empty) and + # appends the lease knobs to $ENV (it does NOT hand-edit baseline secrets): + # EDEN_CONTROL_PLANE_URL=http://control-plane:8081 (flips lease mode on) + # EDEN_LEASE_DURATION_SECONDS=10 (fast hand-off drill) # EDEN_STATE_SYNC_INTERVAL_SECONDS=5 + # Cap ideation in the experiment-config YAML (ideation_policy fixed_total:3) + # so the run is bounded. Termination is OPERATOR-DRIVEN (Phase 5), NOT + # dispatch_mode.termination=auto — the orchestrator's auto-termination + # decision 403s under wire auth (terminate is admins-gated; #256). -Phase 2 — Bring up the stack with multi-experiment overlay +Phase 2 — Bring up the stack with the lease overlay. docker compose -f compose.yaml -f compose.multi-experiment.yaml \ --env-file $ENV up -d --wait --wait-timeout 300 - - # Assertions: - # - control-plane /healthz returns 200 - # - control-plane /v0/control/experiments contains both ids - # - control-plane /v0/control/leases lists 2 active leases (one per - # experiment) held by the orchestrator worker_id within ~30s - -Phase 3 — Drive both experiments to quiescence - # Both experiments use max_total=2 → 2 integrated variants each. The - # orchestrator's multi-experiment loop runs both lease loops; quiescence - # exit fires when ALL held leases have drained. - - Wait for orchestrator container to exit 0 (timeout 300s). - -Phase 4 — Cross-experiment isolation assertions - curl -fsS .../experiments/exp-A/events | jq … - curl -fsS .../experiments/exp-B/events | jq … - + # control-plane comes up healthy (depends_on postgres + task-store-server). + + Assertions: + - control-plane /healthz returns 200. + - Register the experiment: POST /v0/control/experiments (admin bearer, + from inside the control-plane container) → 201 or 200 (§3.2′). + - control-plane /v0/control/experiments lists the experiment. + - Seed the task-store `orchestrators` group with both replica worker_ids + (the lease-driven path joins only the CONTROL-PLANE orchestrators group, + not the task-store one — without this the lease holder's §3.7-gated + dispatch/integrate calls 403; folded into #254). + - Within a 60s deadline: exactly ONE active lease exists for the + experiment, held by one of {orchestrator, orchestrator-2}. Record the + holder worker_id as $HOLDER. (lease-singleton invariant — chapter 11 §4.) + +Phase 3 — Lease-handoff drill (chaos). + # Kill the current lease holder; assert the standby acquires the lease. + docker rm -f eden-<$HOLDER> # e.g. eden-orchestrator or eden-orchestrator-2 + Within lease_duration*2 + poll slack (~45s deadline): a single active lease + exists again, held by the OTHER replica ($HOLDER changed) — no split-brain. + +Phase 4 — The surviving replica drives the pipeline. + Poll the events stream (240s deadline) until ≥2 variant.integrated. Assert + ≥2 variant.integrated AND ≥2 execution-task.completed AND ≥2 + evaluation-task.completed (the post-hand-off holder drove dispatch + + execute + evaluate + integrate end-to-end on the deployed stack). + +Phase 5 — Operator-driven termination + state-sync convergence. + # Register a throwaway worker, add it to `admins`, terminate via its + # worker bearer (terminate_experiment rejects the literal admin bearer). + POST /v0/experiments//terminate (admins worker bearer) Assert: - - exp-A events count >= some floor (≥6 task.completed, ≥2 variant.integrated) - - exp-B events count >= same floor - - Per-experiment event task_ids are disjoint - (no exp-A task_id appears in exp-B's event stream, vice versa) - - exp-A variant_ids and exp-B variant_ids are disjoint - - Each experiment's idea_ids are disjoint - - control-plane registry shows both with last_known_state observed - (running OR terminated depending on policy; the smoke's - termination policy in the experiment-config drives terminated) - -Phase 5 — Lease-handoff drill (chaos) - # Bring up a second orchestrator replica via compose.multi-orchestrator.yaml - # NO — that conflicts with this overlay. Instead: this overlay layers - # a second orchestrator-multi instance directly. - # - # Decision: include `orchestrator-2` (multi-experiment shape) in - # compose.multi-experiment.yaml itself, so the chaos drill works - # without needing a third overlay file. Two replicas; chaos-kill - # the lease holder; assert the other replica picks up its lease. - - docker rm -f eden-orchestrator # current lease holder - Wait up to lease_duration * 2 (= 20s) for orchestrator-2 to acquire - both leases via control-plane /v0/control/leases. - Assert: orchestrator-2 now holds both leases; experiment-A and - experiment-B both continue to make progress (or are already - quiesced). - -Phase 6 — Final cross-experiment cardinality cross-check - Re-fetch /v0/control/experiments; assert: - - Both experiments still registered. - - Both have last_known_state == "terminated" (the smoke's - termination-policy drives this). - - Neither leak across experiment boundaries. + - an experiment.terminated event appears (60s deadline). + - control-plane /v0/control/experiments shows last_known_state == + "terminated" (state-sync poller running→terminated convergence, + chapter 11 §3; 30s deadline). PASS ``` -Substrate-cleanup posture mirrors `smoke-checkpoint.sh`: the smoke's `cleanup()` trap runs `docker compose down -v`, removes the per-experiment forgejo creds dirs, and wipes the bind-mount data root via a sibling Alpine container (uid-mismatch dance). +Substrate-cleanup posture mirrors `smoke-checkpoint.sh`: a per-run `mktemp -d` data root, and the `cleanup()` trap runs `docker compose ... down -v` + wipes the data root via a sibling Alpine container (uid-mismatch dance). bash-3.2 discipline applies (no `mapfile`/assoc-arrays). + +#### 3.3 The smoke script [SUPERSEDED — see §3.3′ + §0] + +The two-experiment / cross-experiment-isolation smoke design below is deferred to [#254](https://github.com/ealt/eden/issues/254); retained as historical context only. (Original Phase 4/6 cross-experiment-isolation assertions presuppose multi-experiment hosting the reference impl does not provide.) ### 3.4 Per-experiment env-var namespacing convention @@ -382,27 +420,25 @@ The convention is: **Code (reference impl):** -- Verify a `/healthz` endpoint exists on the control-plane server; add if missing. (Verify in [`reference/services/control-plane/src/eden_control_plane_server/app.py`](../../reference/services/control-plane/src/eden_control_plane_server/app.py); shape mirrors the web-ui's `/healthz`.) -- Orchestrator entrypoint wrapper script (~15 lines bash) under [`reference/compose/`](../../reference/compose/) (e.g. `orchestrator-entrypoint.sh`); web-ui entrypoint wrapper (~15 lines bash). -- Modify the runtime image's Dockerfile so the entrypoint wrappers are installed (small COPY + chmod). +- Add a `/healthz` endpoint to the control-plane server ([`reference/services/control-plane/src/eden_control_plane_server/app.py`](../../reference/services/control-plane/src/eden_control_plane_server/app.py); unauthenticated, outside `/v0/control`; shape mirrors the web-ui's `/healthz`). + a unit test. +- **[RE-SCOPED — see §0]** Add an `EDEN_CONTROL_PLANE_URL` env fallback for `--control-plane-url` in the orchestrator CLI ([`reference/services/orchestrator/src/eden_orchestrator/cli.py`](../../reference/services/orchestrator/src/eden_orchestrator/cli.py)) and the web-ui CLI ([`reference/services/web-ui/src/eden_web_ui/cli.py`](../../reference/services/web-ui/src/eden_web_ui/cli.py)), treating empty as unset. **No entrypoint wrapper scripts, no Dockerfile change** (supersedes the draft plan's wrapper items). **Compose:** - Add `control-plane` service to [`compose.yaml`](../../reference/compose/compose.yaml). - Add `init-control-plane-db.sh` postgres-init hook + mount. -- Modify `orchestrator` and `web-ui` services in compose.yaml to invoke the entrypoint wrappers. -- Add `POSTGRES_DB_CONTROL_PLANE`, `CONTROL_PLANE_HOST_PORT`, `EDEN_ORCHESTRATOR_MULTI_EXPERIMENT`, `EDEN_CONTROL_PLANE_URL`, `EDEN_LEASE_DURATION_SECONDS`, `EDEN_STATE_SYNC_INTERVAL_SECONDS`, `EDEN_STATE_SYNC_FAILURE_THRESHOLD` to [`.env.example`](../../reference/compose/.env.example). -- New `compose.multi-experiment.yaml` overlay. +- Modify the `orchestrator` + `web-ui` services in compose.yaml: add the `EDEN_CONTROL_PLANE_URL` env (the lease-mode toggle via the §0 CLI env fallback), the `--lease-duration-seconds` flag + `control-plane` `depends_on` on the orchestrator. **No entrypoint wrappers.** +- Add `POSTGRES_DB_CONTROL_PLANE`, `EDEN_CONTROL_PLANE_STORE_URL`, `CONTROL_PLANE_HOST_PORT`, `EDEN_CONTROL_PLANE_URL`, `EDEN_LEASE_DURATION_SECONDS`, `EDEN_STATE_SYNC_INTERVAL_SECONDS`, `EDEN_STATE_SYNC_FAILURE_THRESHOLD` to [`.env.example`](../../reference/compose/.env.example). (`setup-experiment` emits the first three + `EDEN_CONTROL_PLANE_URL=` empty.) +- New `compose.multi-experiment.yaml` overlay — **RE-SCOPED**: a second `orchestrator-2` replica in lease mode (§3.1.5′), NOT a second host trio. - Delete `compose.control-plane.yaml`. **setup-experiment.sh:** -- Add `--register-additional-experiment ` mode with the §3.2 semantics. Existing single-experiment flow unchanged. -- Per-experiment-2 env-var namespacing convention documented in the script's help text. +- **[RE-SCOPED — see §0] No change.** The single-experiment flow is used as-is; the smoke registers the one experiment with the control plane via a wire call (§3.2′). The `--register-additional-experiment` mode is deferred to [#254](https://github.com/ealt/eden/issues/254). **Smoke + CI:** -- New [`reference/compose/healthcheck/smoke-multi-experiment.sh`](../../reference/compose/healthcheck/smoke-multi-experiment.sh). +- New [`reference/compose/healthcheck/smoke-multi-experiment.sh`](../../reference/compose/healthcheck/smoke-multi-experiment.sh) — the **lease-handoff** smoke (§3.3′). - New `compose-smoke-multi-experiment` job in [`.github/workflows/ci.yml`](../../.github/workflows/ci.yml) (20-minute timeout, mirrors compose-smoke-multi-orchestrator + compose-smoke-checkpoint shape, not branch-protected initially). **Docs:** @@ -416,10 +452,11 @@ The convention is: ### 4.2 Out of scope (followups; file as issues if not already) - **Helm-chart multi-experiment substrate.** Folded into the existing Phase 13a plan ([`docs/plans/eden-phase-13a-helm-base-chart.md`](eden-phase-13a-helm-base-chart.md)) when that lands; this plan is Compose-only. No new issue needed — Phase 13a's existing scope covers it. -- **N-experiment generalization beyond N=2.** The `_2` suffix convention is bounded; a generic N-experiment registry would generalize to `_N` suffixes. Out of scope for the smoke's needs. +- **[RE-SCOPED] Cross-experiment-isolation smoke (two experiments end-to-end).** Deferred to [#254](https://github.com/ealt/eden/issues/254) — the reference impl cannot host >1 experiment per deployment (§0). This is the headline deferral of the re-scope. +- **N-experiment generalization beyond N=2.** Subsumed by #254. - **Multi-experiment load testing.** Per the issue: this is a smoke, not a stress test. - **Cross-experiment scheduling intelligence** (e.g. lease-stealing for fair work distribution). Out of scope; chapter 11 §3.9 alternatives-considered documents this as a future v1 amendment. -- **Worker-host multi-experiment refactor.** Per Decision 5, worker hosts stay single-experiment-scoped; a future refactor that lets one host trio serve multiple experiments would simplify deployment but is a different concern. +- **Worker-host multi-experiment refactor.** Worker hosts stay single-experiment-scoped; a future refactor that lets one host trio serve multiple experiments is part of the #254 family. - **`eden_control_plane` admin-pages in the web-ui.** Phase 12c shipped the read-only `/admin/experiments/` dashboard; a parallel `/admin/control/workers/` + `/admin/control/groups/` admin surface for the deployment-scoped registry is a follow-up (already deferred per the 12c CHANGELOG entry "Deployment-scoped worker/group registry admin pages not shipped"). ### 4.3 Non-goals @@ -431,16 +468,15 @@ The convention is: | File | Change | |---|---| -| [`reference/compose/compose.yaml`](../../reference/compose/compose.yaml) | Add `control-plane` service block (§3.1.1). Modify `orchestrator` and `web-ui` services to invoke entrypoint wrappers (§3.1.3). Add postgres-init script mount on the `postgres` service (§3.1.2). | +| [`reference/compose/compose.yaml`](../../reference/compose/compose.yaml) | Add `control-plane` service block (§3.1.1). Add `EDEN_CONTROL_PLANE_URL` env + `--lease-duration-seconds` flag to `orchestrator`; `control-plane` to its `depends_on`. Add `EDEN_CONTROL_PLANE_URL` env to `web-ui`. Add postgres-init script mount on `postgres` (§3.1.2). | | `reference/compose/init-control-plane-db.sh` (new) | Postgres init hook creating `eden_control_plane` database (§3.1.2). | -| `reference/compose/orchestrator-entrypoint.sh` (new) | Bash wrapper that decides single- vs multi-experiment invocation from env (§3.1.3). | -| `reference/compose/web-ui-entrypoint.sh` (new) | Bash wrapper that conditionally adds `--control-plane-url` (§3.1.3). | -| [`reference/compose/Dockerfile`](../../reference/compose/Dockerfile) | `COPY` + `chmod +x` the two entrypoint scripts into the runtime image. | +| [`reference/services/orchestrator/src/eden_orchestrator/cli.py`](../../reference/services/orchestrator/src/eden_orchestrator/cli.py) | **[RE-SCOPED]** `EDEN_CONTROL_PLANE_URL` env fallback for `--control-plane-url` (empty→unset). | +| [`reference/services/web-ui/src/eden_web_ui/cli.py`](../../reference/services/web-ui/src/eden_web_ui/cli.py) | **[RE-SCOPED]** `EDEN_CONTROL_PLANE_URL` env fallback for `--control-plane-url` (empty→unset). | | [`reference/compose/compose.control-plane.yaml`](../../reference/compose/compose.control-plane.yaml) | **Delete** (§3.1.4). | -| `reference/compose/compose.multi-experiment.yaml` (new) | Second host trio + second orchestrator-multi-instance, layered as `-f compose.yaml -f compose.multi-experiment.yaml` (§3.1.5). | -| [`reference/compose/.env.example`](../../reference/compose/.env.example) | Document the per-experiment env-var namespacing convention (§3.4). | -| [`reference/scripts/setup-experiment/setup-experiment.sh`](../../reference/scripts/setup-experiment/setup-experiment.sh) | Add `--register-additional-experiment ` mode (§3.2). | -| `reference/compose/healthcheck/smoke-multi-experiment.sh` (new) | The smoke (§3.3). | +| `reference/compose/compose.multi-experiment.yaml` (new) | **RE-SCOPED**: second `orchestrator-2` replica in lease mode, layered as `-f compose.yaml -f compose.multi-experiment.yaml` (§3.1.5′). | +| [`reference/compose/.env.example`](../../reference/compose/.env.example) | Document the control-plane + lease env vars (§3.1.1/§3.1.3). (`_2` per-experiment namespacing deferred to #254.) | +| ~~`reference/scripts/setup-experiment/setup-experiment.sh`~~ | **RE-SCOPED — no change** (§3.2′). `--register-additional-experiment` deferred to #254. | +| `reference/compose/healthcheck/smoke-multi-experiment.sh` (new) | The lease-handoff smoke (§3.3′). | | [`.github/workflows/ci.yml`](../../.github/workflows/ci.yml) | New `compose-smoke-multi-experiment` job mirroring `compose-smoke-checkpoint` shape. | | [`reference/services/control-plane/src/eden_control_plane_server/app.py`](../../reference/services/control-plane/src/eden_control_plane_server/app.py) | Verify `/healthz` endpoint exists; add if missing. | | [`AGENTS.md`](../../AGENTS.md) | New row in "Commands" table for the smoke. | @@ -453,13 +489,12 @@ The convention is: This is a substrate-level smoke; the assertions ARE the test. There are no new unit tests, no new wire tests, no new conformance scenarios (all of those shipped with 12c). Verification gates: -### 6.1 Smoke-level assertions (per §3.3 above) +### 6.1 Smoke-level assertions (per §3.3′ above — RE-SCOPED) -- **Stack-startup**: control-plane `/healthz` 200; control-plane lists 2 registered experiments; 2 active leases held by the orchestrator within ~30s. -- **Cross-experiment isolation**: exp-A and exp-B event streams disjoint by task_id; variant_ids disjoint; idea_ids disjoint. -- **Per-experiment progress**: each experiment reaches `≥2 variant.integrated` events, `≥6 task.completed` events. -- **Control-plane state-sync**: each experiment's `last_known_state` in `read_experiment_metadata` converges to `"terminated"` after the in-experiment policy fires (smoke configures `max_variants_policy(2)` for both). -- **Chaos drill** (Phase 5): killing the lease-holding orchestrator → second orchestrator acquires both leases within `lease_duration * 2`; experiments still complete. +- **Stack-startup**: control-plane `/healthz` 200; control-plane lists the one registered experiment; exactly ONE active lease held by one of the two replicas within ~60s (lease-singleton invariant). +- **Lease-handoff chaos**: killing the lease holder → the standby replica acquires the lease within `lease_duration * 2` + poll slack; at no observed instant are there two active leases (no split-brain). +- **Progress**: the post-hand-off holder reaches `≥2 variant.integrated`, `≥2 execution-task.completed`, `≥2 evaluation-task.completed` (full dispatch→execute→evaluate→integrate pipeline on the deployed stack). +- **Control-plane state-sync**: after an OPERATOR-DRIVEN `terminate_experiment` (admins worker — the orchestrator's auto-termination decision 403s under wire auth, [#256](https://github.com/ealt/eden/issues/256)), the experiment's `last_known_state` converges to `"terminated"` (chapter 11 §3 poller). ### 6.2 Local-repro discipline @@ -472,7 +507,7 @@ Per AGENTS.md "Local repro beats log-tail reading", the smoke MUST be runnable l - `uv run pytest -q` passes (regression: the entrypoint-wrapper scripts and setup-experiment changes don't break unit tests). - `python3 scripts/check-rename-discipline.py` clean. - `npx --yes markdownlint-cli2@0.14.0 "**/*.md" "#node_modules" "#.venv" "#docs/archive/**" "#docs/plans/review/**"` clean. -- Manual UI smoke: spin up the stack with the multi-experiment overlay; verify `/admin/experiments/` shows both experiments; switch between them via the dashboard's select form. +- Manual UI smoke (RE-SCOPED): spin up the stack with the lease overlay + control-plane env; verify `/admin/experiments/` shows the one registered experiment (the cross-experiment switcher is exercised by #254). ## 7. Chunked execution plan @@ -489,16 +524,14 @@ The work is bounded enough to land as ONE impl PR after this plan PR merges, but - Verify control-plane `/healthz` endpoint (add if missing). - **Validation gate**: existing 6 smokes (`smoke.sh`, `smoke-subprocess.sh`, `smoke-subprocess-docker.sh`, `smoke-manual-mode.sh`, `smoke-multi-orchestrator.sh`, `smoke-checkpoint.sh`, `e2e.sh`) all pass unchanged. -**Wave 2 — Multi-experiment substrate + setup-experiment ergonomics** (covers Decisions 4, 5, 6, 7 + §3.1.5, §3.2, §3.4): +**Wave 2 — Lease overlay (RE-SCOPED — see §0)** (covers Decisions 4, 5, 7 + §3.1.5′): -- New `compose.multi-experiment.yaml` overlay. -- `setup-experiment.sh --register-additional-experiment ` flag. -- `.env.example` updates for the per-experiment-2 namespaced vars. -- **Validation gate**: `setup-experiment.sh --experiment-id exp-A && setup-experiment.sh --register-additional-experiment exp-B --env-file ` produces an `.env` with both groups of vars; `docker compose -f compose.yaml -f compose.multi-experiment.yaml up -d --wait` brings up the full multi-experiment stack. +- New `compose.multi-experiment.yaml` overlay adding `orchestrator-2` in lease mode (§3.1.5′). +- **Validation gate**: `docker compose -f compose.yaml -f compose.multi-experiment.yaml --env-file up -d --wait` (with `EDEN_ORCHESTRATOR_MULTI_EXPERIMENT=1` + `EDEN_CONTROL_PLANE_URL` set) brings up both orchestrator replicas in lease mode against the control plane. (setup-experiment is unchanged; `--register-additional-experiment` deferred to #254.) -**Wave 3 — Smoke + CI + docs** (covers Decision 8 + §3.3 + §3.5): +**Wave 3 — Smoke + CI + docs** (covers Decision 8 + §3.3′ + §3.5): -- New `smoke-multi-experiment.sh`. +- New `smoke-multi-experiment.sh` (lease-handoff). - New `compose-smoke-multi-experiment` CI job. - AGENTS.md / README / user-guide updates. - CHANGELOG `[Unreleased]` entry referencing #147. @@ -514,10 +547,11 @@ Each wave's validation gate is the "go / no-go" for the next wave. If wave 1 bre 3. **Compose's `${VAR:+...}` substitution doesn't work inside `command:` lists.** Verify before committing to Shape A (entrypoint wrapper) vs Shape B (two service definitions). The decision is Shape A precisely because compose's flag-omission semantics are awkward in list-style command args. 4. **The chaos drill flake risk.** Lease handoff is bounded by `lease_duration * 2` (20s with the smoke's `EDEN_LEASE_DURATION_SECONDS=10`), but the orchestrator's acquisition thread polls per `poll_interval` (default 1s in the compose config) — so the worst-case detection window is ~22s. CI timeout is 300s on bring-up + 240s on quiescence; the chaos drill adds another ~30s. Total smoke runtime ≤ 10 min on the GitHub Actions runner. Mitigation: explicit `deadline = $((SECONDS + 60))` on the lease-acquisition assertion and `docker compose logs --tail 60` dump on failure (mirrors the existing smokes' diagnostic posture). 5. **GitHub Actions runner resource pressure (six host containers + control-plane + multi-orchestrator + postgres + forgejo = ~10 containers).** Mitigation: cap `EDEN_IDEATION_POLICY_MAX_TOTAL=2` for both experiments, run the scripted reference ideator/executor/evaluator (not the LLM ones), and rely on the 20-minute timeout. If memory pressure causes flakes, fall back to running the chaos drill in a separate CI job (split the smoke into base + chaos; base goes required first). -6. **Audit-of-substrate-rename trap.** AGENTS.md "Substrate migrations need a same-PR audit" applies. The compose.control-plane.yaml deletion is the main concrete reference to audit. Grep checklist (run in the impl PR): - - `grep -rn 'compose.control-plane' .` — must return zero hits after the wave-1 commit lands. - - `grep -rn 'EDEN_CONTROL_PLANE_URL' .` — must surface only documented call-sites (web-ui CLI + orchestrator CLI + the new entrypoint wrappers + .env.example + this plan + the CHANGELOG entry). - - `grep -rn 'control-plane' reference/compose/` — must show the new compose.yaml block + the new entrypoint wrappers + nothing else. +6. **Audit-of-substrate-rename trap.** AGENTS.md "Substrate migrations need a same-PR audit" applies. The compose.control-plane.yaml deletion is the main concrete reference to audit. Grep audit run in the impl PR — every hit classified per the AGENTS.md checklist (real consumer vs doc reference): + - **Real consumers** (compose / scripts / CI / operator docs) updated: `docs/observability.md` §2.1/§3.4 rewritten to the first-class-service + `EDEN_CONTROL_PLANE_URL` toggle. No compose/CI/script still references the deleted overlay. + - **Forward-looking plan references** updated: the sibling-overlay example lists in `issue-110` and the §3.4 bring-up step in `issue-182` now point at the first-class service / `compose.multi-experiment.yaml`. + - **Historical analysis preserved** (deliberately not edited): `docs/plans/issue-157-cli-flags-to-config.md` references the overlay's web-ui-only shape as point-in-time analysis of a now-superseded state; rewriting it would corrupt that plan's record. These remaining `compose.control-plane` hits are expected and intentional. + - `grep -rn 'control-plane' reference/compose/` — shows the new compose.yaml `control-plane` service block + the postgres init-hook mount + the `EDEN_CONTROL_PLANE_URL` env wiring + nothing else (no wrapper scripts — see §0 impl refinement). 7. **Worker-host conflict on shared substrate paths.** The exp-2 host containers' substrate paths (`exp-2-artifacts`, etc.) are distinct from the default-shape exp-1 paths by construction (suffix `_2`). The risk is that a shared mount target inside the container (e.g. `/var/lib/eden/artifacts`) collides if both trios mount different host paths to the SAME container target — which they do, but the trios are different containers so this is fine. Mitigation: documented in the compose.multi-experiment.yaml's per-service `volumes:` blocks. 8. **The `--register-additional-experiment` flag interacts poorly with checkpoint-import auto-register.** 12c's checkpoint-import endpoint auto-registers the imported experiment with the control plane (Decision 9 of the 12c plan). Operator workflow: import a checkpoint as experiment B; then run `setup-experiment.sh --register-additional-experiment B` against the existing baseline. The setup-experiment flow's control-plane registration is idempotent (chapter 11 §2 / 12c round-6 fix: 200 on idempotent replay), so this is safe — but the smoke doesn't test it. Mitigation: out-of-scope for this plan; flag as a followup if needed. 9. **EnvVar `EDEN_ADMIN_TOKEN` reuse across both experiments.** Both forgejo repos use the same `EDEN_ADMIN_TOKEN` (deployment-scoped, not per-experiment). This is the correct posture — the chapter 11 §6 deployment-scoped worker registry uses the admin token, NOT per-experiment tokens. The risk is conceptual confusion (operators might expect per-experiment admin tokens); mitigation is the `.env.example` documentation explicitly calling out which vars are deployment-scoped vs experiment-scoped. diff --git a/docs/plans/issue-182-coverage-debt-audit.md b/docs/plans/issue-182-coverage-debt-audit.md index 9e7a8572..c2b7866b 100644 --- a/docs/plans/issue-182-coverage-debt-audit.md +++ b/docs/plans/issue-182-coverage-debt-audit.md @@ -219,7 +219,7 @@ Waves are grouped by **stack configuration** so each wave shares one bring-up (t ### Wave 3 — Control-plane stack (§4.3) -- **Concrete bring-up** (per [`docs/observability.md`](../observability.md) §3.4): run the control-plane-server as a sibling container on the `eden-reference_default` network, then recreate `web-ui` with the [`compose.control-plane.yaml`](../../reference/compose/compose.control-plane.yaml) overlay so `--control-plane-url` is set and the `/admin/experiments/` + `/admin/control/*` routes register. The control-plane routes do NOT exist on the default stack — this overlay is the gate. +- **Concrete bring-up** (per [`docs/observability.md`](../observability.md) §3.4): since #147 the control-plane-server is a first-class always-on Compose service; set `EDEN_CONTROL_PLANE_URL=http://control-plane:8081` in `.env` and recreate `web-ui` so `--control-plane-url` is set and the `/admin/experiments/` + `/admin/control/*` routes register. The web-ui routes are 404 on the default stack — that env var is the gate. - Walk the `/admin/experiments/` dashboard (register + select + unregister — now reachable), the lease primitive (acquire → renew → release → expiry hand-off + list/filter), the deployment-scoped worker/group registry (wire `/v0/control/*` + `/admin/control/workers/` + `/admin/control/groups/` UI incl. reissue), `GET /v0/control/whoami`, and multi-experiment side-by-side (two experiments registered + leased, port/data-root isolation; cross-ref #147). **Gate:** every §4.3 surface ticked or `blocked by #__`; comment posted; surprises filed + triaged. diff --git a/docs/user-guide.md b/docs/user-guide.md index 9daaa305..72218be6 100644 --- a/docs/user-guide.md +++ b/docs/user-guide.md @@ -190,6 +190,18 @@ docker compose --env-file .env \ up -d --wait ``` +> **Multi-experiment / control-plane mode.** A `control-plane` service +> (chapter 11: experiment registry + orchestrator leases + state-sync) +> runs on every stack but is opt-in — set +> `EDEN_CONTROL_PLANE_URL=http://control-plane:8081` to flip the +> orchestrator into lease-driven mode and surface the web-ui's +> `/admin/experiments/` dashboard. The reference impl hosts one +> experiment per task-store-server; the lease lifecycle (including a +> lease-handoff chaos drill) is exercised by +> `reference/compose/healthcheck/smoke-multi-experiment.sh`. See +> [`reference/compose/README.md`](../reference/compose/README.md) +> "Multi-experiment mode" for details. + ### The orchestrator's quiescence-exit The orchestrator is tuned for CI: the default budget (`max_quiescent_iterations: 3`, and `30` in the smoke-injected configs) × 1s poll is seconds of zero progress before it exits 0. With a human at the keyboard this fires constantly. Since [issue #157](https://github.com/ealt/eden/issues/157) the budget is the experiment-config `max_quiescent_iterations` field (the `EDEN_MAX_QUIESCENT_ITERATIONS` env var was retired). Set it in your experiment-config YAML **before** running setup-experiment: diff --git a/reference/compose/.env.example b/reference/compose/.env.example index 82842e5c..f098d36c 100644 --- a/reference/compose/.env.example +++ b/reference/compose/.env.example @@ -86,6 +86,30 @@ EDEN_SESSION_SECRET=eden-dev-session-secret-change-me # Postgres connection URL the task-store-server uses. EDEN_STORE_URL=postgresql://eden:eden-dev-password-change-me@postgres:5432/eden +# --- Control plane (12c / issue #147) --- +# The control-plane-server is a first-class always-on Compose service. +# Its store is a SEPARATE logical database in the same Postgres +# instance (chapter 11 §3.4 Option A), created by the postgres init +# hook (reference/compose/init-control-plane-db.sh). +POSTGRES_DB_CONTROL_PLANE=eden_control_plane +EDEN_CONTROL_PLANE_STORE_URL=postgresql://eden:eden-dev-password-change-me@postgres:5432/eden_control_plane +# Host port the control-plane API is published on. +CONTROL_PLANE_HOST_PORT=8081 +# OPT-IN toggle: when EMPTY (default + the existing six smokes), the +# orchestrator + web-ui run single-experiment and ignore the control +# plane. When set to the in-network URL (http://control-plane:8081), +# the orchestrator CLI's env fallback flips it into the chapter-11 §5 +# lease-driven loop and the web-ui mounts /admin/experiments/. The +# lease-handoff smoke (smoke-multi-experiment.sh) sets this. +EDEN_CONTROL_PLANE_URL= +# Chapter 11 lease duration (seconds). Only meaningful in lease-driven +# mode. The lease-handoff smoke lowers this for a fast chaos drill. +EDEN_LEASE_DURATION_SECONDS=30 +# Control-plane state-sync poller cadence + stale-warning threshold +# (chapter 11 §3.2 / §3.4). +EDEN_STATE_SYNC_INTERVAL_SECONDS=30 +EDEN_STATE_SYNC_FAILURE_THRESHOLD=10 + # --- Ideation policy (issue #133) --- # The orchestrator reads the ideation policy from the experiment # config's `ideation_policy` block (see diff --git a/reference/compose/README.md b/reference/compose/README.md index 7e6e429b..a29d4960 100644 --- a/reference/compose/README.md +++ b/reference/compose/README.md @@ -31,6 +31,7 @@ See [`../../docs/operations/experiment-data-durability.md`](../../docs/operation | `executor-host` | `eden_executor_host` | Scripted executor worker; writes work commits | | `evaluator-host` | `eden_evaluator_host` | Scripted evaluator worker | | `web-ui` | `eden_web_ui` | Backend-for-frontend Web UI on `localhost:${WEB_UI_HOST_PORT}` | +| `control-plane` | `eden_control_plane_server` | Always-on chapter-11 control plane (registry + leases + state-sync) on `localhost:${CONTROL_PLANE_HOST_PORT:-8081}` | ### Setup-time services (10c) @@ -216,6 +217,38 @@ after the stack is down. See name. If you renamed the project, use `docker volume ls` to find the historical names. +## Multi-experiment mode (control plane) + +The `control-plane` service (chapter 11) is always-on, but it is +**opt-in** for the orchestrator and web-ui: with `EDEN_CONTROL_PLANE_URL` +empty (the default), the orchestrator runs single-experiment and the +web-ui hides the cross-experiment dashboard, so the control plane just +starts cleanly and idles. Setting `EDEN_CONTROL_PLANE_URL=http://control-plane:8081` +flips both into chapter-11 lease-driven mode (the orchestrator CLI reads +the env var as the fallback for `--control-plane-url`). + +The control plane is Postgres-backed: a separate `eden_control_plane` +database in the same instance (chapter 11 §3.4 Option A), created by the +[`init-control-plane-db.sh`](init-control-plane-db.sh) postgres init +hook. That hook runs **only on a fresh Postgres data dir** (upstream +image behavior), so on a data root that predates this feature the +database won't exist and `control-plane` will fail to start. To upgrade +an existing data root, either `docker compose down -v` + re-run +setup-experiment, or create the database manually: + +```bash +docker compose --env-file .env exec postgres \ + psql -U eden -d eden -c 'CREATE DATABASE eden_control_plane OWNER eden;' +``` + +The canonical reference for the lease lifecycle on this substrate is the +[`smoke-multi-experiment.sh`](healthcheck/smoke-multi-experiment.sh) +smoke (control-plane health, two lease-contending orchestrator replicas, +and a lease-handoff chaos drill). Note the reference impl hosts **one** +experiment per task-store-server; true multi-experiment hosting + +cross-experiment isolation is tracked in +[#254](https://github.com/ealt/eden/issues/254). + ## Security note `.env.example` ships intentionally weak development credentials diff --git a/reference/compose/compose.control-plane.yaml b/reference/compose/compose.control-plane.yaml deleted file mode 100644 index 6b9f03d2..00000000 --- a/reference/compose/compose.control-plane.yaml +++ /dev/null @@ -1,42 +0,0 @@ -name: eden-reference - -services: - web-ui: - command: - - python - - -m - - eden_web_ui - - --task-store-url - - http://task-store-server:8080 - - --experiment-id - - ${EDEN_EXPERIMENT_ID:?} - - --admin-token - - ${EDEN_ADMIN_TOKEN:?} - - --experiment-config - - /etc/eden/experiment-config.yaml - - --session-secret - - ${EDEN_SESSION_SECRET:?EDEN_SESSION_SECRET must be set (run setup-experiment)} - - --artifacts-dir - - /var/lib/eden/artifacts - - --repo-path - - /var/lib/eden/repo - - --forgejo-url - - ${FORGEJO_REMOTE_URL:?} - - --credential-helper - - /etc/eden/credential-helper.sh - - --clone-url - - ${FORGEJO_CLONE_URL_HOST:-http://localhost:${FORGEJO_HOST_PORT:-3001}/eden/${EDEN_EXPERIMENT_ID:?}.git} - - --base-commit-sha - - ${EDEN_BASE_COMMIT_SHA:?EDEN_BASE_COMMIT_SHA must be set (run setup-experiment)} - - --worker-id - - ${EDEN_WEB_UI_WORKER_ID:-web-ui-1} - - --host - - 0.0.0.0 - - --port - - "8090" - - --log-level - - info - - --control-plane-url - - ${EDEN_CONTROL_PLANE_URL:?EDEN_CONTROL_PLANE_URL must be set when layering compose.control-plane.yaml} - - --control-plane-admin-token - - ${EDEN_ADMIN_TOKEN:?} diff --git a/reference/compose/compose.multi-experiment.yaml b/reference/compose/compose.multi-experiment.yaml new file mode 100644 index 00000000..34f1d009 --- /dev/null +++ b/reference/compose/compose.multi-experiment.yaml @@ -0,0 +1,88 @@ +# #147 overlay — second orchestrator replica in lease-driven mode. +# +# Layered on top of compose.yaml for the lease-handoff chaos smoke +# (smoke-multi-experiment.sh / the compose-smoke-multi-experiment CI +# job). It adds ONE service: a second orchestrator (`orchestrator-2`, +# worker_id `orchestrator-2`) so two replicas contend for the single +# registered experiment's lease. +# +# Both replicas run the chapter-11 §5 lease-driven loop only when +# EDEN_CONTROL_PLANE_URL is non-empty: the orchestrator CLI reads that +# env var as the fallback for --control-plane-url (issue #147), so the +# base-compose `orchestrator` AND this `orchestrator-2` both flip into +# lease mode from the same env var. With it empty this overlay is a +# plain second single-experiment orchestrator (the multi-orchestrator +# smoke's shape), but the smoke that layers this overlay always sets +# the URL. +# +# Structurally mirrors compose.multi-orchestrator.yaml's orchestrator-2 +# (own per-replica repo + credentials volumes; no log bind-mount — +# logs go to stdout / the json-file driver, read via `compose logs`). +# NOTE: no `logging: *eden-logging` here — YAML anchors do not cross +# compose files, and the default json-file driver is fine for a smoke. + +name: eden-reference + +services: + orchestrator-2: + image: eden-reference:dev + build: + context: ../.. + dockerfile: reference/compose/Dockerfile + container_name: eden-orchestrator-2 + restart: "on-failure" + depends_on: + task-store-server: + condition: service_healthy + control-plane: + condition: service_healthy + command: + - python + - -m + - eden_orchestrator + - --task-store-url + - http://task-store-server:8080 + # Always passed; only a logging label in lease-driven mode. + - --experiment-id + - ${EDEN_EXPERIMENT_ID:?} + - --admin-token + - ${EDEN_ADMIN_TOKEN:?} + - --worker-id + - orchestrator-2 + - --repo-path + - /var/lib/eden/repo + - --forgejo-url + - ${FORGEJO_REMOTE_URL:?} + - --credential-helper + - /etc/eden/credential-helper.sh + - --experiment-config + - /etc/eden/experiment-config.yaml + - --poll-interval + - "1.0" + - --max-quiescent-iterations + - "${EDEN_MAX_QUIESCENT_ITERATIONS:-30}" + - --lease-duration-seconds + - "${EDEN_LEASE_DURATION_SECONDS:-30}" + - --termination-policy + - ${EDEN_TERMINATION_POLICY:-eden_dispatch.termination:default_termination_policy} + - --log-level + - info + environment: + # The lease-mode toggle (see header). Empty → single-experiment. + EDEN_CONTROL_PLANE_URL: ${EDEN_CONTROL_PLANE_URL:-} + EDEN_TERMINATION_MAX_VARIANTS: ${EDEN_TERMINATION_MAX_VARIANTS:-} + configs: + - source: eden-experiment-config + target: /etc/eden/experiment-config.yaml + volumes: + # Per-replica clones so neither orchestrator's git state collides + # with the other. Per-replica clones is the design posture; the + # wire (+ the control-plane lease) is the only synchronization + # point. + - eden-orchestrator-2-repo:/var/lib/eden/repo + - eden-orchestrator-2-credentials:/var/lib/eden/credentials + - ${EDEN_FORGEJO_CREDS_DIR_HOST:?}/credential-helper.sh:/etc/eden/credential-helper.sh:ro + +volumes: + eden-orchestrator-2-repo: + eden-orchestrator-2-credentials: diff --git a/reference/compose/compose.yaml b/reference/compose/compose.yaml index 4c18e18d..e4b450dc 100644 --- a/reference/compose/compose.yaml +++ b/reference/compose/compose.yaml @@ -16,6 +16,13 @@ services: POSTGRES_DB: ${POSTGRES_DB:-eden} POSTGRES_USER: ${POSTGRES_USER:-eden} POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:?POSTGRES_PASSWORD must be set} + # #147: second logical database for the control-plane store + # (chapter 11 §3.4 Option A — same instance, separate database). + # Read by the init hook mounted below; the hook creates it only + # on a fresh data dir (upstream image behavior). See + # reference/compose/README.md "Multi-experiment mode" for the + # existing-data-root upgrade path. + POSTGRES_DB_CONTROL_PLANE: ${POSTGRES_DB_CONTROL_PLANE:-eden_control_plane} ports: - "${POSTGRES_HOST_PORT:-5433}:5432" volumes: @@ -24,6 +31,10 @@ services: # Desktop VM rebuilds, factory resets, and any other event that # destroys named-volume backing storage. See chapter 01 §13. - ${EDEN_EXPERIMENT_DATA_ROOT:?EDEN_EXPERIMENT_DATA_ROOT must be set (run setup-experiment)}/postgres:/var/lib/postgresql/data + # #147: create the control-plane database on first init. The + # upstream postgres image runs /docker-entrypoint-initdb.d/*.sh + # exactly once, on a fresh data dir. + - ./init-control-plane-db.sh:/docker-entrypoint-initdb.d/01-control-plane-db.sh:ro healthcheck: test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-eden} -d ${POSTGRES_DB:-eden}"] interval: 5s @@ -230,6 +241,76 @@ services: retries: 20 start_period: 10s + # --------------------------------------------------------------- + # 12c / #147: deployment-level control plane (chapter 11) + # --------------------------------------------------------------- + # + # First-class always-on service (issue #147). The orchestrator and + # web-ui only TALK to it when EDEN_CONTROL_PLANE_URL is set in the + # env; with it empty (the default + the existing six smokes) they + # run single-experiment and ignore the control plane, so this + # container is exercised for clean startup but otherwise idle. + # State-sync poller is always on (--task-store-url set); with no + # registered experiments it no-ops. Store is Postgres-backed (the + # eden_control_plane database created by the postgres init hook) so + # lease + registry state survives a control-plane restart, per + # chapter 11 §3.4 Option A. + control-plane: + image: eden-reference:dev + build: + context: ../.. + dockerfile: reference/compose/Dockerfile + container_name: eden-control-plane + restart: unless-stopped + logging: *eden-logging + depends_on: + postgres: + condition: service_healthy + task-store-server: + condition: service_healthy + environment: + # Issue #109: file-handler companion to docker json-file driver. + EDEN_LOG_DIR: /var/lib/eden/logs + command: + - python + - -m + - eden_control_plane_server + - --store-url + - ${EDEN_CONTROL_PLANE_STORE_URL:?EDEN_CONTROL_PLANE_STORE_URL must be set (run setup-experiment)} + - --host + - 0.0.0.0 + - --port + - "8081" + - --admin-token + - ${EDEN_ADMIN_TOKEN:?} + - --task-store-url + - http://task-store-server:8080 + - --lease-duration-seconds + - "${EDEN_LEASE_DURATION_SECONDS:-30}" + - --state-sync-interval-seconds + - "${EDEN_STATE_SYNC_INTERVAL_SECONDS:-30}" + - --state-sync-failure-threshold + - "${EDEN_STATE_SYNC_FAILURE_THRESHOLD:-10}" + - --log-level + - info + volumes: + # Issue #109: per-service log bind-mount (writable). setup-experiment + # creates the host-side dir. + - ${EDEN_EXPERIMENT_DATA_ROOT:?}/logs/control-plane:/var/lib/eden/logs + ports: + - "${CONTROL_PLANE_HOST_PORT:-8081}:8081" + healthcheck: + # /healthz lives outside /v0/control so it needs no auth header. + test: + - CMD + - curl + - -fsS + - http://localhost:8081/healthz + interval: 5s + timeout: 3s + retries: 20 + start_period: 10s + orchestrator: image: eden-reference:dev build: @@ -245,12 +326,22 @@ services: depends_on: task-store-server: condition: service_healthy + # #147: the control plane is always-on; wait for it so that when + # EDEN_CONTROL_PLANE_URL is set (lease-driven mode) the startup + # control-plane bootstrap doesn't race the server coming up. In + # single-experiment mode (URL empty) the orchestrator ignores it. + control-plane: + condition: service_healthy command: - python - -m - eden_orchestrator - --task-store-url - http://task-store-server:8080 + # --experiment-id is always passed; it is the active experiment in + # single-experiment mode and only a logging label in lease-driven + # mode (the CLI selects mode solely on --control-plane-url / the + # EDEN_CONTROL_PLANE_URL env fallback being set). See issue #147. - --experiment-id - ${EDEN_EXPERIMENT_ID:?} - --admin-token @@ -265,23 +356,47 @@ services: - /etc/eden/credential-helper.sh - --experiment-config - /etc/eden/experiment-config.yaml - # Issue #157: the single-experiment orchestrator reads its - # ideation_policy, termination_policy, and max_quiescent_iterations - # from the experiment-config YAML — not from CLI flags / env vars. - # The smoke scripts inject these fields into the config before - # bringing the stack up (max_quiescent_iterations: 30 reproduces the - # retired EDEN_MAX_QUIESCENT_ITERATIONS:-30 deployment default — 30 - # iterations × 1s poll = 30s of zero progress before quiescence-exit). - # Manual-UI sessions (a human at the keyboard) want a much higher - # value — set max_quiescent_iterations in the experiment-config YAML. - # See GitHub issue #98 for the design discussion (the structural fix + # The orchestrator reads the ideation_policy block from the + # experiment config to build the policy callable invoked when + # dispatch_mode.ideation_creation == "auto". Absent block → + # maintain_pending(target=3). + # Compose-friendly quiescence: 30 iterations × 1s = 30s of + # zero progress before the orchestrator declares the + # experiment done. Tuned for the smoke tests where worker + # hosts auto-claim within milliseconds. Manual-UI sessions + # (a human at the keyboard) want a much higher value — set + # EDEN_MAX_QUIESCENT_ITERATIONS in .env to override. See + # GitHub issue #98 for the design discussion (the structural fix # — orchestrator as cross-experiment infra — is deferred to # Phase 12c control plane). - --poll-interval - "1.0" + - --max-quiescent-iterations + - "${EDEN_MAX_QUIESCENT_ITERATIONS:-30}" + # #147: lease duration for chapter 11 lease-driven mode. Only + # meaningful when EDEN_CONTROL_PLANE_URL is set; harmless (parsed, + # unused) in single-experiment mode. + - --lease-duration-seconds + - "${EDEN_LEASE_DURATION_SECONDS:-30}" + # 12a-3 wave 4: --termination-policy gates the new decision-type 0 + # branch. Default `default_termination_policy` returns + # `never_terminate`; deployments that want policy-driven + # termination point this at a configured factory (see + # docs/operations/experiment-lifecycle.md for the patterns). + - --termination-policy + - ${EDEN_TERMINATION_POLICY:-eden_dispatch.termination:default_termination_policy} - --log-level - info environment: + # 12a-3 wave 4: per-policy configuration. The + # `env_max_variants_policy` factory reads this; other + # deployment-supplied policies MAY follow the same convention. + EDEN_TERMINATION_MAX_VARIANTS: ${EDEN_TERMINATION_MAX_VARIANTS:-} + # #147: lease-driven (chapter 11 multi-experiment) mode toggle. + # Empty (default + the existing six smokes) → single-experiment + # mode, control plane ignored. Non-empty → the orchestrator CLI's + # env fallback sets --control-plane-url and the §5 lease loop runs. + EDEN_CONTROL_PLANE_URL: ${EDEN_CONTROL_PLANE_URL:-} # Issue #109: file-handler companion to docker json-file driver. EDEN_LOG_DIR: /var/lib/eden/logs configs: @@ -486,6 +601,15 @@ services: - --log-level - info environment: + # #147: when set, the web-ui CLI's env fallback supplies + # --control-plane-url and mounts the cross-experiment + # /admin/experiments/ dashboard (chapter 11). Empty (default + + # the existing six smokes) → the dashboard route stays 404 and + # the web-ui operates against the single --experiment-id. + # Replaces the retired compose.control-plane.yaml overlay. The + # control-plane admin token defaults to EDEN_ADMIN_TOKEN inside + # the CLI, so no separate token env is needed here. + EDEN_CONTROL_PLANE_URL: ${EDEN_CONTROL_PLANE_URL:-} # Issue #109: file-handler companion to docker json-file driver. EDEN_LOG_DIR: /var/lib/eden/logs configs: diff --git a/reference/compose/healthcheck/smoke-multi-experiment.sh b/reference/compose/healthcheck/smoke-multi-experiment.sh new file mode 100755 index 00000000..fe93b605 --- /dev/null +++ b/reference/compose/healthcheck/smoke-multi-experiment.sh @@ -0,0 +1,380 @@ +#!/usr/bin/env bash +set -euo pipefail + +# #147 — control-plane + lease-handoff smoke for the EDEN reference +# Compose stack. +# +# RE-SCOPED (see docs/plans/issue-147-compose-smoke-multi-experiment.md +# §0): the reference impl cannot host more than one experiment per +# deployment (single-experiment task-store-server), so the original +# "two experiments + cross-experiment isolation" smoke is deferred to +# issue #254. This smoke exercises the genuinely-new chapter-11 +# substrate surface that IS shippable today: +# +# - The control-plane-server running as a first-class Compose service +# (Postgres-backed; /healthz reachable). +# - ONE registered experiment driven through the control plane. +# - TWO orchestrator replicas (orchestrator, orchestrator-2) in +# lease-driven mode contending for that experiment's single lease. +# - The lease-singleton invariant: at most one replica holds the +# lease at any observed instant. +# - The lease-handoff chaos drill: kill the lease holder; the standby +# replica acquires the lease and the experiment still makes +# progress (>= 2 variant.integrated). +# - State-sync convergence: after an operator-driven termination, the +# control-plane's last_known_state for the experiment converges to +# "terminated". +# +# NOTE on termination: this smoke uses the OPERATOR-DRIVEN +# `terminate_experiment` wire op (an `admins` worker), NOT +# `dispatch_mode.termination = "auto"`. The orchestrator's auto- +# termination decision currently 403s under wire auth (terminate is +# admins-gated, the orchestrator is in `orchestrators`) — a pre-existing +# spec inter-chapter drift surfaced by this smoke and tracked in #256. +# Operator-driven termination is the supported path per 03-roles.md +# §6.2 ("Termination MAY occur via the operator-driven wire op +# regardless of dispatch_mode"). +# +# bash 3.2 compatible (no mapfile / readarray / associative arrays). + +for tool in docker jq curl python3; do + command -v "$tool" >/dev/null || { + echo "smoke-multi-experiment.sh requires '$tool' on PATH" >&2 + exit 2 + } +done +docker compose version >/dev/null || { + echo "smoke-multi-experiment.sh requires the 'docker compose' v2 plugin" >&2 + exit 2 +} + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +COMPOSE_DIR="$(cd "${SCRIPT_DIR}/.." && pwd)" +REPO_ROOT="$(cd "${COMPOSE_DIR}/../.." && pwd)" + +cd "$COMPOSE_DIR" + +ENV_FILE="$(mktemp)" +# Per-run data root so re-runs don't trip the rotate-password trap +# (setup-experiment generates a fresh POSTGRES_PASSWORD each run, but a +# leftover bind-mounted postgres data dir bakes in the old one). See +# AGENTS.md "Compose smoke tests need explicit volume cleanup". +SMOKE_DATA_ROOT="$(mktemp -d -t eden-smoke-multi-XXXXXX)" +EXPERIMENT_ID="smoke-multi-exp" +COMPOSE_FILES=(-f compose.yaml -f compose.multi-experiment.yaml) + +cleanup() { + local rc=$? + if [[ $rc -ne 0 ]]; then + echo "--- cleanup: dumping compose state for diagnostics ---" >&2 + docker compose "${COMPOSE_FILES[@]}" --env-file "$ENV_FILE" \ + ps -a >&2 2>&1 || true + docker compose "${COMPOSE_FILES[@]}" --env-file "$ENV_FILE" \ + logs --tail 80 control-plane orchestrator orchestrator-2 >&2 2>&1 || true + fi + docker compose "${COMPOSE_FILES[@]}" --env-file "$ENV_FILE" \ + down -v >/dev/null 2>&1 || true + rm -f "$ENV_FILE" + rm -f "${COMPOSE_DIR}/experiment-config.yaml" + # Substrate bind-mount subdirs contain files owned by container + # uids the host doesn't match; delete via a root sibling container, + # then rmdir from the host. Defense-in-depth empty/`/` guard. + if [[ -n "${SMOKE_DATA_ROOT:-}" \ + && "$SMOKE_DATA_ROOT" != "/" \ + && -d "$SMOKE_DATA_ROOT" ]]; then + docker run --rm -v "$SMOKE_DATA_ROOT:/cleanup" alpine:3.20 \ + sh -c 'find /cleanup -mindepth 1 -delete' >/dev/null 2>&1 || true + fi + rm -rf "$SMOKE_DATA_ROOT" || true + exit "$rc" +} +trap cleanup EXIT + +# --------------------------------------------------------------------- +# Phase 1 — provision the single experiment + pin lease-mode env +# --------------------------------------------------------------------- +echo "--- running setup-experiment ---" +bash "${REPO_ROOT}/reference/scripts/setup-experiment/setup-experiment.sh" \ + "${REPO_ROOT}/tests/fixtures/experiment/.eden/config.yaml" \ + --experiment-id "$EXPERIMENT_ID" \ + --env-file "$ENV_FILE" \ + --data-root "$SMOKE_DATA_ROOT" + +# Flip the orchestrators + web-ui into chapter-11 lease-driven mode and +# pick a fast lease for a snappy chaos drill. setup-experiment writes +# EDEN_CONTROL_PLANE_URL= (empty); replace it. The lease/state-sync +# knobs are not written by setup-experiment (compose defaults apply), +# so appending them is unambiguous. +sed -i.bak \ + "s|^EDEN_CONTROL_PLANE_URL=.*$|EDEN_CONTROL_PLANE_URL=http://control-plane:8081|" \ + "$ENV_FILE" +rm -f "${ENV_FILE}.bak" +cat >>"$ENV_FILE" <<'EOF' +EDEN_LEASE_DURATION_SECONDS=10 +EDEN_STATE_SYNC_INTERVAL_SECONDS=5 +EOF + +# Bound ideation so the run is finite (mirrors smoke-multi-orchestrator; +# edit the experiment-config YAML, not env, per issue #133). Termination +# is operator-driven below, so a small fixed budget is enough. +EXPERIMENT_CONFIG="${COMPOSE_DIR}/experiment-config.yaml" +cat >>"$EXPERIMENT_CONFIG" <<'YAML' +ideation_policy: + kind: fixed_total + total: 3 +YAML + +EDEN_ADMIN_TOKEN="$(grep -E '^EDEN_ADMIN_TOKEN=' "$ENV_FILE" | cut -d= -f2-)" +test -n "$EDEN_ADMIN_TOKEN" +EXP_BASE="/v0/experiments/${EXPERIMENT_ID}" + +# Admin-authenticated wire call against the task-store-server, issued +# from inside its container (no host port-guessing; works whether or +# not the host has curl). +call_ts() { + local method="$1" path="$2" body="${3:-}" + local args=( + -fsS -X "$method" + -H "Authorization: Bearer admin:${EDEN_ADMIN_TOKEN}" + -H "X-Eden-Experiment-Id: ${EXPERIMENT_ID}" + -H "Content-Type: application/json" + ) + [[ -n "$body" ]] && args+=(-d "$body") + args+=("http://localhost:8080${path}") + docker compose "${COMPOSE_FILES[@]}" --env-file "$ENV_FILE" \ + exec -T task-store-server curl "${args[@]}" +} + +# Worker-bearer wire call (for the §3.7 admins-gated terminate_experiment +# op, which rejects the literal `admin` bearer). +call_ts_worker() { + local method="$1" path="$2" bearer="$3" body="${4:-}" + local args=( + -fsS -X "$method" + -H "Authorization: Bearer ${bearer}" + -H "X-Eden-Experiment-Id: ${EXPERIMENT_ID}" + -H "Content-Type: application/json" + ) + [[ -n "$body" ]] && args+=(-d "$body") + args+=("http://localhost:8080${path}") + docker compose "${COMPOSE_FILES[@]}" --env-file "$ENV_FILE" \ + exec -T task-store-server curl "${args[@]}" +} + +# Admin-authenticated call against the control-plane, from inside its +# container (it listens on 8081). +call_cp() { + local method="$1" path="$2" body="${3:-}" + local args=( + -fsS -X "$method" + -H "Authorization: Bearer admin:${EDEN_ADMIN_TOKEN}" + -H "Content-Type: application/json" + ) + [[ -n "$body" ]] && args+=(-d "$body") + args+=("http://localhost:8081${path}") + docker compose "${COMPOSE_FILES[@]}" --env-file "$ENV_FILE" \ + exec -T control-plane curl "${args[@]}" +} + +# --------------------------------------------------------------------- +# Phase 2 — bring up the stack with the lease overlay +# --------------------------------------------------------------------- +echo "--- bringing up the full stack with the lease overlay ---" +docker compose "${COMPOSE_FILES[@]}" --env-file "$ENV_FILE" \ + up -d --wait --wait-timeout 300 + +echo "--- asserting control-plane /healthz ---" +docker compose "${COMPOSE_FILES[@]}" --env-file "$ENV_FILE" \ + exec -T control-plane curl -fsS http://localhost:8081/healthz >/dev/null || { + echo "control-plane /healthz did not return 200" >&2 + exit 1 + } + +# Register the experiment with the control plane so the lease-driven +# orchestrators pick it up (admin-gated; 201 first / 200 idempotent). +echo "--- registering experiment with the control plane ---" +call_cp POST /v0/control/experiments \ + "{\"experiment_id\":\"${EXPERIMENT_ID}\",\"config_uri\":\"file:///etc/eden/experiment-config.yaml\"}" \ + >/dev/null +call_cp GET "/v0/control/experiments/${EXPERIMENT_ID}" \ + | jq -e --arg id "$EXPERIMENT_ID" '.experiment_id == $id' >/dev/null || { + echo "control-plane registry does not list ${EXPERIMENT_ID}" >&2 + exit 1 + } + +# Pre-seed the task-store `orchestrators` group with both replica +# worker_ids. In single-experiment mode the orchestrator self-joins +# this group at startup; the chapter-11 lease-driven path only joins +# the CONTROL-PLANE orchestrators group, NOT the task-store one, so +# without this seeding the lease holder's §3.7-gated dispatch/integrate +# calls would 403. Tracked for a proper in-orchestrator fix under #254 +# (multi-experiment hosting). add_to_group is admin-gated + idempotent. +echo "--- seeding task-store orchestrators group with both replicas ---" +for wid in orchestrator orchestrator-2; do + call_ts POST "${EXP_BASE}/workers" "{\"worker_id\":\"${wid}\"}" >/dev/null + call_ts POST "${EXP_BASE}/groups/orchestrators/members" \ + "{\"member_id\":\"${wid}\"}" >/dev/null +done + +# Return the single lease holder for the experiment (empty if none). +lease_holder() { + call_cp GET "/v0/control/experiments/${EXPERIMENT_ID}" 2>/dev/null \ + | jq -r '.lease.holder // empty' +} + +# Count active leases for $EXPERIMENT_ID held by $1 (0 or 1). +leases_for_holder() { + call_cp GET "/v0/control/leases?holder=$1" 2>/dev/null \ + | jq --arg id "$EXPERIMENT_ID" \ + '[.leases[]? | select(.experiment_id == $id)] | length' +} + +echo "--- waiting for a single lease holder (lease-singleton invariant) ---" +HOLDER="" +deadline=$((SECONDS + 60)) +while [[ $SECONDS -lt $deadline ]]; do + HOLDER="$(lease_holder || true)" + if [[ "$HOLDER" = "orchestrator" || "$HOLDER" = "orchestrator-2" ]]; then + break + fi + sleep 2 +done +case "$HOLDER" in + orchestrator|orchestrator-2) ;; + *) + echo "no replica acquired the lease within 60s (holder='${HOLDER}')" >&2 + exit 1 + ;; +esac +n_total=$(( $(leases_for_holder orchestrator) + $(leases_for_holder orchestrator-2) )) +test "$n_total" -eq 1 || { + echo "lease-singleton violated: ${n_total} active leases across both replicas" >&2 + exit 1 +} +echo " lease held by: ${HOLDER}" + +# --------------------------------------------------------------------- +# Phase 3 — lease-handoff chaos drill +# --------------------------------------------------------------------- +# Kill AND remove the lease holder so the on-failure restart policy has +# nothing to restart; the standby MUST acquire the lease within +# lease_duration*2 + poll slack. +OTHER="orchestrator-2" +[[ "$HOLDER" = "orchestrator-2" ]] && OTHER="orchestrator" +echo "--- chaos: killing lease holder eden-${HOLDER}; expecting ${OTHER} to take over ---" +docker rm -f "eden-${HOLDER}" >/dev/null + +deadline=$((SECONDS + 45)) +NEW_HOLDER="" +while [[ $SECONDS -lt $deadline ]]; do + NEW_HOLDER="$(lease_holder || true)" + if [[ "$NEW_HOLDER" = "$OTHER" ]]; then + break + fi + sleep 2 +done +test "$NEW_HOLDER" = "$OTHER" || { + echo "lease did not hand off to ${OTHER} within 45s (holder='${NEW_HOLDER}')" >&2 + exit 1 +} +n_total=$(( $(leases_for_holder orchestrator) + $(leases_for_holder orchestrator-2) )) +test "$n_total" -eq 1 || { + echo "post-handoff lease-singleton violated: ${n_total} active leases" >&2 + exit 1 +} +echo " lease handed off to: ${NEW_HOLDER}" + +# --------------------------------------------------------------------- +# Phase 4 — the surviving replica drives the pipeline to >= 2 integrated +# --------------------------------------------------------------------- +echo "--- waiting for the surviving replica to integrate >= 2 variants ---" +deadline=$((SECONDS + 240)) +integrated=0 +while [[ $SECONDS -lt $deadline ]]; do + events="$(call_ts GET "${EXP_BASE}/events" || true)" + if [[ -n "$events" ]]; then + integrated="$(echo "$events" \ + | jq '[.events[]? | select(.type == "variant.integrated")] | length')" + if [[ "${integrated:-0}" -ge 2 ]]; then + break + fi + fi + sleep 3 +done +events="$(call_ts GET "${EXP_BASE}/events")" +integrated="$(echo "$events" \ + | jq '[.events[]? | select(.type == "variant.integrated")] | length')" +exec_completed="$(echo "$events" | jq '[.events[]? | select( + .type == "task.completed" and (.data.task_id | startswith("execution-")) + )] | length')" +eval_completed="$(echo "$events" | jq '[.events[]? | select( + .type == "task.completed" and (.data.task_id | startswith("evaluate-")) + )] | length')" +for name in integrated exec_completed eval_completed; do + count="${!name}" + if (( count < 2 )); then + echo "expected >= 2 ${name}; got ${count}" >&2 + exit 1 + fi +done +echo " integrated=${integrated} exec_completed=${exec_completed} eval_completed=${eval_completed}" + +# --------------------------------------------------------------------- +# Phase 5 — operator-driven termination + state-sync convergence +# --------------------------------------------------------------------- +# terminate_experiment is admins-group-gated and rejects the literal +# admin bearer, so register a throwaway worker, add it to `admins`, and +# call terminate with its worker bearer. (The orchestrator's OWN auto- +# termination decision can't do this under wire auth — see #256; the +# operator-driven op is the supported path.) +echo "--- operator-driven terminate (admins worker) ---" +TERM_ADMIN="smoke-term-admin" +REG_JSON="$(call_ts POST "${EXP_BASE}/workers" "{\"worker_id\":\"${TERM_ADMIN}\"}")" +TERM_TOKEN="$(echo "$REG_JSON" | jq -r '.registration_token // empty')" +if [[ -z "$TERM_TOKEN" ]]; then + REG_JSON="$(call_ts POST "${EXP_BASE}/workers/${TERM_ADMIN}/reissue-credential" "")" + TERM_TOKEN="$(echo "$REG_JSON" | jq -r '.registration_token // empty')" +fi +test -n "$TERM_TOKEN" +call_ts POST "${EXP_BASE}/groups/admins/members" \ + "{\"member_id\":\"${TERM_ADMIN}\"}" >/dev/null +call_ts_worker POST "${EXP_BASE}/terminate" \ + "${TERM_ADMIN}:${TERM_TOKEN}" '{"reason":"smoke-multi-experiment"}' >/dev/null + +echo "--- asserting experiment.terminated ---" +deadline=$((SECONDS + 60)) +terminated=0 +while [[ $SECONDS -lt $deadline ]]; do + events="$(call_ts GET "${EXP_BASE}/events" || true)" + if [[ -n "$events" ]]; then + n="$(echo "$events" \ + | jq '[.events[]? | select(.type == "experiment.terminated")] | length')" + if [[ "${n:-0}" -ge 1 ]]; then + terminated=1 + break + fi + fi + sleep 2 +done +test "$terminated" -eq 1 || { + echo "experiment did not reach experiment.terminated within 60s" >&2 + exit 1 +} + +echo "--- asserting control-plane state-sync convergence (last_known_state) ---" +deadline=$((SECONDS + 30)) +last_state="" +while [[ $SECONDS -lt $deadline ]]; do + last_state="$(call_cp GET "/v0/control/experiments/${EXPERIMENT_ID}" 2>/dev/null \ + | jq -r '.last_known_state // empty')" + if [[ "$last_state" = "terminated" ]]; then + break + fi + sleep 2 +done +test "$last_state" = "terminated" || { + echo "control-plane last_known_state did not converge to terminated (got '${last_state}')" >&2 + exit 1 +} + +echo "PASS" diff --git a/reference/compose/init-control-plane-db.sh b/reference/compose/init-control-plane-db.sh new file mode 100755 index 00000000..09e5f0ca --- /dev/null +++ b/reference/compose/init-control-plane-db.sh @@ -0,0 +1,27 @@ +#!/bin/sh +# Postgres init hook (issue #147): create the control-plane database. +# +# Mounted into /docker-entrypoint-initdb.d/ on the upstream +# postgres:16-alpine image. That directory's *.sh files run exactly +# ONCE, on a fresh data dir, after the server is up and $POSTGRES_DB +# has been created — but POSTGRES_DB only ever creates ONE database, +# so the control-plane's second logical database (chapter 11 §3.4 +# Option A) needs an explicit CREATE here. +# +# POSIX sh (the alpine image's /bin/sh is busybox ash; no bashisms). +set -eu + +DB_NAME="${POSTGRES_DB_CONTROL_PLANE:-eden_control_plane}" + +# CREATE DATABASE has no IF NOT EXISTS; gate on a catalog lookup so a +# re-run (e.g. a future image that re-runs the hook) is idempotent +# rather than erroring the whole init. +exists="$(psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -tAc \ + "SELECT 1 FROM pg_database WHERE datname = '${DB_NAME}'")" +if [ "$exists" = "1" ]; then + echo "init-control-plane-db: database ${DB_NAME} already exists; skipping" +else + echo "init-control-plane-db: creating database ${DB_NAME}" + psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" \ + -c "CREATE DATABASE \"${DB_NAME}\" OWNER \"$POSTGRES_USER\";" +fi diff --git a/reference/scripts/setup-experiment/setup-experiment.sh b/reference/scripts/setup-experiment/setup-experiment.sh index 23287750..0b888370 100755 --- a/reference/scripts/setup-experiment/setup-experiment.sh +++ b/reference/scripts/setup-experiment/setup-experiment.sh @@ -357,6 +357,7 @@ mkdir -p \ "${EDEN_EXPERIMENT_DATA_ROOT}/logs/executor-host" \ "${EDEN_EXPERIMENT_DATA_ROOT}/logs/evaluator-host" \ "${EDEN_EXPERIMENT_DATA_ROOT}/logs/web-ui" \ + "${EDEN_EXPERIMENT_DATA_ROOT}/logs/control-plane" \ "${EDEN_EXPERIMENT_DATA_ROOT}/loki" \ "${EDEN_EXPERIMENT_DATA_ROOT}/alloy" # Issue #110: loki/ + alloy/ are DERIVED / observability storage for the @@ -393,6 +394,7 @@ if ! chmod 0777 \ "${EDEN_EXPERIMENT_DATA_ROOT}/logs/executor-host" \ "${EDEN_EXPERIMENT_DATA_ROOT}/logs/evaluator-host" \ "${EDEN_EXPERIMENT_DATA_ROOT}/logs/web-ui" \ + "${EDEN_EXPERIMENT_DATA_ROOT}/logs/control-plane" \ "${EDEN_EXPERIMENT_DATA_ROOT}/loki" \ "${EDEN_EXPERIMENT_DATA_ROOT}/alloy" 2>/dev/null then @@ -499,6 +501,13 @@ EDEN_STORE_URL="postgresql://eden:${POSTGRES_PASSWORD_ENC}@postgres:5432/eden" EDEN_READONLY_PASSWORD_ENC="$(python3 -c 'import sys, urllib.parse; sys.stdout.write(urllib.parse.quote(sys.argv[1], safe=""))' "$EDEN_READONLY_PASSWORD")" EDEN_READONLY_STORE_URL="postgresql://eden_readonly:${EDEN_READONLY_PASSWORD_ENC}@postgres:5432/eden" +# #147: control-plane store DSN. Same Postgres instance as the task +# store, a SEPARATE logical database (chapter 11 §3.4 Option A) created +# by the postgres init hook (reference/compose/init-control-plane-db.sh). +# Same percent-encoded eden-superuser password as EDEN_STORE_URL. +POSTGRES_DB_CONTROL_PLANE="eden_control_plane" +EDEN_CONTROL_PLANE_STORE_URL="postgresql://eden:${POSTGRES_PASSWORD_ENC}@postgres:5432/${POSTGRES_DB_CONTROL_PLANE}" + ENV_TMP="$(mktemp)" cat >"$ENV_TMP" < dict[str, str]: + """Unauthenticated liveness probe. + + Lives outside the ``/v0/control`` prefix, so the §13 auth + middleware (which only guards ``/v0/control/...``) lets it + through. Mirrors the web-ui's ``/healthz`` so the Compose + healthcheck can poll the control-plane the same way. + """ + return {"status": "ok"} + if admin_token is not None: _install_auth_middleware(app, admin_token=admin_token, store=store) diff --git a/reference/services/control-plane/tests/test_server.py b/reference/services/control-plane/tests/test_server.py index 44f963c4..fbd22b12 100644 --- a/reference/services/control-plane/tests/test_server.py +++ b/reference/services/control-plane/tests/test_server.py @@ -37,6 +37,17 @@ def client_noauth(store: InMemoryControlPlaneStore) -> Iterator[TestClient]: # --------------------------------------------------------------------- +def test_healthz_unauthenticated_ok() -> None: + # /healthz lives outside /v0/control, so even with auth enabled the + # middleware lets it through unauthenticated (Compose healthcheck path). + store = InMemoryControlPlaneStore() + app = make_app(store, admin_token="secret", lease_duration_seconds=30) + with TestClient(app) as c: + r = c.get("/healthz") + assert r.status_code == 200 + assert r.json() == {"status": "ok"} + + def test_register_experiment_creates_201(client_noauth: TestClient) -> None: r = client_noauth.post( "/v0/control/experiments", diff --git a/reference/services/orchestrator/src/eden_orchestrator/cli.py b/reference/services/orchestrator/src/eden_orchestrator/cli.py index 54dffb7a..2ad2ee49 100644 --- a/reference/services/orchestrator/src/eden_orchestrator/cli.py +++ b/reference/services/orchestrator/src/eden_orchestrator/cli.py @@ -4,6 +4,7 @@ import argparse import importlib +import os import re from eden_contracts import ExperimentConfig @@ -222,9 +223,15 @@ def parse_args(argv: list[str] | None = None) -> argparse.Namespace: ) parser.add_argument( "--control-plane-url", - default=None, + # Env fallback (#147): defaults to $EDEN_CONTROL_PLANE_URL, treating + # an empty value as unset. This lets a single Compose service + # definition flip between single- and multi-experiment mode purely + # via the env var (`${EDEN_CONTROL_PLANE_URL:-}` empty → single, + # non-empty → lease-driven) without a conditional command wrapper. + default=os.environ.get("EDEN_CONTROL_PLANE_URL") or None, help=( - "Optional control-plane base URL (e.g. 'http://control-plane:8081'). " + "Optional control-plane base URL (e.g. 'http://control-plane:8081'; " + "defaults to $EDEN_CONTROL_PLANE_URL, empty treated as unset). " "When set, the orchestrator subscribes to the chapter-11 §2 " "experiment registry and runs the §5 multi-experiment loop: " "acquires/renews a lease per registered experiment, drives " @@ -861,8 +868,6 @@ def _resolve_control_plane_admin_token(args) -> str | None: # noqa: ANN001 credential. Subsequent lease ops authenticate as the worker bearer returned by `bootstrap_control_plane_worker`. """ - import os - token = args.control_plane_admin_token or os.environ.get( "EDEN_CONTROL_PLANE_ADMIN_TOKEN" ) diff --git a/reference/services/web-ui/src/eden_web_ui/cli.py b/reference/services/web-ui/src/eden_web_ui/cli.py index d2b385ba..fe31cb80 100644 --- a/reference/services/web-ui/src/eden_web_ui/cli.py +++ b/reference/services/web-ui/src/eden_web_ui/cli.py @@ -240,10 +240,15 @@ def parse_args(argv: list[str] | None = None) -> argparse.Namespace: ) parser.add_argument( "--control-plane-url", - default=None, + # Env fallback (#147): defaults to $EDEN_CONTROL_PLANE_URL, treating + # an empty value as unset. Lets the Compose web-ui service flip the + # cross-experiment surface on/off purely via `${EDEN_CONTROL_PLANE_URL:-}` + # without a conditional command override (retires compose.control-plane.yaml). + default=os.environ.get("EDEN_CONTROL_PLANE_URL") or None, help=( "Optional control-plane base URL (e.g. " - "'http://control-plane:8081'). When set, the web-ui exposes " + "'http://control-plane:8081'; defaults to $EDEN_CONTROL_PLANE_URL, " + "empty treated as unset). When set, the web-ui exposes " "the cross-experiment admin views at /admin/experiments/ " "(chapter 11 §2 / §3 / §4) and a top-nav 'experiments' " "link. When unset, the cross-experiment surface is hidden "