Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -537,6 +537,17 @@ jobs:
working-directory: reference/compose
run: bash healthcheck/smoke-checkpoint.sh

# Backfill of the Phase 12c CHANGELOG-narrated deferral (issue #147),
# re-scoped: the reference impl cannot host >1 experiment per
# deployment (single-experiment task-store-server; cross-experiment
# isolation deferred to #254), so this exercises the control-plane as
# a first-class Compose service + the chapter-11 lease lifecycle and
# lease-handoff chaos drill on the deployed stack. Not required by
# branch protection in this PR; same posture as the other newly-added
# smoke jobs — bump to required-status after staying clean on main
# for ~2 weeks.
compose-smoke-multi-experiment:
name: compose-smoke-multi-experiment
# Issue #110: exercises the opt-in Loki + Alloy + Grafana log-search
# overlay (compose.logging.yaml) end-to-end — brings up base +
# subprocess + logging, asserts Loki ingests EDEN lines, Grafana is
Expand Down Expand Up @@ -566,6 +577,9 @@ jobs:
jq --version
python3 --version

- name: Run compose smoke (control-plane + lease-handoff drill)
working-directory: reference/compose
run: bash healthcheck/smoke-multi-experiment.sh
- name: Run compose smoke (log-search overlay)
working-directory: reference/compose
run: bash healthcheck/smoke-logging.sh
Expand Down
1 change: 1 addition & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ At Phase 10 chunk 10d follow-up A, markdown linting, JSON Schema validation, and
| `bash reference/compose/healthcheck/smoke-subprocess.sh` | Phase 10d subprocess-mode smoke (mirrors the `compose-smoke-subprocess` CI job; layers `compose.subprocess.yaml` over the base stack and runs against the fixture's `ideation.py` / `execution.py` / `evaluation.py`). |
| `bash reference/compose/healthcheck/smoke-subprocess-docker.sh` | Phase 10d follow-up A docker-mode smoke (mirrors the `compose-smoke-subprocess-docker` CI job; runs setup-experiment with `--exec-mode docker` so each `*_command` runs in a sibling container via DooD; asserts no orphan executor/evaluator containers post-quiescence + ideator-orphan reaped after `compose stop`). |
| `bash reference/compose/healthcheck/smoke-checkpoint.sh` | Phase 12b portable-checkpoint round-trip smoke (mirrors the `compose-smoke-checkpoint` CI job — runs setup-experiment + brings up the full stack + waits for quiescence, exports via `POST /v0/experiments/<id>/checkpoint`, tears down + wipes the data root, brings up only postgres + task-store-server against the same `.env` for an empty-store receiver, imports via `POST /v0/checkpoints/import`, asserts pre/post wire state matches + `imported_from` is stamped). Issue [#152](https://github.com/ealt/eden/issues/152). |
| `bash reference/compose/healthcheck/smoke-multi-experiment.sh` | Issue [#147](https://github.com/ealt/eden/issues/147) control-plane + lease-handoff smoke (mirrors the `compose-smoke-multi-experiment` CI job — brings up the always-on `control-plane` service + TWO orchestrator replicas in chapter-11 lease-driven mode against ONE registered experiment, asserts the lease-singleton invariant, kills the lease holder and asserts clean hand-off to the standby, then drives to `experiment.terminated` and asserts the control-plane `last_known_state` converges). Re-scoped from the original two-experiment smoke; cross-experiment isolation is deferred to [#254](https://github.com/ealt/eden/issues/254) (the reference impl is single-experiment per task-store-server). |
| `bash reference/compose/healthcheck/smoke-logging.sh` | Issue #110 log-search overlay smoke (mirrors the `compose-smoke-logging` CI job — runs setup-experiment, statically merge-gates the privileged `compose.logging-infra.yaml` overlay, brings up base + subprocess + logging, asserts Loki ingests EDEN lines + Grafana is healthy with the Loki datasource + `eden-explore` dashboard provisioned + a `{service="orchestrator"}` LogQL query returns ≥1 line; when a docker socket is reachable it also layers `compose.logging-infra.yaml` and asserts postgres stdout reaches Loki). |
| `uv run pytest -q -m docker` | Run the docker-backed `container_exec` integration tests (gated on a reachable docker daemon; skipped otherwise). |
| `bash reference/compose/healthcheck/e2e.sh` | Phase 10e end-to-end smoke (mirrors the `compose-e2e` CI job — staged bring-up, Web UI ideator walkthrough + admin-reclaim drill via `e2e_drive.py`, full-stack quiescence wait, termination drill). Requires `httpx` importable from `python3` — locally, prefix with `PATH="/path/to/.venv/bin:$PATH"` or activate the workspace venv. |
Expand Down
14 changes: 14 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,20 @@ Per-chunk entries preserve the full implementation record: contract amendments,

## [Unreleased]

### Control-plane as a first-class Compose service + lease-handoff smoke (issue #147; re-scoped)

Backfills the Phase 12c CHANGELOG-narrated deferral of a `compose-smoke-multi-experiment` CI job. **Re-scoped during impl** (operator-authorized): the draft plan's headline — two experiments end-to-end with cross-experiment isolation asserted via wire reads — is **not buildable** on the reference impl, because it hosts exactly one experiment per deployment. Three sites enforce this: the task-store-server's `Store` binds a single `experiment_id` and the wire layer rejects any other (`ExperimentIdMismatch` at [`_dependencies.py:73`](reference/packages/eden-wire/src/eden_wire/_dependencies.py)); the orchestrator multi-experiment loop targets one task-store URL for all experiments ([`multi_loop.py`](reference/services/orchestrator/src/eden_orchestrator/multi_loop.py) `make_runtime_factory`); and the integrator is one shared bare repo deployment-wide ([`cli.py`](reference/services/orchestrator/src/eden_orchestrator/cli.py) `_build_runtime_factory`). 12c's multi-experiment surface was validated only against fake stores + the single-IUT conformance binding. **True multi-experiment hosting + the cross-experiment-isolation smoke are deferred to [#254](https://github.com/ealt/eden/issues/254)** (filed at re-scope time). This chunk ships the genuinely-new, genuinely-shippable substrate piece instead: the control plane as a first-class Compose service, plus a lease-lifecycle + lease-handoff chaos smoke.

**Control-plane Compose service.** [`compose.yaml`](reference/compose/compose.yaml) gains an always-on `control-plane` service (Postgres-backed; chapter 11 §3.4 Option A — a separate `eden_control_plane` database in the same instance, created by the new [`init-control-plane-db.sh`](reference/compose/init-control-plane-db.sh) postgres init hook). A `/healthz` endpoint was added to the control-plane server ([`app.py`](reference/services/control-plane/src/eden_control_plane_server/app.py); unauthenticated, outside `/v0/control`) for the container healthcheck. The service is always-on but **opt-in**: the orchestrator and web-ui only talk to it when `EDEN_CONTROL_PLANE_URL` is non-empty, so the existing six Compose smokes are unchanged in posture.

**Env-fallback instead of entrypoint wrappers.** Rather than the draft plan's bash wrapper scripts, the orchestrator + web-ui CLIs gained an `EDEN_CONTROL_PLANE_URL` env fallback for `--control-plane-url` (empty treated as unset), mirroring the existing `EDEN_CONTROL_PLANE_ADMIN_TOKEN` fallback. The orchestrator selects mode solely on `--control-plane-url` being set, and `--experiment-id` is harmless in lease mode (a logging label), so a single compose service definition flips between single- and multi-experiment mode purely via the env var — no wrapper scripts, no Dockerfile change. `compose.control-plane.yaml` (whose only content was web-ui flag-passing) is **deleted**; `docs/observability.md` §3.4 is rewritten to the first-class-service + env-toggle flow.

**Lease-handoff smoke.** New [`compose.multi-experiment.yaml`](reference/compose/compose.multi-experiment.yaml) overlay (a second `orchestrator-2` replica in lease mode) + [`smoke-multi-experiment.sh`](reference/compose/healthcheck/smoke-multi-experiment.sh) + the `compose-smoke-multi-experiment` CI job (unrequired initially; bump to required-status after ~2 weeks clean on main). The smoke brings up the control plane + two lease-contending replicas against one registered experiment, asserts the lease-singleton invariant, kills the lease holder and asserts clean hand-off to the standby, has the surviving replica drive the full pipeline to `≥2 variant.integrated`, then issues an **operator-driven** `terminate_experiment` and asserts the control-plane `last_known_state` converges to `terminated`. Validated locally end-to-end (PASS).

**Two pre-existing gaps surfaced by the smoke (filed, not fixed here).** (1) In lease-driven mode the orchestrator joins only the *control-plane* `orchestrators` group, not the *task-store* one (single-experiment mode self-joins via `_ensure_orchestrators_membership`; the multi-experiment path does not), so without seeding, the lease holder's §3.7-gated dispatch/integrate calls 403 — the smoke seeds the task-store group as a workaround; folded into [#254](https://github.com/ealt/eden/issues/254). (2) The orchestrator's *auto*-termination decision (`dispatch_mode.termination = "auto"`) 403s under wire auth because `terminate_experiment` is `admins`-gated while the orchestrator is in `orchestrators` (a spec inter-chapter drift between 03 §6.2 and 07 §2.9 / 04 §8.2, never caught because existing smokes use `never_terminate` + quiescence-exit and dispatch tests run auth-disabled) — filed as [#256](https://github.com/ealt/eden/issues/256). The smoke uses the supported operator-driven termination path instead.

**setup-experiment.** Emits the control-plane store DSN (`EDEN_CONTROL_PLANE_STORE_URL`, `POSTGRES_DB_CONTROL_PLANE`) + `EDEN_CONTROL_PLANE_URL=` (empty) and creates the `logs/control-plane` substrate dir. No `--register-additional-experiment` flag (that was the two-experiment path; deferred to #254). Closes #147.

### Evaluatable baseline variant (issue #122)

Elevates the experiment seed — the single commit on `main` at experiment start — to a first-class `kind == "baseline"` variant so operators have a "what did the seed score?" comparison anchor and the lineage tree has a colored root. Default-on (suppressed by `baseline.enabled: false`). Spans spec + schemas + contracts + storage + wire + dispatch + orchestrator (both modes) + web-ui.
Expand Down
38 changes: 11 additions & 27 deletions docs/observability.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ Notes:
- These are read views over the wire API. Filter changes do not mutate state.
- Reclaim / reassign / terminate / dispatch-mode toggles do mutate, behind CSRF + the same authorization model as the wire API.
- **Auth.** Every `/admin/*` page load requires the signed-in session's worker to be a transitive member of the `admins` group; non-admin sessions get a 403 forbidden page from the route-layer middleware (issue #144). The `setup-experiment.sh` script seeds the `admins` group with the web-ui's worker so the default Compose deployment already meets this requirement. Sign-ups created after a deployment is up are not added to `admins` by default and will hit the 403 page until an existing admin adds them via `/admin/groups/admins/`.
- `/admin/experiments/` is only mounted when the web-ui is started with `--control-plane-url`. The default Compose stack omits this flag; the route returns 404 ("page does not exist") until a control-plane-server is wired up. To enable it on the demo stack, run a sibling control-plane container and recreate the web-ui with the [`compose.control-plane.yaml`](../reference/compose/compose.control-plane.yaml) overlay — see [§3.4](#34-enabling-the-multi-experiment-control-plane).
- `/admin/experiments/` is only mounted when the web-ui is started with `--control-plane-url`. Since issue #147 the control-plane-server runs as a first-class always-on Compose service, but the web-ui's `--control-plane-url` is still opt-in via the `EDEN_CONTROL_PLANE_URL` env var (empty by default), so the route returns 404 ("page does not exist") on the default stack. To enable it, set `EDEN_CONTROL_PLANE_URL` and recreate the web-ui — see [§3.4](#34-enabling-the-multi-experiment-control-plane).

### 2.2 Forgejo Web UI (`http://localhost:3001`)

Expand Down Expand Up @@ -333,40 +333,24 @@ If you prefer a desktop tool over Adminer, connect directly to `localhost:5433`

### 3.4 Enabling the multi-experiment control plane

The `/admin/experiments/` route is gated on the web-ui being started with `--control-plane-url`. Phase 12c shipped the control-plane-server as a separate service that the default Compose stack does not start. To enable the route on a running demo stack:
Since issue #147 the control-plane-server is a **first-class always-on Compose service** (`control-plane`, port 8081), Postgres-backed (a separate `eden_control_plane` database in the same instance, created by the postgres init hook). `setup-experiment.sh` provisions its store DSN. So the control plane is already running on any `docker compose up` stack — what's opt-in is whether the **web-ui** talks to it.

The `/admin/experiments/` route is gated on the web-ui being started with `--control-plane-url`, which the web-ui CLI reads from the `EDEN_CONTROL_PLANE_URL` env var (empty by default → route stays 404). To enable the cross-experiment dashboard on a running stack:

```bash
# 1. Spin up control-plane-server as a sibling container on the eden network.
ADMIN=$(grep '^EDEN_ADMIN_TOKEN=' reference/compose/.env | cut -d= -f2)
docker run --rm -d --name eden-demo-control-plane \
--network eden-reference_default \
-p 8081:8081 \
eden-reference:dev \
python -m eden_control_plane_server \
--store-url ':memory:' \
--host 0.0.0.0 --port 8081 \
--admin-token "$ADMIN" \
--task-store-url http://task-store-server:8080

# 2. Tell the web-ui's overlay where to find it, then recreate web-ui only.
cd reference/compose
echo 'EDEN_CONTROL_PLANE_URL=http://eden-demo-control-plane:8081' >> .env
docker compose --env-file .env \
-f compose.yaml -f compose.control-plane.yaml \
up -d --force-recreate --no-deps web-ui
# Point the web-ui at the in-network control-plane service, then
# recreate web-ui only. The control-plane admin token defaults to
# EDEN_ADMIN_TOKEN inside the CLI, so no separate token is needed.
echo 'EDEN_CONTROL_PLANE_URL=http://control-plane:8081' >> .env
docker compose --env-file .env up -d --force-recreate --no-deps web-ui
```

`compose.control-plane.yaml` is an overlay that re-declares the web-ui's `command:` with the two extra flags (`--control-plane-url` + `--control-plane-admin-token`). Compose replaces (not merges) list-shaped command keys, so the overlay carries the full command — keep it in lockstep with `compose.yaml` if web-ui flags change.

`/admin/experiments/` now resolves (303 → sign-in if you're not authenticated, 200 once you are). The control-plane API itself listens at `http://localhost:8081/v0/control/*` (14 endpoints; bearer-authed with the same admin token); fetch its `/openapi.json` with the bearer for the full surface.

State-sync caveat: the demo command above uses `:memory:` storage, so the control-plane forgets its experiment registry on container restart. For a persistent deployment you'd point `--store-url` at Postgres (a separate database / schema from the task store), and likely run it as a first-class Compose service rather than a sibling container.
`/admin/experiments/` now resolves (303 → sign-in if you're not authenticated, 200 once you are). The control-plane API itself listens at `http://localhost:${CONTROL_PLANE_HOST_PORT:-8081}/v0/control/*` (bearer-authed with the same admin token; `/healthz` is unauthenticated); fetch its `/openapi.json` with the bearer for the full surface. Note the registry is empty until an experiment is registered via `POST /v0/control/experiments` (the lease-handoff smoke and a lease-driven orchestrator do this).

Tear down:
Tear down (revert the web-ui to the no-control-plane command):

```bash
docker rm -f eden-demo-control-plane
# Optional: revert the web-ui to the no-control-plane command.
cd reference/compose
sed -i.bak '/^EDEN_CONTROL_PLANE_URL=/d' .env && rm .env.bak
docker compose --env-file .env up -d --force-recreate --no-deps web-ui
Expand Down
4 changes: 2 additions & 2 deletions docs/plans/issue-110-loki-grafana-overlay.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ These are the load-bearing calls this plan makes. They are defensible defaults,

### 3.1 New overlay: `reference/compose/compose.logging.yaml`

Sibling to `compose.subprocess.yaml` / `compose.docker-exec.yaml` / `compose.control-plane.yaml` / `compose.multi-orchestrator.yaml`. Layered as:
Sibling to `compose.subprocess.yaml` / `compose.docker-exec.yaml` / `compose.multi-orchestrator.yaml` / `compose.multi-experiment.yaml`. Layered as:

```bash
cd reference/compose
Expand Down Expand Up @@ -166,7 +166,7 @@ No renames of existing identifiers. New identifiers introduced (validated agains

| New identifier | Kind | Convention followed |
|---|---|---|
| `compose.logging.yaml` / `compose.logging-infra.yaml` | overlay files | `compose.<concern>.yaml` (matches `compose.subprocess.yaml`, `compose.docker-exec.yaml`, `compose.control-plane.yaml`, `compose.multi-orchestrator.yaml`) |
| `compose.logging.yaml` / `compose.logging-infra.yaml` | overlay files | `compose.<concern>.yaml` (matches `compose.subprocess.yaml`, `compose.docker-exec.yaml`, `compose.multi-orchestrator.yaml`, `compose.multi-experiment.yaml`) |
| `loki` / `alloy` / `grafana` | compose service names | upstream tool names, lowercase (matches `forgejo`, `postgres`) |
| `EDEN_GRAFANA_ADMIN_PASSWORD` | env var (secret) | `EDEN_<THING>_<ROLE>` (matches `EDEN_READONLY_PASSWORD`, `EDEN_ADMIN_TOKEN`, `EDEN_SESSION_SECRET`) |
| `EDEN_LOGGING_DOCKER_GID` | env var (infra-overlay required) | `EDEN_<CONCERN>_<THING>` — distinct from docker-exec's `EDEN_DOCKER_GID` so the infra overlay fails fast instead of inheriting the default `0` (§3.4) |
Expand Down
Loading