Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,20 @@ Per-chunk entries preserve the full implementation record: contract amendments,

## [Unreleased]

### Fix: align spec prose + integrator manifest to the `evaluation` field name (issue #273)

Resolves the pre-existing drift the #122 entry surfaced (below, "Pre-existing drift surfaced"): the variant's evaluation-payload field is `evaluation` in `variant.schema.json` + the `eden_contracts.Variant` model + the storage/wire impl, but the spec **prose** spelled it `metrics` in several chapters and the integrator-manifest table. Two names for one field is a readability + onboarding hazard and a latent parity trap (the `schema-parity` job only checks schema↔model — both already say `evaluation` — so the prose `metrics` naming was unguarded). **Option 1** from the issue was selected: align prose + manifest to the schema/model `evaluation` naming, keeping the wire + on-tree manifest key stable (the reference integrator already emitted `evaluation` in `.eden/variants/<id>/evaluation.json` — see [`_manifest.py`](reference/packages/eden-git/src/eden_git/_manifest.py) — so this is a prose-and-docstring correction, not a manifest-shape change).

**Scope was wider than the issue's two named locations.** The issue body named `02-data-model.md` §9.1 and `06-integrator.md` §4.2, but a full audit (per the AGENTS.md "spec inter-chapter restatement is a conflict surface" pitfall — grep chapters 03/04/05/07/08 for the same field) found the same `metrics` field-name spelling restated across **eight** chapters. Renaming only the two named spots would have relocated the drift rather than removing it, so the fix renames every backtick'd field reference in lockstep:

- **Spec prose.** `02-data-model.md` §9.1 variant field table + §9.2 + the `evaluated_by` description; `03-roles.md` §4.2 (evaluator output) + §4.4 (the submission field, the variant-side write rule, the retry-exhausted no-graft rule, and the resubmission-equivalence formula); `04-task-protocol.md` §4.2 content-equivalence formula; `05-event-protocol.md` §6 (the read-the-entity boundary example); `06-integrator.md` §4.1 (validation prose) + §4.2 manifest-shape table; `07-wire-protocol.md` §11 reference-helper endpoint (`/_reference/.../validate/metrics` → `validate/evaluation`, which the impl already exposed); `08-storage.md` §4.1 + §4.3; `10-checkpoints.md` §13 (variant round-trip field list).
- **JSON Schema.** `variant.schema.json` `evaluated_by` description prose (the field key was already `evaluation`).
- **Impl docstrings + one error string.** `submissions.py` (`EvaluationSubmission` docstring), `variants.py` (`declare_variant_evaluation_error` docstring), and the `validate_acceptance` reason string (`"success submission requires metrics"` → `"… requires evaluation"`).

**Latent conformance bug fixed.** `conformance/scenarios/test_evaluator_submission.py` asserted `variant.get("metrics") is None` in two evaluation_error scenarios — but the wire variant uses `evaluation`, so those reads were always `None` and the assertions passed trivially without testing the no-graft guarantee they claimed. Corrected to `variant.get("evaluation")` so they actually verify the variant carries no evaluation payload; the surrounding docstrings (which quote the renamed §4.4 prose) were aligned too. A stale checkpoint round-trip fixture (`test_checkpoint_roundtrip.py`) and a hardening-test reason assertion (`test_store_hardening.py`) were updated to match.

**Deliberately left as-is.** The `baseline.metrics` **config** block (`02-data-model.md` §2.7, `experiment-config.schema.json`, `eden_contracts.config.BaselineConfig.metrics`) keeps the `metrics` name — it is a distinct config field that *writes into* `variant.evaluation`, and renaming it would touch the config surface (out of scope for Option 1, which keeps the wire/config stable). Plain-English / concept uses of "metric" (metric values §1.3, metric names in the evaluation schema §8, "objective over metrics") are left untouched — a metric is a real domain concept; the field that *holds* the metrics is `evaluation`. The generated `docs/conformance-coverage.md` (a non-CI-enforced snapshot last regenerated at #112) is **not** regenerated in this PR: it has drifted ~40 keyword lines since #122, so regenerating it now would dump unrelated churn into this focused rename. It will pick up the renamed prose on its next routine regeneration.

### Control-plane as a first-class Compose service + lease-handoff smoke (issue #147; re-scoped)

Backfills the Phase 12c CHANGELOG-narrated deferral of a `compose-smoke-multi-experiment` CI job. **Re-scoped during impl** (operator-authorized): the draft plan's headline — two experiments end-to-end with cross-experiment isolation asserted via wire reads — is **not buildable** on the reference impl, because it hosts exactly one experiment per deployment. Three sites enforce this: the task-store-server's `Store` binds a single `experiment_id` and the wire layer rejects any other (`ExperimentIdMismatch` at [`_dependencies.py:73`](reference/packages/eden-wire/src/eden_wire/_dependencies.py)); the orchestrator multi-experiment loop targets one task-store URL for all experiments ([`multi_loop.py`](reference/services/orchestrator/src/eden_orchestrator/multi_loop.py) `make_runtime_factory`); and the integrator is one shared bare repo deployment-wide ([`cli.py`](reference/services/orchestrator/src/eden_orchestrator/cli.py) `_build_runtime_factory`). 12c's multi-experiment surface was validated only against fake stores + the single-IUT conformance binding. **True multi-experiment hosting + the cross-experiment-isolation smoke are deferred to [#254](https://github.com/ealt/eden/issues/254)** (filed at re-scope time). This chunk ships the genuinely-new, genuinely-shippable substrate piece instead: the control plane as a first-class Compose service, plus a lease-lifecycle + lease-handoff chaos smoke.
Expand Down
40 changes: 20 additions & 20 deletions conformance/scenarios/test_evaluator_submission.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ def test_submit_with_mismatched_variant_id_rejected(wire_client: WireClient) ->

"An evaluator submits with: variant_id — the variant it evaluated."
The task store enforces this so an evaluator cannot misroute a
metrics result onto an unrelated variant.
evaluation result onto an unrelated variant.
"""
eval_tid, _variant_id = _drive_to_starting_variant(wire_client)
c = _seed.claim(wire_client, eval_tid)
Expand All @@ -44,9 +44,9 @@ def test_success_evaluation_outside_schema_must_not_complete_variant(
) -> None:
"""spec/v0/03-roles.md §4.2 — evaluation keys MUST be a subset of evaluation_schema.

"Produce a `metrics` object whose keys are a subset of the
"Produce an `evaluation` object whose keys are a subset of the
declared `evaluation_schema` keys." A conforming IUT MUST reject a
success submission whose metrics include a key the schema does
success submission whose evaluation includes a key the schema does
not declare; the variant MUST NOT terminalize as success. Where in
the pipeline the rejection surfaces is implementation-defined,
so the assertion checks the observable end-state.
Expand Down Expand Up @@ -130,7 +130,7 @@ def test_success_writes_variant_fields_post_accept(
"""spec/v0/03-roles.md §4.4 — accepted success writes evaluation + uri.

Asserts the §4.4 variant-side write rule: after /accept on a success
submission, the variant's `status == "success"`, `metrics`,
submission, the variant's `status == "success"`, `evaluation`,
`artifacts_uri`, and `completed_at` carry the submitted values,
and `variant.succeeded` is in the event log. The §4.4 atomicity
claim ("written atomically with the event") is asserted in
Expand Down Expand Up @@ -168,12 +168,12 @@ def test_success_writes_variant_fields_post_accept(
def test_status_error_writes_variant_evaluation_and_artifacts(
wire_client: WireClient,
) -> None:
"""spec/v0/03-roles.md §4.4 — status=error MUST write variant metrics + artifacts_uri.
"""spec/v0/03-roles.md §4.4 — status=error MUST write variant evaluation + artifacts_uri.

"metrics — set to the submission's `metrics` when status ∈
"evaluation — set to the submission's `evaluation` when status ∈
{'success', 'error'}." Distinct from the evaluation_error case (which
discards metrics): the §4.4 variant-side write rule is per-status,
and the error path keeps the metrics around because the variant
discards the evaluation): the §4.4 variant-side write rule is per-status,
and the error path keeps the evaluation around because the variant
DID run; only the run failed. The reject reason is
`worker_error` — `validation_error` would discard the payload
instead.
Expand Down Expand Up @@ -204,13 +204,13 @@ def test_status_error_writes_variant_evaluation_and_artifacts(
def test_eval_error_keeps_variant_starting_and_does_not_graft_evaluation(
wire_client: WireClient,
) -> None:
"""spec/v0/03-roles.md §4.4 — evaluation_error MUST keep variant in starting; metrics discarded.
"""spec/v0/03-roles.md §4.4 — evaluation_error keeps variant starting; evaluation discarded.

"When status == evaluation_error, the orchestrator MUST NOT write
metrics on the variant; any submission-carried metrics is
discarded." Observed: after submitting evaluation_error with metrics
evaluation on the variant; any submission-carried evaluation is
discarded." Observed: after submitting evaluation_error with an evaluation
and rejecting the task, the variant stays in `starting` and its
`metrics` field is unset.
`evaluation` field is unset.
"""
eval_tid, variant_id = _drive_to_starting_variant(wire_client)
c = _seed.claim(wire_client, eval_tid)
Expand All @@ -227,18 +227,18 @@ def test_eval_error_keeps_variant_starting_and_does_not_graft_evaluation(
assert 200 <= rejected.status_code < 300, rejected.text
variant = _seed.read_variant(wire_client, variant_id)
assert variant["status"] == "starting"
assert variant.get("metrics") is None, variant
assert variant.get("evaluation") is None, variant
assert variant.get("artifacts_uri") is None, variant


def test_retry_exhausted_eval_error_does_not_graft_prior_evaluation(
wire_client: WireClient,
) -> None:
"""spec/v0/03-roles.md §4.4 — retry-exhausted evaluation_error MUST NOT graft prior metrics.
"""spec/v0/03-roles.md §4.4 — retry-exhausted evaluation_error MUST NOT graft prior evaluation.

"On the retry-exhausted evaluation_error terminal transition itself,
the orchestrator MUST NOT graft metrics or artifacts from any
prior evaluation_error submission onto the variant; the variant's metrics
the orchestrator MUST NOT graft the evaluation payload or artifacts from any
prior evaluation_error submission onto the variant; the variant's evaluation
and artifacts_uri fields remain unset."
"""
eval_tid, variant_id = _drive_to_starting_variant(wire_client)
Expand All @@ -259,7 +259,7 @@ def test_retry_exhausted_eval_error_does_not_graft_prior_evaluation(
assert 200 <= decl.status_code < 300, decl.text
variant = _seed.read_variant(wire_client, variant_id)
assert variant["status"] == "evaluation_error"
assert variant.get("metrics") is None, variant
assert variant.get("evaluation") is None, variant
assert variant.get("artifacts_uri") is None, variant


Expand All @@ -269,8 +269,8 @@ def test_resubmit_idempotent_under_role_rules(
"""spec/v0/03-roles.md §4.4 — identical resubmission MUST be accepted; only one task.submitted.

Per the §4.4 amendment in this chunk: "identical normative
fields (`variant_id`, `status`, `metrics`) MUST be accepted." The
test holds all four fields (`variant_id`, `status`, `metrics`,
fields (`variant_id`, `status`, `evaluation`) MUST be accepted." The
test holds all four fields (`variant_id`, `status`, `evaluation`,
`artifacts_uri`) identical between the two submits as the
baseline-equivalence check; the artifacts_uri-non-equivalence
test below pins the half of the amendment that says
Expand Down Expand Up @@ -314,7 +314,7 @@ def test_resubmit_with_different_artifacts_uri_is_idempotent(
Per the §4.4 amendment in this chunk: "`artifacts_uri` is NOT
part of equivalence — the first submission's `artifacts_uri` is
the committed one." Two submits with identical
`variant_id`+`status`+`metrics` but DIFFERENT `artifacts_uri`
`variant_id`+`status`+`evaluation` but DIFFERENT `artifacts_uri`
MUST both return 200, MUST emit only one `task.submitted`, and
after `/accept` the variant's `artifacts_uri` MUST equal the
*first* submission's value.
Expand Down
1 change: 1 addition & 0 deletions docs/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -263,6 +263,7 @@ Units and chunking to be named closer to execution — too far ahead to estimate
- [12c](plans/eden-phase-12c-control-plane.md) — Control plane — **shipped 2026-05-19** (see [CHANGELOG](../CHANGELOG.md))
- [#145](plans/issue-145-per-route-store-swap.md) — Per-route store swapping for the experiment switcher (12c §3.6 backfill) — **shipped 2026-06-02** (see [CHANGELOG](../CHANGELOG.md))
- [#122](plans/issue-122-baseline-variant.md) — Evaluatable baseline variant (seed becomes a `kind="baseline"` Variant) — **shipped 2026-06-02** (see [CHANGELOG](../CHANGELOG.md))
- [#273](https://github.com/ealt/eden/issues/273) — Fix spec/impl drift: align prose + integrator manifest to the `evaluation` field name (Option 1) — **shipped 2026-06-03** (see [CHANGELOG](../CHANGELOG.md))

---

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ def _build_payload() -> dict[str, list[dict[str, object]]]:
"variant_commit_sha": "c" * 40,
"artifacts_uri": "checkpoint:sha256:" + sha256_hex(b"var-1 artifacts"),
"description": "first variant",
"metrics": {"accuracy": 0.95},
"evaluation": {"accuracy": 0.95},
"started_at": "2026-05-06T15:03:00Z",
"completed_at": "2026-05-06T15:05:00Z",
"executed_by": "executor-2",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -980,7 +980,7 @@ def _validate_evaluate_acceptance(
if variant.commit_sha is None:
return f"variant {submission.variant_id!r} has no commit_sha"
if submission.evaluation is None:
return "success submission requires metrics (03-roles.md §4.4)"
return "success submission requires evaluation (03-roles.md §4.4)"
try:
self._validate_evaluation(submission.evaluation)
except InvalidPrecondition as exc:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ def create_variant(self, variant: Variant) -> None:
def declare_variant_evaluation_error(self, variant_id: str) -> None:
"""Retry-exhausted: ``starting → evaluation_error`` (``05-event-protocol.md`` §2.2).

Writes ``completed_at`` atomically; MUST NOT set metrics or
Writes ``completed_at`` atomically; MUST NOT set evaluation or
artifacts_uri (``03-roles.md`` §4.4).
"""
with self._atomic_operation():
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,12 @@
- ``IdeaSubmission`` — status + idea_ids (set-equivalent
per 04 §4.2).
- ``VariantSubmission`` — status + variant_id + commit_sha (03 §3.4).
- ``EvaluationSubmission`` — status + variant_id + metrics + optional
artifacts_uri (03 §4.4, 04 §4.2 on metrics equivalence).
- ``EvaluationSubmission`` — status + variant_id + evaluation + optional
artifacts_uri (03 §4.4, 04 §4.2 on evaluation equivalence).

All three are ``frozen=True`` so callers cannot rebind fields after
construction. ``submit`` still deep-copies on entry and
``read_submission`` on exit, because the ``metrics`` dict on
``read_submission`` on exit, because the ``evaluation`` dict on
``EvaluationSubmission`` is not itself frozen.
"""

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -275,7 +275,7 @@ def test_evaluate_success_without_evaluation_routed_to_validation_error(
)
reason = store.validate_acceptance("t-eval")
assert reason is not None
assert "metrics" in reason
assert "evaluation" in reason

def test_driver_routes_malformed_success_to_validation_error(
self, make_store: Callable[..., Store]
Expand Down
6 changes: 3 additions & 3 deletions spec/v0/02-data-model.md
Original file line number Diff line number Diff line change
Expand Up @@ -381,15 +381,15 @@ A variant is one completed attempt.
| `artifacts_uri` | no | string (URI) | Where the variant's evaluator-produced artifacts live. Written by the orchestrator at evaluation-task terminal time from the evaluator's submission ([`03-roles.md`](03-roles.md) §4.4). |
| `executor_artifacts_uri` | no | string (URI) | Where the variant's executor-produced artifacts live (build logs, coverage reports, generated screenshots — output not appropriate to commit to the worker branch). Written by the orchestrator at execution-task terminal time from the executor's submission ([`03-roles.md`](03-roles.md) §3.4). Disjoint from `artifacts_uri`: the executor writes one, the evaluator writes the other, and the orchestrator preserves both. |
| `description` | no | string | Human-readable summary. |
| `metrics` | no | object | Evaluation payload; shape dictated by the experiment's evaluation schema. |
| `evaluation` | no | object | Evaluation payload; shape dictated by the experiment's evaluation schema. |
| `started_at` | yes | timestamp | When the executor began. |
| `completed_at` | no | timestamp | Set when the variant reaches a terminal status. Written exactly once by the orchestrator, atomically with the transition from `"starting"` to `"success"`, `"error"`, or `"evaluation_error"` (see [`04-task-protocol.md`](04-task-protocol.md) §4.3 and [`03-roles.md`](03-roles.md) §4.4). |
| `executed_by` | no | string (worker_id) | The executor's `worker_id`; written at execution-task submit time (atomically with the variant's status transition out of `"starting"`). |
| `evaluated_by` | no | string (worker_id) | The evaluator's `worker_id` whose metrics were committed; written at evaluation-task submit time. |
| `evaluated_by` | no | string (worker_id) | The evaluator's `worker_id` whose evaluation was committed; written at evaluation-task submit time. |

### 9.2 Evaluation payload

If present, `metrics` MUST be an object whose keys are a subset of the declared evaluation-schema keys and whose values match the declared types (§1.3) or are `null`. Because the evaluation-schema is per-experiment, [`schemas/variant.schema.json`](schemas/variant.schema.json) cannot express this constraint generically and leaves `metrics` as an open object; the per-metric type check is a runtime responsibility of the orchestrator. A conforming orchestrator MUST reject a evaluation payload that violates the experiment's evaluation-schema, and MUST NOT record the variant as `"success"` in that case.
If present, `evaluation` MUST be an object whose keys are a subset of the declared evaluation-schema keys and whose values match the declared types (§1.3) or are `null`. Because the evaluation-schema is per-experiment, [`schemas/variant.schema.json`](schemas/variant.schema.json) cannot express this constraint generically and leaves `evaluation` as an open object; the per-metric type check is a runtime responsibility of the orchestrator. A conforming orchestrator MUST reject a evaluation payload that violates the experiment's evaluation-schema, and MUST NOT record the variant as `"success"` in that case.

### 9.3 Status transitions

Expand Down
Loading
Loading