ealt · ealt · Jun 4, 2026 · Jun 3, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,6 +8,20 @@ Per-chunk entries preserve the full implementation record: contract amendments,
 
 ## [Unreleased]
 
+### Fix: align spec prose + integrator manifest to the `evaluation` field name (issue #273)
+
+Resolves the pre-existing drift the #122 entry surfaced (below, "Pre-existing drift surfaced"): the variant's evaluation-payload field is `evaluation` in `variant.schema.json` + the `eden_contracts.Variant` model + the storage/wire impl, but the spec **prose** spelled it `metrics` in several chapters and the integrator-manifest table. Two names for one field is a readability + onboarding hazard and a latent parity trap (the `schema-parity` job only checks schema↔model — both already say `evaluation` — so the prose `metrics` naming was unguarded). **Option 1** from the issue was selected: align prose + manifest to the schema/model `evaluation` naming, keeping the wire + on-tree manifest key stable (the reference integrator already emitted `evaluation` in `.eden/variants/<id>/evaluation.json` — see [`_manifest.py`](reference/packages/eden-git/src/eden_git/_manifest.py) — so this is a prose-and-docstring correction, not a manifest-shape change).
+
+**Scope was wider than the issue's two named locations.** The issue body named `02-data-model.md` §9.1 and `06-integrator.md` §4.2, but a full audit (per the AGENTS.md "spec inter-chapter restatement is a conflict surface" pitfall — grep chapters 03/04/05/07/08 for the same field) found the same `metrics` field-name spelling restated across **eight** chapters. Renaming only the two named spots would have relocated the drift rather than removing it, so the fix renames every backtick'd field reference in lockstep:
+
+- **Spec prose.** `02-data-model.md` §9.1 variant field table + §9.2 + the `evaluated_by` description; `03-roles.md` §4.2 (evaluator output) + §4.4 (the submission field, the variant-side write rule, the retry-exhausted no-graft rule, and the resubmission-equivalence formula); `04-task-protocol.md` §4.2 content-equivalence formula; `05-event-protocol.md` §6 (the read-the-entity boundary example); `06-integrator.md` §4.1 (validation prose) + §4.2 manifest-shape table; `07-wire-protocol.md` §11 reference-helper endpoint (`/_reference/.../validate/metrics` → `validate/evaluation`, which the impl already exposed); `08-storage.md` §4.1 + §4.3; `10-checkpoints.md` §13 (variant round-trip field list).
+- **JSON Schema.** `variant.schema.json` `evaluated_by` description prose (the field key was already `evaluation`).
+- **Impl docstrings + one error string.** `submissions.py` (`EvaluationSubmission` docstring), `variants.py` (`declare_variant_evaluation_error` docstring), and the `validate_acceptance` reason string (`"success submission requires metrics"` → `"… requires evaluation"`).
+
+**Latent conformance bug fixed.** `conformance/scenarios/test_evaluator_submission.py` asserted `variant.get("metrics") is None` in two evaluation_error scenarios — but the wire variant uses `evaluation`, so those reads were always `None` and the assertions passed trivially without testing the no-graft guarantee they claimed. Corrected to `variant.get("evaluation")` so they actually verify the variant carries no evaluation payload; the surrounding docstrings (which quote the renamed §4.4 prose) were aligned too. A stale checkpoint round-trip fixture (`test_checkpoint_roundtrip.py`) and a hardening-test reason assertion (`test_store_hardening.py`) were updated to match.
+
+**Deliberately left as-is.** The `baseline.metrics` **config** block (`02-data-model.md` §2.7, `experiment-config.schema.json`, `eden_contracts.config.BaselineConfig.metrics`) keeps the `metrics` name — it is a distinct config field that *writes into* `variant.evaluation`, and renaming it would touch the config surface (out of scope for Option 1, which keeps the wire/config stable). Plain-English / concept uses of "metric" (metric values §1.3, metric names in the evaluation schema §8, "objective over metrics") are left untouched — a metric is a real domain concept; the field that *holds* the metrics is `evaluation`. The generated `docs/conformance-coverage.md` (a non-CI-enforced snapshot last regenerated at #112) is **not** regenerated in this PR: it has drifted ~40 keyword lines since #122, so regenerating it now would dump unrelated churn into this focused rename. It will pick up the renamed prose on its next routine regeneration.
+
 ### Control-plane as a first-class Compose service + lease-handoff smoke (issue #147; re-scoped)
 
 Backfills the Phase 12c CHANGELOG-narrated deferral of a `compose-smoke-multi-experiment` CI job. **Re-scoped during impl** (operator-authorized): the draft plan's headline — two experiments end-to-end with cross-experiment isolation asserted via wire reads — is **not buildable** on the reference impl, because it hosts exactly one experiment per deployment. Three sites enforce this: the task-store-server's `Store` binds a single `experiment_id` and the wire layer rejects any other (`ExperimentIdMismatch` at [`_dependencies.py:73`](reference/packages/eden-wire/src/eden_wire/_dependencies.py)); the orchestrator multi-experiment loop targets one task-store URL for all experiments ([`multi_loop.py`](reference/services/orchestrator/src/eden_orchestrator/multi_loop.py) `make_runtime_factory`); and the integrator is one shared bare repo deployment-wide ([`cli.py`](reference/services/orchestrator/src/eden_orchestrator/cli.py) `_build_runtime_factory`). 12c's multi-experiment surface was validated only against fake stores + the single-IUT conformance binding. **True multi-experiment hosting + the cross-experiment-isolation smoke are deferred to [#254](https://github.com/ealt/eden/issues/254)** (filed at re-scope time). This chunk ships the genuinely-new, genuinely-shippable substrate piece instead: the control plane as a first-class Compose service, plus a lease-lifecycle + lease-handoff chaos smoke.

diff --git a/conformance/scenarios/test_evaluator_submission.py b/conformance/scenarios/test_evaluator_submission.py
@@ -24,7 +24,7 @@ def test_submit_with_mismatched_variant_id_rejected(wire_client: WireClient) ->
 
     "An evaluator submits with: variant_id — the variant it evaluated."
     The task store enforces this so an evaluator cannot misroute a
-    metrics result onto an unrelated variant.
+    evaluation result onto an unrelated variant.
     """
     eval_tid, _variant_id = _drive_to_starting_variant(wire_client)
     c = _seed.claim(wire_client, eval_tid)
@@ -44,9 +44,9 @@ def test_success_evaluation_outside_schema_must_not_complete_variant(
 ) -> None:
     """spec/v0/03-roles.md §4.2 — evaluation keys MUST be a subset of evaluation_schema.
 
-    "Produce a `metrics` object whose keys are a subset of the
+    "Produce an `evaluation` object whose keys are a subset of the
     declared `evaluation_schema` keys." A conforming IUT MUST reject a
-    success submission whose metrics include a key the schema does
+    success submission whose evaluation includes a key the schema does
     not declare; the variant MUST NOT terminalize as success. Where in
     the pipeline the rejection surfaces is implementation-defined,
     so the assertion checks the observable end-state.
@@ -130,7 +130,7 @@ def test_success_writes_variant_fields_post_accept(
     """spec/v0/03-roles.md §4.4 — accepted success writes evaluation + uri.
 
     Asserts the §4.4 variant-side write rule: after /accept on a success
-    submission, the variant's `status == "success"`, `metrics`,
+    submission, the variant's `status == "success"`, `evaluation`,
     `artifacts_uri`, and `completed_at` carry the submitted values,
     and `variant.succeeded` is in the event log. The §4.4 atomicity
     claim ("written atomically with the event") is asserted in
@@ -168,12 +168,12 @@ def test_success_writes_variant_fields_post_accept(
 def test_status_error_writes_variant_evaluation_and_artifacts(
     wire_client: WireClient,
 ) -> None:
-    """spec/v0/03-roles.md §4.4 — status=error MUST write variant metrics + artifacts_uri.
+    """spec/v0/03-roles.md §4.4 — status=error MUST write variant evaluation + artifacts_uri.
 
-    "metrics — set to the submission's `metrics` when status ∈
+    "evaluation — set to the submission's `evaluation` when status ∈
     {'success', 'error'}." Distinct from the evaluation_error case (which
-    discards metrics): the §4.4 variant-side write rule is per-status,
-    and the error path keeps the metrics around because the variant
+    discards the evaluation): the §4.4 variant-side write rule is per-status,
+    and the error path keeps the evaluation around because the variant
     DID run; only the run failed. The reject reason is
     `worker_error` — `validation_error` would discard the payload
     instead.
@@ -204,13 +204,13 @@ def test_status_error_writes_variant_evaluation_and_artifacts(
 def test_eval_error_keeps_variant_starting_and_does_not_graft_evaluation(
     wire_client: WireClient,
 ) -> None:
-    """spec/v0/03-roles.md §4.4 — evaluation_error MUST keep variant in starting; metrics discarded.
+    """spec/v0/03-roles.md §4.4 — evaluation_error keeps variant starting; evaluation discarded.
 
     "When status == evaluation_error, the orchestrator MUST NOT write
-    metrics on the variant; any submission-carried metrics is
-    discarded." Observed: after submitting evaluation_error with metrics
+    evaluation on the variant; any submission-carried evaluation is
+    discarded." Observed: after submitting evaluation_error with an evaluation
     and rejecting the task, the variant stays in `starting` and its
-    `metrics` field is unset.
+    `evaluation` field is unset.
     """
     eval_tid, variant_id = _drive_to_starting_variant(wire_client)
     c = _seed.claim(wire_client, eval_tid)
@@ -227,18 +227,18 @@ def test_eval_error_keeps_variant_starting_and_does_not_graft_evaluation(
     assert 200 <= rejected.status_code < 300, rejected.text
     variant = _seed.read_variant(wire_client, variant_id)
     assert variant["status"] == "starting"
-    assert variant.get("metrics") is None, variant
+    assert variant.get("evaluation") is None, variant
     assert variant.get("artifacts_uri") is None, variant
 
 
 def test_retry_exhausted_eval_error_does_not_graft_prior_evaluation(
     wire_client: WireClient,
 ) -> None:
-    """spec/v0/03-roles.md §4.4 — retry-exhausted evaluation_error MUST NOT graft prior metrics.
+    """spec/v0/03-roles.md §4.4 — retry-exhausted evaluation_error MUST NOT graft prior evaluation.
 
     "On the retry-exhausted evaluation_error terminal transition itself,
-    the orchestrator MUST NOT graft metrics or artifacts from any
-    prior evaluation_error submission onto the variant; the variant's metrics
+    the orchestrator MUST NOT graft the evaluation payload or artifacts from any
+    prior evaluation_error submission onto the variant; the variant's evaluation
     and artifacts_uri fields remain unset."
     """
     eval_tid, variant_id = _drive_to_starting_variant(wire_client)
@@ -259,7 +259,7 @@ def test_retry_exhausted_eval_error_does_not_graft_prior_evaluation(
     assert 200 <= decl.status_code < 300, decl.text
     variant = _seed.read_variant(wire_client, variant_id)
     assert variant["status"] == "evaluation_error"
-    assert variant.get("metrics") is None, variant
+    assert variant.get("evaluation") is None, variant
     assert variant.get("artifacts_uri") is None, variant
 
 
@@ -269,8 +269,8 @@ def test_resubmit_idempotent_under_role_rules(
     """spec/v0/03-roles.md §4.4 — identical resubmission MUST be accepted; only one task.submitted.
 
     Per the §4.4 amendment in this chunk: "identical normative
-    fields (`variant_id`, `status`, `metrics`) MUST be accepted." The
-    test holds all four fields (`variant_id`, `status`, `metrics`,
+    fields (`variant_id`, `status`, `evaluation`) MUST be accepted." The
+    test holds all four fields (`variant_id`, `status`, `evaluation`,
     `artifacts_uri`) identical between the two submits as the
     baseline-equivalence check; the artifacts_uri-non-equivalence
     test below pins the half of the amendment that says
@@ -314,7 +314,7 @@ def test_resubmit_with_different_artifacts_uri_is_idempotent(
     Per the §4.4 amendment in this chunk: "`artifacts_uri` is NOT
     part of equivalence — the first submission's `artifacts_uri` is
     the committed one." Two submits with identical
-    `variant_id`+`status`+`metrics` but DIFFERENT `artifacts_uri`
+    `variant_id`+`status`+`evaluation` but DIFFERENT `artifacts_uri`
     MUST both return 200, MUST emit only one `task.submitted`, and
     after `/accept` the variant's `artifacts_uri` MUST equal the
     *first* submission's value.

diff --git a/docs/roadmap.md b/docs/roadmap.md
@@ -263,6 +263,7 @@ Units and chunking to be named closer to execution — too far ahead to estimate
 - [12c](plans/eden-phase-12c-control-plane.md) — Control plane — **shipped 2026-05-19** (see [CHANGELOG](../CHANGELOG.md))
 - [#145](plans/issue-145-per-route-store-swap.md) — Per-route store swapping for the experiment switcher (12c §3.6 backfill) — **shipped 2026-06-02** (see [CHANGELOG](../CHANGELOG.md))
 - [#122](plans/issue-122-baseline-variant.md) — Evaluatable baseline variant (seed becomes a `kind="baseline"` Variant) — **shipped 2026-06-02** (see [CHANGELOG](../CHANGELOG.md))
+- [#273](https://github.com/ealt/eden/issues/273) — Fix spec/impl drift: align prose + integrator manifest to the `evaluation` field name (Option 1) — **shipped 2026-06-03** (see [CHANGELOG](../CHANGELOG.md))
 
 ---
 

diff --git a/reference/packages/eden-checkpoint/tests/test_checkpoint_roundtrip.py b/reference/packages/eden-checkpoint/tests/test_checkpoint_roundtrip.py
@@ -74,7 +74,7 @@ def _build_payload() -> dict[str, list[dict[str, object]]]:
                 "variant_commit_sha": "c" * 40,
                 "artifacts_uri": "checkpoint:sha256:" + sha256_hex(b"var-1 artifacts"),
                 "description": "first variant",
-                "metrics": {"accuracy": 0.95},
+                "evaluation": {"accuracy": 0.95},
                 "started_at": "2026-05-06T15:03:00Z",
                 "completed_at": "2026-05-06T15:05:00Z",
                 "executed_by": "executor-2",

diff --git a/reference/packages/eden-storage/src/eden_storage/_ops/tasks_lifecycle.py b/reference/packages/eden-storage/src/eden_storage/_ops/tasks_lifecycle.py
@@ -980,7 +980,7 @@ def _validate_evaluate_acceptance(
         if variant.commit_sha is None:
             return f"variant {submission.variant_id!r} has no commit_sha"
         if submission.evaluation is None:
-            return "success submission requires metrics (03-roles.md §4.4)"
+            return "success submission requires evaluation (03-roles.md §4.4)"
         try:
             self._validate_evaluation(submission.evaluation)
         except InvalidPrecondition as exc:

diff --git a/reference/packages/eden-storage/src/eden_storage/_ops/variants.py b/reference/packages/eden-storage/src/eden_storage/_ops/variants.py
@@ -97,7 +97,7 @@ def create_variant(self, variant: Variant) -> None:
     def declare_variant_evaluation_error(self, variant_id: str) -> None:
         """Retry-exhausted: ``starting → evaluation_error`` (``05-event-protocol.md`` §2.2).
 
-        Writes ``completed_at`` atomically; MUST NOT set metrics or
+        Writes ``completed_at`` atomically; MUST NOT set evaluation or
         artifacts_uri (``03-roles.md`` §4.4).
         """
         with self._atomic_operation():

diff --git a/reference/packages/eden-storage/src/eden_storage/submissions.py b/reference/packages/eden-storage/src/eden_storage/submissions.py
@@ -13,12 +13,12 @@
 - ``IdeaSubmission``        — status + idea_ids (set-equivalent
   per 04 §4.2).
 - ``VariantSubmission``   — status + variant_id + commit_sha (03 §3.4).
-- ``EvaluationSubmission``    — status + variant_id + metrics + optional
-  artifacts_uri (03 §4.4, 04 §4.2 on metrics equivalence).
+- ``EvaluationSubmission``    — status + variant_id + evaluation + optional
+  artifacts_uri (03 §4.4, 04 §4.2 on evaluation equivalence).
 
 All three are ``frozen=True`` so callers cannot rebind fields after
 construction. ``submit`` still deep-copies on entry and
-``read_submission`` on exit, because the ``metrics`` dict on
+``read_submission`` on exit, because the ``evaluation`` dict on
 ``EvaluationSubmission`` is not itself frozen.
 """
 

diff --git a/reference/packages/eden-storage/tests/test_store_hardening.py b/reference/packages/eden-storage/tests/test_store_hardening.py
@@ -275,7 +275,7 @@ def test_evaluate_success_without_evaluation_routed_to_validation_error(
         )
         reason = store.validate_acceptance("t-eval")
         assert reason is not None
-        assert "metrics" in reason
+        assert "evaluation" in reason
 
     def test_driver_routes_malformed_success_to_validation_error(
         self, make_store: Callable[..., Store]

diff --git a/spec/v0/02-data-model.md b/spec/v0/02-data-model.md
@@ -381,15 +381,15 @@ A variant is one completed attempt.
 | `artifacts_uri` | no | string (URI) | Where the variant's evaluator-produced artifacts live. Written by the orchestrator at evaluation-task terminal time from the evaluator's submission ([`03-roles.md`](03-roles.md) §4.4). |
 | `executor_artifacts_uri` | no | string (URI) | Where the variant's executor-produced artifacts live (build logs, coverage reports, generated screenshots — output not appropriate to commit to the worker branch). Written by the orchestrator at execution-task terminal time from the executor's submission ([`03-roles.md`](03-roles.md) §3.4). Disjoint from `artifacts_uri`: the executor writes one, the evaluator writes the other, and the orchestrator preserves both. |
 | `description` | no | string | Human-readable summary. |
-| `metrics` | no | object | Evaluation payload; shape dictated by the experiment's evaluation schema. |
+| `evaluation` | no | object | Evaluation payload; shape dictated by the experiment's evaluation schema. |
 | `started_at` | yes | timestamp | When the executor began. |
 | `completed_at` | no | timestamp | Set when the variant reaches a terminal status. Written exactly once by the orchestrator, atomically with the transition from `"starting"` to `"success"`, `"error"`, or `"evaluation_error"` (see [`04-task-protocol.md`](04-task-protocol.md) §4.3 and [`03-roles.md`](03-roles.md) §4.4). |
 | `executed_by` | no | string (worker_id) | The executor's `worker_id`; written at execution-task submit time (atomically with the variant's status transition out of `"starting"`). |
-| `evaluated_by` | no | string (worker_id) | The evaluator's `worker_id` whose metrics were committed; written at evaluation-task submit time. |
+| `evaluated_by` | no | string (worker_id) | The evaluator's `worker_id` whose evaluation was committed; written at evaluation-task submit time. |
 
 ### 9.2 Evaluation payload
 
-If present, `metrics` MUST be an object whose keys are a subset of the declared evaluation-schema keys and whose values match the declared types (§1.3) or are `null`. Because the evaluation-schema is per-experiment, [`schemas/variant.schema.json`](schemas/variant.schema.json) cannot express this constraint generically and leaves `metrics` as an open object; the per-metric type check is a runtime responsibility of the orchestrator. A conforming orchestrator MUST reject a evaluation payload that violates the experiment's evaluation-schema, and MUST NOT record the variant as `"success"` in that case.
+If present, `evaluation` MUST be an object whose keys are a subset of the declared evaluation-schema keys and whose values match the declared types (§1.3) or are `null`. Because the evaluation-schema is per-experiment, [`schemas/variant.schema.json`](schemas/variant.schema.json) cannot express this constraint generically and leaves `evaluation` as an open object; the per-metric type check is a runtime responsibility of the orchestrator. A conforming orchestrator MUST reject a evaluation payload that violates the experiment's evaluation-schema, and MUST NOT record the variant as `"success"` in that case.
 
 ### 9.3 Status transitions