From d654cb6edb2a76042387cf1b93ae632ce6b7f4c8 Mon Sep 17 00:00:00 2001
From: Eric Alt <13019253+ealt@users.noreply.github.com>
Date: Wed, 3 Jun 2026 16:25:58 -0700
Subject: [PATCH] Fix #273: align prose + manifest to schema/model 'evaluation'
 naming
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The variant evaluation-payload field is `evaluation` in variant.schema.json,
the eden_contracts.Variant model, and the storage/wire impl, but the spec
prose spelled it `metrics` across eight chapters plus the integrator-manifest
table. Option 1 from the issue: rename the prose + manifest to `evaluation`,
keeping the wire and on-tree manifest key stable (the reference integrator
already emitted `evaluation` in evaluation.json).

A full audit (per the AGENTS.md inter-chapter-restatement pitfall) found the
drift was wider than the issue's two named locations — every backtick'd field
reference is renamed in lockstep across spec chapters 02/03/04/05/06/07/08/10,
the variant.schema.json evaluated_by description, impl docstrings, and the
validate_acceptance reason string.

Also fixes a latent conformance bug: test_evaluator_submission.py asserted
`variant.get("metrics") is None`, which was always None on the wire (field is
`evaluation`) and so never tested the no-graft guarantee it claimed.

The baseline.metrics config block keeps its name (distinct config field that
writes into variant.evaluation; out of scope for Option 1). Plain-English
"metric" concept uses are untouched. docs/conformance-coverage.md (a stale,
non-CI-enforced generated snapshot) is left for routine regeneration to avoid
unrelated #122-era churn in this focused rename.

Closes #273.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 CHANGELOG.md                                  | 14 +++++++
 .../scenarios/test_evaluator_submission.py    | 40 +++++++++----------
 docs/roadmap.md                               |  1 +
 .../tests/test_checkpoint_roundtrip.py        |  2 +-
 .../src/eden_storage/_ops/tasks_lifecycle.py  |  2 +-
 .../src/eden_storage/_ops/variants.py         |  2 +-
 .../src/eden_storage/submissions.py           |  6 +--
 .../tests/test_store_hardening.py             |  2 +-
 spec/v0/02-data-model.md                      |  6 +--
 spec/v0/03-roles.md                           | 12 +++---
 spec/v0/04-task-protocol.md                   |  2 +-
 spec/v0/05-event-protocol.md                  |  2 +-
 spec/v0/06-integrator.md                      |  4 +-
 spec/v0/07-wire-protocol.md                   |  2 +-
 spec/v0/08-storage.md                         |  4 +-
 spec/v0/10-checkpoints.md                     |  2 +-
 spec/v0/schemas/variant.schema.json           |  2 +-
 17 files changed, 60 insertions(+), 45 deletions(-)
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 5ce47ecc..9cf9b06e 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -8,6 +8,20 @@ Per-chunk entries preserve the full implementation record: contract amendments,
 
 ## [Unreleased]
 
+### Fix: align spec prose + integrator manifest to the `evaluation` field name (issue #273)
+
+Resolves the pre-existing drift the #122 entry surfaced (below, "Pre-existing drift surfaced"): the variant's evaluation-payload field is `evaluation` in `variant.schema.json` + the `eden_contracts.Variant` model + the storage/wire impl, but the spec **prose** spelled it `metrics` in several chapters and the integrator-manifest table. Two names for one field is a readability + onboarding hazard and a latent parity trap (the `schema-parity` job only checks schema↔model — both already say `evaluation` — so the prose `metrics` naming was unguarded). **Option 1** from the issue was selected: align prose + manifest to the schema/model `evaluation` naming, keeping the wire + on-tree manifest key stable (the reference integrator already emitted `evaluation` in `.eden/variants/<id>/evaluation.json` — see [`_manifest.py`](reference/packages/eden-git/src/eden_git/_manifest.py) — so this is a prose-and-docstring correction, not a manifest-shape change).
+
+**Scope was wider than the issue's two named locations.** The issue body named `02-data-model.md` §9.1 and `06-integrator.md` §4.2, but a full audit (per the AGENTS.md "spec inter-chapter restatement is a conflict surface" pitfall — grep chapters 03/04/05/07/08 for the same field) found the same `metrics` field-name spelling restated across **eight** chapters. Renaming only the two named spots would have relocated the drift rather than removing it, so the fix renames every backtick'd field reference in lockstep:
+
+- **Spec prose.** `02-data-model.md` §9.1 variant field table + §9.2 + the `evaluated_by` description; `03-roles.md` §4.2 (evaluator output) + §4.4 (the submission field, the variant-side write rule, the retry-exhausted no-graft rule, and the resubmission-equivalence formula); `04-task-protocol.md` §4.2 content-equivalence formula; `05-event-protocol.md` §6 (the read-the-entity boundary example); `06-integrator.md` §4.1 (validation prose) + §4.2 manifest-shape table; `07-wire-protocol.md` §11 reference-helper endpoint (`/_reference/.../validate/metrics` → `validate/evaluation`, which the impl already exposed); `08-storage.md` §4.1 + §4.3; `10-checkpoints.md` §13 (variant round-trip field list).
+- **JSON Schema.** `variant.schema.json` `evaluated_by` description prose (the field key was already `evaluation`).
+- **Impl docstrings + one error string.** `submissions.py` (`EvaluationSubmission` docstring), `variants.py` (`declare_variant_evaluation_error` docstring), and the `validate_acceptance` reason string (`"success submission requires metrics"` → `"… requires evaluation"`).
+
+**Latent conformance bug fixed.** `conformance/scenarios/test_evaluator_submission.py` asserted `variant.get("metrics") is None` in two evaluation_error scenarios — but the wire variant uses `evaluation`, so those reads were always `None` and the assertions passed trivially without testing the no-graft guarantee they claimed. Corrected to `variant.get("evaluation")` so they actually verify the variant carries no evaluation payload; the surrounding docstrings (which quote the renamed §4.4 prose) were aligned too. A stale checkpoint round-trip fixture (`test_checkpoint_roundtrip.py`) and a hardening-test reason assertion (`test_store_hardening.py`) were updated to match.
+
+**Deliberately left as-is.** The `baseline.metrics` **config** block (`02-data-model.md` §2.7, `experiment-config.schema.json`, `eden_contracts.config.BaselineConfig.metrics`) keeps the `metrics` name — it is a distinct config field that *writes into* `variant.evaluation`, and renaming it would touch the config surface (out of scope for Option 1, which keeps the wire/config stable). Plain-English / concept uses of "metric" (metric values §1.3, metric names in the evaluation schema §8, "objective over metrics") are left untouched — a metric is a real domain concept; the field that *holds* the metrics is `evaluation`. The generated `docs/conformance-coverage.md` (a non-CI-enforced snapshot last regenerated at #112) is **not** regenerated in this PR: it has drifted ~40 keyword lines since #122, so regenerating it now would dump unrelated churn into this focused rename. It will pick up the renamed prose on its next routine regeneration.
+
 ### Control-plane as a first-class Compose service + lease-handoff smoke (issue #147; re-scoped)
 
 Backfills the Phase 12c CHANGELOG-narrated deferral of a `compose-smoke-multi-experiment` CI job. **Re-scoped during impl** (operator-authorized): the draft plan's headline — two experiments end-to-end with cross-experiment isolation asserted via wire reads — is **not buildable** on the reference impl, because it hosts exactly one experiment per deployment. Three sites enforce this: the task-store-server's `Store` binds a single `experiment_id` and the wire layer rejects any other (`ExperimentIdMismatch` at [`_dependencies.py:73`](reference/packages/eden-wire/src/eden_wire/_dependencies.py)); the orchestrator multi-experiment loop targets one task-store URL for all experiments ([`multi_loop.py`](reference/services/orchestrator/src/eden_orchestrator/multi_loop.py) `make_runtime_factory`); and the integrator is one shared bare repo deployment-wide ([`cli.py`](reference/services/orchestrator/src/eden_orchestrator/cli.py) `_build_runtime_factory`). 12c's multi-experiment surface was validated only against fake stores + the single-IUT conformance binding. **True multi-experiment hosting + the cross-experiment-isolation smoke are deferred to [#254](https://github.com/ealt/eden/issues/254)** (filed at re-scope time). This chunk ships the genuinely-new, genuinely-shippable substrate piece instead: the control plane as a first-class Compose service, plus a lease-lifecycle + lease-handoff chaos smoke.
diff --git a/conformance/scenarios/test_evaluator_submission.py b/conformance/scenarios/test_evaluator_submission.py
index ecc08c26..e1e2d2e4 100644
--- a/conformance/scenarios/test_evaluator_submission.py
+++ b/conformance/scenarios/test_evaluator_submission.py
@@ -24,7 +24,7 @@ def test_submit_with_mismatched_variant_id_rejected(wire_client: WireClient) ->
 
     "An evaluator submits with: variant_id — the variant it evaluated."
     The task store enforces this so an evaluator cannot misroute a
-    metrics result onto an unrelated variant.
+    evaluation result onto an unrelated variant.
     """
     eval_tid, _variant_id = _drive_to_starting_variant(wire_client)
     c = _seed.claim(wire_client, eval_tid)
@@ -44,9 +44,9 @@ def test_success_evaluation_outside_schema_must_not_complete_variant(
 ) -> None:
     """spec/v0/03-roles.md §4.2 — evaluation keys MUST be a subset of evaluation_schema.
 
-    "Produce a `metrics` object whose keys are a subset of the
+    "Produce an `evaluation` object whose keys are a subset of the
     declared `evaluation_schema` keys." A conforming IUT MUST reject a
-    success submission whose metrics include a key the schema does
+    success submission whose evaluation includes a key the schema does
     not declare; the variant MUST NOT terminalize as success. Where in
     the pipeline the rejection surfaces is implementation-defined,
     so the assertion checks the observable end-state.
@@ -130,7 +130,7 @@ def test_success_writes_variant_fields_post_accept(
     """spec/v0/03-roles.md §4.4 — accepted success writes evaluation + uri.
 
     Asserts the §4.4 variant-side write rule: after /accept on a success
-    submission, the variant's `status == "success"`, `metrics`,
+    submission, the variant's `status == "success"`, `evaluation`,
     `artifacts_uri`, and `completed_at` carry the submitted values,
     and `variant.succeeded` is in the event log. The §4.4 atomicity
     claim ("written atomically with the event") is asserted in
@@ -168,12 +168,12 @@ def test_success_writes_variant_fields_post_accept(
 def test_status_error_writes_variant_evaluation_and_artifacts(
     wire_client: WireClient,
 ) -> None:
-    """spec/v0/03-roles.md §4.4 — status=error MUST write variant metrics + artifacts_uri.
+    """spec/v0/03-roles.md §4.4 — status=error MUST write variant evaluation + artifacts_uri.
 
-    "metrics — set to the submission's `metrics` when status ∈
+    "evaluation — set to the submission's `evaluation` when status ∈
     {'success', 'error'}." Distinct from the evaluation_error case (which
-    discards metrics): the §4.4 variant-side write rule is per-status,
-    and the error path keeps the metrics around because the variant
+    discards the evaluation): the §4.4 variant-side write rule is per-status,
+    and the error path keeps the evaluation around because the variant
     DID run; only the run failed. The reject reason is
     `worker_error` — `validation_error` would discard the payload
     instead.
@@ -204,13 +204,13 @@ def test_status_error_writes_variant_evaluation_and_artifacts(
 def test_eval_error_keeps_variant_starting_and_does_not_graft_evaluation(
     wire_client: WireClient,
 ) -> None:
-    """spec/v0/03-roles.md §4.4 — evaluation_error MUST keep variant in starting; metrics discarded.
+    """spec/v0/03-roles.md §4.4 — evaluation_error keeps variant starting; evaluation discarded.
 
     "When status == evaluation_error, the orchestrator MUST NOT write
-    metrics on the variant; any submission-carried metrics is
-    discarded." Observed: after submitting evaluation_error with metrics
+    evaluation on the variant; any submission-carried evaluation is
+    discarded." Observed: after submitting evaluation_error with an evaluation
     and rejecting the task, the variant stays in `starting` and its
-    `metrics` field is unset.
+    `evaluation` field is unset.
     """
     eval_tid, variant_id = _drive_to_starting_variant(wire_client)
     c = _seed.claim(wire_client, eval_tid)
@@ -227,18 +227,18 @@ def test_eval_error_keeps_variant_starting_and_does_not_graft_evaluation(
     assert 200 <= rejected.status_code < 300, rejected.text
     variant = _seed.read_variant(wire_client, variant_id)
     assert variant["status"] == "starting"
-    assert variant.get("metrics") is None, variant
+    assert variant.get("evaluation") is None, variant
     assert variant.get("artifacts_uri") is None, variant
 
 
 def test_retry_exhausted_eval_error_does_not_graft_prior_evaluation(
     wire_client: WireClient,
 ) -> None:
-    """spec/v0/03-roles.md §4.4 — retry-exhausted evaluation_error MUST NOT graft prior metrics.
+    """spec/v0/03-roles.md §4.4 — retry-exhausted evaluation_error MUST NOT graft prior evaluation.
 
     "On the retry-exhausted evaluation_error terminal transition itself,
-    the orchestrator MUST NOT graft metrics or artifacts from any
-    prior evaluation_error submission onto the variant; the variant's metrics
+    the orchestrator MUST NOT graft the evaluation payload or artifacts from any
+    prior evaluation_error submission onto the variant; the variant's evaluation
     and artifacts_uri fields remain unset."
     """
     eval_tid, variant_id = _drive_to_starting_variant(wire_client)
@@ -259,7 +259,7 @@ def test_retry_exhausted_eval_error_does_not_graft_prior_evaluation(
     assert 200 <= decl.status_code < 300, decl.text
     variant = _seed.read_variant(wire_client, variant_id)
     assert variant["status"] == "evaluation_error"
-    assert variant.get("metrics") is None, variant
+    assert variant.get("evaluation") is None, variant
     assert variant.get("artifacts_uri") is None, variant
 
 
@@ -269,8 +269,8 @@ def test_resubmit_idempotent_under_role_rules(
     """spec/v0/03-roles.md §4.4 — identical resubmission MUST be accepted; only one task.submitted.
 
     Per the §4.4 amendment in this chunk: "identical normative
-    fields (`variant_id`, `status`, `metrics`) MUST be accepted." The
-    test holds all four fields (`variant_id`, `status`, `metrics`,
+    fields (`variant_id`, `status`, `evaluation`) MUST be accepted." The
+    test holds all four fields (`variant_id`, `status`, `evaluation`,
     `artifacts_uri`) identical between the two submits as the
     baseline-equivalence check; the artifacts_uri-non-equivalence
     test below pins the half of the amendment that says
@@ -314,7 +314,7 @@ def test_resubmit_with_different_artifacts_uri_is_idempotent(
     Per the §4.4 amendment in this chunk: "`artifacts_uri` is NOT
     part of equivalence — the first submission's `artifacts_uri` is
     the committed one." Two submits with identical
-    `variant_id`+`status`+`metrics` but DIFFERENT `artifacts_uri`
+    `variant_id`+`status`+`evaluation` but DIFFERENT `artifacts_uri`
     MUST both return 200, MUST emit only one `task.submitted`, and
     after `/accept` the variant's `artifacts_uri` MUST equal the
     *first* submission's value.
diff --git a/docs/roadmap.md b/docs/roadmap.md
index 09fadab6..dcf2c35d 100644
--- a/docs/roadmap.md
+++ b/docs/roadmap.md
@@ -263,6 +263,7 @@ Units and chunking to be named closer to execution — too far ahead to estimate
 - [12c](plans/eden-phase-12c-control-plane.md) — Control plane — **shipped 2026-05-19** (see [CHANGELOG](../CHANGELOG.md))
 - [#145](plans/issue-145-per-route-store-swap.md) — Per-route store swapping for the experiment switcher (12c §3.6 backfill) — **shipped 2026-06-02** (see [CHANGELOG](../CHANGELOG.md))
 - [#122](plans/issue-122-baseline-variant.md) — Evaluatable baseline variant (seed becomes a `kind="baseline"` Variant) — **shipped 2026-06-02** (see [CHANGELOG](../CHANGELOG.md))
+- [#273](https://github.com/ealt/eden/issues/273) — Fix spec/impl drift: align prose + integrator manifest to the `evaluation` field name (Option 1) — **shipped 2026-06-03** (see [CHANGELOG](../CHANGELOG.md))
 
 ---
 
diff --git a/reference/packages/eden-checkpoint/tests/test_checkpoint_roundtrip.py b/reference/packages/eden-checkpoint/tests/test_checkpoint_roundtrip.py
index fab6aa91..b9482e60 100644
--- a/reference/packages/eden-checkpoint/tests/test_checkpoint_roundtrip.py
+++ b/reference/packages/eden-checkpoint/tests/test_checkpoint_roundtrip.py
@@ -74,7 +74,7 @@ def _build_payload() -> dict[str, list[dict[str, object]]]:
                 "variant_commit_sha": "c" * 40,
                 "artifacts_uri": "checkpoint:sha256:" + sha256_hex(b"var-1 artifacts"),
                 "description": "first variant",
-                "metrics": {"accuracy": 0.95},
+                "evaluation": {"accuracy": 0.95},
                 "started_at": "2026-05-06T15:03:00Z",
                 "completed_at": "2026-05-06T15:05:00Z",
                 "executed_by": "executor-2",
diff --git a/reference/packages/eden-storage/src/eden_storage/_ops/tasks_lifecycle.py b/reference/packages/eden-storage/src/eden_storage/_ops/tasks_lifecycle.py
index c94f5de9..fcc341fd 100644
--- a/reference/packages/eden-storage/src/eden_storage/_ops/tasks_lifecycle.py
+++ b/reference/packages/eden-storage/src/eden_storage/_ops/tasks_lifecycle.py
@@ -980,7 +980,7 @@ def _validate_evaluate_acceptance(
         if variant.commit_sha is None:
             return f"variant {submission.variant_id!r} has no commit_sha"
         if submission.evaluation is None:
-            return "success submission requires metrics (03-roles.md §4.4)"
+            return "success submission requires evaluation (03-roles.md §4.4)"
         try:
             self._validate_evaluation(submission.evaluation)
         except InvalidPrecondition as exc:
diff --git a/reference/packages/eden-storage/src/eden_storage/_ops/variants.py b/reference/packages/eden-storage/src/eden_storage/_ops/variants.py
index cfdaf9df..c38df490 100644
--- a/reference/packages/eden-storage/src/eden_storage/_ops/variants.py
+++ b/reference/packages/eden-storage/src/eden_storage/_ops/variants.py
@@ -97,7 +97,7 @@ def create_variant(self, variant: Variant) -> None:
     def declare_variant_evaluation_error(self, variant_id: str) -> None:
         """Retry-exhausted: ``starting → evaluation_error`` (``05-event-protocol.md`` §2.2).
 
-        Writes ``completed_at`` atomically; MUST NOT set metrics or
+        Writes ``completed_at`` atomically; MUST NOT set evaluation or
         artifacts_uri (``03-roles.md`` §4.4).
         """
         with self._atomic_operation():
diff --git a/reference/packages/eden-storage/src/eden_storage/submissions.py b/reference/packages/eden-storage/src/eden_storage/submissions.py
index ca004c0e..62f246b6 100644
--- a/reference/packages/eden-storage/src/eden_storage/submissions.py
+++ b/reference/packages/eden-storage/src/eden_storage/submissions.py
@@ -13,12 +13,12 @@
 - ``IdeaSubmission``        — status + idea_ids (set-equivalent
   per 04 §4.2).
 - ``VariantSubmission``   — status + variant_id + commit_sha (03 §3.4).
-- ``EvaluationSubmission``    — status + variant_id + metrics + optional
-  artifacts_uri (03 §4.4, 04 §4.2 on metrics equivalence).
+- ``EvaluationSubmission``    — status + variant_id + evaluation + optional
+  artifacts_uri (03 §4.4, 04 §4.2 on evaluation equivalence).
 
 All three are ``frozen=True`` so callers cannot rebind fields after
 construction. ``submit`` still deep-copies on entry and
-``read_submission`` on exit, because the ``metrics`` dict on
+``read_submission`` on exit, because the ``evaluation`` dict on
 ``EvaluationSubmission`` is not itself frozen.
 """
 
diff --git a/reference/packages/eden-storage/tests/test_store_hardening.py b/reference/packages/eden-storage/tests/test_store_hardening.py
index 1f4b4dfd..9443a580 100644
--- a/reference/packages/eden-storage/tests/test_store_hardening.py
+++ b/reference/packages/eden-storage/tests/test_store_hardening.py
@@ -275,7 +275,7 @@ def test_evaluate_success_without_evaluation_routed_to_validation_error(
         )
         reason = store.validate_acceptance("t-eval")
         assert reason is not None
-        assert "metrics" in reason
+        assert "evaluation" in reason
 
     def test_driver_routes_malformed_success_to_validation_error(
         self, make_store: Callable[..., Store]
diff --git a/spec/v0/02-data-model.md b/spec/v0/02-data-model.md
index 23ef9273..c9308b69 100644
--- a/spec/v0/02-data-model.md
+++ b/spec/v0/02-data-model.md
@@ -381,15 +381,15 @@ A variant is one completed attempt.
 | `artifacts_uri` | no | string (URI) | Where the variant's evaluator-produced artifacts live. Written by the orchestrator at evaluation-task terminal time from the evaluator's submission ([`03-roles.md`](03-roles.md) §4.4). |
 | `executor_artifacts_uri` | no | string (URI) | Where the variant's executor-produced artifacts live (build logs, coverage reports, generated screenshots — output not appropriate to commit to the worker branch). Written by the orchestrator at execution-task terminal time from the executor's submission ([`03-roles.md`](03-roles.md) §3.4). Disjoint from `artifacts_uri`: the executor writes one, the evaluator writes the other, and the orchestrator preserves both. |
 | `description` | no | string | Human-readable summary. |
-| `metrics` | no | object | Evaluation payload; shape dictated by the experiment's evaluation schema. |
+| `evaluation` | no | object | Evaluation payload; shape dictated by the experiment's evaluation schema. |
 | `started_at` | yes | timestamp | When the executor began. |
 | `completed_at` | no | timestamp | Set when the variant reaches a terminal status. Written exactly once by the orchestrator, atomically with the transition from `"starting"` to `"success"`, `"error"`, or `"evaluation_error"` (see [`04-task-protocol.md`](04-task-protocol.md) §4.3 and [`03-roles.md`](03-roles.md) §4.4). |
 | `executed_by` | no | string (worker_id) | The executor's `worker_id`; written at execution-task submit time (atomically with the variant's status transition out of `"starting"`). |
-| `evaluated_by` | no | string (worker_id) | The evaluator's `worker_id` whose metrics were committed; written at evaluation-task submit time. |
+| `evaluated_by` | no | string (worker_id) | The evaluator's `worker_id` whose evaluation was committed; written at evaluation-task submit time. |
 
 ### 9.2 Evaluation payload
 
-If present, `metrics` MUST be an object whose keys are a subset of the declared evaluation-schema keys and whose values match the declared types (§1.3) or are `null`. Because the evaluation-schema is per-experiment, [`schemas/variant.schema.json`](schemas/variant.schema.json) cannot express this constraint generically and leaves `metrics` as an open object; the per-metric type check is a runtime responsibility of the orchestrator. A conforming orchestrator MUST reject a evaluation payload that violates the experiment's evaluation-schema, and MUST NOT record the variant as `"success"` in that case.
+If present, `evaluation` MUST be an object whose keys are a subset of the declared evaluation-schema keys and whose values match the declared types (§1.3) or are `null`. Because the evaluation-schema is per-experiment, [`schemas/variant.schema.json`](schemas/variant.schema.json) cannot express this constraint generically and leaves `evaluation` as an open object; the per-metric type check is a runtime responsibility of the orchestrator. A conforming orchestrator MUST reject a evaluation payload that violates the experiment's evaluation-schema, and MUST NOT record the variant as `"success"` in that case.
 
 ### 9.3 Status transitions
 
diff --git a/spec/v0/03-roles.md b/spec/v0/03-roles.md
index edca8465..44d4de7f 100644
--- a/spec/v0/03-roles.md
+++ b/spec/v0/03-roles.md
@@ -133,10 +133,10 @@ An evaluator receives:
 
 The evaluator MUST:
 
-1. Produce a `metrics` object whose keys are a subset of the declared `evaluation_schema` keys and whose values satisfy the per-metric type rules ([`02-data-model.md`](02-data-model.md) §1.3, §9.2).
+1. Produce an `evaluation` object whose keys are a subset of the declared `evaluation_schema` keys and whose values satisfy the per-metric type rules ([`02-data-model.md`](02-data-model.md) §1.3, §9.2).
 2. Optionally upload supporting artifacts (logs, captured outputs, diagnostic files).
 
-The evaluator MUST NOT modify the worker branch or any protocol-owned mutable state other than the variant fields the submission writes (§4.4) and the task it holds a claim on. In particular, the evaluator MUST NOT write to the variant's `completed_at`, `metrics`, `artifacts_uri`, `description`, or `status` directly; those writes are performed by the orchestrator when the submitted task reaches its terminal state (§4.4, [`04-task-protocol.md`](04-task-protocol.md) §4.3).
+The evaluator MUST NOT modify the worker branch or any protocol-owned mutable state other than the variant fields the submission writes (§4.4) and the task it holds a claim on. In particular, the evaluator MUST NOT write to the variant's `completed_at`, `evaluation`, `artifacts_uri`, `description`, or `status` directly; those writes are performed by the orchestrator when the submitted task reaches its terminal state (§4.4, [`04-task-protocol.md`](04-task-protocol.md) §4.3).
 
 ### 4.3 Non-interference
 
@@ -151,19 +151,19 @@ The evaluator submits with:
   - `"success"` — the variant ran and produced metrics.
   - `"error"` — the variant could not be evaluated for reasons attributable to the variant's own code (build failure, test failure, etc.). The evaluator MAY still include partial metrics.
   - `"evaluation_error"` — the evaluator itself failed for reasons unrelated to the variant's code (infrastructure fault, evaluator bug). While a fresh evaluation task MAY still be created for this variant, the variant's status MUST remain `"starting"`. If the orchestrator's retry policy is exhausted (or the operator abandons evaluation), the orchestrator MUST transition the variant's status to `"evaluation_error"`, making that status terminal for the variant ([`04-task-protocol.md`](04-task-protocol.md) §4.3).
-- `metrics` — the evaluation object described in §4.2. MAY be absent when `status == "evaluation_error"`.
+- `evaluation` — the evaluation object described in §4.2. MAY be absent when `status == "evaluation_error"`.
 - `artifacts_uri` — OPTIONAL. A URI the evaluator uploaded supporting artifacts to.
 
 On a `submitted → completed` or `submitted → failed` transition (per [`04-task-protocol.md`](04-task-protocol.md) §4.3), the orchestrator MUST write the following variant fields atomically with the event:
 
 - `status` — the variant status implied by the submission: `"success"` when the submission's `status == "success"`; `"error"` when the submission's `status == "error"`; unchanged from `"starting"` when the submission's `status == "evaluation_error"` (see §4.4 above for the terminal-retry case).
-- `metrics` — set to the submission's `metrics` when `status ∈ {"success", "error"}`. When `status == "evaluation_error"` the orchestrator MUST NOT write `metrics` on the variant; any submission-carried `metrics` is discarded.
+- `evaluation` — set to the submission's `evaluation` when `status ∈ {"success", "error"}`. When `status == "evaluation_error"` the orchestrator MUST NOT write `evaluation` on the variant; any submission-carried `evaluation` is discarded.
 - `artifacts_uri` — set to the submission's `artifacts_uri` when provided and `status ∈ {"success", "error"}`. When `status == "evaluation_error"` the orchestrator MUST NOT write `artifacts_uri` on the variant; any submission-carried `artifacts_uri` is discarded. (An evaluator that wishes to retain diagnostic artifacts from a failed attempt MAY reference them in the `task.failed` event for that evaluation task; that channel is defined in [`05-event-protocol.md`](05-event-protocol.md).)
 - `completed_at` — set to the time of the terminal variant transition, i.e. written exactly once, when the variant's status leaves `"starting"` (either on a `"success"`/`"error"` submission, or on the retry-exhausted `"evaluation_error"` transition). Intermediate `evaluation_error` submissions MUST NOT advance `completed_at`.
 
-On the retry-exhausted `"evaluation_error"` terminal transition itself, the orchestrator MUST NOT graft metrics or artifacts from any prior `evaluation_error` submission onto the variant; the variant's `metrics` and `artifacts_uri` fields remain unset. This keeps the variant object canonical: a variant either carries the outputs of a successful or code-level-failed evaluation, or it carries nothing.
+On the retry-exhausted `"evaluation_error"` terminal transition itself, the orchestrator MUST NOT graft the evaluation payload or artifacts from any prior `evaluation_error` submission onto the variant; the variant's `evaluation` and `artifacts_uri` fields remain unset. This keeps the variant object canonical: a variant either carries the outputs of a successful or code-level-failed evaluation, or it carries nothing.
 
-Resubmission is idempotent under the same rules as §3.4 and [`04-task-protocol.md`](04-task-protocol.md) §4.2: identical normative fields (`variant_id`, `status`, `metrics`) MUST be accepted; inconsistent resubmission MUST be rejected. `artifacts_uri` is NOT part of equivalence — the first submission's `artifacts_uri` is the committed one. (Earlier drafts of this section listed `artifacts_uri` as part of the equivalence formula; the §4.2 statement is canonical and this section now defers to it.)
+Resubmission is idempotent under the same rules as §3.4 and [`04-task-protocol.md`](04-task-protocol.md) §4.2: identical normative fields (`variant_id`, `status`, `evaluation`) MUST be accepted; inconsistent resubmission MUST be rejected. `artifacts_uri` is NOT part of equivalence — the first submission's `artifacts_uri` is the committed one. (Earlier drafts of this section listed `artifacts_uri` as part of the equivalence formula; the §4.2 statement is canonical and this section now defers to it.)
 
 ## 5. Integrator
 
diff --git a/spec/v0/04-task-protocol.md b/spec/v0/04-task-protocol.md
index 257faab3..ba5dbbe2 100644
--- a/spec/v0/04-task-protocol.md
+++ b/spec/v0/04-task-protocol.md
@@ -134,7 +134,7 @@ When the claimant matches, the task store MUST handle the resubmission as follow
 - If the resubmission's result payload is **content-equivalent** to the already-recorded payload, the task store MUST accept it and MUST NOT change the task's state or recorded result. "Content equivalence" means the normative fields identified per role agree:
   - `ideation` — the set of `idea_ids` (compared as sets; order is not significant per [`03-roles.md`](03-roles.md) §2.4) and `status`.
   - `execution` — `variant_id`, `status`, and `commit_sha` (when present).
-  - `evaluation` — `variant_id`, `status`, and `metrics` (compared as JSON values; key order does not matter).
+  - `evaluation` — `variant_id`, `status`, and `evaluation` (compared as JSON values; key order does not matter).
 - If the resubmission's result payload is **not** content-equivalent, the task store MUST reject it. The first submission's result is the committed result.
 
 This rule exists so that a worker may safely retry a submit after a network or process failure without risk of advancing state twice or corrupting the recorded result. Bindings MAY additionally accept an optional caller-supplied `submission_id` field on the wire payload to act as an explicit idempotency key; that is a binding-layer extension and does not weaken the content-equivalence rule.
diff --git a/spec/v0/05-event-protocol.md b/spec/v0/05-event-protocol.md
index e9e03b44..fb670094 100644
--- a/spec/v0/05-event-protocol.md
+++ b/spec/v0/05-event-protocol.md
@@ -165,7 +165,7 @@ Note on `experiment.policy_error`: this event records an orchestrator-side fault
 
 Every protocol-owned state-machine transition in v0 is covered by exactly one event type in §3.1–§3.4. A subscriber that consumes every event with a registered `type`, in log order, MUST be able to reconstruct the **lifecycle history** of every task, idea, variant, and experiment-scoped configuration in the experiment: which entities exist, in what states, in what order. A conforming implementation MUST NOT expose a state-machine transition that is not marked by its corresponding registered event.
 
-Registered event payloads carry only the fields needed to *identify* the transitioning entity and any cross-entity references (e.g. `idea.dispatched` carries the implement `task_id`). They do not carry full entity snapshots. A subscriber that needs the content of an entity (the idea's `parent_commits` and `artifacts_uri`, the variant's `metrics` and `completed_at`) MUST read that entity from its store using the identifier the event carries; the read returns the entity's current state ([`08-storage.md`](08-storage.md) §1.7). This boundary is deliberate: events mark what happened; the entity stores hold what the entity *is*. Coupling the two into a full event-sourced projection — a subscriber reconstructing every intermediate entity value from events alone — is a deployment MAY implement on top of v0 but is not a protocol requirement, since v0 does not mandate historical reads on the entity stores.
+Registered event payloads carry only the fields needed to *identify* the transitioning entity and any cross-entity references (e.g. `idea.dispatched` carries the implement `task_id`). They do not carry full entity snapshots. A subscriber that needs the content of an entity (the idea's `parent_commits` and `artifacts_uri`, the variant's `evaluation` and `completed_at`) MUST read that entity from its store using the identifier the event carries; the read returns the entity's current state ([`08-storage.md`](08-storage.md) §1.7). This boundary is deliberate: events mark what happened; the entity stores hold what the entity *is*. Coupling the two into a full event-sourced projection — a subscriber reconstructing every intermediate entity value from events alone — is a deployment MAY implement on top of v0 but is not a protocol requirement, since v0 does not mandate historical reads on the entity stores.
 
 The transactional invariant (§2) combined with the atomic event + state-change rule guarantees that any entity reachable via an event's identifier is already durable at read time.
 
diff --git a/spec/v0/06-integrator.md b/spec/v0/06-integrator.md
index 173d83f2..b4a0e371 100644
--- a/spec/v0/06-integrator.md
+++ b/spec/v0/06-integrator.md
@@ -48,7 +48,7 @@ A conforming integrator MUST NOT integrate variants in any other status. In part
 
 A conforming integrator MUST NOT integrate a `kind == "baseline"` variant ([`02-data-model.md`](02-data-model.md) §9.4), regardless of its `status`. A baseline has no `work/*` branch to squash and already points at the seed on `main`, so it receives no `variant/*` commit, no `variant_commit_sha`, and no `variant.integrated` event. This carve is paired with the `integration` decision predicate and the termination-drain rule ([`02-data-model.md`](02-data-model.md) §2.4, §2.5), both of which exclude baselines so a successful baseline does not block termination. A binding MAY additionally reject a manual/operator `integrate_variant` call against a baseline with `eden://error/invalid-precondition` ([`07-wire-protocol.md`](07-wire-protocol.md) §5) as defense in depth.
 
-The integrator MUST NOT integrate a variant whose `metrics` do not validate against the experiment's `evaluation_schema` ([`02-data-model.md`](02-data-model.md) §9.2, [`08-storage.md`](08-storage.md) §4). The orchestrator's acceptance of a `success` submission is the primary guard for this; the integrator MAY additionally re-validate as defense in depth but MUST NOT silently drop or coerce invalid metrics.
+The integrator MUST NOT integrate a variant whose `evaluation` does not validate against the experiment's `evaluation_schema` ([`02-data-model.md`](02-data-model.md) §9.2, [`08-storage.md`](08-storage.md) §4). The orchestrator's acceptance of a `success` submission is the primary guard for this; the integrator MAY additionally re-validate as defense in depth but MUST NOT silently drop or coerce an invalid evaluation payload.
 
 ## 3. Integration output
 
@@ -127,7 +127,7 @@ The manifest is a JSON object with the following required fields. Each required
 | `idea_id` | string | The variant's `idea_id` ([`02-data-model.md`](02-data-model.md) §9.1). |
 | `commit_sha` | string | The worker-branch tip the evaluator measured ([`02-data-model.md`](02-data-model.md) §9.1). |
 | `parent_commits` | array of string | The variant's `parent_commits`, in order ([`02-data-model.md`](02-data-model.md) §9.1). |
-| `metrics` | object | The evaluator's evaluation payload ([`03-roles.md`](03-roles.md) §4.4), conforming to the experiment's `evaluation_schema`. |
+| `evaluation` | object | The evaluator's evaluation payload ([`03-roles.md`](03-roles.md) §4.4), conforming to the experiment's `evaluation_schema`. |
 | `completed_at` | timestamp | The variant's `completed_at` ([`02-data-model.md`](02-data-model.md) §9.1). UTC, RFC 3339 profile as elsewhere in the data model. |
 
 Optional fields:
diff --git a/spec/v0/07-wire-protocol.md b/spec/v0/07-wire-protocol.md
index 39199b64..0c406603 100644
--- a/spec/v0/07-wire-protocol.md
+++ b/spec/v0/07-wire-protocol.md
@@ -364,7 +364,7 @@ Claim, reject, reclaim, and accept are **not** blindly retry-safe on transport f
 The reference `eden_wire` server also exposes:
 
 - `GET /_reference/experiments/{E}/tasks/{T}/validate-terminal`
-- `POST /_reference/experiments/{E}/validate/metrics`
+- `POST /_reference/experiments/{E}/validate/evaluation`
 
 These are conveniences for the Phase-5 dispatch driver and are **not** part of the normative binding. A conforming third-party client MUST NOT rely on them being present. A conforming third-party orchestrator implementing its own accept/reject decision inline is free to do so; the [`04-task-protocol.md`](04-task-protocol.md) §4.3 decision rules are all that matter for the state machine.
 
diff --git a/spec/v0/08-storage.md b/spec/v0/08-storage.md
index cc27c9fa..edf167dd 100644
--- a/spec/v0/08-storage.md
+++ b/spec/v0/08-storage.md
@@ -151,7 +151,7 @@ Every experiment declares an evaluation schema in its `experiment_config` ([`02-
 
 ### 4.1 Registration
 
-At experiment registration time, a conforming deployment MUST persist the experiment's evaluation schema durably and atomically with the experiment's other configuration. A subsequent write of a variant's `metrics` field MUST be validated against the schema registered for that experiment; no write MAY bypass this validation.
+At experiment registration time, a conforming deployment MUST persist the experiment's evaluation schema durably and atomically with the experiment's other configuration. A subsequent write of a variant's `evaluation` field MUST be validated against the schema registered for that experiment; no write MAY bypass this validation.
 
 ### 4.2 Immutability during an experiment
 
@@ -163,7 +163,7 @@ The content is canonicality: comparing variants across an experiment only has me
 
 A successful write of a `variant.evaluation` payload MUST satisfy:
 
-- Every key in `metrics` is present in the experiment's evaluation schema.
+- Every key in the `evaluation` payload is present in the experiment's evaluation schema.
 - Every value either satisfies the declared type of its key — per the type mapping in [`02-data-model.md`](02-data-model.md) §1.3 (`integer`, `real`, `text`) — or is `null`.
 - No reserved name ([`02-data-model.md`](02-data-model.md) §8.2) appears as a key.
 
diff --git a/spec/v0/10-checkpoints.md b/spec/v0/10-checkpoints.md
index 9f1535bf..8b8aadca 100644
--- a/spec/v0/10-checkpoints.md
+++ b/spec/v0/10-checkpoints.md
@@ -184,7 +184,7 @@ The contract per object kind:
 
 **Tasks.** Every `task_id` in the source is present in the import. `kind`, `payload`, `target`, `created_by`, `submitted_by`, `created_at`, `updated_at` round-trip verbatim. `state` round-trips verbatim EXCEPT `claimed` becomes `pending` per (c) above. The `claim` field is empty on every imported task.
 
-**Ideas, variants, submissions.** Round-trip identical to their schema-validated forms, except `artifacts_uri` per (a). Variant `metrics`, `commit_sha`, `variant_commit_sha`, `branch`, `parent_commits`, `description`, `executed_by`, `evaluated_by`, `completed_at`, `status` all round-trip verbatim.
+**Ideas, variants, submissions.** Round-trip identical to their schema-validated forms, except `artifacts_uri` per (a). Variant `evaluation`, `commit_sha`, `variant_commit_sha`, `branch`, `parent_commits`, `description`, `executed_by`, `evaluated_by`, `completed_at`, `status` all round-trip verbatim.
 
 **Events.** Replay in the same order with the same per-event `type` / `occurred_at` / `experiment_id` / `data` payload. The `event_id` MAY differ per (b).
 
diff --git a/spec/v0/schemas/variant.schema.json b/spec/v0/schemas/variant.schema.json
index e9c95de7..16b23499 100644
--- a/spec/v0/schemas/variant.schema.json
+++ b/spec/v0/schemas/variant.schema.json
@@ -94,7 +94,7 @@
       "minLength": 1,
       "maxLength": 64,
       "pattern": "^[a-z0-9][a-z0-9_-]{0,63}$",
-      "description": "worker_id of the evaluator whose metrics were committed; written at evaluation-task submit time."
+      "description": "worker_id of the evaluator whose evaluation was committed; written at evaluation-task submit time."
     }
   },
   "allOf": [