From d654cb6edb2a76042387cf1b93ae632ce6b7f4c8 Mon Sep 17 00:00:00 2001 From: Eric Alt <13019253+ealt@users.noreply.github.com> Date: Wed, 3 Jun 2026 16:25:58 -0700 Subject: [PATCH] Fix #273: align prose + manifest to schema/model 'evaluation' naming MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The variant evaluation-payload field is `evaluation` in variant.schema.json, the eden_contracts.Variant model, and the storage/wire impl, but the spec prose spelled it `metrics` across eight chapters plus the integrator-manifest table. Option 1 from the issue: rename the prose + manifest to `evaluation`, keeping the wire and on-tree manifest key stable (the reference integrator already emitted `evaluation` in evaluation.json). A full audit (per the AGENTS.md inter-chapter-restatement pitfall) found the drift was wider than the issue's two named locations — every backtick'd field reference is renamed in lockstep across spec chapters 02/03/04/05/06/07/08/10, the variant.schema.json evaluated_by description, impl docstrings, and the validate_acceptance reason string. Also fixes a latent conformance bug: test_evaluator_submission.py asserted `variant.get("metrics") is None`, which was always None on the wire (field is `evaluation`) and so never tested the no-graft guarantee it claimed. The baseline.metrics config block keeps its name (distinct config field that writes into variant.evaluation; out of scope for Option 1). Plain-English "metric" concept uses are untouched. docs/conformance-coverage.md (a stale, non-CI-enforced generated snapshot) is left for routine regeneration to avoid unrelated #122-era churn in this focused rename. Closes #273. Co-Authored-By: Claude Opus 4.8 (1M context) --- CHANGELOG.md | 14 +++++++ .../scenarios/test_evaluator_submission.py | 40 +++++++++---------- docs/roadmap.md | 1 + .../tests/test_checkpoint_roundtrip.py | 2 +- .../src/eden_storage/_ops/tasks_lifecycle.py | 2 +- .../src/eden_storage/_ops/variants.py | 2 +- .../src/eden_storage/submissions.py | 6 +-- .../tests/test_store_hardening.py | 2 +- spec/v0/02-data-model.md | 6 +-- spec/v0/03-roles.md | 12 +++--- spec/v0/04-task-protocol.md | 2 +- spec/v0/05-event-protocol.md | 2 +- spec/v0/06-integrator.md | 4 +- spec/v0/07-wire-protocol.md | 2 +- spec/v0/08-storage.md | 4 +- spec/v0/10-checkpoints.md | 2 +- spec/v0/schemas/variant.schema.json | 2 +- 17 files changed, 60 insertions(+), 45 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 5ce47ecc..9cf9b06e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,6 +8,20 @@ Per-chunk entries preserve the full implementation record: contract amendments, ## [Unreleased] +### Fix: align spec prose + integrator manifest to the `evaluation` field name (issue #273) + +Resolves the pre-existing drift the #122 entry surfaced (below, "Pre-existing drift surfaced"): the variant's evaluation-payload field is `evaluation` in `variant.schema.json` + the `eden_contracts.Variant` model + the storage/wire impl, but the spec **prose** spelled it `metrics` in several chapters and the integrator-manifest table. Two names for one field is a readability + onboarding hazard and a latent parity trap (the `schema-parity` job only checks schema↔model — both already say `evaluation` — so the prose `metrics` naming was unguarded). **Option 1** from the issue was selected: align prose + manifest to the schema/model `evaluation` naming, keeping the wire + on-tree manifest key stable (the reference integrator already emitted `evaluation` in `.eden/variants//evaluation.json` — see [`_manifest.py`](reference/packages/eden-git/src/eden_git/_manifest.py) — so this is a prose-and-docstring correction, not a manifest-shape change). + +**Scope was wider than the issue's two named locations.** The issue body named `02-data-model.md` §9.1 and `06-integrator.md` §4.2, but a full audit (per the AGENTS.md "spec inter-chapter restatement is a conflict surface" pitfall — grep chapters 03/04/05/07/08 for the same field) found the same `metrics` field-name spelling restated across **eight** chapters. Renaming only the two named spots would have relocated the drift rather than removing it, so the fix renames every backtick'd field reference in lockstep: + +- **Spec prose.** `02-data-model.md` §9.1 variant field table + §9.2 + the `evaluated_by` description; `03-roles.md` §4.2 (evaluator output) + §4.4 (the submission field, the variant-side write rule, the retry-exhausted no-graft rule, and the resubmission-equivalence formula); `04-task-protocol.md` §4.2 content-equivalence formula; `05-event-protocol.md` §6 (the read-the-entity boundary example); `06-integrator.md` §4.1 (validation prose) + §4.2 manifest-shape table; `07-wire-protocol.md` §11 reference-helper endpoint (`/_reference/.../validate/metrics` → `validate/evaluation`, which the impl already exposed); `08-storage.md` §4.1 + §4.3; `10-checkpoints.md` §13 (variant round-trip field list). +- **JSON Schema.** `variant.schema.json` `evaluated_by` description prose (the field key was already `evaluation`). +- **Impl docstrings + one error string.** `submissions.py` (`EvaluationSubmission` docstring), `variants.py` (`declare_variant_evaluation_error` docstring), and the `validate_acceptance` reason string (`"success submission requires metrics"` → `"… requires evaluation"`). + +**Latent conformance bug fixed.** `conformance/scenarios/test_evaluator_submission.py` asserted `variant.get("metrics") is None` in two evaluation_error scenarios — but the wire variant uses `evaluation`, so those reads were always `None` and the assertions passed trivially without testing the no-graft guarantee they claimed. Corrected to `variant.get("evaluation")` so they actually verify the variant carries no evaluation payload; the surrounding docstrings (which quote the renamed §4.4 prose) were aligned too. A stale checkpoint round-trip fixture (`test_checkpoint_roundtrip.py`) and a hardening-test reason assertion (`test_store_hardening.py`) were updated to match. + +**Deliberately left as-is.** The `baseline.metrics` **config** block (`02-data-model.md` §2.7, `experiment-config.schema.json`, `eden_contracts.config.BaselineConfig.metrics`) keeps the `metrics` name — it is a distinct config field that *writes into* `variant.evaluation`, and renaming it would touch the config surface (out of scope for Option 1, which keeps the wire/config stable). Plain-English / concept uses of "metric" (metric values §1.3, metric names in the evaluation schema §8, "objective over metrics") are left untouched — a metric is a real domain concept; the field that *holds* the metrics is `evaluation`. The generated `docs/conformance-coverage.md` (a non-CI-enforced snapshot last regenerated at #112) is **not** regenerated in this PR: it has drifted ~40 keyword lines since #122, so regenerating it now would dump unrelated churn into this focused rename. It will pick up the renamed prose on its next routine regeneration. + ### Control-plane as a first-class Compose service + lease-handoff smoke (issue #147; re-scoped) Backfills the Phase 12c CHANGELOG-narrated deferral of a `compose-smoke-multi-experiment` CI job. **Re-scoped during impl** (operator-authorized): the draft plan's headline — two experiments end-to-end with cross-experiment isolation asserted via wire reads — is **not buildable** on the reference impl, because it hosts exactly one experiment per deployment. Three sites enforce this: the task-store-server's `Store` binds a single `experiment_id` and the wire layer rejects any other (`ExperimentIdMismatch` at [`_dependencies.py:73`](reference/packages/eden-wire/src/eden_wire/_dependencies.py)); the orchestrator multi-experiment loop targets one task-store URL for all experiments ([`multi_loop.py`](reference/services/orchestrator/src/eden_orchestrator/multi_loop.py) `make_runtime_factory`); and the integrator is one shared bare repo deployment-wide ([`cli.py`](reference/services/orchestrator/src/eden_orchestrator/cli.py) `_build_runtime_factory`). 12c's multi-experiment surface was validated only against fake stores + the single-IUT conformance binding. **True multi-experiment hosting + the cross-experiment-isolation smoke are deferred to [#254](https://github.com/ealt/eden/issues/254)** (filed at re-scope time). This chunk ships the genuinely-new, genuinely-shippable substrate piece instead: the control plane as a first-class Compose service, plus a lease-lifecycle + lease-handoff chaos smoke. diff --git a/conformance/scenarios/test_evaluator_submission.py b/conformance/scenarios/test_evaluator_submission.py index ecc08c26..e1e2d2e4 100644 --- a/conformance/scenarios/test_evaluator_submission.py +++ b/conformance/scenarios/test_evaluator_submission.py @@ -24,7 +24,7 @@ def test_submit_with_mismatched_variant_id_rejected(wire_client: WireClient) -> "An evaluator submits with: variant_id — the variant it evaluated." The task store enforces this so an evaluator cannot misroute a - metrics result onto an unrelated variant. + evaluation result onto an unrelated variant. """ eval_tid, _variant_id = _drive_to_starting_variant(wire_client) c = _seed.claim(wire_client, eval_tid) @@ -44,9 +44,9 @@ def test_success_evaluation_outside_schema_must_not_complete_variant( ) -> None: """spec/v0/03-roles.md §4.2 — evaluation keys MUST be a subset of evaluation_schema. - "Produce a `metrics` object whose keys are a subset of the + "Produce an `evaluation` object whose keys are a subset of the declared `evaluation_schema` keys." A conforming IUT MUST reject a - success submission whose metrics include a key the schema does + success submission whose evaluation includes a key the schema does not declare; the variant MUST NOT terminalize as success. Where in the pipeline the rejection surfaces is implementation-defined, so the assertion checks the observable end-state. @@ -130,7 +130,7 @@ def test_success_writes_variant_fields_post_accept( """spec/v0/03-roles.md §4.4 — accepted success writes evaluation + uri. Asserts the §4.4 variant-side write rule: after /accept on a success - submission, the variant's `status == "success"`, `metrics`, + submission, the variant's `status == "success"`, `evaluation`, `artifacts_uri`, and `completed_at` carry the submitted values, and `variant.succeeded` is in the event log. The §4.4 atomicity claim ("written atomically with the event") is asserted in @@ -168,12 +168,12 @@ def test_success_writes_variant_fields_post_accept( def test_status_error_writes_variant_evaluation_and_artifacts( wire_client: WireClient, ) -> None: - """spec/v0/03-roles.md §4.4 — status=error MUST write variant metrics + artifacts_uri. + """spec/v0/03-roles.md §4.4 — status=error MUST write variant evaluation + artifacts_uri. - "metrics — set to the submission's `metrics` when status ∈ + "evaluation — set to the submission's `evaluation` when status ∈ {'success', 'error'}." Distinct from the evaluation_error case (which - discards metrics): the §4.4 variant-side write rule is per-status, - and the error path keeps the metrics around because the variant + discards the evaluation): the §4.4 variant-side write rule is per-status, + and the error path keeps the evaluation around because the variant DID run; only the run failed. The reject reason is `worker_error` — `validation_error` would discard the payload instead. @@ -204,13 +204,13 @@ def test_status_error_writes_variant_evaluation_and_artifacts( def test_eval_error_keeps_variant_starting_and_does_not_graft_evaluation( wire_client: WireClient, ) -> None: - """spec/v0/03-roles.md §4.4 — evaluation_error MUST keep variant in starting; metrics discarded. + """spec/v0/03-roles.md §4.4 — evaluation_error keeps variant starting; evaluation discarded. "When status == evaluation_error, the orchestrator MUST NOT write - metrics on the variant; any submission-carried metrics is - discarded." Observed: after submitting evaluation_error with metrics + evaluation on the variant; any submission-carried evaluation is + discarded." Observed: after submitting evaluation_error with an evaluation and rejecting the task, the variant stays in `starting` and its - `metrics` field is unset. + `evaluation` field is unset. """ eval_tid, variant_id = _drive_to_starting_variant(wire_client) c = _seed.claim(wire_client, eval_tid) @@ -227,18 +227,18 @@ def test_eval_error_keeps_variant_starting_and_does_not_graft_evaluation( assert 200 <= rejected.status_code < 300, rejected.text variant = _seed.read_variant(wire_client, variant_id) assert variant["status"] == "starting" - assert variant.get("metrics") is None, variant + assert variant.get("evaluation") is None, variant assert variant.get("artifacts_uri") is None, variant def test_retry_exhausted_eval_error_does_not_graft_prior_evaluation( wire_client: WireClient, ) -> None: - """spec/v0/03-roles.md §4.4 — retry-exhausted evaluation_error MUST NOT graft prior metrics. + """spec/v0/03-roles.md §4.4 — retry-exhausted evaluation_error MUST NOT graft prior evaluation. "On the retry-exhausted evaluation_error terminal transition itself, - the orchestrator MUST NOT graft metrics or artifacts from any - prior evaluation_error submission onto the variant; the variant's metrics + the orchestrator MUST NOT graft the evaluation payload or artifacts from any + prior evaluation_error submission onto the variant; the variant's evaluation and artifacts_uri fields remain unset." """ eval_tid, variant_id = _drive_to_starting_variant(wire_client) @@ -259,7 +259,7 @@ def test_retry_exhausted_eval_error_does_not_graft_prior_evaluation( assert 200 <= decl.status_code < 300, decl.text variant = _seed.read_variant(wire_client, variant_id) assert variant["status"] == "evaluation_error" - assert variant.get("metrics") is None, variant + assert variant.get("evaluation") is None, variant assert variant.get("artifacts_uri") is None, variant @@ -269,8 +269,8 @@ def test_resubmit_idempotent_under_role_rules( """spec/v0/03-roles.md §4.4 — identical resubmission MUST be accepted; only one task.submitted. Per the §4.4 amendment in this chunk: "identical normative - fields (`variant_id`, `status`, `metrics`) MUST be accepted." The - test holds all four fields (`variant_id`, `status`, `metrics`, + fields (`variant_id`, `status`, `evaluation`) MUST be accepted." The + test holds all four fields (`variant_id`, `status`, `evaluation`, `artifacts_uri`) identical between the two submits as the baseline-equivalence check; the artifacts_uri-non-equivalence test below pins the half of the amendment that says @@ -314,7 +314,7 @@ def test_resubmit_with_different_artifacts_uri_is_idempotent( Per the §4.4 amendment in this chunk: "`artifacts_uri` is NOT part of equivalence — the first submission's `artifacts_uri` is the committed one." Two submits with identical - `variant_id`+`status`+`metrics` but DIFFERENT `artifacts_uri` + `variant_id`+`status`+`evaluation` but DIFFERENT `artifacts_uri` MUST both return 200, MUST emit only one `task.submitted`, and after `/accept` the variant's `artifacts_uri` MUST equal the *first* submission's value. diff --git a/docs/roadmap.md b/docs/roadmap.md index 09fadab6..dcf2c35d 100644 --- a/docs/roadmap.md +++ b/docs/roadmap.md @@ -263,6 +263,7 @@ Units and chunking to be named closer to execution — too far ahead to estimate - [12c](plans/eden-phase-12c-control-plane.md) — Control plane — **shipped 2026-05-19** (see [CHANGELOG](../CHANGELOG.md)) - [#145](plans/issue-145-per-route-store-swap.md) — Per-route store swapping for the experiment switcher (12c §3.6 backfill) — **shipped 2026-06-02** (see [CHANGELOG](../CHANGELOG.md)) - [#122](plans/issue-122-baseline-variant.md) — Evaluatable baseline variant (seed becomes a `kind="baseline"` Variant) — **shipped 2026-06-02** (see [CHANGELOG](../CHANGELOG.md)) +- [#273](https://github.com/ealt/eden/issues/273) — Fix spec/impl drift: align prose + integrator manifest to the `evaluation` field name (Option 1) — **shipped 2026-06-03** (see [CHANGELOG](../CHANGELOG.md)) --- diff --git a/reference/packages/eden-checkpoint/tests/test_checkpoint_roundtrip.py b/reference/packages/eden-checkpoint/tests/test_checkpoint_roundtrip.py index fab6aa91..b9482e60 100644 --- a/reference/packages/eden-checkpoint/tests/test_checkpoint_roundtrip.py +++ b/reference/packages/eden-checkpoint/tests/test_checkpoint_roundtrip.py @@ -74,7 +74,7 @@ def _build_payload() -> dict[str, list[dict[str, object]]]: "variant_commit_sha": "c" * 40, "artifacts_uri": "checkpoint:sha256:" + sha256_hex(b"var-1 artifacts"), "description": "first variant", - "metrics": {"accuracy": 0.95}, + "evaluation": {"accuracy": 0.95}, "started_at": "2026-05-06T15:03:00Z", "completed_at": "2026-05-06T15:05:00Z", "executed_by": "executor-2", diff --git a/reference/packages/eden-storage/src/eden_storage/_ops/tasks_lifecycle.py b/reference/packages/eden-storage/src/eden_storage/_ops/tasks_lifecycle.py index c94f5de9..fcc341fd 100644 --- a/reference/packages/eden-storage/src/eden_storage/_ops/tasks_lifecycle.py +++ b/reference/packages/eden-storage/src/eden_storage/_ops/tasks_lifecycle.py @@ -980,7 +980,7 @@ def _validate_evaluate_acceptance( if variant.commit_sha is None: return f"variant {submission.variant_id!r} has no commit_sha" if submission.evaluation is None: - return "success submission requires metrics (03-roles.md §4.4)" + return "success submission requires evaluation (03-roles.md §4.4)" try: self._validate_evaluation(submission.evaluation) except InvalidPrecondition as exc: diff --git a/reference/packages/eden-storage/src/eden_storage/_ops/variants.py b/reference/packages/eden-storage/src/eden_storage/_ops/variants.py index cfdaf9df..c38df490 100644 --- a/reference/packages/eden-storage/src/eden_storage/_ops/variants.py +++ b/reference/packages/eden-storage/src/eden_storage/_ops/variants.py @@ -97,7 +97,7 @@ def create_variant(self, variant: Variant) -> None: def declare_variant_evaluation_error(self, variant_id: str) -> None: """Retry-exhausted: ``starting → evaluation_error`` (``05-event-protocol.md`` §2.2). - Writes ``completed_at`` atomically; MUST NOT set metrics or + Writes ``completed_at`` atomically; MUST NOT set evaluation or artifacts_uri (``03-roles.md`` §4.4). """ with self._atomic_operation(): diff --git a/reference/packages/eden-storage/src/eden_storage/submissions.py b/reference/packages/eden-storage/src/eden_storage/submissions.py index ca004c0e..62f246b6 100644 --- a/reference/packages/eden-storage/src/eden_storage/submissions.py +++ b/reference/packages/eden-storage/src/eden_storage/submissions.py @@ -13,12 +13,12 @@ - ``IdeaSubmission`` — status + idea_ids (set-equivalent per 04 §4.2). - ``VariantSubmission`` — status + variant_id + commit_sha (03 §3.4). -- ``EvaluationSubmission`` — status + variant_id + metrics + optional - artifacts_uri (03 §4.4, 04 §4.2 on metrics equivalence). +- ``EvaluationSubmission`` — status + variant_id + evaluation + optional + artifacts_uri (03 §4.4, 04 §4.2 on evaluation equivalence). All three are ``frozen=True`` so callers cannot rebind fields after construction. ``submit`` still deep-copies on entry and -``read_submission`` on exit, because the ``metrics`` dict on +``read_submission`` on exit, because the ``evaluation`` dict on ``EvaluationSubmission`` is not itself frozen. """ diff --git a/reference/packages/eden-storage/tests/test_store_hardening.py b/reference/packages/eden-storage/tests/test_store_hardening.py index 1f4b4dfd..9443a580 100644 --- a/reference/packages/eden-storage/tests/test_store_hardening.py +++ b/reference/packages/eden-storage/tests/test_store_hardening.py @@ -275,7 +275,7 @@ def test_evaluate_success_without_evaluation_routed_to_validation_error( ) reason = store.validate_acceptance("t-eval") assert reason is not None - assert "metrics" in reason + assert "evaluation" in reason def test_driver_routes_malformed_success_to_validation_error( self, make_store: Callable[..., Store] diff --git a/spec/v0/02-data-model.md b/spec/v0/02-data-model.md index 23ef9273..c9308b69 100644 --- a/spec/v0/02-data-model.md +++ b/spec/v0/02-data-model.md @@ -381,15 +381,15 @@ A variant is one completed attempt. | `artifacts_uri` | no | string (URI) | Where the variant's evaluator-produced artifacts live. Written by the orchestrator at evaluation-task terminal time from the evaluator's submission ([`03-roles.md`](03-roles.md) §4.4). | | `executor_artifacts_uri` | no | string (URI) | Where the variant's executor-produced artifacts live (build logs, coverage reports, generated screenshots — output not appropriate to commit to the worker branch). Written by the orchestrator at execution-task terminal time from the executor's submission ([`03-roles.md`](03-roles.md) §3.4). Disjoint from `artifacts_uri`: the executor writes one, the evaluator writes the other, and the orchestrator preserves both. | | `description` | no | string | Human-readable summary. | -| `metrics` | no | object | Evaluation payload; shape dictated by the experiment's evaluation schema. | +| `evaluation` | no | object | Evaluation payload; shape dictated by the experiment's evaluation schema. | | `started_at` | yes | timestamp | When the executor began. | | `completed_at` | no | timestamp | Set when the variant reaches a terminal status. Written exactly once by the orchestrator, atomically with the transition from `"starting"` to `"success"`, `"error"`, or `"evaluation_error"` (see [`04-task-protocol.md`](04-task-protocol.md) §4.3 and [`03-roles.md`](03-roles.md) §4.4). | | `executed_by` | no | string (worker_id) | The executor's `worker_id`; written at execution-task submit time (atomically with the variant's status transition out of `"starting"`). | -| `evaluated_by` | no | string (worker_id) | The evaluator's `worker_id` whose metrics were committed; written at evaluation-task submit time. | +| `evaluated_by` | no | string (worker_id) | The evaluator's `worker_id` whose evaluation was committed; written at evaluation-task submit time. | ### 9.2 Evaluation payload -If present, `metrics` MUST be an object whose keys are a subset of the declared evaluation-schema keys and whose values match the declared types (§1.3) or are `null`. Because the evaluation-schema is per-experiment, [`schemas/variant.schema.json`](schemas/variant.schema.json) cannot express this constraint generically and leaves `metrics` as an open object; the per-metric type check is a runtime responsibility of the orchestrator. A conforming orchestrator MUST reject a evaluation payload that violates the experiment's evaluation-schema, and MUST NOT record the variant as `"success"` in that case. +If present, `evaluation` MUST be an object whose keys are a subset of the declared evaluation-schema keys and whose values match the declared types (§1.3) or are `null`. Because the evaluation-schema is per-experiment, [`schemas/variant.schema.json`](schemas/variant.schema.json) cannot express this constraint generically and leaves `evaluation` as an open object; the per-metric type check is a runtime responsibility of the orchestrator. A conforming orchestrator MUST reject a evaluation payload that violates the experiment's evaluation-schema, and MUST NOT record the variant as `"success"` in that case. ### 9.3 Status transitions diff --git a/spec/v0/03-roles.md b/spec/v0/03-roles.md index edca8465..44d4de7f 100644 --- a/spec/v0/03-roles.md +++ b/spec/v0/03-roles.md @@ -133,10 +133,10 @@ An evaluator receives: The evaluator MUST: -1. Produce a `metrics` object whose keys are a subset of the declared `evaluation_schema` keys and whose values satisfy the per-metric type rules ([`02-data-model.md`](02-data-model.md) §1.3, §9.2). +1. Produce an `evaluation` object whose keys are a subset of the declared `evaluation_schema` keys and whose values satisfy the per-metric type rules ([`02-data-model.md`](02-data-model.md) §1.3, §9.2). 2. Optionally upload supporting artifacts (logs, captured outputs, diagnostic files). -The evaluator MUST NOT modify the worker branch or any protocol-owned mutable state other than the variant fields the submission writes (§4.4) and the task it holds a claim on. In particular, the evaluator MUST NOT write to the variant's `completed_at`, `metrics`, `artifacts_uri`, `description`, or `status` directly; those writes are performed by the orchestrator when the submitted task reaches its terminal state (§4.4, [`04-task-protocol.md`](04-task-protocol.md) §4.3). +The evaluator MUST NOT modify the worker branch or any protocol-owned mutable state other than the variant fields the submission writes (§4.4) and the task it holds a claim on. In particular, the evaluator MUST NOT write to the variant's `completed_at`, `evaluation`, `artifacts_uri`, `description`, or `status` directly; those writes are performed by the orchestrator when the submitted task reaches its terminal state (§4.4, [`04-task-protocol.md`](04-task-protocol.md) §4.3). ### 4.3 Non-interference @@ -151,19 +151,19 @@ The evaluator submits with: - `"success"` — the variant ran and produced metrics. - `"error"` — the variant could not be evaluated for reasons attributable to the variant's own code (build failure, test failure, etc.). The evaluator MAY still include partial metrics. - `"evaluation_error"` — the evaluator itself failed for reasons unrelated to the variant's code (infrastructure fault, evaluator bug). While a fresh evaluation task MAY still be created for this variant, the variant's status MUST remain `"starting"`. If the orchestrator's retry policy is exhausted (or the operator abandons evaluation), the orchestrator MUST transition the variant's status to `"evaluation_error"`, making that status terminal for the variant ([`04-task-protocol.md`](04-task-protocol.md) §4.3). -- `metrics` — the evaluation object described in §4.2. MAY be absent when `status == "evaluation_error"`. +- `evaluation` — the evaluation object described in §4.2. MAY be absent when `status == "evaluation_error"`. - `artifacts_uri` — OPTIONAL. A URI the evaluator uploaded supporting artifacts to. On a `submitted → completed` or `submitted → failed` transition (per [`04-task-protocol.md`](04-task-protocol.md) §4.3), the orchestrator MUST write the following variant fields atomically with the event: - `status` — the variant status implied by the submission: `"success"` when the submission's `status == "success"`; `"error"` when the submission's `status == "error"`; unchanged from `"starting"` when the submission's `status == "evaluation_error"` (see §4.4 above for the terminal-retry case). -- `metrics` — set to the submission's `metrics` when `status ∈ {"success", "error"}`. When `status == "evaluation_error"` the orchestrator MUST NOT write `metrics` on the variant; any submission-carried `metrics` is discarded. +- `evaluation` — set to the submission's `evaluation` when `status ∈ {"success", "error"}`. When `status == "evaluation_error"` the orchestrator MUST NOT write `evaluation` on the variant; any submission-carried `evaluation` is discarded. - `artifacts_uri` — set to the submission's `artifacts_uri` when provided and `status ∈ {"success", "error"}`. When `status == "evaluation_error"` the orchestrator MUST NOT write `artifacts_uri` on the variant; any submission-carried `artifacts_uri` is discarded. (An evaluator that wishes to retain diagnostic artifacts from a failed attempt MAY reference them in the `task.failed` event for that evaluation task; that channel is defined in [`05-event-protocol.md`](05-event-protocol.md).) - `completed_at` — set to the time of the terminal variant transition, i.e. written exactly once, when the variant's status leaves `"starting"` (either on a `"success"`/`"error"` submission, or on the retry-exhausted `"evaluation_error"` transition). Intermediate `evaluation_error` submissions MUST NOT advance `completed_at`. -On the retry-exhausted `"evaluation_error"` terminal transition itself, the orchestrator MUST NOT graft metrics or artifacts from any prior `evaluation_error` submission onto the variant; the variant's `metrics` and `artifacts_uri` fields remain unset. This keeps the variant object canonical: a variant either carries the outputs of a successful or code-level-failed evaluation, or it carries nothing. +On the retry-exhausted `"evaluation_error"` terminal transition itself, the orchestrator MUST NOT graft the evaluation payload or artifacts from any prior `evaluation_error` submission onto the variant; the variant's `evaluation` and `artifacts_uri` fields remain unset. This keeps the variant object canonical: a variant either carries the outputs of a successful or code-level-failed evaluation, or it carries nothing. -Resubmission is idempotent under the same rules as §3.4 and [`04-task-protocol.md`](04-task-protocol.md) §4.2: identical normative fields (`variant_id`, `status`, `metrics`) MUST be accepted; inconsistent resubmission MUST be rejected. `artifacts_uri` is NOT part of equivalence — the first submission's `artifacts_uri` is the committed one. (Earlier drafts of this section listed `artifacts_uri` as part of the equivalence formula; the §4.2 statement is canonical and this section now defers to it.) +Resubmission is idempotent under the same rules as §3.4 and [`04-task-protocol.md`](04-task-protocol.md) §4.2: identical normative fields (`variant_id`, `status`, `evaluation`) MUST be accepted; inconsistent resubmission MUST be rejected. `artifacts_uri` is NOT part of equivalence — the first submission's `artifacts_uri` is the committed one. (Earlier drafts of this section listed `artifacts_uri` as part of the equivalence formula; the §4.2 statement is canonical and this section now defers to it.) ## 5. Integrator diff --git a/spec/v0/04-task-protocol.md b/spec/v0/04-task-protocol.md index 257faab3..ba5dbbe2 100644 --- a/spec/v0/04-task-protocol.md +++ b/spec/v0/04-task-protocol.md @@ -134,7 +134,7 @@ When the claimant matches, the task store MUST handle the resubmission as follow - If the resubmission's result payload is **content-equivalent** to the already-recorded payload, the task store MUST accept it and MUST NOT change the task's state or recorded result. "Content equivalence" means the normative fields identified per role agree: - `ideation` — the set of `idea_ids` (compared as sets; order is not significant per [`03-roles.md`](03-roles.md) §2.4) and `status`. - `execution` — `variant_id`, `status`, and `commit_sha` (when present). - - `evaluation` — `variant_id`, `status`, and `metrics` (compared as JSON values; key order does not matter). + - `evaluation` — `variant_id`, `status`, and `evaluation` (compared as JSON values; key order does not matter). - If the resubmission's result payload is **not** content-equivalent, the task store MUST reject it. The first submission's result is the committed result. This rule exists so that a worker may safely retry a submit after a network or process failure without risk of advancing state twice or corrupting the recorded result. Bindings MAY additionally accept an optional caller-supplied `submission_id` field on the wire payload to act as an explicit idempotency key; that is a binding-layer extension and does not weaken the content-equivalence rule. diff --git a/spec/v0/05-event-protocol.md b/spec/v0/05-event-protocol.md index e9e03b44..fb670094 100644 --- a/spec/v0/05-event-protocol.md +++ b/spec/v0/05-event-protocol.md @@ -165,7 +165,7 @@ Note on `experiment.policy_error`: this event records an orchestrator-side fault Every protocol-owned state-machine transition in v0 is covered by exactly one event type in §3.1–§3.4. A subscriber that consumes every event with a registered `type`, in log order, MUST be able to reconstruct the **lifecycle history** of every task, idea, variant, and experiment-scoped configuration in the experiment: which entities exist, in what states, in what order. A conforming implementation MUST NOT expose a state-machine transition that is not marked by its corresponding registered event. -Registered event payloads carry only the fields needed to *identify* the transitioning entity and any cross-entity references (e.g. `idea.dispatched` carries the implement `task_id`). They do not carry full entity snapshots. A subscriber that needs the content of an entity (the idea's `parent_commits` and `artifacts_uri`, the variant's `metrics` and `completed_at`) MUST read that entity from its store using the identifier the event carries; the read returns the entity's current state ([`08-storage.md`](08-storage.md) §1.7). This boundary is deliberate: events mark what happened; the entity stores hold what the entity *is*. Coupling the two into a full event-sourced projection — a subscriber reconstructing every intermediate entity value from events alone — is a deployment MAY implement on top of v0 but is not a protocol requirement, since v0 does not mandate historical reads on the entity stores. +Registered event payloads carry only the fields needed to *identify* the transitioning entity and any cross-entity references (e.g. `idea.dispatched` carries the implement `task_id`). They do not carry full entity snapshots. A subscriber that needs the content of an entity (the idea's `parent_commits` and `artifacts_uri`, the variant's `evaluation` and `completed_at`) MUST read that entity from its store using the identifier the event carries; the read returns the entity's current state ([`08-storage.md`](08-storage.md) §1.7). This boundary is deliberate: events mark what happened; the entity stores hold what the entity *is*. Coupling the two into a full event-sourced projection — a subscriber reconstructing every intermediate entity value from events alone — is a deployment MAY implement on top of v0 but is not a protocol requirement, since v0 does not mandate historical reads on the entity stores. The transactional invariant (§2) combined with the atomic event + state-change rule guarantees that any entity reachable via an event's identifier is already durable at read time. diff --git a/spec/v0/06-integrator.md b/spec/v0/06-integrator.md index 173d83f2..b4a0e371 100644 --- a/spec/v0/06-integrator.md +++ b/spec/v0/06-integrator.md @@ -48,7 +48,7 @@ A conforming integrator MUST NOT integrate variants in any other status. In part A conforming integrator MUST NOT integrate a `kind == "baseline"` variant ([`02-data-model.md`](02-data-model.md) §9.4), regardless of its `status`. A baseline has no `work/*` branch to squash and already points at the seed on `main`, so it receives no `variant/*` commit, no `variant_commit_sha`, and no `variant.integrated` event. This carve is paired with the `integration` decision predicate and the termination-drain rule ([`02-data-model.md`](02-data-model.md) §2.4, §2.5), both of which exclude baselines so a successful baseline does not block termination. A binding MAY additionally reject a manual/operator `integrate_variant` call against a baseline with `eden://error/invalid-precondition` ([`07-wire-protocol.md`](07-wire-protocol.md) §5) as defense in depth. -The integrator MUST NOT integrate a variant whose `metrics` do not validate against the experiment's `evaluation_schema` ([`02-data-model.md`](02-data-model.md) §9.2, [`08-storage.md`](08-storage.md) §4). The orchestrator's acceptance of a `success` submission is the primary guard for this; the integrator MAY additionally re-validate as defense in depth but MUST NOT silently drop or coerce invalid metrics. +The integrator MUST NOT integrate a variant whose `evaluation` does not validate against the experiment's `evaluation_schema` ([`02-data-model.md`](02-data-model.md) §9.2, [`08-storage.md`](08-storage.md) §4). The orchestrator's acceptance of a `success` submission is the primary guard for this; the integrator MAY additionally re-validate as defense in depth but MUST NOT silently drop or coerce an invalid evaluation payload. ## 3. Integration output @@ -127,7 +127,7 @@ The manifest is a JSON object with the following required fields. Each required | `idea_id` | string | The variant's `idea_id` ([`02-data-model.md`](02-data-model.md) §9.1). | | `commit_sha` | string | The worker-branch tip the evaluator measured ([`02-data-model.md`](02-data-model.md) §9.1). | | `parent_commits` | array of string | The variant's `parent_commits`, in order ([`02-data-model.md`](02-data-model.md) §9.1). | -| `metrics` | object | The evaluator's evaluation payload ([`03-roles.md`](03-roles.md) §4.4), conforming to the experiment's `evaluation_schema`. | +| `evaluation` | object | The evaluator's evaluation payload ([`03-roles.md`](03-roles.md) §4.4), conforming to the experiment's `evaluation_schema`. | | `completed_at` | timestamp | The variant's `completed_at` ([`02-data-model.md`](02-data-model.md) §9.1). UTC, RFC 3339 profile as elsewhere in the data model. | Optional fields: diff --git a/spec/v0/07-wire-protocol.md b/spec/v0/07-wire-protocol.md index 39199b64..0c406603 100644 --- a/spec/v0/07-wire-protocol.md +++ b/spec/v0/07-wire-protocol.md @@ -364,7 +364,7 @@ Claim, reject, reclaim, and accept are **not** blindly retry-safe on transport f The reference `eden_wire` server also exposes: - `GET /_reference/experiments/{E}/tasks/{T}/validate-terminal` -- `POST /_reference/experiments/{E}/validate/metrics` +- `POST /_reference/experiments/{E}/validate/evaluation` These are conveniences for the Phase-5 dispatch driver and are **not** part of the normative binding. A conforming third-party client MUST NOT rely on them being present. A conforming third-party orchestrator implementing its own accept/reject decision inline is free to do so; the [`04-task-protocol.md`](04-task-protocol.md) §4.3 decision rules are all that matter for the state machine. diff --git a/spec/v0/08-storage.md b/spec/v0/08-storage.md index cc27c9fa..edf167dd 100644 --- a/spec/v0/08-storage.md +++ b/spec/v0/08-storage.md @@ -151,7 +151,7 @@ Every experiment declares an evaluation schema in its `experiment_config` ([`02- ### 4.1 Registration -At experiment registration time, a conforming deployment MUST persist the experiment's evaluation schema durably and atomically with the experiment's other configuration. A subsequent write of a variant's `metrics` field MUST be validated against the schema registered for that experiment; no write MAY bypass this validation. +At experiment registration time, a conforming deployment MUST persist the experiment's evaluation schema durably and atomically with the experiment's other configuration. A subsequent write of a variant's `evaluation` field MUST be validated against the schema registered for that experiment; no write MAY bypass this validation. ### 4.2 Immutability during an experiment @@ -163,7 +163,7 @@ The content is canonicality: comparing variants across an experiment only has me A successful write of a `variant.evaluation` payload MUST satisfy: -- Every key in `metrics` is present in the experiment's evaluation schema. +- Every key in the `evaluation` payload is present in the experiment's evaluation schema. - Every value either satisfies the declared type of its key — per the type mapping in [`02-data-model.md`](02-data-model.md) §1.3 (`integer`, `real`, `text`) — or is `null`. - No reserved name ([`02-data-model.md`](02-data-model.md) §8.2) appears as a key. diff --git a/spec/v0/10-checkpoints.md b/spec/v0/10-checkpoints.md index 9f1535bf..8b8aadca 100644 --- a/spec/v0/10-checkpoints.md +++ b/spec/v0/10-checkpoints.md @@ -184,7 +184,7 @@ The contract per object kind: **Tasks.** Every `task_id` in the source is present in the import. `kind`, `payload`, `target`, `created_by`, `submitted_by`, `created_at`, `updated_at` round-trip verbatim. `state` round-trips verbatim EXCEPT `claimed` becomes `pending` per (c) above. The `claim` field is empty on every imported task. -**Ideas, variants, submissions.** Round-trip identical to their schema-validated forms, except `artifacts_uri` per (a). Variant `metrics`, `commit_sha`, `variant_commit_sha`, `branch`, `parent_commits`, `description`, `executed_by`, `evaluated_by`, `completed_at`, `status` all round-trip verbatim. +**Ideas, variants, submissions.** Round-trip identical to their schema-validated forms, except `artifacts_uri` per (a). Variant `evaluation`, `commit_sha`, `variant_commit_sha`, `branch`, `parent_commits`, `description`, `executed_by`, `evaluated_by`, `completed_at`, `status` all round-trip verbatim. **Events.** Replay in the same order with the same per-event `type` / `occurred_at` / `experiment_id` / `data` payload. The `event_id` MAY differ per (b). diff --git a/spec/v0/schemas/variant.schema.json b/spec/v0/schemas/variant.schema.json index e9c95de7..16b23499 100644 --- a/spec/v0/schemas/variant.schema.json +++ b/spec/v0/schemas/variant.schema.json @@ -94,7 +94,7 @@ "minLength": 1, "maxLength": 64, "pattern": "^[a-z0-9][a-z0-9_-]{0,63}$", - "description": "worker_id of the evaluator whose metrics were committed; written at evaluation-task submit time." + "description": "worker_id of the evaluator whose evaluation was committed; written at evaluation-task submit time." } }, "allOf": [