Skip to content

Fix #273: align prose + manifest to schema/model 'evaluation' naming#279

Merged
ealt merged 1 commit into
mainfrom
impl/issue-273-evaluation-metrics-drift
Jun 4, 2026
Merged

Fix #273: align prose + manifest to schema/model 'evaluation' naming#279
ealt merged 1 commit into
mainfrom
impl/issue-273-evaluation-metrics-drift

Conversation

@ealt
Copy link
Copy Markdown
Owner

@ealt ealt commented Jun 3, 2026

Summary

  • Why: The variant evaluation-payload field is evaluation in variant.schema.json, the eden_contracts.Variant model, and the storage/wire impl, but the spec prose spelled it metrics across eight chapters plus the integrator-manifest table. Two names for one field is a readability/onboarding hazard and a latent parity trap — the schema-parity job only checks schema↔model (both already evaluation), so the prose metrics naming was unguarded.
  • Option 1 (selected during triage): align prose + manifest to the schema/model evaluation naming, keeping the wire + on-tree manifest key stable. The reference integrator already emitted evaluation in .eden/variants/<id>/evaluation.json (see _manifest.py), so this is a prose-and-docstring correction, not a manifest-shape change.
  • Scope was wider than the issue's two named locations. The issue named 02-data-model.md §9.1 and 06-integrator.md §4.2, but a full audit (per the AGENTS.md "spec inter-chapter restatement is a conflict surface" pitfall) found the same field-name spelling restated across eight chapters (02/03/04/05/06/07/08/10). Renaming only the two named spots would have relocated the drift, so every backtick'd field reference is renamed in lockstep.
  • Latent conformance bug fixed: test_evaluator_submission.py asserted variant.get("metrics") is None in two evaluation_error scenarios — always None on the wire (field is evaluation), so those assertions passed trivially and never tested the no-graft guarantee they claimed. Corrected to variant.get("evaluation").

What this does NOT cover

  • baseline.metrics config block (02-data-model.md §2.7, experiment-config.schema.json, eden_contracts.config.BaselineConfig.metrics) — deliberately keeps the metrics name. It is a distinct config field that writes into variant.evaluation; renaming it would touch the config surface, which Option 1 explicitly keeps stable. Not a deferral — an intentional boundary.
  • Plain-English "metric" concept uses (metric values §1.3, metric names in the evaluation schema §8, "objective over metrics") — left untouched. A metric is a real domain concept; the field that holds the metrics is evaluation.
  • docs/conformance-coverage.md — a non-CI-enforced generated snapshot, last regenerated at Issue #99: Per-claim MUST/SHOULD audit (chapters 02/03/04/05/06/07/08/09) #112 and drifted ~40 keyword lines since Evaluatable baseline variant (seed becomes a kind=baseline Variant) #122. Regenerating it in this PR would dump unrelated Evaluatable baseline variant (seed becomes a kind=baseline Variant) #122-era churn into a focused rename, so it is left for routine regeneration (it will pick up the renamed prose then). No tracked deferral — the doc self-documents that it is generated and re-runnable.
  • No new runtime behavior: the only impl changes are docstrings + one human-readable validate_acceptance reason string. No wire/schema/model field changes.

Fresh-operator walkthrough

N/A — internal change only (spec prose, docstrings, one error-message string, test data). No operator-facing surface changes. The 07-wire-protocol.md §11 reference-helper endpoint path was corrected to /validate/evaluation to match the impl, which already exposed that path — no behavior change.

Test plan

All literal pre-push gates from AGENTS.md "Commands" run locally:

  • python3 scripts/check-rename-discipline.py — clean
  • uv run ruff check . — clean
  • uv run pyright — 0 errors
  • markdownlint-cli2 (spec + CHANGELOG + roadmap) — 0 errors
  • check-jsonschema --check-metaschema — ok
  • uv run pytest reference/packages/eden-contracts/tests/test_schema_parity.py — 219 passed
  • python3 scripts/spec-xref-check.py — all 1105 §-refs resolve
  • check_citations.py — 267 scenarios cite valid spec MUSTs
  • uv run pytest -q — 2121 passed, 230 skipped, 3 failed (all pre-existing/environmental: error: 1Password: failed to fill whole buffer on git commit in temp-repo tests — local SSH-signing lock, unrelated to this change; none in touched files)
  • uv run pytest -q conformance/ -n auto — 255 passed, 13 skipped
  • bash reference/compose/healthcheck/smoke.sh — PASS (includes "variant/* refs published to forgejo", exercising the integrator evaluation manifest end-to-end)

Related issues

🤖 Generated with Claude Code

The variant evaluation-payload field is `evaluation` in variant.schema.json,
the eden_contracts.Variant model, and the storage/wire impl, but the spec
prose spelled it `metrics` across eight chapters plus the integrator-manifest
table. Option 1 from the issue: rename the prose + manifest to `evaluation`,
keeping the wire and on-tree manifest key stable (the reference integrator
already emitted `evaluation` in evaluation.json).

A full audit (per the AGENTS.md inter-chapter-restatement pitfall) found the
drift was wider than the issue's two named locations — every backtick'd field
reference is renamed in lockstep across spec chapters 02/03/04/05/06/07/08/10,
the variant.schema.json evaluated_by description, impl docstrings, and the
validate_acceptance reason string.

Also fixes a latent conformance bug: test_evaluator_submission.py asserted
`variant.get("metrics") is None`, which was always None on the wire (field is
`evaluation`) and so never tested the no-graft guarantee it claimed.

The baseline.metrics config block keeps its name (distinct config field that
writes into variant.evaluation; out of scope for Option 1). Plain-English
"metric" concept uses are untouched. docs/conformance-coverage.md (a stale,
non-CI-enforced generated snapshot) is left for routine regeneration to avoid
unrelated #122-era churn in this focused rename.

Closes #273.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ealt ealt force-pushed the impl/issue-273-evaluation-metrics-drift branch from c2f0519 to d654cb6 Compare June 4, 2026 02:42
@ealt ealt merged commit edf2a83 into main Jun 4, 2026
39 of 40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spec/impl drift: variant evaluation-payload field is evaluation (schema/model) vs metrics (02-data-model §9.1 prose + integrator manifest)

1 participant