Fix #273: align prose + manifest to schema/model 'evaluation' naming#279
Merged
Conversation
The variant evaluation-payload field is `evaluation` in variant.schema.json,
the eden_contracts.Variant model, and the storage/wire impl, but the spec
prose spelled it `metrics` across eight chapters plus the integrator-manifest
table. Option 1 from the issue: rename the prose + manifest to `evaluation`,
keeping the wire and on-tree manifest key stable (the reference integrator
already emitted `evaluation` in evaluation.json).
A full audit (per the AGENTS.md inter-chapter-restatement pitfall) found the
drift was wider than the issue's two named locations — every backtick'd field
reference is renamed in lockstep across spec chapters 02/03/04/05/06/07/08/10,
the variant.schema.json evaluated_by description, impl docstrings, and the
validate_acceptance reason string.
Also fixes a latent conformance bug: test_evaluator_submission.py asserted
`variant.get("metrics") is None`, which was always None on the wire (field is
`evaluation`) and so never tested the no-graft guarantee it claimed.
The baseline.metrics config block keeps its name (distinct config field that
writes into variant.evaluation; out of scope for Option 1). Plain-English
"metric" concept uses are untouched. docs/conformance-coverage.md (a stale,
non-CI-enforced generated snapshot) is left for routine regeneration to avoid
unrelated #122-era churn in this focused rename.
Closes #273.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
c2f0519 to
d654cb6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
evaluationinvariant.schema.json, theeden_contracts.Variantmodel, and the storage/wire impl, but the spec prose spelled itmetricsacross eight chapters plus the integrator-manifest table. Two names for one field is a readability/onboarding hazard and a latent parity trap — theschema-parityjob only checks schema↔model (both alreadyevaluation), so the prosemetricsnaming was unguarded.evaluationnaming, keeping the wire + on-tree manifest key stable. The reference integrator already emittedevaluationin.eden/variants/<id>/evaluation.json(see_manifest.py), so this is a prose-and-docstring correction, not a manifest-shape change.02-data-model.md§9.1 and06-integrator.md§4.2, but a full audit (per the AGENTS.md "spec inter-chapter restatement is a conflict surface" pitfall) found the same field-name spelling restated across eight chapters (02/03/04/05/06/07/08/10). Renaming only the two named spots would have relocated the drift, so every backtick'd field reference is renamed in lockstep.test_evaluator_submission.pyassertedvariant.get("metrics") is Nonein twoevaluation_errorscenarios — alwaysNoneon the wire (field isevaluation), so those assertions passed trivially and never tested the no-graft guarantee they claimed. Corrected tovariant.get("evaluation").What this does NOT cover
baseline.metricsconfig block (02-data-model.md§2.7,experiment-config.schema.json,eden_contracts.config.BaselineConfig.metrics) — deliberately keeps themetricsname. It is a distinct config field that writes intovariant.evaluation; renaming it would touch the config surface, which Option 1 explicitly keeps stable. Not a deferral — an intentional boundary.evaluation.docs/conformance-coverage.md— a non-CI-enforced generated snapshot, last regenerated at Issue #99: Per-claim MUST/SHOULD audit (chapters 02/03/04/05/06/07/08/09) #112 and drifted ~40 keyword lines since Evaluatable baseline variant (seed becomes a kind=baseline Variant) #122. Regenerating it in this PR would dump unrelated Evaluatable baseline variant (seed becomes a kind=baseline Variant) #122-era churn into a focused rename, so it is left for routine regeneration (it will pick up the renamed prose then). No tracked deferral — the doc self-documents that it is generated and re-runnable.validate_acceptancereason string. No wire/schema/model field changes.Fresh-operator walkthrough
N/A — internal change only (spec prose, docstrings, one error-message string, test data). No operator-facing surface changes. The
07-wire-protocol.md§11 reference-helper endpoint path was corrected to/validate/evaluationto match the impl, which already exposed that path — no behavior change.Test plan
All literal pre-push gates from AGENTS.md "Commands" run locally:
python3 scripts/check-rename-discipline.py— cleanuv run ruff check .— cleanuv run pyright— 0 errorsmarkdownlint-cli2(spec + CHANGELOG + roadmap) — 0 errorscheck-jsonschema --check-metaschema— okuv run pytest reference/packages/eden-contracts/tests/test_schema_parity.py— 219 passedpython3 scripts/spec-xref-check.py— all 1105 §-refs resolvecheck_citations.py— 267 scenarios cite valid spec MUSTsuv run pytest -q— 2121 passed, 230 skipped, 3 failed (all pre-existing/environmental:error: 1Password: failed to fill whole bufferongit commitin temp-repo tests — local SSH-signing lock, unrelated to this change; none in touched files)uv run pytest -q conformance/ -n auto— 255 passed, 13 skippedbash reference/compose/healthcheck/smoke.sh— PASS (includes "variant/* refs published to forgejo", exercising the integratorevaluationmanifest end-to-end)Related issues
evaluation(schema/model) vsmetrics(02-data-model §9.1 prose + integrator manifest) #273 — Spec/impl drift: variant evaluation-payload field isevaluation(schema/model) vsmetrics(prose + manifest)🤖 Generated with Claude Code