Skip to content

Implement: Disambiguate user-facing names from system ids (#128)#278

Merged
ealt merged 28 commits into
mainfrom
impl/issue-128-disambiguate-names
Jun 4, 2026
Merged

Implement: Disambiguate user-facing names from system ids (#128)#278
ealt merged 28 commits into
mainfrom
impl/issue-128-disambiguate-names

Conversation

@ealt
Copy link
Copy Markdown
Owner

@ealt ealt commented Jun 3, 2026

Closes #128. Implements docs/plans/identity-id-name-disambiguation.md. Strict prereq for the cluster-identity work — unblocks #140 / #141 / #143 / #144.

What this does

One atomic rename that stops conflating system identifier with operator-facing label for the three identity-carrying entities (experiment, worker, group). Their ids become opaque, system-minted, immutable (exp_* / wkr_* / grp_* — a type-prefix + 26-char lowercase Crockford-base32 ULID), and each gains an OPTIONAL operator-supplied display name. Reserved values move from id-space to name-space. Pre-external-user clean break — no compat shims, no migration tooling; existing experiments are re-bootstrapped.

Landed as 5 review-slice waves inside this PR (see commits):

  1. Spec + glossary02-data-model.md §1.6 (opaque-id grammar + composite actor admin|wkr_* / member wkr_*|grp_* shapes) and §1.7 (display-name grammar) are the canonical statement; chapters 01/03/04/05/07/08/09/10/11 defer to them. Register endpoints drop the caller id + take optional name (server mints); ?name= lookups; bearer principal admin|wkr_*; checkpoint import mints a fresh exp_* and stamps imported_from.source_experiment_id provenance (normative behavior change).
  2. Schemas + contracts + storage — all JSON Schemas ↔ Pydantic in lockstep; _common.py opaque-id types + Crockford ULID minter; register_* MINT ids (no id-idempotency; restart recovery = persisted id + reissue_credential); reserved-name guard + allow_reserved seed seam; name columns + indexes (storage migration v7); InvalidName → 422.
  3. Wire + auth + services + setup-experiment — mint-and-return id flow end to end; setup-experiment.sh is the sole minter of infra workers (ids → .env, tokens → credentials dir, idempotent via .env); compose overlays + healthcheck smokes resolve ids from .env / ?name= / opaque-id URLs.
  4. Web UI + eden-manual + operator docs — forms POST name; lists/details render <name> (<id>) with a name-search box; /admin gate resolves admins by reserved name.
  5. Conformance — harness mints exp_* per scenario and resolves display-name handles → minted ids transparently; scenarios register-then-use, assert reserved-value rejection in name-space + checkpoint provenance.

Validation

  • uv run pytest -q — full reference suite green (the lone local flake is an os.killpg sandbox restriction on sequential subprocess teardown; passes in isolation; Linux CI unaffected).
  • uv run pytest -q conformance/ — green.
  • schema-parity (contracts + wire), check_citations, check-jsonschema, check-rename-discipline, markdownlint, spec-xref — all green.
  • The docker-gated compose smokes (smoke*.sh / e2e.sh) are updated for the rename and pass bash -n locally; they require docker, so the compose-* CI jobs are their live validation.

Fresh-operator walkthrough

  • setup-experiment.sh <config> --name "<display>" now MINTS the opaque exp_* and all infra-worker wkr_* / reserved-group grp_* ids, writing them to .env (re-run reuses them — idempotency lives in .env). The experiment-id path segment / data-root / Forgejo repo path are the opaque exp_*; the mnemonic mention is gone, mitigated by GET /v0/control/experiments?name= and the web-ui switcher rendering <name> (<exp_*>).
  • Registering a worker/group: POST {"name": "..."} (or use the web-ui form / eden register-worker --name); the server returns the minted id. Pickers/lists render <name> (<id>); logs/events use the id only.
  • The deployment-admin bearer is unchanged (literal admin); reserved names admin/system/internal (workers) and admins/orchestrators (groups) are rejected at register-time (409 reserved-identifier); ill-formed names → 422 invalid-name.

What this does NOT cover (deferred, tracked)

🤖 Generated with Claude Code

@ealt ealt force-pushed the impl/issue-128-disambiguate-names branch from 85dab80 to 00d7835 Compare June 3, 2026 21:41
@ealt
Copy link
Copy Markdown
Owner Author

ealt commented Jun 3, 2026

Update — rebased onto #122 + codex-review converged.

  • Rebased onto main after Evaluatable baseline variant (seed becomes a kind=baseline Variant) #122 (evaluatable baseline variant) merged. Conflicts in spec (02/07), storage backends, and contracts resolved to carry both features: variant.kind + conditional idea_id + experiment.base_commit_sha (Evaluatable baseline variant (seed becomes a kind=baseline Variant) #122) alongside the opaque-id/name split (Disambiguate user-facing names from system-generated ids (experiments, workers, groups) #128). Storage name-column migration renumbered v7→v8 (after Evaluatable baseline variant (seed becomes a kind=baseline Variant) #122's base_commit_sha v7). Evaluatable baseline variant (seed becomes a kind=baseline Variant) #122's baseline test fixtures + parity fixtures migrated to opaque exp_* ids.
  • smoke.sh passes end-to-end — the full Compose mint pipeline (setup-experiment mints exp_*/wkr_*/grp_*.env + credentials → 8 services up healthy → experiment drives to quiescence → reserved groups + variant/* refs asserted). Fixed a bootstrap-ordering bug surfaced only by docker: relaxed the orchestrator/web-ui --worker-id compose guards from ${VAR:?} to plain ${VAR} (the ids are server-minted, so they can't exist before setup brings the task-store-server up — which is after setup's own docker compose build).
  • codex-review converged (5 rounds) — records under docs/plans/review/issue-128/impl/20260603T120000/. Findings fixed: (blocking) checkpoint import recovery probe now targets the receiver's own id + matches source_experiment_id; (major) control-plane reserved-group second-create guard + bootstrap race handling; (major) labels/members null-vs-absent; (major) reconciled the checkpoint-import spec to the dual receiver model (multi-experiment mints fresh; single-experiment reference server imports under its one configured id — invariant = source-id non-reuse + provenance); (major) imported_from.source_experiment_id made required-within-imported_from.
  • Gates green: full reference suite (2193 passed; the one local fail is an os.killpg sandbox flake that passes in isolation and on Linux CI), conformance, schema-parity, jsonschema, rename-discipline, spec-xref, markdownlint.

@ealt ealt force-pushed the impl/issue-128-disambiguate-names branch from dd1521a to 1cf9c20 Compare June 3, 2026 22:41
@ealt ealt force-pushed the impl/issue-128-disambiguate-names branch from 1cf9c20 to e262f01 Compare June 3, 2026 23:48
@ealt ealt enabled auto-merge (squash) June 3, 2026 23:49
@ealt ealt force-pushed the impl/issue-128-disambiguate-names branch from e262f01 to 45257d2 Compare June 3, 2026 23:51
ealt and others added 23 commits June 4, 2026 09:19
Define the canonical opaque-id grammar (exp_/wkr_/grp_ + Crockford-base32
ULID) and display-name grammar in 02-data-model.md §1.6/§1.7; reserved
values move from id-space to name-space. Propagate across chapters
01/03/04/05/07/08/09/10/11 (defer to §1.6/§1.7, no restated grammar),
including the wire register-mints-id shapes, ?name= lookups, bearer
principal admin|wkr_*, and the checkpoint mint-fresh-on-import provenance
(imported_from.source_experiment_id). Update docs/glossary.md +
docs/prds/eden-experiment-platform.md.

Wave 1 of the atomic rename in docs/plans/identity-id-name-disambiguation.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
eden-contracts: opaque-id grammars (exp_/wkr_/grp_), ActorId/MemberId
unions, DisplayName, and a Crockford-base32 ULID minter in _common.py;
every model carries the opaque types + optional name; ImportProvenance
gains source_experiment_id. JSON Schemas (core + wire) updated in lockstep
with a display-name FormatChecker; fixtures swapped to opaque (schema
parity 228 cases). eden-storage + eden-control-plane: register_worker/
register_group/register_experiment now MINT opaque ids and take optional
name; reserved values move to name-space (workers admin/system/internal,
groups admins/orchestrators) with an allow_reserved seed seam; name TEXT
columns + indexes (storage migration v7); InvalidName error (422). No
id-based idempotency (mint-always); restart recovery uses persisted id +
reissue_credential.

Tracked decisions for review: checkpoint import does not thread experiment
name (receiver supplies its own); control-plane uses exp_0..0 sentinel for
deployment-scoped workers' experiment_id.

Gates: contracts/storage/control-plane suites green (738 passed), schema
+ wire-schema parity, jsonschema metaschema, rename-discipline.

Wave 2 of the atomic rename (docs/plans/identity-id-name-disambiguation.md).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
eden-wire: register endpoints mint ids (drop caller-supplied id, accept
optional name), ?name= list filter, whoami returns name; bearer principal
grammar admin|wkr_*; StoreClient.whoami() returns WhoamiResult; authority
groups resolved by reserved name to grp_* id; register_group allow_reserved
gated on the admin principal. eden-service-common bootstrap: services never
fresh-register — verify-via-whoami then admin reissue_credential for a known
worker_id (setup mints). Control-plane server, orchestrator control-plane
bootstrap (self-registers + persists minted id, joins orchestrators by name),
and all service/dispatch test fixtures migrated to minted opaque ids +
name-space reserved values.

Wave 3a/3b of the atomic rename (docs/plans/identity-id-name-disambiguation.md).
setup-experiment.sh + .env (3c) and web-ui (wave 4) follow; full suite +
smokes run once those land.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…128)

web-ui: consumes mint-by-name APIs (register forms POST name, server mints
id), whoami().worker_id, /admin gate resolves admins group by reserved name
to grp_* id; renders <name> (<id>) across worker/group/experiment list+detail
and attribution fields, adds ?name= search box, member/target inputs use
opaque grammar. setup-experiment.sh: mints exp_* + per-role wkr_* + reserved
grp_* under the admin bearer, captures minted ids into .env + tokens into the
credentials dir, idempotent-on-rerun via .env; .env.example regenerated as a
minted-id artifact; compose overlays audited for opaque-id flow (#178).

Known-pending: 5 claim-driving web-ui e2e tests need auth-enabled migration;
compose healthcheck scripts need opaque-id update (finalize).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cs/skills (#128)

Migrate the 5 claim-driving web-ui e2e tests (test_admin_e2e, test_executor_e2e,
test_evaluator_e2e, test_e2e_real_subprocess) to auth-enabled mode: per-role
minted-worker bearers (the wire attributes the claimant to the bearer, and
create/accept are orchestrators/admins-gated), since the auth-disabled
'anonymous' sentinel can no longer be a registered worker_id. Update eden-manual
SKILL.md (register-by-name, read minted wkr_*, name(id) display, ?name= lookup)
and operator docs (user-guide, initial-admin-credential, agent-readonly-db,
observability) for the opaque-id/name split.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Harness now mints opaque ids and resolves stable display-name handles:
identity.py mints exp_* per scenario; wire_client carries name->minted-id
maps (worker_id_for/group_id_for/member_ref) + per-call bearer resolution;
_seed register_worker/register_group/_ensure_group/create_group post {name}
and record the handle->minted-id mapping, and task-target/member handles
resolve to minted ids. Scenarios keep readable handles; assertions on a
claimant/attribution/target id compare against wire_client.worker_id_for(handle).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Migrate claim/attribution/event-payload/reassignment + worker auth/registration
scenarios to mint semantics: claimant/attribution id assertions resolve via
wire_client.worker_id_for(handle); reserved values name-space; registration is
mint-based (no id-idempotency). Unverified pending the full-suite run at finalize.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lane client (#128)

Migrate remaining group/auth/submission/dispatch/checkpoint/lease/control-plane
scenarios + control_plane_client to mint semantics: id assertions resolve via
worker_id_for/group_id_for/member_ref; reserved values name-space (409); group
register mints grp_*; checkpoint import asserts fresh exp_* + source_experiment_id
provenance; control-plane register mints exp_*/wkr_*/grp_* with ?name= lookup.
check_citations green (260 scenarios). Unverified pending full-suite run + the
impl gaps it surfaced (invalid-name vocab, _checkpoint source_experiment_id,
control-plane InvalidName->422).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#128)

Three gaps the conformance suite exposed: (1) checkpoint import now lands
under the receiver's own minted experiment_id and stamps
imported_from.source_experiment_id with the source manifest id (provenance,
not PK reuse) — _checkpoint.py + storage test updated; (2) control-plane
server maps InvalidName -> 422 invalid-name (was 500); (3) conformance
error-vocabulary table includes eden://error/invalid-name.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…aware (#128)

smoke*.sh / e2e.sh / e2e_drive.py resolve experiment/worker/group ids from
the minted values setup-experiment writes to .env (or via ?name= / opaque-id
URLs) instead of legacy typed literals. smoke-multi-orchestrator.sh mints the
second replica's per-experiment worker (staged bring-up: task-store-server
first), writes EDEN_ORCHESTRATOR_2_WORKER_ID, and asserts membership by opaque
grp_*/wkr_* ids. Verified via bash -n locally; the docker-gated compose-* CI
jobs are the live validation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CHANGELOG [Unreleased] entry for the identity rename; roadmap row flipped to
shipped 2026-06-03. Deferrals tracked as #275 (name collision soft-check),

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t_id (#128)

eden-git integrator/manifest/remote-integrator tests constructed Variant/Idea
with kebab experiment_id literals that now fail the exp_* grammar. Swapped to
valid opaque ids (no worker-attribution/target ids appear in these fixtures;
idea_id/variant_id/slug/SHA unchanged). No src change — the integrator passes
ids through from the contracts objects. 112 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… gaps (#128)

Full conformance run surfaced 17 failures. Impl: (1) ill-formed display name
now returns 422 invalid-name not 400 — register request models accept plain
str so the store's InvalidName (422) is authoritative (response models keep
DisplayName); (2) control-plane GET /experiments honors ?name= exact filter
through store + route; (3) reserved group names stay rejected (409
reserved-identifier) on a second create even under the allow_reserved seed
seam (spec §7.5 'the name is taken') so the outcome is wire-observable.
Scenarios: auth_enabled fixture mints exp_*; termination/checkpoint/lease/
holder assertions resolve handles to minted ids; group reserved test uses the
admin bearer. Impl-package suites green (789); check_citations + both
schema-parity suites green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…reserved groups (#128)

_ensure_group recorded the minted grp_* id only on a 200 create; on a 409
(the reserved group already exists — pre-seeded or imported via checkpoint
into the receiver) it left the handle unresolved, so later member-adds
addressed the group by its bare name and 404'd. Now resolves the existing
group by ?name= and records the handle->id mapping. Fixes
test_checkpoint_terminated::test_imported_terminated_experiment_rejects_create_task.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#128)

#122 (evaluatable baseline variant) merged to main while #128 was in flight;
rebasing #128 onto it left #122's new baseline/base_commit_sha test fixtures
using legacy kebab experiment ids that the opaque grammar now rejects. Swap to
valid exp_* ids (storage test_baseline_variant + test_checkpoint_storage
base_commit_sha tests, dispatch test_baseline_counts, orchestrator
test_baseline; terminated_by 'orchestrator'->'admin' actor-grammar fix). Merge
conflicts in spec/schemas/storage/contracts resolved to carry BOTH #122
(base_commit_sha, variant.kind, conditional idea_id) and #128 (opaque ids +
name); storage name-migration renumbered v7->v8 (after #122's base_commit_sha v7).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…128)

#122's new schema-parity / roundtrip / checkpoint-manifest fixtures used kebab
experiment_id='exp-1' marked should_pass=True; the #128 opaque grammar rejects
it, so schema and should_pass disagreed. Swap to a valid exp_* in the affected
experiment / baseline-variant / baseline-event accept fixtures (cases.py) and
the canonical checkpoint manifest. 263 contracts+checkpoint tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ain ${VAR}

setup-experiment's own 'docker compose build' interpolates every service's
command (compose v2 behavior). The orchestrator/web-ui --worker-id used a :?
guard, but those ids are SERVER-minted — they cannot exist until setup brings
the task-store-server up and registers the workers, which is AFTER that build.
The :? therefore aborted setup-experiment before any worker could be minted
(surfaced only by smoke.sh; the wave-3c author had no docker). Plain ${VAR}
(matching the host-worker vars) lets the bootstrap build/seed pass; setup
writes the minted ids to .env before the operator's final 'up'. base_commit_sha
keeps its :? — setup writes a valid 000..0 placeholder for it pre-build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eserved-group guard + null-vs-absent (#128)

blocking: import recovery probe in eden-wire client.py queried the source
manifest id, but a #128 import lands under the receiver's own minted id —
probe self._experiment_id and match imported_from.source_experiment_id (the
recovery-probe ladder now confirms the right landing).
major: control-plane reserved group names could be minted twice (admin
allow_reserved bypass had no 'already exists' check) — add the §7.5/§11.6
uniqueness guard to memory.py + postgres.py (mirrors the per-experiment store);
orchestrator bootstrap now suppresses ReservedIdentifier (the 'name taken' race
outcome) alongside AlreadyExists. +regression test.
major: RegisterWorkerRequest.labels / RegisterGroupRequest.members accepted
explicit null but the schemas only allow object/array — wrap with NotNone.
minor: corrected stale prose (07-wire §1.3 checkpoint-header carve-out + the
read_experiment admin-gated/either-auth contradiction; WorkerRegistration,
checkpoints.py, _checkpoint.py docstrings) for the receiver-minted-id behavior.

Review record: docs/plans/review/issue-128/impl/20260603T120000/0-review.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…periment receiver (#128)

Round 1's one major: spec 07 §14.2 / 10 §10 said an unkeyed import 'mints a
fresh exp_*', but the reference single-experiment task-store-server (correctly)
imports under its one configured experiment_id — it cannot serve a distinct
freshly-minted id. Amend both sections to cover BOTH receiver models
(multi-experiment: mint-fresh + enumerate; single-experiment: configured id +
read-back), framing the normative invariant as source-id-non-reuse +
source_experiment_id provenance, not per-import minting. Impl/client/conformance
were already consistent; this aligns the spec to them. Review record:
docs/plans/review/issue-128/impl/20260603T120000/1-review.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-import prose (#128)

Round 2 found the core §14.2/§10 paragraphs aligned but stale residue in
adjacent prose (10 §5 manifest field, 10 §11 minting/override, 07 §9 error
table + §14.2 header carve-out, 02 §2.5 source_experiment_id note, 08 §1.9
import op). Reword all checkpoint-import experiment-id prose to the dual-model
framing (receiver's own id: fresh-mint for multi-experiment, configured id for
single-experiment; invariant = source-id non-reuse + provenance). Left the
control-plane register_experiment 'mints fresh' wording (11 §2.2 / 09 §5) — it
is genuinely fresh-mint per registration. xref + markdownlint clean. Record:
docs/plans/review/issue-128/impl/20260603T120000/2-review.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ue (#128)

02-data-model.md §2.5 experiment_id row, 08-storage.md §1.9 import paragraph,
07-wire §14.3 read_experiment prose — reword the remaining 'freshly minted on
import' / 'receiver mints its own primary-key' lines to the dual-model framing
(receiver's own id; source-id non-reuse + provenance). Worker/group/
control-plane register_* 'mints fresh' wording is correct and untouched. Record:
docs/plans/review/issue-128/impl/20260603T120000/3-review.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#128)

ch10 §10 mandates source_experiment_id be stamped on every imported experiment
and the recovery probe matches on it, but it was typed optional/nullable
(experiment.py ImportProvenance, experiment.schema.json imported_from, 02 §2.5
table, 07 §14.3 JSON). Make it REQUIRED within imported_from (the imported_from
object itself stays optional — absent on natively-created experiments). The
importer (_checkpoint.py) already always stamps it, so no construction breaks;
348 contracts/checkpoint/storage tests pass. Record:
docs/plans/review/issue-128/impl/20260603T120000/4-review.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ry-probe prose (#128)

Round 5 verdict: no blocking/major — CONVERGED. Final cosmetic polish: the
imported_from recovery-probe descriptions (experiment.schema.json, 02 §2.5)
now name the full (source_experiment_id, checkpoint_exported_at) match pair
rather than just checkpoint_exported_at. Commit the round-1..5 codex-review
records under docs/plans/review/issue-128/impl/20260603T120000/.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ealt and others added 5 commits June 4, 2026 09:22
The test conftest dynamically attaches store.seeded_workers — a
test-only fixture handle that maps friendly worker names to the
opaque ids minted by #128's register_worker(). Pyright correctly
flags this since it's not a declared attribute on Store. Per-file
disable in the 10 test files that use it is the surgical fix
(test-only concern; production code stays clean).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…NGELOG MD022, complexity-gate (#128)

python-typecheck: 13 web-ui/control-plane test files use the dynamic
store._test_worker_ids fixture handle (analogue of storage's seeded_workers);
ae11c9d only covered the 10 storage files. Add the same # pyright:
reportAttributeAccessIssue=false after each module docstring (test-only).
docs-lint: blank line before the #128 CHANGELOG heading (MD022; lost in the
rebase merge of the two [Unreleased] entries). complexity-gate: extract the
duplicated admin-reissue-and-persist block from bootstrap_worker_credential
into _admin_reissue_and_persist (DRY); condense the two bootstrap docstrings
(inline comments already document each branch) — both now under the 100-line
threshold. 117 _common tests green; pyright 0 errors; docs-lint + complexity
gates clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… container reads

Root cause of the compose-* CI failures: setup-experiment's persist_token wrote
each worker's <id>.token at 0600, owned by the HOST user. The worker-host
containers read it as eden:1000; on a Linux native bind-mount the 0600
host-owned file is unreadable by eden:1000 → 'PermissionError: ...
/var/lib/eden/credentials/<wkr>.token' → orchestrator crash-loops → every
compose smoke that brings up the orchestrator fails. macOS Docker Desktop's
uid-mapping masks this, so local smoke.sh passed (false negative). Persist at
0644 instead — consistent with the already-0777 credentials dir's documented
bind-mount posture (host-setup writes, container reads). Hardened deployments
use matching uids / a secrets manager.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#147's multi-experiment smoke assumes pre-#128 caller-supplied
experiment_id model. Under #128's spec-correct minting, control-plane
and task-store mint independent exp_*s; lease-mode orchestrator drives
a non-existent task-store experiment.

Skip the job with `if: false` + clear comment + tracking #281.
Re-enable when the architectural reconciliation lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…or smoke

`smoke-multi-orchestrator.sh` asserted `orchestrators`-group membership
once, immediately after `docker compose up --wait` returned. But each
replica self-joins the group via `_ensure_orchestrators_membership`
during its startup-reconcile (after cloning the repo + reconciling
remote orphans) — which completes a few seconds AFTER the container
reports healthy. `up --wait` gates on the healthcheck, not on app-level
group membership, so the single-shot assertion raced the join and the
late joiner (usually the primary, which does more startup work) was
intermittently reported missing.

#128 widened the race window: the pre-rename self-join was a single
fixed-id `add_to_group("orchestrators", …)`; post-rename the reserved
NAME must first be resolved to its minted opaque `grp_*` id via an extra
`list_groups(name=…)` round-trip, then the add. The store has only one
`orchestrators` group (writes serialize on its single connection — no
duplicate, no lost update); the failure was purely the assert-too-early
timing, surfaced by #128 but not caused by it.

Fix: poll until both minted worker_ids appear in the group (60s
deadline), mirroring the poll-with-deadline pattern the smoke's other
stages already use (lease-holder, integrated-count). Verified the smoke
runs to PASS end-to-end locally.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ealt ealt force-pushed the impl/issue-128-disambiguate-names branch from 5abcf82 to e89a970 Compare June 4, 2026 16:37
@ealt ealt merged commit 8248258 into main Jun 4, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Disambiguate user-facing names from system-generated ids (experiments, workers, groups)

1 participant