tapgarden: large scale hardening#2153
Conversation
MintingBatch.batchState (in-memory) and the asset_minting_batches.state DB column were written by separate code paths and could diverge on failure. Several store methods that transitioned state as a side-effect (AddSproutsToBatch, CommitSignedGenesisTx, MarkBatchConfirmed) updated the DB while the in-memory mirror was only patched up later by advanceStateUntil's blanket post-step write -- or, in places, *before* the DB write was issued. Funnel every state mutation through MintingStore methods that take a *MintingBatch and advance the in-memory field only after the DB write succeeds. The public UpdateState mutator is removed (renamed to unexported setState); an exported SetStateOnDBSuccess remains as the bridge for store implementations. Two redundant in-memory writes in BatchCaretaker that either re-mirrored state already updated by a store call or got ahead of the DB are dropped.
The preceding refactor established an invariant -- the in-memory batch state may never advance unless the corresponding DB write committed -- but nothing in the test suite asserted it directly. A regression that re-introduced an in-memory write ahead of the DB call would slip past review. TestUpdateBatchStateMemoryCoherence forces UpdateBatchState to fail via a pre-cancelled context and asserts that both the in-memory batch.State() and a fresh on-disk read remain at the prior value.
fundBatch did one of two things depending on whether its workingBatch arg was nil: create a fresh batch and persist via CommitMintingBatch, or attach funding to an existing on-disk batch via CommitBatchFunding. This overload was the shape of bug that caused lightninglabs#2136 -- the wrong in-memory reference silently flipped which path ran, and downstream code that thought it was modifying batch A was in fact creating a fresh batch B. Split into three pieces with non-overlapping responsibilities. createFundedBatch derives, funds, and persists a new batch and never mutates planter state. applyFundingToBatch funds an existing on-disk batch atomically and rejects nil up front. fundPendingBatch is a thin nil-dispatch wrapper used by callers that genuinely have the create-or-update ambiguity. Start()'s recovery path calls applyFundingToBatch directly. Shared funding prep (sibling persistence and the computeFunding closure) is factored into prepareFunding.
prepAssetSeedling assigned or mutated c.pendingBatch before the DB write that justified the mutation. A failed commit left the in-memory state ahead of disk: either pointing at a batch that did not exist, or holding a seedling that the DB had never seen. The next call to QueueNewSeedling would then either trip a uniqueness check or wedge the failed seedling in memory indefinitely. Reorder both branches to persist first and mirror into memory only on success. To make the existing-batch branch separable, MintingBatch grows validateSeedling (pure) and commitSeedling (mutation); AddSeedling remains as the combined wrapper for callers that have no persistence boundary between the two. The fresh-batch branch builds entirely on a local newBatch and only assigns c.pendingBatch after CommitMintingBatch returns.
The validateSeedling/commitSeedling split exists so seedling persistence precedes in-memory mirroring. Nothing prevented a future change from re-merging the two halves or letting validateSeedling mutate the batch -- both of which would reintroduce the failure mode the split exists to prevent. TestSeedlingValidateCommitSplit asserts that validateSeedling neither mutates Seedlings nor sets SupplyCommitments on success, failure, or first-seedling paths; only commitSeedling may.
fetchDelegationKey and fetchPreCommitGroupKey each iterated pendingBatch.Seedlings and broke after the first entry, picking either the anchor directly or one that referenced it. Go's map iteration is non-deterministic, so correctness depended on an invariant ( <= 1 anchor per universe-commitment batch) enforced by validateUniCommitment elsewhere; break that invariant by one line and the loops silently picked at random. Replace both call sites with MintingBatch.uniqueAnchorSeedling, a deterministic scan that returns the single seedling with GroupAnchor == nil and errors loudly on zero or more than one match. The invariant is now asserted at the site that depends on it rather than presumed.
The new uniqueAnchorSeedling helper relies on a deterministic scan plus an exactly-one check. A regression that reintroduced break-after-first or dropped the multi-anchor check could pass review unnoticed and only fail stochastically under particular map orderings. TestUniqueAnchorSeedling exercises the five shapes that matter: anchor only; anchor plus children (run 32x to defeat incidental map ordering); multiple anchors; no anchor; empty batch. The first two assert the expected return; the last three assert the expected error.
MintingOutputKey memoized its result on first call; subsequent calls silently ignored the sibling argument. The caretaker's BatchStateCommitted branch was the only caller that exploited this, passing nil and relying on the Frozen branch having cached the real sibling. Any path that reached Committed without going through Frozen first -- a future RBF retry, a partial restart, a refactor -- would have silently computed and cached the wrong output key. Drop the cache. MintingBatch loses its memoized mintingPubKey and taprootAssetScriptRoot fields; MintingOutputKey recomputes from (BatchKey, RootAssetCommitment, sibling) on every call. The caretaker's Committed branch now loads the sibling preimage before calling MintingOutputKey and passes it explicitly; the same preimage is already used a few lines below for siblingBytes, so this is a reorder rather than a new I/O cost. The work is a single tapscript hash plus an ECC tweak -- not expensive enough to justify a memoizing getter that introduces correctness footguns.
The prior commit removed MintingOutputKey's memoization cache so the function depends genuinely on its sibling argument. Nothing in the test suite asserted that contract, and the previously-exploited "pass nil after the cache is warm" pattern would re-emerge silently if the cache were ever reintroduced. TestMintingOutputKeyPureInSibling asserts that distinct siblings produce distinct keys, that repeated calls with the same sibling are stable, and that a nil sibling produces a key distinct from any non-nil one. If the memoizing cache is ever reintroduced, both the "different siblings" and "nil is distinct" cases will fail.
The planter has long treated a single "current" pre-broadcast batch as a singleton in code, but the schema admitted plurality. Older versions of the planter could in some failure scenarios desync the in-memory slot from disk and leave two pre-broadcast rows behind -- making the in-code invariant unenforceable on a recovering daemon. Add migration 000060, a partial unique index on asset_minting_batches that forbids more than one row in BatchStatePending or BatchStateFrozen at any time. The "(1) WHERE batch_state IN (0, 1)" idiom indexes a constant expression filtered by the active-set predicate; both modernc SQLite >= 3.9 and Postgres accept it. Legacy databases that already hold duplicates fail on startup with the migration error; the operator recovery path arrives in a follow-up commit. Supply-commit test fixtures that don't exercise the planter state machine advance each batch past Frozen immediately after insert so they no longer trip the constraint. LatestMigrationVersion bumps from 59 to 60.
The DB-level singleton invariant from migration 000060 still leaves the planter with two operational gaps: at startup, against a legacy DB that survived migration but otherwise has duplicates, and on caretaker failure, where the gardener used to leave the orphaned (typically Frozen) batch behind. Two cancel-related error messages also predate the invariant and were ambiguous. Add a defensive pre-Start check that refuses to come up if more than one non-final batch is Pending or Frozen, naming the offending batch keys and pointing the operator at --repair.cancel-duplicate-batches. The gardener's BroadcastErrChan handler now cancels the failed batch in the same step, mirroring caretaker.Cancel()'s state rule (Pending/Frozen -> SeedlingCancelled, Committed -> SproutCancelled). The cancel-ambiguity error wording is rewritten to reflect the actual source (the cancel API has no batch-key parameter), and an obsolete TODO becomes an invariant statement referencing migration 000060. Tests follow: TestCheckSingletonInvariant covers the relevant shapes; finalize_with_tapscript_tree expects SeedlingCancelled on caretaker error; fund_seal_on_restart drops the now-unrepresentable two-pending scenario.
A legacy database populated by an older version of tapd may hold more than one pre-broadcast batch, in which case migration 000060 fails on startup with a unique-constraint violation and blocks the daemon indefinitely with no documented recovery path. Add a one-shot operator flag that opens the database with migrations skipped, finds all batches in BatchStatePending or BatchStateFrozen, preserves the most recent (by CreationTime), and cancels the rest by transitioning them to BatchStateSeedlingCancelled. On exit (status 0) the operator restarts tapd normally and migration 000060 applies against the now-clean database. The default posture is still "refuse to start" -- the daemon never silently mutates state on its own; the operator must explicitly invoke the repair flag. Cancellations are state transitions rather than row deletions, so cancelled batches remain on disk for later inspection.
A re-run of the minting caretaker's Confirmed branch fires SendMintEvent again for each minted asset; until now this produced a fresh supply_update_events row per re-fire because the table held no uniqueness constraint on event content. The downstream supply-commit state machine then saw the same logical mint twice and double-counted the update log. Add a deterministic 32-byte content hash column event_key on supply_update_events with hash = sha256(group_key || be32(update_type_id) || event_data), guarded by a unique index. The InsertSupplyUpdateEvent query becomes an ON CONFLICT DO NOTHING upsert so a re-insert of the same logical event is silently absorbed. Two migrations: 000061 adds the column nullable plus the unique index (NULLs are distinct in both backends, so pre-backfill rows don't collide); 000062 is a programmatic step that computes hashes for pre-existing rows (SQLite lacks a native SHA-256). After 000062 the index enforces dedup across the whole table. LatestMigrationVersion bumps from 60 to 62.
In the Committed branch of BatchCaretaker.stateStep, Wallet.ImportTaprootOutput ran after CommitSignedGenesisTx, which is the state-transition write that flips the batch from Committed to Broadcast. A crash between the two left the genesis transaction persisted (state=Broadcast) but the minting taproot output never imported into lnd's wallet. On restart, the Broadcast branch executes, which does not import the output, so lnd remained unaware of it indefinitely. Move the import to before the state-transition write. The import is already idempotent (the "already exists" error is caught and swallowed), so a crash anywhere in this branch now resumes from Committed and re-runs the sign-import-commit sequence cleanly.
The minting caretaker's Committed branch reloads an unsigned PSBT from disk on restart and re-signs it; the resulting signed bytes are implicitly compared with the prior attempt through downstream side effects. lnd uses BIP-340 RFC-6979 deterministic Schnorr nonces, so the property should hold by construction -- but it is load-bearing for restart semantics and worth checking in CI rather than assuming. Add an itest that calls Wallet.SignAndFinalizePsbt twice on the same unsigned PSBT and asserts byte-identical output.
testBasicAssetCreation pins the "restart at every observable boundary" mint case in one fixed order. A regression that only fires under a particular interleaving of restart points would slip past it and would not shrink to a small reproducer. Add a rapid property test that models the legal BatchState transition graph and samples every subset of two well-synchronized restart points (after disk-state Committed; after the Broadcast branch's publish call), asserting the mint flow always ends in BatchStateFinalized with no errors and no leftover caretakers. The harness is structured so a third restart point can be added trivially.
MintingBatch.Copy and the FundedMintAnchorPsbt.Copy it delegated to were "deep" in name only. Seedlings, AssetMetas, GenesisPacket's nested slices, and RootAssetCommitment all shared substructure with the source. RPC payloads and subscriber notifications used Copy believing they had a snapshot, when they actually held a smaller window into the same data -- callers reasoned as if they had isolation, with no actual barrier. Walk every reachable pointer, slice, map, and byte buffer and duplicate it. Helpers cover the leaf types (Genesis, GroupKey, ScriptKey, MetaReveal, etc.); RootAssetCommitment goes through commitment.TapCommitment.Copy (already exercised by TestTapCommitmentDeepCopy); FundedMintAnchorPsbt.Copy round-trips the psbt.Packet through Serialize/NewFromRawBytes, which duplicates every nested PInput / POutput / Unknown without hand-rolling per-type copies. TestMintingBatchCopyIsDeep builds a fully-populated batch, snapshots it, mutates every reachable substructure on the source, and asserts the snapshot is unmoved. Impossible error paths panic rather than fall back to shared pointers.
rapid's default of 100 iterations per Check call is gratuitous for the 4-element subset space TestCaretakerRestartRecoveryRapid explores. Combined with the now-genuinely-deep MintingBatch.Copy, the test grew from 289s to 309s on the full package suite. Cap iterations to 30 by default, which still hits every subset case roughly 7-8 times and recovers ~220s on the full package suite. Operators wanting deeper exploration can override via -rapid.checks=N; the test only overrides when the flag still carries its default value.
The cancel protocol used two shared per-caretaker channels (CancelReqChan + CancelRespChan). Matching a response to its specific request relied on the gardener serializing all cancel calls -- a discipline, not a property of the protocol itself. Replace the pair with a single CancelReqChan carrying cancelReq values, each embedding its own per-call reply channel. The binding is now intrinsic and no longer depends on caller serialization. CancelRespChan is dropped from BatchCaretakerConfig as a side-effect.
The finalize-batch state-request handler returned the caretaker's live MintingBatch pointer once BroadcastCompleteChan fired. The caretaker goroutine continues to mutate the batch after broadcast (Broadcast -> Confirmed -> Finalized writes GenesisPacket and RootAssetCommitment), so the caller's reads raced those writes. Take a deep snapshot via Batch.Copy before resolving, matching what every other state-request handler in the same select arm already does. Document the ownership invariant on BatchCaretakerConfig.Batch: the live pointer is owned by the caretaker goroutine; external observers must Copy first.
Two latent issues in the caretaker. BatchCaretaker.anchorOutputIndex was a uint32 field populated only by the Frozen state step; on any restart that re-entered the state machine at Committed, Broadcast, or Confirmed, the Frozen step never ran and downstream reads used the zero value. The bug was latent because FundPsbt(..., -1) typically leaves the asset anchor at output 0. Separately, the confirmation-watching goroutine launched in the Broadcast step read CancelReqChan in parallel with the cultivator's main loop, giving each per-call request two potential receivers; today every post-broadcast Cancel() rejects, but the per-call respCh contract is undermined if that ever changes. Remove the cached field and read AssetAnchorOutIdx directly from GenesisPacket at each use site; the value is always present once the batch has progressed past Frozen. Drop the inner CancelReqChan reads so the cultivator loop is the sole reader post-broadcast, and document the single-reader invariant on the field.
The caretaker's SignalCompletion callback did an unbuffered send on completionSignals with no select on Quit. The gardener drains that channel only from its main select; stopCaretakers runs from a defer that fires after the select has exited. If a caretaker reaches SignalCompletion in the same window that Stop closes Quit, the send blocks forever and the nested Wg.Wait inside stopCaretakers deadlocks the entire shutdown path. Wrap the send in a select on Quit so it can be abandoned. On shutdown the planter is stopping the caretaker anyway and does not need the completion notification.
InsertPendingUpdate's no-pending-transition arm created a new supply_commit_transitions row, inserted the update event, and moved the state machine to UpdatesPendingState. With the event_key UNIQUE index added in migration 000061, a re-fired event that already lives on a prior (now-finalized) transition is silently absorbed by ON CONFLICT DO NOTHING -- leaving the freshly-created transition with no events to commit and the state machine stuck in UpdatesPending with nothing pending. Switch InsertSupplyUpdateEvent to :execrows so the caller can distinguish "new row" (1) from "dedup" (0). When the no-pending-transition arm sees 0 rows affected, return an internal sentinel error from the ExecTx closure to roll the whole tx back; the outer wrapper translates the sentinel into a successful nil for the caller (the event is already in the log, which is what they wanted). The pending-exists arm is unchanged: dedup there is a benign no-op against the existing transition. TestSupplyCommitInsertPendingUpdateRefiredAfterFinalize drives an event through to a finalized transition, re-fires it, and asserts no orphan pending transition lands.
ChainBridge, WalletAnchor, KeyRing, and GroupFetcher describe tapd's view of its node-side dependencies (chain backend, wallet, key ring, group lookups), but they lived in tapgarden/interface.go. They were consumed by tapfreighter, tapchannel, tapconfig, lndservices, universe/supplycommit, universe/supplyverifier, and backup -- which made the tapgarden home a documented accident of authorship rather than a property of what the package is. Introduce a tapnode package that houses the four interfaces alone. Concrete implementations remain in lndservices; the rest of the tree imports tapnode for the interface only. tapgarden retains the minting-specific symbols (MintingStore, MintSupplyCommitter, Planter, BatchState, FundBatchResp, etc.). No behavior change.
copyMetaReveal collapsed a non-nil, empty UnknownOddTypes map to nil
because the guard was len(...) > 0. reflect.DeepEqual treats {} and
nil as distinct for maps, so AssertCopyEqual flagged the discrepancy
whenever FillFakeData produced an empty (non-nil) map. The bug was
latent because the random fixture only exercised the shape some
fraction of the time; reordering of init() in unrelated test work
shifted the rand state enough to surface it consistently.
Switch the guard to != nil so an empty source produces an empty
copy.
MockChainBridge, MockWalletAnchor, MockKeyRing, and GenMockGroupVerifier implement the tapnode interfaces but lived in tapgarden/mock.go. They were consumed across tapgarden, tapfreighter, and universe/supplyverifier tests; the tapgarden home was an accident of original authorship. Move them into tapnode/tapnodemock as ChainBridge, WalletAnchor, KeyRing, and GenGroupVerifier (the package-name-prefix stutter is dropped on the way through). The NewGenesisTx / FundGenesisTx helpers that lived alongside MockWalletAnchor move with it. tapnode/tapnodemock is a subpackage of tapnode so production code cannot transitively depend on the mocks; tests opt in by importing the subpackage explicitly.
The re-org watcher guards proof integrity in the face of chain re-orgs, which is unrelated to growing seedlings into assets. It lived in tapgarden only because that was the package that first happened to need it. Move tapgarden/re-org_watcher.go to tapreorg/watcher.go, renaming ReOrgWatcher -> Watcher, ReOrgWatcherConfig -> Config, and NewReOrgWatcher -> NewWatcher to drop the package-name stutter. The new package owns its own DefaultTimeout and logger (subsystem "RORG"). Drop a dead pre-overwrite assignment in DefaultUpdateCallback while we're here. tapconfig, tapcfg/server.go, and the root-level log.go are updated for the new symbols.
The Custodian receives assets transferred to this node; the Planter mints new ones. Keeping them in the same package forced every consumer that only cared about receipt to import the minting code, and conflated the two responsibilities at the package boundary. Hoist the Custodian, its config (renamed Config, dropping the type stutter), its event surface (AssetReceiveEvent, AddrImportErrEvent, AddrImportCompleteEvent), its helpers (AddrMatchesAsset, EventMatchesProof, WaitForAddrImport, AddrImportStatus), and MockAssetSyncer into a new tapcustody package. Subsystem CSTD. GenHeaderVerifier, needed by caretaker, planter, and custodian plus tapfreighter, supplyverifier, rpcserver, and tapcfg, moves to tapnode alongside the ChainBridge interface it wraps -- a node-level concern, not a tapgarden one.
The gardener's request protocol shipped (closure, reply channel) pairs to the main loop indirectly: a stateRequest interface, two generic structs (stateReq[T], stateParamReq[T,S]), a reqType enum with seven constants, a typedParam runtime type assertion, and a seven-arm dispatch switch -- all just to encode which method the caller wanted to invoke. Replace it with the only thing that machinery was approximating: a chan of plain func(). Each public method builds an inline closure that captures its own parameters and response channel; the loop invokes whatever closure arrives. A small dispatchStateReq helper handles the send-or-quit + receive-or-quit boilerplate, parameterised by the response type. The wait side gains a small semantic improvement: a FinalizeBatch broadcast wait now unblocks on Quit instead of hanging on a never-resolved response channel.
The copyX helpers introduced for MintingBatch.Copy lived in tapgarden but each described how to clone an asset or proof type; they sat in the wrong package for the symbol they named. Move them onto their natural types: Copy methods on asset.Genesis, asset.GroupKey, asset.AssetGroup, asset.ExternalKey, asset.TweakedScriptKey, asset.ScriptKey, and proof.MetaReveal; free helpers asset.CopyKeyDescriptor and asset.CopyPubKey for the foreign keychain / btcec types we cannot add methods to. PreCommitmentOutput is tapgarden-local so it stays put as a Copy method on the type. tapgarden/batch.go shrinks by ~120 lines; seedling.go drops copyExternalKey; call sites read more straightforwardly (s.Meta.Copy() vs copyMetaReveal(s.Meta)). asset.Asset.Copy is intentionally left alone -- it has its own inline copy logic and its callers are out of scope.
MintingStore had grown to seventeen methods on a single interface, mixing the lifecycle of a minting batch (state transitions, sprouts, signed genesis tx, confirmation) with read-only reference lookups (script keys, asset metas, group keys, delegation keys). The name promised "the place where minting batches are stored"; the methods delivered an inventory. Split into BatchStore (fourteen lifecycle methods) and MintingRefReader (four read-only reference lookups). GardenKit.Log becomes GardenKit.BatchStore; a MintingRefs MintingRefReader field is added for the read-only lookups. The free helpers freezeMintingBatch, fetchFinalizedBatch, and listBatches now take the narrower interfaces they actually need. The single concrete *tapdb.AssetMintingStore satisfies both interfaces. The embedded asset.TapscriptTreeManager is dropped from the umbrella in favor of TreeStore at the point of injection; the unused FetchGroupByGenesis is dropped from the tapgarden-facing surface but kept on the tapdb struct where it's still consumed internally. NewFallibleTapscriptTreeMgr now takes asset.TapscriptTreeManager directly.
The caretaker is constructed by the planter from the same kit, but BatchCaretakerConfig embedded GardenKit by value -- duplicating the planter's kit per caretaker for no functional reason. Embed *GardenKit instead so the caretaker reads the planter's kit directly. The kit is populated once at planter construction and not mutated thereafter, so sharing by reference is safe. Promoted field access (b.cfg.BatchStore, b.cfg.ChainBridge, ...) is unchanged: Go promotes fields through pointer embedding.
The FundBatchResp wrapper added nothing beyond the *VerboseBatch it held. Drop the wrapper and return *VerboseBatch directly so the type names what it is.
Five MintingState values were declared but only MintingStateSeed was ever emitted, in a single SeedlingUpdate site. The other four were declared but never produced. Success or failure of a seedling add is already conveyed by Error and PendingBatch on the update. Drop the enum and the NewState field rather than continue exposing a transition surface the package does not actually track.
A long-standing TODO at NewBatchCaretaker asked for this rename; the package's gardening metaphor (Seedling, Sprout, Planter) fits Cultivator and not Caretaker. Rename the type, its config, its constructor, the planter's cultivator field and helpers, and the file itself. Log strings and local variables follow.
PendingAssetGroup embedded asset.GroupKeyRequest and asset.GroupVirtualTx anonymously, hiding the fact that the struct is a pair: a request together with the virtual tx that fulfils it. Replace the silent embeddings with named KeyRequest and VirtualTx fields and update the call sites that relied on field promotion.
AssetMintEvent carried both an explicit BatchState field and a Batch pointer whose Batch.State() might lag, and a comment frankly admitted the contradiction by giving BatchState precedence. Drop the field and make the constructors mirror the just-executed state into the copied batch so Batch.State() always reports it. The RPC marshaler now reads Batch.State(); the wire-level mintrpc.MintEvent.batch_state is unchanged.
UpdateTapSibling is exported only because Go visibility forces it across the tapgarden/tapdb boundary; its only legitimate callers are BatchStore implementations. The contract lived only in the call graph. Mirror SetStateOnDBSuccess's doc onto UpdateTapSibling so the contract is in the source.
ChainPlanter was the only implementation; no mocks exist anywhere in the repo, and tests already hold *ChainPlanter directly. The interface added abstraction without any second implementation to abstract over. Replace the two config-field types that held the interface with *tapgarden.ChainPlanter, delete the interface and its compile-time assertion, and strip the stale "NOTE: This is part of the Planter interface" comments. If a fake minter is ever needed, the interface can be reintroduced from the concrete type's actual surface.
tapdb's persistence path for the supply pre-commit row was coupled to tapgarden's FundedMintAnchorPsbt.PreCommitmentOutput. Extracting the supply-commit code into its own package would require breaking that coupling first. Introduce tapgarden.PreCommitBindData, a typed persistence payload. The five BatchStore methods that bind a funded genesis transaction or seal a batch -- CommitMintingBatch, CommitBatchTx, CommitBatchFunding, AddSproutsToBatch, SealBatch -- gain an fn.Option[PreCommitBindData] parameter that tapdb writes alongside its usual columns. For now the planter constructs the payload by lifting it off FundedMintAnchorPsbt.PreCommitmentOutput via a new PreCommitBindData helper; the same SQL writes happen in the same transactions. Two inner helpers (upsertPreCommit + insertMintAnchorTx) merge into a single upsertPreCommitRow that takes the typed payload plus the genesis tx hash.
The supply-commit code needs to read mint_supply_pre_commits rows (for intake checks and verifier lookups) but importing all of AssetMintingStore just for that surface drags every other minting concern along with it. Create a dedicated SupplyPreCommitStore backed by the same db handle as AssetMintingStore, exposing the mint_supply_pre_commits reads (currently just FetchDelegationKey) that the supply-commit code needs. AssetMintingStore.FetchDelegationKey becomes a thin shim that delegates to it; the shim survives for now because the MintingRefReader interface in tapgarden still has the method, and that method is removed in a later commit once the augmenter takes over delegation-key prep. The write side stays on AssetMintingStore because pre-commit rows must land atomically with the batch's chain update, which only the binding tx in AssetMintingStore can offer.
tapgarden's minting flow has supply-commit logic hardcoded throughout (seedling delegation-key prep, universe-commitment intake validation, pre-commitment output stamping, PreCommitBindData production, post-confirmation events). The two concerns are entangled across the planter and cultivator, and the supply-commit code cannot move to its own package until tapgarden's call sites stop reaching directly into it. Introduce tapgarden.GenesisTxAugmenter, an interface tapgarden will invoke at well-defined lifecycle moments to let an external participant act on batch minting without tapgarden knowing the details. Six hooks: PrepareSeedling and ValidateSeedling for intake; ExtraOutputs and PostFund for genesis-tx contributions; BindData for the typed persistence payload; OnBatchConfirmed for downstream events. A NoOpAugmenter is provided so call sites don't nil-check. universe/supplycommit gains a concrete GenesisAugmenter that captures every prior supply-commit-specific tapgarden path in one place. The augmenter is not yet wired into GardenKit; that happens in the following commit.
With GenesisTxAugmenter defined but unwired, tapgarden still
executed the legacy supply-commit code paths directly. The
indirection only earns its keep once production paths route through
it.
Wire the augmenter through GardenKit and replace the planter's
direct supply-commit references with augmenter calls at every
lifecycle moment: seedling intake goes through PrepareSeedling
instead of prepSeedlingDelegationKey; the intake gate through
ValidateSeedling instead of validateUniCommitment plus
validateDelegationKey; funding consults ExtraOutputs to splice
additional anchor outputs and PostFund to stamp PSBT metadata;
persistence payloads at funding and seal come from BindData;
post-confirmation events go through OnBatchConfirmed. The planter
and cultivator each gain an augmenter() helper that returns the
configured augmenter or NoOpAugmenter{}, so call sites never
nil-check. MintingBatch.validateGroupAnchor is exported as
ValidateGroupAnchor so the augmenter can invoke it.
tapgarden/mock.go gets a local mockSupplyCommitAugmenter that
mirrors the real one without the dependency cycle, so existing
tests still get fully-formed funded batches.
With GenesisTxAugmenter wired through and tapgarden's production paths routed through it, the legacy supply-commit types and helpers in tapgarden are dead code. Delete them: MintSupplyCommitter and the matching GardenKit fields (MintSupplyCommitter, DelegationKeyChecker); PreCommitmentOutput, its Copy method, NewPreCommitmentOutput, and the PreCommitmentOutput field on FundedMintAnchorPsbt; PreCommitTxOut (moved to supplycommit and exported there); the DelegationKey alias; fetchDelegationKey / fetchPreCommitGroupKey; prepSeedlingDelegationKey; sealBatchPreCommit; the validateUniCommitment / validateDelegationKey methods on MintingBatch; the sendSupplyCommitEvents Cultivator method; FetchDelegationKey on MintingRefReader; the AssetMintingStore.FetchDelegationKey shim and PreCommitmentOutput reconstruction in marshalMintingBatch; the PreCommitOutIdx tracking in AnchorTxOutputIndexes. tapconfig wires a real supplycommit.GenesisAugmenter into GardenKit. The obsolete supply_commit_test.go is replaced by a unit test on the augmenter at universe/supplycommit/genesis_augmenter_test.go; TestValidateUniCommitment is removed since the invariants now live on the augmenter. tapgarden/mock.go gains a MockBindDataForBatch helper so tapdb tests can derive a typed payload without going through the deleted method.
The supply-commit extraction moved PreCommitTxOut into supplycommit; two test files now reach into supplycommit for it and no longer reference tapgarden. go vet caught these once the unit suite ran end-to-end. Drop the unused imports.
GenGroupVerifier, GenGroupAnchorVerifier, and GenRawGroupAnchorVerifier construct closures over tapnode.GroupFetcher that feed proof.VerifierCtx; they sit naturally alongside tapnode.GenHeaderVerifier rather than inside the minting state machine. Verifier consumption is downstream of tapgarden's end (a verifiable asset in the local store); the generators belong to the proof-verifier code, not the minting one. Move the three generators to tapnode along with ErrGroupKeyUnknown, ErrGenesisNotGroupAnchor, and the LRU cache helpers. Call sites are updated accordingly.
The cultivator's batch-confirmation handler assembled universe.Item values inline (storeMintingProof returned both the proof blob and the item) and streamed them to a local universe via UpsertProofLeafBatch. The planter's re-org handler had its own UpsertProofLeaf loop with the same wire-shape construction. Both sites encode universe publication, which is downstream of tapgarden's natural end (a verifiable asset in the local store). Introduce tapgarden.MintProofPublisher, owned by the consumer, with PublishMintBatch and PublishMintProofUpdates. The universe/mintpublish package provides the BatchRegistrar-backed implementation. tapgarden now passes raw (asset, proof, mintTxHash, outIdx) inputs and never constructs a universe.Item, BaseLeafKey, Identifier, Leaf, or GenesisWithGroup itself. GardenKit loses the Universe and UniversePushBatchSize fields in favor of a single MintProofPublisher field; NoOpMintProofPublisher covers test configurations with no downstream universe.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a large-scale hardening of the tapgarden minting subsystem. The changes focus on ensuring structural correctness and restart-idempotence for minting batches. By enforcing strict singleton invariants on pending batches, ensuring atomic persistence, and implementing deterministic state transitions, the system is now significantly more resilient to crashes and re-orgs. The refactor also improves code modularity by extracting node-level service interfaces and dedicated subsystems into their own packages. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the tapd codebase by modularizing substrate dependencies into the tapnode package and separating concerns into tapreorg, tapcustody, and mintpublish. It also introduces database constraints to enforce a singleton pre-broadcast batch invariant, adds a repair tool to resolve legacy database violations, and implements deduplication for supply update events. The review feedback highlights a critical migration failure risk where the programmatic backfill in migration 62 could fail on legacy databases containing duplicate events. Additionally, feedback points out a shallow copy issue in GroupKey.Copy(), potential nil pointer dereferences in UpdateBatchState and Cultivator.Cancel, and a potential infinite loop in PublishMintBatch if batchSize is non-positive.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
The bug this PR closes -- restart-triggered re-fires of mint events inserting duplicate supply_update_events rows -- could have already fired against an unupgraded database. Two rows with identical content hash to the same event_key, so the second SetSupplyUpdateEventKey in the migration 62 backfill would violate the unique index added in migration 61 and abort the migration. Track hashes seen during the backfill loop and delete any row whose hash a prior row already claimed. The deleted rows are by definition the same logical event as the one that survives. A new DeleteSupplyUpdateEvent sqlc query supports the delete-by-event_id that the dedupe needs. TestMigration62BackfillDedupesLegacyDuplicates seeds three identical rows pre-backfill and asserts the migration leaves exactly one with the expected hash, never erroring out.
Address golangci-lint findings (lll, gofmt, whitespace, exhaustive, gosec, govet copylocks) and two reviewer comments: use uint for mintpublish's batchSize so the negative-value case is ruled out at the type boundary, and use the already-computed batchKey in the Cultivator.Cancel default arm instead of re-deriving it from the live batch. The copylocks fixes switch the augmenter staging-batch construction from `*batch` to `batch.Copy()` so the atomic state field isn't copied by value. The exhaustive cases get explicit default arms on the BatchState switches in checkSingletonInvariant and RunRepairTool. Rename Migration62BackfillSupplyUpdateEventKeys to Migration62BackfillEventKeys so the aligned map entry fits the 80-column limit. Make TestMigration62BackfillSupplyUpdateEventKeys and TestMigration62BackfillDedupesLegacyDuplicates portable to Postgres: the first one used the SQLite-only `?` placeholder, the second used MAX(bytea) which isn't defined in Postgres. Switch to sqlc's InsertSupplyUpdateEvent and split COUNT + key lookup into two queries. Regenerate sqlc bindings whose source-comment references the event_key column's migration number. Add a [repair] section to sample-tapd.conf with an example for repair.cancel-duplicate-batches; the sample-conf check requires every tapd flag to have a default or example in the file.
7f01e15 to
f21a6d7
Compare
Ok, hang on, hear me out..
I can hear you now. 47 commits? +6500/-4000?! April 1st was like two months ago, what gives?
This is the result of a general tapgarden refactor I embarked on after making #2141. That was sort of the straw that broke the camel's back; by the time I'd made that PR, I had literally lost count of the number of problems I'd encountered and fixed in tapgarden. So, I decided to investigate more deeply, by way of a holistic analysis, why the package has historically been so buggy.
The result is probably best described by summarizing what problems, previously possible, are now made structurally impossible, and by what broad change each was accomplished. So, a summary, ctsy Opus:
Persistence atomicity. Previously it was possible for an in-memory batch state to advance while the corresponding database write failed, leaving memory and disk disagreeing about what state the batch was in; now, because every state mutation flows through a store method that touches memory only after the database commit succeeds, the two views cannot drift.
Funding overload. Previously it was possible for a fund-batch call to silently create a new batch when the caller meant to attach funding to an existing one (the bug behind [bug]: Mint batch can get stuck in FROZEN state after funding failure #2136); now, because "create" and "attach" are separate functions with non-overlapping signatures, the wrong intent will not type-check.
Singleton invariant. Previously it was possible for two unbroadcast minting batches to coexist on disk and block all subsequent minting; now, because the schema itself forbids more than one row in the pre-broadcast subset, that state cannot exist.
Determinism. Previously it was possible for the planter to pick a different "anchor seedling" on different runs of the same input because the choice depended on Go map iteration order; now, because the lookup scans deterministically and refuses to return when the singleton invariant fails, the choice is fixed.
Purity in output-key derivation. Previously it was possible for MintingOutputKey to return a cached value that didn't reflect the sibling argument the caller actually passed; now, because the function holds no state, its output is determined entirely by its inputs.
Restart-idempotence of side effects. Previously it was possible for a crash between persisting a batch as broadcast and importing its taproot output to leave the on-chain output unknown to the wallet forever; now, because import precedes the state-transition write, on-disk state can never outpace the side effect it commits.
Event deduplication. Previously it was possible for a restart-triggered re-fire of a mint event to log the same logical event twice and double-count the supply-commit update stream; now, because each event is keyed by a content hash with a uniqueness constraint, the second insert is silently absorbed by the schema.
No orphaned state-machine transitions. Previously it was possible for a deduplicated event to leave an empty supply-commit transition wedged in UpdatesPending; now, because the insert transaction rolls back whenever it detects a no-op dedupe, no empty transition can ever land.
Snapshot isolation. Previously it was possible for a subscriber holding a "snapshot" of a batch to share substructure with the live caretaker and observe mid-flight mutations through it; now, because copies are genuinely deep, snapshots are real isolation barriers.
Request/response binding. Previously it was possible for a cancel response to land on a different cancel request than the one that produced it, because the matching depended on a serialization discipline rather than the protocol itself; now, because each request carries its own reply channel, the binding is intrinsic.
Shutdown safety. Previously it was possible for a caretaker reaching completion in the same instant the planter was stopping to deadlock the entire shutdown path; now, because the completion send is abandonable on Quit, no notification can outlive its receiver.
Of the net ~2500 LoC added, roughly 900 lines are in added tests (mostly state machine property tests, IIRC), 800 are around reorganization (stuff that moves various 'complecting' machinery into other/new packages), and the rest is mostly in interface scaffolding + singleton-invariant/dedup/repair-flag machinery (the repair flag being a safety mechanism for old databases that disobeyed the implicit singleton invariant to upgrade).
So yes, It's huge, but: it passes our tests, plus new ones, and IMO enough LLM-enhanced eyes on it should be able to vet it appropriately.