Skip to content

tapgarden: large scale hardening#2153

Open
jtobin wants to merge 50 commits into
lightninglabs:mainfrom
jtobin:tapgarden-hardening
Open

tapgarden: large scale hardening#2153
jtobin wants to merge 50 commits into
lightninglabs:mainfrom
jtobin:tapgarden-hardening

Conversation

@jtobin

@jtobin jtobin commented Jun 4, 2026

Copy link
Copy Markdown
Member

Ok, hang on, hear me out..

I can hear you now. 47 commits? +6500/-4000?! April 1st was like two months ago, what gives?

This is the result of a general tapgarden refactor I embarked on after making #2141. That was sort of the straw that broke the camel's back; by the time I'd made that PR, I had literally lost count of the number of problems I'd encountered and fixed in tapgarden. So, I decided to investigate more deeply, by way of a holistic analysis, why the package has historically been so buggy.

The result is probably best described by summarizing what problems, previously possible, are now made structurally impossible, and by what broad change each was accomplished. So, a summary, ctsy Opus:

  • Persistence atomicity. Previously it was possible for an in-memory batch state to advance while the corresponding database write failed, leaving memory and disk disagreeing about what state the batch was in; now, because every state mutation flows through a store method that touches memory only after the database commit succeeds, the two views cannot drift.

  • Funding overload. Previously it was possible for a fund-batch call to silently create a new batch when the caller meant to attach funding to an existing one (the bug behind [bug]: Mint batch can get stuck in FROZEN state after funding failure #2136); now, because "create" and "attach" are separate functions with non-overlapping signatures, the wrong intent will not type-check.

  • Singleton invariant. Previously it was possible for two unbroadcast minting batches to coexist on disk and block all subsequent minting; now, because the schema itself forbids more than one row in the pre-broadcast subset, that state cannot exist.

  • Determinism. Previously it was possible for the planter to pick a different "anchor seedling" on different runs of the same input because the choice depended on Go map iteration order; now, because the lookup scans deterministically and refuses to return when the singleton invariant fails, the choice is fixed.

  • Purity in output-key derivation. Previously it was possible for MintingOutputKey to return a cached value that didn't reflect the sibling argument the caller actually passed; now, because the function holds no state, its output is determined entirely by its inputs.

  • Restart-idempotence of side effects. Previously it was possible for a crash between persisting a batch as broadcast and importing its taproot output to leave the on-chain output unknown to the wallet forever; now, because import precedes the state-transition write, on-disk state can never outpace the side effect it commits.

  • Event deduplication. Previously it was possible for a restart-triggered re-fire of a mint event to log the same logical event twice and double-count the supply-commit update stream; now, because each event is keyed by a content hash with a uniqueness constraint, the second insert is silently absorbed by the schema.

  • No orphaned state-machine transitions. Previously it was possible for a deduplicated event to leave an empty supply-commit transition wedged in UpdatesPending; now, because the insert transaction rolls back whenever it detects a no-op dedupe, no empty transition can ever land.

  • Snapshot isolation. Previously it was possible for a subscriber holding a "snapshot" of a batch to share substructure with the live caretaker and observe mid-flight mutations through it; now, because copies are genuinely deep, snapshots are real isolation barriers.

  • Request/response binding. Previously it was possible for a cancel response to land on a different cancel request than the one that produced it, because the matching depended on a serialization discipline rather than the protocol itself; now, because each request carries its own reply channel, the binding is intrinsic.

  • Shutdown safety. Previously it was possible for a caretaker reaching completion in the same instant the planter was stopping to deadlock the entire shutdown path; now, because the completion send is abandonable on Quit, no notification can outlive its receiver.

Of the net ~2500 LoC added, roughly 900 lines are in added tests (mostly state machine property tests, IIRC), 800 are around reorganization (stuff that moves various 'complecting' machinery into other/new packages), and the rest is mostly in interface scaffolding + singleton-invariant/dedup/repair-flag machinery (the repair flag being a safety mechanism for old databases that disobeyed the implicit singleton invariant to upgrade).

So yes, It's huge, but: it passes our tests, plus new ones, and IMO enough LLM-enhanced eyes on it should be able to vet it appropriately.

jtobin added 30 commits June 4, 2026 11:16
MintingBatch.batchState (in-memory) and the asset_minting_batches.state
DB column were written by separate code paths and could diverge on
failure. Several store methods that transitioned state as a side-effect
(AddSproutsToBatch, CommitSignedGenesisTx, MarkBatchConfirmed) updated
the DB while the in-memory mirror was only patched up later by
advanceStateUntil's blanket post-step write -- or, in places, *before*
the DB write was issued.

Funnel every state mutation through MintingStore methods that take a
*MintingBatch and advance the in-memory field only after the DB write
succeeds. The public UpdateState mutator is removed (renamed to
unexported setState); an exported SetStateOnDBSuccess remains as the
bridge for store implementations. Two redundant in-memory writes in
BatchCaretaker that either re-mirrored state already updated by a store
call or got ahead of the DB are dropped.
The preceding refactor established an invariant -- the in-memory batch
state may never advance unless the corresponding DB write committed --
but nothing in the test suite asserted it directly. A regression that
re-introduced an in-memory write ahead of the DB call would slip past
review.

TestUpdateBatchStateMemoryCoherence forces UpdateBatchState to fail via
a pre-cancelled context and asserts that both the in-memory
batch.State() and a fresh on-disk read remain at the prior value.
fundBatch did one of two things depending on whether its workingBatch
arg was nil: create a fresh batch and persist via CommitMintingBatch,
or attach funding to an existing on-disk batch via CommitBatchFunding.
This overload was the shape of bug that caused lightninglabs#2136 -- the wrong
in-memory reference silently flipped which path ran, and downstream
code that thought it was modifying batch A was in fact creating a
fresh batch B.

Split into three pieces with non-overlapping responsibilities.
createFundedBatch derives, funds, and persists a new batch and never
mutates planter state. applyFundingToBatch funds an existing on-disk
batch atomically and rejects nil up front. fundPendingBatch is a thin
nil-dispatch wrapper used by callers that genuinely have the
create-or-update ambiguity. Start()'s recovery path calls
applyFundingToBatch directly. Shared funding prep (sibling persistence
and the computeFunding closure) is factored into prepareFunding.
prepAssetSeedling assigned or mutated c.pendingBatch before the DB
write that justified the mutation. A failed commit left the in-memory
state ahead of disk: either pointing at a batch that did not exist, or
holding a seedling that the DB had never seen. The next call to
QueueNewSeedling would then either trip a uniqueness check or wedge
the failed seedling in memory indefinitely.

Reorder both branches to persist first and mirror into memory only on
success. To make the existing-batch branch separable, MintingBatch
grows validateSeedling (pure) and commitSeedling (mutation);
AddSeedling remains as the combined wrapper for callers that have no
persistence boundary between the two. The fresh-batch branch builds
entirely on a local newBatch and only assigns c.pendingBatch after
CommitMintingBatch returns.
The validateSeedling/commitSeedling split exists so seedling
persistence precedes in-memory mirroring. Nothing prevented a future
change from re-merging the two halves or letting validateSeedling
mutate the batch -- both of which would reintroduce the failure mode
the split exists to prevent.

TestSeedlingValidateCommitSplit asserts that validateSeedling neither
mutates Seedlings nor sets SupplyCommitments on success, failure, or
first-seedling paths; only commitSeedling may.
fetchDelegationKey and fetchPreCommitGroupKey each iterated
pendingBatch.Seedlings and broke after the first entry, picking either
the anchor directly or one that referenced it. Go's map iteration is
non-deterministic, so correctness depended on an invariant ( <= 1
anchor per universe-commitment batch) enforced by validateUniCommitment
elsewhere; break that invariant by one line and the loops silently
picked at random.

Replace both call sites with MintingBatch.uniqueAnchorSeedling, a
deterministic scan that returns the single seedling with GroupAnchor
== nil and errors loudly on zero or more than one match. The invariant
is now asserted at the site that depends on it rather than presumed.
The new uniqueAnchorSeedling helper relies on a deterministic scan
plus an exactly-one check. A regression that reintroduced
break-after-first or dropped the multi-anchor check could pass review
unnoticed and only fail stochastically under particular map orderings.

TestUniqueAnchorSeedling exercises the five shapes that matter: anchor
only; anchor plus children (run 32x to defeat incidental map ordering);
multiple anchors; no anchor; empty batch. The first two assert the
expected return; the last three assert the expected error.
MintingOutputKey memoized its result on first call; subsequent calls
silently ignored the sibling argument. The caretaker's
BatchStateCommitted branch was the only caller that exploited this,
passing nil and relying on the Frozen branch having cached the real
sibling. Any path that reached Committed without going through Frozen
first -- a future RBF retry, a partial restart, a refactor -- would
have silently computed and cached the wrong output key.

Drop the cache. MintingBatch loses its memoized mintingPubKey and
taprootAssetScriptRoot fields; MintingOutputKey recomputes from
(BatchKey, RootAssetCommitment, sibling) on every call. The caretaker's
Committed branch now loads the sibling preimage before calling
MintingOutputKey and passes it explicitly; the same preimage is already
used a few lines below for siblingBytes, so this is a reorder rather
than a new I/O cost. The work is a single tapscript hash plus an ECC
tweak -- not expensive enough to justify a memoizing getter that
introduces correctness footguns.
The prior commit removed MintingOutputKey's memoization cache so the
function depends genuinely on its sibling argument. Nothing in the
test suite asserted that contract, and the previously-exploited
"pass nil after the cache is warm" pattern would re-emerge silently
if the cache were ever reintroduced.

TestMintingOutputKeyPureInSibling asserts that distinct siblings
produce distinct keys, that repeated calls with the same sibling are
stable, and that a nil sibling produces a key distinct from any
non-nil one. If the memoizing cache is ever reintroduced, both the
"different siblings" and "nil is distinct" cases will fail.
The planter has long treated a single "current" pre-broadcast batch
as a singleton in code, but the schema admitted plurality. Older
versions of the planter could in some failure scenarios desync the
in-memory slot from disk and leave two pre-broadcast rows behind --
making the in-code invariant unenforceable on a recovering daemon.

Add migration 000060, a partial unique index on asset_minting_batches
that forbids more than one row in BatchStatePending or BatchStateFrozen
at any time. The "(1) WHERE batch_state IN (0, 1)" idiom indexes a
constant expression filtered by the active-set predicate; both
modernc SQLite >= 3.9 and Postgres accept it. Legacy databases that
already hold duplicates fail on startup with the migration error; the
operator recovery path arrives in a follow-up commit. Supply-commit
test fixtures that don't exercise the planter state machine advance
each batch past Frozen immediately after insert so they no longer
trip the constraint. LatestMigrationVersion bumps from 59 to 60.
The DB-level singleton invariant from migration 000060 still leaves
the planter with two operational gaps: at startup, against a legacy
DB that survived migration but otherwise has duplicates, and on
caretaker failure, where the gardener used to leave the orphaned
(typically Frozen) batch behind. Two cancel-related error messages
also predate the invariant and were ambiguous.

Add a defensive pre-Start check that refuses to come up if more than
one non-final batch is Pending or Frozen, naming the offending batch
keys and pointing the operator at --repair.cancel-duplicate-batches.
The gardener's BroadcastErrChan handler now cancels the failed batch
in the same step, mirroring caretaker.Cancel()'s state rule
(Pending/Frozen -> SeedlingCancelled, Committed -> SproutCancelled).
The cancel-ambiguity error wording is rewritten to reflect the actual
source (the cancel API has no batch-key parameter), and an obsolete
TODO becomes an invariant statement referencing migration 000060.
Tests follow: TestCheckSingletonInvariant covers the relevant shapes;
finalize_with_tapscript_tree expects SeedlingCancelled on caretaker
error; fund_seal_on_restart drops the now-unrepresentable two-pending
scenario.
A legacy database populated by an older version of tapd may hold
more than one pre-broadcast batch, in which case migration 000060
fails on startup with a unique-constraint violation and blocks the
daemon indefinitely with no documented recovery path.

Add a one-shot operator flag that opens the database with migrations
skipped, finds all batches in BatchStatePending or BatchStateFrozen,
preserves the most recent (by CreationTime), and cancels the rest by
transitioning them to BatchStateSeedlingCancelled. On exit (status 0)
the operator restarts tapd normally and migration 000060 applies
against the now-clean database. The default posture is still "refuse
to start" -- the daemon never silently mutates state on its own; the
operator must explicitly invoke the repair flag. Cancellations are
state transitions rather than row deletions, so cancelled batches
remain on disk for later inspection.
A re-run of the minting caretaker's Confirmed branch fires
SendMintEvent again for each minted asset; until now this produced a
fresh supply_update_events row per re-fire because the table held no
uniqueness constraint on event content. The downstream supply-commit
state machine then saw the same logical mint twice and double-counted
the update log.

Add a deterministic 32-byte content hash column event_key on
supply_update_events with hash = sha256(group_key
|| be32(update_type_id) || event_data), guarded by a unique index. The
InsertSupplyUpdateEvent query becomes an ON CONFLICT DO NOTHING upsert
so a re-insert of the same logical event is silently absorbed. Two
migrations: 000061 adds the column nullable plus the unique index
(NULLs are distinct in both backends, so pre-backfill rows don't
collide); 000062 is a programmatic step that computes hashes for
pre-existing rows (SQLite lacks a native SHA-256). After 000062 the
index enforces dedup across the whole table. LatestMigrationVersion
bumps from 60 to 62.
In the Committed branch of BatchCaretaker.stateStep,
Wallet.ImportTaprootOutput ran after CommitSignedGenesisTx, which is
the state-transition write that flips the batch from Committed to
Broadcast. A crash between the two left the genesis transaction
persisted (state=Broadcast) but the minting taproot output never
imported into lnd's wallet. On restart, the Broadcast branch executes,
which does not import the output, so lnd remained unaware of it
indefinitely.

Move the import to before the state-transition write. The import is
already idempotent (the "already exists" error is caught and
swallowed), so a crash anywhere in this branch now resumes from
Committed and re-runs the sign-import-commit sequence cleanly.
The minting caretaker's Committed branch reloads an unsigned PSBT
from disk on restart and re-signs it; the resulting signed bytes are
implicitly compared with the prior attempt through downstream side
effects. lnd uses BIP-340 RFC-6979 deterministic Schnorr nonces, so
the property should hold by construction -- but it is load-bearing
for restart semantics and worth checking in CI rather than assuming.

Add an itest that calls Wallet.SignAndFinalizePsbt twice on the same
unsigned PSBT and asserts byte-identical output.
testBasicAssetCreation pins the "restart at every observable
boundary" mint case in one fixed order. A regression that only fires
under a particular interleaving of restart points would slip past it
and would not shrink to a small reproducer.

Add a rapid property test that models the legal BatchState
transition graph and samples every subset of two well-synchronized
restart points (after disk-state Committed; after the Broadcast
branch's publish call), asserting the mint flow always ends in
BatchStateFinalized with no errors and no leftover caretakers. The
harness is structured so a third restart point can be added
trivially.
MintingBatch.Copy and the FundedMintAnchorPsbt.Copy it delegated to
were "deep" in name only. Seedlings, AssetMetas, GenesisPacket's
nested slices, and RootAssetCommitment all shared substructure with
the source. RPC payloads and subscriber notifications used Copy
believing they had a snapshot, when they actually held a smaller
window into the same data -- callers reasoned as if they had
isolation, with no actual barrier.

Walk every reachable pointer, slice, map, and byte buffer and
duplicate it. Helpers cover the leaf types (Genesis, GroupKey,
ScriptKey, MetaReveal, etc.); RootAssetCommitment goes through
commitment.TapCommitment.Copy (already exercised by
TestTapCommitmentDeepCopy); FundedMintAnchorPsbt.Copy round-trips the
psbt.Packet through Serialize/NewFromRawBytes, which duplicates every
nested PInput / POutput / Unknown without hand-rolling per-type
copies. TestMintingBatchCopyIsDeep builds a fully-populated batch,
snapshots it, mutates every reachable substructure on the source,
and asserts the snapshot is unmoved. Impossible error paths panic
rather than fall back to shared pointers.
rapid's default of 100 iterations per Check call is gratuitous for
the 4-element subset space TestCaretakerRestartRecoveryRapid
explores. Combined with the now-genuinely-deep MintingBatch.Copy, the
test grew from 289s to 309s on the full package suite.

Cap iterations to 30 by default, which still hits every subset case
roughly 7-8 times and recovers ~220s on the full package suite.
Operators wanting deeper exploration can override via
-rapid.checks=N; the test only overrides when the flag still carries
its default value.
The cancel protocol used two shared per-caretaker channels
(CancelReqChan + CancelRespChan). Matching a response to its specific
request relied on the gardener serializing all cancel calls -- a
discipline, not a property of the protocol itself.

Replace the pair with a single CancelReqChan carrying cancelReq
values, each embedding its own per-call reply channel. The binding is
now intrinsic and no longer depends on caller serialization.
CancelRespChan is dropped from BatchCaretakerConfig as a side-effect.
The finalize-batch state-request handler returned the caretaker's
live MintingBatch pointer once BroadcastCompleteChan fired. The
caretaker goroutine continues to mutate the batch after broadcast
(Broadcast -> Confirmed -> Finalized writes GenesisPacket and
RootAssetCommitment), so the caller's reads raced those writes.

Take a deep snapshot via Batch.Copy before resolving, matching what
every other state-request handler in the same select arm already
does. Document the ownership invariant on
BatchCaretakerConfig.Batch: the live pointer is owned by the
caretaker goroutine; external observers must Copy first.
Two latent issues in the caretaker. BatchCaretaker.anchorOutputIndex
was a uint32 field populated only by the Frozen state step; on any
restart that re-entered the state machine at Committed, Broadcast, or
Confirmed, the Frozen step never ran and downstream reads used the
zero value. The bug was latent because FundPsbt(..., -1) typically
leaves the asset anchor at output 0. Separately, the
confirmation-watching goroutine launched in the Broadcast step read
CancelReqChan in parallel with the cultivator's main loop, giving
each per-call request two potential receivers; today every
post-broadcast Cancel() rejects, but the per-call respCh contract is
undermined if that ever changes.

Remove the cached field and read AssetAnchorOutIdx directly from
GenesisPacket at each use site; the value is always present once the
batch has progressed past Frozen. Drop the inner CancelReqChan reads
so the cultivator loop is the sole reader post-broadcast, and
document the single-reader invariant on the field.
The caretaker's SignalCompletion callback did an unbuffered send on
completionSignals with no select on Quit. The gardener drains that
channel only from its main select; stopCaretakers runs from a defer
that fires after the select has exited. If a caretaker reaches
SignalCompletion in the same window that Stop closes Quit, the send
blocks forever and the nested Wg.Wait inside stopCaretakers
deadlocks the entire shutdown path.

Wrap the send in a select on Quit so it can be abandoned. On
shutdown the planter is stopping the caretaker anyway and does not
need the completion notification.
InsertPendingUpdate's no-pending-transition arm created a new
supply_commit_transitions row, inserted the update event, and moved
the state machine to UpdatesPendingState. With the event_key UNIQUE
index added in migration 000061, a re-fired event that already lives
on a prior (now-finalized) transition is silently absorbed by ON
CONFLICT DO NOTHING -- leaving the freshly-created transition with
no events to commit and the state machine stuck in UpdatesPending
with nothing pending.

Switch InsertSupplyUpdateEvent to :execrows so the caller can
distinguish "new row" (1) from "dedup" (0). When the
no-pending-transition arm sees 0 rows affected, return an internal
sentinel error from the ExecTx closure to roll the whole tx back;
the outer wrapper translates the sentinel into a successful nil for
the caller (the event is already in the log, which is what they
wanted). The pending-exists arm is unchanged: dedup there is a
benign no-op against the existing transition.
TestSupplyCommitInsertPendingUpdateRefiredAfterFinalize drives an
event through to a finalized transition, re-fires it, and asserts no
orphan pending transition lands.
ChainBridge, WalletAnchor, KeyRing, and GroupFetcher describe tapd's
view of its node-side dependencies (chain backend, wallet, key ring,
group lookups), but they lived in tapgarden/interface.go. They were
consumed by tapfreighter, tapchannel, tapconfig, lndservices,
universe/supplycommit, universe/supplyverifier, and backup -- which
made the tapgarden home a documented accident of authorship rather
than a property of what the package is.

Introduce a tapnode package that houses the four interfaces alone.
Concrete implementations remain in lndservices; the rest of the tree
imports tapnode for the interface only. tapgarden retains the
minting-specific symbols (MintingStore, MintSupplyCommitter, Planter,
BatchState, FundBatchResp, etc.). No behavior change.
copyMetaReveal collapsed a non-nil, empty UnknownOddTypes map to nil
because the guard was len(...) > 0. reflect.DeepEqual treats {} and
nil as distinct for maps, so AssertCopyEqual flagged the discrepancy
whenever FillFakeData produced an empty (non-nil) map. The bug was
latent because the random fixture only exercised the shape some
fraction of the time; reordering of init() in unrelated test work
shifted the rand state enough to surface it consistently.

Switch the guard to != nil so an empty source produces an empty
copy.
MockChainBridge, MockWalletAnchor, MockKeyRing, and
GenMockGroupVerifier implement the tapnode interfaces but lived in
tapgarden/mock.go. They were consumed across tapgarden, tapfreighter,
and universe/supplyverifier tests; the tapgarden home was an
accident of original authorship.

Move them into tapnode/tapnodemock as ChainBridge, WalletAnchor,
KeyRing, and GenGroupVerifier (the package-name-prefix stutter is
dropped on the way through). The NewGenesisTx / FundGenesisTx
helpers that lived alongside MockWalletAnchor move with it.
tapnode/tapnodemock is a subpackage of tapnode so production code
cannot transitively depend on the mocks; tests opt in by importing
the subpackage explicitly.
The re-org watcher guards proof integrity in the face of chain
re-orgs, which is unrelated to growing seedlings into assets. It
lived in tapgarden only because that was the package that first
happened to need it.

Move tapgarden/re-org_watcher.go to tapreorg/watcher.go, renaming
ReOrgWatcher -> Watcher, ReOrgWatcherConfig -> Config, and
NewReOrgWatcher -> NewWatcher to drop the package-name stutter. The
new package owns its own DefaultTimeout and logger (subsystem
"RORG"). Drop a dead pre-overwrite assignment in
DefaultUpdateCallback while we're here. tapconfig, tapcfg/server.go,
and the root-level log.go are updated for the new symbols.
The Custodian receives assets transferred to this node; the Planter
mints new ones. Keeping them in the same package forced every
consumer that only cared about receipt to import the minting code,
and conflated the two responsibilities at the package boundary.

Hoist the Custodian, its config (renamed Config, dropping the type
stutter), its event surface (AssetReceiveEvent,
AddrImportErrEvent, AddrImportCompleteEvent), its helpers
(AddrMatchesAsset, EventMatchesProof, WaitForAddrImport,
AddrImportStatus), and MockAssetSyncer into a new tapcustody
package. Subsystem CSTD. GenHeaderVerifier, needed by caretaker,
planter, and custodian plus tapfreighter, supplyverifier, rpcserver,
and tapcfg, moves to tapnode alongside the ChainBridge interface it
wraps -- a node-level concern, not a tapgarden one.
The gardener's request protocol shipped (closure, reply channel)
pairs to the main loop indirectly: a stateRequest interface, two
generic structs (stateReq[T], stateParamReq[T,S]), a reqType enum
with seven constants, a typedParam runtime type assertion, and a
seven-arm dispatch switch -- all just to encode which method the
caller wanted to invoke.

Replace it with the only thing that machinery was approximating: a
chan of plain func(). Each public method builds an inline closure
that captures its own parameters and response channel; the loop
invokes whatever closure arrives. A small dispatchStateReq helper
handles the send-or-quit + receive-or-quit boilerplate,
parameterised by the response type. The wait side gains a small
semantic improvement: a FinalizeBatch broadcast wait now unblocks on
Quit instead of hanging on a never-resolved response channel.
The copyX helpers introduced for MintingBatch.Copy lived in
tapgarden but each described how to clone an asset or proof type;
they sat in the wrong package for the symbol they named.

Move them onto their natural types: Copy methods on asset.Genesis,
asset.GroupKey, asset.AssetGroup, asset.ExternalKey,
asset.TweakedScriptKey, asset.ScriptKey, and proof.MetaReveal; free
helpers asset.CopyKeyDescriptor and asset.CopyPubKey for the foreign
keychain / btcec types we cannot add methods to. PreCommitmentOutput
is tapgarden-local so it stays put as a Copy method on the type.
tapgarden/batch.go shrinks by ~120 lines; seedling.go drops
copyExternalKey; call sites read more straightforwardly
(s.Meta.Copy() vs copyMetaReveal(s.Meta)). asset.Asset.Copy is
intentionally left alone -- it has its own inline copy logic and
its callers are out of scope.
jtobin added 17 commits June 4, 2026 11:42
MintingStore had grown to seventeen methods on a single interface,
mixing the lifecycle of a minting batch (state transitions, sprouts,
signed genesis tx, confirmation) with read-only reference lookups
(script keys, asset metas, group keys, delegation keys). The name
promised "the place where minting batches are stored"; the methods
delivered an inventory.

Split into BatchStore (fourteen lifecycle methods) and
MintingRefReader (four read-only reference lookups). GardenKit.Log
becomes GardenKit.BatchStore; a MintingRefs MintingRefReader field
is added for the read-only lookups. The free helpers
freezeMintingBatch, fetchFinalizedBatch, and listBatches now take
the narrower interfaces they actually need. The single concrete
*tapdb.AssetMintingStore satisfies both interfaces. The embedded
asset.TapscriptTreeManager is dropped from the umbrella in favor of
TreeStore at the point of injection; the unused FetchGroupByGenesis
is dropped from the tapgarden-facing surface but kept on the tapdb
struct where it's still consumed internally.
NewFallibleTapscriptTreeMgr now takes asset.TapscriptTreeManager
directly.
The caretaker is constructed by the planter from the same kit, but
BatchCaretakerConfig embedded GardenKit by value -- duplicating the
planter's kit per caretaker for no functional reason.

Embed *GardenKit instead so the caretaker reads the planter's kit
directly. The kit is populated once at planter construction and not
mutated thereafter, so sharing by reference is safe. Promoted field
access (b.cfg.BatchStore, b.cfg.ChainBridge, ...) is unchanged: Go
promotes fields through pointer embedding.
The FundBatchResp wrapper added nothing beyond the *VerboseBatch it
held.

Drop the wrapper and return *VerboseBatch directly so the type names
what it is.
Five MintingState values were declared but only MintingStateSeed was
ever emitted, in a single SeedlingUpdate site. The other four were
declared but never produced. Success or failure of a seedling add is
already conveyed by Error and PendingBatch on the update.

Drop the enum and the NewState field rather than continue exposing a
transition surface the package does not actually track.
A long-standing TODO at NewBatchCaretaker asked for this rename; the
package's gardening metaphor (Seedling, Sprout, Planter) fits
Cultivator and not Caretaker.

Rename the type, its config, its constructor, the planter's
cultivator field and helpers, and the file itself. Log strings and
local variables follow.
PendingAssetGroup embedded asset.GroupKeyRequest and
asset.GroupVirtualTx anonymously, hiding the fact that the struct is
a pair: a request together with the virtual tx that fulfils it.

Replace the silent embeddings with named KeyRequest and VirtualTx
fields and update the call sites that relied on field promotion.
AssetMintEvent carried both an explicit BatchState field and a Batch
pointer whose Batch.State() might lag, and a comment frankly
admitted the contradiction by giving BatchState precedence.

Drop the field and make the constructors mirror the just-executed
state into the copied batch so Batch.State() always reports it. The
RPC marshaler now reads Batch.State(); the wire-level
mintrpc.MintEvent.batch_state is unchanged.
UpdateTapSibling is exported only because Go visibility forces it
across the tapgarden/tapdb boundary; its only legitimate callers are
BatchStore implementations. The contract lived only in the call
graph.

Mirror SetStateOnDBSuccess's doc onto UpdateTapSibling so the
contract is in the source.
ChainPlanter was the only implementation; no mocks exist anywhere
in the repo, and tests already hold *ChainPlanter directly. The
interface added abstraction without any second implementation to
abstract over.

Replace the two config-field types that held the interface with
*tapgarden.ChainPlanter, delete the interface and its compile-time
assertion, and strip the stale "NOTE: This is part of the Planter
interface" comments. If a fake minter is ever needed, the interface
can be reintroduced from the concrete type's actual surface.
tapdb's persistence path for the supply pre-commit row was coupled
to tapgarden's FundedMintAnchorPsbt.PreCommitmentOutput. Extracting
the supply-commit code into its own package would require breaking
that coupling first.

Introduce tapgarden.PreCommitBindData, a typed persistence payload.
The five BatchStore methods that bind a funded genesis transaction
or seal a batch -- CommitMintingBatch, CommitBatchTx,
CommitBatchFunding, AddSproutsToBatch, SealBatch -- gain an
fn.Option[PreCommitBindData] parameter that tapdb writes alongside
its usual columns. For now the planter constructs the payload by
lifting it off FundedMintAnchorPsbt.PreCommitmentOutput via a new
PreCommitBindData helper; the same SQL writes happen in the same
transactions. Two inner helpers (upsertPreCommit + insertMintAnchorTx)
merge into a single upsertPreCommitRow that takes the typed payload
plus the genesis tx hash.
The supply-commit code needs to read mint_supply_pre_commits rows
(for intake checks and verifier lookups) but importing all of
AssetMintingStore just for that surface drags every other minting
concern along with it.

Create a dedicated SupplyPreCommitStore backed by the same db handle
as AssetMintingStore, exposing the mint_supply_pre_commits reads
(currently just FetchDelegationKey) that the supply-commit code
needs. AssetMintingStore.FetchDelegationKey becomes a thin shim that
delegates to it; the shim survives for now because the
MintingRefReader interface in tapgarden still has the method, and
that method is removed in a later commit once the augmenter takes
over delegation-key prep. The write side stays on AssetMintingStore
because pre-commit rows must land atomically with the batch's chain
update, which only the binding tx in AssetMintingStore can offer.
tapgarden's minting flow has supply-commit logic hardcoded
throughout (seedling delegation-key prep, universe-commitment intake
validation, pre-commitment output stamping, PreCommitBindData
production, post-confirmation events). The two concerns are
entangled across the planter and cultivator, and the supply-commit
code cannot move to its own package until tapgarden's call sites
stop reaching directly into it.

Introduce tapgarden.GenesisTxAugmenter, an interface tapgarden will
invoke at well-defined lifecycle moments to let an external
participant act on batch minting without tapgarden knowing the
details. Six hooks: PrepareSeedling and ValidateSeedling for intake;
ExtraOutputs and PostFund for genesis-tx contributions; BindData for
the typed persistence payload; OnBatchConfirmed for downstream
events. A NoOpAugmenter is provided so call sites don't nil-check.
universe/supplycommit gains a concrete GenesisAugmenter that
captures every prior supply-commit-specific tapgarden path in one
place. The augmenter is not yet wired into GardenKit; that happens
in the following commit.
With GenesisTxAugmenter defined but unwired, tapgarden still
executed the legacy supply-commit code paths directly. The
indirection only earns its keep once production paths route through
it.

Wire the augmenter through GardenKit and replace the planter's
direct supply-commit references with augmenter calls at every
lifecycle moment: seedling intake goes through PrepareSeedling
instead of prepSeedlingDelegationKey; the intake gate through
ValidateSeedling instead of validateUniCommitment plus
validateDelegationKey; funding consults ExtraOutputs to splice
additional anchor outputs and PostFund to stamp PSBT metadata;
persistence payloads at funding and seal come from BindData;
post-confirmation events go through OnBatchConfirmed. The planter
and cultivator each gain an augmenter() helper that returns the
configured augmenter or NoOpAugmenter{}, so call sites never
nil-check. MintingBatch.validateGroupAnchor is exported as
ValidateGroupAnchor so the augmenter can invoke it.
tapgarden/mock.go gets a local mockSupplyCommitAugmenter that
mirrors the real one without the dependency cycle, so existing
tests still get fully-formed funded batches.
With GenesisTxAugmenter wired through and tapgarden's production
paths routed through it, the legacy supply-commit types and helpers
in tapgarden are dead code.

Delete them: MintSupplyCommitter and the matching GardenKit fields
(MintSupplyCommitter, DelegationKeyChecker); PreCommitmentOutput, its
Copy method, NewPreCommitmentOutput, and the PreCommitmentOutput
field on FundedMintAnchorPsbt; PreCommitTxOut (moved to supplycommit
and exported there); the DelegationKey alias;
fetchDelegationKey / fetchPreCommitGroupKey;
prepSeedlingDelegationKey; sealBatchPreCommit; the
validateUniCommitment / validateDelegationKey methods on
MintingBatch; the sendSupplyCommitEvents Cultivator method;
FetchDelegationKey on MintingRefReader; the
AssetMintingStore.FetchDelegationKey shim and PreCommitmentOutput
reconstruction in marshalMintingBatch; the PreCommitOutIdx tracking
in AnchorTxOutputIndexes. tapconfig wires a real
supplycommit.GenesisAugmenter into GardenKit. The obsolete
supply_commit_test.go is replaced by a unit test on the augmenter at
universe/supplycommit/genesis_augmenter_test.go;
TestValidateUniCommitment is removed since the invariants now live
on the augmenter. tapgarden/mock.go gains a MockBindDataForBatch
helper so tapdb tests can derive a typed payload without going
through the deleted method.
The supply-commit extraction moved PreCommitTxOut into supplycommit;
two test files now reach into supplycommit for it and no longer
reference tapgarden. go vet caught these once the unit suite ran
end-to-end.

Drop the unused imports.
GenGroupVerifier, GenGroupAnchorVerifier, and
GenRawGroupAnchorVerifier construct closures over
tapnode.GroupFetcher that feed proof.VerifierCtx; they sit naturally
alongside tapnode.GenHeaderVerifier rather than inside the minting
state machine. Verifier consumption is downstream of tapgarden's
end (a verifiable asset in the local store); the generators belong
to the proof-verifier code, not the minting one.

Move the three generators to tapnode along with
ErrGroupKeyUnknown, ErrGenesisNotGroupAnchor, and the LRU cache
helpers. Call sites are updated accordingly.
The cultivator's batch-confirmation handler assembled universe.Item
values inline (storeMintingProof returned both the proof blob and
the item) and streamed them to a local universe via
UpsertProofLeafBatch. The planter's re-org handler had its own
UpsertProofLeaf loop with the same wire-shape construction. Both
sites encode universe publication, which is downstream of
tapgarden's natural end (a verifiable asset in the local store).

Introduce tapgarden.MintProofPublisher, owned by the consumer, with
PublishMintBatch and PublishMintProofUpdates. The
universe/mintpublish package provides the BatchRegistrar-backed
implementation. tapgarden now passes raw
(asset, proof, mintTxHash, outIdx) inputs and never constructs a
universe.Item, BaseLeafKey, Identifier, Leaf, or GenesisWithGroup
itself. GardenKit loses the Universe and UniversePushBatchSize
fields in favor of a single MintProofPublisher field;
NoOpMintProofPublisher covers test configurations with no downstream
universe.
@jtobin jtobin added this to the v0.9 milestone Jun 4, 2026
@jtobin jtobin self-assigned this Jun 4, 2026
@gemini-code-assist

Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a large-scale hardening of the tapgarden minting subsystem. The changes focus on ensuring structural correctness and restart-idempotence for minting batches. By enforcing strict singleton invariants on pending batches, ensuring atomic persistence, and implementing deterministic state transitions, the system is now significantly more resilient to crashes and re-orgs. The refactor also improves code modularity by extracting node-level service interfaces and dedicated subsystems into their own packages.

Highlights

  • Persistence Atomicity: State mutations now flow through a store method that ensures memory is only updated after a successful database commit, preventing drift between memory and disk.
  • Singleton Invariant: A new partial unique index on the minting batches table enforces that at most one batch exists in a pre-broadcast state (Pending or Frozen), preventing deadlock and state corruption.
  • Restart Idempotence: The minting caretaker's restart path is now idempotent; crashes during the sign-and-finalize or broadcast phases are handled by re-running the logic, with duplicate events absorbed by a new content-hash index.
  • Package Reorganization: Subsystems like 'custodian' and 'reorg' have been moved to dedicated packages (tapcustody, tapreorg), and shared node-side interfaces have been consolidated into a new tapnode package.
  • One-Shot Repair Tool: Added a new repair flag (--repair.cancel-duplicate-batches) to allow operators to recover legacy databases that violate the new singleton invariant before applying migrations.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the tapd codebase by modularizing substrate dependencies into the tapnode package and separating concerns into tapreorg, tapcustody, and mintpublish. It also introduces database constraints to enforce a singleton pre-broadcast batch invariant, adds a repair tool to resolve legacy database violations, and implements deduplication for supply update events. The review feedback highlights a critical migration failure risk where the programmatic backfill in migration 62 could fail on legacy databases containing duplicate events. Additionally, feedback points out a shallow copy issue in GroupKey.Copy(), potential nil pointer dereferences in UpdateBatchState and Cultivator.Cancel, and a potential infinite loop in PublishMintBatch if batchSize is non-positive.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread tapdb/programmatic_migrations.go
Comment thread asset/group_key.go
Comment thread tapdb/asset_minting.go
Comment thread tapgarden/cultivator.go
Comment thread universe/mintpublish/publisher.go Outdated
jtobin added 3 commits June 4, 2026 12:34
The bug this PR closes -- restart-triggered re-fires of mint events
inserting duplicate supply_update_events rows -- could have already
fired against an unupgraded database. Two rows with identical content
hash to the same event_key, so the second SetSupplyUpdateEventKey in
the migration 62 backfill would violate the unique index added in
migration 61 and abort the migration.

Track hashes seen during the backfill loop and delete any row whose
hash a prior row already claimed. The deleted rows are by definition
the same logical event as the one that survives. A new
DeleteSupplyUpdateEvent sqlc query supports the delete-by-event_id
that the dedupe needs.

TestMigration62BackfillDedupesLegacyDuplicates seeds three identical
rows pre-backfill and asserts the migration leaves exactly one with
the expected hash, never erroring out.
Address golangci-lint findings (lll, gofmt, whitespace, exhaustive,
gosec, govet copylocks) and two reviewer comments: use uint for
mintpublish's batchSize so the negative-value case is ruled out at
the type boundary, and use the already-computed batchKey in the
Cultivator.Cancel default arm instead of re-deriving it from the
live batch.

The copylocks fixes switch the augmenter staging-batch construction
from `*batch` to `batch.Copy()` so the atomic state field isn't
copied by value. The exhaustive cases get explicit default arms on
the BatchState switches in checkSingletonInvariant and RunRepairTool.
Rename Migration62BackfillSupplyUpdateEventKeys to
Migration62BackfillEventKeys so the aligned map entry fits the
80-column limit.

Make TestMigration62BackfillSupplyUpdateEventKeys and
TestMigration62BackfillDedupesLegacyDuplicates portable to Postgres:
the first one used the SQLite-only `?` placeholder, the second used
MAX(bytea) which isn't defined in Postgres. Switch to sqlc's
InsertSupplyUpdateEvent and split COUNT + key lookup into two
queries. Regenerate sqlc bindings whose source-comment references
the event_key column's migration number.

Add a [repair] section to sample-tapd.conf with an example for
repair.cancel-duplicate-batches; the sample-conf check requires every
tapd flag to have a default or example in the file.
@jtobin jtobin force-pushed the tapgarden-hardening branch from 7f01e15 to f21a6d7 Compare June 4, 2026 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: 🆕 New

Development

Successfully merging this pull request may close these issues.

1 participant