Skip to content

Commit 779e572

Browse files
authored
chore: Secondary Index Cleanup Hardening (#563)
* harden and cleanup secondary index
1 parent 28d38c2 commit 779e572

30 files changed

+4423
-2254
lines changed
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# Backlog: Checkpoint drops old active root before transaction GC can release it
2+
3+
## Summary
4+
5+
Table checkpoint currently drops the old active-root guard immediately after publishing a new table-file root, before the checkpoint transaction commits and before transaction GC can release it at a snapshot-safe point.
6+
7+
## Reference
8+
9+
docs/tasks/000121-secondary-index-cleanup-hardening.md; docs/rfcs/0014-dual-tree-secondary-index.md Phase 5; doradb-storage/src/table/persistence.rs; doradb-storage/src/file/cow_file.rs; doradb-storage/src/trx/mod.rs
10+
11+
## Deferred From (Optional)
12+
13+
docs/tasks/000121-secondary-index-cleanup-hardening.md; docs/rfcs/0014-dual-tree-secondary-index.md Phase 5
14+
15+
## Deferral Context (Optional)
16+
17+
- Defer Reason: Task 000121 is focused on secondary MemIndex cleanup. Fixing old active-root ownership affects checkpoint publication and transaction GC payloads, so it should be planned separately from the local cleanup API hardening.
18+
- Findings: Table::checkpoint receives OldRoot from MutableTableFile::commit and currently drops it immediately before committing the checkpoint transaction. ActiveTrx, PrecommitTrxPayload, and CommittedTrxPayload carry row/index GC data but do not carry old table roots.
19+
- Direction Hint: Prefer making root retention part of the transaction lifecycle: checkpoint owns the swapped root until commit, committed transaction GC releases it after the snapshot horizon proves no active transaction can still see it, and rollback/error paths release it safely.
20+
21+
## Scope Hint
22+
23+
Move old-root ownership into transaction commit and GC payloads, or an equivalent root-retention structure, so swapped active-root objects are released only after no active transaction can observe them.
24+
25+
## Acceptance Hint
26+
27+
Checkpoint no longer drops old_root directly; old active-root guards are released through transaction GC or equivalent visibility-gated cleanup; regression coverage proves a transaction that began before checkpoint can safely keep using the previous root object until it ends.
28+
29+
## Notes (Optional)
30+
31+
32+
## Close Reason (Added When Closed)
33+
34+
When a backlog item is moved to `docs/backlogs/closed/`, append:
35+
36+
```md
37+
## Close Reason
38+
39+
- Type: <implemented|stale|replaced|duplicate|wontfix|already-implemented|other>
40+
- Detail: <reason detail>
41+
- Closed By: <backlog close>
42+
- Reference: <task/issue/pr reference>
43+
- Closed At: <YYYY-MM-DD>
44+
```
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# Backlog: Protect obsolete DiskTree roots from checkpoint reclamation while transactions can access them
2+
3+
## Summary
4+
5+
Define and implement retention rules so secondary DiskTree roots and pages reachable from active transaction snapshots are not reclaimed, reused, or overwritten until no active transaction can access them.
6+
7+
## Reference
8+
9+
docs/tasks/000121-secondary-index-cleanup-hardening.md; docs/rfcs/0014-dual-tree-secondary-index.md Phase 5; doradb-storage/src/table/gc.rs; doradb-storage/src/table/persistence.rs; doradb-storage/src/file/cow_file.rs
10+
11+
## Deferred From (Optional)
12+
13+
docs/tasks/000121-secondary-index-cleanup-hardening.md; docs/rfcs/0014-dual-tree-secondary-index.md Phase 5
14+
15+
## Deferral Context (Optional)
16+
17+
- Defer Reason: Task 000121 can make cleanup transaction-scoped, but full DiskTree root/page reclamation policy is broader and should be designed with checkpoint GC rather than folded into the local cleanup fix.
18+
- Findings: Cleanup can use a transaction STS to protect its snapshot, but DiskTree rewritten blocks are currently not reclaimed through a formal root-reachability policy. The weak duplicate candidate docs/backlogs/000086-secondary-index-dual-tree-access-path.md is about access-path optimization, not root lifetime.
19+
- Direction Hint: After old active-root ownership is fixed, design root-reachability GC around transaction visibility: old roots and pages remain readable for active snapshots, then become reclaimable only after the GC horizon passes their owning checkpoint.
20+
21+
## Scope Hint
22+
23+
Cover checkpoint A/B root swap, future DiskTree root-reachability GC, table-file gc_block_list integration, and calc_min_active_sts_for_gc visibility boundaries.
24+
25+
## Acceptance Hint
26+
27+
DiskTree root and page reclamation consults active transaction visibility; a cleanup transaction that captured a DiskTree root can finish reads after later checkpoints; regression tests cover cleanup/checkpoint overlap and root-GC safety.
28+
29+
## Notes (Optional)
30+
31+
32+
## Close Reason (Added When Closed)
33+
34+
When a backlog item is moved to `docs/backlogs/closed/`, append:
35+
36+
```md
37+
## Close Reason
38+
39+
- Type: <implemented|stale|replaced|duplicate|wontfix|already-implemented|other>
40+
- Detail: <reason detail>
41+
- Closed By: <backlog close>
42+
- Reference: <task/issue/pr reference>
43+
- Closed At: <YYYY-MM-DD>
44+
```
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Backlog: Page-level secondary MemIndex cleanup scan
2+
3+
## Summary
4+
5+
Current secondary MemIndex cleanup materializes every MemIndex entry for an index before processing it. Future work should replace that with a page-level or bounded-batch scan interface that preserves cleanup proof rules while limiting peak memory and reducing unnecessary lookup work. The scan should also filter out live hot entries before encoded-key materialization when their row id is at or above the captured pivot, because those entries are not cleanup candidates.
6+
7+
## Reference
8+
9+
doradb-storage/src/table/gc.rs; doradb-storage/src/index/unique_index.rs; doradb-storage/src/index/non_unique_index.rs; doradb-storage/src/index/disk_tree.rs; docs/garbage-collect.md; docs/backlogs/000086-secondary-index-dual-tree-access-path.md
10+
11+
## Deferred From (Optional)
12+
13+
docs/tasks/000121-secondary-index-cleanup-hardening.md; docs/rfcs/0014-dual-tree-secondary-index.md Phase 5
14+
15+
## Deferral Context (Optional)
16+
17+
- Defer Reason: The direction is valid, but avoiding the memory spike and lookup work safely needs a separate design pass around BTree page cursors, DiskTree matching, and page-local revalidation/deletion. Folding it into the current hardening task would widen the scope.
18+
- Findings: Current cleanup calls scan_mem_entries() on each secondary index, which delegates to scan_encoded_entries() and returns a full Vec without filtering. The cleanup then performs per-entry DiskTree checks and compare-style MemIndex deletion. It already retains live entries with row_id >= captured pivot without a DiskTree probe, but only after those entries have been copied into the full scan vector and iterated. A page-level design can bound memory and skip these live hot entries before copying encoded keys. Cleanup decisions for remaining candidates still depend on async DiskTree/column-index reads, so the design must not simply hold leaf-page locks while deciding. Related backlog 000086 covers the broader dual-tree access contract; this follow-up is narrower and specifically targets cleanup scan shape, memory use, candidate filtering, and avoidable lookup cost.
19+
- Direction Hint: Prefer a page-batched or cursor-based MemIndex scan. First inspect row id and delete state from each leaf slot, skip live hot entries with row_id >= captured pivot, and only allocate/copy encoded keys for cleanup candidates. Deleted entries must not be skipped by the live-hot rule, and live entries below pivot remain candidates because they may be redundant with the captured DiskTree root. Evaluate whether DiskTree facts should be gathered by sorted range matching per batch, and whether final deletion should use page-local revalidation without falling back to root traversal. Retain entries when the scanned page/key no longer matches rather than weakening correctness.
20+
21+
## Scope Hint
22+
23+
Design and implement a bounded cleanup scan path for user-table secondary MemIndex cleanup. Keep DiskTree immutable during cleanup and preserve captured-root proof semantics. Avoid holding MemIndex leaf latches across async DiskTree or column-index reads. Include candidate filtering so live hot entries at or above the captured pivot are skipped before encoded-key materialization and before any DiskTree or compare-delete work.
24+
25+
## Acceptance Hint
26+
27+
Cleanup processes MemIndex entries in bounded page-sized batches or an equivalent streaming form, does not allocate one vector for all entries in an index, skips live hot entries before encoded-key allocation, exposes a separate skipped-live-hot statistic, preserves existing safety tests, and includes targeted coverage for cleanup results when entries change between scan and delete.
28+
29+
## Notes (Optional)
30+
31+
Future implementation should keep scanned/retained/removed focused on processed cleanup candidates and add a separate skipped_live_hot or equivalent per-index metric for live hot entries skipped by the scan filter.
32+
33+
## Close Reason (Added When Closed)
34+
35+
When a backlog item is moved to `docs/backlogs/closed/`, append:
36+
37+
```md
38+
## Close Reason
39+
40+
- Type: <implemented|stale|replaced|duplicate|wontfix|already-implemented|other>
41+
- Detail: <reason detail>
42+
- Closed By: <backlog close>
43+
- Reference: <task/issue/pr reference>
44+
- Closed At: <YYYY-MM-DD>
45+
```

docs/backlogs/next-id

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
000089
1+
000092

docs/checkpoint-and-recovery.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ Instead:
6666
the table checkpoint
6767
- hot secondary-index state is rebuilt through normal redo replay of hot row
6868
operations
69-
- stale cold entries are shadowed at runtime by `MemTree` and deletion-buffer
69+
- stale cold entries are shadowed at runtime by `MemIndex` and deletion-buffer
7070
state until the next table checkpoint publishes updated `DiskTree` roots
7171

7272
This removes the old `Index_Rec_CTS` problem entirely.
@@ -77,7 +77,7 @@ When a transaction commits on a table:
7777

7878
1. its redo record is appended to the commit log
7979
2. heap undo state and deletion-buffer state become visible through commit CTS
80-
3. secondary-index changes remain in `MemTree`
80+
3. secondary-index changes remain in `MemIndex`
8181

8282
Foreground commit never updates persistent `DiskTree` pages directly.
8383

@@ -166,7 +166,7 @@ metadata-only root to advance `deletion_cutoff_ts` to the checkpoint cutoff.
166166

167167
The design explicitly does **not** do the following:
168168

169-
- scan dirty `MemTree` entries looking for committed work
169+
- scan dirty `MemIndex` entries looking for committed work
170170
- merge arbitrary committed batches into `DiskTree`
171171
- advance an index-only replay watermark from a batch max CTS
172172

@@ -206,7 +206,7 @@ For each redo record after the coarse replay floor:
206206
- Heap / hot RowStore:
207207
- replay if the row belongs to hot RowStore and
208208
`CTS >= Heap_Redo_Start_TS`
209-
- row replay also rebuilds hot secondary-index `MemTree` state through the
209+
- row replay also rebuilds hot secondary-index `MemIndex` state through the
210210
normal row/index update logic
211211
- Cold-row deletions:
212212
- replay if `row_id < pivot_row_id` and
@@ -232,17 +232,17 @@ After redo reaches log end:
232232

233233
- RowStore is reconstructed for hot rows
234234
- deletion buffer is reconstructed for post-checkpoint cold deletes
235-
- `MemTree` is reconstructed for post-checkpoint hot secondary-index state
235+
- `MemIndex` is reconstructed for post-checkpoint hot secondary-index state
236236
- `DiskTree` remains the checkpointed source of truth for cold secondary-index
237237
state
238238

239239
The engine can then serve traffic.
240240

241241
## 6. Garbage Collection
242242

243-
### 6.1 MemTree Cleanup
243+
### 6.1 MemIndex Cleanup
244244

245-
`MemTree` entries may be cleaned or evicted only after the corresponding table
245+
`MemIndex` entries may be cleaned or evicted only after the corresponding table
246246
checkpoint makes their state durable:
247247

248248
- hot rows checkpointed into persistent LWC plus companion `DiskTree` entries

docs/garbage-collect.md

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# Garbage Collection
2+
3+
Garbage collection in the storage engine is split by ownership boundary. Each
4+
collector needs a proof that its target is unreachable for every active
5+
snapshot before it can physically remove state.
6+
7+
## Transaction Purge
8+
9+
Transaction purge is governed by `Global_Min_STS`, the oldest active snapshot
10+
timestamp used by the purge workers. A committed undo or index cleanup record is
11+
purgeable only when its commit timestamp is strictly older than
12+
`Global_Min_STS`.
13+
14+
Purge removes runtime-only obligations:
15+
16+
- row undo links whose next version is no longer visible to active snapshots
17+
- index undo branches after their owning row undo can no longer be rolled back
18+
- delete overlays created by transaction cleanup when the row/deletion proof is
19+
available
20+
21+
## Row-Page Undo GC
22+
23+
Row-page undo-chain GC is local to the row page and keeps historical versions
24+
needed by active readers. Runtime unique-key links are part of that same
25+
lifecycle: they preserve old unique-key ownership across row chains and are not
26+
owned by secondary-index full-scan cleanup.
27+
28+
Runtime unique-key links are not collectible merely because a row crossed the
29+
column-store pivot, disappeared from the deletion buffer, or became cold. They
30+
are collectible only when rollback/index-undo obligations are gone and
31+
`Global_Min_STS` proves no active snapshot can require the old owner.
32+
33+
## MemIndex Full-Scan Cleanup
34+
35+
User-table secondary indexes use a hot `MemIndex` and a checkpointed cold
36+
`DiskTree`. Full-scan cleanup is a memory cleanup pass only. It never mutates
37+
`DiskTree` and never rebuilds checkpointed cold entries into `MemIndex`.
38+
39+
The cleanup pass captures:
40+
41+
- table checkpoint timestamp
42+
- `pivot_row_id`
43+
- `ColumnBlockIndex` root
44+
- secondary `DiskTree` roots
45+
- deletion checkpoint cutoff
46+
- caller-supplied `Global_Min_STS`
47+
48+
Live entries can be removed when the captured row id is below the captured
49+
pivot and the captured `DiskTree` already has the same durable entry:
50+
51+
- unique: same encoded logical key maps to the same row id
52+
- non-unique: same encoded exact `(logical_key, row_id)` key exists
53+
54+
Delete overlays require overlay-obsolescence proof, not `DiskTree` absence. A
55+
unique delete-shadow or non-unique delete-marked exact entry can be removed
56+
when one of these facts is true:
57+
58+
- a deletion-buffer marker is committed and older than `Global_Min_STS`
59+
- the captured table root is older than `Global_Min_STS`, the row id is below
60+
the captured pivot, and the captured `ColumnBlockIndex` proves the row id is
61+
absent
62+
- the captured table root is older than `Global_Min_STS`, the row id is below
63+
the captured pivot, and the captured cold LWC row still exists but its current
64+
indexed values encode to a different scanned MemIndex key
65+
66+
The last case covers committed key changes for cold rows: the row still exists,
67+
but the scanned delete overlay no longer protects the row's current secondary
68+
key. If the captured cold row still owns the same encoded key, cleanup retains
69+
the overlay unless whole-row deletion is proven. Hot row-page key-obsolescence
70+
proof remains transaction index GC's job because it needs row-page undo-chain
71+
visibility.
72+
73+
A matching stale cold entry may still exist in the captured `DiskTree` after a
74+
row-deletion overlay is removed. That is safe because normal row/deletion
75+
visibility checks filter the row; `DiskTree` mutation remains checkpoint-owned.
76+
77+
Invalid cleanup proofs:
78+
79+
- deletion-buffer absence
80+
- `row_id < pivot_row_id` by itself
81+
- a `RowLocation::NotFound` result from a moving current root
82+
- the existence of a newer `DiskTree` root not captured with the table snapshot
83+
- hot row-page key mismatch without transaction index GC's row-page proof
84+
85+
Cleanup removes scanned entries with encoded compare-delete operations that
86+
also check the expected row id or delete-bit state. If an entry changed after
87+
the scan, cleanup retains it.
88+
89+
## DiskTree And CoW Roots
90+
91+
`DiskTree` roots are published as companion state of table checkpoint and
92+
deletion checkpoint. Old `DiskTree` pages become reclaimable only after the
93+
table-file CoW root that references them is no longer reachable by active
94+
readers. Root reachability GC is separate from `MemIndex` cleanup.
95+
96+
## Deletion Buffer
97+
98+
The deletion buffer tracks tombstones for persisted column-store rows. A marker
99+
is globally purgeable only after its delete timestamp is committed and older
100+
than `Global_Min_STS`.
101+
102+
Deletion-buffer absence is deliberately not a general proof. Some hot-origin
103+
secondary-index overlays never had a cold delete marker, and a missing marker
104+
does not prove that every active snapshot can ignore the old row/key owner.
105+
106+
## Summary
107+
108+
Use the narrowest proof owned by the component being collected:
109+
110+
- transaction purge uses `Global_Min_STS` and undo/index-undo ownership
111+
- row-page GC uses row undo-chain visibility
112+
- runtime unique-key links use the undo GC horizon
113+
- `MemIndex` cleanup uses captured checkpoint roots plus deletion proof
114+
- `DiskTree` page GC uses table-file CoW root reachability

docs/index-design.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ This pattern appears in both indexing layers:
3232
- hot routing in memory
3333
- cold routing in persisted CoW state
3434
- secondary index:
35-
- hot `MemTree`
35+
- hot `MemIndex`
3636
- cold `DiskTree`
3737

3838
This design keeps foreground writes memory-first while letting checkpoint
@@ -51,11 +51,17 @@ Use the following documents as the living source of truth:
5151

5252
2. [`secondary-index.md`](./secondary-index.md)
5353
- unique and non-unique secondary-index models
54-
- `MemTree` and `DiskTree`
54+
- `MemIndex` and `DiskTree`
5555
- read/write behavior
5656
- companion checkpoint maintenance
5757
- recovery behavior
5858

59+
3. [`garbage-collect.md`](./garbage-collect.md)
60+
- transaction and row undo purge
61+
- runtime unique-key link lifecycle
62+
- `MemIndex` cleanup proofs
63+
- `DiskTree` and table-file CoW root reclamation
64+
5965
## 4. Summary
6066

6167
`RowID` is the common identity across the storage engine. Block index resolves

docs/rfcs/0014-dual-tree-secondary-index.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -525,10 +525,10 @@ SAFETY:` comments, and run the repository lint gate. [D10], [C4], [C7]
525525
intentionally deferred optimization work as backlogs.
526526
- Non-goals: generic B+Tree backend unification and broad query-planner range
527527
scan features.
528-
- Task Doc: `docs/tasks/TBD.md`
529-
- Task Issue: `#0`
530-
- Phase Status: `pending`
531-
- Implementation Summary: `pending`
528+
- Task Doc: `docs/tasks/000121-secondary-index-cleanup-hardening.md`
529+
- Task Issue: `#562`
530+
- Phase Status: done
531+
- Implementation Summary: Implemented Phase 5 secondary-index cleanup hardening: removed the transitional MemIndex wrapper, added proof-based MemIndex cleanup, preserved no-cold-backfill recovery, refreshed docs, and validated with clippy and nextest. [Task Resolve Sync: docs/tasks/000121-secondary-index-cleanup-hardening.md @ 2026-04-17]
532532

533533
## Test Strategy
534534

0 commit comments

Comments
 (0)