Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 7 minutes and 50 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (3)
📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds a composite secondary-index combining MemTree + DiskTree runtimes, refactors DiskTree into runtime/snapshot surfaces, expands MemTree encoded-entry helpers, adjusts B-tree scan callback error semantics, and removes the ReadonlyBackingFile indirection across buffer/file/write paths. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant DualTree as DualTreeUniqueIndex
participant MemTree as MemTree
participant DiskRuntime as SecondaryDiskTreeRuntime
participant DiskTree as DiskTreeSnapshot
Client->>DualTree: lookup(key)
DualTree->>MemTree: lookup(key)
alt mem hit
MemTree-->>DualTree: Some(row_id)
DualTree-->>Client: row_id
else mem miss
DualTree->>DiskRuntime: open_unique_at(root, guard)
DiskRuntime-->>DiskTree: DiskTree snapshot
DualTree->>DiskTree: lookup(key)
alt disk match
DiskTree-->>DualTree: row_id
DualTree->>MemTree: insert overlay for disk match
MemTree-->>DualTree: inserted
DualTree-->>Client: row_id
else no match
DiskTree-->>DualTree: None
DualTree-->>Client: None
end
end
sequenceDiagram
participant Client
participant DualTree as DualTreeNonUniqueIndex
participant MemTree as MemTree
participant DiskRuntime as SecondaryDiskTreeRuntime
participant DiskTree as DiskTreeSnapshot
Client->>DualTree: lookup(prefix)
DualTree->>MemTree: lookup_encoded_entries(prefix)
MemTree-->>DualTree: mem_entries (encoded, row_id, deleted)
DualTree->>DiskRuntime: open_non_unique_at(root, guard)
DiskRuntime-->>DiskTree: DiskTree snapshot
DualTree->>DiskTree: prefix_scan_entries(prefix)
DiskTree-->>DualTree: disk_entries (encoded_exact_key, row_id)
DualTree->>DualTree: merge_non_unique_entries(mem_entries, disk_entries)
DualTree-->>Client: deduplicated merged RowIDs
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #558 +/- ##
==========================================
+ Coverage 91.59% 91.67% +0.08%
==========================================
Files 100 101 +1
Lines 53220 54203 +983
==========================================
+ Hits 48746 49691 +945
- Misses 4474 4512 +38 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
doradb-storage/src/index/disk_tree.rs (1)
1520-1545: Keepprefix_scan()on the row-id-only path.Line 1521 now routes the legacy
Vec<RowID>API throughprefix_scan_entries(), which clones every matching exact key before immediately discarding it. That adds avoidable allocation/copy cost to an existing lookup path; please share the filtering loop without materializing keys unless the caller actually asks for them.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@doradb-storage/src/index/disk_tree.rs` around lines 1520 - 1545, prefix_scan currently calls prefix_scan_entries and thus decodes and clones full exact keys only to discard them, causing extra allocations; change prefix_scan(&self, key: &[Val]) to perform the same loop as prefix_scan_entries but only collect RowID values: compute the prefix via self.tree.encoder.encode_prefix(...), iterate over self.tree.collect_entries().await?, check entry.key.starts_with(prefix.as_bytes()), call unpack_row_id_from_exact_key(&entry.key)? to get the RowID and push that into the Vec<RowID>, and return it—leave prefix_scan_entries unchanged for callers that need (Vec<u8>, RowID).doradb-storage/src/index/composite_secondary_index.rs (1)
519-577: Reduce allocation churn in non-unique merge dedup tracking.
push_row_oncecurrently allocates (to_vec) on every emitted key. For large scans, this can become avoidable overhead. Consider reusing one buffer for the last emitted key.Optional perf-oriented refactor
fn merge_non_unique_entries( mem_entries: &[NonUniqueMemTreeEntry], disk_entries: &[(Vec<u8>, RowID)], values: &mut Vec<RowID>, ) { let mut mem_idx = 0; let mut disk_idx = 0; - let mut last_emitted_key: Option<Vec<u8>> = None; + let mut last_emitted_key = Vec::new(); + let mut has_last = false; while mem_idx < mem_entries.len() && disk_idx < disk_entries.len() { let mem = &mem_entries[mem_idx]; let (disk_key, disk_row_id) = &disk_entries[disk_idx]; match mem.encoded_key.as_slice().cmp(disk_key.as_slice()) { Ordering::Less => { - push_active_mem_entry(mem, values, &mut last_emitted_key); + push_active_mem_entry(mem, values, &mut last_emitted_key, &mut has_last); mem_idx += 1; } Ordering::Equal => { if !mem.deleted { - push_row_once(&mem.encoded_key, mem.row_id, values, &mut last_emitted_key); + push_row_once( + &mem.encoded_key, + mem.row_id, + values, + &mut last_emitted_key, + &mut has_last, + ); } mem_idx += 1; disk_idx += 1; } Ordering::Greater => { - push_row_once(disk_key, *disk_row_id, values, &mut last_emitted_key); + push_row_once( + disk_key, + *disk_row_id, + values, + &mut last_emitted_key, + &mut has_last, + ); disk_idx += 1; } } } for mem in &mem_entries[mem_idx..] { - push_active_mem_entry(mem, values, &mut last_emitted_key); + push_active_mem_entry(mem, values, &mut last_emitted_key, &mut has_last); } for (disk_key, row_id) in &disk_entries[disk_idx..] { - push_row_once(disk_key, *row_id, values, &mut last_emitted_key); + push_row_once(disk_key, *row_id, values, &mut last_emitted_key, &mut has_last); } } fn push_active_mem_entry( entry: &NonUniqueMemTreeEntry, values: &mut Vec<RowID>, - last_emitted_key: &mut Option<Vec<u8>>, + last_emitted_key: &mut Vec<u8>, + has_last: &mut bool, ) { if !entry.deleted { - push_row_once(&entry.encoded_key, entry.row_id, values, last_emitted_key); + push_row_once(&entry.encoded_key, entry.row_id, values, last_emitted_key, has_last); } } fn push_row_once( encoded_key: &[u8], row_id: RowID, values: &mut Vec<RowID>, - last_emitted_key: &mut Option<Vec<u8>>, + last_emitted_key: &mut Vec<u8>, + has_last: &mut bool, ) { - if last_emitted_key.as_deref() != Some(encoded_key) { + if !*has_last || last_emitted_key.as_slice() != encoded_key { values.push(row_id); - *last_emitted_key = Some(encoded_key.to_vec()); + last_emitted_key.clear(); + last_emitted_key.extend_from_slice(encoded_key); + *has_last = true; } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@doradb-storage/src/index/composite_secondary_index.rs` around lines 519 - 577, The current push_row_once causes a new allocation for every emitted key by calling encoded_key.to_vec(); change push_row_once (and callers merge_non_unique_entries and push_active_mem_entry) to reuse the single Optional buffer in last_emitted_key instead of always allocating: compare encoded_key with last_emitted_key.as_deref() as now, but when updating last_emitted_key, if last_emitted_key.as_mut() yields Some(buf) then clear and extend_from_slice(encoded_key) (reusing capacity), otherwise set last_emitted_key to Some(encoded_key.to_vec()) only once; keep the same push semantics (values.push(row_id) only when key differs). This reduces allocation churn while keeping function signatures push_row_once(encoded_key, row_id, values, last_emitted_key) and push_active_mem_entry(entry, values, last_emitted_key) intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@doradb-storage/src/index/non_unique_index.rs`:
- Around line 163-176: The scan currently swallows decode failures from
push_encoded_exact_entry (via CollectEncodedExactEntries) causing
lookup_encoded_entries to return partial results instead of failing; change the
callback (CollectEncodedExactEntries / push_encoded_exact_entry) to return a
Result and propagate any decode error (e.g. Error::InvalidState) up instead of
converting it to a stop signal, ensure prefix_scanner/scan_prefix returns that
error, and make lookup_encoded_entries propagate the scan_prefix Result (no
longer treating a stopped scan as success) so concrete decode failures surface
to the caller.
---
Nitpick comments:
In `@doradb-storage/src/index/composite_secondary_index.rs`:
- Around line 519-577: The current push_row_once causes a new allocation for
every emitted key by calling encoded_key.to_vec(); change push_row_once (and
callers merge_non_unique_entries and push_active_mem_entry) to reuse the single
Optional buffer in last_emitted_key instead of always allocating: compare
encoded_key with last_emitted_key.as_deref() as now, but when updating
last_emitted_key, if last_emitted_key.as_mut() yields Some(buf) then clear and
extend_from_slice(encoded_key) (reusing capacity), otherwise set
last_emitted_key to Some(encoded_key.to_vec()) only once; keep the same push
semantics (values.push(row_id) only when key differs). This reduces allocation
churn while keeping function signatures push_row_once(encoded_key, row_id,
values, last_emitted_key) and push_active_mem_entry(entry, values,
last_emitted_key) intact.
In `@doradb-storage/src/index/disk_tree.rs`:
- Around line 1520-1545: prefix_scan currently calls prefix_scan_entries and
thus decodes and clones full exact keys only to discard them, causing extra
allocations; change prefix_scan(&self, key: &[Val]) to perform the same loop as
prefix_scan_entries but only collect RowID values: compute the prefix via
self.tree.encoder.encode_prefix(...), iterate over
self.tree.collect_entries().await?, check
entry.key.starts_with(prefix.as_bytes()), call
unpack_row_id_from_exact_key(&entry.key)? to get the RowID and push that into
the Vec<RowID>, and return it—leave prefix_scan_entries unchanged for callers
that need (Vec<u8>, RowID).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 543e40ad-f329-4eaa-a3cc-10443e230681
⛔ Files ignored due to path filters (4)
docs/rfcs/0014-dual-tree-secondary-index.mdis excluded by!docs/**,!**/*.mddocs/tasks/000119-composite-secondary-index-core.mdis excluded by!docs/**,!**/*.mddocs/tasks/next-idis excluded by!docs/**docs/unsafe-usage-baseline.mdis excluded by!docs/**,!**/*.md
📒 Files selected for processing (6)
doradb-storage/src/index/btree_node.rsdoradb-storage/src/index/composite_secondary_index.rsdoradb-storage/src/index/disk_tree.rsdoradb-storage/src/index/mod.rsdoradb-storage/src/index/non_unique_index.rsdoradb-storage/src/index/unique_index.rs
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
doradb-storage/src/index/disk_tree.rs (1)
638-649: Letopen()create and own itsPoolGuard.The runtime already owns the readonly pool, but callers still have to thread an external guard with the same lifetime. Making the snapshot self-contained here would remove an easy-to-misuse parameter and simplify downstream APIs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@doradb-storage/src/index/disk_tree.rs` around lines 638 - 649, open currently requires callers to pass a PoolGuard; instead have open() acquire and own the guard itself by calling self.disk_pool.pool_guard() so callers no longer thread a guard. Change open(&self, root_block_id: BlockID, disk_pool_guard: &PoolGuard) -> DiskTree to remove the external PoolGuard parameter and inside call self.disk_pool.pool_guard(), then pass that owned PoolGuard into DiskTree::from_root_snapshot(root_block_id, self, pool_guard). Update DiskTree::from_root_snapshot signature if needed to accept an owned PoolGuard (or adjust lifetimes) and remove usages that expect an external guard; you may also remove or keep disk_pool_guard() helper (disk_pool_guard) if still useful.doradb-storage/src/index/composite_secondary_index.rs (1)
601-612: Consider reducing allocations inpush_row_once.The function clones
encoded_keyon every successful push (line 610). Since this is called in a tight merge loop, consider tracking the last emitted position/slice rather than cloning the full key each time.♻️ Alternative using last index tracking
One approach is to track which source (mem or disk) and index was last emitted, then compare against the original slice directly instead of cloning:
-fn push_row_once( - encoded_key: &[u8], - row_id: RowID, - values: &mut Vec<RowID>, - last_emitted_key: &mut Option<Vec<u8>>, -) { - if last_emitted_key.as_deref() != Some(encoded_key) { - values.push(row_id); - *last_emitted_key = Some(encoded_key.to_vec()); - } -} +fn push_row_once<'a>( + encoded_key: &'a [u8], + row_id: RowID, + values: &mut Vec<RowID>, + last_emitted_key: &mut Option<&'a [u8]>, +) { + if *last_emitted_key != Some(encoded_key) { + values.push(row_id); + *last_emitted_key = Some(encoded_key); + } +}This avoids per-push allocations by borrowing the slice directly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@doradb-storage/src/index/composite_secondary_index.rs` around lines 601 - 612, push_row_once currently clones encoded_key into last_emitted_key on each push, causing allocations in a hot merge loop; change the tracking to store a lightweight identifier instead of a Vec<u8> (for example replace last_emitted_key: &mut Option<Vec<u8>> with a small enum/tuple like Option<(SourceTag, usize)> or Option<(source_id, index)> that records which input source and index produced the last emitted key) and then compare that identifier to the current source+index before pushing into values (keep function name push_row_once, parameters encoded_key and row_id, and update callers that maintain last_emitted_key to provide and update the new lightweight identifier). Ensure no long-lived borrowed slices are stored across iterations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@doradb-storage/src/index/composite_secondary_index.rs`:
- Around line 452-474: In compare_delete, if mem.lookup_unique(...) returns None
you must still consult the DiskTree like the unique variant does: after the mem
check, perform a disk lookup for the key+row_id (or call the disk-layer
compare_delete equivalent) on self.disk using the same pool_guard, key, row_id
and ts; if the disk-layer lookup indicates the entry exists, invoke the disk
compare_delete behavior and return its Result<bool>, otherwise return Ok(true).
Update the compare_delete function to call self.disk (e.g.,
self.disk.lookup_non_unique or self.disk.compare_delete) when mem lookup is None
so non-unique indexes correctly verify cold entries.
In `@doradb-storage/src/index/disk_tree.rs`:
- Around line 1557-1566: prefix_scan_entries currently calls collect_entries()
which materializes the entire tree and then filters, causing O(N) scans; change
it to a streamed prefix scan: compute the prefix bytes via
self.encoder().encode_prefix(key, Some(ROW_ID_SIZE)), obtain an async
iterator/stream that starts scanning at that prefix (replace collect_entries()
with the crate/engine method that yields entries from a start key), iterate
entries one-by-one, for each entry check
entry.key.starts_with(prefix.as_bytes()) and
unpack_row_id_from_exact_key(&entry.key) and push into results, and break the
loop as soon as an entry no longer matches the prefix to avoid scanning the rest
of the tree. Use prefix_scan_entries, encoder().encode_prefix,
unpack_row_id_from_exact_key and remove the collect_entries() materialization.
---
Nitpick comments:
In `@doradb-storage/src/index/composite_secondary_index.rs`:
- Around line 601-612: push_row_once currently clones encoded_key into
last_emitted_key on each push, causing allocations in a hot merge loop; change
the tracking to store a lightweight identifier instead of a Vec<u8> (for example
replace last_emitted_key: &mut Option<Vec<u8>> with a small enum/tuple like
Option<(SourceTag, usize)> or Option<(source_id, index)> that records which
input source and index produced the last emitted key) and then compare that
identifier to the current source+index before pushing into values (keep function
name push_row_once, parameters encoded_key and row_id, and update callers that
maintain last_emitted_key to provide and update the new lightweight identifier).
Ensure no long-lived borrowed slices are stored across iterations.
In `@doradb-storage/src/index/disk_tree.rs`:
- Around line 638-649: open currently requires callers to pass a PoolGuard;
instead have open() acquire and own the guard itself by calling
self.disk_pool.pool_guard() so callers no longer thread a guard. Change
open(&self, root_block_id: BlockID, disk_pool_guard: &PoolGuard) -> DiskTree to
remove the external PoolGuard parameter and inside call
self.disk_pool.pool_guard(), then pass that owned PoolGuard into
DiskTree::from_root_snapshot(root_block_id, self, pool_guard). Update
DiskTree::from_root_snapshot signature if needed to accept an owned PoolGuard
(or adjust lifetimes) and remove usages that expect an external guard; you may
also remove or keep disk_pool_guard() helper (disk_pool_guard) if still useful.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 26d41c2e-0490-40f0-a8bf-4d27e84cdc6b
⛔ Files ignored due to path filters (2)
docs/rfcs/0014-dual-tree-secondary-index.mdis excluded by!docs/**,!**/*.mddocs/tasks/000119-composite-secondary-index-core.mdis excluded by!docs/**,!**/*.md
📒 Files selected for processing (11)
doradb-storage/src/buffer/mod.rsdoradb-storage/src/buffer/readonly.rsdoradb-storage/src/file/cow_file.rsdoradb-storage/src/file/mod.rsdoradb-storage/src/file/multi_table_file.rsdoradb-storage/src/file/table_file.rsdoradb-storage/src/index/composite_secondary_index.rsdoradb-storage/src/index/disk_tree.rsdoradb-storage/src/table/mod.rsdoradb-storage/src/table/persistence.rsdoradb-storage/src/table/tests.rs
✅ Files skipped from review due to trivial changes (1)
- doradb-storage/src/buffer/readonly.rs
Closes #557
Summary by CodeRabbit
New Features
Bug Fixes
Tests