feat(index): support configurable multi-segment FM-Index builds#7123
Conversation
|
Important This PR touches the Lance format specification. Substantive changes to the format specification — the If this is a meaningful format change:
|
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
5822f75 to
40bec4d
Compare
jackye1995
left a comment
There was a problem hiding this comment.
Found a correctness issue in segmented FMIndex search and a few behavioral divergences from the existing segmented-index pattern. Details inline.
| ), | ||
| }; | ||
| let match_count = row_addrs.true_rows().row_addrs().unwrap().count(); | ||
| assert!( |
There was a problem hiding this comment.
This assertion is too weak and appears to hide a correctness bug in segmented FMIndex search. FMIndex training currently does not request _rowid or _rowaddr, so collect_texts falls back to a local counter starting at 0 for each segment, and search returns those ids as row addresses. With this fixture, "quick" should match 5 rows, but segmented search can collapse local ids across segments and still satisfy >= 3.
Can we fix FMIndex to store global row addresses / stable row ids and make this test assert the exact expected rows or count?
There was a problem hiding this comment.
Good catch! I fixed this by:
- Updating
FMIndexPluginto request global row addresses (.with_row_addr()) during training. - Updating
collect_textsto parse the global_rowaddr(falling back to_rowidif needed), rather than falling back to local counter starting at 0. - Adding a new
search_row_addrsmethod onFMIndexto return the full 64-bit row address during logical searches, avoiding any 32-bit offset truncation. - Tightening the test assertion to assert exactly
5matching rows.
| } | ||
|
|
||
| self.dataset | ||
| .commit_existing_index_segments(&index_name, &column, segment_metadatas) |
There was a problem hiding this comment.
This commit path does not preserve the normal replace=true behavior. The standard execute path removes all existing same-name indices when replace is set, but commit_existing_index_segments only removes overlapping same-type segments. Replacing an existing same-name BTree/bitmap/etc. index with segmented FMIndex can leave mixed metadata under one index name, and empty/untrained FMIndex coverage can leave old same-type segments too.
Can this path either build the transaction with the same removed_indices as the normal path, or extend the segment commit API with explicit replace semantics?
There was a problem hiding this comment.
Fixed. I replaced the commit_existing_index_segments calls with explicit TransactionBuilder + apply_commit, collecting all same-name indices into removed_indices when replace=true. This matches the standard execute path behavior and ensures replacing a BTree/bitmap index with a segmented FMIndex properly cleans up all old metadata.
|
|
||
| /// Build FM-Index with multiple segments, each covering a subset of fragments. | ||
| async fn execute_multi_segment_fmindex(&mut self, num_segments: u32) -> Result<IndexMetadata> { | ||
| let column_input = &self.columns[0]; |
There was a problem hiding this comment.
This bypasses the standard execute_uncommitted validation that requires exactly one column. For num_segments > 1, an empty column list will panic here, and a multi-column request silently indexes only the first column instead of returning the existing error.
Can we run the same self.columns.len() != 1 validation before entering the segmented FMIndex path?
There was a problem hiding this comment.
Added. execute_multi_segment_fmindex now runs the same self.columns.len() != 1 check at entry, matching execute_uncommitted validation.
|
|
||
| let mut fragment_bitmap = RoaringBitmap::new(); | ||
| for segment in &segments { | ||
| fragment_bitmap |= segment.fragment_bitmap.as_ref().cloned().ok_or_else(|| { |
There was a problem hiding this comment.
This merge path should mirror the btree merge behavior and intersect stored segment coverage with the dataset's current live fragments. As written, it unions historical fragment ids and passes them to build_scalar_index; after compaction retires one of those fragments, the rebuild can fail with No fragment with id ....
Can we use effective live coverage here, and add a compaction/retired-fragment merge test similar to the btree coverage?
There was a problem hiding this comment.
Fixed. merge_segments now intersects the unioned fragment bitmap with dataset.fragment_bitmap to drop retired/compacted fragments before rebuilding, mirroring the btree merge behavior. When the intersection is empty (all fragments retired), it produces an empty index instead of crashing. Added test_fmindex_merge_after_compaction_drops_retired_fragments to cover this scenario.
… FM-Index builds - Fix local row address bug in collect_texts by requesting _rowaddr and returning u64 row addresses in search_row_addrs - Rebuild segmented index commit path in create.rs using TransactionBuilder to properly handle replace=true with same-name index removal - Run single-column validation at the entry of execute_multi_segment_fmindex - Intersect segment fragment coverage with live dataset fragments in merge_segments to prevent failures after compaction retires fragments - Add test_fmindex_merge_after_compaction_drops_retired_fragments integration test Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
| let value_col = batch.column(0); | ||
| // Prefer _rowaddr (global row address) over _rowid to ensure stable, | ||
| // globally unique identifiers across segments. | ||
| let row_addrs: &arrow_array::UInt64Array = batch |
There was a problem hiding this comment.
[P1] This new hard requirement for _rowaddr also needs to be reflected in FMIndex update/optimize paths. FMIndexScalarIndex::update_criteria() still returns TrainingCriteria::new(TrainingOrdering::None), so optimize_indices() / append maintenance can build a training stream with only the value column and fail here. Also, FMIndexScalarIndex::update() currently writes a fresh index from new_data only and ignores the existing index plus old_data_filter; if we only add .with_row_addr(), the generic single-segment scalar update path can replace the old FMIndex with one containing only appended rows. Can we either make FMIndex update merge existing rows correctly, or force FMIndex maintenance to rebuild from the full target fragment bitmap instead of taking the single-segment update() shortcut?
There was a problem hiding this comment.
Good catch. Fixed both issues:
update_criteria()now returnsrequires_old_datawith.with_row_addr(), so the training stream includes all existing + new rows with global row addresses.update()now applies theold_data_filterto exclude deleted/compacted rows before rebuilding the BWT, so the single-segment update path produces a complete index covering both old and new data instead of silently dropping existing rows.
| let texts = collect_texts(new_data).await?; | ||
| let mut texts = collect_texts(new_data).await?; | ||
| if let Some(filter) = old_data_filter { | ||
| texts.retain(|(row_addr, _)| match &filter { |
There was a problem hiding this comment.
[P1] This filter drops the appended rows in the optimize/append path. Because update_criteria() now uses requires_old_data, load_unindexed_training_data() passes fragments=None and scans all current rows, not just old rows. But the old_data_filter here only describes the selected old segment coverage, so retaining only to_keep / valid removes the unindexed appended fragments from texts before rebuilding. I think FMIndex either needs to rebuild from the intended union fragment set without this old-only filter, or the update path needs a filter that keeps both selected old rows and new/unindexed rows while excluding deleted/retired old rows.
There was a problem hiding this comment.
Good catch. The filter now excludes to_remove fragments instead of keeping only to_keep, so both old valid rows and new unindexed/appended rows are retained in the rebuilt index.
Add num_segments parameter to FM-Index creation that distributes dataset fragments across multiple independent index segments. Each segment is a complete FM-Index covering a disjoint subset of fragments, enabling incremental indexing and segment merge support. - Add num_segments field to FMIndexIndexDetails proto - Add multi-segment build path in CreateIndexBuilder that splits fragments into groups, builds one segment per group, and commits atomically via commit_existing_index_segments - Add segment_has_fmindex_details predicate and FM-Index branch in merge_existing_index_segments dispatch (merge = rebuild from source) - Add dataset-level fmindex::merge_segments function - Add FMINDEX type mapping and num_segments kwarg in Python bindings Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
Segment topology is managed by the manifest (fragment_bitmap on IndexMetadata), not by index-type-specific protos. num_segments flows through ScalarIndexParams JSON at build time only. Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
- Keep FMIndexIndexDetails empty (segment topology is managed by manifest) - Collapse nested if-let in dataset.rs to satisfy clippy collapsible_if Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
- test_fmindex_segments_commit_and_query_as_logical_index: builds one segment per fragment, commits, queries across segments - test_fmindex_segments_merge_and_query: builds two segments, merges into one, verifies query results on merged index - Fix segment_has_fmindex_details predicate (FMIndexIndexDetails not FmIndexIndexDetails) Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
…-segment path - Enforce existing-index name collision check (replace=false now errors) - Honor train=false by building an empty index instead of scanning - Handle empty datasets (zero fragments) without panicking Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
… FM-Index builds - Fix local row address bug in collect_texts by requesting _rowaddr and returning u64 row addresses in search_row_addrs - Rebuild segmented index commit path in create.rs using TransactionBuilder to properly handle replace=true with same-name index removal - Run single-column validation at the entry of execute_multi_segment_fmindex - Intersect segment fragment coverage with live dataset fragments in merge_segments to prevent failures after compaction retires fragments - Add test_fmindex_merge_after_compaction_drops_retired_fragments integration test Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
…ew data - Change update_criteria() to requires_old_data with .with_row_addr() so optimize_indices/append maintenance streams include both existing and new rows with global row addresses - Apply old_data_filter in update() to exclude deleted/compacted rows before rebuilding the BWT, preventing the single-segment update shortcut from silently dropping existing indexed rows Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
…FMIndex update The old_data_filter.to_keep only covers old segment fragments, so filtering by it dropped all new/appended rows. Use to_remove exclusion instead so both old valid rows and new unindexed rows are retained. Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
16f611d to
eba411d
Compare
Summary
num_segmentsparameter to FM-Index creation that distributes dataset fragments across multiple independent index segmentscommit_existing_index_segments,merge_existing_index_segments)FMINDEXtype mapping andnum_segmentskwarg in Python bindingsUsage
Test plan
cargo check -p lance -p lance-index)cargo fmtpasses🤖 Generated with Claude Code