feat(index): support configurable multi-segment FM-Index builds#7123
feat(index): support configurable multi-segment FM-Index builds#7123beinan wants to merge 7 commits into
Conversation
|
Important This PR touches the Lance format specification. Substantive changes to the format specification — the If this is a meaningful format change:
|
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Add num_segments parameter to FM-Index creation that distributes dataset fragments across multiple independent index segments. Each segment is a complete FM-Index covering a disjoint subset of fragments, enabling incremental indexing and segment merge support. - Add num_segments field to FMIndexIndexDetails proto - Add multi-segment build path in CreateIndexBuilder that splits fragments into groups, builds one segment per group, and commits atomically via commit_existing_index_segments - Add segment_has_fmindex_details predicate and FM-Index branch in merge_existing_index_segments dispatch (merge = rebuild from source) - Add dataset-level fmindex::merge_segments function - Add FMINDEX type mapping and num_segments kwarg in Python bindings Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
Segment topology is managed by the manifest (fragment_bitmap on IndexMetadata), not by index-type-specific protos. num_segments flows through ScalarIndexParams JSON at build time only. Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
- Keep FMIndexIndexDetails empty (segment topology is managed by manifest) - Collapse nested if-let in dataset.rs to satisfy clippy collapsible_if Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
5822f75 to
40bec4d
Compare
- test_fmindex_segments_commit_and_query_as_logical_index: builds one segment per fragment, commits, queries across segments - test_fmindex_segments_merge_and_query: builds two segments, merges into one, verifies query results on merged index - Fix segment_has_fmindex_details predicate (FMIndexIndexDetails not FmIndexIndexDetails) Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
…-segment path - Enforce existing-index name collision check (replace=false now errors) - Honor train=false by building an empty index instead of scanning - Handle empty datasets (zero fragments) without panicking Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
Summary
num_segmentsparameter to FM-Index creation that distributes dataset fragments across multiple independent index segmentscommit_existing_index_segments,merge_existing_index_segments)FMINDEXtype mapping andnum_segmentskwarg in Python bindingsUsage
Test plan
cargo check -p lance -p lance-index)cargo fmtpasses🤖 Generated with Claude Code