feat(index): support configurable multi-segment FM-Index builds by beinan · Pull Request #7123 · lance-format/lance

beinan · 2026-06-05T06:46:26Z

Summary

Add num_segments parameter to FM-Index creation that distributes dataset fragments across multiple independent index segments
Each segment is a complete FM-Index covering a disjoint set of fragments, enabling incremental indexing and segment merge
Wire FM-Index into Lance's segmented index infrastructure (commit_existing_index_segments, merge_existing_index_segments)
Add FMINDEX type mapping and num_segments kwarg in Python bindings

Usage

# Single segment (default, backward compatible)
ds.create_index("text_col", index_type="FMIndex")

# Multi-segment: fragments distributed across 4 segments
ds.create_index("text_col", index_type="FMIndex", num_segments=4)

Test plan

Existing 20 FM-Index unit tests pass (verified locally)
Build compiles clean (cargo check -p lance -p lance-index)
cargo fmt passes
CI passes
Test multi-segment build with real dataset

🤖 Generated with Claude Code

github-actions · 2026-06-05T06:46:34Z

Important

This PR touches the Lance format specification.

Substantive changes to the format specification — the .proto definitions
and the spec docs under docs/src/format/ — require a PMC vote before merge.
Minor edits such as typo fixes, wording, or formatting are excluded; use your
judgment.

If this is a meaningful format change:

Start a vote following the Lance community voting process.
Format specification modifications need 3 binding +1 votes (excluding the
proposer), held on GitHub Discussions, with a minimum voting period of 1 week.
Once the vote passes, link the completed vote in this PR. It should not be
merged until the vote is linked.

github-actions · 2026-06-05T06:46:44Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

codecov · 2026-06-05T08:24:34Z

Codecov Report

❌ Patch coverage is 66.08443% with 233 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/index/create.rs	2.95%	162 Missing and 2 partials ⚠️
rust/lance/src/index/scalar/fmindex.rs	51.35%	32 Missing and 4 partials ⚠️
rust/lance-index/src/scalar/fmindex.rs	57.62%	21 Missing and 4 partials ⚠️
rust/lance/src/index/scalar_logical.rs	98.17%	5 Missing ⚠️
rust/lance-index/src/lib.rs	85.71%	1 Missing ⚠️
rust/lance/src/index.rs	85.71%	1 Missing ⚠️
rust/lance/src/index/append.rs	98.88%	0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

jackye1995

Found a correctness issue in segmented FMIndex search and a few behavioral divergences from the existing segmented-index pattern. Details inline.

jackye1995 · 2026-06-09T03:53:27Z

+            ),
+        };
+        let match_count = row_addrs.true_rows().row_addrs().unwrap().count();
+        assert!(


This assertion is too weak and appears to hide a correctness bug in segmented FMIndex search. FMIndex training currently does not request _rowid or _rowaddr, so collect_texts falls back to a local counter starting at 0 for each segment, and search returns those ids as row addresses. With this fixture, "quick" should match 5 rows, but segmented search can collapse local ids across segments and still satisfy >= 3.

Can we fix FMIndex to store global row addresses / stable row ids and make this test assert the exact expected rows or count?

Good catch! I fixed this by:

Updating FMIndexPlugin to request global row addresses (.with_row_addr()) during training.

Updating collect_texts to parse the global _rowaddr (falling back to _rowid if needed), rather than falling back to local counter starting at 0.

Adding a new search_row_addrs method on FMIndex to return the full 64-bit row address during logical searches, avoiding any 32-bit offset truncation.

Tightening the test assertion to assert exactly 5 matching rows.

jackye1995 · 2026-06-09T03:53:27Z

+        }
+
+        self.dataset
+            .commit_existing_index_segments(&index_name, &column, segment_metadatas)


This commit path does not preserve the normal replace=true behavior. The standard execute path removes all existing same-name indices when replace is set, but commit_existing_index_segments only removes overlapping same-type segments. Replacing an existing same-name BTree/bitmap/etc. index with segmented FMIndex can leave mixed metadata under one index name, and empty/untrained FMIndex coverage can leave old same-type segments too.

Can this path either build the transaction with the same removed_indices as the normal path, or extend the segment commit API with explicit replace semantics?

Fixed. I replaced the commit_existing_index_segments calls with explicit TransactionBuilder + apply_commit, collecting all same-name indices into removed_indices when replace=true. This matches the standard execute path behavior and ensures replacing a BTree/bitmap index with a segmented FMIndex properly cleans up all old metadata.

jackye1995 · 2026-06-09T03:53:27Z

+
+    /// Build FM-Index with multiple segments, each covering a subset of fragments.
+    async fn execute_multi_segment_fmindex(&mut self, num_segments: u32) -> Result<IndexMetadata> {
+        let column_input = &self.columns[0];


This bypasses the standard execute_uncommitted validation that requires exactly one column. For num_segments > 1, an empty column list will panic here, and a multi-column request silently indexes only the first column instead of returning the existing error.

Can we run the same self.columns.len() != 1 validation before entering the segmented FMIndex path?

Added. execute_multi_segment_fmindex now runs the same self.columns.len() != 1 check at entry, matching execute_uncommitted validation.

jackye1995 · 2026-06-09T03:53:27Z

+
+    let mut fragment_bitmap = RoaringBitmap::new();
+    for segment in &segments {
+        fragment_bitmap |= segment.fragment_bitmap.as_ref().cloned().ok_or_else(|| {


This merge path should mirror the btree merge behavior and intersect stored segment coverage with the dataset's current live fragments. As written, it unions historical fragment ids and passes them to build_scalar_index; after compaction retires one of those fragments, the rebuild can fail with No fragment with id ....

Can we use effective live coverage here, and add a compaction/retired-fragment merge test similar to the btree coverage?

Fixed. merge_segments now intersects the unioned fragment bitmap with dataset.fragment_bitmap to drop retired/compacted fragments before rebuilding, mirroring the btree merge behavior. When the intersection is empty (all fragments retired), it produces an empty index instead of crashing. Added test_fmindex_merge_after_compaction_drops_retired_fragments to cover this scenario.

… FM-Index builds - Fix local row address bug in collect_texts by requesting _rowaddr and returning u64 row addresses in search_row_addrs - Rebuild segmented index commit path in create.rs using TransactionBuilder to properly handle replace=true with same-name index removal - Run single-column validation at the entry of execute_multi_segment_fmindex - Intersect segment fragment coverage with live dataset fragments in merge_segments to prevent failures after compaction retires fragments - Add test_fmindex_merge_after_compaction_drops_retired_fragments integration test Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>

jackye1995 · 2026-06-09T17:19:59Z

-        let value_col = batch.column(0);
+        // Prefer _rowaddr (global row address) over _rowid to ensure stable,
+        // globally unique identifiers across segments.
+        let row_addrs: &arrow_array::UInt64Array = batch


[P1] This new hard requirement for _rowaddr also needs to be reflected in FMIndex update/optimize paths. FMIndexScalarIndex::update_criteria() still returns TrainingCriteria::new(TrainingOrdering::None), so optimize_indices() / append maintenance can build a training stream with only the value column and fail here. Also, FMIndexScalarIndex::update() currently writes a fresh index from new_data only and ignores the existing index plus old_data_filter; if we only add .with_row_addr(), the generic single-segment scalar update path can replace the old FMIndex with one containing only appended rows. Can we either make FMIndex update merge existing rows correctly, or force FMIndex maintenance to rebuild from the full target fragment bitmap instead of taking the single-segment update() shortcut?

Good catch. Fixed both issues:

update_criteria() now returns requires_old_data with .with_row_addr(), so the training stream includes all existing + new rows with global row addresses.

update() now applies the old_data_filter to exclude deleted/compacted rows before rebuilding the BWT, so the single-segment update path produces a complete index covering both old and new data instead of silently dropping existing rows.

jackye1995 · 2026-06-09T18:35:09Z

-        let texts = collect_texts(new_data).await?;
+        let mut texts = collect_texts(new_data).await?;
+        if let Some(filter) = old_data_filter {
+            texts.retain(|(row_addr, _)| match &filter {


[P1] This filter drops the appended rows in the optimize/append path. Because update_criteria() now uses requires_old_data, load_unindexed_training_data() passes fragments=None and scans all current rows, not just old rows. But the old_data_filter here only describes the selected old segment coverage, so retaining only to_keep / valid removes the unindexed appended fragments from texts before rebuilding. I think FMIndex either needs to rebuild from the intended union fragment set without this old-only filter, or the update path needs a filter that keeps both selected old rows and new/unindexed rows while excluding deleted/retired old rows.

Good catch. The filter now excludes to_remove fragments instead of keeping only to_keep, so both old valid rows and new unindexed/appended rows are retained in the rebuilt index.

Add num_segments parameter to FM-Index creation that distributes dataset fragments across multiple independent index segments. Each segment is a complete FM-Index covering a disjoint subset of fragments, enabling incremental indexing and segment merge support. - Add num_segments field to FMIndexIndexDetails proto - Add multi-segment build path in CreateIndexBuilder that splits fragments into groups, builds one segment per group, and commits atomically via commit_existing_index_segments - Add segment_has_fmindex_details predicate and FM-Index branch in merge_existing_index_segments dispatch (merge = rebuild from source) - Add dataset-level fmindex::merge_segments function - Add FMINDEX type mapping and num_segments kwarg in Python bindings Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>

Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>

Segment topology is managed by the manifest (fragment_bitmap on IndexMetadata), not by index-type-specific protos. num_segments flows through ScalarIndexParams JSON at build time only. Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>

- Keep FMIndexIndexDetails empty (segment topology is managed by manifest) - Collapse nested if-let in dataset.rs to satisfy clippy collapsible_if Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>

- test_fmindex_segments_commit_and_query_as_logical_index: builds one segment per fragment, commits, queries across segments - test_fmindex_segments_merge_and_query: builds two segments, merges into one, verifies query results on merged index - Fix segment_has_fmindex_details predicate (FMIndexIndexDetails not FmIndexIndexDetails) Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>

…-segment path - Enforce existing-index name collision check (replace=false now errors) - Honor train=false by building an empty index instead of scanning - Handle empty datasets (zero fragments) without panicking Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>

… FM-Index builds - Fix local row address bug in collect_texts by requesting _rowaddr and returning u64 row addresses in search_row_addrs - Rebuild segmented index commit path in create.rs using TransactionBuilder to properly handle replace=true with same-name index removal - Run single-column validation at the entry of execute_multi_segment_fmindex - Intersect segment fragment coverage with live dataset fragments in merge_segments to prevent failures after compaction retires fragments - Add test_fmindex_merge_after_compaction_drops_retired_fragments integration test Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>

…ew data - Change update_criteria() to requires_old_data with .with_row_addr() so optimize_indices/append maintenance streams include both existing and new rows with global row addresses - Apply old_data_filter in update() to exclude deleted/compacted rows before rebuilding the BWT, preventing the single-segment update shortcut from silently dropping existing indexed rows Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>

…FMIndex update The old_data_filter.to_keep only covers old segment fragments, so filtering by it dropped all new/appended rows. Use to_remove exclusion instead so both old valid rows and new unindexed rows are retained. Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>

github-actions Bot added A-python Python bindings A-index Vector index, linalg, tokenizer A-format On-disk format: protos and format spec docs and removed A-python Python bindings A-index Vector index, linalg, tokenizer A-format On-disk format: protos and format spec docs labels Jun 5, 2026

beinan changed the title ~~Support configurable multi-segment FM-Index builds~~ feat(index): support configurable multi-segment FM-Index builds Jun 5, 2026

github-actions Bot added enhancement New feature or request A-python Python bindings A-index Vector index, linalg, tokenizer A-format On-disk format: protos and format spec docs labels Jun 5, 2026

beinan marked this pull request as ready for review June 5, 2026 07:41

claude Bot reviewed Jun 5, 2026

View reviewed changes

beinan force-pushed the beinan/fmindex-segmented-index branch from 5822f75 to 40bec4d Compare June 5, 2026 19:46

jackye1995 requested changes Jun 9, 2026

View reviewed changes

jackye1995 reviewed Jun 9, 2026

View reviewed changes

github-actions Bot added the A-java Java bindings + JNI label Jun 9, 2026

Beinan Wang added 8 commits June 9, 2026 17:22

fix: collapse nested if-let to satisfy clippy collapsible_if lint

eed8827

Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>

style: match CI rustfmt formatting for let-chain expression

c072d60

Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>

fix: revert proto change and collapse nested if in Python bindings

bf5a510

- Keep FMIndexIndexDetails empty (segment topology is managed by manifest) - Collapse nested if-let in dataset.rs to satisfy clippy collapsible_if Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>

Beinan Wang and others added 4 commits June 9, 2026 17:23

fix: rebuild fmindex segments needing old data

decd0d0

fix(index): rename FM index type to Fm

eba411d

jackye1995 force-pushed the beinan/fmindex-segmented-index branch from 16f611d to eba411d Compare June 10, 2026 00:43

jackye1995 approved these changes Jun 10, 2026

View reviewed changes

jackye1995 merged commit 70f5695 into lance-format:main Jun 10, 2026
31 checks passed

Conversation

beinan commented Jun 5, 2026

Summary

Usage

Test plan

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

codecov Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jackye1995 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jun 5, 2026 •

edited

Loading