Skip to content

feat(index): support configurable multi-segment FM-Index builds#7123

Open
beinan wants to merge 7 commits into
lance-format:mainfrom
beinan:beinan/fmindex-segmented-index
Open

feat(index): support configurable multi-segment FM-Index builds#7123
beinan wants to merge 7 commits into
lance-format:mainfrom
beinan:beinan/fmindex-segmented-index

Conversation

@beinan

@beinan beinan commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add num_segments parameter to FM-Index creation that distributes dataset fragments across multiple independent index segments
  • Each segment is a complete FM-Index covering a disjoint set of fragments, enabling incremental indexing and segment merge
  • Wire FM-Index into Lance's segmented index infrastructure (commit_existing_index_segments, merge_existing_index_segments)
  • Add FMINDEX type mapping and num_segments kwarg in Python bindings

Usage

# Single segment (default, backward compatible)
ds.create_index("text_col", index_type="FMIndex")

# Multi-segment: fragments distributed across 4 segments
ds.create_index("text_col", index_type="FMIndex", num_segments=4)

Test plan

  • Existing 20 FM-Index unit tests pass (verified locally)
  • Build compiles clean (cargo check -p lance -p lance-index)
  • cargo fmt passes
  • CI passes
  • Test multi-segment build with real dataset

🤖 Generated with Claude Code

@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Important

This PR touches the Lance format specification.

Substantive changes to the format specification — the .proto definitions
and the spec docs under docs/src/format/ — require a PMC vote before merge.
Minor edits such as typo fixes, wording, or formatting are excluded; use your
judgment.

If this is a meaningful format change:

  • Start a vote following the Lance community voting process.
    Format specification modifications need 3 binding +1 votes (excluding the
    proposer), held on GitHub Discussions, with a minimum voting period of 1 week.
  • Once the vote passes, link the completed vote in this PR. It should not be
    merged until the vote is linked.

@github-actions github-actions Bot added A-python Python bindings A-index Vector index, linalg, tokenizer A-format On-disk format: protos and format spec docs and removed A-python Python bindings A-index Vector index, linalg, tokenizer A-format On-disk format: protos and format spec docs labels Jun 5, 2026
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@beinan beinan changed the title Support configurable multi-segment FM-Index builds feat(index): support configurable multi-segment FM-Index builds Jun 5, 2026
@github-actions github-actions Bot added enhancement New feature or request A-python Python bindings A-index Vector index, linalg, tokenizer A-format On-disk format: protos and format spec docs labels Jun 5, 2026
@beinan beinan marked this pull request as ready for review June 5, 2026 07:41

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@codecov

codecov Bot commented Jun 5, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 60.35503% with 134 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/index/create.rs 3.20% 121 Missing ⚠️
rust/lance/src/index/scalar/fmindex.rs 77.77% 6 Missing and 4 partials ⚠️
rust/lance/src/index/scalar_logical.rs 98.14% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

Beinan Wang added 5 commits June 5, 2026 12:45
Add num_segments parameter to FM-Index creation that distributes dataset
fragments across multiple independent index segments. Each segment is a
complete FM-Index covering a disjoint subset of fragments, enabling
incremental indexing and segment merge support.

- Add num_segments field to FMIndexIndexDetails proto
- Add multi-segment build path in CreateIndexBuilder that splits
  fragments into groups, builds one segment per group, and commits
  atomically via commit_existing_index_segments
- Add segment_has_fmindex_details predicate and FM-Index branch in
  merge_existing_index_segments dispatch (merge = rebuild from source)
- Add dataset-level fmindex::merge_segments function
- Add FMINDEX type mapping and num_segments kwarg in Python bindings

Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
Segment topology is managed by the manifest (fragment_bitmap on
IndexMetadata), not by index-type-specific protos. num_segments
flows through ScalarIndexParams JSON at build time only.

Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
- Keep FMIndexIndexDetails empty (segment topology is managed by manifest)
- Collapse nested if-let in dataset.rs to satisfy clippy collapsible_if

Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
@beinan beinan force-pushed the beinan/fmindex-segmented-index branch from 5822f75 to 40bec4d Compare June 5, 2026 19:46
Beinan Wang added 2 commits June 5, 2026 14:04
- test_fmindex_segments_commit_and_query_as_logical_index: builds one
  segment per fragment, commits, queries across segments
- test_fmindex_segments_merge_and_query: builds two segments, merges
  into one, verifies query results on merged index
- Fix segment_has_fmindex_details predicate (FMIndexIndexDetails
  not FmIndexIndexDetails)

Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
…-segment path

- Enforce existing-index name collision check (replace=false now errors)
- Honor train=false by building an empty index instead of scanning
- Handle empty datasets (zero fragments) without panicking

Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-format On-disk format: protos and format spec docs A-index Vector index, linalg, tokenizer A-python Python bindings enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant