-
Notifications
You must be signed in to change notification settings - Fork 698
feat(index): support configurable multi-segment FM-Index builds #7123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
beinan
wants to merge
11
commits into
lance-format:main
Choose a base branch
from
beinan:beinan/fmindex-segmented-index
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
eac88a9
feat(index): support configurable multi-segment FM-Index builds
585ea4b
fix: collapse nested if-let to satisfy clippy collapsible_if lint
7c4ece3
style: match CI rustfmt formatting for let-chain expression
b3c98bd
refactor: revert proto change, keep FMIndexIndexDetails empty
40bec4d
fix: revert proto change and collapse nested if in Python bindings
6887808
test: add segmented FM-Index tests for commit, query, and merge
c900564
fix: add replace validation and train/empty-dataset handling to multi…
bc2e471
fix(index): address PR #7123 review comments on segmented FM-Index bu…
b52000b
fix(index): make FMIndex update path request _rowaddr and merge old+n…
d7cd89e
fix(index): exclude retired fragments instead of keeping only old in …
923a63e
fix: rebuild fmindex segments needing old data
jackye1995 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,64 @@ | ||
| # FM-Index (Full-text / Substring / Regex Search) | ||
|
|
||
| The FM-Index (Ferragina-Manzini Index) is a compressed substring index based on the Burrows-Wheeler Transform (BWT). Unlike traditional inverted indexes (Full-Text Search) which index distinct words, the FM-Index enables efficient **arbitrary substring search**, **prefix match**, and **suffix/regular-expression search** directly on raw bytes. | ||
|
|
||
| In Lance, the FM-Index is designed to scale dynamically across millions of documents or large-scale datasets, and is partitioned using Lance's **Segmented Index** architecture to support incremental appends, disjoint fragment tracking, and segment merging. | ||
|
|
||
| ## High-Level Architecture | ||
|
|
||
| The FM-Index indexes raw text by treating columns of strings or binary payloads as raw byte arrays. | ||
|
|
||
| ``` | ||
| +----------------------------------------+ | ||
| | Lance Dataset | | ||
| | (Disjoint groups of Fragments 0..N) | | ||
| +----------------------------------------+ | ||
| | | ||
| Divide fragments into num_segments | ||
| | | ||
| v | ||
| +----------------------------------------+ | ||
| | Segmented Index | | ||
| | +-----------+ +-----------+ +-------+ | | ||
| | | Segment 1 | | Segment 2 | | ... | | | ||
| | | (FM-Idx) | | (FM-Idx) | | | | | ||
| | +-----------+ +-----------+ +-------+ | | ||
| +----------------------------------------+ | ||
| ``` | ||
|
|
||
| Each segment contains its own self-contained physical FM-Index mapping byte sub-sequences to Lance global row IDs. | ||
|
|
||
| ## Data Normalization & Sanitization | ||
|
|
||
| The FM-Index is **normalization-independent by design** because it operates entirely on raw bytes. | ||
|
|
||
| ### Byte Sanitization vs. Text Normalization | ||
|
|
||
| 1. **Byte Sanitization (Core Index Layer)**: | ||
| The physical FM-Index uses specific sentinel bytes internally to mark boundaries: | ||
| - `\x00` is reserved as the global Burrows-Wheeler Transform (BWT) terminator character. | ||
| - `\xFF` is reserved as the document/row separator character. | ||
|
|
||
| To avoid breaking the indexing structures, any incoming occurrences of `\x00` or `\xFF` are sanitized by remapping them to space (`\x20`) characters at index-build time. No other bytes are changed in this layer. | ||
|
|
||
| 2. **Text Normalization (User/Application Layer)**: | ||
| Because the index faithfully maps raw bytes, any semantic normalization (such as case folding `Hello` -> `hello`, Unicode NFKC normalization, stemming, or whitespace collapsing) is fully decoupled from the core index engine: | ||
| - To build a case-insensitive search index, users apply a lowercase transform to the column *prior* to indexing. | ||
| - When querying, the user's query text must undergo the exact same normalization pipeline. | ||
|
|
||
| ## Configurable Segment Partitioning | ||
|
|
||
| Merging or appending to BWT-based indexes cannot be done via simple concatenation; the BWT suffix array must be reconstructed by re-reading the text and rebuilding. To balance build cost and search performance, Lance allows configuring how fragments map to index segments. | ||
|
|
||
| - **`num_segments` parameter**: Configured at index-creation time. If `num_segments` is specified (e.g. `num_segments = 4`), Lance splits the target dataset fragments into disjoint subsets and builds independent FM-Index segments over each chunk. | ||
| - **Unindexed Appends**: When new fragments are appended to the dataset, a subsequent `create_index` execution with unindexed fragment coverage will construct a new separate segment representing only those new fragments, keeping existing segments fully intact. | ||
| - **Segment Merging**: Multiple existing index segments can be merged into a single segment under Lance's `merge_segments` protocol. Lance unions the fragment coverage bitmaps of the selected segments, re-reads the raw text from those covered fragments, and constructs a fresh unified FM-Index. | ||
|
|
||
| ## Query Evaluation | ||
|
|
||
| When a substring query is submitted (e.g., `CONTAINS(column, "query_string")`): | ||
| 1. The search string is sanitized (remapping any `\x00` or `\xFF` to spaces) and optionally normalized if the target index is normalized. | ||
| 2. The query is dispatched across all active segments in the logical index in parallel. | ||
| 3. Each segment performs a BWT backward-search to locate occurrences of the pattern. | ||
| 4. Matching offsets are mapped back to absolute dataset Row IDs. | ||
| 5. Results from all segments are unioned to produce the final selection. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P1] This new hard requirement for
_rowaddralso needs to be reflected in FMIndex update/optimize paths.FMIndexScalarIndex::update_criteria()still returnsTrainingCriteria::new(TrainingOrdering::None), sooptimize_indices()/ append maintenance can build a training stream with only the value column and fail here. Also,FMIndexScalarIndex::update()currently writes a fresh index fromnew_dataonly and ignores the existing index plusold_data_filter; if we only add.with_row_addr(), the generic single-segment scalar update path can replace the old FMIndex with one containing only appended rows. Can we either make FMIndex update merge existing rows correctly, or force FMIndex maintenance to rebuild from the full target fragment bitmap instead of taking the single-segmentupdate()shortcut?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Fixed both issues:
update_criteria()now returnsrequires_old_datawith.with_row_addr(), so the training stream includes all existing + new rows with global row addresses.update()now applies theold_data_filterto exclude deleted/compacted rows before rebuilding the BWT, so the single-segment update path produces a complete index covering both old and new data instead of silently dropping existing rows.