Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 64 additions & 0 deletions docs/src/format/index/scalar/fmindex.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# FM-Index (Full-text / Substring / Regex Search)

The FM-Index (Ferragina-Manzini Index) is a compressed substring index based on the Burrows-Wheeler Transform (BWT). Unlike traditional inverted indexes (Full-Text Search) which index distinct words, the FM-Index enables efficient **arbitrary substring search**, **prefix match**, and **suffix/regular-expression search** directly on raw bytes.

In Lance, the FM-Index is designed to scale dynamically across millions of documents or large-scale datasets, and is partitioned using Lance's **Segmented Index** architecture to support incremental appends, disjoint fragment tracking, and segment merging.

## High-Level Architecture

The FM-Index indexes raw text by treating columns of strings or binary payloads as raw byte arrays.

```
+----------------------------------------+
| Lance Dataset |
| (Disjoint groups of Fragments 0..N) |
+----------------------------------------+
|
Divide fragments into num_segments
|
v
+----------------------------------------+
| Segmented Index |
| +-----------+ +-----------+ +-------+ |
| | Segment 1 | | Segment 2 | | ... | |
| | (FM-Idx) | | (FM-Idx) | | | |
| +-----------+ +-----------+ +-------+ |
+----------------------------------------+
```

Each segment contains its own self-contained physical FM-Index mapping byte sub-sequences to Lance global row IDs.

## Data Normalization & Sanitization

The FM-Index is **normalization-independent by design** because it operates entirely on raw bytes.

### Byte Sanitization vs. Text Normalization

1. **Byte Sanitization (Core Index Layer)**:
The physical FM-Index uses specific sentinel bytes internally to mark boundaries:
- `\x00` is reserved as the global Burrows-Wheeler Transform (BWT) terminator character.
- `\xFF` is reserved as the document/row separator character.

To avoid breaking the indexing structures, any incoming occurrences of `\x00` or `\xFF` are sanitized by remapping them to space (`\x20`) characters at index-build time. No other bytes are changed in this layer.

2. **Text Normalization (User/Application Layer)**:
Because the index faithfully maps raw bytes, any semantic normalization (such as case folding `Hello` -> `hello`, Unicode NFKC normalization, stemming, or whitespace collapsing) is fully decoupled from the core index engine:
- To build a case-insensitive search index, users apply a lowercase transform to the column *prior* to indexing.
- When querying, the user's query text must undergo the exact same normalization pipeline.

## Configurable Segment Partitioning

Merging or appending to BWT-based indexes cannot be done via simple concatenation; the BWT suffix array must be reconstructed by re-reading the text and rebuilding. To balance build cost and search performance, Lance allows configuring how fragments map to index segments.

- **`num_segments` parameter**: Configured at index-creation time. If `num_segments` is specified (e.g. `num_segments = 4`), Lance splits the target dataset fragments into disjoint subsets and builds independent FM-Index segments over each chunk.
- **Unindexed Appends**: When new fragments are appended to the dataset, a subsequent `create_index` execution with unindexed fragment coverage will construct a new separate segment representing only those new fragments, keeping existing segments fully intact.
- **Segment Merging**: Multiple existing index segments can be merged into a single segment under Lance's `merge_segments` protocol. Lance unions the fragment coverage bitmaps of the selected segments, re-reads the raw text from those covered fragments, and constructs a fresh unified FM-Index.

## Query Evaluation

When a substring query is submitted (e.g., `CONTAINS(column, "query_string")`):
1. The search string is sanitized (remapping any `\x00` or `\xFF` to spaces) and optionally normalized if the target index is normalized.
2. The query is dispatched across all active segments in the logical index in parallel.
3. Each segment performs a BWT backward-search to locate occurrences of the pattern.
4. Matching offsets are mapped back to absolute dataset Row IDs.
5. Results from all segments are unioned to produce the final selection.
2 changes: 1 addition & 1 deletion java/lance-jni/src/blocking_dataset.rs
Original file line number Diff line number Diff line change
Expand Up @@ -981,7 +981,7 @@ fn inner_create_index<'local>(
| IndexType::NGram
| IndexType::ZoneMap
| IndexType::BloomFilter
| IndexType::FMIndex
| IndexType::Fm
| IndexType::RTree => {
// For scalar indices, create a scalar IndexParams
let (index_type_str, params_opt) = get_scalar_index_params(env, params_jobj)?;
Expand Down
22 changes: 22 additions & 0 deletions python/src/dataset.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2159,6 +2159,7 @@ impl Dataset {
"LABEL_LIST" => IndexType::LabelList,
"RTREE" => IndexType::RTree,
"INVERTED" | "FTS" => IndexType::Inverted,
"FM" => IndexType::Fm,
"IVF_FLAT" | "IVF_PQ" | "IVF_SQ" | "IVF_RQ" | "IVF_HNSW_FLAT" | "IVF_HNSW_PQ"
| "IVF_HNSW_SQ" => IndexType::Vector,
_ => {
Expand Down Expand Up @@ -2198,6 +2199,27 @@ impl Dataset {
index_type: "rtree".to_string(),
params: None,
}),
"FM" => {
let mut params_json = serde_json::Map::new();
if let Some(kwargs) = kwargs
&& let Some(num_segments) = kwargs.get_item("num_segments")?
{
let n: u32 = num_segments.extract()?;
params_json.insert(
"num_segments".to_string(),
serde_json::Value::Number(n.into()),
);
}
let params = if params_json.is_empty() {
None
} else {
Some(serde_json::Value::Object(params_json).to_string())
};
Box::new(ScalarIndexParams {
index_type: "fm".to_string(),
params,
})
}
"SCALAR" => {
let Some(kwargs) = kwargs else {
return Err(PyValueError::new_err(
Expand Down
19 changes: 9 additions & 10 deletions rust/lance-index/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ pub enum IndexType {

RTree = 10, // RTree

FMIndex = 11, // FM-Index
Fm = 11, // FM-Index

// 100+ and up for vector index.
/// Flat vector index.
Expand All @@ -152,7 +152,7 @@ impl std::fmt::Display for IndexType {
Self::ZoneMap => write!(f, "ZoneMap"),
Self::BloomFilter => write!(f, "BloomFilter"),
Self::RTree => write!(f, "RTree"),
Self::FMIndex => write!(f, "FMIndex"),
Self::Fm => write!(f, "Fm"),
Self::Vector | Self::IvfPq => write!(f, "IVF_PQ"),
Self::IvfFlat => write!(f, "IVF_FLAT"),
Self::IvfSq => write!(f, "IVF_SQ"),
Expand Down Expand Up @@ -180,7 +180,7 @@ impl TryFrom<i32> for IndexType {
v if v == Self::ZoneMap as i32 => Ok(Self::ZoneMap),
v if v == Self::BloomFilter as i32 => Ok(Self::BloomFilter),
v if v == Self::RTree as i32 => Ok(Self::RTree),
v if v == Self::FMIndex as i32 => Ok(Self::FMIndex),
v if v == Self::Fm as i32 => Ok(Self::Fm),
v if v == Self::Vector as i32 => Ok(Self::Vector),
v if v == Self::IvfFlat as i32 => Ok(Self::IvfFlat),
v if v == Self::IvfSq as i32 => Ok(Self::IvfSq),
Expand Down Expand Up @@ -209,7 +209,7 @@ impl TryFrom<&str> for IndexType {
"ZoneMap" | "ZONEMAP" => Ok(Self::ZoneMap),
"BloomFilter" | "BLOOMFILTER" | "BLOOM_FILTER" => Ok(Self::BloomFilter),
"RTree" | "RTREE" | "R_TREE" => Ok(Self::RTree),
"FMIndex" | "FMINDEX" | "FM_INDEX" => Ok(Self::FMIndex),
"Fm" | "FM" => Ok(Self::Fm),
"Vector" | "VECTOR" => Ok(Self::Vector),
"IVF_FLAT" => Ok(Self::IvfFlat),
"IVF_SQ" => Ok(Self::IvfSq),
Expand Down Expand Up @@ -241,7 +241,7 @@ impl IndexType {
| Self::ZoneMap
| Self::BloomFilter
| Self::RTree
| Self::FMIndex,
| Self::Fm,
)
}

Expand Down Expand Up @@ -281,7 +281,7 @@ impl IndexType {
Self::ZoneMap => 0,
Self::BloomFilter => 0,
Self::RTree => 0,
Self::FMIndex => 0,
Self::Fm => 0,

// IMPORTANT: if any vector index subtype needs a format bump that is
// not backward compatible, its new version must be set to
Expand Down Expand Up @@ -396,7 +396,7 @@ mod tests {
IndexType::ZoneMap,
IndexType::BloomFilter,
IndexType::RTree,
IndexType::FMIndex,
IndexType::Fm,
IndexType::Vector,
IndexType::IvfFlat,
IndexType::IvfSq,
Expand Down Expand Up @@ -438,9 +438,8 @@ mod tests {
("RTree", IndexType::RTree),
("RTREE", IndexType::RTree),
("R_TREE", IndexType::RTree),
("FMIndex", IndexType::FMIndex),
("FMINDEX", IndexType::FMIndex),
("FM_INDEX", IndexType::FMIndex),
("Fm", IndexType::Fm),
("FM", IndexType::Fm),
("Vector", IndexType::Vector),
("VECTOR", IndexType::Vector),
("IVF_FLAT", IndexType::IvfFlat),
Expand Down
8 changes: 7 additions & 1 deletion rust/lance-index/src/registry.rs
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,12 @@ impl IndexPluginRegistry {
fn get_plugin_name_from_details_name(&self, details_name: &str) -> String {
let details_name = Self::normalize_plugin_name(details_name);
if details_name.ends_with("indexdetails") {
details_name.replace("indexdetails", "")
let plugin_name = details_name.replace("indexdetails", "");
if plugin_name == "fmindex" {
"fm".to_string()
} else {
plugin_name
}
} else {
details_name
}
Expand Down Expand Up @@ -156,6 +161,7 @@ mod tests {
("NGRAM", "NGram"),
("ZONEMAP", "ZoneMap"),
("BLOOMFILTER", "BloomFilter"),
("FM", "Fm"),
("JSON", "Json"),
] {
let plugin = registry.get_plugin_by_name(requested_name).unwrap();
Expand Down
6 changes: 3 additions & 3 deletions rust/lance-index/src/scalar.rs
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ pub enum BuiltinIndexType {
BloomFilter,
RTree,
Inverted,
FMIndex,
Fm,
}

impl BuiltinIndexType {
Expand All @@ -82,7 +82,7 @@ impl BuiltinIndexType {
Self::Inverted => "inverted",
Self::BloomFilter => "bloomfilter",
Self::RTree => "rtree",
Self::FMIndex => "fmindex",
Self::Fm => "fm",
}
}
}
Expand All @@ -100,7 +100,7 @@ impl TryFrom<IndexType> for BuiltinIndexType {
IndexType::Inverted => Ok(Self::Inverted),
IndexType::BloomFilter => Ok(Self::BloomFilter),
IndexType::RTree => Ok(Self::RTree),
IndexType::FMIndex => Ok(Self::FMIndex),
IndexType::Fm => Ok(Self::Fm),
_ => Err(Error::index("Invalid index type".to_string())),
}
}
Expand Down
Loading
Loading