Bug description
flat_bm25_search() in rust/lance-index/src/scalar/inverted/index.rs (v0.39.0, lines
2376–2383)
computes the BM25 document-length normalization term using the query-matching token count
instead
of the actual document token count. This silently breaks relevance scoring for all documents
searched via the unindexed (flat) path — which is used for new fragments that arrive via
merge_insert before they are indexed.
Affected code
// flat_bm25_search() — lines 2376–2383
let doc_tokens = collect_doc_tokens(doc, tokenizer, None); // ALL tokens collected
scorer.update(&doc_tokens); // avgdl updated correctly
let doc_tokens = doc_tokens // ← REBIND shadows original
.into_iter()
.filter(|t| query_tokens.contains(t))
.collect::<Vec<_>>();
let doc_norm = K1 * (1.0 - B + B * doc_tokens.len() as f32 / scorer.avg_doc_length());
// ^^^^^^^^^^^^^^^^
// BUG: filtered (query-matching) count, NOT actual doc length
The variable shadowing of doc_tokens is the root cause: the second binding silently changes
what .len() measures. The compiler emits no warning because the type is identical.
Comparison with the indexed path
The indexed WAND path (~line 174) does this correctly:
let doc_norm = K1 * (1.0 - B + B * self.docs.num_tokens(doc) as f32 /
self.docs.average_length());
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^
// CORRECT: stored full token count
Impact
For a single-term query, every document gets doc_tokens.len() == 1 after filtering, so the
normalization collapses to K1 * (1 - B + B * 1/avgdl) regardless of actual length. This causes:
- Long documents to be over-scored on the flat path relative to the indexed path.
- Score inconsistency between indexed and unindexed fragments after a merge_insert delta sync.
- Incorrect ranking until the new fragment is fully indexed.
Proposed fix
let doc_tokens = collect_doc_tokens(doc, tokenizer, None);
let doc_len = doc_tokens.len(); // save actual length before filter
scorer.update(&doc_tokens);
let doc_tokens = doc_tokens
.into_iter()
.filter(|t| query_tokens.contains(t))
.collect::<Vec<_>>();
let doc_norm = K1 * (1.0 - B + B * doc_len as f32 / scorer.avg_doc_length());
// ^^^^^^^ actual length — consistent with indexed path
Reproduction sketch
1. Create a LanceDB table with a BM25 full-text search index.
2. Insert additional rows via merge_insert (creates an unindexed fragment).
3. Run a single-term FTS query without re-indexing.
4. Compare scores of matching rows in the new fragment vs. the indexed fragment for documents
of different lengths — scores will diverge despite identical term frequency.
Environment
- Lance version: v0.39.0
- File: rust/lance-index/src/scalar/inverted/index.rs
- Lines: 2376–2383 (flat path), ~174 (indexed path for comparison)
Bug description
flat_bm25_search()inrust/lance-index/src/scalar/inverted/index.rs(v0.39.0, lines2376–2383)
computes the BM25 document-length normalization term using the query-matching token count
instead
of the actual document token count. This silently breaks relevance scoring for all documents
searched via the unindexed (flat) path — which is used for new fragments that arrive via
merge_insertbefore they are indexed.Affected code