Skip to content

flat_bm25_search() uses query-matching token count instead of actual doc length in BM25 normalization #7107

@kate2020-summer

Description

@kate2020-summer

Bug description

flat_bm25_search() in rust/lance-index/src/scalar/inverted/index.rs (v0.39.0, lines
2376–2383)
computes the BM25 document-length normalization term using the query-matching token count
instead
of the actual document token count. This silently breaks relevance scoring for all documents
searched via the unindexed (flat) path — which is used for new fragments that arrive via
merge_insert before they are indexed.

Affected code

// flat_bm25_search() — lines 2376–2383
let doc_tokens = collect_doc_tokens(doc, tokenizer, None); // ALL tokens collected
scorer.update(&doc_tokens);                                // avgdl updated correctly

let doc_tokens = doc_tokens                               // ← REBIND shadows original
    .into_iter()                                                               
    .filter(|t| query_tokens.contains(t))
    .collect::<Vec<_>>();

let doc_norm = K1 * (1.0 - B + B * doc_tokens.len() as f32 / scorer.avg_doc_length());
//                                   ^^^^^^^^^^^^^^^^
//                                   BUG: filtered (query-matching) count, NOT actual doc length

The variable shadowing of doc_tokens is the root cause: the second binding silently changes
what .len() measures. The compiler emits no warning because the type is identical.

Comparison with the indexed path

The indexed WAND path (~line 174) does this correctly:

let doc_norm = K1 * (1.0 - B + B * self.docs.num_tokens(doc) as f32 /
self.docs.average_length());
//                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
//                                   CORRECT: stored full token count

Impact

For a single-term query, every document gets doc_tokens.len() == 1 after filtering, so the
normalization collapses to K1 * (1 - B + B * 1/avgdl) regardless of actual length. This causes:

- Long documents to be over-scored on the flat path relative to the indexed path.
- Score inconsistency between indexed and unindexed fragments after a merge_insert delta sync.
- Incorrect ranking until the new fragment is fully indexed.

Proposed fix

let doc_tokens = collect_doc_tokens(doc, tokenizer, None);
let doc_len = doc_tokens.len();          // save actual length before filter
scorer.update(&doc_tokens);

let doc_tokens = doc_tokens
    .into_iter()
    .filter(|t| query_tokens.contains(t))
    .collect::<Vec<_>>();

let doc_norm = K1 * (1.0 - B + B * doc_len as f32 / scorer.avg_doc_length());
//                                   ^^^^^^^ actual length — consistent with indexed path

Reproduction sketch

1. Create a LanceDB table with a BM25 full-text search index.
2. Insert additional rows via merge_insert (creates an unindexed fragment).
3. Run a single-term FTS query without re-indexing.
4. Compare scores of matching rows in the new fragment vs. the indexed fragment for documents
of different lengths — scores will diverge despite identical term frequency.

Environment

- Lance version: v0.39.0
- File: rust/lance-index/src/scalar/inverted/index.rs
- Lines: 23762383 (flat path), ~174 (indexed path for comparison)

Metadata

Metadata

Assignees

Labels

A-indexVector index, linalg, tokenizerbugSomething isn't working

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions