flat_bm25_search() uses query-matching token count instead of actual doc length in BM25   normalization

 ## Bug description

  `flat_bm25_search()` in `rust/lance-index/src/scalar/inverted/index.rs` (v0.39.0, lines
  2376–2383)
  computes the BM25 document-length normalization term using the **query-matching token count**
  instead
  of the **actual document token count**. This silently breaks relevance scoring for all documents
  searched via the unindexed (flat) path — which is used for new fragments that arrive via
  `merge_insert` before they are indexed.

  ## Affected code
  
  ```rust
  // flat_bm25_search() — lines 2376–2383
  let doc_tokens = collect_doc_tokens(doc, tokenizer, None); // ALL tokens collected
  scorer.update(&doc_tokens);                                // avgdl updated correctly

  let doc_tokens = doc_tokens                               // ← REBIND shadows original
      .into_iter()                                                               
      .filter(|t| query_tokens.contains(t))
      .collect::<Vec<_>>();

  let doc_norm = K1 * (1.0 - B + B * doc_tokens.len() as f32 / scorer.avg_doc_length());
  //                                   ^^^^^^^^^^^^^^^^
  //                                   BUG: filtered (query-matching) count, NOT actual doc length

  The variable shadowing of doc_tokens is the root cause: the second binding silently changes
  what .len() measures. The compiler emits no warning because the type is identical.
  
  Comparison with the indexed path

  The indexed WAND path (~line 174) does this correctly:
  
  let doc_norm = K1 * (1.0 - B + B * self.docs.num_tokens(doc) as f32 /
  self.docs.average_length());
  //                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  //                                   CORRECT: stored full token count
  
  Impact

  For a single-term query, every document gets doc_tokens.len() == 1 after filtering, so the
  normalization collapses to K1 * (1 - B + B * 1/avgdl) regardless of actual length. This causes:

  - Long documents to be over-scored on the flat path relative to the indexed path.
  - Score inconsistency between indexed and unindexed fragments after a merge_insert delta sync.
  - Incorrect ranking until the new fragment is fully indexed.

  Proposed fix

  let doc_tokens = collect_doc_tokens(doc, tokenizer, None);
  let doc_len = doc_tokens.len();          // save actual length before filter
  scorer.update(&doc_tokens);

  let doc_tokens = doc_tokens
      .into_iter()
      .filter(|t| query_tokens.contains(t))
      .collect::<Vec<_>>();

  let doc_norm = K1 * (1.0 - B + B * doc_len as f32 / scorer.avg_doc_length());
  //                                   ^^^^^^^ actual length — consistent with indexed path

  Reproduction sketch

  1. Create a LanceDB table with a BM25 full-text search index.
  2. Insert additional rows via merge_insert (creates an unindexed fragment).
  3. Run a single-term FTS query without re-indexing.
  4. Compare scores of matching rows in the new fragment vs. the indexed fragment for documents
  of different lengths — scores will diverge despite identical term frequency.

  Environment
  
  - Lance version: v0.39.0
  - File: rust/lance-index/src/scalar/inverted/index.rs
  - Lines: 2376–2383 (flat path), ~174 (indexed path for comparison)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flat_bm25_search() uses query-matching token count instead of actual doc length in BM25 normalization #7107

Bug description

Affected code

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

flat_bm25_search() uses query-matching token count instead of actual doc length in BM25 normalization #7107

Description

Bug description

Affected code

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions