Skip to content

fix: route JSON index queries to the correct sub-parser by path#7072

Open
ztorchan wants to merge 2 commits into
lance-format:mainfrom
ztorchan:fix-json_index_route
Open

fix: route JSON index queries to the correct sub-parser by path#7072
ztorchan wants to merge 2 commits into
lance-format:mainfrom
ztorchan:fix-json_index_route

Conversation

@ztorchan
Copy link
Copy Markdown
Contributor

@ztorchan ztorchan commented Jun 3, 2026

Problem

When a dataset has a JSON column and multiple JSON indices are created on different JSON paths of that same column (e.g. one index on $.a and another on $.b), query routing is incorrect. A query like json_extract(json, '$.b') = 'foo' may hit the $.a index instead of the $.b index, producing wrong results.

Root Cause

maybe_indexed_column obtains a parser from IndexInformationProvider::get_index(), which returns a &dyn ScalarQueryParser pointing to a MultiQueryParser that aggregates all sub-parsers for that column.

The flow was:

  1. get_index() returns MultiQueryParser as &dyn ScalarQueryParser
  2. parser.is_valid_reference(expr, data_type) is called — MultiQueryParser's impl iterates children and returns Some(DataType) from the first child that accepts, but discards which child matched
  3. The same MultiQueryParser is then used for visit_eq / visit_between etc., which also iterate children and return the first non-None result — potentially a different child than the one that validated the reference

This means the query can be dispatched to the wrong JSON index (e.g. the $.a index for a $.b query).

Fix

  • Change IndexInformationProvider::get_index to return (&DataType, &MultiQueryParser) instead of (&DataType, &dyn ScalarQueryParser), so callers can interact with the MultiQueryParser directly
  • Add MultiQueryParser::select(expr, data_type) — iterates child parsers and returns (&dyn ScalarQueryParser, DataType) from the first child whose is_valid_reference accepts the expression, preserving which child matched
  • Update maybe_indexed_column to call multi.select(expr, data_type) instead of parser.is_valid_reference(expr, data_type), obtaining the precise sub-parser for all subsequent operations

Test

Added regression test test_multi_json_indices_route_by_path that:

  • Creates a MultiQueryParser with two JsonQueryParser sub-parsers (for $.a and $.b)
  • Verifies json_extract(json, '$.b') = 'foo' resolves to the json_b_idx index
  • Verifies json_extract(json, '$.a') = 'foo' resolves to the json_a_idx index
  • Verifies json_extract(json, '$.c') = 'foo' (unindexed path) does not bind to any index

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer bug Something isn't working labels Jun 3, 2026
@ztorchan
Copy link
Copy Markdown
Contributor Author

ztorchan commented Jun 3, 2026

Hi, @Xuanwo @westonpace
This bug caused me to get incorrect results when using a filter on a JSON column. Please take a look when you have time

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 3, 2026

Codecov Report

❌ Patch coverage is 96.15385% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/expression.rs 96.05% 2 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good improvement over the existing which would just blindly return a multi-query-parser where the first index might not be capable of serving the query.

I wonder slightly if this is the right spot to be calling select. You're calling it when we extract the field reference but in some cases we might need more of the tree to make that decision.

For example, let's say we have a btree index and a bloom filter index on the same column and the filter is x > 7. At this point all we see is x and we say "both bloom and btree would work so pick bloom". Then, later, we hit > and we bail because bloom doesn't support >. However, if we had made the select decision with the entire scope we might have been able to choose the btree index.

That being said, I'm not aware of any use case for having both a bloom filter and a btree index at the moment, and a partial fix is better than no fix. So we can also wait and add test cases for those other situations when we encounter them.

Ping me if you want to continue with the merge as-is.

self.parsers.push(other);
}

/// Pick the underlying parser whose `is_valid_reference` accepts `expr`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Pick the underlying parser whose `is_valid_reference` accepts `expr`.
/// Pick the first underlying parser whose `is_valid_reference` accepts `expr`.

I think it's still possible to have multiple indexes that could satisfy a query (e.g. if you had a btree and bitmap index on the same column) so I think we still use first-come-first-serve in those cases.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the comment

@ztorchan ztorchan force-pushed the fix-json_index_route branch from ee44708 to 72e9b32 Compare June 4, 2026 02:17
@ztorchan
Copy link
Copy Markdown
Contributor Author

ztorchan commented Jun 4, 2026

That makes sense. I've filed #7091 to track this. I would be happy to solve this problem.
For this pr, please continue with the merge as-is if you think it's appropriate. @westonpace

@ztorchan ztorchan requested a review from westonpace June 5, 2026 02:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants