fix: route JSON index queries to the correct sub-parser by path#7072
fix: route JSON index queries to the correct sub-parser by path#7072ztorchan wants to merge 2 commits into
Conversation
|
Hi, @Xuanwo @westonpace |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
westonpace
left a comment
There was a problem hiding this comment.
This seems like a good improvement over the existing which would just blindly return a multi-query-parser where the first index might not be capable of serving the query.
I wonder slightly if this is the right spot to be calling select. You're calling it when we extract the field reference but in some cases we might need more of the tree to make that decision.
For example, let's say we have a btree index and a bloom filter index on the same column and the filter is x > 7. At this point all we see is x and we say "both bloom and btree would work so pick bloom". Then, later, we hit > and we bail because bloom doesn't support >. However, if we had made the select decision with the entire scope we might have been able to choose the btree index.
That being said, I'm not aware of any use case for having both a bloom filter and a btree index at the moment, and a partial fix is better than no fix. So we can also wait and add test cases for those other situations when we encounter them.
Ping me if you want to continue with the merge as-is.
| self.parsers.push(other); | ||
| } | ||
|
|
||
| /// Pick the underlying parser whose `is_valid_reference` accepts `expr`. |
There was a problem hiding this comment.
| /// Pick the underlying parser whose `is_valid_reference` accepts `expr`. | |
| /// Pick the first underlying parser whose `is_valid_reference` accepts `expr`. |
I think it's still possible to have multiple indexes that could satisfy a query (e.g. if you had a btree and bitmap index on the same column) so I think we still use first-come-first-serve in those cases.
There was a problem hiding this comment.
I've changed the comment
ee44708 to
72e9b32
Compare
|
That makes sense. I've filed #7091 to track this. I would be happy to solve this problem. |
Problem
When a dataset has a JSON column and multiple JSON indices are created on different JSON paths of that same column (e.g. one index on
$.aand another on$.b), query routing is incorrect. A query likejson_extract(json, '$.b') = 'foo'may hit the$.aindex instead of the$.bindex, producing wrong results.Root Cause
maybe_indexed_columnobtains a parser fromIndexInformationProvider::get_index(), which returns a&dyn ScalarQueryParserpointing to aMultiQueryParserthat aggregates all sub-parsers for that column.The flow was:
get_index()returnsMultiQueryParseras&dyn ScalarQueryParserparser.is_valid_reference(expr, data_type)is called —MultiQueryParser's impl iterates children and returnsSome(DataType)from the first child that accepts, but discards which child matchedMultiQueryParseris then used forvisit_eq/visit_betweenetc., which also iterate children and return the first non-Noneresult — potentially a different child than the one that validated the referenceThis means the query can be dispatched to the wrong JSON index (e.g. the
$.aindex for a$.bquery).Fix
IndexInformationProvider::get_indexto return(&DataType, &MultiQueryParser)instead of(&DataType, &dyn ScalarQueryParser), so callers can interact with theMultiQueryParserdirectlyMultiQueryParser::select(expr, data_type)— iterates child parsers and returns(&dyn ScalarQueryParser, DataType)from the first child whoseis_valid_referenceaccepts the expression, preserving which child matchedmaybe_indexed_columnto callmulti.select(expr, data_type)instead ofparser.is_valid_reference(expr, data_type), obtaining the precise sub-parser for all subsequent operationsTest
Added regression test
test_multi_json_indices_route_by_paththat:MultiQueryParserwith twoJsonQueryParsersub-parsers (for$.aand$.b)json_extract(json, '$.b') = 'foo'resolves to thejson_b_idxindexjson_extract(json, '$.a') = 'foo'resolves to thejson_a_idxindexjson_extract(json, '$.c') = 'foo'(unindexed path) does not bind to any index