feat(index): accelerate regex and infix LIKE with the ngram index#7139
Open
wombatu-kun wants to merge 4 commits into
Open
feat(index): accelerate regex and infix LIKE with the ngram index#7139wombatu-kun wants to merge 4 commits into
wombatu-kun wants to merge 4 commits into
Conversation
Contributor
|
Important This PR touches the Lance format specification. Substantive changes to the format specification — the If this is a meaningful format change:
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
The ngram index previously accelerated only `contains(col, 'substr')`. This extends it to regular-expression predicates (`regexp_like` / `regexp_match`) and infix LIKE patterns (`col LIKE '%foo%'`), which until now fell through to a full table scan. Following Postgres `pg_trgm` and Russ Cox's Google Code Search, a pattern is compiled into a boolean trigram condition that is a necessary condition for any string to match. The condition is evaluated against the inverted index (AND maps to posting-list intersection, OR to union) to produce a candidate superset, and the scan rechecks the true predicate to drop false positives. The single invariant is soundness: the condition never requires a trigram a matching string could lack, so no real match is dropped. Patterns from which no trigram can be derived (for example `a.b`, `.*`, or case-insensitive matches) are left to a full scan instead of being routed to the index. A plain-literal `regexp_like(col, 'foo')` is rewritten to `col LIKE '%foo%'` by the optimizer before it reaches the index, so infix LIKE is accelerated through the same machinery. The original LIKE is kept as the recheck filter, so the regex used for candidate generation only needs to be a sound superset. On a 200k-row benchmark the accelerated queries are 4-5.5x faster (a selective `foo.*bar` drops from 46ms to 8ms), while non-accelerable patterns stay at full-scan cost with no regression. Coverage includes unit tests for the regex-to-trigram derivation, index-level search tests, a multi-fragment end-to-end scan test, and a regression test ensuring non-accelerable patterns still return correct results. Closes lance-format#7130 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a unit test for `apply_regex_flags` (inline-flag folding, plus rejection of unrecognized and non-literal flag arguments) and an end-to-end scanner assertion that a case-insensitive `regexp_like(col, 'PAT', 'i')` still returns correct results through the full-scan recheck. Also renames the misleading `anchored_literal` benchmark case to `plain_literal` (the pattern has no anchors; it is rewritten to an infix LIKE) and adds a trailing newline to the ngram docs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The spell checker flagged the deliberate regex test fragments `fo` / `ba` and the spelling variant `unparseable`. Reword the affected comments, mark the two lines that must keep the literal regex strings with `spellchecker:disable-line`, and switch `unparseable` to `unparsable` (including the renamed test). No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The feature commit added the `regex-syntax` dependency to `lance-index` but updated only the root `Cargo.lock`. The `python/` crate is a separate Cargo workspace whose lint and build steps run with `--locked`, so its lockfile must also list `regex-syntax` under `lance-index` or `--locked` fails. The package itself was already present (a transitive dep of `regex`); only the missing dependency edge is added. The `java/lance-jni` workspace does not run with `--locked`, so it is left untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
7998f7e to
552595f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Extends the ngram index (today only
contains(col, 'substr')) to also accelerateregexp_like(col, pat)/regexp_match(col, pat)and infixLIKE(col LIKE '%foo%bar%'), which until now fell through to a full table scan. Closes #7130.How
Following Postgres
pg_trgmand Russ Cox's trigram-index approach, a pattern is compiled into a boolean trigram condition (TrigramQuery, an AND/OR tree) that is a necessary condition for any match. This maps onto the inverted index's set algebra: AND is posting-list intersection, OR is union. The index returns a candidate superset and the scan rechecks the true predicate, exactly ascontainsdoes. The derivation walks theregex-syntaxHIR bottom-up, tracking per-node(emptyable, exact, prefix, suffix)sets and folding boundary trigrams across concatenation, with bounds that fold-then-discard so precision loss never drops a necessary trigram.Soundness
The derived condition never requires a trigram a matching string could lack, so no real match is dropped. Requirements come from the index's own tokenizer (so sub-trigram runs contribute nothing); character classes and case-insensitive folds are treated as a single unknown character (the index's normalization disagrees with Unicode case folding -
(?i)calso matches U+2102); and patterns with no derivable trigram (a.b,.*) are left to a full scan.Why infix LIKE
A plain-literal
regexp_like(col, 'foo')is rewritten tocol LIKE '%foo%'by the optimizer before it reaches the index, so without infix-LIKE the most common "regex" query would not accelerate. The LIKE is translated to a loose regex for candidate generation only; the original LIKE stays as the recheck filter, so the candidate regex need only be a sound superset.Benchmark
cargo bench -p lance --bench regex_ngram, 200k rows, before (main) vs after:regexp_match(doc, 'zqxwvu.*needlexyz')regexp_match(doc, '(zqxwvu|qwerasdf|needlexyz)')regexp_match(doc, 'zqxwvu')(rewritten to LIKE)regexp_match(doc, 'a.b')(non-accelerable)Testing
Unit tests for the regex-to-trigram derivation and the regex-flags folding; index-level search tests (AND across
.*, alternation union, NULL exclusion, absent trigram); a multi-fragment end-to-end scan test asserting correct results and index engagement forregexp_like,regexp_match, andLIKE, plus a case-insensitive query that must fall back to a full recheck; and a regression test ensuring non-accelerable patterns still return all correct matches via full recheck. No binding changes are needed - Python/Java pass these filter strings through the existing scan API, so acceleration applies transparently.