Skip to content

feat(index): accelerate regex and infix LIKE with the ngram index#7139

Open
wombatu-kun wants to merge 4 commits into
lance-format:mainfrom
wombatu-kun:issue/7130-ngram-regex
Open

feat(index): accelerate regex and infix LIKE with the ngram index#7139
wombatu-kun wants to merge 4 commits into
lance-format:mainfrom
wombatu-kun:issue/7130-ngram-regex

Conversation

@wombatu-kun
Copy link
Copy Markdown
Contributor

@wombatu-kun wombatu-kun commented Jun 6, 2026

What

Extends the ngram index (today only contains(col, 'substr')) to also accelerate regexp_like(col, pat) / regexp_match(col, pat) and infix LIKE (col LIKE '%foo%bar%'), which until now fell through to a full table scan. Closes #7130.

How

Following Postgres pg_trgm and Russ Cox's trigram-index approach, a pattern is compiled into a boolean trigram condition (TrigramQuery, an AND/OR tree) that is a necessary condition for any match. This maps onto the inverted index's set algebra: AND is posting-list intersection, OR is union. The index returns a candidate superset and the scan rechecks the true predicate, exactly as contains does. The derivation walks the regex-syntax HIR bottom-up, tracking per-node (emptyable, exact, prefix, suffix) sets and folding boundary trigrams across concatenation, with bounds that fold-then-discard so precision loss never drops a necessary trigram.

Soundness

The derived condition never requires a trigram a matching string could lack, so no real match is dropped. Requirements come from the index's own tokenizer (so sub-trigram runs contribute nothing); character classes and case-insensitive folds are treated as a single unknown character (the index's normalization disagrees with Unicode case folding - (?i)c also matches U+2102); and patterns with no derivable trigram (a.b, .*) are left to a full scan.

Why infix LIKE

A plain-literal regexp_like(col, 'foo') is rewritten to col LIKE '%foo%' by the optimizer before it reaches the index, so without infix-LIKE the most common "regex" query would not accelerate. The LIKE is translated to a loose regex for candidate generation only; the original LIKE stays as the recheck filter, so the candidate regex need only be a sound superset.

Benchmark

cargo bench -p lance --bench regex_ngram, 200k rows, before (main) vs after:

Query Before After Change
regexp_match(doc, 'zqxwvu.*needlexyz') 45.8 ms 8.2 ms -82%
regexp_match(doc, '(zqxwvu|qwerasdf|needlexyz)') 51.7 ms 11.6 ms -77%
regexp_match(doc, 'zqxwvu') (rewritten to LIKE) 36.6 ms 8.0 ms -78%
regexp_match(doc, 'a.b') (non-accelerable) 77.8 ms 76.2 ms within noise

Testing

Unit tests for the regex-to-trigram derivation and the regex-flags folding; index-level search tests (AND across .*, alternation union, NULL exclusion, absent trigram); a multi-fragment end-to-end scan test asserting correct results and index engagement for regexp_like, regexp_match, and LIKE, plus a case-insensitive query that must fall back to a full recheck; and a regression test ensuring non-accelerable patterns still return all correct matches via full recheck. No binding changes are needed - Python/Java pass these filter strings through the existing scan API, so acceleration applies transparently.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 6, 2026

Important

This PR touches the Lance format specification.

Substantive changes to the format specification — the .proto definitions
and the spec docs under docs/src/format/ — require a PMC vote before merge.
Minor edits such as typo fixes, wording, or formatting are excluded; use your
judgment.

If this is a meaningful format change:

  • Start a vote following the Lance community voting process.
    Format specification modifications need 3 binding +1 votes (excluding the
    proposer), held on GitHub Discussions, with a minimum voting period of 1 week.
  • Once the vote passes, link the completed vote in this PR. It should not be
    merged until the vote is linked.

@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer A-deps Dependency updates A-format On-disk format: protos and format spec docs enhancement New feature or request A-python Python bindings labels Jun 6, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 6, 2026

Codecov Report

❌ Patch coverage is 95.61043% with 32 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/ngram/ngram_regex.rs 94.66% 20 Missing ⚠️
rust/lance-index/src/scalar/expression.rs 93.07% 5 Missing and 4 partials ⚠️
rust/lance-index/src/scalar/fmindex.rs 0.00% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

Vova Kolmakov and others added 4 commits June 7, 2026 20:23
The ngram index previously accelerated only `contains(col, 'substr')`. This extends it to regular-expression predicates (`regexp_like` / `regexp_match`) and infix LIKE patterns (`col LIKE '%foo%'`), which until now fell through to a full table scan.

Following Postgres `pg_trgm` and Russ Cox's Google Code Search, a pattern is compiled into a boolean trigram condition that is a necessary condition for any string to match. The condition is evaluated against the inverted index (AND maps to posting-list intersection, OR to union) to produce a candidate superset, and the scan rechecks the true predicate to drop false positives. The single invariant is soundness: the condition never requires a trigram a matching string could lack, so no real match is dropped. Patterns from which no trigram can be derived (for example `a.b`, `.*`, or case-insensitive matches) are left to a full scan instead of being routed to the index.

A plain-literal `regexp_like(col, 'foo')` is rewritten to `col LIKE '%foo%'` by the optimizer before it reaches the index, so infix LIKE is accelerated through the same machinery. The original LIKE is kept as the recheck filter, so the regex used for candidate generation only needs to be a sound superset.

On a 200k-row benchmark the accelerated queries are 4-5.5x faster (a selective `foo.*bar` drops from 46ms to 8ms), while non-accelerable patterns stay at full-scan cost with no regression. Coverage includes unit tests for the regex-to-trigram derivation, index-level search tests, a multi-fragment end-to-end scan test, and a regression test ensuring non-accelerable patterns still return correct results.

Closes lance-format#7130

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a unit test for `apply_regex_flags` (inline-flag folding, plus rejection of unrecognized and non-literal flag arguments) and an end-to-end scanner assertion that a case-insensitive `regexp_like(col, 'PAT', 'i')` still returns correct results through the full-scan recheck. Also renames the misleading `anchored_literal` benchmark case to `plain_literal` (the pattern has no anchors; it is rewritten to an infix LIKE) and adds a trailing newline to the ngram docs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The spell checker flagged the deliberate regex test fragments `fo` / `ba` and the spelling variant `unparseable`. Reword the affected comments, mark the two lines that must keep the literal regex strings with `spellchecker:disable-line`, and switch `unparseable` to `unparsable` (including the renamed test). No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The feature commit added the `regex-syntax` dependency to `lance-index` but updated only the root `Cargo.lock`. The `python/` crate is a separate Cargo workspace whose lint and build steps run with `--locked`, so its lockfile must also list `regex-syntax` under `lance-index` or `--locked` fails. The package itself was already present (a transitive dep of `regex`); only the missing dependency edge is added. The `java/lance-jni` workspace does not run with `--locked`, so it is left untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@wombatu-kun wombatu-kun force-pushed the issue/7130-ngram-regex branch from 7998f7e to 552595f Compare June 7, 2026 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-deps Dependency updates A-format On-disk format: protos and format spec docs A-index Vector index, linalg, tokenizer A-python Python bindings enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Accelerate regular expressions with ngram index, if present

1 participant