feat(index): accelerate regex and infix LIKE with the ngram index by wombatu-kun · Pull Request #7139 · lance-format/lance

wombatu-kun · 2026-06-06T18:32:38Z

What

Extends the ngram index (today only contains(col, 'substr')) to also accelerate regexp_like(col, pat) / regexp_match(col, pat) and infix LIKE (col LIKE '%foo%bar%'), which until now fell through to a full table scan. Closes #7130.

How

Following Postgres pg_trgm and Russ Cox's trigram-index approach, a pattern is compiled into a boolean trigram condition (TrigramQuery, an AND/OR tree) that is a necessary condition for any match. This maps onto the inverted index's set algebra: AND is posting-list intersection, OR is union. The index returns a candidate superset and the scan rechecks the true predicate, exactly as contains does. The derivation walks the regex-syntax HIR bottom-up, tracking per-node (emptyable, exact, prefix, suffix) sets and folding boundary trigrams across concatenation, with bounds that fold-then-discard so precision loss never drops a necessary trigram.

Soundness

The derived condition never requires a trigram a matching string could lack, so no real match is dropped. Requirements come from the index's own tokenizer (so sub-trigram runs contribute nothing); character classes and case-insensitive folds are treated as a single unknown character (the index's normalization disagrees with Unicode case folding - (?i)c also matches U+2102); and patterns with no derivable trigram (a.b, .*) are left to a full scan.

Why infix LIKE

A plain-literal regexp_like(col, 'foo') is rewritten to col LIKE '%foo%' by the optimizer before it reaches the index, so without infix-LIKE the most common "regex" query would not accelerate. The LIKE is translated to a loose regex for candidate generation only; the original LIKE stays as the recheck filter, so the candidate regex need only be a sound superset.

Benchmark

cargo bench -p lance --bench regex_ngram, 200k rows, before (main) vs after:

Query	Before	After	Change
`regexp_match(doc, 'zqxwvu.*needlexyz')`	45.8 ms	8.2 ms	-82%
`regexp_match(doc, '(zqxwvu\|qwerasdf\|needlexyz)')`	51.7 ms	11.6 ms	-77%
`regexp_match(doc, 'zqxwvu')` (rewritten to LIKE)	36.6 ms	8.0 ms	-78%
`regexp_match(doc, 'a.b')` (non-accelerable)	77.8 ms	76.2 ms	within noise

Testing

Unit tests for the regex-to-trigram derivation and the regex-flags folding; index-level search tests (AND across .*, alternation union, NULL exclusion, absent trigram); a multi-fragment end-to-end scan test asserting correct results and index engagement for regexp_like, regexp_match, and LIKE, plus a case-insensitive query that must fall back to a full recheck; and a regression test ensuring non-accelerable patterns still return all correct matches via full recheck. No binding changes are needed - Python/Java pass these filter strings through the existing scan API, so acceleration applies transparently.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

github-actions · 2026-06-06T18:32:46Z

Important

This PR touches the Lance format specification.

Substantive changes to the format specification — the .proto definitions
and the spec docs under docs/src/format/ — require a PMC vote before merge.
Minor edits such as typo fixes, wording, or formatting are excluded; use your
judgment.

If this is a meaningful format change:

Start a vote following the Lance community voting process.
Format specification modifications need 3 binding +1 votes (excluding the
proposer), held on GitHub Discussions, with a minimum voting period of 1 week.
Once the vote passes, link the completed vote in this PR. It should not be
merged until the vote is linked.

codecov · 2026-06-06T19:32:33Z

Codecov Report

❌ Patch coverage is 95.61043% with 32 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-index/src/scalar/ngram/ngram_regex.rs	94.66%	20 Missing ⚠️
rust/lance-index/src/scalar/expression.rs	93.07%	5 Missing and 4 partials ⚠️
rust/lance-index/src/scalar/fmindex.rs	0.00%	3 Missing ⚠️

📢 Thoughts on this report? Let us know!

The ngram index previously accelerated only `contains(col, 'substr')`. This extends it to regular-expression predicates (`regexp_like` / `regexp_match`) and infix LIKE patterns (`col LIKE '%foo%'`), which until now fell through to a full table scan. Following Postgres `pg_trgm` and Russ Cox's Google Code Search, a pattern is compiled into a boolean trigram condition that is a necessary condition for any string to match. The condition is evaluated against the inverted index (AND maps to posting-list intersection, OR to union) to produce a candidate superset, and the scan rechecks the true predicate to drop false positives. The single invariant is soundness: the condition never requires a trigram a matching string could lack, so no real match is dropped. Patterns from which no trigram can be derived (for example `a.b`, `.*`, or case-insensitive matches) are left to a full scan instead of being routed to the index. A plain-literal `regexp_like(col, 'foo')` is rewritten to `col LIKE '%foo%'` by the optimizer before it reaches the index, so infix LIKE is accelerated through the same machinery. The original LIKE is kept as the recheck filter, so the regex used for candidate generation only needs to be a sound superset. On a 200k-row benchmark the accelerated queries are 4-5.5x faster (a selective `foo.*bar` drops from 46ms to 8ms), while non-accelerable patterns stay at full-scan cost with no regression. Coverage includes unit tests for the regex-to-trigram derivation, index-level search tests, a multi-fragment end-to-end scan test, and a regression test ensuring non-accelerable patterns still return correct results. Closes lance-format#7130 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds a unit test for `apply_regex_flags` (inline-flag folding, plus rejection of unrecognized and non-literal flag arguments) and an end-to-end scanner assertion that a case-insensitive `regexp_like(col, 'PAT', 'i')` still returns correct results through the full-scan recheck. Also renames the misleading `anchored_literal` benchmark case to `plain_literal` (the pattern has no anchors; it is rewritten to an infix LIKE) and adds a trailing newline to the ngram docs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The spell checker flagged the deliberate regex test fragments `fo` / `ba` and the spelling variant `unparseable`. Reword the affected comments, mark the two lines that must keep the literal regex strings with `spellchecker:disable-line`, and switch `unparseable` to `unparsable` (including the renamed test). No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The feature commit added the `regex-syntax` dependency to `lance-index` but updated only the root `Cargo.lock`. The `python/` crate is a separate Cargo workspace whose lint and build steps run with `--locked`, so its lockfile must also list `regex-syntax` under `lance-index` or `--locked` fails. The package itself was already present (a transitive dep of `regex`); only the missing dependency edge is added. The `java/lance-jni` workspace does not run with `--locked`, so it is left untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

claude Bot reviewed Jun 6, 2026

View reviewed changes

github-actions Bot added A-index Vector index, linalg, tokenizer A-deps Dependency updates A-format On-disk format: protos and format spec docs enhancement New feature or request A-python Python bindings labels Jun 6, 2026

Vova Kolmakov and others added 4 commits June 7, 2026 20:23

wombatu-kun force-pushed the issue/7130-ngram-regex branch from 7998f7e to 552595f Compare June 7, 2026 13:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(index): accelerate regex and infix LIKE with the ngram index#7139

feat(index): accelerate regex and infix LIKE with the ngram index#7139
wombatu-kun wants to merge 4 commits into
lance-format:mainfrom
wombatu-kun:issue/7130-ngram-regex

wombatu-kun commented Jun 6, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

github-actions Bot commented Jun 6, 2026

Uh oh!

codecov Bot commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wombatu-kun commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How

Soundness

Why infix LIKE

Benchmark

Testing

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

github-actions Bot commented Jun 6, 2026

Uh oh!

codecov Bot commented Jun 6, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wombatu-kun commented Jun 6, 2026 •

edited

Loading