Skip to content

feat(memwal): implement read-path shard pruning for MemWAL queries#7122

Open
beinan wants to merge 5 commits into
lance-format:mainfrom
beinan:user/beinan/memwal-read-path-shard-pruning
Open

feat(memwal): implement read-path shard pruning for MemWAL queries#7122
beinan wants to merge 5 commits into
lance-format:mainfrom
beinan:user/beinan/memwal-read-path-shard-pruning

Conversation

@beinan

@beinan beinan commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR implements read-path shard pruning for MemWAL datasets.

When a query's filter contains an equality (=) or IN predicate on the designated sharding column (derived from the dataset's ShardingSpec), the query planner evaluates the target shard IDs using the sharding transform (e.g., identity value matches or bucket Murmur3 hashes). It then calls collect_for_shards() to open and scan only the matching shard subdirectories under _mem_wal/, skipping irrelevant directories entirely.

Key Changes

  1. Metadata Recording (data_source.rs): Enrich ShardSnapshot with each shard's derived partition column values (shard_field_values: HashMap<String, Vec<u8>>) derived from the transaction manifest.
  2. Pruning Analyzer (shard_pruning.rs): Implement literal value extraction from DataFusion expressions (&Expr), perform type coercion to match column schema, and evaluate targets against both Identity and Bucket sharding specifications.
  3. Planners Wiring (collector.rs, planner.rs): Change LsmScanPlanner to call collect_pruned(filter) instead of unconditionally loading all shards via collect().
  4. Validation and Tests (planner.rs, shard_pruning.rs): Add unit tests for pruning evaluation and an integration test verifying only the single correct shard subdirectory is read during active queries.

Test plan

  • All 380 existing MemWAL tests pass successfully.
  • Added 6 unit tests in shard_pruning.rs verifying bucket/identity/unsharded logic and coercion logic.
  • Added 2 integration tests in planner.rs verifying correct shard selection and checking that collect_pruned actively skips unrelated directories under _mem_wal/ on active queries.

🤖 Generated with Claude Code

@github-actions github-actions Bot added the enhancement New feature or request label Jun 5, 2026
@codecov

codecov Bot commented Jun 5, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.45510% with 43 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...lance/src/dataset/mem_wal/scanner/shard_pruning.rs 92.26% 19 Missing and 8 partials ⚠️
rust/lance/src/dataset/mem_wal/sharding.rs 54.83% 12 Missing and 2 partials ⚠️
...ust/lance/src/dataset/mem_wal/scanner/collector.rs 94.73% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@github-actions github-actions Bot added A-python Python bindings A-java Java bindings + JNI labels Jun 5, 2026
Beinan Wang added 3 commits June 5, 2026 12:45
When a query filter contains an equality or IN predicate on the sharding
column, the scan planner now evaluates the filter values through the
sharding transform (bucket/identity) and skips shards whose field values
do not match. This avoids scanning data in non-matching shards entirely.

Key changes:
- Add shard_field_values to ShardSnapshot for read-path access
- Add hash_scalar_to_bucket and source_column_for_field helpers to sharding.rs
- New shard_pruning module with filter-to-shard-id evaluation
- Wire collect_pruned into LsmScanPlanner with type coercion for SQL literals
- Add with_sharding_spec to LsmScanner and LsmDataSourceCollector

Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
Run cargo fmt and address clippy::collapsible_if warning
in collector.rs by merging nested if-let chains.

Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
The ShardSnapshot struct gained a new shard_field_values field for
read-path shard pruning, but the constructors in python/src/mem_wal.rs
and java/lance-jni/src/mem_wal.rs were not updated.

Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
@beinan beinan force-pushed the user/beinan/memwal-read-path-shard-pruning branch from 70ada29 to 908a635 Compare June 5, 2026 19:45
Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
@beinan beinan force-pushed the user/beinan/memwal-read-path-shard-pruning branch from 908a635 to 784b9e1 Compare June 5, 2026 20:26
… empty snapshots

Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
@beinan beinan force-pushed the user/beinan/memwal-read-path-shard-pruning branch from f4b943e to a2961e6 Compare June 5, 2026 21:13
@beinan beinan marked this pull request as ready for review June 6, 2026 00:13

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-java Java bindings + JNI A-python Python bindings enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant