Skip to content

feat: add index segment_seq metadata#7048

Open
jackye1995 wants to merge 8 commits into
lance-format:mainfrom
jackye1995:jack/index-segment-sequence-number
Open

feat: add index segment_seq metadata#7048
jackye1995 wants to merge 8 commits into
lance-format:mainfrom
jackye1995:jack/index-segment-sequence-number

Conversation

@jackye1995
Copy link
Copy Markdown
Contributor

@jackye1995 jackye1995 commented Jun 2, 2026

Adds optional, name-scoped segment_seq metadata to index segments and assigns it in the commit/rebase layer. The change stores LogicalIndexMetadata.max_segment_seq in the manifest index section so sequence numbers are not reused after the highest physical segment for an index name is removed.

This backfills legacy segments on the next segment commit, uses the single writer feature flag FLAG_INDEX_SEGMENT_SEQ for both physical segment_seq and logical high-water metadata, and preserves existing Rust/Python/Java describe/build call patterns by assigning sequence numbers at commit time.

Discussion: #7044

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added A-python Python bindings A-java Java bindings + JNI A-format On-disk format: protos and format spec docs labels Jun 2, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

Important

This PR touches the Lance format specification.

Substantive changes to the format specification — the .proto definitions
and the spec docs under docs/src/format/ — require a PMC vote before merge.
Minor edits such as typo fixes, wording, or formatting are excluded; use your
judgment.

If this is a meaningful format change:

  • Start a vote following the Lance community voting process.
    Format specification modifications need 3 binding +1 votes (excluding the
    proposer), held on GitHub Discussions, with a minimum voting period of 1 week.
  • Once the vote passes, link the completed vote in this PR. It should not be
    merged until the vote is linked.

@github-actions github-actions Bot added the enhancement New feature or request label Jun 2, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 2, 2026

@LuQQiu
Copy link
Copy Markdown
Contributor

LuQQiu commented Jun 2, 2026

Several high level comments

  1. this PR does not have max_segment_seq, is it expected?
  2. backfill when new segment arrive. Do we want to support backfill when no new segment arrive? how about re-commit with exact old segments == no op commit but also trigger backfill?

Copy link
Copy Markdown
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall idea seems fine, but I have a couple of blocking questions

Comment thread rust/lance-table/src/feature_flags.rs Outdated
Comment on lines +23 to +26
/// Index metadata uses segment_seq.
pub const FLAG_INDEX_SEGMENT_SEQ: u64 = 64;
/// Index section stores logical high-water marks for segment_seq assignment.
pub const FLAG_INDEX_SEGMENT_SEQ_HIGH_WATER: u64 = 128;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue(blocking): FLAG_INDEX_SEGMENT_SEQ_HIGH_WATER isn't documented in the protobuf.

Also, do we even need two separate flags? It seems like we could just have one flag for this and require both to be set. What do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I collapsed this back to a single writer flag: FLAG_INDEX_SEGMENT_SEQ now covers both physical IndexMetadata.segment_seq and logical LogicalIndexMetadata.max_segment_seq, and FLAG_UNKNOWN is back to 128.

The format docs and the vote/design discussion were updated with the single-flag compatibility model.

Comment thread protos/table.proto
Comment on lines +312 to +313
// Metadata about one logical index name.
message LogicalIndexMetadata {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: I'm in favor of adding this. It's been a long-time coming.

Eventually, we should move index details over to this, so there's just one copy per index.

Also, people have requested other fields like a top-level UUID and created_at date. I think we can add those in a future follow up.

Comment thread rust/lance/src/index.rs Outdated
Comment on lines +6380 to +6387
assert_ne!(
dataset.manifest.writer_feature_flags & FLAG_INDEX_SEGMENT_SEQ,
0
);
assert_ne!(
dataset.manifest.writer_feature_flags & FLAG_INDEX_SEGMENT_SEQ_HIGH_WATER,
0
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question(blocking): It's not clear to me from these test—how does a user opt-in to this segment sequence flag? We can't enable it by default, because that means that users can't downgrade their Lance version—any previous Lance library will (or should) error when trying to write to it, given it won't recognize the writer feature flags.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flag is not enabled by a plain dataset write. It is set only once the manifest contains committed index segment sequence metadata, either physical segment_seq or logical max_segment_seq.

So the behavior is:

  • Existing datasets without this metadata keep the flag clear and remain writable by older writers.
  • A new writer enables the flag when it commits index segment metadata and writes segment_seq / logical_indexes.
  • After that point, older writers should reject writes because they do not understand the writer feature flag and could otherwise drop the metadata.

I also added a test assertion that a freshly written dataset has FLAG_INDEX_SEGMENT_SEQ clear before index segments are committed, and set after segment commit.

@jackye1995 jackye1995 force-pushed the jack/index-segment-sequence-number branch from 85211bc to 9182881 Compare June 6, 2026 07:04
@github-actions github-actions Bot added the A-deps Dependency updates label Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-deps Dependency updates A-format On-disk format: protos and format spec docs A-java Java bindings + JNI A-python Python bindings enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants