feat: add index segment_seq metadata#7048
Conversation
|
Important This PR touches the Lance format specification. Substantive changes to the format specification — the If this is a meaningful format change:
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
Several high level comments
|
wjones127
left a comment
There was a problem hiding this comment.
The overall idea seems fine, but I have a couple of blocking questions
| /// Index metadata uses segment_seq. | ||
| pub const FLAG_INDEX_SEGMENT_SEQ: u64 = 64; | ||
| /// Index section stores logical high-water marks for segment_seq assignment. | ||
| pub const FLAG_INDEX_SEGMENT_SEQ_HIGH_WATER: u64 = 128; |
There was a problem hiding this comment.
issue(blocking): FLAG_INDEX_SEGMENT_SEQ_HIGH_WATER isn't documented in the protobuf.
Also, do we even need two separate flags? It seems like we could just have one flag for this and require both to be set. What do you think?
There was a problem hiding this comment.
Done. I collapsed this back to a single writer flag: FLAG_INDEX_SEGMENT_SEQ now covers both physical IndexMetadata.segment_seq and logical LogicalIndexMetadata.max_segment_seq, and FLAG_UNKNOWN is back to 128.
The format docs and the vote/design discussion were updated with the single-flag compatibility model.
| // Metadata about one logical index name. | ||
| message LogicalIndexMetadata { |
There was a problem hiding this comment.
praise: I'm in favor of adding this. It's been a long-time coming.
Eventually, we should move index details over to this, so there's just one copy per index.
Also, people have requested other fields like a top-level UUID and created_at date. I think we can add those in a future follow up.
| assert_ne!( | ||
| dataset.manifest.writer_feature_flags & FLAG_INDEX_SEGMENT_SEQ, | ||
| 0 | ||
| ); | ||
| assert_ne!( | ||
| dataset.manifest.writer_feature_flags & FLAG_INDEX_SEGMENT_SEQ_HIGH_WATER, | ||
| 0 | ||
| ); |
There was a problem hiding this comment.
question(blocking): It's not clear to me from these test—how does a user opt-in to this segment sequence flag? We can't enable it by default, because that means that users can't downgrade their Lance version—any previous Lance library will (or should) error when trying to write to it, given it won't recognize the writer feature flags.
There was a problem hiding this comment.
The flag is not enabled by a plain dataset write. It is set only once the manifest contains committed index segment sequence metadata, either physical segment_seq or logical max_segment_seq.
So the behavior is:
- Existing datasets without this metadata keep the flag clear and remain writable by older writers.
- A new writer enables the flag when it commits index segment metadata and writes
segment_seq/logical_indexes. - After that point, older writers should reject writes because they do not understand the writer feature flag and could otherwise drop the metadata.
I also added a test assertion that a freshly written dataset has FLAG_INDEX_SEGMENT_SEQ clear before index segments are committed, and set after segment commit.
85211bc to
9182881
Compare
Adds optional, name-scoped
segment_seqmetadata to index segments and assigns it in the commit/rebase layer. The change storesLogicalIndexMetadata.max_segment_seqin the manifest index section so sequence numbers are not reused after the highest physical segment for an index name is removed.This backfills legacy segments on the next segment commit, uses the single writer feature flag
FLAG_INDEX_SEGMENT_SEQfor both physicalsegment_seqand logical high-water metadata, and preserves existing Rust/Python/Java describe/build call patterns by assigning sequence numbers at commit time.Discussion: #7044