[kernel-spark] Support CDC + schema tracking log in v2#6801
Open
PorridgeSwim wants to merge 3 commits into
Open
[kernel-spark] Support CDC + schema tracking log in v2#6801PorridgeSwim wants to merge 3 commits into
PorridgeSwim wants to merge 3 commits into
Conversation
This was referenced May 15, 2026
a78a4ac to
f74ee6f
Compare
murali-db
pushed a commit
that referenced
this pull request
May 16, 2026
## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/6570/files) to review incremental changes. - [stack/SparkMetadataAdapter](#6546) [[Files changed](https://github.com/delta-io/delta/pull/6546/files)] [MERGED] - [stack/RefactorMetadataTrackingLog](#6550) [[Files changed](https://github.com/delta-io/delta/pull/6550/files)] [MERGED] - [stack/RefactorDeltaSourceMetadataEvolutionSupport](#6562) [[Files changed](https://github.com/delta-io/delta/pull/6562/files)] [MERGED] - [stack/MetadataEvolutionHandler2](#6563) [[Files changed](https://github.com/delta-io/delta/pull/6563/files)] [MERGED] - [**stack/NonAdditiveSchemaEvolution2**](#6570) [[Files changed](https://github.com/delta-io/delta/pull/6570/files)] - [stack/NonAdditiveSchemaEvolution3](#6697) [[Files changed](https://github.com/delta-io/delta/pull/6697/files/b7f6c8ebfc0882e7e2cc580f09f376be23a8d43d..dbb6246c14be1ab7f017ad9fc26455ae599ee676)] - [stack/consecutiveSchemaChangesMerger](#6698) [[Files changed](https://github.com/delta-io/delta/pull/6698/files/dbb6246c14be1ab7f017ad9fc26455ae599ee676..4bf2fa3fa828bcab0b56c4c26ca51ee9cc40b482)] - [stack/SchemaTrackingWithCDC](#6801) [[Files changed](https://github.com/delta-io/delta/pull/6801/files/4bf2fa3fa828bcab0b56c4c26ca51ee9cc40b482..a78a4ac2bc9a52605278a36b98804230258c12a2)] - [stack/V1V2MixTest](#6759) [[Files changed](https://github.com/delta-io/delta/pull/6759/files/7f9b7f2724b2245ab7380908616303cf7ea95fca..e146cdc9ebb0572e8b0a928cc6dd3bfdc198d984)] --------- #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description PR 5/7 in the non-additive schema evolution for V2 streaming connector stack. Wire schema tracking into V2's analysis path so the analyzed plan reflects the persisted (evolved) schema instead of the live snapshot schema. - `DeltaAnalysis.verifyDeltaSourceSchemaLocation`: extend the duplicate-schema-location check to also visit `StreamingRelationV2`, keyed on the V2 `Table.name`. - `SparkTable`: open `DeltaSourceMetadataTrackingLog` once during construction (gated on `mergeConsecutiveSchemaChanges`) and seed `SchemaProvider` from the persisted metadata, so analysis-time `schema()` matches what the stream will read at runtime. - `ApplyV2ReadOptions` (renamed from `ApplyV2Streaming`): generalize the CDC-only rebuild to also fire when `schemaTrackingLocation` arrives via `extraOptions` on the catalog `readStream.table()` path; rebuild `SparkTable` with merged options so the schema-log lookup actually fires. - `MetadataEvolutionHandler.getMetadataTrackingLogForMicroBatchStream`: V2 port of V1's helper, reused by `SparkTable` (analysis) and `SparkScan` (execution). ## How was this patch tested? `SparkTableTest`, `MetadataEvolutionHandlerTest`, `ApplyV2ReadOptionsSuite`. Unified `DeltaV2SourceSchemaEvolutionSuite` updated. ## Does this PR introduce _any_ user-facing changes? No.
Draft
f74ee6f to
89923c7
Compare
# Conflicts: # spark/v2/src/test/java/io/delta/spark/internal/v2/read/MetadataEvolutionHandlerTest.java
89923c7 to
4aeacfb
Compare
murali-db
pushed a commit
that referenced
this pull request
May 16, 2026
…6697) ## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/6697/files) to review incremental changes. - [stack/SparkMetadataAdapter](#6546) [[Files changed](https://github.com/delta-io/delta/pull/6546/files)] [MERGED] - [stack/RefactorMetadataTrackingLog](#6550) [[Files changed](https://github.com/delta-io/delta/pull/6550/files)] [MERGED] - [stack/RefactorDeltaSourceMetadataEvolutionSupport](#6562) [[Files changed](https://github.com/delta-io/delta/pull/6562/files)] [MERGED] - [stack/MetadataEvolutionHandler2](#6563) [[Files changed](https://github.com/delta-io/delta/pull/6563/files)] [MERGED] - [stack/NonAdditiveSchemaEvolution2](#6570) [[Files changed](https://github.com/delta-io/delta/pull/6570/files)] [MERGED] - [**stack/NonAdditiveSchemaEvolution3**](#6697) [[Files changed](https://github.com/delta-io/delta/pull/6697/files)] - [stack/consecutiveSchemaChangesMerger](#6698) [[Files changed](https://github.com/delta-io/delta/pull/6698/files/f96643aa3cc01e7f70cc13a18b82dc27f277f11d..f612628ad931ec35c237801109f01b6fbd1379f7)] - [stack/SchemaTrackingWithCDC](#6801) [[Files changed](https://github.com/delta-io/delta/pull/6801/files/f612628ad931ec35c237801109f01b6fbd1379f7..4aeacfb120b33e9cdfe124352290b72f53f7cf89)] - [stack/V1V2MixTest](#6759) [[Files changed](https://github.com/delta-io/delta/pull/6759/files/f612628ad931ec35c237801109f01b6fbd1379f7..0c818ee431ab417a4f2ffbcc609930be09d25031)] --------- #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description PR 6/7 in the non-additive schema evolution for V2 streaming connector stack. Wire `MetadataEvolutionHandler` into `SparkMicroBatchStream` and `SparkScan` so V2 streaming reads honor non-additive schema evolution (column rename/drop, type widening). - `SparkMicroBatchStream`: take `metadataTrackingLog` + `metadataPath` as constructor inputs; when a persisted entry exists, layer it onto the freshly loaded `snapshotAtSourceInit` to derive `readSnapshotAtSourceInit` (mirrors V1's `readSnapshotDescriptor`). Integrate the schema-evolution barrier protocol into `latestOffset` / `commit` / `planInputPartitions`. Skip the on-restart schema-validation check when schema tracking is active — the schema-log evolution exception covers it. - `SparkScan.toMicroBatchStream`: reload latest snapshot (the analysis-time `initialSnapshot` can be stale by stream start), open the tracking log via `MetadataEvolutionHandler.getMetadataTrackingLogForMicroBatchStream` with `mergeConsecutiveSchemaChanges=false` (the merger only runs at analysis), and pass it through with the checkpoint location. - `SparkScan` option allow-list: move `allowSourceColumnDrop` / `Rename` / `TypeChange` out of the unsupported list now that they are honored. ## How was this patch tested? `SparkMicroBatchStreamTest`, `MetadataEvolutionHandlerTest`. Unified suites (`DeltaV2SourceSchemaEvolutionSuite`, `TypeWideningStreamingV2SourceSuite`, `RemoveColumnMappingStreamingReadV2Suite`) move non-merger evolution scenarios from `shouldFailTests` to `shouldPassTests`; merger-dependent tests remain pending until PR 7/7. ## Does this PR introduce _any_ user-facing changes? No.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🥞 Stacked PR
Use this link to review incremental changes.
Which Delta project/connector is this regarding?
Description
Follow-up to the non-additive schema evolution stack: extend V2 streaming schema tracking to CDC reads so a CDC stream stops at metadata- or protocol-change commits with a barrier sentinel instead of either silently reading across the change or failing the read-compat check.
SparkMicroBatchStream.collectAndBuildCDCIndexedFiles: captureProtocolalongsideMetadatawhile scanning a commit's actions, then callMetadataEvolutionHandler.getMetadataOrProtocolChangeIndexedFileIteratoronce the scan is done; when the commit diverges from source-init, return a singleton barrier (METADATA_CHANGE_INDEX) in place of BASE + files + END. Skip the on-commitverifyMetadataActionread-compat check when schema tracking is active — the barrier covers divergence. V1 splits this betweenDeltaSourceCDCSupport.filterAndIndexDeltaLogs(barrier injection) andIndexedChangeFileSeq.filterFiles(short-circuit); V2 collapses both into this single method.SparkMicroBatchStream.applyPerCommitCDCAdmission: pass barrier sentinels through admission unchanged (they can only appear as element 0 of the per-commit list).SparkMicroBatchStream.getFileChangesForCDC: applymetadataEvolutionHandler.stopIndexedFileIteratorAtSchemaChangeBarrierafter end-boundary filtering so post-barrier commits in the same batch are truncated. V1's wrap lives in the sharedDeltaSource.getFileChangesWithRateLimit; V2 places it inside the CDC-specific method because bothplanInputPartitionsand the outergetFileChangesWithRateLimitreach the CDC iterator through here.MetadataEvolutionHandler.getMergedConsecutiveMetadataChanges: includeAddCDCFilein the action set the merger walks and treat any non-null CDC column as a file action that stops the merger walk. Drops themergeActionSetparameter (alwaysCDC_ACTION_SETnow) so non-CDC and CDC analysis share the same stop semantics. Resolves the TODO([Feature Request] Implement kernel-based dsv2 delta streaming source (M2: support advanced read options) #5319) placeholder left by PR 7/7.SparkMicroBatchStream.CDC_ACTION_SET: promote topublicsoMetadataEvolutionHandlercan reuse it.How was this patch tested?
SparkMicroBatchStreamCDCTestadds barrier-emission cases:testProcessCommit_emitsBarrierAtSchemaChange: a metadata-only commit on a CDC stream with seeded tracking emits[barrier, END]fromprocessCommitToIndexedFilesForCDC.testGetFileChangesForCDC_emitsBarrierAtSchemaChange: end-to-end check that the barrier fires and the iterator truncates across commits — exercises metadata/protocol capture, barrier emission, admission passthrough, and cross-commit truncation in one path.MetadataEvolutionHandlerTestextends the merger walk tests to cover CDC file actions stopping the walk. The unifiedDeltaV2SourceSchemaEvolutionCDCSuiteBasemoves all evolution scenarios out ofshouldFailTestsintoshouldPassTestsso the CDC variants of the streaming schema-evolution suite now run alongside non-CDC.Does this PR introduce any user-facing changes?
No.