[kernel-spark] Support schema tracking log in v2 analysis stage#6570
Conversation
8b93ccd to
3e25083
Compare
Range-diff: stack/MetadataEvolutionHandler2 (8b93ccd -> 3e25083)
Reproduce locally: |
3e25083 to
33e6da2
Compare
33e6da2 to
f9d52c5
Compare
Range-diff: stack/MetadataEvolutionHandler2 (33e6da2 -> f9d52c5)
Reproduce locally: |
a448530 to
8497640
Compare
8497640 to
efb589e
Compare
Range-diff: stack/MetadataEvolutionHandler2 (8497640 -> efb589e)
Reproduce locally: |
5c0c8bf to
6c83dbc
Compare
Range-diff: stack/MetadataEvolutionHandler2 (5c0c8bf -> 6c83dbc)
Reproduce locally: |
9c30908 to
7c66bf1
Compare
…seable in v2 (#6562) ## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/6562/files) to review incremental changes. - [stack/SparkMetadataAdapter](#6546) [[Files changed](https://github.com/delta-io/delta/pull/6546/files)] [MERGED] - [stack/RefactorMetadataTrackingLog](#6550) [[Files changed](https://github.com/delta-io/delta/pull/6550/files)] [MERGED] - [**stack/RefactorDeltaSourceMetadataEvolutionSupport**](#6562) [[Files changed](https://github.com/delta-io/delta/pull/6562/files)] - [stack/MetadataEvolutionHandler2](#6563) [[Files changed](https://github.com/delta-io/delta/pull/6563/files/ed92a0fa2051432b6bc5784034df0b7949bbfb98..e5b2c3295843ec85753e07dc0010aa5ccebaabb7)] - [stack/NonAdditiveSchemaEvolution2](#6570) [[Files changed](https://github.com/delta-io/delta/pull/6570/files/e5b2c3295843ec85753e07dc0010aa5ccebaabb7..7c66bf11a0f1b651cda32ed7f529f552dd9dbfcb)] - [stack/NonAdditiveSchemaEvolution3](#6697) [[Files changed](https://github.com/delta-io/delta/pull/6697/files/7c66bf11a0f1b651cda32ed7f529f552dd9dbfcb..14956ea304c93d2343ccd7eb89a112966f07f906)] - [stack/consecutiveSchemaChangesMerger](#6698) [[Files changed](https://github.com/delta-io/delta/pull/6698/files/14956ea304c93d2343ccd7eb89a112966f07f906..8101b335b892a6a5b6d6fe11f4a202d14102721c)] --------- #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description PR 3/7 in the non-additive schema evolution for V2 streaming connector stack. Refactor `DeltaSourceMetadataEvolutionSupport` and `DeltaColumnMapping` so the schema change detection logic can be called from V2 without depending on V1 instance state. **`DeltaSourceMetadataEvolutionSupport`:** - Extract instance methods (`validateAndResolveMetadataEvolution`, `checkColumnMappingSchemaChangesDuringStreaming`, `resolveMetadataEvolutionForCommitRange`, etc.) to companion object statics that accept explicit parameters instead of accessing V1 `DeltaSource` via `this` - V1 trait methods now delegate to the companion object statics **`DeltaColumnMapping`:** - Widen `hasNoColumnMappingSchemaChanges` from V1 `Metadata` to `AbstractMetadata` so V2 can call it via the adapter layer - Extract `assignColumnIdAndPhysicalNameToSchema(StructType, Map)` from `assignColumnIdAndPhysicalName(Metadata, Metadata, ...)` — needed for simulating column mapping upgrades during NoMapping-to-NameMapping transitions All changes are structural refactors with no behavioral change. ## How was this patch tested? Existing tests in `DeltaSourceSchemaEvolutionSuite` continue to pass. No behavioral changes. ## Does this PR introduce _any_ user-facing changes? No.
99b2159 to
0e07f87
Compare
TimothyW553
left a comment
There was a problem hiding this comment.
spark/v2/src/main/java/io/delta/spark/internal/v2/read/SparkMicroBatchStream.java:243-247 (not in this diff): what happens at restart if this snapshot schema differs from what SparkTable.schema() got from the tracking log?
| (extraOptions.containsKey(DeltaOptions.SCHEMA_TRACKING_LOCATION) || | ||
| extraOptions.containsKey(DeltaOptions.SCHEMA_TRACKING_LOCATION_ALIAS)) && | ||
| !tableOptions.containsKey(DeltaOptions.SCHEMA_TRACKING_LOCATION) && | ||
| !tableOptions.containsKey(DeltaOptions.SCHEMA_TRACKING_LOCATION_ALIAS) |
There was a problem hiding this comment.
If extraOptions and the table both set schemaTrackingLocation to different values, the user's value is silently dropped. Error instead?
There was a problem hiding this comment.
This is intentional. If the table already carries a schemaTrackingLocation, we treat that as the source of truth. In practice there's no conflict to surface: this rule is currently the only path that propagates schemaTrackingLocation onto SparkTable (via the merged-options rebuild below), so by the time tableOptions contains the key it was put there by us. The !tableOptions.containsKey(...) check is what keeps the rule idempotent — without it, we'd rebuild the StreamingRelationV2 on every pass.
| // the runtime tableId (V1 uses the Delta UUID, V2 uses Kernel's snapshot id). | ||
| // `table.name` is path-aware ("delta.`/path`" for path-based, qualified name for | ||
| // catalog-based) and is sufficient to differentiate sources for the conflict check. | ||
| val tableId = table.name.replace(":", "").replace("/", "_") |
There was a problem hiding this comment.
table.name differs for path vs catalog access to the same table, so this misses the conflict V1 catches. Use the path?
There was a problem hiding this comment.
Good catch, but this is fail-fast only — at runtime both V1 and V2 use the stable Delta UUID, and DeltaSourceMetadataTrackingLog re-validates, so the conflict still surfaces at first batch.
Getting the path here isn't clean: StreamingRelationV2.table is the generic DSv2 Table (only name()), and SparkTable lives in a module above DeltaAnalysis. A proper fix needs a SparkTable-side change — out of scope. Adding a TODO.
| DeltaSQLConf | ||
| .DELTA_STREAMING_ENABLE_SCHEMA_TRACKING_MERGE_CONSECUTIVE_CHANGES()); | ||
| scala.Option<DeltaSourceMetadataTrackingLog> trackingLog = | ||
| MetadataEvolutionHandler.getMetadataTrackingLogForMicroBatchStream( |
There was a problem hiding this comment.
Move this out of the constructor, into the streaming entry?
There was a problem hiding this comment.
SparkTable's schema is fixed during its construction, and this schema will be used in logicalPlan as expected readSchema of Spark engine. Hence, we have to fetch the correct read schema at SparkTable level
| private def needsSchemaTrackingRebuild( | ||
| table: SparkTable, extraOptions: CaseInsensitiveStringMap): Boolean = { | ||
| val tableOptions = new CaseInsensitiveStringMap(table.getOptions) | ||
| (extraOptions.containsKey(DeltaOptions.SCHEMA_TRACKING_LOCATION) || |
There was a problem hiding this comment.
Let's turn this into a static method somewhere in a util?
There was a problem hiding this comment.
moved to metadataevolutionhandler
| // Keep this None to force the V2 path; we don't want to fall back to V1 here. | ||
| v1Relation = None) | ||
|
|
||
| // For catalog-loaded relations (readStream.table("foo")), TableCatalog.loadTable has no |
There was a problem hiding this comment.
I think this explanation is unnecessary here. let's just use a TODO (one line)
| Map<String, String> options, | ||
| DeltaSnapshotManager snapshotManager, | ||
| Engine engine, | ||
| Set<DeltaLogActionUtils.DeltaAction> mergeActionSet, |
There was a problem hiding this comment.
They will be used in #6698, since I am aiming to merge them all in the this week, it is fine to keep them unused for now, there is no correctness issue
| Option<String> sourceMetadataPathOpt, | ||
| boolean mergeConsecutiveSchemaChanges) { | ||
| String location = | ||
| options.getOrDefault( |
There was a problem hiding this comment.
Should this be a case-insensitive lookup?
| // consecutive metadata-only commits and writes the merged entry back to the durable schema | ||
| // log; the execution-time SparkMicroBatchStream then re-reads the same merged entry from the | ||
| // log via DeltaSourceMetadataTrackingLog.getCurrentTrackedMetadata. | ||
| SparkSession spark = SparkSession.active(); |
There was a problem hiding this comment.
Could you put the reading-from-schema-tracking-log logic into a util method?
There was a problem hiding this comment.
Moved to MetadataEvolutionHandler
| // Override in CDC suites to add CDC-specific tests. | ||
| override protected def shouldFailTests: Set[String] = Set( | ||
| // ========== Schema location validation ========== | ||
| // TODO(#5319): Move tests to shouldPassTests as V2 schema tracking log support is implemented.c |
## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/6563/files) to review incremental changes. - [stack/SparkMetadataAdapter](#6546) [[Files changed](https://github.com/delta-io/delta/pull/6546/files)] [MERGED] - [stack/RefactorMetadataTrackingLog](#6550) [[Files changed](https://github.com/delta-io/delta/pull/6550/files)] [MERGED] - [stack/RefactorDeltaSourceMetadataEvolutionSupport](#6562) [[Files changed](https://github.com/delta-io/delta/pull/6562/files)] [MERGED] - [**stack/MetadataEvolutionHandler2**](#6563) [[Files changed](https://github.com/delta-io/delta/pull/6563/files)] - [stack/NonAdditiveSchemaEvolution2](#6570) [[Files changed](https://github.com/delta-io/delta/pull/6570/files/a20f1f3ab452a75fc954e15c57c17327e0cb9267..0e07f87285becd6be416450ae084df454d9c94a9)] - [stack/NonAdditiveSchemaEvolution3](#6697) [[Files changed](https://github.com/delta-io/delta/pull/6697/files/0e07f87285becd6be416450ae084df454d9c94a9..73e1aa7f4162a3e1480ffd2b88b9ca79d852f2fe)] - [stack/consecutiveSchemaChangesMerger](#6698) [[Files changed](https://github.com/delta-io/delta/pull/6698/files/73e1aa7f4162a3e1480ffd2b88b9ca79d852f2fe..5e5d260b64d45cc11bcfdb58e5aab1b2d2637b33)] - [stack/V1V2MixTest](#6759) [[Files changed](https://github.com/delta-io/delta/pull/6759/files/5e5d260b64d45cc11bcfdb58e5aab1b2d2637b33..738379713040986c74f98dbebfdc6c83ec1d3f16)] --------- #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description PR 4/7 in the non-additive schema evolution for V2 streaming connector stack. Introduce `MetadataEvolutionHandler`, a Java class that implements the V1 barrier protocol for schema evolution in the V2 connector. In V1 this logic lives in `DeltaSourceMetadataEvolutionSupport`, a Scala trait mixed into `DeltaSource` that accesses stream state via `this`. Since V2's `SparkMicroBatchStream` is Java and cannot use Scala trait mixins, `MetadataEvolutionHandler` receives all dependencies via constructor injection instead. The handler covers the full schema evolution lifecycle: - **Stream start**: eager metadata tracking log initialization on first batch - **Offset generation**: injects `METADATA_CHANGE_INDEX` / `POST_METADATA_CHANGE_INDEX` barrier sentinels into the file change iterator - **Pending schema offsets**: returns barrier offsets for in-progress schema changes - **Batch commit**: updates the schema log and throws `DELTA_STREAMING_METADATA_EVOLUTION` to trigger stream restart - **Batch planning on restart**: validates and re-initializes the schema log All detection logic delegates to the shared `DeltaSourceMetadataEvolutionSupport$` companion object statics (refactored in PR 3/7). V2-specific orchestration is limited to wiring the barrier protocol into the `CloseableIterator<IndexedFile>` pipeline and collecting metadata/protocol from Kernel commit ranges via `StreamingHelper`. Also extends `StreamingHelper` with `getMetadataAndProtocolForVersionRange` to collect metadata and protocol actions from a range of Kernel commits. ## How was this patch tested? Unit tests in `MetadataEvolutionHandlerTest.java` covering: barrier protocol (METADATA_CHANGE_INDEX / POST_METADATA_CHANGE_INDEX offset generation), tracking state transitions, initialization lifecycle, offset arithmetic, pending schema change handling, and commit-time evolution exception. ## Does this PR introduce _any_ user-facing changes? No.
0e07f87 to
a85d763
Compare
818591f to
2a2f77a
Compare
| identifier = Some(ident), | ||
| v1Relation = None) | ||
|
|
||
| case s @ StreamingRelationV2(_, _, table: SparkTable, extraOptions, _, _, _, _) |
There was a problem hiding this comment.
I see a lot of overlap between cdc and schema evolution. Could you help reconcile them? We should only run schema augmentation once.
| DeltaAction.ADD, | ||
| DeltaAction.REMOVE, | ||
| DeltaAction.METADATA, | ||
| DeltaAction.PROTOCOL, |
There was a problem hiding this comment.
Should we add this to the CDC path too? One of us (whoever has time) should make sure we test CDC + schema evolution cases.
There was a problem hiding this comment.
I will create a new PR to support CDC + schema evolution and add TODO in MetadataEvolutionHandler
| * schemaTrackingLocation}/{@code schemaLocation} option was observed, so callers must rebuild the | ||
| * table with the option folded in for its schema to be driven by the tracking log. | ||
| */ | ||
| public static boolean shouldPropagateSchemaTrackingToTable( |
There was a problem hiding this comment.
Should we add a test for this?
2a2f77a to
c13a9c3
Compare
| val rebuilt = if (table.getCatalogTable.isPresent) { | ||
| new SparkTable(table.getIdentifier, table.getCatalogTable.get, merged) | ||
| } else { | ||
| new SparkTable(table.getIdentifier, table.getTablePath.toString, merged) |
There was a problem hiding this comment.
Could we make sure table.getIdentifier is not null? same for getTablePath.
There was a problem hiding this comment.
there is requireNonNull in SparkTable's constructor for identifier and tablePath
c13a9c3 to
b7f6c8e
Compare
…ark 4.2 New code from delta-io#6570 (schema tracking log in v2 analysis) uses a direct StreamingRelationV2 8-arg pattern match that breaks on Spark 4.2 where the case class has 9 parameters. Use StreamingRelationV2Shim instead. Co-authored-by: Isaac
…6697) ## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/6697/files) to review incremental changes. - [stack/SparkMetadataAdapter](#6546) [[Files changed](https://github.com/delta-io/delta/pull/6546/files)] [MERGED] - [stack/RefactorMetadataTrackingLog](#6550) [[Files changed](https://github.com/delta-io/delta/pull/6550/files)] [MERGED] - [stack/RefactorDeltaSourceMetadataEvolutionSupport](#6562) [[Files changed](https://github.com/delta-io/delta/pull/6562/files)] [MERGED] - [stack/MetadataEvolutionHandler2](#6563) [[Files changed](https://github.com/delta-io/delta/pull/6563/files)] [MERGED] - [stack/NonAdditiveSchemaEvolution2](#6570) [[Files changed](https://github.com/delta-io/delta/pull/6570/files)] [MERGED] - [**stack/NonAdditiveSchemaEvolution3**](#6697) [[Files changed](https://github.com/delta-io/delta/pull/6697/files)] - [stack/consecutiveSchemaChangesMerger](#6698) [[Files changed](https://github.com/delta-io/delta/pull/6698/files/f96643aa3cc01e7f70cc13a18b82dc27f277f11d..f612628ad931ec35c237801109f01b6fbd1379f7)] - [stack/SchemaTrackingWithCDC](#6801) [[Files changed](https://github.com/delta-io/delta/pull/6801/files/f612628ad931ec35c237801109f01b6fbd1379f7..4aeacfb120b33e9cdfe124352290b72f53f7cf89)] - [stack/V1V2MixTest](#6759) [[Files changed](https://github.com/delta-io/delta/pull/6759/files/f612628ad931ec35c237801109f01b6fbd1379f7..0c818ee431ab417a4f2ffbcc609930be09d25031)] --------- #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description PR 6/7 in the non-additive schema evolution for V2 streaming connector stack. Wire `MetadataEvolutionHandler` into `SparkMicroBatchStream` and `SparkScan` so V2 streaming reads honor non-additive schema evolution (column rename/drop, type widening). - `SparkMicroBatchStream`: take `metadataTrackingLog` + `metadataPath` as constructor inputs; when a persisted entry exists, layer it onto the freshly loaded `snapshotAtSourceInit` to derive `readSnapshotAtSourceInit` (mirrors V1's `readSnapshotDescriptor`). Integrate the schema-evolution barrier protocol into `latestOffset` / `commit` / `planInputPartitions`. Skip the on-restart schema-validation check when schema tracking is active — the schema-log evolution exception covers it. - `SparkScan.toMicroBatchStream`: reload latest snapshot (the analysis-time `initialSnapshot` can be stale by stream start), open the tracking log via `MetadataEvolutionHandler.getMetadataTrackingLogForMicroBatchStream` with `mergeConsecutiveSchemaChanges=false` (the merger only runs at analysis), and pass it through with the checkpoint location. - `SparkScan` option allow-list: move `allowSourceColumnDrop` / `Rename` / `TypeChange` out of the unsupported list now that they are honored. ## How was this patch tested? `SparkMicroBatchStreamTest`, `MetadataEvolutionHandlerTest`. Unified suites (`DeltaV2SourceSchemaEvolutionSuite`, `TypeWideningStreamingV2SourceSuite`, `RemoveColumnMappingStreamingReadV2Suite`) move non-merger evolution scenarios from `shouldFailTests` to `shouldPassTests`; merger-dependent tests remain pending until PR 7/7. ## Does this PR introduce _any_ user-facing changes? No.
…ark 4.2 New code from delta-io#6570 (schema tracking log in v2 analysis) uses a direct StreamingRelationV2 8-arg pattern match that breaks on Spark 4.2 where the case class has 9 parameters. Use StreamingRelationV2Shim instead. Co-authored-by: Isaac
## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/6698/files) to review incremental changes. - [stack/SparkMetadataAdapter](#6546) [[Files changed](https://github.com/delta-io/delta/pull/6546/files)] [MERGED] - [stack/RefactorMetadataTrackingLog](#6550) [[Files changed](https://github.com/delta-io/delta/pull/6550/files)] [MERGED] - [stack/RefactorDeltaSourceMetadataEvolutionSupport](#6562) [[Files changed](https://github.com/delta-io/delta/pull/6562/files)] [MERGED] - [stack/MetadataEvolutionHandler2](#6563) [[Files changed](https://github.com/delta-io/delta/pull/6563/files)] [MERGED] - [stack/NonAdditiveSchemaEvolution2](#6570) [[Files changed](https://github.com/delta-io/delta/pull/6570/files)] [MERGED] - [stack/NonAdditiveSchemaEvolution3](#6697) [[Files changed](https://github.com/delta-io/delta/pull/6697/files)] [MERGED] - [**stack/consecutiveSchemaChangesMerger**](#6698) [[Files changed](https://github.com/delta-io/delta/pull/6698/files)] - [stack/SchemaTrackingWithCDC](#6801) [[Files changed](https://github.com/delta-io/delta/pull/6801/files/e230b46c3acb772d6599662b7c5aaf17e3625498..1ed4903f1b06fd49533dad3a1cf25c9206aef2f3)] - [stack/V1V2MixTest](#6759) [[Files changed](https://github.com/delta-io/delta/pull/6759/files/e230b46c3acb772d6599662b7c5aaf17e3625498..e3c0d530a63150797b7b882fab2dad2070452683)] --------- #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description PR 7/7 in the non-additive schema evolution for V2 streaming connector stack. Implement V2's consecutive-schema-changes merger so analysis-time evolution matches V1: runs of consecutive metadata-only commits collapse to a single tracked entry at the latest version. Without the merger, each metadata-only commit produces its own pending schema offset. - `MetadataEvolutionHandler.getMergedConsecutiveMetadataChanges`: V2 port of V1's `DeltaSourceMetadataEvolutionSupport.getMergedConsecutiveMetadataChanges`. Walks Kernel commits forward via `CommitRangeImpl` + `StreamingHelper.getCommitActionsFromRangeUnsafe`; for each commit detects file actions (ADD/REMOVE) and metadata/protocol actions; stops on the first commit with a file action or with neither metadata nor protocol; emits a merged `PersistedMetadata` at the latest metadata-only version. - `MetadataEvolutionHandler.getMetadataTrackingLogForMicroBatchStream`: pass the merger lambda into `DeltaSourceMetadataTrackingLog.create` (was a `null` placeholder). - `DeltaSourceMetadataTrackingLog` (V1): extract `PersistedMetadata.toProtocolJson` helper so V2's merger can reuse the same protocol-JSON encoding. ## How was this patch tested? `MetadataEvolutionHandlerTest` covers merger walk semantics — stop-on-file-action, stop-on-no-metadata-or-protocol, multiple folded changes, protocol-only and combined updates. `DeltaSourceSchemaEvolutionSuite` adds parallel V1 tests for the same scenarios. Unified `DeltaV2SourceSchemaEvolutionSuite` moves the remaining merger-dependent scenarios (`consecutive schema evolutions`, `unblock with sql conf`, `streaming with a column mapping upgrade`) from `shouldFailTests` to `shouldPassTests`. ## Does this PR introduce _any_ user-facing changes? No.
## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/6801/files) to review incremental changes. - [stack/SparkMetadataAdapter](#6546) [[Files changed](https://github.com/delta-io/delta/pull/6546/files)] [MERGED] - [stack/RefactorMetadataTrackingLog](#6550) [[Files changed](https://github.com/delta-io/delta/pull/6550/files)] [MERGED] - [stack/RefactorDeltaSourceMetadataEvolutionSupport](#6562) [[Files changed](https://github.com/delta-io/delta/pull/6562/files)] [MERGED] - [stack/MetadataEvolutionHandler2](#6563) [[Files changed](https://github.com/delta-io/delta/pull/6563/files)] [MERGED] - [stack/NonAdditiveSchemaEvolution2](#6570) [[Files changed](https://github.com/delta-io/delta/pull/6570/files)] [MERGED] - [stack/NonAdditiveSchemaEvolution3](#6697) [[Files changed](https://github.com/delta-io/delta/pull/6697/files)] [MERGED] - [stack/consecutiveSchemaChangesMerger](#6698) [[Files changed](https://github.com/delta-io/delta/pull/6698/files)] [MERGED] - [**stack/SchemaTrackingWithCDC**](#6801) [[Files changed](https://github.com/delta-io/delta/pull/6801/files)] - [stack/V1V2MixTest](#6759) [[Files changed](https://github.com/delta-io/delta/pull/6759/files)] --------- #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Follow-up to the non-additive schema evolution stack: extend V2 streaming schema tracking to CDC reads so a CDC stream stops at metadata- or protocol-change commits with a barrier sentinel instead of either silently reading across the change or failing the read-compat check. - `SparkMicroBatchStream.collectAndBuildCDCIndexedFiles`: capture `Protocol` alongside `Metadata` while scanning a commit's actions, then call `MetadataEvolutionHandler.getMetadataOrProtocolChangeIndexedFileIterator` once the scan is done; when the commit diverges from source-init, return a singleton barrier (`METADATA_CHANGE_INDEX`) in place of BASE + files + END. Skip the on-commit `verifyMetadataAction` read-compat check when schema tracking is active — the barrier covers divergence. V1 splits this between `DeltaSourceCDCSupport.filterAndIndexDeltaLogs` (barrier injection) and `IndexedChangeFileSeq.filterFiles` (short-circuit); V2 collapses both into this single method. - `SparkMicroBatchStream.applyPerCommitCDCAdmission`: pass barrier sentinels through admission unchanged (they can only appear as element 0 of the per-commit list). - `SparkMicroBatchStream.getFileChangesForCDC`: apply `metadataEvolutionHandler.stopIndexedFileIteratorAtSchemaChangeBarrier` after end-boundary filtering so post-barrier commits in the same batch are truncated. V1's wrap lives in the shared `DeltaSource.getFileChangesWithRateLimit`; V2 places it inside the CDC-specific method because both `planInputPartitions` and the outer `getFileChangesWithRateLimit` reach the CDC iterator through here. - `MetadataEvolutionHandler.getMergedConsecutiveMetadataChanges`: include `AddCDCFile` in the action set the merger walks and treat any non-null CDC column as a file action that stops the merger walk. Drops the `mergeActionSet` parameter (always `CDC_ACTION_SET` now) so non-CDC and CDC analysis share the same stop semantics. Resolves the TODO(#5319) placeholder left by PR 7/7. - `SparkMicroBatchStream.CDC_ACTION_SET`: promote to `public` so `MetadataEvolutionHandler` can reuse it. ## How was this patch tested? `SparkMicroBatchStreamCDCTest` adds barrier-emission cases: - `testProcessCommit_emitsBarrierAtSchemaChange`: a metadata-only commit on a CDC stream with seeded tracking emits `[barrier, END]` from `processCommitToIndexedFilesForCDC`. - `testGetFileChangesForCDC_emitsBarrierAtSchemaChange`: end-to-end check that the barrier fires and the iterator truncates across commits — exercises metadata/protocol capture, barrier emission, admission passthrough, and cross-commit truncation in one path. `MetadataEvolutionHandlerTest` extends the merger walk tests to cover CDC file actions stopping the walk. The unified `DeltaV2SourceSchemaEvolutionCDCSuiteBase` moves all evolution scenarios out of `shouldFailTests` into `shouldPassTests` so the CDC variants of the streaming schema-evolution suite now run alongside non-CDC. ## Does this PR introduce _any_ user-facing changes? No. Co-authored-by: Timothy Wang <timothy.art@gmail.com>
…ark 4.2 New code from delta-io#6570 (schema tracking log in v2 analysis) uses a direct StreamingRelationV2 8-arg pattern match that breaks on Spark 4.2 where the case class has 9 parameters. Use StreamingRelationV2Shim instead. Co-authored-by: Isaac
…ark 4.2 New code from delta-io#6570 (schema tracking log in v2 analysis) uses a direct StreamingRelationV2 8-arg pattern match that breaks on Spark 4.2 where the case class has 9 parameters. Use StreamingRelationV2Shim instead. Co-authored-by: Isaac
🥞 Stacked PR
Use this link to review incremental changes.
Which Delta project/connector is this regarding?
Description
PR 5/7 in the non-additive schema evolution for V2 streaming connector stack.
Wire schema tracking into V2's analysis path so the analyzed plan reflects the persisted (evolved) schema instead of the live snapshot schema.
DeltaAnalysis.verifyDeltaSourceSchemaLocation: extend the duplicate-schema-location check to also visitStreamingRelationV2, keyed on the V2Table.name.SparkTable: openDeltaSourceMetadataTrackingLogonce during construction (gated onmergeConsecutiveSchemaChanges) and seedSchemaProviderfrom the persisted metadata, so analysis-timeschema()matches what the stream will read at runtime.ApplyV2ReadOptions(renamed fromApplyV2Streaming): generalize the CDC-only rebuild to also fire whenschemaTrackingLocationarrives viaextraOptionson the catalogreadStream.table()path; rebuildSparkTablewith merged options so the schema-log lookup actually fires.MetadataEvolutionHandler.getMetadataTrackingLogForMicroBatchStream: V2 port of V1's helper, reused bySparkTable(analysis) andSparkScan(execution).How was this patch tested?
SparkTableTest,MetadataEvolutionHandlerTest,ApplyV2ReadOptionsSuite. UnifiedDeltaV2SourceSchemaEvolutionSuiteupdated.Does this PR introduce any user-facing changes?
No.