[kernel-spark] Support non-additive schema evolution in v2 connector#6697
Conversation
d8362c3 to
d344128
Compare
Range-diff: stack/NonAdditiveSchemaEvolution2 (d8362c3 -> d344128)
Reproduce locally: |
d344128 to
c4fca36
Compare
Range-diff: stack/NonAdditiveSchemaEvolution2 (d344128 -> c4fca36)
Reproduce locally: |
a35f8b0 to
0148020
Compare
Range-diff: stack/NonAdditiveSchemaEvolution2 (a35f8b0 -> 0148020)
Reproduce locally: |
0148020 to
db16b9f
Compare
Range-diff: stack/NonAdditiveSchemaEvolution2 (0148020 -> db16b9f)
Reproduce locally: |
## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/6546/files) to review incremental changes. - [**stack/SparkMetadataAdapter**](#6546) [[Files changed](https://github.com/delta-io/delta/pull/6546/files)] - [stack/RefactorMetadataTrackingLog](#6550) [[Files changed](https://github.com/delta-io/delta/pull/6550/files/9271a6262f7a2615b977de0319c7238044b7d0a9..8378d33acda70a34a109b35173a968a4b3401ec1)] - [stack/RefactorDeltaSourceMetadataEvolutionSupport](#6562) [[Files changed](https://github.com/delta-io/delta/pull/6562/files/8378d33acda70a34a109b35173a968a4b3401ec1..90365431b12640de181446ec9c2033fb1b143b03)] - [stack/MetadataEvolutionHandler2](#6563) [[Files changed](https://github.com/delta-io/delta/pull/6563/files/28bb7021adb12b055e1b281fdfee0ab48a8732ac..578870181fa81a9146b2fa907244e350ffcabb52)] - [stack/NonAdditiveSchemaEvolution2](#6570) [[Files changed](https://github.com/delta-io/delta/pull/6570/files/578870181fa81a9146b2fa907244e350ffcabb52..c025b7c3c386e8d46d6142d0727dce95582bb0ef)] - [stack/NonAdditiveSchemaEvolution3](#6697) [[Files changed](https://github.com/delta-io/delta/pull/6697/files/c025b7c3c386e8d46d6142d0727dce95582bb0ef..db16b9fa80a80c105430c93589126ba8b828458f)] - [stack/consecutiveSchemaChangesMerger](#6698) [[Files changed](https://github.com/delta-io/delta/pull/6698/files/0148020ffe11e7b079e99fa8c5189a19c354f2be..9a360aa819f20d78b5361b2e997d24433fb793d5)] --------- #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description PR 1/7 in the non-additive schema evolution for V2 streaming connector stack. The shared V1 Scala utilities (`DeltaColumnMapping`, `DeltaSourceMetadataEvolutionSupport`) operate on `AbstractMetadata`/`AbstractProtocol`, but V2 holds Kernel types. This PR creates two adapter classes that bridge the gap: - `KernelMetadataAdapter`: Kernel `Metadata` → `AbstractMetadata` (schema conversion via `SchemaUtils`, partition columns and configuration converted to Scala collections) - `KernelProtocolAdapter`: Kernel `Protocol` → `AbstractProtocol` (maps reader/writer features to `Option[Set[String]]`) Also adds `columnMappingMode` and `partitionSchema` to the `AbstractMetadata` trait — V1's `Metadata` already had these fields, the trait just didn't expose them. ## How was this patch tested? Unit tests in `ActionAdaptersTest.java`: table-features protocol, legacy protocol, full metadata round-trip, null optional fields, and null constructor rejection. ## Does this PR introduce _any_ user-facing changes? No.
db16b9f to
ea7cfeb
Compare
Range-diff: stack/NonAdditiveSchemaEvolution2 (db16b9f -> ea7cfeb)
Reproduce locally: |
6854223 to
13395a7
Compare
Range-diff: stack/NonAdditiveSchemaEvolution2 (6854223 -> 13395a7)
Reproduce locally: |
…#6550) ## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/6550/files) to review incremental changes. - [stack/SparkMetadataAdapter](#6546) [[Files changed](https://github.com/delta-io/delta/pull/6546/files)] [MERGED] - [**stack/RefactorMetadataTrackingLog**](#6550) [[Files changed](https://github.com/delta-io/delta/pull/6550/files)] - [stack/RefactorDeltaSourceMetadataEvolutionSupport](#6562) [[Files changed](https://github.com/delta-io/delta/pull/6562/files/953f137f8c4ce46d8b8a9605b0c7bed898e30df4..027984b6edcbad0f4731e560425c2ed9bcf8fc27)] - [stack/MetadataEvolutionHandler2](#6563) [[Files changed](https://github.com/delta-io/delta/pull/6563/files/027984b6edcbad0f4731e560425c2ed9bcf8fc27..ada845895139edcb2727a87b39922c8e16837a99)] - [stack/NonAdditiveSchemaEvolution2](#6570) [[Files changed](https://github.com/delta-io/delta/pull/6570/files/ada845895139edcb2727a87b39922c8e16837a99..476762fde7b9cb9b9bc3e416c86a260cd29806ed)] - [stack/NonAdditiveSchemaEvolution3](#6697) [[Files changed](https://github.com/delta-io/delta/pull/6697/files/476762fde7b9cb9b9bc3e416c86a260cd29806ed..13395a7f2a49db4962091e8ee919bebdab5bd4e2)] - [stack/consecutiveSchemaChangesMerger](#6698) [[Files changed](https://github.com/delta-io/delta/pull/6698/files/13395a7f2a49db4962091e8ee919bebdab5bd4e2..f22ba063eaf35ab69d653a2d5faefdc52f35eab5)] --------- #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description PR 2/7 in the non-additive schema evolution for V2 streaming connector stack. Decouple `DeltaSourceMetadataTrackingLog` and `PersistedMetadata` from V1-specific types so the schema log can be reused by the V2 connector. - Replace `SnapshotDescriptor` parameter in `create()` with plain `sourceTableId` and `sourceDataPath` strings - Unify `PersistedMetadata.apply` to accept `AbstractMetadata`/`AbstractProtocol` instead of V1 `Metadata`/`Protocol` - Extract the consecutive schema changes merger (V1-specific, depends on `DeltaLog`) out of the companion object into `DeltaSourceMetadataEvolutionSupport`, and inject it as a function parameter so V2 can provide its own implementation - Remove `Protocol`'s `private` constructor modifier to allow construction from abstract protocol fields All changes are structural refactors with no behavioral change. ## How was this patch tested? Existing tests in `DeltaSourceSchemaEvolutionSuite` updated to use the new API. No behavioral changes. ## Does this PR introduce _any_ user-facing changes? No.
9a0ad72 to
14956ea
Compare
…seable in v2 (#6562) ## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/6562/files) to review incremental changes. - [stack/SparkMetadataAdapter](#6546) [[Files changed](https://github.com/delta-io/delta/pull/6546/files)] [MERGED] - [stack/RefactorMetadataTrackingLog](#6550) [[Files changed](https://github.com/delta-io/delta/pull/6550/files)] [MERGED] - [**stack/RefactorDeltaSourceMetadataEvolutionSupport**](#6562) [[Files changed](https://github.com/delta-io/delta/pull/6562/files)] - [stack/MetadataEvolutionHandler2](#6563) [[Files changed](https://github.com/delta-io/delta/pull/6563/files/ed92a0fa2051432b6bc5784034df0b7949bbfb98..e5b2c3295843ec85753e07dc0010aa5ccebaabb7)] - [stack/NonAdditiveSchemaEvolution2](#6570) [[Files changed](https://github.com/delta-io/delta/pull/6570/files/e5b2c3295843ec85753e07dc0010aa5ccebaabb7..7c66bf11a0f1b651cda32ed7f529f552dd9dbfcb)] - [stack/NonAdditiveSchemaEvolution3](#6697) [[Files changed](https://github.com/delta-io/delta/pull/6697/files/7c66bf11a0f1b651cda32ed7f529f552dd9dbfcb..14956ea304c93d2343ccd7eb89a112966f07f906)] - [stack/consecutiveSchemaChangesMerger](#6698) [[Files changed](https://github.com/delta-io/delta/pull/6698/files/14956ea304c93d2343ccd7eb89a112966f07f906..8101b335b892a6a5b6d6fe11f4a202d14102721c)] --------- #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description PR 3/7 in the non-additive schema evolution for V2 streaming connector stack. Refactor `DeltaSourceMetadataEvolutionSupport` and `DeltaColumnMapping` so the schema change detection logic can be called from V2 without depending on V1 instance state. **`DeltaSourceMetadataEvolutionSupport`:** - Extract instance methods (`validateAndResolveMetadataEvolution`, `checkColumnMappingSchemaChangesDuringStreaming`, `resolveMetadataEvolutionForCommitRange`, etc.) to companion object statics that accept explicit parameters instead of accessing V1 `DeltaSource` via `this` - V1 trait methods now delegate to the companion object statics **`DeltaColumnMapping`:** - Widen `hasNoColumnMappingSchemaChanges` from V1 `Metadata` to `AbstractMetadata` so V2 can call it via the adapter layer - Extract `assignColumnIdAndPhysicalNameToSchema(StructType, Map)` from `assignColumnIdAndPhysicalName(Metadata, Metadata, ...)` — needed for simulating column mapping upgrades during NoMapping-to-NameMapping transitions All changes are structural refactors with no behavioral change. ## How was this patch tested? Existing tests in `DeltaSourceSchemaEvolutionSuite` continue to pass. No behavioral changes. ## Does this PR introduce _any_ user-facing changes? No.
ebcd911 to
73e1aa7
Compare
| isTriggerAvailableNow = true; | ||
| } | ||
|
|
||
| private SnapshotImpl buildReadSnapshotFromPersistedMetadata(PersistedMetadata customMetadata) { |
There was a problem hiding this comment.
Can you move this out of SparkMicroBatchStream. The class is already becoming complex, and this would better fit to a class dedicated to interacting with the schema tracking log
There was a problem hiding this comment.
Moved to MetadataEvolutionHandler
| if (endOffset.reservoirVersion() == version | ||
| && endOffset.index() == DeltaSourceOffset.BASE_INDEX()) { | ||
| return false; | ||
| return new CommitValidationResult(false, null, null); |
There was a problem hiding this comment.
Here, you'll pass null, null to getMetadataOrProtocolChangeIndexedFileIterator, that doesn't seem particularly safe
There was a problem hiding this comment.
Those nulls have a real meaning ("no metadata/protocol action in this commit") and getMetadataOrProtocolChangeIndexedFileIterator is already written to handle them. I annotated the chain (CommitValidationResult fields → getMetadataOrProtocolChangeIndexedFileIterator → hasMetadataOrProtocolChangeComparedToStreamMetadata)
with @Nullable so the contract is enforced by static analysis. Preferred over Optional<> since Optional as a field type is discouraged in Java
| shouldTrackSchema = | ||
| !DeltaColumnMapping$.MODULE$.hasNoColumnMappingSchemaChanges( | ||
| new KernelMetadataAdapter(newMetadata), | ||
| new KernelMetadataAdapter(oldMetadata), | ||
| schemaReadOptions.allowUnsafeStreamingReadOnPartitionColumnChanges()); |
| // Loads a fresh snapshot as the baseline for schema change detection and table identity | ||
| // checks. SparkScan's initialSnapshot is from analysis time and may be stale by stream | ||
| // start/restart. | ||
| // Matches V1's DeltaDataSource.createSource() behavior. | ||
| Snapshot latestSnapshot = snapshotManager.loadLatestSnapshot(); |
There was a problem hiding this comment.
I recall raising something similar on one of @zikangh PRs but can't find it again.
I know we want to stay really close to V1 behavior, but reloading the snapshot mid-analysis/execution is something we should avoid in V2. Ideally we get a snapshot pinned during table resolution, used through analysis, and yes it may become stale. But that shouldn't be an issue because even here, the snapshot could become stale right after you load it.
Having a consistent snapshot throughout avoids a whole class of bugs and dedicated handling due to state going out of sync.
What's the issue here with the using the initialSnapshot?
It's not directly related to your change as you're just moving the call, but still something worth discussing
There was a problem hiding this comment.
The motivation for reloading was to match V1's behavior of letting users adopt additive schema changes without refreshing the DataFrame. But I agree — the DataFrame should stay pinned to its analysis-time snapshot, and requiring an explicit refresh to adopt schema changes is the more correct semantics. V1's behavior here isn't quite right.
| } else { | ||
| long startVersionForMetadataLogInit; | ||
| if (previousOffset.index() == DeltaSourceOffset.BASE_INDEX()) { | ||
| startVersionForMetadataLogInit = previousOffset.reservoirVersion() - 1; | ||
| } else { | ||
| startVersionForMetadataLogInit = previousOffset.reservoirVersion(); | ||
| } | ||
| if (metadataEvolutionHandler.shouldInitializeMetadataTrackingEagerly()) { | ||
| metadataEvolutionHandler.initializeMetadataTrackingAndExitStream( | ||
| startVersionForMetadataLogInit, | ||
| /* batchEndVersion= */ null, | ||
| /* alwaysFailUponLogInitialized= */ false); | ||
| } | ||
| checkReadIncompatibleSchemaChangeOnStreamStartOnce(startVersionForMetadataLogInit, null); | ||
| } |
There was a problem hiding this comment.
v1 does not eager-init in this else branch (only the first-batch path does) , i think we cn drop it.
There was a problem hiding this comment.
This is a V1/V2 engine-behavior difference. On restart from a fully committed batch, DSv1's engine first calls getBatch on the previous batch (where V1 inits via validateAndInitMetadataLogForPlannedBatchesDuringStreamStart) before calling latestOffset for the next one. DSv2's engine skips that and calls latestOffset directly, so the init has to move here — otherwise the schema-tracking log stays uninitialized on restart for streams that picked it up after they were already running.
| return new SnapshotImpl( | ||
| snapshotAtSourceInit.getDataPath(), | ||
| customMetadata.deltaCommitVersion(), | ||
| snapshotAtSourceInit.getLazyLogSegment(), | ||
| logReplay, | ||
| readProtocol, | ||
| readMetadata, | ||
| snapshotAtSourceInit.getCommitter(), | ||
| SnapshotQueryContext.forVersionSnapshot( | ||
| snapshotAtSourceInit.getDataPath().toString(), customMetadata.deltaCommitVersion()), | ||
| Optional.empty() /* inCommitTimestampOpt */); |
There was a problem hiding this comment.
the version (customMetadata.deltaCommitVersion()) and LazyLogSegment (from snapshotAtSourceInit) come from different commits.
There was a problem hiding this comment.
Good catch — they're from different commits. V1 sidesteps this by exposing only version/metadata/protocol via SnapshotDescriptor; V2 has no equivalent in Kernel, and all current consumers only read dataPath/metadata/protocol/version — none touch the log segment, so the mismatch is inert today.
Aligning the log segment would cost an extra loadSnapshotAt(persistedVersion) at every stream init, which V1 doesn't pay. Instead I replaced lazyLogSegment and lazyCrcInfo with traps that throw on access, so any future log-replay read fails loudly rather than silently against the wrong version.
| customMetadata.dataSchemaJson(), | ||
| SchemaUtils.convertSparkSchemaToKernelSchema(customMetadata.dataSchema()), | ||
| VectorUtils.buildArrayValue( | ||
| Arrays.asList(customMetadata.partitionSchema().fieldNames()), StringType.STRING), |
There was a problem hiding this comment.
fieldNames() drops column-mapping field ids . could we pass the full partition schema.
There was a problem hiding this comment.
There is no partitionSchema field in v2 Metadata.
| Offset postInsertOffset = | ||
| streamPostChange.latestOffset(postBarrierOffset, ReadLimit.allAvailable()); | ||
| assertEquals( | ||
| 1L, countRowsBetweenOffsets(streamPostChange, postBarrierOffset, postInsertOffset)); |
There was a problem hiding this comment.
a wrong-schema bug would still match the row count. could we read one post-change column value and assert it.
There was a problem hiding this comment.
changed to exact match
| this.readSchemaAtSourceInit = | ||
| Objects.requireNonNull( | ||
| SchemaUtils.convertKernelSchemaToSparkSchema(snapshotAtSourceInit.getSchema()), | ||
| "readSchemaAtSourceInit is null"); | ||
| SchemaUtils.convertKernelSchemaToSparkSchema(readSnapshotAtSourceInit.getSchema()); | ||
| this.readProtocolAtSourceInit = readSnapshotAtSourceInit.getProtocol(); | ||
| this.readConfigurationsAtSourceInit = readSnapshotAtSourceInit.getMetadata().getConfiguration(); |
There was a problem hiding this comment.
the previous requireNonNull on schema/protocol/configuration was dropped in the rewrite.
## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/6563/files) to review incremental changes. - [stack/SparkMetadataAdapter](#6546) [[Files changed](https://github.com/delta-io/delta/pull/6546/files)] [MERGED] - [stack/RefactorMetadataTrackingLog](#6550) [[Files changed](https://github.com/delta-io/delta/pull/6550/files)] [MERGED] - [stack/RefactorDeltaSourceMetadataEvolutionSupport](#6562) [[Files changed](https://github.com/delta-io/delta/pull/6562/files)] [MERGED] - [**stack/MetadataEvolutionHandler2**](#6563) [[Files changed](https://github.com/delta-io/delta/pull/6563/files)] - [stack/NonAdditiveSchemaEvolution2](#6570) [[Files changed](https://github.com/delta-io/delta/pull/6570/files/a20f1f3ab452a75fc954e15c57c17327e0cb9267..0e07f87285becd6be416450ae084df454d9c94a9)] - [stack/NonAdditiveSchemaEvolution3](#6697) [[Files changed](https://github.com/delta-io/delta/pull/6697/files/0e07f87285becd6be416450ae084df454d9c94a9..73e1aa7f4162a3e1480ffd2b88b9ca79d852f2fe)] - [stack/consecutiveSchemaChangesMerger](#6698) [[Files changed](https://github.com/delta-io/delta/pull/6698/files/73e1aa7f4162a3e1480ffd2b88b9ca79d852f2fe..5e5d260b64d45cc11bcfdb58e5aab1b2d2637b33)] - [stack/V1V2MixTest](#6759) [[Files changed](https://github.com/delta-io/delta/pull/6759/files/5e5d260b64d45cc11bcfdb58e5aab1b2d2637b33..738379713040986c74f98dbebfdc6c83ec1d3f16)] --------- #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description PR 4/7 in the non-additive schema evolution for V2 streaming connector stack. Introduce `MetadataEvolutionHandler`, a Java class that implements the V1 barrier protocol for schema evolution in the V2 connector. In V1 this logic lives in `DeltaSourceMetadataEvolutionSupport`, a Scala trait mixed into `DeltaSource` that accesses stream state via `this`. Since V2's `SparkMicroBatchStream` is Java and cannot use Scala trait mixins, `MetadataEvolutionHandler` receives all dependencies via constructor injection instead. The handler covers the full schema evolution lifecycle: - **Stream start**: eager metadata tracking log initialization on first batch - **Offset generation**: injects `METADATA_CHANGE_INDEX` / `POST_METADATA_CHANGE_INDEX` barrier sentinels into the file change iterator - **Pending schema offsets**: returns barrier offsets for in-progress schema changes - **Batch commit**: updates the schema log and throws `DELTA_STREAMING_METADATA_EVOLUTION` to trigger stream restart - **Batch planning on restart**: validates and re-initializes the schema log All detection logic delegates to the shared `DeltaSourceMetadataEvolutionSupport$` companion object statics (refactored in PR 3/7). V2-specific orchestration is limited to wiring the barrier protocol into the `CloseableIterator<IndexedFile>` pipeline and collecting metadata/protocol from Kernel commit ranges via `StreamingHelper`. Also extends `StreamingHelper` with `getMetadataAndProtocolForVersionRange` to collect metadata and protocol actions from a range of Kernel commits. ## How was this patch tested? Unit tests in `MetadataEvolutionHandlerTest.java` covering: barrier protocol (METADATA_CHANGE_INDEX / POST_METADATA_CHANGE_INDEX offset generation), tracking state transitions, initialization lifecycle, offset arithmetic, pending schema change handling, and commit-time evolution exception. ## Does this PR introduce _any_ user-facing changes? No.
9defa7b to
efbe032
Compare
feb5e79 to
c376c08
Compare
c376c08 to
dbb6246
Compare
## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/6570/files) to review incremental changes. - [stack/SparkMetadataAdapter](#6546) [[Files changed](https://github.com/delta-io/delta/pull/6546/files)] [MERGED] - [stack/RefactorMetadataTrackingLog](#6550) [[Files changed](https://github.com/delta-io/delta/pull/6550/files)] [MERGED] - [stack/RefactorDeltaSourceMetadataEvolutionSupport](#6562) [[Files changed](https://github.com/delta-io/delta/pull/6562/files)] [MERGED] - [stack/MetadataEvolutionHandler2](#6563) [[Files changed](https://github.com/delta-io/delta/pull/6563/files)] [MERGED] - [**stack/NonAdditiveSchemaEvolution2**](#6570) [[Files changed](https://github.com/delta-io/delta/pull/6570/files)] - [stack/NonAdditiveSchemaEvolution3](#6697) [[Files changed](https://github.com/delta-io/delta/pull/6697/files/b7f6c8ebfc0882e7e2cc580f09f376be23a8d43d..dbb6246c14be1ab7f017ad9fc26455ae599ee676)] - [stack/consecutiveSchemaChangesMerger](#6698) [[Files changed](https://github.com/delta-io/delta/pull/6698/files/dbb6246c14be1ab7f017ad9fc26455ae599ee676..4bf2fa3fa828bcab0b56c4c26ca51ee9cc40b482)] - [stack/SchemaTrackingWithCDC](#6801) [[Files changed](https://github.com/delta-io/delta/pull/6801/files/4bf2fa3fa828bcab0b56c4c26ca51ee9cc40b482..a78a4ac2bc9a52605278a36b98804230258c12a2)] - [stack/V1V2MixTest](#6759) [[Files changed](https://github.com/delta-io/delta/pull/6759/files/7f9b7f2724b2245ab7380908616303cf7ea95fca..e146cdc9ebb0572e8b0a928cc6dd3bfdc198d984)] --------- #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description PR 5/7 in the non-additive schema evolution for V2 streaming connector stack. Wire schema tracking into V2's analysis path so the analyzed plan reflects the persisted (evolved) schema instead of the live snapshot schema. - `DeltaAnalysis.verifyDeltaSourceSchemaLocation`: extend the duplicate-schema-location check to also visit `StreamingRelationV2`, keyed on the V2 `Table.name`. - `SparkTable`: open `DeltaSourceMetadataTrackingLog` once during construction (gated on `mergeConsecutiveSchemaChanges`) and seed `SchemaProvider` from the persisted metadata, so analysis-time `schema()` matches what the stream will read at runtime. - `ApplyV2ReadOptions` (renamed from `ApplyV2Streaming`): generalize the CDC-only rebuild to also fire when `schemaTrackingLocation` arrives via `extraOptions` on the catalog `readStream.table()` path; rebuild `SparkTable` with merged options so the schema-log lookup actually fires. - `MetadataEvolutionHandler.getMetadataTrackingLogForMicroBatchStream`: V2 port of V1's helper, reused by `SparkTable` (analysis) and `SparkScan` (execution). ## How was this patch tested? `SparkTableTest`, `MetadataEvolutionHandlerTest`, `ApplyV2ReadOptionsSuite`. Unified `DeltaV2SourceSchemaEvolutionSuite` updated. ## Does this PR introduce _any_ user-facing changes? No.
dbb6246 to
5e57fc8
Compare
| DeltaStreamUtils.SchemaReadOptions$.MODULE$.fromSparkSession( | ||
| spark, isStreamingFromColumnMappingTable, isTypeWideningSupportedInProtocol), | ||
| "schemaReadOptions is null"); | ||
| this.metadataEvolutionHandler = |
There was a problem hiding this comment.
V1 checks that require(options.failOnDataLoss to avoid having log retention affect schema evolution. should we add the same check?
There was a problem hiding this comment.
Good catch, added the check and a test to verify it
5e57fc8 to
f96643a
Compare
🥞 Stacked PR
Use this link to review incremental changes.
Which Delta project/connector is this regarding?
Description
PR 6/7 in the non-additive schema evolution for V2 streaming connector stack.
Wire
MetadataEvolutionHandlerintoSparkMicroBatchStreamandSparkScanso V2 streaming reads honor non-additive schema evolution (column rename/drop, type widening).SparkMicroBatchStream: takemetadataTrackingLog+metadataPathas constructor inputs; when a persisted entry exists, layer it onto the freshly loadedsnapshotAtSourceInitto derivereadSnapshotAtSourceInit(mirrors V1'sreadSnapshotDescriptor). Integrate the schema-evolution barrier protocol intolatestOffset/commit/planInputPartitions. Skip the on-restart schema-validation check when schema tracking is active — the schema-log evolution exception covers it.SparkScan.toMicroBatchStream: reload latest snapshot (the analysis-timeinitialSnapshotcan be stale by stream start), open the tracking log viaMetadataEvolutionHandler.getMetadataTrackingLogForMicroBatchStreamwithmergeConsecutiveSchemaChanges=false(the merger only runs at analysis), and pass it through with the checkpoint location.SparkScanoption allow-list: moveallowSourceColumnDrop/Rename/TypeChangeout of the unsupported list now that they are honored.How was this patch tested?
SparkMicroBatchStreamTest,MetadataEvolutionHandlerTest. Unified suites (DeltaV2SourceSchemaEvolutionSuite,TypeWideningStreamingV2SourceSuite,RemoveColumnMappingStreamingReadV2Suite) move non-merger evolution scenarios fromshouldFailTeststoshouldPassTests; merger-dependent tests remain pending until PR 7/7.Does this PR introduce any user-facing changes?
No.