[kernel-spark] Implement MetadataEvolutionHandler in v2#6563
Conversation
05a2c34 to
21a7fb0
Compare
Range-diff: stack/RefactorDeltaSourceMetadataEvolutionSupport (05a2c34 -> 21a7fb0)
Reproduce locally: |
21a7fb0 to
b69031a
Compare
Range-diff: stack/RefactorDeltaSourceMetadataEvolutionSupport (21a7fb0 -> b69031a)
Reproduce locally: |
b69031a to
82d0cb9
Compare
Range-diff: stack/RefactorDeltaSourceMetadataEvolutionSupport (b69031a -> 82d0cb9)
Reproduce locally: |
82d0cb9 to
9af6e73
Compare
Range-diff: stack/RefactorDeltaSourceMetadataEvolutionSupport (82d0cb9 -> 9af6e73)
Reproduce locally: |
9af6e73 to
cdae0bb
Compare
cdae0bb to
ab61d5e
Compare
Range-diff: stack/RefactorDeltaSourceMetadataEvolutionSupport (cdae0bb -> ab61d5e)
Reproduce locally: |
ab61d5e to
be6d89d
Compare
Range-diff: stack/RefactorDeltaSourceMetadataEvolutionSupport (ab61d5e -> be6d89d)
Reproduce locally: |
be6d89d to
e7376d3
Compare
Range-diff: stack/RefactorDeltaSourceMetadataEvolutionSupport (be6d89d -> e7376d3)
Reproduce locally: |
e7376d3 to
aee34e1
Compare
9a48144 to
696ce03
Compare
Range-diff: stack/RefactorDeltaSourceMetadataEvolutionSupport (9a48144 -> 696ce03)
Reproduce locally: |
696ce03 to
6981a40
Compare
Range-diff: stack/RefactorDeltaSourceMetadataEvolutionSupport (696ce03 -> 6981a40)
Reproduce locally: |
TimothyW553
left a comment
There was a problem hiding this comment.
a few blocking comments
| public CloseableIterator<IndexedFile> stopIndexedFileIteratorAtSchemaChangeBarrier( | ||
| CloseableIterator<IndexedFile> fileActions) { | ||
| // Consume until we hit the barrier, include the barrier itself, discard the rest. | ||
| List<IndexedFile> result = new ArrayList<>(); | ||
| try { | ||
| while (fileActions.hasNext()) { | ||
| IndexedFile file = fileActions.next(); | ||
| result.add(file); | ||
| if (file.getIndex() == DeltaSourceOffset.METADATA_CHANGE_INDEX()) { | ||
| break; | ||
| } | ||
| } | ||
| } finally { | ||
| try { | ||
| fileActions.close(); | ||
| } catch (Exception e) { | ||
| throw new RuntimeException("Failed to close file actions iterator", e); | ||
| } | ||
| } | ||
| return Utils.toCloseableIterator(result.iterator()); | ||
| } |
There was a problem hiding this comment.
V1 was lazy here but this version reads the whole iterator into a list first. For large batches this could use a lot of memory - can we make it lazy?
There was a problem hiding this comment.
Sounds great to me, utilize breakableFilter to make it lazy
| } finally { | ||
| try { | ||
| fileActions.close(); | ||
| } catch (Exception e) { | ||
| throw new RuntimeException("Failed to close file actions iterator", e); | ||
| } | ||
| } |
There was a problem hiding this comment.
If next() throws and then close() also throws, we lose the original error. Use addSuppressed or try-with-resources?
There was a problem hiding this comment.
not a issue anymore after we changed it to lazy
| val upgradedMetadata = new AbstractMetadata { | ||
| val id: String = oldMetadata.id | ||
| val name: String = oldMetadata.name | ||
| val description: String = oldMetadata.description | ||
| val schema: StructType = upgradedSchema | ||
| val partitionColumns: Seq[String] = oldMetadata.partitionColumns | ||
| val configuration: Map[String, String] = oldMetadata.configuration + | ||
| (DeltaConfigs.COLUMN_MAPPING_MODE.key -> newMetadata.columnMappingMode.name, | ||
| DeltaConfigs.COLUMN_MAPPING_MAX_ID.key -> upgradedMaxId.toString) | ||
| val columnMappingMode: DeltaColumnMappingMode = newMetadata.columnMappingMode | ||
| } |
There was a problem hiding this comment.
If a new field is added to AbstractMetadata, this anonymous impl will silently still compile with a trait default - exactly the drift the new equalsByFields is meant to prevent. Can we use a named class or a copy helper?
There was a problem hiding this comment.
I don't think this is a issue, Scala requires anonymous trait instantiations to implement all abstract members
johanl-db
left a comment
There was a problem hiding this comment.
I didn't do a detailed review, only looked at a high-level: the change makes sense, the main evaluation at this stage will be that existing DSV1 tests can be run and pass against this new implementation.
6981a40 to
5788701
Compare
## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/6546/files) to review incremental changes. - [**stack/SparkMetadataAdapter**](#6546) [[Files changed](https://github.com/delta-io/delta/pull/6546/files)] - [stack/RefactorMetadataTrackingLog](#6550) [[Files changed](https://github.com/delta-io/delta/pull/6550/files/9271a6262f7a2615b977de0319c7238044b7d0a9..8378d33acda70a34a109b35173a968a4b3401ec1)] - [stack/RefactorDeltaSourceMetadataEvolutionSupport](#6562) [[Files changed](https://github.com/delta-io/delta/pull/6562/files/8378d33acda70a34a109b35173a968a4b3401ec1..90365431b12640de181446ec9c2033fb1b143b03)] - [stack/MetadataEvolutionHandler2](#6563) [[Files changed](https://github.com/delta-io/delta/pull/6563/files/28bb7021adb12b055e1b281fdfee0ab48a8732ac..578870181fa81a9146b2fa907244e350ffcabb52)] - [stack/NonAdditiveSchemaEvolution2](#6570) [[Files changed](https://github.com/delta-io/delta/pull/6570/files/578870181fa81a9146b2fa907244e350ffcabb52..c025b7c3c386e8d46d6142d0727dce95582bb0ef)] - [stack/NonAdditiveSchemaEvolution3](#6697) [[Files changed](https://github.com/delta-io/delta/pull/6697/files/c025b7c3c386e8d46d6142d0727dce95582bb0ef..db16b9fa80a80c105430c93589126ba8b828458f)] - [stack/consecutiveSchemaChangesMerger](#6698) [[Files changed](https://github.com/delta-io/delta/pull/6698/files/0148020ffe11e7b079e99fa8c5189a19c354f2be..9a360aa819f20d78b5361b2e997d24433fb793d5)] --------- #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description PR 1/7 in the non-additive schema evolution for V2 streaming connector stack. The shared V1 Scala utilities (`DeltaColumnMapping`, `DeltaSourceMetadataEvolutionSupport`) operate on `AbstractMetadata`/`AbstractProtocol`, but V2 holds Kernel types. This PR creates two adapter classes that bridge the gap: - `KernelMetadataAdapter`: Kernel `Metadata` → `AbstractMetadata` (schema conversion via `SchemaUtils`, partition columns and configuration converted to Scala collections) - `KernelProtocolAdapter`: Kernel `Protocol` → `AbstractProtocol` (maps reader/writer features to `Option[Set[String]]`) Also adds `columnMappingMode` and `partitionSchema` to the `AbstractMetadata` trait — V1's `Metadata` already had these fields, the trait just didn't expose them. ## How was this patch tested? Unit tests in `ActionAdaptersTest.java`: table-features protocol, legacy protocol, full metadata round-trip, null optional fields, and null constructor rejection. ## Does this PR introduce _any_ user-facing changes? No.
5788701 to
1438111
Compare
Range-diff: stack/RefactorDeltaSourceMetadataEvolutionSupport (5788701 -> 1438111)
Reproduce locally: |
| * <p>V2 port of V1's {@code | ||
| * DeltaSourceMetadataEvolutionSupport.getNextOffsetFromPreviousOffsetIfPendingSchemaChange}. | ||
| */ | ||
| public DeltaSourceOffset getNextOffsetFromPreviousOffsetIfPendingSchemaChange( |
There was a problem hiding this comment.
SparkMicroBatchStream.getNextOffsetFromPreviousOffset returns an Optional<>, we should match that.
| * DeltaSourceMetadataEvolutionSupport.initializeMetadataTrackingAndExitStream}. | ||
| */ | ||
| public void initializeMetadataTrackingAndExitStream( | ||
| long batchStartVersion, Long batchEndVersion, boolean alwaysFailUponLogInitialized) { |
There was a problem hiding this comment.
I don't think Optional<> is encouraged to be used as a method parameter in Java. I can add a @Nullable anotation
| new KernelProtocolAdapter(protocolToUse), | ||
| metadataPath); | ||
|
|
||
| if (replace) { |
There was a problem hiding this comment.
Can we just do metadataTrackingLog.get().writeNewMetadata(schemaToPersist, replace);?
| * picks up the updated schema. | ||
| * </ol> | ||
| * | ||
| * <p>See V1's trait doc for the full barrier protocol details. Validation logic shared with v1 |
There was a problem hiding this comment.
Not immediately clear what this refers to.
1438111 to
ada8458
Compare
Range-diff: stack/RefactorDeltaSourceMetadataEvolutionSupport (1438111 -> ada8458)
Reproduce locally: |
…#6550) ## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/6550/files) to review incremental changes. - [stack/SparkMetadataAdapter](#6546) [[Files changed](https://github.com/delta-io/delta/pull/6546/files)] [MERGED] - [**stack/RefactorMetadataTrackingLog**](#6550) [[Files changed](https://github.com/delta-io/delta/pull/6550/files)] - [stack/RefactorDeltaSourceMetadataEvolutionSupport](#6562) [[Files changed](https://github.com/delta-io/delta/pull/6562/files/953f137f8c4ce46d8b8a9605b0c7bed898e30df4..027984b6edcbad0f4731e560425c2ed9bcf8fc27)] - [stack/MetadataEvolutionHandler2](#6563) [[Files changed](https://github.com/delta-io/delta/pull/6563/files/027984b6edcbad0f4731e560425c2ed9bcf8fc27..ada845895139edcb2727a87b39922c8e16837a99)] - [stack/NonAdditiveSchemaEvolution2](#6570) [[Files changed](https://github.com/delta-io/delta/pull/6570/files/ada845895139edcb2727a87b39922c8e16837a99..476762fde7b9cb9b9bc3e416c86a260cd29806ed)] - [stack/NonAdditiveSchemaEvolution3](#6697) [[Files changed](https://github.com/delta-io/delta/pull/6697/files/476762fde7b9cb9b9bc3e416c86a260cd29806ed..13395a7f2a49db4962091e8ee919bebdab5bd4e2)] - [stack/consecutiveSchemaChangesMerger](#6698) [[Files changed](https://github.com/delta-io/delta/pull/6698/files/13395a7f2a49db4962091e8ee919bebdab5bd4e2..f22ba063eaf35ab69d653a2d5faefdc52f35eab5)] --------- #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description PR 2/7 in the non-additive schema evolution for V2 streaming connector stack. Decouple `DeltaSourceMetadataTrackingLog` and `PersistedMetadata` from V1-specific types so the schema log can be reused by the V2 connector. - Replace `SnapshotDescriptor` parameter in `create()` with plain `sourceTableId` and `sourceDataPath` strings - Unify `PersistedMetadata.apply` to accept `AbstractMetadata`/`AbstractProtocol` instead of V1 `Metadata`/`Protocol` - Extract the consecutive schema changes merger (V1-specific, depends on `DeltaLog`) out of the companion object into `DeltaSourceMetadataEvolutionSupport`, and inject it as a function parameter so V2 can provide its own implementation - Remove `Protocol`'s `private` constructor modifier to allow construction from abstract protocol fields All changes are structural refactors with no behavioral change. ## How was this patch tested? Existing tests in `DeltaSourceSchemaEvolutionSuite` updated to use the new API. No behavioral changes. ## Does this PR introduce _any_ user-facing changes? No.
54b456f to
e5b2c32
Compare
…seable in v2 (#6562) ## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/6562/files) to review incremental changes. - [stack/SparkMetadataAdapter](#6546) [[Files changed](https://github.com/delta-io/delta/pull/6546/files)] [MERGED] - [stack/RefactorMetadataTrackingLog](#6550) [[Files changed](https://github.com/delta-io/delta/pull/6550/files)] [MERGED] - [**stack/RefactorDeltaSourceMetadataEvolutionSupport**](#6562) [[Files changed](https://github.com/delta-io/delta/pull/6562/files)] - [stack/MetadataEvolutionHandler2](#6563) [[Files changed](https://github.com/delta-io/delta/pull/6563/files/ed92a0fa2051432b6bc5784034df0b7949bbfb98..e5b2c3295843ec85753e07dc0010aa5ccebaabb7)] - [stack/NonAdditiveSchemaEvolution2](#6570) [[Files changed](https://github.com/delta-io/delta/pull/6570/files/e5b2c3295843ec85753e07dc0010aa5ccebaabb7..7c66bf11a0f1b651cda32ed7f529f552dd9dbfcb)] - [stack/NonAdditiveSchemaEvolution3](#6697) [[Files changed](https://github.com/delta-io/delta/pull/6697/files/7c66bf11a0f1b651cda32ed7f529f552dd9dbfcb..14956ea304c93d2343ccd7eb89a112966f07f906)] - [stack/consecutiveSchemaChangesMerger](#6698) [[Files changed](https://github.com/delta-io/delta/pull/6698/files/14956ea304c93d2343ccd7eb89a112966f07f906..8101b335b892a6a5b6d6fe11f4a202d14102721c)] --------- #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description PR 3/7 in the non-additive schema evolution for V2 streaming connector stack. Refactor `DeltaSourceMetadataEvolutionSupport` and `DeltaColumnMapping` so the schema change detection logic can be called from V2 without depending on V1 instance state. **`DeltaSourceMetadataEvolutionSupport`:** - Extract instance methods (`validateAndResolveMetadataEvolution`, `checkColumnMappingSchemaChangesDuringStreaming`, `resolveMetadataEvolutionForCommitRange`, etc.) to companion object statics that accept explicit parameters instead of accessing V1 `DeltaSource` via `this` - V1 trait methods now delegate to the companion object statics **`DeltaColumnMapping`:** - Widen `hasNoColumnMappingSchemaChanges` from V1 `Metadata` to `AbstractMetadata` so V2 can call it via the adapter layer - Extract `assignColumnIdAndPhysicalNameToSchema(StructType, Map)` from `assignColumnIdAndPhysicalName(Metadata, Metadata, ...)` — needed for simulating column mapping upgrades during NoMapping-to-NameMapping transitions All changes are structural refactors with no behavioral change. ## How was this patch tested? Existing tests in `DeltaSourceSchemaEvolutionSuite` continue to pass. No behavioral changes. ## Does this PR introduce _any_ user-facing changes? No.
f963649 to
a20f1f3
Compare
a20f1f3 to
2349027
Compare
## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/6570/files) to review incremental changes. - [stack/SparkMetadataAdapter](#6546) [[Files changed](https://github.com/delta-io/delta/pull/6546/files)] [MERGED] - [stack/RefactorMetadataTrackingLog](#6550) [[Files changed](https://github.com/delta-io/delta/pull/6550/files)] [MERGED] - [stack/RefactorDeltaSourceMetadataEvolutionSupport](#6562) [[Files changed](https://github.com/delta-io/delta/pull/6562/files)] [MERGED] - [stack/MetadataEvolutionHandler2](#6563) [[Files changed](https://github.com/delta-io/delta/pull/6563/files)] [MERGED] - [**stack/NonAdditiveSchemaEvolution2**](#6570) [[Files changed](https://github.com/delta-io/delta/pull/6570/files)] - [stack/NonAdditiveSchemaEvolution3](#6697) [[Files changed](https://github.com/delta-io/delta/pull/6697/files/b7f6c8ebfc0882e7e2cc580f09f376be23a8d43d..dbb6246c14be1ab7f017ad9fc26455ae599ee676)] - [stack/consecutiveSchemaChangesMerger](#6698) [[Files changed](https://github.com/delta-io/delta/pull/6698/files/dbb6246c14be1ab7f017ad9fc26455ae599ee676..4bf2fa3fa828bcab0b56c4c26ca51ee9cc40b482)] - [stack/SchemaTrackingWithCDC](#6801) [[Files changed](https://github.com/delta-io/delta/pull/6801/files/4bf2fa3fa828bcab0b56c4c26ca51ee9cc40b482..a78a4ac2bc9a52605278a36b98804230258c12a2)] - [stack/V1V2MixTest](#6759) [[Files changed](https://github.com/delta-io/delta/pull/6759/files/7f9b7f2724b2245ab7380908616303cf7ea95fca..e146cdc9ebb0572e8b0a928cc6dd3bfdc198d984)] --------- #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description PR 5/7 in the non-additive schema evolution for V2 streaming connector stack. Wire schema tracking into V2's analysis path so the analyzed plan reflects the persisted (evolved) schema instead of the live snapshot schema. - `DeltaAnalysis.verifyDeltaSourceSchemaLocation`: extend the duplicate-schema-location check to also visit `StreamingRelationV2`, keyed on the V2 `Table.name`. - `SparkTable`: open `DeltaSourceMetadataTrackingLog` once during construction (gated on `mergeConsecutiveSchemaChanges`) and seed `SchemaProvider` from the persisted metadata, so analysis-time `schema()` matches what the stream will read at runtime. - `ApplyV2ReadOptions` (renamed from `ApplyV2Streaming`): generalize the CDC-only rebuild to also fire when `schemaTrackingLocation` arrives via `extraOptions` on the catalog `readStream.table()` path; rebuild `SparkTable` with merged options so the schema-log lookup actually fires. - `MetadataEvolutionHandler.getMetadataTrackingLogForMicroBatchStream`: V2 port of V1's helper, reused by `SparkTable` (analysis) and `SparkScan` (execution). ## How was this patch tested? `SparkTableTest`, `MetadataEvolutionHandlerTest`, `ApplyV2ReadOptionsSuite`. Unified `DeltaV2SourceSchemaEvolutionSuite` updated. ## Does this PR introduce _any_ user-facing changes? No.
…6697) ## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/6697/files) to review incremental changes. - [stack/SparkMetadataAdapter](#6546) [[Files changed](https://github.com/delta-io/delta/pull/6546/files)] [MERGED] - [stack/RefactorMetadataTrackingLog](#6550) [[Files changed](https://github.com/delta-io/delta/pull/6550/files)] [MERGED] - [stack/RefactorDeltaSourceMetadataEvolutionSupport](#6562) [[Files changed](https://github.com/delta-io/delta/pull/6562/files)] [MERGED] - [stack/MetadataEvolutionHandler2](#6563) [[Files changed](https://github.com/delta-io/delta/pull/6563/files)] [MERGED] - [stack/NonAdditiveSchemaEvolution2](#6570) [[Files changed](https://github.com/delta-io/delta/pull/6570/files)] [MERGED] - [**stack/NonAdditiveSchemaEvolution3**](#6697) [[Files changed](https://github.com/delta-io/delta/pull/6697/files)] - [stack/consecutiveSchemaChangesMerger](#6698) [[Files changed](https://github.com/delta-io/delta/pull/6698/files/f96643aa3cc01e7f70cc13a18b82dc27f277f11d..f612628ad931ec35c237801109f01b6fbd1379f7)] - [stack/SchemaTrackingWithCDC](#6801) [[Files changed](https://github.com/delta-io/delta/pull/6801/files/f612628ad931ec35c237801109f01b6fbd1379f7..4aeacfb120b33e9cdfe124352290b72f53f7cf89)] - [stack/V1V2MixTest](#6759) [[Files changed](https://github.com/delta-io/delta/pull/6759/files/f612628ad931ec35c237801109f01b6fbd1379f7..0c818ee431ab417a4f2ffbcc609930be09d25031)] --------- #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description PR 6/7 in the non-additive schema evolution for V2 streaming connector stack. Wire `MetadataEvolutionHandler` into `SparkMicroBatchStream` and `SparkScan` so V2 streaming reads honor non-additive schema evolution (column rename/drop, type widening). - `SparkMicroBatchStream`: take `metadataTrackingLog` + `metadataPath` as constructor inputs; when a persisted entry exists, layer it onto the freshly loaded `snapshotAtSourceInit` to derive `readSnapshotAtSourceInit` (mirrors V1's `readSnapshotDescriptor`). Integrate the schema-evolution barrier protocol into `latestOffset` / `commit` / `planInputPartitions`. Skip the on-restart schema-validation check when schema tracking is active — the schema-log evolution exception covers it. - `SparkScan.toMicroBatchStream`: reload latest snapshot (the analysis-time `initialSnapshot` can be stale by stream start), open the tracking log via `MetadataEvolutionHandler.getMetadataTrackingLogForMicroBatchStream` with `mergeConsecutiveSchemaChanges=false` (the merger only runs at analysis), and pass it through with the checkpoint location. - `SparkScan` option allow-list: move `allowSourceColumnDrop` / `Rename` / `TypeChange` out of the unsupported list now that they are honored. ## How was this patch tested? `SparkMicroBatchStreamTest`, `MetadataEvolutionHandlerTest`. Unified suites (`DeltaV2SourceSchemaEvolutionSuite`, `TypeWideningStreamingV2SourceSuite`, `RemoveColumnMappingStreamingReadV2Suite`) move non-merger evolution scenarios from `shouldFailTests` to `shouldPassTests`; merger-dependent tests remain pending until PR 7/7. ## Does this PR introduce _any_ user-facing changes? No.
🥞 Stacked PR
Use this link to review incremental changes.
Which Delta project/connector is this regarding?
Description
PR 4/7 in the non-additive schema evolution for V2 streaming connector stack.
Introduce
MetadataEvolutionHandler, a Java class that implements the V1 barrier protocol for schema evolution in the V2 connector. In V1 this logic lives inDeltaSourceMetadataEvolutionSupport, a Scala trait mixed intoDeltaSourcethat accesses stream state viathis. Since V2'sSparkMicroBatchStreamis Java and cannot use Scala trait mixins,MetadataEvolutionHandlerreceives all dependencies via constructor injection instead.The handler covers the full schema evolution lifecycle:
METADATA_CHANGE_INDEX/POST_METADATA_CHANGE_INDEXbarrier sentinels into the file change iteratorDELTA_STREAMING_METADATA_EVOLUTIONto trigger stream restartAll detection logic delegates to the shared
DeltaSourceMetadataEvolutionSupport$companion object statics (refactored in PR 3/7). V2-specific orchestration is limited to wiring the barrier protocol into theCloseableIterator<IndexedFile>pipeline and collecting metadata/protocol from Kernel commit ranges viaStreamingHelper.Also extends
StreamingHelperwithgetMetadataAndProtocolForVersionRangeto collect metadata and protocol actions from a range of Kernel commits.How was this patch tested?
Unit tests in
MetadataEvolutionHandlerTest.javacovering: barrier protocol (METADATA_CHANGE_INDEX / POST_METADATA_CHANGE_INDEX offset generation), tracking state transitions, initialization lifecycle, offset arithmetic, pending schema change handling, and commit-time evolution exception.Does this PR introduce any user-facing changes?
No.