[Spark] Fix streaming schema merger dropping metadata change when chain ends with a protocol-only commit#6752
Open
PorridgeSwim wants to merge 2 commits into
Open
Conversation
62286e6 to
d296b3a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which Delta project/connector is this regarding?
Description
DeltaSourceMetadataEvolutionSupport.getMergedConsecutiveMetadataChangeslooks ahead through the run of consecutive non-file commits after the current persisted schema entry and writes a single mergedPersistedMetadatacovering the whole run. It usediteratorLastto pick only the final commit's actions, then took itsMetadata/Protocol(or fell back to the previously persisted entry).That breaks for chains shaped like
(..., Metadata, Protocol-only): the final commit has noMetadataaction, sometadataOpt.getOrElse(currentMetadata.dataSchemaJson)reverts to the pre-chain schema whiledeltaCommitVersionadvances to the protocol-only tail. The persisted entry'sdataSchemathen disagrees with the snapshot atdeltaCommitVersion, silently dropping the earlier metadata change.This PR replaces the last-tuple logic with a fold across the chain, tracking the last
version, the latest non-NoneMetadataaction, and the latest non-NoneProtocolaction separately. The merged entry now reflects the cumulative state at the end of the chain regardless of which commit each action lives in.How was this patch tested?
Added a regression test
consecutive schema evolutions with protocol-only tailinDeltaSourceSchemaEvolutionSuite. The chain is constructed so the start schema, the post-first-change schema (which becomescurrentMetadatafor the merger), and the post-chain schema are pairwise distinct, so a coincidence can't mask the failure:<a, b>→rename b→c→addColumn d→dropColumn c→ protocol-only upgrade → file action.(vTail, <a, c>)— schema reverts to pre-chain.(vTail, <a, d>)— schema reflects the final metadata change.The existing
consecutive schema evolutionstest (all-Metadata chain) still passes since the fold's "latest metadata in chain" coincides withiteratorLast's result when every commit carries aMetadataaction.Does this PR introduce any user-facing changes?
No (bug fix to internal streaming schema-tracking behavior; no API changes).