Skip to content

[Spark] Fix streaming schema merger dropping metadata change when chain ends with a protocol-only commit#6752

Open
PorridgeSwim wants to merge 2 commits into
delta-io:masterfrom
PorridgeSwim:fixMetadataChangeMergerBug
Open

[Spark] Fix streaming schema merger dropping metadata change when chain ends with a protocol-only commit#6752
PorridgeSwim wants to merge 2 commits into
delta-io:masterfrom
PorridgeSwim:fixMetadataChangeMergerBug

Conversation

@PorridgeSwim
Copy link
Copy Markdown
Collaborator

@PorridgeSwim PorridgeSwim commented May 8, 2026

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

DeltaSourceMetadataEvolutionSupport.getMergedConsecutiveMetadataChanges looks ahead through the run of consecutive non-file commits after the current persisted schema entry and writes a single merged PersistedMetadata covering the whole run. It used iteratorLast to pick only the final commit's actions, then took its Metadata/Protocol (or fell back to the previously persisted entry).

That breaks for chains shaped like (..., Metadata, Protocol-only): the final commit has no Metadata action, so metadataOpt.getOrElse(currentMetadata.dataSchemaJson) reverts to the pre-chain schema while deltaCommitVersion advances to the protocol-only tail. The persisted entry's dataSchema then disagrees with the snapshot at deltaCommitVersion, silently dropping the earlier metadata change.

This PR replaces the last-tuple logic with a fold across the chain, tracking the last version, the latest non-None Metadata action, and the latest non-None Protocol action separately. The merged entry now reflects the cumulative state at the end of the chain regardless of which commit each action lives in.

How was this patch tested?

Added a regression test consecutive schema evolutions with protocol-only tail in DeltaSourceSchemaEvolutionSuite. The chain is constructed so the start schema, the post-first-change schema (which becomes currentMetadata for the merger), and the post-chain schema are pairwise distinct, so a coincidence can't mask the failure:

  • <a, b>rename b→caddColumn ddropColumn c → protocol-only upgrade → file action.
  • Pre-fix: merged entry has (vTail, <a, c>) — schema reverts to pre-chain.
  • Post-fix: merged entry has (vTail, <a, d>) — schema reflects the final metadata change.

The existing consecutive schema evolutions test (all-Metadata chain) still passes since the fold's "latest metadata in chain" coincides with iteratorLast's result when every commit carries a Metadata action.

Does this PR introduce any user-facing changes?

No (bug fix to internal streaming schema-tracking behavior; no API changes).

@PorridgeSwim PorridgeSwim changed the title add test [kernel-spark] May 8, 2026
@PorridgeSwim PorridgeSwim changed the title [kernel-spark] [Spark] Fix streaming schema merger dropping metadata change when chain ends with a protocol-only commit May 8, 2026
@PorridgeSwim PorridgeSwim self-assigned this May 8, 2026
@PorridgeSwim PorridgeSwim force-pushed the fixMetadataChangeMergerBug branch from 62286e6 to d296b3a Compare May 8, 2026 23:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant