Skip to content

Fix V2 streaming row metadata reads#6762

Open
TimothyW553 wants to merge 4 commits into
delta-io:masterfrom
TimothyW553:metadata-row-id
Open

Fix V2 streaming row metadata reads#6762
TimothyW553 wants to merge 4 commits into
delta-io:masterfrom
TimothyW553:metadata-row-id

Conversation

@TimothyW553
Copy link
Copy Markdown
Collaborator

@TimothyW553 TimothyW553 commented May 11, 2026

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Managed tables in the UC test setup have row tracking enabled, so _metadata.row_id is a valid metadata column. Batch reads already work:

SELECT id, _metadata.row_id FROM table

The same projection in V2 streaming analyzed successfully, but failed at execution with ArrayIndexOutOfBoundsException. The streaming scan was planned with _metadata in the output, but the reader factory received a read data schema without _metadata, so it skipped the existing row-tracking metadata path and produced one fewer column than Spark expected.

This kind of query works after this PR:

spark.readStream().format("delta").table(tableName)
    .selectExpr("id", "_metadata.row_id AS rid")
    .writeStream()
    .trigger(Trigger.AvailableNow())
    .option("checkpointLocation", checkpoint)
    .foreachBatch((df, batchId) -> df.collectAsList())
    .start()
    .awaitTermination();

This PR carries the requested _metadata schema from the analyzed V2 streaming relation into SparkScanBuilder using an internal option. SparkScanBuilder adds _metadata back to the read data schema and strips the option before constructing the scan. That lets streaming use the same row-tracking metadata reader path as batch.

How was this patch tested?

Added tests for the streaming metadata schema handoff and for a managed-table streaming read of _metadata.row_id.

build/sbt 'spark/testOnly io.delta.internal.ApplyV2StreamingSuite'
build/sbt 'sparkV2/testOnly io.delta.spark.internal.v2.read.SparkScanBuilderTest'
build/sbt 'sparkUnityCatalog/testOnly io.sparkuctest.UCDeltaTableDataFrameStreamingTest'

Does this PR introduce any user-facing changes?

Yes. V2 streaming reads that select _metadata.row_id now complete instead of crashing during execution.

@TimothyW553 TimothyW553 marked this pull request as ready for review May 11, 2026 21:56
Copy link
Copy Markdown
Collaborator

@johanl-db johanl-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level comment:

This change doesn't fit with how metadata columns are supposed to work in Spark.
There shouldn't be a need for manually propagating the metadata column, the flow is:

  1. DSv2 conector implements trait SupportsMetadataColumn and surfaces the metadata columns it exposes via metadataColumns
  2. When resolving a relation, Spark appends the metadata columns that the connector exposes to its output schema.
  3. The plan is analyzed, unused metadata columns or individual metadata fields are pruned away.
  4. Spark builds a scan, the read schema contains the metadata columns/fields (e.g. user selected _metadata.row_id) that are effectively used. The connector advertised that it could surface that metadata, it now fills the value in the batches it returns to Spark.

The metadata fields should be part of the read schema, without the need to manually add them here. It's Spark responsibility to do this in a generic manner.

Somehow for streaming, the metadata column got lost along the way. This is what needs to be fixed, and not manually adding it back after the fact.

Also, using options as a way to pass information is not robust. There are very few cases where options are the right way (even though we tend to overuse them). Better to be explicit and use dedicated variable with a proper type.

@huan233usc
Copy link
Copy Markdown
Collaborator

+1, can we dig more why the _metadata column is not resolved vs depending on a read option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants