Fix V2 streaming row metadata reads by TimothyW553 · Pull Request #6762 · delta-io/delta

TimothyW553 · 2026-05-11T20:58:08Z

Which Delta project/connector is this regarding?

Description

Managed tables in the UC test setup have row tracking enabled, so _metadata.row_id is a valid metadata column. Batch reads already work:

SELECT id, _metadata.row_id FROM table

The same projection in V2 streaming analyzed successfully, but failed at execution with ArrayIndexOutOfBoundsException. The streaming scan was planned with _metadata in the output, but the reader factory received a read data schema without _metadata, so it skipped the existing row-tracking metadata path and produced one fewer column than Spark expected.

This kind of query works after this PR:

spark.readStream().format("delta").table(tableName)
    .selectExpr("id", "_metadata.row_id AS rid")
    .writeStream()
    .trigger(Trigger.AvailableNow())
    .option("checkpointLocation", checkpoint)
    .foreachBatch((df, batchId) -> df.collectAsList())
    .start()
    .awaitTermination();

This PR carries the requested _metadata schema from the analyzed V2 streaming relation into SparkScanBuilder using an internal option. SparkScanBuilder adds _metadata back to the read data schema and strips the option before constructing the scan. That lets streaming use the same row-tracking metadata reader path as batch.

How was this patch tested?

Added tests for the streaming metadata schema handoff and for a managed-table streaming read of _metadata.row_id.

build/sbt 'spark/testOnly io.delta.internal.ApplyV2StreamingSuite'
build/sbt 'sparkV2/testOnly io.delta.spark.internal.v2.read.SparkScanBuilderTest'
build/sbt 'sparkUnityCatalog/testOnly io.sparkuctest.UCDeltaTableDataFrameStreamingTest'

Does this PR introduce any user-facing changes?

Yes. V2 streaming reads that select _metadata.row_id now complete instead of crashing during execution.

johanl-db

High-level comment:

This change doesn't fit with how metadata columns are supposed to work in Spark.
There shouldn't be a need for manually propagating the metadata column, the flow is:

DSv2 conector implements trait SupportsMetadataColumn and surfaces the metadata columns it exposes via metadataColumns
When resolving a relation, Spark appends the metadata columns that the connector exposes to its output schema.
The plan is analyzed, unused metadata columns or individual metadata fields are pruned away.
Spark builds a scan, the read schema contains the metadata columns/fields (e.g. user selected _metadata.row_id) that are effectively used. The connector advertised that it could surface that metadata, it now fills the value in the batches it returns to Spark.

The metadata fields should be part of the read schema, without the need to manually add them here. It's Spark responsibility to do this in a generic manner.

Somehow for streaming, the metadata column got lost along the way. This is what needs to be fixed, and not manually adding it back after the fact.

Also, using options as a way to pass information is not robust. There are very few cases where options are the right way (even though we tend to overuse them). Better to be explicit and use dedicated variable with a proper type.

huan233usc · 2026-05-15T15:40:18Z

+1, can we dig more why the _metadata column is not resolved vs depending on a read option.

TimothyW553 added 2 commits May 11, 2026 20:52

Fix V2 streaming row metadata reads

83b5996

Merge branch 'master' into metadata-row-id

7d35d6d

TimothyW553 marked this pull request as ready for review May 11, 2026 21:56

TimothyW553 requested review from huan233usc, murali-db, raveeram-db and tdas as code owners May 11, 2026 21:56

TimothyW553 added 2 commits May 13, 2026 13:04

Merge branch 'master' into metadata-row-id

5d8990c

Merge branch 'master' into metadata-row-id

e7f090c

johanl-db reviewed May 15, 2026

View reviewed changes

TimothyW553 mentioned this pull request May 15, 2026

[BUG] Spark V2 streaming does not push down required columns to the scan builder #6800

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix V2 streaming row metadata reads#6762

Fix V2 streaming row metadata reads#6762
TimothyW553 wants to merge 4 commits into
delta-io:masterfrom
TimothyW553:metadata-row-id

TimothyW553 commented May 11, 2026 •

edited

Loading

Uh oh!

johanl-db left a comment

Uh oh!

huan233usc commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

TimothyW553 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

Uh oh!

johanl-db left a comment

Choose a reason for hiding this comment

Uh oh!

huan233usc commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TimothyW553 commented May 11, 2026 •

edited

Loading