[Spark] Align DSv2 _metadata with Spark base fields + row tracking by murali-db · Pull Request #6775 · delta-io/delta

murali-db · 2026-05-13T17:48:25Z

What changes were proposed in this pull request?

Align the DSv2 Kernel SparkTable.metadataColumns() with Spark file-source base metadata, and unify the per-row materialisation path for both file-source and row-tracking fields.

SparkTable.metadataColumns() returns a single _metadata struct containing Spark's BASE_METADATA_FIELDS (file_path, file_name, file_size, file_block_start, file_block_length, file_modification_time); when row tracking is enabled, row_id and row_commit_version are appended.
Materialise _metadata via a per-field setter strategy. One MetadataValueSetterBuilder is registered per requested struct field; MetadataStructReadFunction binds each builder to the current PartitionedFile once and runs the resulting BoundMetadataValueSetters per row, writing into a reused GenericInternalRow (no per-row Object[] copy).
Three setter implementations:
- FileConstantValueSetterBuilder — wraps the extractors exposed by DeltaParquetFileFormat#fileConstantMetadataExtractors (Spark base fields plus Delta extras like base_row_id / default_row_commit_version).
- RowIdValueSetterBuilder / RowCommitVersionValueSetterBuilder — encapsulate the Delta coalesce against materialised row-tracking helper columns, falling back to baseRowId + physicalRowIndex / default_row_commit_version respectively.
MetadataStructSchemaContext.forSchema(...) is the single owner of the pruned _metadata struct, the parquet read schema (augmented with row-tracking helper columns when needed), data / partition projection ordinals, and the ordered MetadataValueSetterBuilder[]. Returns Optional so it's only constructed when the scan requests _metadata.
Replaces the previous RowTrackingReadFunction / RowTrackingSchemaContext design; row tracking is now modelled as two specialised _metadata fields, keeping per-field plumbing uniform.

How was this patch tested?

Added V2MetadataReadTest covering single-subfield projection (parametrised across the six base fields), bare _metadata struct selection, SELECT _metadata, *, and mixed file_path + row-tracking projection.
V2RowTrackingReadTest.testMixedFileHistoryRowIdResolves exercises both branches of the row-id coalesce against a table with mixed file history (INSERT, UPDATE rewrite, INSERT).
Added MetadataStructSchemaContextTest, MetadataValueSetterTest, MetadataStructReadFunctionTest covering the schema-context, individual setter builders, and end-to-end read-function wiring.
Extended SparkTableTest with metadata-column assertions for both row-tracking-enabled and disabled tables.

Does this PR introduce any user-facing change?

_metadata on the DSv2 Delta connector is a wider struct than before. Name-based access (_metadata.row_id, _metadata.file_path, etc.) stays stable. Positional access against the previous shape may break — callers should switch to name-based access.

- Align DSv2 Kernel SparkTable.metadataColumns() with Spark file-source base metadata (file_path, file_name, file_size, file_block_start, file_block_length, file_modification_time). When row tracking is enabled, append row_id and row_commit_version after the base fields. - Materialise _metadata via a per-field setter strategy. One MetadataValueSetterBuilder per requested struct field, bound to file-level constants once per PartitionedFile, run per row by MetadataStructReadFunction. - Three setter implementations: FileConstantValueSetterBuilder (wraps DeltaParquetFileFormat#fileConstantMetadataExtractors), RowIdValueSetterBuilder and RowCommitVersionValueSetterBuilder (encapsulate the coalesce against materialised helper columns). - MetadataStructSchemaContext.forSchema(...) returns Optional, built only when the scan requests _metadata. - Replaces the previous RowTrackingReadFunction / RowTrackingSchemaContext design; row tracking is now modelled as two specialised _metadata fields. - Adds parametrised tests covering each base subfield and the row-tracking coalesce paths.

Reformat 14 Java sources to match Delta OSS google-java-format conventions.

SanJSp

Thanks for the changes - nothing to remark from my side!

murali-db marked this pull request as ready for review May 13, 2026 17:55

murali-db requested review from TimothyW553, huan233usc, raveeram-db and tdas as code owners May 13, 2026 17:55

murali-db added 2 commits May 13, 2026 11:32

Apply javafmt formatting

34ac2a4

Reformat 14 Java sources to match Delta OSS google-java-format conventions.

murali-db force-pushed the dsv2-dml-nr-pr1-file-path branch from d1ef3a1 to 34ac2a4 Compare May 13, 2026 18:34

SanJSp approved these changes May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark] Align DSv2 _metadata with Spark base fields + row tracking#6775

[Spark] Align DSv2 _metadata with Spark base fields + row tracking#6775
murali-db wants to merge 2 commits into
delta-io:masterfrom
murali-db:dsv2-dml-nr-pr1-file-path

murali-db commented May 13, 2026

Uh oh!

SanJSp left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

murali-db commented May 13, 2026

What changes were proposed in this pull request?

How was this patch tested?

Does this PR introduce any user-facing change?

Uh oh!

SanJSp left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants