Skip to content

[Spark] Align DSv2 _metadata with Spark base fields + row tracking#6775

Open
murali-db wants to merge 2 commits into
delta-io:masterfrom
murali-db:dsv2-dml-nr-pr1-file-path
Open

[Spark] Align DSv2 _metadata with Spark base fields + row tracking#6775
murali-db wants to merge 2 commits into
delta-io:masterfrom
murali-db:dsv2-dml-nr-pr1-file-path

Conversation

@murali-db
Copy link
Copy Markdown
Collaborator

What changes were proposed in this pull request?

Align the DSv2 Kernel SparkTable.metadataColumns() with Spark file-source base metadata, and unify the per-row materialisation path for both file-source and row-tracking fields.

  • SparkTable.metadataColumns() returns a single _metadata struct containing Spark's BASE_METADATA_FIELDS (file_path, file_name, file_size, file_block_start, file_block_length, file_modification_time); when row tracking is enabled, row_id and row_commit_version are appended.
  • Materialise _metadata via a per-field setter strategy. One MetadataValueSetterBuilder is registered per requested struct field; MetadataStructReadFunction binds each builder to the current PartitionedFile once and runs the resulting BoundMetadataValueSetters per row, writing into a reused GenericInternalRow (no per-row Object[] copy).
  • Three setter implementations:
    • FileConstantValueSetterBuilder — wraps the extractors exposed by DeltaParquetFileFormat#fileConstantMetadataExtractors (Spark base fields plus Delta extras like base_row_id / default_row_commit_version).
    • RowIdValueSetterBuilder / RowCommitVersionValueSetterBuilder — encapsulate the Delta coalesce against materialised row-tracking helper columns, falling back to baseRowId + physicalRowIndex / default_row_commit_version respectively.
  • MetadataStructSchemaContext.forSchema(...) is the single owner of the pruned _metadata struct, the parquet read schema (augmented with row-tracking helper columns when needed), data / partition projection ordinals, and the ordered MetadataValueSetterBuilder[]. Returns Optional so it's only constructed when the scan requests _metadata.
  • Replaces the previous RowTrackingReadFunction / RowTrackingSchemaContext design; row tracking is now modelled as two specialised _metadata fields, keeping per-field plumbing uniform.

How was this patch tested?

  • Added V2MetadataReadTest covering single-subfield projection (parametrised across the six base fields), bare _metadata struct selection, SELECT _metadata, *, and mixed file_path + row-tracking projection.
  • V2RowTrackingReadTest.testMixedFileHistoryRowIdResolves exercises both branches of the row-id coalesce against a table with mixed file history (INSERT, UPDATE rewrite, INSERT).
  • Added MetadataStructSchemaContextTest, MetadataValueSetterTest, MetadataStructReadFunctionTest covering the schema-context, individual setter builders, and end-to-end read-function wiring.
  • Extended SparkTableTest with metadata-column assertions for both row-tracking-enabled and disabled tables.

Does this PR introduce any user-facing change?

_metadata on the DSv2 Delta connector is a wider struct than before. Name-based access (_metadata.row_id, _metadata.file_path, etc.) stays stable. Positional access against the previous shape may break — callers should switch to name-based access.

@murali-db murali-db marked this pull request as ready for review May 13, 2026 17:55
murali-db added 2 commits May 13, 2026 11:32
- Align DSv2 Kernel SparkTable.metadataColumns() with Spark file-source base metadata (file_path, file_name, file_size, file_block_start, file_block_length, file_modification_time). When row tracking is enabled, append row_id and row_commit_version after the base fields.
- Materialise _metadata via a per-field setter strategy. One MetadataValueSetterBuilder per requested struct field, bound to file-level constants once per PartitionedFile, run per row by MetadataStructReadFunction.
- Three setter implementations: FileConstantValueSetterBuilder (wraps DeltaParquetFileFormat#fileConstantMetadataExtractors), RowIdValueSetterBuilder and RowCommitVersionValueSetterBuilder (encapsulate the coalesce against materialised helper columns).
- MetadataStructSchemaContext.forSchema(...) returns Optional, built only when the scan requests _metadata.
- Replaces the previous RowTrackingReadFunction / RowTrackingSchemaContext design; row tracking is now modelled as two specialised _metadata fields.
- Adds parametrised tests covering each base subfield and the row-tracking coalesce paths.
Reformat 14 Java sources to match Delta OSS google-java-format conventions.
@murali-db murali-db force-pushed the dsv2-dml-nr-pr1-file-path branch from d1ef3a1 to 34ac2a4 Compare May 13, 2026 18:34
Copy link
Copy Markdown
Collaborator

@SanJSp SanJSp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes - nothing to remark from my side!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants