Metadata-only CREATE TABLE for DSv2 + Kernel + CCv2 path#7
Draft
TimothyW553 wants to merge 11 commits into
Draft
Metadata-only CREATE TABLE for DSv2 + Kernel + CCv2 path#7TimothyW553 wants to merge 11 commits into
TimothyW553 wants to merge 11 commits into
Conversation
<!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md 2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 3. Be sure to keep the PR description updated to reflect all changes. 4. Please write your PR title to summarize what this PR proposes. 5. If possible, provide a concise example to reproduce the issue for a faster review. 6. If applicable, include the corresponding issue number in the PR title and link it in the body. --> #### Which Delta project/connector is this regarding? <!-- Please add the component selected below to the beginning of the pull request title For example: [Spark] Title of my pull request --> - [ ] Spark - [ ] Standalone - [ ] Flink - [x] Kernel - [ ] Other (fill in here) ## Description Add collations table features, `collations-preview` and `collations`, as writer features. ## How was this patch tested? New test using golden table. ## Does this PR introduce _any_ user-facing changes? No.
delta-io#6337) Clarify in two places that checkpoints and reconciled snapshots should not contain `domainMetadata` actions with `removed=true`: - Action reconciliation: the `domainMetadata` collection excludes tombstones. - Checkpoint contents: the Domain Metadata bullet excludes removed entries.
…6350) <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md 2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 3. Be sure to keep the PR description updated to reflect all changes. 4. Please write your PR title to summarize what this PR proposes. 5. If possible, provide a concise example to reproduce the issue for a faster review. 6. If applicable, include the corresponding issue number in the PR title and link it in the body. --> #### Which Delta project/connector is this regarding? <!-- Please add the component selected below to the beginning of the pull request title For example: [Spark] Title of my pull request --> - [ ] Spark - [ ] Standalone - [ ] Flink - [x] Kernel - [ ] Other (fill in here) ## Description test only follow up to delta-io#6322. revises a test to not be reliant on delta-spark versions. ## How was this patch tested? test only. test still passes ## Does this PR introduce _any_ user-facing changes? <!-- If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released Delta Lake versions or within the unreleased branches such as master. If no, write 'No'. --> no
…ale (delta-io#6354) ## Summary - Fix `ExpressionUtils.convertValueToKernelLiteral` crash when a `BigDecimal` has precision less than scale - Normalize precision to `Math.max(bd.precision(), bd.scale() + 1)` before calling `Literal.ofDecimal` - Add unit test for the edge case ## Problem A query with a decimal `IN` predicate containing `0.00` crashes when using the V2 connector: ```sql SELECT * FROM delta_table WHERE dec10_2 IN (0.00, 100.00) ``` This throws an `IllegalArgumentException: Invalid precision and scale combo` from Kernel's `Literal.ofDecimal`. **Root cause:** Java's `BigDecimal("0.00")` reports `precision()=1` and `scale()=2`. This violates the invariant that `precision >= scale` required by Kernel's `Literal.ofDecimal`. The V2 code in `ExpressionUtils.convertValueToKernelLiteral` was passing these values through without normalization. ## Fix Before constructing the Kernel literal, normalize precision: ```java int precision = Math.max(bd.precision(), bd.scale() + 1); ``` For `BigDecimal("0.00")`, this yields `precision=3, scale=2` (i.e., `DECIMAL(3,2)`), which correctly represents the value. ## Test plan - [x] New unit test `testConvertValueToKernelLiteral_DecimalWithScaleExceedingPrecision` passes - [x] All 73 `ExpressionUtilsTest` tests pass
…without a supported IcebergCompat version (delta-io#6352) <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md 2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 3. Be sure to keep the PR description updated to reflect all changes. 4. Please write your PR title to summarize what this PR proposes. 5. If possible, provide a concise example to reproduce the issue for a faster review. 6. If applicable, include the corresponding issue number in the PR title and link it in the body. --> #### Which Delta project/connector is this regarding? <!-- Please add the component selected below to the beginning of the pull request title For example: [Spark] Title of my pull request --> - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description When a user enables `delta.universalFormat.enabledFormats = 'iceberg'` alongside an unrecognized IcebergCompat property (e.g. `delta.enableIcebergCompatV3`), the error message currently says: > To enable IcebergCompatV2, set the table property 'delta.enableIcebergCompatV2' = 'true'. This is misleading because: 1. Spark Delta silently ignores the unrecognized V3 property (only V1 and V2 are in `IcebergCompat.knownVersions`) 2. The error then directs the user to V2 with no indication that V3 is not supported The fix updates the message to explicitly state that the supported versions are IcebergCompatV1 and IcebergCompatV2, making clear to users that higher versions are not yet supported in this Spark Delta release. Resolves delta-io#6351 ## How was this patch tested? The change is limited to an error message string. Manual reproduction: ```sql SET spark.databricks.delta.allowArbitraryProperties.enabled=true; CREATE TABLE demo.icebergCompatV3 (i INT, s STRING) USING DELTA TBLPROPERTIES ( 'delta.columnMapping.mode' = 'name', 'delta.enableIcebergCompatV3' = 'true', 'delta.enableDeletionVectors' = 'false', 'delta.universalFormat.enabledFormats' = 'iceberg' ); ``` **Before:** `"To enable IcebergCompatV2, set the table property 'delta.enableIcebergCompatV2' = 'true'."` **After:** `"Supported versions are IcebergCompatV1 and IcebergCompatV2."` ## Does this PR introduce _any_ user-facing changes? Yes. The error message for `DELTA_UNIVERSAL_FORMAT_VIOLATION` (when UniForm Iceberg is enabled without a recognized IcebergCompat version) now explicitly lists the supported IcebergCompat versions (V1 and V2), instead of directing the user to enable V2. This helps users who set an unsupported version (e.g. V3) understand why the error occurred. --------- Signed-off-by: openinx <openinx@gmail.com>
…commits (delta-io#6338) ## Description This PR adds a regression test that exposes a **data loss bug** in Delta streaming when used with Coordinated Commits and non-trivial backfill batch sizes. Related issue: delta-io#6339 ### Observed Behavior The test writes 100 sequential single-row commits to a Delta table while a streaming query is running, then verifies all 100 values appear in the sink. - **batchSize = 1**: Test passes consistently. All 100 values present. - **batchSize = 2**: Test fails. ~4 out of 100 commits are lost on average. **Only odd-versioned commits are lost** (these are the versions that are not immediately backfilled). - **batchSize = 3**: Test fails. The loss pattern follows the backfill cycle: - `v % 3 == 0`: no loss (these versions trigger backfill) - `v % 3 == 1`: v may be lost, and if v is lost, v+1 is always lost together (they get backfilled as a pair) - `v % 3 == 2`: v may be lost independently The pattern strongly correlates with which commits are sitting unbackfilled in the coordinator at any given time, suggesting the bug is related to the interaction between backfill and commit listing. ### Hypothesized Root Cause We suspect the issue is a race condition in `CoordinatedCommitsUtils.commitFilesIterator`, which is used by `DeltaSource.getFileChanges` during both `latestOffset` and `getBatch`. This method lists commits in two lazy, sequential steps: 1. List backfilled commits from the filesystem (`listedDeltas`) 2. Query the coordinator for unbackfilled commits (`tailFromSnapshot`) These two steps are **not atomic**. A possible scenario (example with batchSize = 3): 1. **During `latestOffset`**: filesystem has `[0.json]`, coordinator has `[1, 2]`. `latestOffset` correctly computes endOffset covering through version 2. 2. **During `getBatch`**: the filesystem listing iterator runs and sees `[0.json]`. 3. **Between the two iterators**: a concurrent write creates version 3, triggering `backfillToVersion(3)`. This writes versions 1, 2, 3 to the filesystem and **removes them from the coordinator** via `registerBackfill`. 4. **The coordinator query runs**: returns empty — versions 1 and 2 have been removed. 5. **Result**: `getBatch` misses versions 1 and 2. The next batch starts from version 3, so they are never re-read. ## How was this patch tested? Added a new test `"streaming processes 100 sequential single-value commits and contains all values 0 to 99"` that: - Creates a Delta table and starts a streaming query - Appends 100 single-row commits while the stream is running - Verifies all 100 values appear in the sink The test passes for batchSize = 1 but fails for batchSize = 2 and 3 due to the data loss. ## Does this PR introduce _any_ user-facing changes? No. This PR only adds a test to demonstrate the existing bug. A fix will follow in a subsequent PR.
## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/6249/files/0d54da51e7eade47b8115d92aaf7be1e8e4c011f..7e400f87189bef892d3b0022c0aede238a0c84de) to review incremental changes. - [stack/ignoreDeletesV2](delta-io#6245) [[Files changed](https://github.com/delta-io/delta/pull/6245/files)] - [stack/skipChangeCommitsV2](delta-io#6246) [[Files changed](https://github.com/delta-io/delta/pull/6246/files/6e9962d4a30a63ed14786830bae2b668004a76c9..2e111cf6ac9d1e5f84d83d94412b05486f543613)] - [**stack/ignoreChangesV2**](delta-io#6249) [[Files changed](https://github.com/delta-io/delta/pull/6249/files/0d54da51e7eade47b8115d92aaf7be1e8e4c011f..7e400f87189bef892d3b0022c0aede238a0c84de)] - [stack/ignoreFileDeletionV2](delta-io#6250) [[Files changed](https://github.com/delta-io/delta/pull/6250/files/7e400f87189bef892d3b0022c0aede238a0c84de..af230590d0d9c1a5c37cc6a7b6af404025a6d415)] --------- <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md 2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 3. Be sure to keep the PR description updated to reflect all changes. 4. Please write your PR title to summarize what this PR proposes. 5. If possible, provide a concise example to reproduce the issue for a faster review. 6. If applicable, include the corresponding issue number in the PR title and link it in the body. --> #### Which Delta project/connector is this regarding? <!-- Please add the component selected below to the beginning of the pull request title For example: [Spark] Title of my pull request --> - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Support ignoreChanges read option in DSv2, which skip all remove file actions but keep the add file actions in the same commit. <!-- - Describe what this PR changes. - Describe why we need the change. If this PR resolves an issue be sure to include "Resolves #XXX" to correctly link and close the issue upon merge. --> ## How was this patch tested? Unit tests that test parity between v1 and v2 connector on both pure deletes commit (only remove) and change commit (add + remove). Integration tests: - streaming with ignoreChanges = true allows both delete and change commits <!-- If tests were added, say they were added here. Please make sure to test the changes thoroughly including negative and positive cases if possible. If the changes were tested in any way other than unit tests, please clarify how you tested step by step (ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future). If the changes were not tested, please explain why. --> ## Does this PR introduce _any_ user-facing changes? No <!-- If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released Delta Lake versions or within the unreleased branches such as master. If no, write 'No'. -->
…elta-io#6335) <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md 2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 3. Be sure to keep the PR description updated to reflect all changes. 4. Please write your PR title to summarize what this PR proposes. 5. If possible, provide a concise example to reproduce the issue for a faster review. 6. If applicable, include the corresponding issue number in the PR title and link it in the body. --> ## Description Write atomic-supported property to Iceberg compact tables. This is part of project to support write to UC managed UniForm tables ## How was this patch tested? Add UTs ``` build/sbt -DsparkVersion=4.0 "iceberg/testOnly org.apache.spark.sql.delta.uniform.UniversalFormatSuite" ```
… path Add a clean, sequential CREATE TABLE architecture for the Delta DSv2 catalog path using Kernel as the transaction engine: build → commit → publish → load New components in spark/v2 ddl package: - CreateTableContext: plain data POJO for operation inputs - CreateTableTxnBuilder: encapsulates all prep logic (UC pre-registration, path resolution, property filtering, schema conversion, Kernel txn building) - TableCommitter: generic Kernel commit boundary - CreateTablePublisher: derives catalog publication from committed snapshot - DTOs: PreparedTableTxn, PreparedCreateTableTxn, CommittedTableTxn, CreateTableCatalogPublication - Generic interfaces: TableTxnBuilder, TablePublisher DeltaCatalog.createTable routes to the new path when DeltaV2Mode is STRICT or AUTO (for UC-managed tables). The orchestrator is 4 lines with zero branching — all table-type logic is pushed into the builder/publisher. Also: - build.sbt: Unity Catalog version → 0.5.0-SNAPSHOT - DeltaV2Mode: added shouldUseKernelForCreateTable() - CatalogTableUtils: added isCatalogManagedFromProperties() - AbstractDeltaCatalog: isUnityCatalog visibility → protected Tests: 5 unit tests (property filtering, DataLayoutSpec) + 5 integration tests (full build→commit→publish for path-based tables).
bbb0e88 to
2e86dfc
Compare
Replace 10-file over-engineered design with minimal plumbing: - DDLRequest: generic POJO for all DDL ops (CREATE, CTAS, RTAS) - CreateTableBuilder: prepare() + buildTransaction() - CreateTablePublisher: publish abstraction point DeltaCatalog.createTable() is now 5 lines: prepare → buildTransaction → commit → publish → loadTable
2e86dfc to
5dcc3c8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ddlpackage inspark/v2with:CreateTableTxnBuilder,TableCommitter,CreateTablePublisher,CreateTableContext(plain POJO), and supporting DTOs/interfaces0.5.0-SNAPSHOTTest plan
This pull request was AI-assisted by Isaac.