Skip to content

[PROTOCOL] Add commitInfo.isIncrementalSafe spec#6798

Open
scottsand-db wants to merge 2 commits into
delta-io:masterfrom
scottsand-db:stack/incremental_safe_commits
Open

[PROTOCOL] Add commitInfo.isIncrementalSafe spec#6798
scottsand-db wants to merge 2 commits into
delta-io:masterfrom
scottsand-db:stack/incremental_safe_commits

Conversation

@scottsand-db
Copy link
Copy Markdown
Collaborator

@scottsand-db scottsand-db commented May 15, 2026

Adds commitInfo.isIncrementalSafe boolean to the spec and describes exactly when it is can be set, what it achieves, and how it is used. This is a new optional field. No known existing Delta writers are currently writing this today (trino, duckdb, delta-rs, flink, spark, etc.) -- I don't expect any conflicts.

While I'm there, also clarify that commitInfo must be json object and not just any json data / value (such as an array). This is just a tiny oversight in the existing spec -- no known Delta writers are actually writing an array data instead of object for commitInfo.

Once this PR is merged we should update existing writers, who already confirm/deduce if a commit is incremental-safe, to emit this field.

@scottsand-db scottsand-db requested a review from tdas as a code owner May 15, 2026 20:19
@scottsand-db scottsand-db changed the title update commitInfo spec [PROTOCOL] Add commitInfo.isIncrementalSafe spec May 15, 2026
@scottsand-db scottsand-db self-assigned this May 15, 2026
@scottsand-db scottsand-db changed the title [PROTOCOL] Add commitInfo.isIncrementalSafe spec [PROTOCOL] Add commitInfo.isIncrementalSafe spec May 15, 2026
Copy link
Copy Markdown
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is nice: We've seen buggy clients that add the same file twice in separate INSERT commands, which not only violates (1) but also is unlikely to give the expected outcome (SELECT * returns the same result before and after the INSERT). With this flag, we at least would have a statement of intent that could help identify commits which claimed to be inserting fresh files but which actually re-inserted existing files.

Comment thread PROTOCOL.md Outdated
Copy link
Copy Markdown
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread PROTOCOL.md

When [In-Commit Timestamps](#in-commit-timestamps) are enabled, writers are required to include a `commitInfo` action with every commit, which must include the `inCommitTimestamp` field. Also, the `commitInfo` action must be first action in the commit.

The `commitInfo` action may include an optional boolean field `isIncrementalSafe`. When `true`, the writer asserts that this commit is incrementally safe: its effect on any aggregate derived from the log (e.g. those recorded in a [Version Checksum](#version-checksum-file)) can be computed from this commit's own `add` and `remove` actions alone. For example, given the Version Checksum at version `N`, a reader can derive `numFiles`, `tableSizeBytes`, and `fileSizeHistogram` at any later version by iteratively applying each subsequent commit's `add.size` and `remove.size`, provided every such commit asserts `isIncrementalSafe=true`. Specifically, the writer guarantees:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are allowing any fields to be added here, should we start naming fields that delta specifically mandates with a prefix like "delta."?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea! delta.isIncrementalSafe

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are actual json fields. I guess we could choose delta.whatever as a field name, tho that complicates json parsing and such because of the special character. Unless you mean that delta is a spec-mandated object with spec-mandated fields?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does having a key with a period make the JSON parsing more difficult? Its still just string from the the parsers perspective (I don't think "." requires escaping)?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah +1 my understanding is that these are just JSON keys

{
  "user.name": "alice",
  "config.db.host": "localhost"
}

and that . inside of the key has no impact on parsing, right?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON doesn't care. SQL engines that invoke said parsing do care. In theory they can use various escaping mechanisms to handle the fact that the . is not a field separator.

Copy link
Copy Markdown
Collaborator

@scovich scovich May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it also breaks serde to strongly typed objects (delta-spark and kernel-rs both do this). In theory field aliasing trickery can compensate ("parse delta.foo as delta_foo" or similar).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense so maybe just no dot, and have it be part of the full name? deltaIsIncrementalSafe?

Comment thread PROTOCOL.md

When [In-Commit Timestamps](#in-commit-timestamps) are enabled, writers are required to include a `commitInfo` action with every commit, which must include the `inCommitTimestamp` field. Also, the `commitInfo` action must be first action in the commit.

The `commitInfo` action may include an optional boolean field `isIncrementalSafe`. When `true`, the writer asserts that this commit is incrementally safe: its effect on any aggregate derived from the log (e.g. those recorded in a [Version Checksum](#version-checksum-file)) can be computed from this commit's own `add` and `remove` actions alone. For example, given the Version Checksum at version `N`, a reader can derive `numFiles`, `tableSizeBytes`, and `fileSizeHistogram` at any later version by iteratively applying each subsequent commit's `add.size` and `remove.size`, provided every such commit asserts `isIncrementalSafe=true`. Specifically, the writer guarantees:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we also want to mandate stats numRecords is present?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this isn't captured in CRC files today so maybe not necessary.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per-column stats (numRecords) is different from table-level stats (add.size) -- IMO okay to exclude numRecords?

TableFeatures that depend on this should require that themselves

Comment thread PROTOCOL.md

When [In-Commit Timestamps](#in-commit-timestamps) are enabled, writers are required to include a `commitInfo` action with every commit, which must include the `inCommitTimestamp` field. Also, the `commitInfo` action must be first action in the commit.

The `commitInfo` action may include an optional boolean field `isIncrementalSafe`. When `true`, the writer asserts that this commit is incrementally safe: its effect on any aggregate derived from the log (e.g. those recorded in a [Version Checksum](#version-checksum-file)) can be computed from this commit's own `add` and `remove` actions alone. For example, given the Version Checksum at version `N`, a reader can derive `numFiles`, `tableSizeBytes`, and `fileSizeHistogram` at any later version by iteratively applying each subsequent commit's `add.size` and `remove.size`, provided every such commit asserts `isIncrementalSafe=true`. Specifically, the writer guarantees:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the future we are going to need actual back-references attached to actions (for example introduced the an adaptive metadata tree was recently discuss in a community sync).

This will cause the information to be stored at most two level commit and file. Are we OK with this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause the information to be stored at most two level commit and file. Are we OK with this?

Sorry @emkornfield -- I didn't get the impact of this line against the proposal in this PR. Can you elaborate? either (a) how does your comment impact this change or (b) how does my change proposed here impact what you commented about?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @emkornfield -- I didn't get the impact of this line against the proposal in this PR. Can you elaborate? either (a) how does your comment impact this change or (b) how does my change proposed here impact what you commented about?

Sorry for not being clear. I think the TL;DR; is did you the consider the option doing something at the action level. If writers know they are replacing files without a corresponding remove in the commit, they could presumably mark the action as "duplicate or replace" or something similar, so an incremental processor could potentially just skip those actions. There are likely a lot of factors here, so it might not be viable. But at least for AMT we will likely need file action level metadata that might be somewhat redundant with the information this commit level metadata summarizes

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If writers know they are replacing files without a corresponding remove in the commit, they could presumably mark the action as "duplicate or replace" or something similar, so an incremental processor could potentially just skip those actions.

Ah, I see! This is a great callout. This "replace" would apply to the entire commit, yes? I wonder if we would ever need per-file-action granularity.. but seems like delta.incrementalOp=true/false/skip(replace) could be a direction worth exploring

Comment thread PROTOCOL.md

When [In-Commit Timestamps](#in-commit-timestamps) are enabled, writers are required to include a `commitInfo` action with every commit, which must include the `inCommitTimestamp` field. Also, the `commitInfo` action must be first action in the commit.

The `commitInfo` action may include an optional boolean field `isIncrementalSafe`. When `true`, the writer asserts that this commit is incrementally safe: its effect on any aggregate derived from the log (e.g. those recorded in a [Version Checksum](#version-checksum-file)) can be computed from this commit's own `add` and `remove` actions alone. For example, given the Version Checksum at version `N`, a reader can derive `numFiles`, `tableSizeBytes`, and `fileSizeHistogram` at any later version by iteratively applying each subsequent commit's `add.size` and `remove.size`, provided every such commit asserts `isIncrementalSafe=true`. Specifically, the writer guarantees:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking out loud on naming. We are really asserting a few things with this flag:

  1. action reconciliation is not needed for this file.
  2. remove actions must reference a file in the table (I guess this was never called out specifically before).
  3. Some optional fields are guaranteed to be written here (driven by the requirements of CRC).

would something specifically referencing CRC make sense

(e.g. crcFileIncrementalSafe) if it wasn't for the the updated requirements on fields it seems like naming this something like (noExternalReconcilationNeeded) might better name the feature.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would something specifically referencing CRC make sense

I actually tried to avoid this because technically you could not have a CRC, and sum up files in a checkpoint, and then incrementally sum up delta files, to produce the table-level stats for the table. No CRCs needed in this scenario.

Thus, so far, delta.isIncrementalSafe seems like a correct and generalized property name with no massive red flags. Let me know what you think

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually tried to avoid this because technically you could not have a CRC, and sum up files in a checkpoint, and then incrementally sum up delta files, to produce the table-level stats for the table. No CRCs needed in this scenario.

I guess the main argument for CRC in the name is that the newly required fields are picked exactly for CRC purposes. numRecords as an example would be another field that I'm sure some people would care about doing incremental updates on but still don't have strong guarantees with this new field IIUC.

Thus, so far, delta.isIncrementalSafe seems like a correct and generalized property name with no massive red flags. Let me know what you think

Ultimately it is a little bit of bikeshedding, so I'll yield to your preference. I think the only question would become what happens if we change the list of fields in CDC (e.g. have a new CDC writer feature or something), would this flag's definition change? Or would we add a different flag to capture this?

Copy link
Copy Markdown
Collaborator

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions/comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants