Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion PROTOCOL.md
Original file line number Diff line number Diff line change
Expand Up @@ -782,10 +782,18 @@ A delta file can optionally contain additional provenance information about what
When the `catalogManaged` table feature is enabled, the `commitInfo` action must have a field
`txnId` that stores a unique transaction identifier string.

Implementations are free to store any valid JSON-formatted data via the `commitInfo` action.
The `commitInfo` action must be a JSON object. Implementations are free to store any fields within that object; table features may require additional fields or impose semantics on specific fields.

When [In-Commit Timestamps](#in-commit-timestamps) are enabled, writers are required to include a `commitInfo` action with every commit, which must include the `inCommitTimestamp` field. Also, the `commitInfo` action must be first action in the commit.

The `commitInfo` action may include an optional boolean field `isIncrementalSafe`. When `true`, the writer asserts that this commit is incrementally safe: its effect on any aggregate derived from the log (e.g. those recorded in a [Version Checksum](#version-checksum-file)) can be computed from this commit's own `add` and `remove` actions alone. For example, given the Version Checksum at version `N`, a reader can derive `numFiles`, `tableSizeBytes`, and `fileSizeHistogram` at any later version by iteratively applying each subsequent commit's `add.size` and `remove.size`, provided every such commit asserts `isIncrementalSafe=true`. Specifically, the writer guarantees:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are allowing any fields to be added here, should we start naming fields that delta specifically mandates with a prefix like "delta."?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea! delta.isIncrementalSafe

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are actual json fields. I guess we could choose delta.whatever as a field name, tho that complicates json parsing and such because of the special character. Unless you mean that delta is a spec-mandated object with spec-mandated fields?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does having a key with a period make the JSON parsing more difficult? Its still just string from the the parsers perspective (I don't think "." requires escaping)?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah +1 my understanding is that these are just JSON keys

{
  "user.name": "alice",
  "config.db.host": "localhost"
}

and that . inside of the key has no impact on parsing, right?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON doesn't care. SQL engines that invoke said parsing do care. In theory they can use various escaping mechanisms to handle the fact that the . is not a field separator.

Copy link
Copy Markdown
Collaborator

@scovich scovich May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it also breaks serde to strongly typed objects (delta-spark and kernel-rs both do this). In theory field aliasing trickery can compensate ("parse delta.foo as delta_foo" or similar).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense so maybe just no dot, and have it be part of the full name? deltaIsIncrementalSafe?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we also want to mandate stats numRecords is present?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this isn't captured in CRC files today so maybe not necessary.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per-column stats (numRecords) is different from table-level stats (add.size) -- IMO okay to exclude numRecords?

TableFeatures that depend on this should require that themselves

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the future we are going to need actual back-references attached to actions (for example introduced the an adaptive metadata tree was recently discuss in a community sync).

This will cause the information to be stored at most two level commit and file. Are we OK with this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause the information to be stored at most two level commit and file. Are we OK with this?

Sorry @emkornfield -- I didn't get the impact of this line against the proposal in this PR. Can you elaborate? either (a) how does your comment impact this change or (b) how does my change proposed here impact what you commented about?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @emkornfield -- I didn't get the impact of this line against the proposal in this PR. Can you elaborate? either (a) how does your comment impact this change or (b) how does my change proposed here impact what you commented about?

Sorry for not being clear. I think the TL;DR; is did you the consider the option doing something at the action level. If writers know they are replacing files without a corresponding remove in the commit, they could presumably mark the action as "duplicate or replace" or something similar, so an incremental processor could potentially just skip those actions. There are likely a lot of factors here, so it might not be viable. But at least for AMT we will likely need file action level metadata that might be somewhat redundant with the information this commit level metadata summarizes

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If writers know they are replacing files without a corresponding remove in the commit, they could presumably mark the action as "duplicate or replace" or something similar, so an incremental processor could potentially just skip those actions.

Ah, I see! This is a great callout. This "replace" would apply to the entire commit, yes? I wonder if we would ever need per-file-action granularity.. but seems like delta.incrementalOp=true/false/skip(replace) could be a direction worth exploring

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking out loud on naming. We are really asserting a few things with this flag:

  1. action reconciliation is not needed for this file.
  2. remove actions must reference a file in the table (I guess this was never called out specifically before).
  3. Some optional fields are guaranteed to be written here (driven by the requirements of CRC).

would something specifically referencing CRC make sense

(e.g. crcFileIncrementalSafe) if it wasn't for the the updated requirements on fields it seems like naming this something like (noExternalReconcilationNeeded) might better name the feature.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would something specifically referencing CRC make sense

I actually tried to avoid this because technically you could not have a CRC, and sum up files in a checkpoint, and then incrementally sum up delta files, to produce the table-level stats for the table. No CRCs needed in this scenario.

Thus, so far, delta.isIncrementalSafe seems like a correct and generalized property name with no massive red flags. Let me know what you think

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually tried to avoid this because technically you could not have a CRC, and sum up files in a checkpoint, and then incrementally sum up delta files, to produce the table-level stats for the table. No CRCs needed in this scenario.

I guess the main argument for CRC in the name is that the newly required fields are picked exactly for CRC purposes. numRecords as an example would be another field that I'm sure some people would care about doing incremental updates on but still don't have strong guarantees with this new field IIUC.

Thus, so far, delta.isIncrementalSafe seems like a correct and generalized property name with no massive red flags. Let me know what you think

Ultimately it is a little bit of bikeshedding, so I'll yield to your preference. I think the only question would become what happens if we change the list of fields in CDC (e.g. have a new CDC writer feature or something), would this flag's definition change? Or would we add a different flag to capture this?


1. **No double adds.** No [logical file](#add-file-and-remove-file) added by this commit was present in the previous snapshot.
2. **No double removes.** Every logical file removed by this commit was present in the previous snapshot.
3. Every `remove` action in this commit includes the `size` field (which is otherwise [optional](#add-file-and-remove-file)).

In practice, ordinary DML and table-maintenance operations (e.g. `INSERT`, `UPDATE`, `DELETE`, `MERGE`, `OPTIMIZE`) naturally satisfy these guarantees. The canonical counter-example is a `COMPUTE STATISTICS`-style operation that recomputes file-level statistics by re-adding logical files already present in the table; such an operation violates guarantee (1).

An example of storing provenance information related to an `INSERT` operation:
```json
{
Expand All @@ -795,6 +803,7 @@ An example of storing provenance information related to an `INSERT` operation:
"userName":"michael@databricks.com",
"operation":"INSERT",
"operationParameters":{"mode":"Append","partitionBy":"[]"},
"isIncrementalSafe":true,
"notebook":{
"notebookId":"4443029",
"notebookPath":"Users/michael@databricks.com/actions"},
Expand Down
Loading