[PROTOCOL] Add commitInfo.isIncrementalSafe spec#6798
Conversation
commitInfo.isIncrementalSafe spec
scovich
left a comment
There was a problem hiding this comment.
Ah, this is nice: We've seen buggy clients that add the same file twice in separate INSERT commands, which not only violates (1) but also is unlikely to give the expected outcome (SELECT * returns the same result before and after the INSERT). With this flag, we at least would have a statement of intent that could help identify commits which claimed to be inserting fresh files but which actually re-inserted existing files.
|
|
||
| When [In-Commit Timestamps](#in-commit-timestamps) are enabled, writers are required to include a `commitInfo` action with every commit, which must include the `inCommitTimestamp` field. Also, the `commitInfo` action must be first action in the commit. | ||
|
|
||
| The `commitInfo` action may include an optional boolean field `isIncrementalSafe`. When `true`, the writer asserts that this commit is incrementally safe: its effect on any aggregate derived from the log (e.g. those recorded in a [Version Checksum](#version-checksum-file)) can be computed from this commit's own `add` and `remove` actions alone. For example, given the Version Checksum at version `N`, a reader can derive `numFiles`, `tableSizeBytes`, and `fileSizeHistogram` at any later version by iteratively applying each subsequent commit's `add.size` and `remove.size`, provided every such commit asserts `isIncrementalSafe=true`. Specifically, the writer guarantees: |
There was a problem hiding this comment.
If we are allowing any fields to be added here, should we start naming fields that delta specifically mandates with a prefix like "delta."?
There was a problem hiding this comment.
I like that idea! delta.isIncrementalSafe
There was a problem hiding this comment.
These are actual json fields. I guess we could choose delta.whatever as a field name, tho that complicates json parsing and such because of the special character. Unless you mean that delta is a spec-mandated object with spec-mandated fields?
There was a problem hiding this comment.
Why does having a key with a period make the JSON parsing more difficult? Its still just string from the the parsers perspective (I don't think "." requires escaping)?
There was a problem hiding this comment.
Yeah +1 my understanding is that these are just JSON keys
{
"user.name": "alice",
"config.db.host": "localhost"
}and that . inside of the key has no impact on parsing, right?
There was a problem hiding this comment.
JSON doesn't care. SQL engines that invoke said parsing do care. In theory they can use various escaping mechanisms to handle the fact that the . is not a field separator.
There was a problem hiding this comment.
And it also breaks serde to strongly typed objects (delta-spark and kernel-rs both do this). In theory field aliasing trickery can compensate ("parse delta.foo as delta_foo" or similar).
There was a problem hiding this comment.
makes sense so maybe just no dot, and have it be part of the full name? deltaIsIncrementalSafe?
|
|
||
| When [In-Commit Timestamps](#in-commit-timestamps) are enabled, writers are required to include a `commitInfo` action with every commit, which must include the `inCommitTimestamp` field. Also, the `commitInfo` action must be first action in the commit. | ||
|
|
||
| The `commitInfo` action may include an optional boolean field `isIncrementalSafe`. When `true`, the writer asserts that this commit is incrementally safe: its effect on any aggregate derived from the log (e.g. those recorded in a [Version Checksum](#version-checksum-file)) can be computed from this commit's own `add` and `remove` actions alone. For example, given the Version Checksum at version `N`, a reader can derive `numFiles`, `tableSizeBytes`, and `fileSizeHistogram` at any later version by iteratively applying each subsequent commit's `add.size` and `remove.size`, provided every such commit asserts `isIncrementalSafe=true`. Specifically, the writer guarantees: |
There was a problem hiding this comment.
do we also want to mandate stats numRecords is present?
There was a problem hiding this comment.
I guess this isn't captured in CRC files today so maybe not necessary.
There was a problem hiding this comment.
per-column stats (numRecords) is different from table-level stats (add.size) -- IMO okay to exclude numRecords?
TableFeatures that depend on this should require that themselves
|
|
||
| When [In-Commit Timestamps](#in-commit-timestamps) are enabled, writers are required to include a `commitInfo` action with every commit, which must include the `inCommitTimestamp` field. Also, the `commitInfo` action must be first action in the commit. | ||
|
|
||
| The `commitInfo` action may include an optional boolean field `isIncrementalSafe`. When `true`, the writer asserts that this commit is incrementally safe: its effect on any aggregate derived from the log (e.g. those recorded in a [Version Checksum](#version-checksum-file)) can be computed from this commit's own `add` and `remove` actions alone. For example, given the Version Checksum at version `N`, a reader can derive `numFiles`, `tableSizeBytes`, and `fileSizeHistogram` at any later version by iteratively applying each subsequent commit's `add.size` and `remove.size`, provided every such commit asserts `isIncrementalSafe=true`. Specifically, the writer guarantees: |
There was a problem hiding this comment.
I think in the future we are going to need actual back-references attached to actions (for example introduced the an adaptive metadata tree was recently discuss in a community sync).
This will cause the information to be stored at most two level commit and file. Are we OK with this?
There was a problem hiding this comment.
This will cause the information to be stored at most two level commit and file. Are we OK with this?
Sorry @emkornfield -- I didn't get the impact of this line against the proposal in this PR. Can you elaborate? either (a) how does your comment impact this change or (b) how does my change proposed here impact what you commented about?
There was a problem hiding this comment.
Sorry @emkornfield -- I didn't get the impact of this line against the proposal in this PR. Can you elaborate? either (a) how does your comment impact this change or (b) how does my change proposed here impact what you commented about?
Sorry for not being clear. I think the TL;DR; is did you the consider the option doing something at the action level. If writers know they are replacing files without a corresponding remove in the commit, they could presumably mark the action as "duplicate or replace" or something similar, so an incremental processor could potentially just skip those actions. There are likely a lot of factors here, so it might not be viable. But at least for AMT we will likely need file action level metadata that might be somewhat redundant with the information this commit level metadata summarizes
There was a problem hiding this comment.
If writers know they are replacing files without a corresponding remove in the commit, they could presumably mark the action as "duplicate or replace" or something similar, so an incremental processor could potentially just skip those actions.
Ah, I see! This is a great callout. This "replace" would apply to the entire commit, yes? I wonder if we would ever need per-file-action granularity.. but seems like delta.incrementalOp=true/false/skip(replace) could be a direction worth exploring
|
|
||
| When [In-Commit Timestamps](#in-commit-timestamps) are enabled, writers are required to include a `commitInfo` action with every commit, which must include the `inCommitTimestamp` field. Also, the `commitInfo` action must be first action in the commit. | ||
|
|
||
| The `commitInfo` action may include an optional boolean field `isIncrementalSafe`. When `true`, the writer asserts that this commit is incrementally safe: its effect on any aggregate derived from the log (e.g. those recorded in a [Version Checksum](#version-checksum-file)) can be computed from this commit's own `add` and `remove` actions alone. For example, given the Version Checksum at version `N`, a reader can derive `numFiles`, `tableSizeBytes`, and `fileSizeHistogram` at any later version by iteratively applying each subsequent commit's `add.size` and `remove.size`, provided every such commit asserts `isIncrementalSafe=true`. Specifically, the writer guarantees: |
There was a problem hiding this comment.
thinking out loud on naming. We are really asserting a few things with this flag:
- action reconciliation is not needed for this file.
- remove actions must reference a file in the table (I guess this was never called out specifically before).
- Some optional fields are guaranteed to be written here (driven by the requirements of CRC).
would something specifically referencing CRC make sense
(e.g. crcFileIncrementalSafe) if it wasn't for the the updated requirements on fields it seems like naming this something like (noExternalReconcilationNeeded) might better name the feature.
There was a problem hiding this comment.
would something specifically referencing CRC make sense
I actually tried to avoid this because technically you could not have a CRC, and sum up files in a checkpoint, and then incrementally sum up delta files, to produce the table-level stats for the table. No CRCs needed in this scenario.
Thus, so far, delta.isIncrementalSafe seems like a correct and generalized property name with no massive red flags. Let me know what you think
There was a problem hiding this comment.
I actually tried to avoid this because technically you could not have a CRC, and sum up files in a checkpoint, and then incrementally sum up delta files, to produce the table-level stats for the table. No CRCs needed in this scenario.
I guess the main argument for CRC in the name is that the newly required fields are picked exactly for CRC purposes. numRecords as an example would be another field that I'm sure some people would care about doing incremental updates on but still don't have strong guarantees with this new field IIUC.
Thus, so far, delta.isIncrementalSafe seems like a correct and generalized property name with no massive red flags. Let me know what you think
Ultimately it is a little bit of bikeshedding, so I'll yield to your preference. I think the only question would become what happens if we change the list of fields in CDC (e.g. have a new CDC writer feature or something), would this flag's definition change? Or would we add a different flag to capture this?
emkornfield
left a comment
There was a problem hiding this comment.
A few questions/comments.
Adds
commitInfo.isIncrementalSafeboolean to the spec and describes exactly when it is can be set, what it achieves, and how it is used. This is a new optional field. No known existing Delta writers are currently writing this today (trino, duckdb, delta-rs, flink, spark, etc.) -- I don't expect any conflicts.While I'm there, also clarify that commitInfo must be json object and not just any json data / value (such as an array). This is just a tiny oversight in the existing spec -- no known Delta writers are actually writing an array data instead of object for commitInfo.
Once this PR is merged we should update existing writers, who already confirm/deduce if a commit is incremental-safe, to emit this field.