-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[PROTOCOL] Add commitInfo.isIncrementalSafe spec
#6798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -782,10 +782,18 @@ A delta file can optionally contain additional provenance information about what | |
| When the `catalogManaged` table feature is enabled, the `commitInfo` action must have a field | ||
| `txnId` that stores a unique transaction identifier string. | ||
|
|
||
| Implementations are free to store any valid JSON-formatted data via the `commitInfo` action. | ||
| The `commitInfo` action must be a JSON object. Implementations are free to store any fields within that object; table features may require additional fields or impose semantics on specific fields. | ||
|
|
||
| When [In-Commit Timestamps](#in-commit-timestamps) are enabled, writers are required to include a `commitInfo` action with every commit, which must include the `inCommitTimestamp` field. Also, the `commitInfo` action must be first action in the commit. | ||
|
|
||
| The `commitInfo` action may include an optional boolean field `isIncrementalSafe`. When `true`, the writer asserts that this commit is incrementally safe: its effect on any aggregate derived from the log (e.g. those recorded in a [Version Checksum](#version-checksum-file)) can be computed from this commit's own `add` and `remove` actions alone. For example, given the Version Checksum at version `N`, a reader can derive `numFiles`, `tableSizeBytes`, and `fileSizeHistogram` at any later version by iteratively applying each subsequent commit's `add.size` and `remove.size`, provided every such commit asserts `isIncrementalSafe=true`. Specifically, the writer guarantees: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we also want to mandate stats numRecords is present?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess this isn't captured in CRC files today so maybe not necessary.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. per-column stats ( TableFeatures that depend on this should require that themselves
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think in the future we are going to need actual back-references attached to actions (for example introduced the an adaptive metadata tree was recently discuss in a community sync). This will cause the information to be stored at most two level commit and file. Are we OK with this?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Sorry @emkornfield -- I didn't get the impact of this line against the proposal in this PR. Can you elaborate? either (a) how does your comment impact this change or (b) how does my change proposed here impact what you commented about?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Sorry for not being clear. I think the TL;DR; is did you the consider the option doing something at the action level. If writers know they are replacing files without a corresponding remove in the commit, they could presumably mark the action as "duplicate or replace" or something similar, so an incremental processor could potentially just skip those actions. There are likely a lot of factors here, so it might not be viable. But at least for AMT we will likely need file action level metadata that might be somewhat redundant with the information this commit level metadata summarizes
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Ah, I see! This is a great callout. This "replace" would apply to the entire commit, yes? I wonder if we would ever need per-file-action granularity.. but seems like
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thinking out loud on naming. We are really asserting a few things with this flag:
would something specifically referencing CRC make sense (e.g.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I actually tried to avoid this because technically you could not have a CRC, and sum up files in a checkpoint, and then incrementally sum up delta files, to produce the table-level stats for the table. No CRCs needed in this scenario. Thus, so far,
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I guess the main argument for CRC in the name is that the newly required fields are picked exactly for CRC purposes. numRecords as an example would be another field that I'm sure some people would care about doing incremental updates on but still don't have strong guarantees with this new field IIUC.
Ultimately it is a little bit of bikeshedding, so I'll yield to your preference. I think the only question would become what happens if we change the list of fields in CDC (e.g. have a new CDC writer feature or something), would this flag's definition change? Or would we add a different flag to capture this? |
||
|
|
||
| 1. **No double adds.** No [logical file](#add-file-and-remove-file) added by this commit was present in the previous snapshot. | ||
| 2. **No double removes.** Every logical file removed by this commit was present in the previous snapshot. | ||
| 3. Every `remove` action in this commit includes the `size` field (which is otherwise [optional](#add-file-and-remove-file)). | ||
|
|
||
| In practice, ordinary DML and table-maintenance operations (e.g. `INSERT`, `UPDATE`, `DELETE`, `MERGE`, `OPTIMIZE`) naturally satisfy these guarantees. The canonical counter-example is a `COMPUTE STATISTICS`-style operation that recomputes file-level statistics by re-adding logical files already present in the table; such an operation violates guarantee (1). | ||
|
|
||
| An example of storing provenance information related to an `INSERT` operation: | ||
| ```json | ||
| { | ||
|
|
@@ -795,6 +803,7 @@ An example of storing provenance information related to an `INSERT` operation: | |
| "userName":"michael@databricks.com", | ||
| "operation":"INSERT", | ||
| "operationParameters":{"mode":"Append","partitionBy":"[]"}, | ||
| "isIncrementalSafe":true, | ||
| "notebook":{ | ||
| "notebookId":"4443029", | ||
| "notebookPath":"Users/michael@databricks.com/actions"}, | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are allowing any fields to be added here, should we start naming fields that delta specifically mandates with a prefix like "delta."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea!
delta.isIncrementalSafeThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are actual json fields. I guess we could choose
delta.whateveras a field name, tho that complicates json parsing and such because of the special character. Unless you mean thatdeltais a spec-mandated object with spec-mandated fields?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does having a key with a period make the JSON parsing more difficult? Its still just string from the the parsers perspective (I don't think "." requires escaping)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah +1 my understanding is that these are just JSON keys
{ "user.name": "alice", "config.db.host": "localhost" }and that
.inside of the key has no impact on parsing, right?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JSON doesn't care. SQL engines that invoke said parsing do care. In theory they can use various escaping mechanisms to handle the fact that the
.is not a field separator.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it also breaks serde to strongly typed objects (delta-spark and kernel-rs both do this). In theory field aliasing trickery can compensate ("parse
delta.fooasdelta_foo" or similar).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense so maybe just no dot, and have it be part of the full name?
deltaIsIncrementalSafe?