-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[Protocol] Add delta.writePartitionColumnsToParquet to tblproperties #6545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
itsbilal
wants to merge
2
commits into
delta-io:master
from
itsbilal:protocol-writepartcols-property
Closed
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1/ I don't see a section which explains how this relates to https://github.com/delta-io/delta/blob/master/protocol_rfcs/materialize-partition-columns.md
2/ I think this needs to be structured closer to the section which talks about serializing partition values or in a separate section with the serialization section linking to the new section.
Its hard to read this and understand what the algorithm/ behavior would be on the Kernel/ connection side? Maybe have a truth table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does reference writer features that govern materialization of partition values, which are multiple (
IcebergCompatV1, IcebergCompatV2, IcebergCompatV3, IcebergWriterCompatV1, IcebergWriterCompatV3, materializePartitionColumns), and the writer features are treated as one set that are prioritized over this property. I don't see why we should talk about one writer feature in isolation.The protocol doesn't spell out materialization of partition values at all anywhere else, with the exception of the above-listed writer features. We could do it if we want but we would then need to decide if there's a default stance the writers should take.
For this property, should I just reword it to "this property only governs materialization of partition columns if no writer feature requiring materialization of partition columns is enabled on the table" - that's the same point as the existing text with fewer words, and still leaves enough ambiguity for either default true / default false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1/ We might not need to talk about one feature in isolation. But materialization of partition values is an entire functional area. I think we need a top level section covering all features for which we materialize partition values in addition to https://github.com/sanujbasu/delta/blob/e4c1fa465d51a02355b78366151818ffaee342a7/PROTOCOL.md#partition-value-serialization.
2/ "this property only governs materialization of partition columns if no writer feature requiring materialization of partition columns is enabled on the table" is good phrasing but it hard to understand why "materializePartitionColumns" alone doesnt address that use case since its enabled by default?
Also there is no mention of materializePartitionColumns in the protocol.md? Everyone ive discussed this with is confused about why we need the toggle if we already have a feature? If the toggle is to add the materializePartitionColumns writer feature to the list, why not just add the feature to the vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That one is about how they're serialized in AddFile though. In parquet we don't always use the string type, it's just the same data type that's used if it were any other column.
I've made some references to the optionality of the presence of partition values in data files in this PR now, let me know how that sounds. I'm wary of writing up the whole decision tree in protocol because not all writers will be expected to support iceberg compat, etc., and the table property spec makes it pretty clear that the toggle is best-effort. Really the only thing the protocol should enforce is "readers should understand both states, writers are sometimes required to write partition values".
materializePartitionColumnsis not in the protocol because it's just an RFC that hasn't been merged yet. (funny enough, we did merge it into the protocol... and then reverted it because it hadn't baked as an RFC for long enough yet)Features block writes if they're present in a protocol and a writer doesn't understand them. We want the ability to materialize partition columns without blocking writers, especially since this is a new writer feature and we don't want to block writers all the time but rather we just want a best-effort way to require writers to materialize partition values.
If you're suggesting that we go ahead and default to writing partition values all the time, I'm supportive of a change like that. I just think that in that reality we would still need the ability to configure that behaviour, and for that we'd need a table property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on an offline conversation, we decided to move this change to the materialize partition columns RFC, along with a detailed description of when and when not to materialize partition columns. Opened a separate PR for that: #6738