[Protocol] Add delta.writePartitionColumnsToParquet to tblproperties#6545
[Protocol] Add delta.writePartitionColumnsToParquet to tblproperties#6545itsbilal wants to merge 2 commits into
Conversation
This change adds a definition for the delta.writePartitionColumnsToParquet table property to the list of table properties. This property controls the materialization of partition columns in data parquet files, in the absence of any writer features that would require materialization of partition columns (and therefore take precedence over this property).
29b70e9 to
3877bbb
Compare
| -|-|- | ||
| `delta.parquet.compression.codec` | Compression codec writers SHOULD use for new Parquet data and checkpoint files. Changing this property does not affect existing files; a table may contain files written with different codecs, which is a normal and expected state. | Widely supported values (matched case-insensitively): `uncompressed`/`none` (no compression), `snappy`, `gzip`, `lz4` (deprecated, Hadoop framing), `lz4_raw` ([LZ4 block format](https://parquet.apache.org/docs/file-format/data-pages/compression/#lz4_raw)), `zstd`.<br><br>When absent, writers SHOULD default to `zstd`. If a writer does not support or recognize the specified codec, it SHOULD abort with an appropriate error or fall back to a default codec.<br><br>Readers SHOULD support all codecs listed above regardless of the current property value. Parquet files written with other [parquet-supported codecs](https://parquet.apache.org/docs/file-format/data-pages/compression/) may also exist; readers MAY support reading these files. | ||
| `delta.parquet.format.version` | Parquet data page format writers SHOULD use for new data files. This property is a directive to writers only; readers do not need to consult it, as Parquet pages are self-describing via the `PageType` field in each page header. Changing this property does not affect existing files; a table MAY contain files written with different data page versions, which is a normal and expected state. | Valid values: `1.0.0` (DataPageV1) and `2.x.x` (DataPageV2, where `x.x` is any minor.patch version). Recommended values are `1.0.0` and `2.12.0`.<br><br>When absent, writers SHOULD default to `1.0.0`. Writers SHOULD validate this property and abort if the value does not match `1.0.0` or `2.MINOR.PATCH`.<br><br>Readers SHOULD support both DataPageV1 and DataPageV2 pages regardless of this property's value. Tables intended for access by engines beyond the Delta Lake connectors SHOULD use `1.0.0`, as DataPageV2 support varies across the broader Parquet ecosystem. | ||
| `delta.writePartitionColumnsToParquet` | Controls whether writers SHOULD write partition columns in newly written data parquet files, in the absence of any writer features that necessitate writing of partition columns (eg. `IcebergCompatV1`). In other words, if no writer feature is enabled that requires materialization of partition columns, writers should read this property to decide whether to materialize partition columns in data parquet files or not. Writer features requirements take precedence over this property's value. Readers should continue to read partition values off of AddFile actions, regardless of the presence of partition values in data files. File-level statistics should not be present for partition columns in partitioned tables in any case. This setting does not apply to writers of files of any other file format. | Boolean field, with valid values `false` and `true`. |
There was a problem hiding this comment.
1/ I don't see a section which explains how this relates to https://github.com/delta-io/delta/blob/master/protocol_rfcs/materialize-partition-columns.md
2/ I think this needs to be structured closer to the section which talks about serializing partition values or in a separate section with the serialization section linking to the new section.
Its hard to read this and understand what the algorithm/ behavior would be on the Kernel/ connection side? Maybe have a truth table?
There was a problem hiding this comment.
It does reference writer features that govern materialization of partition values, which are multiple (IcebergCompatV1, IcebergCompatV2, IcebergCompatV3, IcebergWriterCompatV1, IcebergWriterCompatV3, materializePartitionColumns), and the writer features are treated as one set that are prioritized over this property. I don't see why we should talk about one writer feature in isolation.
2/ I think this needs to be structured closer to the section which talks about serializing partition values or in a separate section with the serialization section linking to the new section.
The protocol doesn't spell out materialization of partition values at all anywhere else, with the exception of the above-listed writer features. We could do it if we want but we would then need to decide if there's a default stance the writers should take.
Its hard to read this and understand what the algorithm/ behavior would be on the Kernel/ connection side? Maybe have a truth table?
For this property, should I just reword it to "this property only governs materialization of partition columns if no writer feature requiring materialization of partition columns is enabled on the table" - that's the same point as the existing text with fewer words, and still leaves enough ambiguity for either default true / default false.
There was a problem hiding this comment.
1/ We might not need to talk about one feature in isolation. But materialization of partition values is an entire functional area. I think we need a top level section covering all features for which we materialize partition values in addition to https://github.com/sanujbasu/delta/blob/e4c1fa465d51a02355b78366151818ffaee342a7/PROTOCOL.md#partition-value-serialization.
2/ "this property only governs materialization of partition columns if no writer feature requiring materialization of partition columns is enabled on the table" is good phrasing but it hard to understand why "materializePartitionColumns" alone doesnt address that use case since its enabled by default?
Also there is no mention of materializePartitionColumns in the protocol.md? Everyone ive discussed this with is confused about why we need the toggle if we already have a feature? If the toggle is to add the materializePartitionColumns writer feature to the list, why not just add the feature to the vector?
There was a problem hiding this comment.
1/ We might not need to talk about one feature in isolation. But materialization of partition values is an entire functional area. I think we need a top level section covering all features for which we materialize partition values in addition to https://github.com/sanujbasu/delta/blob/e4c1fa465d51a02355b78366151818ffaee342a7/PROTOCOL.md#partition-value-serialization.
That one is about how they're serialized in AddFile though. In parquet we don't always use the string type, it's just the same data type that's used if it were any other column.
I've made some references to the optionality of the presence of partition values in data files in this PR now, let me know how that sounds. I'm wary of writing up the whole decision tree in protocol because not all writers will be expected to support iceberg compat, etc., and the table property spec makes it pretty clear that the toggle is best-effort. Really the only thing the protocol should enforce is "readers should understand both states, writers are sometimes required to write partition values".
2/ "this property only governs materialization of partition columns if no writer feature requiring materialization of partition columns is enabled on the table" is good phrasing but it hard to understand why "materializePartitionColumns" alone doesnt address that use case since its enabled by default?
Also there is no mention of materializePartitionColumns in the protocol.md? Everyone ive discussed this with is confused about why we need the toggle if we already have a feature? If the toggle is to add the materializePartitionColumns writer feature to the list, why not just add the feature to the vector?
materializePartitionColumns is not in the protocol because it's just an RFC that hasn't been merged yet. (funny enough, we did merge it into the protocol... and then reverted it because it hadn't baked as an RFC for long enough yet)
Everyone ive discussed this with is confused about why we need the toggle if we already have a feature?
Features block writes if they're present in a protocol and a writer doesn't understand them. We want the ability to materialize partition columns without blocking writers, especially since this is a new writer feature and we don't want to block writers all the time but rather we just want a best-effort way to require writers to materialize partition values.
If you're suggesting that we go ahead and default to writing partition values all the time, I'm supportive of a change like that. I just think that in that reality we would still need the ability to configure that behaviour, and for that we'd need a table property.
There was a problem hiding this comment.
Based on an offline conversation, we decided to move this change to the materialize partition columns RFC, along with a detailed description of when and when not to materialize partition columns. Opened a separate PR for that: #6738
Skipping RFC process as it's a backward-compatible, best-effort change (see #6324 for precedence)
Which Delta project/connector is this regarding?
Description
This change adds a definition for the
delta.writePartitionColumnsToParquet table property to the list of table properties. This property controls the materialization of partition columns in data parquet files, in the absence of any writer features that would require materialization of partition columns (and therefore take precedence over this property).
How was this patch tested?
N/a
Does this PR introduce any user-facing changes?
No