[PROTOCOL] Add delta.parquet.compression.codec property to protocol#6324
Conversation
|
Offline feedback around proposal:
|
| @@ -0,0 +1,38 @@ | |||
| # Parquet Compression Codec | |||
There was a problem hiding this comment.
Hey @emkornfield ! I think for something like this you can just make a PR directly against PROTOCOL.md.
WDYT @tdas ?
There was a problem hiding this comment.
i think so too. if its not a breaking change, just fully backward compatible improvements, then we can just add it to the protocol directly.
|
|
||
| Specifies the compression codec writers SHOULD use when writing new Parquet data and checkpoint files. Changing this property does not affect existing files; a table may contain files written with different codecs, which is a normal and expected state. | ||
|
|
||
| Supported values (matched case-insensitively): |
There was a problem hiding this comment.
Should we clarify that this is a best-effort list? We don't definitively state the exhaustive list of supported values?
In other words: is it VALID for a Delta table to have a DIFFERENT value?
There was a problem hiding this comment.
I think this is covered below on writer requirements?
There was a problem hiding this comment.
Somewhat?
- here we state a list of "supported values"
- below it states "If a writer does not support the specified codec, it SHOULD abort with an appropriate error or fall back to a default codec."
- that doesn't answer the question of: is it perfictly fine to use "foofoobarbar" as a codec value? there is "what a writer supports" and there is "is there the concept of an unsupported value"
Is there a simple sentence or clause we can add that clears this up? removes the ambiguity?
Co-authored-by: emkornfield <emkornfield@gmail.com>
|
|
||
| When the property is absent, writers SHOULD default to `zstd`. If a writer does not support or recognize the specified codec, it SHOULD abort with an appropriate error or fall back to a default codec. | ||
|
|
||
| Readers SHOULD be able to read parquet files compressed with any of the supported codecs, regardless of the current table property value. In some cases parquet files might have been written codecs that [parquet supports](https://parquet.apache.org/docs/file-format/data-pages/compression/) that are not in the list above, readers MAY support reading these files. |
There was a problem hiding this comment.
been written *with codecs that ...
Co-authored-by: emkornfield <emkornfield@gmail.com>
|
|
||
| ## Property Details | ||
|
|
||
| ### delta.parquet.compression.codec |
There was a problem hiding this comment.
nit: this is a title. each property will be a title?
shouldnt we be make this into a table. most projects i know defines properties as a table
https://spark.apache.org/docs/latest/configuration.html
https://iceberg.apache.org/docs/latest/configuration/
without a table .. i am not sure how the list of properties will look like
not a blocker for merging the RFC. we can refactor it when merging into the protocol as well. but i suggest following standards eventually.
…ornfield/delta into rfc_for_compression_setting
…elta-io#6324) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (protocol) ## Description Add a proposal RFC to document parquet compression. ## How was this patch tested? N/A ## Does this PR introduce _any_ user-facing changes? No --------- Co-authored-by: Scott Sandre <scott.sandre@databricks.com>
|
fix #6323? |
|
@emkornfield, @scovich, @scottsand-db, @tdas, Spark uses Snappy by default. Should we make changes in Spark-Delta to use ZSTD by default and align with new spec? |
|
@felipepessoto yes that seems reasonable to me -- wdyt @emkornfield ? |
Yes, this seems reasonable to me. |
|
@emkornfield, @scottsand-db are you doing it or should I send a PR? |
If you have bandwidth a PR would be appreciated. |
Which Delta project/connector is this regarding?
Description
Add a proposal RFC to document parquet compression.
How was this patch tested?
N/A
Does this PR introduce any user-facing changes?
No