-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[PROTOCOL] Add delta.parquet.compression.codec property to protocol #6324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 10 commits
5333e03
8ce05c5
3c26702
750aec8
7bbebcd
b8a8398
5fafcb8
a35eb20
74230c1
cb85c65
d84d923
886062c
0d88696
36bc0b0
c31f5aa
cde8932
d6b411f
34c3cee
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| # Parquet Compression Codec | ||
| **Associated Github issue for discussions: https://github.com/delta-io/delta/issues/6323 | ||
|
emkornfield marked this conversation as resolved.
Outdated
emkornfield marked this conversation as resolved.
Outdated
|
||
|
|
||
| Delta Lake tables store data in parquet files, and parquet supports multiple compression codecs. Currently, the Delta protocol does not formally document how compression is specified or which codecs are supported. This RFC introduces the `delta.parquet.compression.codec` table property to persistently configure the compression codec used for new parquet files. | ||
|
|
||
| -------- | ||
|
|
||
| > ***New top-level Section just before [Appendix](#appendix)*** | ||
|
|
||
| # Table Properties | ||
|
|
||
| Delta Lake tables support a set of properties stored in the `configuration` field of the `metaData` action that control various aspects of table behavior. | ||
|
|
||
| ## Overview | ||
|
|
||
| Property | Description | ||
| -|- | ||
| [`delta.parquet.compression.codec`](#deltaparquetcompressioncodec) | Compression codec for new Parquet data and checkpoint files | ||
|
|
||
| ## Property Details | ||
|
|
||
| ### delta.parquet.compression.codec | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: this is a title. each property will be a title? without a table .. i am not sure how the list of properties will look like not a blocker for merging the RFC. we can refactor it when merging into the protocol as well. but i suggest following standards eventually. |
||
|
|
||
| Specifies the compression codec writers SHOULD use when writing new Parquet data and checkpoint files. Changing this property does not affect existing files; a table may contain files written with different codecs, which is a normal and expected state. | ||
|
|
||
| Supported values (matched case-insensitively): | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we clarify that this is a best-effort list? We don't definitively state the exhaustive list of supported values? In other words: is it VALID for a Delta table to have a DIFFERENT value?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is covered below on writer requirements?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Somewhat?
Is there a simple sentence or clause we can add that clears this up? removes the ambiguity?
emkornfield marked this conversation as resolved.
Outdated
|
||
|
|
||
| Value | Description | ||
| -|- | ||
| `uncompressed` or `none` | No compression | ||
| `snappy` | Snappy compression (recommended default) | ||
| `gzip` | GZIP compression | ||
| `lz4` | (Deprecated) LZ4 compression (Hadoop framing). For backwards compatibility only. | ||
|
emkornfield marked this conversation as resolved.
Outdated
|
||
| `lz4_raw` | [LZ4 compression](https://parquet.apache.org/docs/file-format/data-pages/compression/#lz4_raw) based on the LZ4 block compression format. | ||
|
emkornfield marked this conversation as resolved.
Outdated
emkornfield marked this conversation as resolved.
Outdated
|
||
| `zstd` | Zstandard compression | ||
|
|
||
| When the property is absent, writers SHOULD default to `zstd`. If a writer does not support the specified codec, it SHOULD abort with an appropriate error or fall back to a default codec. | ||
|
emkornfield marked this conversation as resolved.
Outdated
|
||
|
|
||
| Readers SHOULD be able to read parquet files compressed with any of the supported codecs, regardless of the current table property value. In some cases parquet files might have been written codecs that [parquet supports](https://parquet.apache.org/docs/file-format/data-pages/compression/) that are not in the list above, readers MAY support reading these files. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. been written *with codecs that ...
emkornfield marked this conversation as resolved.
Outdated
emkornfield marked this conversation as resolved.
Outdated
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @emkornfield ! I think for something like this you can just make a PR directly against PROTOCOL.md.
WDYT @tdas ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think so too. if its not a breaking change, just fully backward compatible improvements, then we can just add it to the protocol directly.