Skip to content
Merged
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions protocol_rfcs/parquet-compression-codec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Parquet Compression Codec
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @emkornfield ! I think for something like this you can just make a PR directly against PROTOCOL.md.

WDYT @tdas ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think so too. if its not a breaking change, just fully backward compatible improvements, then we can just add it to the protocol directly.

**Associated Github issue for discussions: https://github.com/delta-io/delta/issues/6323 **

Delta Lake tables store data in parquet files, and parquet supports multiple compression codecs. Currently, the Delta protocol does not formally document how compression is specified or which codecs are supported. This RFC introduces the `delta.parquet.compression.codec` table property to persistently configure the compression codec used for new parquet files.

--------

> ***New top-level Section just before [Appendix](#appendix)***

# Table Properties

Delta Lake tables support a set of properties stored in the `configuration` field of the `metaData` action that control various aspects of table behavior.

## Overview

Property | Description
-|-
[`delta.parquet.compression.codec`](#deltaparquetcompressioncodec) | Compression codec for new Parquet data and checkpoint files

## Property Details

### delta.parquet.compression.codec
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is a title. each property will be a title?
shouldnt we be make this into a table. most projects i know defines properties as a table
https://spark.apache.org/docs/latest/configuration.html
https://iceberg.apache.org/docs/latest/configuration/

without a table .. i am not sure how the list of properties will look like

not a blocker for merging the RFC. we can refactor it when merging into the protocol as well. but i suggest following standards eventually.


Specifies the compression codec writers SHOULD use when writing new Parquet data and checkpoint files. Changing this property does not affect existing files; a table may contain files written with different codecs, which is a normal and expected state.

Widely supported values (matched case-insensitively):

Value | Description
-|-
`uncompressed` or `none` | No compression
`snappy` | Snappy compression (recommended default)
`gzip` | GZIP compression
`lz4` | (Deprecated) LZ4 compression (Hadoop framing). For backwards compatibility only.
Comment thread
emkornfield marked this conversation as resolved.
Outdated
`lz4_raw` | [LZ4 compression](https://parquet.apache.org/docs/file-format/data-pages/compression/#lz4_raw) based on the LZ4 block compression format.
`zstd` | Zstandard compression

When the property is absent, writers SHOULD default to `zstd`. If a writer does not support or recognize the specified codec, it SHOULD abort with an appropriate error or fall back to a default codec.

Readers SHOULD be able to read parquet files compressed with any of the supported codecs, regardless of the current table property value. In some cases parquet files might have been written with codecs that [parquet supports](https://parquet.apache.org/docs/file-format/data-pages/compression/) that are not in the list above; readers MAY support reading these files.
Loading