Skip to content

Releases: Eventual-Inc/parquet-format-safe

Patched parquet-format-safe

04 Feb 04:26

Choose a tag to compare

fix: Resolve mismatch with Thrift compact protocol

The [Thrift compact
protocol](https://github.com/apache/thrift/blob/master/doc/specs/thrift-compact-protocol.md)
is used for Parquet file metadata.
[parquet-format-safe](https://github.com/jorgecarleitao/parquet-format-safe)
and other Rust implementations of the protocol eagerly read
string/binary fields as UTF-8.

However, based on the protocol which states that

> Strings are first encoded to UTF-8, and then send as binary

it cannot be known upfront, without using the schema to disambiguate the
field type, whether a field is a string or a binary. This means that
when the field is actually a binary field and contains invalid UTF-8,
Rust libraries error out when reading the field with `File out of
specification: Invalid thrift: bad data`.

To fix this, we patch the protocol implementation to correctly interpret
string/binary fields as binary.