-
Notifications
You must be signed in to change notification settings - Fork 45
Update matrix for parquet-cpp and parquet-java #100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -29,14 +29,14 @@ Implementations: | |
|
|
||
| | Data type | C++ | Java | Go | Rust | cuDF | | ||
| | ----------------------------------------- | ----- | ----- | ----- | ----- | ----- | | ||
| | BOOLEAN | | | | | ✅ | | ||
| | INT32 | | | | | ✅ | | ||
| | INT64 | | | | | ✅ | | ||
| | INT96 (1) | | | | | ✅ | | ||
| | FLOAT | | | | | ✅ | | ||
| | DOUBLE | | | | | ✅ | | ||
| | BYTE_ARRAY | | | | | ✅ | | ||
| | FIXED_LEN_BYTE_ARRAY | | | | | ✅ | | ||
| | BOOLEAN | ✅ | ✅ | | | ✅ | | ||
| | INT32 | ✅ | ✅ | | | ✅ | | ||
| | INT64 | ✅ | ✅ | | | ✅ | | ||
| | INT96 (1) | ✅ | ✅ | | | ✅ | | ||
| | FLOAT | ✅ | ✅ | | | ✅ | | ||
| | DOUBLE | ✅ | ✅ | | | ✅ | | ||
| | BYTE_ARRAY | ✅ | ✅ | | | ✅ | | ||
| | FIXED_LEN_BYTE_ARRAY | ✅ | ✅ | | | ✅ | | ||
|
|
||
| * \(1) This type is deprecated, but as of 2024 it's common in currently produced parquet files | ||
|
|
||
|
|
@@ -45,64 +45,63 @@ Implementations: | |
|
|
||
| | Data type | C++ | Java | Go | Rust | cuDF | | ||
| | ----------------------------------------- | ----- | ----- | ----- | ----- | ----- | | ||
| | STRING | | | | | ✅ | | ||
| | ENUM | | | | | ❌ | | ||
| | UUID | | | | | ❌ | | ||
| | 8, 16, 32, 64 bit signed and unsigned INT | | | | | ✅ | | ||
| | DECIMAL (INT32) | | | | | ✅ | | ||
| | DECIMAL (INT64) | | | | | ✅ | | ||
| | DECIMAL (BYTE_ARRAY) | | | | | ✅ | | ||
| | DECIMAL (FIXED_LEN_BYTE_ARRAY) | | | | | ✅ | | ||
| | DATE | | | | | ✅ | | ||
| | TIME (INT32) | | | | | ✅ | | ||
| | TIME (INT64) | | | | | ✅ | | ||
| | TIMESTAMP (INT64) | | | | | ✅ | | ||
| | INTERVAL | | | | | ❌ | | ||
| | JSON | | | | | ❌ | | ||
| | BSON | | | | | ❌ | | ||
| | LIST | | | | | ✅ | | ||
| | MAP | | | | | ✅ | | ||
| | UNKNOWN (always null) | | | | | ✅ | | ||
| | FLOAT16 | | | | | ✅ | | ||
| | STRING | ✅ | ✅ | | | ✅ | | ||
| | ENUM | ❌ | ✅ | | | ❌ | | ||
| | UUID | ❌ | ✅ | | | ❌ | | ||
| | 8, 16, 32, 64 bit signed and unsigned INT | ✅ | ✅ | | | ✅ | | ||
| | DECIMAL (INT32) | ✅ | ✅ | | | ✅ | | ||
| | DECIMAL (INT64) | ✅ | ✅ | | | ✅ | | ||
| | DECIMAL (BYTE_ARRAY) | ✅ | ✅ | | | ✅ | | ||
| | DECIMAL (FIXED_LEN_BYTE_ARRAY) | ✅ | ✅ | | | ✅ | | ||
| | DATE | ✅ | ✅ | | | ✅ | | ||
| | TIME (INT32) | ✅ | ✅ | | | ✅ | | ||
| | TIME (INT64) | ✅ | ✅ | | | ✅ | | ||
| | TIMESTAMP (INT64) | ✅ | ✅ | | | ✅ | | ||
| | INTERVAL | ✅ | ✅ | | | ❌ | | ||
| | JSON | ✅ | ✅ | | | ❌ | | ||
| | BSON | ❌ | ✅ | | | ❌ | | ||
| | LIST | ✅ | ✅ | | | ✅ | | ||
| | MAP | ✅ | ✅ | | | ✅ | | ||
| | UNKNOWN (always null) | ✅ | ✅ | | | ✅ | | ||
wgtmac marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| | FLOAT16 | ✅ | ✅ | | | ✅ | | ||
|
||
|
|
||
| ### Encodings | ||
|
|
||
| | Encoding | C++ | Java | Go | Rust | cuDF | | ||
| | ----------------------------------------- | ----- | ----- | ----- | ----- | ----- | | ||
| | PLAIN | | | | | ✅ | | ||
| | PLAIN_DICTIONARY | | | | | ✅ | | ||
| | RLE_DICTIONARY | | | | | ✅ | | ||
| | RLE | | | | | ✅ | | ||
| | BIT_PACKED (deprecated) | | | | | (R) | | ||
| | DELTA_BINARY_PACKED | | | | | ✅ | | ||
| | DELTA_LENGTH_BYTE_ARRAY | | | | | ✅ | | ||
| | DELTA_BYTE_ARRAY | | | | | ✅ | | ||
| | BYTE_STREAM_SPLIT | | | | | ✅ | | ||
| | PLAIN | ✅ | ✅ | | | ✅ | | ||
| | PLAIN_DICTIONARY | ✅ | ✅ | | | ✅ | | ||
| | RLE_DICTIONARY | ✅ | ✅ | | | ✅ | | ||
| | RLE | ✅ | ✅ | | | ✅ | | ||
| | BIT_PACKED (deprecated) | ✅ | ✅ | | | (R) | | ||
| | DELTA_BINARY_PACKED | ✅ | ✅ | | | ✅ | | ||
| | DELTA_LENGTH_BYTE_ARRAY | ✅ | ✅ | | | ✅ | | ||
| | DELTA_BYTE_ARRAY | ✅ | ✅ | | | ✅ | | ||
| | BYTE_STREAM_SPLIT | ✅ | ✅ | | | ✅ | | ||
|
|
||
| ### Compressions | ||
|
|
||
| | Compression | C++ | Java | Go | Rust | cuDF | | ||
| | ----------------------------------------- | ----- | ----- | ----- | ----- | ----- | | ||
| | UNCOMPRESSED | | | | | ✅ | | ||
| | BROTLI | | | | | (R) | | ||
| | GZIP | | | | | (R) | | ||
| | LZ4 (deprecated) | | | | | ❌ | | ||
| | LZ4_RAW | | | | | ✅ | | ||
| | LZO | | | | | ❌ | | ||
| | SNAPPY | | | | | ✅ | | ||
| | ZSTD | | | | | ✅ | | ||
| | UNCOMPRESSED | ✅ | ✅ | | | ✅ | | ||
| | GZIP | ✅ | ✅ | | | (R) | | ||
| | LZ4 (deprecated) | ✅ | ✅ | | | ❌ | | ||
| | LZ4_RAW | ✅ | ✅ | | | ✅ | | ||
| | LZO | ❌ | ✅ | | | ❌ | | ||
| | SNAPPY | ✅ | ✅ | | | ✅ | | ||
| | ZSTD | ✅ | ✅ | | | ✅ | | ||
wgtmac marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ### Other format level features | ||
|
|
||
| | | C++ | Java | Go | Rust | cuDF | | ||
| | ----------------------------------------- | ----- | ----- | ----- | ----- | ----- | | ||
| | xxHash-based bloom filters | | | | | (R) | | ||
| | Bloom filter length (1) | | | | | (R) | | ||
| | Statistics min_value, max_value | | | | | ✅ | | ||
| | Page index | | | | | ✅ | | ||
| | Page CRC32 checksum | | | | | ❌ | | ||
| | Modular encryption | | | | | ❌ | | ||
| | Size statistics (2) | | | | | ✅ | | ||
| | xxHash-based bloom filters | (R) | ✅ | | | (R) | | ||
| | Bloom filter length (1) | (R) | ✅ | | | (R) | | ||
| | Statistics min_value, max_value | ✅ | ✅ | | | ✅ | | ||
| | Page index | ✅ | ✅ | | | ✅ | | ||
| | Page CRC32 checksum | ✅ | ✅ | | | ❌ | | ||
| | Modular encryption | ✅ | ✅ | | | ❌ | | ||
| | Size statistics (2) | ✅ | ✅ | | | ✅ | | ||
|
|
||
|
|
||
| * \(1) In parquet.thrift: ColumnMetaData->bloom_filter_length | ||
|
|
@@ -113,12 +112,12 @@ Implementations: | |
|
|
||
| | Format | C++ | Java | Go | Rust | cuDF | | ||
| | -------------------------------------------- | ----- | ----- | ----- | ----- | ----- | | ||
| | External column data (1) | | | | | (W) | | ||
| | Row group "Sorting column" metadata (2) | | | | | (W) | | ||
| | Row group pruning using statistics | | | | | ✅ | | ||
| | Row group pruning using bloom filter | | | | | ✅ | | ||
| | Reading select columns only | | | | | ✅ | | ||
| | Page pruning using statistics | | | | | ❌ | | ||
| | External column data (1) | ❌ | ✅ | | | (W) | | ||
wgtmac marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| | Row group "Sorting column" metadata (2) | ✅ | ❌ | | | (W) | | ||
| | Row group pruning using statistics | ✅ | ✅ | | | ✅ | | ||
wgtmac marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| | Row group pruning using bloom filter | ❌ | ✅ | | | ✅ | | ||
| | Reading select columns only | ✅ | ✅ | | | ✅ | | ||
| | Page pruning using statistics | ❌ | ✅ | | | ❌ | | ||
|
|
||
|
|
||
| * \(1) In parquet.thrift: ColumnChunk->file_path | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we split this out on the unit as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is that? Any unsupported unit in parquet-java?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but I'm not sure if all the other implementations support
nanossince that was added later on.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was added almost 7 years ago: apache/parquet-format@b879065. We can be explicit if we really see an unsupported implementation then.