From 33d9bce7308ba3b844a93dfb62c9871427d0370b Mon Sep 17 00:00:00 2001 From: Gang Wu Date: Wed, 5 Feb 2025 23:20:36 +0800 Subject: [PATCH 1/5] update matrix for cpp and java --- .../docs/File Format/implementationstatus.md | 113 +++++++++--------- 1 file changed, 56 insertions(+), 57 deletions(-) diff --git a/content/en/docs/File Format/implementationstatus.md b/content/en/docs/File Format/implementationstatus.md index 8b32876b..144319b7 100644 --- a/content/en/docs/File Format/implementationstatus.md +++ b/content/en/docs/File Format/implementationstatus.md @@ -29,14 +29,14 @@ Implementations: | Data type | C++ | Java | Go | Rust | cuDF | | ----------------------------------------- | ----- | ----- | ----- | ----- | ----- | -| BOOLEAN | | | | | ✅ | -| INT32 | | | | | ✅ | -| INT64 | | | | | ✅ | -| INT96 (1) | | | | | ✅ | -| FLOAT | | | | | ✅ | -| DOUBLE | | | | | ✅ | -| BYTE_ARRAY | | | | | ✅ | -| FIXED_LEN_BYTE_ARRAY | | | | | ✅ | +| BOOLEAN | ✅ | ✅ | | | ✅ | +| INT32 | ✅ | ✅ | | | ✅ | +| INT64 | ✅ | ✅ | | | ✅ | +| INT96 (1) | ✅ | ✅ | | | ✅ | +| FLOAT | ✅ | ✅ | | | ✅ | +| DOUBLE | ✅ | ✅ | | | ✅ | +| BYTE_ARRAY | ✅ | ✅ | | | ✅ | +| FIXED_LEN_BYTE_ARRAY | ✅ | ✅ | | | ✅ | * \(1) This type is deprecated, but as of 2024 it's common in currently produced parquet files @@ -45,64 +45,63 @@ Implementations: | Data type | C++ | Java | Go | Rust | cuDF | | ----------------------------------------- | ----- | ----- | ----- | ----- | ----- | -| STRING | | | | | ✅ | -| ENUM | | | | | ❌ | -| UUID | | | | | ❌ | -| 8, 16, 32, 64 bit signed and unsigned INT | | | | | ✅ | -| DECIMAL (INT32) | | | | | ✅ | -| DECIMAL (INT64) | | | | | ✅ | -| DECIMAL (BYTE_ARRAY) | | | | | ✅ | -| DECIMAL (FIXED_LEN_BYTE_ARRAY) | | | | | ✅ | -| DATE | | | | | ✅ | -| TIME (INT32) | | | | | ✅ | -| TIME (INT64) | | | | | ✅ | -| TIMESTAMP (INT64) | | | | | ✅ | -| INTERVAL | | | | | ❌ | -| JSON | | | | | ❌ | -| BSON | | | | | ❌ | -| LIST | | | | | ✅ | -| MAP | | | | | ✅ | -| UNKNOWN (always null) | | | | | ✅ | -| FLOAT16 | | | | | ✅ | +| STRING | ✅ | ✅ | | | ✅ | +| ENUM | ❌ | ✅ | | | ❌ | +| UUID | ❌ | ✅ | | | ❌ | +| 8, 16, 32, 64 bit signed and unsigned INT | ✅ | ✅ | | | ✅ | +| DECIMAL (INT32) | ✅ | ✅ | | | ✅ | +| DECIMAL (INT64) | ✅ | ✅ | | | ✅ | +| DECIMAL (BYTE_ARRAY) | ✅ | ✅ | | | ✅ | +| DECIMAL (FIXED_LEN_BYTE_ARRAY) | ✅ | ✅ | | | ✅ | +| DATE | ✅ | ✅ | | | ✅ | +| TIME (INT32) | ✅ | ✅ | | | ✅ | +| TIME (INT64) | ✅ | ✅ | | | ✅ | +| TIMESTAMP (INT64) | ✅ | ✅ | | | ✅ | +| INTERVAL | ✅ | ✅ | | | ❌ | +| JSON | ✅ | ✅ | | | ❌ | +| BSON | ❌ | ✅ | | | ❌ | +| LIST | ✅ | ✅ | | | ✅ | +| MAP | ✅ | ✅ | | | ✅ | +| UNKNOWN (always null) | ✅ | ✅ | | | ✅ | +| FLOAT16 | ✅ | ✅ | | | ✅ | ### Encodings | Encoding | C++ | Java | Go | Rust | cuDF | | ----------------------------------------- | ----- | ----- | ----- | ----- | ----- | -| PLAIN | | | | | ✅ | -| PLAIN_DICTIONARY | | | | | ✅ | -| RLE_DICTIONARY | | | | | ✅ | -| RLE | | | | | ✅ | -| BIT_PACKED (deprecated) | | | | | (R) | -| DELTA_BINARY_PACKED | | | | | ✅ | -| DELTA_LENGTH_BYTE_ARRAY | | | | | ✅ | -| DELTA_BYTE_ARRAY | | | | | ✅ | -| BYTE_STREAM_SPLIT | | | | | ✅ | +| PLAIN | ✅ | ✅ | | | ✅ | +| PLAIN_DICTIONARY | ✅ | ✅ | | | ✅ | +| RLE_DICTIONARY | ✅ | ✅ | | | ✅ | +| RLE | ✅ | ✅ | | | ✅ | +| BIT_PACKED (deprecated) | ✅ | ✅ | | | (R) | +| DELTA_BINARY_PACKED | ✅ | ✅ | | | ✅ | +| DELTA_LENGTH_BYTE_ARRAY | ✅ | ✅ | | | ✅ | +| DELTA_BYTE_ARRAY | ✅ | ✅ | | | ✅ | +| BYTE_STREAM_SPLIT | ✅ | ✅ | | | ✅ | ### Compressions | Compression | C++ | Java | Go | Rust | cuDF | | ----------------------------------------- | ----- | ----- | ----- | ----- | ----- | -| UNCOMPRESSED | | | | | ✅ | -| BROTLI | | | | | (R) | -| GZIP | | | | | (R) | -| LZ4 (deprecated) | | | | | ❌ | -| LZ4_RAW | | | | | ✅ | -| LZO | | | | | ❌ | -| SNAPPY | | | | | ✅ | -| ZSTD | | | | | ✅ | +| UNCOMPRESSED | ✅ | ✅ | | | ✅ | +| GZIP | ✅ | ✅ | | | (R) | +| LZ4 (deprecated) | ✅ | ✅ | | | ❌ | +| LZ4_RAW | ✅ | ✅ | | | ✅ | +| LZO | ❌ | ✅ | | | ❌ | +| SNAPPY | ✅ | ✅ | | | ✅ | +| ZSTD | ✅ | ✅ | | | ✅ | ### Other format level features | | C++ | Java | Go | Rust | cuDF | | ----------------------------------------- | ----- | ----- | ----- | ----- | ----- | -| xxHash-based bloom filters | | | | | (R) | -| Bloom filter length (1) | | | | | (R) | -| Statistics min_value, max_value | | | | | ✅ | -| Page index | | | | | ✅ | -| Page CRC32 checksum | | | | | ❌ | -| Modular encryption | | | | | ❌ | -| Size statistics (2) | | | | | ✅ | +| xxHash-based bloom filters | (R) | ✅ | | | (R) | +| Bloom filter length (1) | (R) | ✅ | | | (R) | +| Statistics min_value, max_value | ✅ | ✅ | | | ✅ | +| Page index | ✅ | ✅ | | | ✅ | +| Page CRC32 checksum | ✅ | ✅ | | | ❌ | +| Modular encryption | ✅ | ✅ | | | ❌ | +| Size statistics (2) | ✅ | ✅ | | | ✅ | * \(1) In parquet.thrift: ColumnMetaData->bloom_filter_length @@ -113,12 +112,12 @@ Implementations: | Format | C++ | Java | Go | Rust | cuDF | | -------------------------------------------- | ----- | ----- | ----- | ----- | ----- | -| External column data (1) | | | | | (W) | -| Row group "Sorting column" metadata (2) | | | | | (W) | -| Row group pruning using statistics | | | | | ✅ | -| Row group pruning using bloom filter | | | | | ✅ | -| Reading select columns only | | | | | ✅ | -| Page pruning using statistics | | | | | ❌ | +| External column data (1) | ❌ | ✅ | | | (W) | +| Row group "Sorting column" metadata (2) | ✅ | ❌ | | | (W) | +| Row group pruning using statistics | ✅ | ✅ | | | ✅ | +| Row group pruning using bloom filter | ❌ | ✅ | | | ✅ | +| Reading select columns only | ✅ | ✅ | | | ✅ | +| Page pruning using statistics | ❌ | ✅ | | | ❌ | * \(1) In parquet.thrift: ColumnChunk->file_path From ca02ad6dc22f60d270d3d104b04be4fb0d2b43fe Mon Sep 17 00:00:00 2001 From: Gang Wu Date: Thu, 6 Feb 2025 11:27:21 +0800 Subject: [PATCH 2/5] fix brotli --- content/en/docs/File Format/implementationstatus.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/content/en/docs/File Format/implementationstatus.md b/content/en/docs/File Format/implementationstatus.md index 144319b7..d572f3a8 100644 --- a/content/en/docs/File Format/implementationstatus.md +++ b/content/en/docs/File Format/implementationstatus.md @@ -57,9 +57,9 @@ Implementations: | TIME (INT32) | ✅ | ✅ | | | ✅ | | TIME (INT64) | ✅ | ✅ | | | ✅ | | TIMESTAMP (INT64) | ✅ | ✅ | | | ✅ | -| INTERVAL | ✅ | ✅ | | | ❌ | +| INTERVAL | ✅ | ❌ | | | ❌ | | JSON | ✅ | ✅ | | | ❌ | -| BSON | ❌ | ✅ | | | ❌ | +| BSON | ❌ | ❌ | | | ❌ | | LIST | ✅ | ✅ | | | ✅ | | MAP | ✅ | ✅ | | | ✅ | | UNKNOWN (always null) | ✅ | ✅ | | | ✅ | @@ -84,6 +84,7 @@ Implementations: | Compression | C++ | Java | Go | Rust | cuDF | | ----------------------------------------- | ----- | ----- | ----- | ----- | ----- | | UNCOMPRESSED | ✅ | ✅ | | | ✅ | +| BROTLI | ✅ | ✅ | | | (R) | | GZIP | ✅ | ✅ | | | (R) | | LZ4 (deprecated) | ✅ | ✅ | | | ❌ | | LZ4_RAW | ✅ | ✅ | | | ✅ | From 7b572628772816df59ed27f3dc3b09c669b36371 Mon Sep 17 00:00:00 2001 From: Gang Wu Date: Fri, 7 Feb 2025 22:34:00 +0800 Subject: [PATCH 3/5] address comment --- .../en/docs/File Format/implementationstatus.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/content/en/docs/File Format/implementationstatus.md b/content/en/docs/File Format/implementationstatus.md index d572f3a8..d8b4102c 100644 --- a/content/en/docs/File Format/implementationstatus.md +++ b/content/en/docs/File Format/implementationstatus.md @@ -58,12 +58,12 @@ Implementations: | TIME (INT64) | ✅ | ✅ | | | ✅ | | TIMESTAMP (INT64) | ✅ | ✅ | | | ✅ | | INTERVAL | ✅ | ❌ | | | ❌ | -| JSON | ✅ | ✅ | | | ❌ | +| JSON | ✅ | ❌ | | | ❌ | | BSON | ❌ | ❌ | | | ❌ | | LIST | ✅ | ✅ | | | ✅ | | MAP | ✅ | ✅ | | | ✅ | -| UNKNOWN (always null) | ✅ | ✅ | | | ✅ | -| FLOAT16 | ✅ | ✅ | | | ✅ | +| UNKNOWN (always null) | ✅ | ❌ | | | ✅ | +| FLOAT16 | ✅ | ❌ | | | ✅ | ### Encodings @@ -86,9 +86,9 @@ Implementations: | UNCOMPRESSED | ✅ | ✅ | | | ✅ | | BROTLI | ✅ | ✅ | | | (R) | | GZIP | ✅ | ✅ | | | (R) | -| LZ4 (deprecated) | ✅ | ✅ | | | ❌ | +| LZ4 (deprecated) | ✅ | ❌ | | | ❌ | | LZ4_RAW | ✅ | ✅ | | | ✅ | -| LZO | ❌ | ✅ | | | ❌ | +| LZO | ❌ | ❌ | | | ❌ | | SNAPPY | ✅ | ✅ | | | ✅ | | ZSTD | ✅ | ✅ | | | ✅ | @@ -113,9 +113,9 @@ Implementations: | Format | C++ | Java | Go | Rust | cuDF | | -------------------------------------------- | ----- | ----- | ----- | ----- | ----- | -| External column data (1) | ❌ | ✅ | | | (W) | +| External column data (1) | ✅ | ✅ | | | (W) | | Row group "Sorting column" metadata (2) | ✅ | ❌ | | | (W) | -| Row group pruning using statistics | ✅ | ✅ | | | ✅ | +| Row group pruning using statistics | ❌ | ✅ | | | ✅ | | Row group pruning using bloom filter | ❌ | ✅ | | | ✅ | | Reading select columns only | ✅ | ✅ | | | ✅ | | Page pruning using statistics | ❌ | ✅ | | | ❌ | From 77614fa9046f29dc20f54c64c122ee2d763553b1 Mon Sep 17 00:00:00 2001 From: Gang Wu Date: Fri, 7 Feb 2025 23:38:29 +0800 Subject: [PATCH 4/5] mark all types supported for java --- content/en/docs/File Format/implementationstatus.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/content/en/docs/File Format/implementationstatus.md b/content/en/docs/File Format/implementationstatus.md index d8b4102c..ab05241d 100644 --- a/content/en/docs/File Format/implementationstatus.md +++ b/content/en/docs/File Format/implementationstatus.md @@ -57,13 +57,13 @@ Implementations: | TIME (INT32) | ✅ | ✅ | | | ✅ | | TIME (INT64) | ✅ | ✅ | | | ✅ | | TIMESTAMP (INT64) | ✅ | ✅ | | | ✅ | -| INTERVAL | ✅ | ❌ | | | ❌ | -| JSON | ✅ | ❌ | | | ❌ | -| BSON | ❌ | ❌ | | | ❌ | +| INTERVAL | ✅ | ✅ | | | ❌ | +| JSON | ✅ | ✅ | | | ❌ | +| BSON | ❌ | ✅ | | | ❌ | | LIST | ✅ | ✅ | | | ✅ | | MAP | ✅ | ✅ | | | ✅ | -| UNKNOWN (always null) | ✅ | ❌ | | | ✅ | -| FLOAT16 | ✅ | ❌ | | | ✅ | +| UNKNOWN (always null) | ✅ | ✅ | | | ✅ | +| FLOAT16 | ✅ | ✅ | | | ✅ | ### Encodings From c6573fff1039b04f05c86d89ee3bf5db27f44ca6 Mon Sep 17 00:00:00 2001 From: Gang Wu Date: Sat, 8 Feb 2025 12:36:40 +0800 Subject: [PATCH 5/5] use asterisk to explain supported logical type --- content/en/docs/File Format/implementationstatus.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/content/en/docs/File Format/implementationstatus.md b/content/en/docs/File Format/implementationstatus.md index ab05241d..3bd1d23d 100644 --- a/content/en/docs/File Format/implementationstatus.md +++ b/content/en/docs/File Format/implementationstatus.md @@ -57,13 +57,15 @@ Implementations: | TIME (INT32) | ✅ | ✅ | | | ✅ | | TIME (INT64) | ✅ | ✅ | | | ✅ | | TIMESTAMP (INT64) | ✅ | ✅ | | | ✅ | -| INTERVAL | ✅ | ✅ | | | ❌ | -| JSON | ✅ | ✅ | | | ❌ | -| BSON | ❌ | ✅ | | | ❌ | +| INTERVAL | ✅ | ✅(*)| | | ❌ | +| JSON | ✅ | ✅(*)| | | ❌ | +| BSON | ❌ | ✅(*)| | | ❌ | | LIST | ✅ | ✅ | | | ✅ | | MAP | ✅ | ✅ | | | ✅ | | UNKNOWN (always null) | ✅ | ✅ | | | ✅ | -| FLOAT16 | ✅ | ✅ | | | ✅ | +| FLOAT16 | ✅ | ✅(*)| | | ✅ | + +(*): Only supported to use its annotated physical type ### Encodings