-
Notifications
You must be signed in to change notification settings - Fork 45
PARQUET-2310: implementation status #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
2db6877
b0640f3
b001576
b30536f
6191e82
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,124 @@ | ||
| --- | ||
| title: "Implementation status" | ||
| linkTitle: "Implementation status" | ||
| weight: 8 | ||
| --- | ||
|
|
||
| This page summarizes the features supported by different Parquet | ||
| implementations. | ||
|
|
||
| *Note*: This is a work in progress and we would welcome help expanding its scope. | ||
|
|
||
| ### Legend | ||
| The value in each box means: | ||
| * ✅: supported | ||
| * ❌: not supported | ||
| * (blank) no data | ||
|
|
||
| Implementations: | ||
| * `C++`: [parquet-cpp](https://github.com/apache/arrow/tree/main/cpp/src/parquet) | ||
| * `Java`: [parquet-java](https://github.com/apache/parquet-java) | ||
| * `Go`: [parquet-go](https://github.com/apache/arrow/tree/main/go/parquet) | ||
| * `Rust`: [parquet-rs](https://github.com/apache/arrow-rs/blob/master/parquet/README.md) | ||
|
|
||
|
|
||
|
|
||
| ### Physical types | ||
alippai marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| | Data type | C++ | Java | Go | Rust | | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure that simply naming our implementations "C++", "Java", etc. is very forward-looking, because at some point e.g. DuckDB might want to add their own info here, and they're also written in C++. Besides, Parquet C++ is also available in Python using PyArrow, in R using R Arrow, and perhaps even in C and Ruby using the GLib bindings. That said, we can also decide to rename the columns later.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree -- let's rename the columns later (as we fill out the details) with some name that can be mapped to the implementation (
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's add a paragraph at the top with a pointer to each implementation (Java, go, cpp, rust, ...) that will make it easy to add more implementations and clarify which one we're talking about.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Proposed addition (targeting this PR) in alippai#1 |
||
| | ----------------------------------------- | ----- | ------ | ----- | ----- | | ||
| | BOOLEAN | | | | | | ||
| | INT32 | | | | | | ||
| | INT64 | | | | | | ||
| | INT96 (1) | | | | | | ||
| | FLOAT | | | | | | ||
| | DOUBLE | | | | | | ||
| | BYTE_ARRAY | | | | | | ||
| | FIXED_LEN_BYTE_ARRAY | | | | | | ||
|
|
||
| * \(1) This type is deprecated, but as of 2024 it's common in currently produced parquet files | ||
|
|
||
|
|
||
| ### Logical types | ||
|
|
||
| | Data type | C++ | Java | Go | Rust | | ||
| | ----------------------------------------- | ----- | ------ | ----- | ----- | | ||
| | STRING | | | | | | ||
| | ENUM | | | | | | ||
alippai marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| | UUID | | | | | | ||
| | 8, 16, 32, 64 bit signed and unsigned INT | | | | | | ||
| | DECIMAL (INT32) | | | | | | ||
| | DECIMAL (INT64) | | | | | | ||
| | DECIMAL (BYTE_ARRAY) | | | | | | ||
| | DECIMAL (FIXED_LEN_BYTE_ARRAY) | | | | | | ||
| | DATE | | | | | | ||
| | TIME (INT32) | | | | | | ||
| | TIME (INT64) | | | | | | ||
| | TIMESTAMP (INT64) | | | | | | ||
| | INTERVAL | | | | | | ||
alippai marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| | JSON | | | | | | ||
| | BSON | | | | | | ||
| | LIST | | | | | | ||
| | MAP | | | | | | ||
| | UNKNOWN (always null) | | | | | | ||
| | FLOAT16 | | | | | | ||
|
|
||
| ### Encodings | ||
|
|
||
| | Encoding | C++ | Java | Go | Rust | | ||
| | ----------------------------------------- | ----- | ------ | ----- | ----- | | ||
| | PLAIN | | | | | | ||
| | PLAIN_DICTIONARY | | | | | | ||
| | RLE_DICTIONARY | | | | | | ||
| | RLE | | | | | | ||
| | BIT_PACKED (deprecated) | | | | | | ||
| | DELTA_BINARY_PACKED | | | | | | ||
| | DELTA_LENGTH_BYTE_ARRAY | | | | | | ||
| | DELTA_BYTE_ARRAY | | | | | | ||
| | BYTE_STREAM_SPLIT | | | | | | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this be split into float/double and int/fixed_len_byte_array, or just use notes if an implementation doesn't yet support the expanded set of data types?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we're going to give dates or version numbers as @emkornfield suggested, then this should be split into separate lines. |
||
|
|
||
| ### Compressions | ||
|
|
||
| | Compression | C++ | Java | Go | Rust | | ||
| | ----------------------------------------- | ----- | ------ | ----- | ----- | | ||
| | UNCOMPRESSED | | | | | | ||
| | BROTLI | | | | | | ||
| | GZIP | | | | | | ||
| | LZ4 (deprecated) | | | | | | ||
| | LZ4_RAW | | | | | | ||
| | LZO | | | | | | ||
| | SNAPPY | | | | | | ||
| | ZSTD | | | | | | ||
|
|
||
| ### Other format level features | ||
|
|
||
| | | C++ | Java | Go | Rust | | ||
| | ----------------------------------------- | ----- | ------ | ----- | ----- | | ||
| | xxxHash-based bloom filters | | | | | | ||
| | Bloom filter length (1) | | | | | | ||
| | Statistics min_value, max_value | | | | | | ||
| | Page index | | | | | | ||
| | Page CRC32 checksum | | | | | | ||
| | Modular encryption | | | | | | ||
alippai marked this conversation as resolved.
Show resolved
Hide resolved
alippai marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| | Size statistics (2) | | | | | | ||
|
|
||
|
|
||
| * \(1) In parquet.thrift: ColumnMetaData->bloom_filter_length | ||
|
|
||
| * \(2) In parquet.thrift: ColumnMetaData->size_statistics | ||
|
|
||
| ### High level data APIs for Parquet feature usage | ||
|
|
||
| | Format | C++ | Java | Go | Rust | | ||
| | -------------------------------------------- | ----- | ------ | ----- | ----- | | ||
| | External column data (1) | | | | | | ||
| | Row group "Sorting column" metadata (2) | | | | | | ||
| | Row group pruning using statistics | | | | | | ||
| | Reading select columns only | | | | | | ||
| | Page pruning using statistics | | | | | | ||
| | Page pruning using bloom filter | | | | | | ||
|
|
||
|
|
||
| * \(1) In parquet.thrift: ColumnChunk->file_path | ||
|
|
||
| * \(2) In parquet.thrift: RowGroup->sorting_columns | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a legend:
✅ supported
❌ not supported
[blank] no data
The main goal being to clarify the difference between missing information and not supported feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposed addition (targeting this PR) in alippai#1