-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status #36027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -348,3 +348,107 @@ Notes: | |||||
| * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``) | ||||||
|
|
||||||
| * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``) | ||||||
|
|
||||||
|
|
||||||
| Parquet format public API details | ||||||
| ================================= | ||||||
|
|
||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | Format | C++ | Python | Java | Go | Rust | | ||||||
| | | | | | | | | ||||||
| +===========================================+=======+========+========+=======+=======+ | ||||||
| | Basic compression | | | | | | | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if we could have separate tables for supported physical types, encodings and compression
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 for this. |
||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | Brotli, LZ4, ZSTD | | | | | | | ||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | LZ4_RAW | | | | | | | ||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | Hive-style partitioning | | | | | | | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure I'd consider this a feature of the parquet implementation, it is more a detail of the query engine imo?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. While arrow-rs needs datafusion for this functionality, arrow handles it without Acero. I don't have strong opinion though
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with @tustvold, |
||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | File metadata | | | | | | | ||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | RowGroup metadata | | | | | | | ||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | Column metadata | | | | | | | ||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
|
Comment on lines
+367
to
+373
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can't not support this metadata, as otherwise the parquet file can't be read?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are these intended for the completeness of fields defined in the metadata? If yes, probably they worth a separate table and indicate the states of each field. But that sounds too complicated.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the intention to indicate that the metadata is available through a public API rather than saying whether or not it is supported in general, since as @tustvold says, you have to support the metadata otherwise the file can't be read. |
||||||
| | Chunk metadta | | | | | | | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure what this is and how it differs from ColumnChunk |
||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | Sorting column | | | | | | | ||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | ColumnIndex statistics | | | | | | | ||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | Page statistics | | | | | | | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is this referring to?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Like I said there is a good chance I made a mistake here. I saw this in the thrift spec: ColumnChunk->ColumnMetadata->Statistics
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could we organize these items in a layered fashion? Maybe this is a good start point: https://arrow.apache.org/docs/cpp/parquet.html#supported-parquet-features |
||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | Statistics min_value | | | | | | | ||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | xxHash based bloom filter | | | | | | | ||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | bloom filter length | | | | | | | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is this?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OMG, they finally added it - amazing, will get that incorporated into the rust writer/reader
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I just added it recently :) Please note that the latest format is not released yet so the parquet-mr does not know |
||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | Modular encryption | | | | | | | ||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | External column data | | | | | | | ||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | Nanosecond support | | | | | | | ||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | FIXED_LEN_BYTE_ARRAY | | | | | | | ||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | Complete Delta encoding support | | | | | | | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it would be clearer if you listed the actual encodings, perhaps in a separate table |
||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | Complete RLE support | | | | | | | ||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | BYTE_STREAM_SPLIT | | | | | | | ||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | Partition pruning on the partition column | | | | | | | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again this is a detail of the query engine not the parquet implementation imo
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same, it's part of the current API, but I agree it's not consistent across implementations. |
||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | RowGroup pruning using statistics | | | | | | | ||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | RowGroup pruning using bloom filter | | | | | | | ||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | Page pruning using projection pushdown | | | | | | | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't this also a detail of the engine choosing what columns to read or not? Or is the intent here to indicate that rows/values can be pruned based on projection directly in the parquet lib? |
||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | Page pruning using statistics | | | | | | | ||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | Page pruning using bloom filter | | | | | | | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think this is supported by the format, bloom filters are per column chunk |
||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | Partition append / delete | | | | | | | ||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | RowGroup append / delete | | | | | | | ||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | Page append / delete | | | | | | | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think any support page appending, the semantics would be peculiar for things like dictionary pages, the rust implementation does support appending column chunks though
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, likely some / most of the Page references should be ColumnChunk. I'll read about this more.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't Parquet itself a write-once format that can't be appended to? I'm not sure what these are supposed to indicate. The inability to append/delete without re-writing a Parquet file is why table formats like Iceberg and Delta have proliferated. |
||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | Page CRC32 checksum | | | | | | | ||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | Parallel partition processing | | | | | | | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMO this is a query engine detail, not a detail of the file format?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's part of the arrow API in python |
||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | Parallel RowGroup processing | | | | | | | ||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | Parallel Page processing | | | | | | | ||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | Storage-aware defaults (1) | | | | | | | ||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | Adaptive concurrency (2) | | | | | | | ||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | Adaptive IO when pruning used (3) | | | | | | | ||||||
|
Comment on lines
+428
to
+432
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure which parquet reader these features are based off, but my 2 cents is that they indicate a problematic IO abstraction that relies on prefetching heuristics instead of pushing vectored IO down into the IO subsystem (which the Rust, and proprietary DataBricks implementation do).
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wanted to capture the IO pushdown section https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/#io-pushdown but also added more. Likely out of scope as none of the implementations goes into details or provides an API
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps just a "Vectorized IO Pushdown". I believe there are efforts to add such an API to parquet-mr |
||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | Arrow schema metadata (4) | | | | | | | ||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
| | RLE / REE support (5) | | | | | | | ||||||
| +-------------------------------------------+-------+--------+--------+-------+-------+ | ||||||
|
|
||||||
|
|
||||||
| Notes: | ||||||
|
|
||||||
| * *R* = Read supported | ||||||
|
|
||||||
| * *W* = Write supported | ||||||
|
|
||||||
| * \(1) In-memory or memory mapped files, SSD direct IO, HDD, NFS, local and remote S3 all need different concurrency and buffer size setups | ||||||
|
|
||||||
| * \(2) Depending on the encoding, compression and row group sizes different task sizes might be ideal | ||||||
|
|
||||||
| * \(3) Automatic balancing of the prefetched / block reading and the Page pruning | ||||||
|
|
||||||
| * \(4) By default, the Arrow schema is serialized and stored in the Parquet file metadata (in the “ARROW:schema” key). When reading the file, if this key is available, it will be used to more faithfully recreate the original Arrow data. | ||||||
|
|
||||||
| * \(5) Parquet supports RLE encoding of dictionary _data_. Reading and writing a similar structure (eg. Arrow REE) without allocating the expanded values might be supported in different implementations | ||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
Javacolumn could be misleading here. In the arrow repo, there is a java dataset reader to support reading from parquet dataset. If this is for parquet-mr, then it can be easily out of sync.