-
Notifications
You must be signed in to change notification settings - Fork 45
Update implementation status for parquet-rs #101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| | Row group pruning using statistics | ❌ | ✅ | | ✅ | ✅ | | ||
| | Row group pruning using bloom filter | ❌ | ✅ | | ✅ | ✅ | | ||
| | Reading select columns only | ✅ | ✅ | | ✅ | ✅ | | ||
| | Page pruning using statistics | ❌ | ✅ | | ✅ | ❌ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure how to do the pruning ones...IIUC parquet-rs allows for pruning, but the actual work needs to be done outside the library. Perhaps these should be Xs like C++?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fair to say pruning is supported, in that the APIs to do it are there, it just isn't batteries included (we don't ship an expression engine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should mark parquet-rs as supporting pruning
Speicficially this structure gets the statistics as arrow record batches (either pages or row groups)
- https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/statistics/struct.StatisticsConverter.html
And then you can specify which row groups to read read via
- https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_groups
- https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_selection
As @tustvold says parquet-rs doesn't provide a way to evaluate an expression on those arrow arrays, but you can use a query engine (like DataFusion!) to do so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I wonder if we should propose adding a row for "predicate pushdown" (aka evaluating predicates based on scans) -- basically what https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/trait.ArrowPredicate.html provides
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe "late materialization" is the precise technique, and I agree it would be good to add a row for this
| | PLAIN_DICTIONARY | ✅ | ✅ | | ✅ | ✅ | | ||
| | RLE_DICTIONARY | ✅ | ✅ | | ✅ | ✅ | | ||
| | RLE | ✅ | ✅ | | ✅ | ✅ | | ||
| | BIT_PACKED (deprecated) | ✅ | ✅ | | ❌ | (R) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be mistaken, but I thought we supported this, just refused to write it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly it only worked for bitwidth = 0, and the C++ and Go implementations also had issues with it: apache/arrow-rs#5338
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that we can read BIT_PACKED level data, but not page data. None of the type specific implementations of get_decoder support it, nor does get_decoder_default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important to remember that boolean data is bitpacked, but for that type that is the PLAIN encoding. The documentation says
Note that the BIT_PACKED encoding method is only supported for encoding repetition and definition levels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could change this to R(*) with a note that decoding is only supported for level data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I have seen, only the Java code implements BIT_PACKED levels according to the specification. C++, Rust, Go and it seems also CuDF read/write bits in a different order than specified and are thus not interoperable with files using that encoding. Which isn't an issue in practice since it is deprecated, and also the Java version does not seem to provide a public api for enabling writing of that encoding. If I remember correctly, Java might use the encoding if the bitwidth is 0, in that case the different bit order does not matter since there is no actual data is written.
For compatibility reasons, this implementation packs values from the most significant bit to the least significant bit, which is not the same as the RLE/bit-packing hybrid.
All implementations I have seen, except Java, reuse the same bitpacking code for BIT_PACKED and RLE, which makes them read/write incorrect levels. So marking as unsupported, with a footnote "except for bitwidth = 0" would be the best solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly, Java might use the encoding if the bitwidth is 0, in that case the different bit order does not matter since there is no actual data is written.
Yes, I recall seeing files that indicated BIT_PACKED encoding was used, but as you said it was for max level == 0 (thus bitwidth == 0), so in other words, no data is actually present/encoded.
I'll add a note to that effect. Thanks!
I think the side by side visual representation is important but the diff is definitely a pain. Maybe we could have the actual status of each implementation stored in separate files (json, yml?) and then automatically render the table |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @etseidl -- I reviewed this and it looks great to me
Thank you for filling this out
| | Row group pruning using statistics | ❌ | ✅ | | ✅ | ✅ | | ||
| | Row group pruning using bloom filter | ❌ | ✅ | | ✅ | ✅ | | ||
| | Reading select columns only | ✅ | ✅ | | ✅ | ✅ | | ||
| | Page pruning using statistics | ❌ | ✅ | | ✅ | ❌ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should mark parquet-rs as supporting pruning
Speicficially this structure gets the statistics as arrow record batches (either pages or row groups)
- https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/statistics/struct.StatisticsConverter.html
And then you can specify which row groups to read read via
- https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_groups
- https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_selection
As @tustvold says parquet-rs doesn't provide a way to evaluate an expression on those arrow arrays, but you can use a query engine (like DataFusion!) to do so
| | Row group pruning using statistics | ❌ | ✅ | | ✅ | ✅ | | ||
| | Row group pruning using bloom filter | ❌ | ✅ | | ✅ | ✅ | | ||
| | Reading select columns only | ✅ | ✅ | | ✅ | ✅ | | ||
| | Page pruning using statistics | ❌ | ✅ | | ✅ | ❌ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I wonder if we should propose adding a row for "predicate pushdown" (aka evaluating predicates based on scans) -- basically what https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/trait.ArrowPredicate.html provides
| | GZIP | ✅ | ✅ | | ✅ | (R) | | ||
| | LZ4 (deprecated) | ✅ | ❌ | | ✅ | ❌ | | ||
| | LZ4_RAW | ✅ | ✅ | | ✅ | ✅ | | ||
| | LZO | ❌ | ❌ | | ❌ | ❌ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.rs/parquet/latest/parquet/basic/enum.Compression.html claims to support LZO
However I did some more digging and I agree that LZO does not appear to be supported https://github.com/apache/arrow-rs/blob/7781bc2170c84ada387901e09b2cdfe4235c3570/parquet/src/compression.rs#L195-L194
| | Statistics min_value, max_value | ✅ | ✅ | | ✅ | ✅ | | ||
| | Page index | ✅ | ✅ | | ✅ | ✅ | | ||
| | Page CRC32 checksum | ✅ | ✅ | | ✅ | ❌ | | ||
| | Modular encryption | ✅ | ✅ | | ❌ | ❌ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have to remember to update this cell when apache/arrow-rs#6637 is merged.
|
Thanks for updating it! Please let me know when ready to merge. |
Thanks @wgtmac. I think it's good to go as far as Rust is concerned. Any fine tuning can be done in a follow up. |
|
Merged! Thanks all! |
No description provided.