Update implementation status for parquet-rs #101

etseidl · 2025-02-13T04:36:21Z

No description provided.

etseidl · 2025-02-13T04:39:46Z

I wonder if would be better to transpose the matrix here so new implementations add a few rows rather than having to wade through so many lines of diffs.

cc @alamb @tustvold

etseidl · 2025-02-13T04:50:59Z

content/en/docs/File Format/implementationstatus.md

+| Row group pruning using statistics           |  ❌   |  ✅   |       |  ✅   |  ✅   |
+| Row group pruning using bloom filter         |  ❌   |  ✅   |       |  ✅   |  ✅   |
+| Reading select columns only                  |  ✅   |  ✅   |       |  ✅   |  ✅   |
+| Page pruning using statistics                |  ❌   |  ✅   |       |  ✅   |  ❌   |


I'm not entirely sure how to do the pruning ones...IIUC parquet-rs allows for pruning, but the actual work needs to be done outside the library. Perhaps these should be Xs like C++?

I think it is fair to say pruning is supported, in that the APIs to do it are there, it just isn't batteries included (we don't ship an expression engine)

I agree we should mark parquet-rs as supporting pruning

Speicficially this structure gets the statistics as arrow record batches (either pages or row groups)

https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/statistics/struct.StatisticsConverter.html

And then you can specify which row groups to read read via

https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_groups

https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_selection

As @tustvold says parquet-rs doesn't provide a way to evaluate an expression on those arrow arrays, but you can use a query engine (like DataFusion!) to do so

BTW I wonder if we should propose adding a row for "predicate pushdown" (aka evaluating predicates based on scans) -- basically what https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/trait.ArrowPredicate.html provides

I believe "late materialization" is the precise technique, and I agree it would be good to add a row for this

tustvold · 2025-02-13T06:50:29Z

content/en/docs/File Format/implementationstatus.md

+| PLAIN_DICTIONARY                          |  ✅   |  ✅   |       |  ✅   |  ✅   |
+| RLE_DICTIONARY                            |  ✅   |  ✅   |       |  ✅   |  ✅   |
+| RLE                                       |  ✅   |  ✅   |       |  ✅   |  ✅   |
+| BIT_PACKED (deprecated)                   |  ✅   |  ✅   |       |  ❌   |  (R)  |


I could be mistaken, but I thought we supported this, just refused to write it

If I remember correctly it only worked for bitwidth = 0, and the C++ and Go implementations also had issues with it: apache/arrow-rs#5338

It seems to me that we can read BIT_PACKED level data, but not page data. None of the type specific implementations of get_decoder support it, nor does get_decoder_default.

Important to remember that boolean data is bitpacked, but for that type that is the PLAIN encoding. The documentation says

Note that the BIT_PACKED encoding method is only supported for encoding repetition and definition levels.

I could change this to R(*) with a note that decoding is only supported for level data.

From what I have seen, only the Java code implements BIT_PACKED levels according to the specification. C++, Rust, Go and it seems also CuDF read/write bits in a different order than specified and are thus not interoperable with files using that encoding. Which isn't an issue in practice since it is deprecated, and also the Java version does not seem to provide a public api for enabling writing of that encoding. If I remember correctly, Java might use the encoding if the bitwidth is 0, in that case the different bit order does not matter since there is no actual data is written.

For compatibility reasons, this implementation packs values from the most significant bit to the least significant bit, which is not the same as the RLE/bit-packing hybrid.

All implementations I have seen, except Java, reuse the same bitpacking code for BIT_PACKED and RLE, which makes them read/write incorrect levels. So marking as unsupported, with a footnote "except for bitwidth = 0" would be the best solution.

If I remember correctly, Java might use the encoding if the bitwidth is 0, in that case the different bit order does not matter since there is no actual data is written.

Yes, I recall seeing files that indicated BIT_PACKED encoding was used, but as you said it was for max level == 0 (thus bitwidth == 0), so in other words, no data is actually present/encoded.

I'll add a note to that effect. Thanks!

alamb · 2025-02-13T10:16:05Z

I wonder if would be better to transpose the matrix here so new implementations add a few rows rather than having to wade through so many lines of diffs.

cc @alamb @tustvold

I think the side by side visual representation is important but the diff is definitely a pain.

Maybe we could have the actual status of each implementation stored in separate files (json, yml?) and then automatically render the table

alamb

Thanks @etseidl -- I reviewed this and it looks great to me

Thank you for filling this out

alamb · 2025-02-13T10:19:56Z

content/en/docs/File Format/implementationstatus.md

+| Row group pruning using statistics           |  ❌   |  ✅   |       |  ✅   |  ✅   |
+| Row group pruning using bloom filter         |  ❌   |  ✅   |       |  ✅   |  ✅   |
+| Reading select columns only                  |  ✅   |  ✅   |       |  ✅   |  ✅   |
+| Page pruning using statistics                |  ❌   |  ✅   |       |  ✅   |  ❌   |


I agree we should mark parquet-rs as supporting pruning

Speicficially this structure gets the statistics as arrow record batches (either pages or row groups)

https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/statistics/struct.StatisticsConverter.html

And then you can specify which row groups to read read via

https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_groups

https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_selection

As @tustvold says parquet-rs doesn't provide a way to evaluate an expression on those arrow arrays, but you can use a query engine (like DataFusion!) to do so

alamb · 2025-02-13T10:20:43Z

content/en/docs/File Format/implementationstatus.md

+| Row group pruning using statistics           |  ❌   |  ✅   |       |  ✅   |  ✅   |
+| Row group pruning using bloom filter         |  ❌   |  ✅   |       |  ✅   |  ✅   |
+| Reading select columns only                  |  ✅   |  ✅   |       |  ✅   |  ✅   |
+| Page pruning using statistics                |  ❌   |  ✅   |       |  ✅   |  ❌   |


BTW I wonder if we should propose adding a row for "predicate pushdown" (aka evaluating predicates based on scans) -- basically what https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/trait.ArrowPredicate.html provides

alamb · 2025-02-13T10:26:35Z

content/en/docs/File Format/implementationstatus.md

+| GZIP                                      |  ✅   |  ✅   |       |  ✅   |  (R)  |
+| LZ4 (deprecated)                          |  ✅   |  ❌   |       |  ✅   |  ❌   |
+| LZ4_RAW                                   |  ✅   |  ✅   |       |  ✅   |  ✅   |
+| LZO                                       |  ❌   |  ❌   |       |  ❌   |  ❌   |


https://docs.rs/parquet/latest/parquet/basic/enum.Compression.html claims to support LZO

However I did some more digging and I agree that LZO does not appear to be supported https://github.com/apache/arrow-rs/blob/7781bc2170c84ada387901e09b2cdfe4235c3570/parquet/src/compression.rs#L195-L194

etseidl · 2025-02-13T16:22:15Z

content/en/docs/File Format/implementationstatus.md

+| Statistics min_value, max_value           |  ✅   |  ✅   |       |  ✅   |  ✅   |
+| Page index                                |  ✅   |  ✅   |       |  ✅   |  ✅   |
+| Page CRC32 checksum                       |  ✅   |  ✅   |       |  ✅   |  ❌   |
+| Modular encryption                        |  ✅   |  ✅   |       |  ❌   |  ❌   |


Have to remember to update this cell when apache/arrow-rs#6637 is merged.

wgtmac · 2025-02-18T15:09:44Z

Thanks for updating it! Please let me know when ready to merge.

etseidl · 2025-02-18T16:09:37Z

Thanks for updating it! Please let me know when ready to merge.

Thanks @wgtmac. I think it's good to go as far as Rust is concerned. Any fine tuning can be done in a follow up.

wgtmac · 2025-02-19T01:43:07Z

Merged! Thanks all!

add column for parquet-rs

2b5f0c8

etseidl commented Feb 13, 2025

View reviewed changes

tustvold reviewed Feb 13, 2025

View reviewed changes

alamb approved these changes Feb 13, 2025

View reviewed changes

alamb mentioned this pull request Feb 13, 2025

Update matrix for parquet-cpp and parquet-java #100

Merged

etseidl commented Feb 13, 2025

View reviewed changes

add note per review suggestion

c099e15

wgtmac merged commit 76b2f6b into apache:production Feb 19, 2025
1 check passed

Update implementation status for parquet-rs #101

Update implementation status for parquet-rs #101

Uh oh!

Conversation

etseidl commented Feb 13, 2025

Uh oh!

etseidl commented Feb 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Feb 13, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wgtmac commented Feb 18, 2025

Uh oh!

etseidl commented Feb 18, 2025

Uh oh!

Uh oh!

wgtmac commented Feb 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants