-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add size statistics to ParquetMetaData introduced in PARQUET-2261
#5486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 1 commit
Commits
Show all changes
54 commits
Select commit
Hold shift + click to select a range
a7e41c3
regen thrift with size statistics added
etseidl 788eef3
first cut at adding page size statistics
etseidl 6296ada
add new stats to chunk metadata test
etseidl 84f3d7a
Merge branch 'apache:master' into size_stats
etseidl 0da05a8
fix escapes
etseidl 7301aeb
Merge remote-tracking branch 'origin/master' into size_stats
etseidl 6e5fece
format
etseidl 457eb4a
formatting
etseidl 18a5732
add escapes
etseidl 658512e
Merge remote-tracking branch 'origin/master' into size_stats
etseidl 81c2b2e
Merge remote-tracking branch 'origin/master' into size_stats
etseidl 29dde50
Merge branch 'size_stats' of github.com:etseidl/arrow-rs into size_stats
etseidl 84f8512
Merge remote-tracking branch 'origin/master' into size_stats
etseidl 9635e5e
add test of SizeStatistics.unencoded_byte_array_data_bytes
etseidl c5c07b6
test def histogram as well, rename test
etseidl 6dd160f
add an assert
etseidl 917b412
refactor and add test of def histogram with nulls
etseidl f8961a3
add test of repetition level histogram
etseidl 73fa099
revert changes to test_roundtrip
etseidl 00ca596
suggestion from review
etseidl 6acc500
add to documentation as suggested in review
etseidl 787e3e8
make histograms optional
etseidl 46851f4
add histograms to PageIndex
etseidl 4f8487b
use Vec::push()
etseidl 903b06b
formatting
etseidl fa89836
check size stats in read metadata
etseidl 2800cc7
check unencoded_byte_array_data_bytes is not set for int cols
etseidl 95a0535
rewrite test_byte_array_size_statistics() to not use test_roundtrip()
etseidl fc66a59
add unencoded_byte_array_data_bytes support in page index
etseidl 542570f
Merge remote-tracking branch 'origin/master' into size_stats
etseidl 7be97e5
update expected sizes to account for new stats
etseidl f5ab47b
only write SizeStatistics in ColumnMetaData if statistics are enabled
etseidl a008e9e
add a little documentation
etseidl 87ccec2
add ParquetOffsetIndex to avoid double read of OffsetIndex
etseidl 3eead30
cleanup
etseidl ddf40c3
use less verbose update of variable_length_bytes
etseidl 0ebb72f
add some documentation
etseidl 393aea1
update to latest thrift (as of 11 Jul 2024) from parquet-format
etseidl 1c12fb8
pass None for optional size statistics
etseidl 53cd5fa
escape HTML tags
etseidl 45f25a8
Merge remote-tracking branch 'origin/master' into size_stats
etseidl 98025cc
don't need to escape brackets in arrays
etseidl 7b59246
Merge remote-tracking branch 'github/update_parquet_thrift' into size…
etseidl 65096dd
use consistent naming
etseidl 08065ad
suggested doc changes
etseidl 1cbd4b7
more suggested doc changes
etseidl dce3513
use more asserts in tests
etseidl f661839
move histogram logic into PageMetrics and ColumnMetrics
etseidl 818a614
refactor some to reduce code duplication, finish docs
etseidl c391dec
account for new size statistics in heap size calculations
etseidl 4816a95
add histogram examples to docs
etseidl e2faf2d
Merge remote-tracking branch 'origin/master' into size_stats
etseidl d92ae20
add some fixmes
etseidl 69dd652
leave not to self
etseidl File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
refactor and add test of def histogram with nulls
- Loading branch information
commit 917b412a704910f79306663de547ab7f973957c9
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we please add some test that verifies that the metadata read from the file correctly interprets the new statistics?
It seems to me like
test_roundtriponly verifies a few fields of the metadata after it is read back inarrow-rs/parquet/src/file/writer.rs
Lines 1461 to 1482 in 73fa099
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, file metadata returned from
test_roundtripis created by the writer, not the reader. So I'm currently just testing that the thrift objects are properly formed before writing. The issue I have is that the reader doesn't currently save the new structures, and I don't want to add them on this pass since they won't be used at all. I can instead replicate some of the code inindex_reader.rsto create thrift objects from the file data and test that those match the objects created by the writer.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 it seems like it would be good to have the ability to read the structures as well (otherwise pure rust implementations can't take advantage of these fields )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree (and in fact I know people using DataFusion that want these stats...but are ok reading them directly from the files as a preprocessing step). I think adding the histograms to
Indexis easy enough, but currently theOffsetIndexis returned as a vector ofPageLocations. Are you ok with me adding an object that mirrors the thriftOffsetIndex?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think we should add add our own
SizeStatisticsstructsrc/file/size_statistics.rsthat mirrors the Thrift TSizeStatistics.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing we need to be careful about is not introducing any breaking API changes or else we can't merge this PR until August 2024 per https://github.com/apache/arrow-rs?tab=readme-ov-file#release-versioning-and-schedule
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Adding histograms to
Indexwon't be a breaking change as they'll be optional. I could add another method toindex_reader.rsto return a vector of unencoded byte array sizes. It's less efficient since we'd have to parse the offset index twice.ColumnChunkMetaDatacontains the chunk level histograms and byte array sizes already, just not as a separateSizeStatisticsobject.