-
Notifications
You must be signed in to change notification settings - Fork 1k
[thrift-remodel] Incorporate changes made to geospatial statistics #8528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[thrift-remodel] Incorporate changes made to geospatial statistics #8528
Conversation
cc @paleolimbot |
parquet/src/basic.rs
Outdated
} | ||
18 => { | ||
let val = GeographyType::read_thrift(&mut *prot)?; | ||
let algorithm = val.algorithm.unwrap_or_default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change gives me the most pause. If the value is not set in the thrift, do we really want to set a default or leave it unset and handle that downstream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind either way...the interpretation of "the default" is part of the Parquet standard (if unset in Thrift, the interpretation is spherical), my thought was that this would make it so that others don't have to read the spec in order to do the right thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, guess I should have read the spec 😅. I'll leave this as is then. @alamb do you have an opinion here? (relevent section of the spec is here).
I agree this seems pretty clear cut:
If unset, the algorithm defaults to SPHERICAL.
Maybe we could change this to explicitly name SPHERICAL and reference the spec, something like
let algorithm = val.algorithm.unwrap_or_default(); | |
let algorithm = val.algorithm | |
// unset algorithm means spherical, per the spec: | |
// https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#geography | |
.unwrap_or(EdgeInterpolationAlgorithm::Spherical) |
parquet/src/basic.rs
Outdated
// TODO(ets): we need to allow for unknown variants. Either hand code this one, or add a new | ||
// macro that adds an _Unknown variant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I'm assuming that an unknown algorithm will result in ignoring the stats, so should not be fatal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The algorithm is taken into account by the writer when writing the stats...statistics for Geography are in theory safe to use for pruning even if the algorithm is unrecognized (although it's difficult to imagine a situation where this would occur except a corrupted file). I put UNKNOWN
in the other PR because I couldn't make the From<>
implementation infallible without it but perhaps you don't have that constraint here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If stats are robust to unknown algorithms, then I think we should add an _Unknown
variant here so older readers can still handle newer files. I'll make that change soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this!
parquet/src/basic.rs
Outdated
} | ||
18 => { | ||
let val = GeographyType::read_thrift(&mut *prot)?; | ||
let algorithm = val.algorithm.unwrap_or_default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind either way...the interpretation of "the default" is part of the Parquet standard (if unset in Thrift, the interpretation is spherical), my thought was that this would make it so that others don't have to read the spec in order to do the right thing.
parquet/src/basic.rs
Outdated
// TODO(ets): we need to allow for unknown variants. Either hand code this one, or add a new | ||
// macro that adds an _Unknown variant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The algorithm is taken into account by the writer when writing the stats...statistics for Geography are in theory safe to use for pruning even if the algorithm is unrecognized (although it's difficult to imagine a situation where this would occur except a corrupted file). I put UNKNOWN
in the other PR because I couldn't make the From<>
implementation infallible without it but perhaps you don't have that constraint here.
make default to SPHERICAL explicit
format!("GEOGRAPHY({crs:?},{algorithm:?})") | ||
let algorithm = algorithm.unwrap_or_default(); | ||
if let Some(crs) = crs { | ||
format!("GEOGRAPHY({algorithm}, {crs})") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pointing out that the order of formatting was switched from crs-algorithm to algorithm-crs. I think this is a beneficial change because the CRS could be a long string.
@etseidl You may want to rebase to latest |
Thanks @martin-g. Yes, the long chain of commits is an artifact of how I've been managing this feature branch. I've been branching off of branches so I can work ahead while awaiting reviews. A git rebase at this point would difficult. I could apply the diffs to a clean check out of the feature branch, but as this is likely the last PR on this branch I don't see the need. It all gets squashed in the end 😅 |
Thanks all. I'll merge this now so it's included in #8530. |
Which issue does this PR close?
Note: this targets a feature branch, not main
Rationale for this change
This brings over changes to handling of geo-spatial statistics introduced by @paleolimbot in #8520.
What changes are included in this PR?
Primarily adds documentation and tests to changes already made. The only significant change is adding a
Default
implementation forEdgeInterpolationAlgorithm
.Still TODO is to decide whether to allow unknown algorithms.
Are these changes tested?
Yes
Are there any user-facing changes?
Yes