Skip to content

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Oct 1, 2025

Which issue does this PR close?

Note: this targets a feature branch, not main

Rationale for this change

This brings over changes to handling of geo-spatial statistics introduced by @paleolimbot in #8520.

What changes are included in this PR?

Primarily adds documentation and tests to changes already made. The only significant change is adding a Default implementation for EdgeInterpolationAlgorithm.

Still TODO is to decide whether to allow unknown algorithms.

Are these changes tested?

Yes

Are there any user-facing changes?

Yes

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 1, 2025
@etseidl
Copy link
Contributor Author

etseidl commented Oct 1, 2025

cc @paleolimbot

@etseidl etseidl changed the title Merge geo spatial [thrift-remodel] Incorporate changes made to geospatial statistics Oct 1, 2025
}
18 => {
let val = GeographyType::read_thrift(&mut *prot)?;
let algorithm = val.algorithm.unwrap_or_default();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change gives me the most pause. If the value is not set in the thrift, do we really want to set a default or leave it unset and handle that downstream?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind either way...the interpretation of "the default" is part of the Parquet standard (if unset in Thrift, the interpretation is spherical), my thought was that this would make it so that others don't have to read the spec in order to do the right thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, guess I should have read the spec 😅. I'll leave this as is then. @alamb do you have an opinion here? (relevent section of the spec is here).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, guess I should have read the spec 😅. I'll leave this as is then. @alamb do you have an opinion here? (relevent section of the spec is here).

I agree this seems pretty clear cut:

If unset, the algorithm defaults to SPHERICAL.

Maybe we could change this to explicitly name SPHERICAL and reference the spec, something like

Suggested change
let algorithm = val.algorithm.unwrap_or_default();
let algorithm = val.algorithm
// unset algorithm means spherical, per the spec:
// https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#geography
.unwrap_or(EdgeInterpolationAlgorithm::Spherical)

Comment on lines 933 to 934
// TODO(ets): we need to allow for unknown variants. Either hand code this one, or add a new
// macro that adds an _Unknown variant.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I'm assuming that an unknown algorithm will result in ignoring the stats, so should not be fatal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The algorithm is taken into account by the writer when writing the stats...statistics for Geography are in theory safe to use for pruning even if the algorithm is unrecognized (although it's difficult to imagine a situation where this would occur except a corrupted file). I put UNKNOWN in the other PR because I couldn't make the From<> implementation infallible without it but perhaps you don't have that constraint here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If stats are robust to unknown algorithms, then I think we should add an _Unknown variant here so older readers can still handle newer files. I'll make that change soon.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this!

}
18 => {
let val = GeographyType::read_thrift(&mut *prot)?;
let algorithm = val.algorithm.unwrap_or_default();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind either way...the interpretation of "the default" is part of the Parquet standard (if unset in Thrift, the interpretation is spherical), my thought was that this would make it so that others don't have to read the spec in order to do the right thing.

Comment on lines 933 to 934
// TODO(ets): we need to allow for unknown variants. Either hand code this one, or add a new
// macro that adds an _Unknown variant.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The algorithm is taken into account by the writer when writing the stats...statistics for Geography are in theory safe to use for pruning even if the algorithm is unrecognized (although it's difficult to imagine a situation where this would occur except a corrupted file). I put UNKNOWN in the other PR because I couldn't make the From<> implementation infallible without it but perhaps you don't have that constraint here.

format!("GEOGRAPHY({crs:?},{algorithm:?})")
let algorithm = algorithm.unwrap_or_default();
if let Some(crs) = crs {
format!("GEOGRAPHY({algorithm}, {crs})")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pointing out that the order of formatting was switched from crs-algorithm to algorithm-crs. I think this is a beneficial change because the CRS could be a long string.

@martin-g
Copy link
Member

martin-g commented Oct 2, 2025

@etseidl You may want to rebase to latest main. There are 220 commits in this PR.

@etseidl
Copy link
Contributor Author

etseidl commented Oct 2, 2025

Thanks @martin-g. Yes, the long chain of commits is an artifact of how I've been managing this feature branch. I've been branching off of branches so I can work ahead while awaiting reviews. A git rebase at this point would difficult. I could apply the diffs to a clean check out of the feature branch, but as this is likely the last PR on this branch I don't see the need. It all gets squashed in the end 😅

@etseidl
Copy link
Contributor Author

etseidl commented Oct 2, 2025

Thanks all. I'll merge this now so it's included in #8530.

@etseidl etseidl merged commit a6d1d8e into apache:gh5854_thrift_remodel Oct 2, 2025
16 checks passed
@etseidl etseidl deleted the merge_geo_spatial branch October 2, 2025 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants