Skip to content

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Oct 1, 2025

Which issue does this PR close?

Rationale for this change

One of the primary reasons the GeoParquet community was excited about first-class Parquet Geometry/Geography support was the built-in column chunk statistics (we had a workaround that involved adding a struct column, but it was difficult for non-spatial readers to use it and very difficult for non-spatial writers to write it). This PR ensures it is possible for arrow-rs to write files that include those statistics.

What changes are included in this PR?

This PR inserts the minimum required change to enable this support.

Are these changes tested?

Yes!

Are there any user-facing changes?

There are several new functions (which include documentation). Previously it was difficult or impossible to actually write Geometry or Geography logical types, and so it is unlikely any previous usage would be affected.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 1, 2025
Copy link
Member Author

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works!

@alamb @etseidl I'm aware this would need some tests/improved documentation at a lower level; however, I'd love some feedback on the approach before I go through and clean this up more thoroughly (whenever time allows!)

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @paleolimbot, looks pretty good on a first pass. I just want to make sure that the size statistics are written properly when geo stats are enabled.

Comment on lines 166 to 168
if let Some(var_bytes) = T::T::variable_length_bytes(slice) {
*self.variable_length_bytes.get_or_insert(0) += var_bytes;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should execute regardless of whether geo stats are enabled. The variable_length_bytes are ultimately written to the SizeStatistics which are useful even without min/max statistics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

drop(file_writer);

// Check that statistics exist in thrift output
thrift_metadata.row_groups[0].columns[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up that when the thrift stuff merges this will no longer be a format::FileMetaData but file::metadata::ParquetMetaData.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I removed these assertions so that they won't break when the thrift stuff merges (although there will be a few logical type constructors that will need to be updated).

@paleolimbot
Copy link
Member Author

Thank you for the review! I will clean this up on Monday and add a few more tests.

Copy link
Member Author

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! I think this is ready for review!

Comment on lines +186 to +214
#[test]
fn test_roundtrip_statistics_geospatial() {
let path = format!(
"{}/geospatial/geospatial.parquet",
arrow::util::test_util::parquet_test_data(),
);

test_roundtrip_statistics(&path, 2);
}

#[test]
fn test_roundtrip_geospatial_with_nan() {
let path = format!(
"{}/geospatial/geospatial-with-nan.parquet",
arrow::util::test_util::parquet_test_data(),
);

test_roundtrip_statistics(&path, 0);
}

#[test]
fn test_roundtrip_statistics_crs() {
let path = format!(
"{}/geospatial/crs-default.parquet",
arrow::util::test_util::parquet_test_data(),
);

test_roundtrip_statistics(&path, 0);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the main tests that I was personally using to ensure that this implementation matched the one in Arrow C++...they rewrite the geometry columns for the test files and ensure the statistics are identical.

Comment on lines 27 to 32
/// Create a new [GeoStatsAccumulator] instance
pub fn new_geo_stats_accumulator(descr: &ColumnDescPtr) -> Box<dyn GeoStatsAccumulator> {
ACCUMULATOR_FACTORY
.get_or_init(|| Arc::new(DefaultGeoStatsAccumulatorFactory::default()))
.new_accumulator(descr)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only part that is used outside this file...my attempt to consolidate any implementation detail we have here with respect to whether this crate was or wasn't built with geospatial support.

Comment on lines 34 to 50
/// Initialize the global [GeoStatsAccumulatorFactory]
///
/// This may only be done once before any calls to [new_geo_stats_accumulator].
/// Clients may use this to implement support for builds of the Parquet crate without
/// geospatial support or to implement support for Geography bounding using external
/// dependencies.
pub fn init_geo_stats_accumulator_factory(
factory: Arc<dyn GeoStatsAccumulatorFactory>,
) -> Result<(), ParquetError> {
if ACCUMULATOR_FACTORY.set(factory).is_err() {
Err(ParquetError::General(
"Global GeoStatsAccumulatorFactory already set".to_string(),
))
} else {
Ok(())
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can take this out if it's too much...this is what I would need in SedonaDB to write files with geospatial stats for Geometry and Geography types (I am not sure if we can enable the geospatial feature on Parquet two levels of dependency deep). We also have the C++ dependencies there to write stats for Geography...while I'd love to rewrite that in Rust and put it in parquet-geospatial, I don't have time to do that today and the C++ dependency to do that (s2geometry) is kind of insane to build inside of a Rust crate.

Comment on lines +277 to +284
#[cfg(feature = "geospatial")]
#[test]
fn test_geometry_accumulator() {
use parquet_geospatial::testing::{wkb_point_xy, wkb_point_xyzm};

use crate::geospatial::bounding_box::BoundingBox;

let mut accumulator = ParquetGeoStatsAccumulator::default();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These cases are also tested by way of ensuring our rewrite of the test files matches, but this test is more explicit and easier to debug.

Comment on lines 126 to 129

/// Computes [GeospatialStatistics], if any, and resets internal state such that any internal
/// accumulator is prepared to accumulate statistics for the next column chunk.
fn flush_geospatial_statistics(&mut self) -> Option<Box<GeospatialStatistics>>;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the lowest impact place I could find to insert the GeospatialStatistics calculation. The ColumnMetrics are maybe a better fit but would require passing a reference to something through the various write methods.

Comment on lines 503 to 513
/// Explicitly specify the Parquet schema to be used
///
/// If omitted (the default), the [ArrowSchemaConverter] is used to compute the
/// Parquet [SchemaDescriptor]. This may be used When the [SchemaDescriptor] is
/// already known or must be calculated using custom logic.
pub fn with_parquet_schema(self, schema_descr: SchemaDescriptor) -> Self {
Self {
schema_descr: Some(schema_descr),
..self
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need this to test the Arrow ByteArrayEncoder, but it would also what somebody would need to write Geometry/Geography types generally.

@paleolimbot paleolimbot marked this pull request as ready for review October 7, 2025 21:30
@paleolimbot paleolimbot requested a review from kylebarron October 7, 2025 21:49
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second pass. All style, no substance. I'll try to do a deeper dive tomorrow.

Comment on lines 453 to 460
let geo_stats_accumulator = if matches!(
descr.logical_type(),
Some(LogicalType::Geometry) | Some(LogicalType::Geography)
) {
Some(new_geo_stats_accumulator(descr))
} else {
None
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this pattern twice. I'm wondering if new_geo_stats_accumulator could be try_new_geo_stats_accumulator and perform the check on logical type internally, returning None if not Geometry or Geography.

@etseidl
Copy link
Contributor

etseidl commented Oct 8, 2025

@paleolimbot I took a stab at resolving the merge conflicts. They are mostly trivial, but I wasn't sure how to resolve the tests. I'll leave that up to you 😄.

@alamb
Copy link
Contributor

alamb commented Oct 8, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing spatial-stats-write (f9112f7) to d5df352 diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench arrow_writer
BENCH_FILTER=
BENCH_BRANCH_NAME=spatial-stats-write
Results will be posted here when complete

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @paleolimbot and @etseidl

I reviewed this PR for test coverage and structure, and from my perspective it is good to go. I had a few minor comments / suggestions, but nothing I think would prevent merging

}
}

/// Explicitly specify the Parquet schema to be used
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a nice API addition I think

/// ```
#[derive(Clone, Debug, PartialEq, Default)]
pub struct GeospatialStatistics {
/// Optional bounding defining the spatial extent, where None represents a lack of information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why remove these comments?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved them to the accessor methods in a previous change...I'm not sure why they're showing up in this diff. My theory was that they'd be more likely to be read there but I don't mind copying them back.

fn new_accumulator(&self, descr: &ColumnDescPtr) -> Box<dyn GeoStatsAccumulator>;
}

/// Dynamic [`GeospatialStatistics``] accumulator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a nice API for optional statistics encoding

# Enable parquet variant support
variant_experimental = ["arrow", "parquet-variant", "parquet-variant-json", "parquet-variant-compute"]
# Enable geospatial support
geospatial = ["parquet-geospatial"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please also add the new feature flag to the main crate readme as well?

https://github.com/apache/arrow-rs/blob/main/parquet/README.md#feature-flags

@alamb
Copy link
Contributor

alamb commented Oct 8, 2025

🤖: Benchmark completed

Details

group                                     main                                   spatial-stats-write
-----                                     ----                                   -------------------
bool/bloom_filter                         1.00    129.9±0.83µs     8.2 MB/sec    1.01    130.8±0.41µs     8.1 MB/sec
bool/default                              1.00     53.2±0.18µs    19.9 MB/sec    1.03     54.8±0.15µs    19.4 MB/sec
bool/parquet_2                            1.00     68.2±0.15µs    15.6 MB/sec    1.05     71.5±0.19µs    14.8 MB/sec
bool/zstd                                 1.00     64.0±0.25µs    16.6 MB/sec    1.02     65.2±0.32µs    16.3 MB/sec
bool/zstd_parquet_2                       1.00     78.6±0.36µs    13.5 MB/sec    1.04     82.0±0.57µs    12.9 MB/sec
bool_non_null/bloom_filter                1.01    106.1±0.39µs     5.4 MB/sec    1.00    105.3±0.60µs     5.4 MB/sec
bool_non_null/default                     1.00     19.8±0.06µs    28.8 MB/sec    1.00     19.9±0.38µs    28.7 MB/sec
bool_non_null/parquet_2                   1.00     37.8±0.49µs    15.1 MB/sec    1.01     38.1±0.13µs    15.0 MB/sec
bool_non_null/zstd                        1.01     28.6±0.18µs    20.0 MB/sec    1.00     28.4±0.14µs    20.1 MB/sec
bool_non_null/zstd_parquet_2              1.00     47.2±0.32µs    12.1 MB/sec    1.01     47.9±0.32µs    12.0 MB/sec
float_with_nans/bloom_filter              1.00   940.8±12.22µs    58.4 MB/sec    1.00    943.2±6.94µs    58.3 MB/sec
float_with_nans/default                   1.00    572.4±2.04µs    96.0 MB/sec    1.00    573.9±4.01µs    95.8 MB/sec
float_with_nans/parquet_2                 1.00    825.7±3.25µs    66.6 MB/sec    1.00    825.7±3.48µs    66.6 MB/sec
float_with_nans/zstd                      1.01    752.6±8.88µs    73.0 MB/sec    1.00    746.3±1.57µs    73.6 MB/sec
float_with_nans/zstd_parquet_2            1.00  1005.0±13.88µs    54.7 MB/sec    1.00   1001.6±4.21µs    54.9 MB/sec
list_primitive/bloom_filter               1.00      4.0±0.06ms   535.5 MB/sec    1.13      4.5±0.17ms   472.7 MB/sec
list_primitive/default                    1.00  1745.4±14.39µs  1221.4 MB/sec    1.06  1846.9±15.71µs  1154.3 MB/sec
list_primitive/parquet_2                  1.00      2.3±0.01ms   908.3 MB/sec    1.14      2.7±0.04ms   794.4 MB/sec
list_primitive/zstd                       1.00      4.2±0.04ms   511.7 MB/sec    1.10      4.6±0.09ms   464.3 MB/sec
list_primitive/zstd_parquet_2             1.00      4.2±0.04ms   510.1 MB/sec    1.04      4.4±0.11ms   489.8 MB/sec
list_primitive_non_null/bloom_filter      1.00      4.8±0.08ms   438.9 MB/sec    1.00      4.9±0.12ms   438.3 MB/sec
list_primitive_non_null/default           1.00  1836.3±11.47µs  1158.5 MB/sec    1.01  1848.6±12.75µs  1150.8 MB/sec
list_primitive_non_null/parquet_2         1.12      3.3±0.04ms   639.9 MB/sec    1.00      3.0±0.02ms   718.3 MB/sec
list_primitive_non_null/zstd              1.00      5.5±0.05ms   385.9 MB/sec    1.00      5.5±0.04ms   387.7 MB/sec
list_primitive_non_null/zstd_parquet_2    1.00      5.9±0.07ms   360.4 MB/sec    1.00      5.9±0.07ms   361.7 MB/sec
primitive/bloom_filter                    1.00      4.3±0.11ms    41.1 MB/sec    1.01      4.3±0.11ms    40.9 MB/sec
primitive/default                         1.00    849.6±2.82µs   207.1 MB/sec    1.03    875.5±7.96µs   200.9 MB/sec
primitive/parquet_2                       1.00   1008.6±5.47µs   174.4 MB/sec    1.02   1033.8±4.40µs   170.2 MB/sec
primitive/zstd                            1.00   1152.6±7.40µs   152.6 MB/sec    1.02   1172.5±6.20µs   150.0 MB/sec
primitive/zstd_parquet_2                  1.00   1343.2±5.55µs   131.0 MB/sec    1.02   1370.7±7.62µs   128.4 MB/sec
primitive_non_null/bloom_filter           1.00      4.3±0.15ms    39.9 MB/sec    1.01      4.4±0.17ms    39.7 MB/sec
primitive_non_null/default                1.00    720.4±3.33µs   239.5 MB/sec    1.01    724.3±3.90µs   238.2 MB/sec
primitive_non_null/parquet_2              1.00   867.1±35.25µs   199.0 MB/sec    1.02    888.2±6.67µs   194.2 MB/sec
primitive_non_null/zstd                   1.00   1000.5±6.52µs   172.4 MB/sec    1.01   1007.3±4.85µs   171.3 MB/sec
primitive_non_null/zstd_parquet_2         1.00   1301.0±7.07µs   132.6 MB/sec    1.01   1311.5±6.67µs   131.5 MB/sec
string/bloom_filter                       1.01      2.4±0.02ms   837.8 MB/sec    1.00      2.4±0.04ms   845.3 MB/sec
string/default                            1.01    776.7±4.10µs     2.6 GB/sec    1.00    772.2±7.42µs     2.6 GB/sec
string/parquet_2                          1.00   1306.4±9.46µs  1567.7 MB/sec    1.00  1308.1±14.37µs  1565.7 MB/sec
string/zstd                               1.00      3.4±0.03ms   594.3 MB/sec    1.00      3.4±0.02ms   596.1 MB/sec
string/zstd_parquet_2                     1.01      3.7±0.04ms   550.3 MB/sec    1.00      3.7±0.03ms   554.7 MB/sec
string_and_binary_view/bloom_filter       1.00    591.3±5.68µs   213.4 MB/sec    1.02    600.4±6.46µs   210.2 MB/sec
string_and_binary_view/default            1.00    351.0±1.94µs   359.5 MB/sec    1.02    356.6±0.91µs   353.9 MB/sec
string_and_binary_view/parquet_2          1.00    383.7±1.66µs   328.9 MB/sec    1.04    398.3±2.63µs   316.9 MB/sec
string_and_binary_view/zstd               1.00    605.6±2.30µs   208.4 MB/sec    1.01    610.5±2.72µs   206.7 MB/sec
string_and_binary_view/zstd_parquet_2     1.00    743.1±2.49µs   169.8 MB/sec    1.00    742.3±4.29µs   170.0 MB/sec
string_dictionary/bloom_filter            1.01    624.0±2.95µs  1653.9 MB/sec    1.00    620.2±6.60µs  1663.9 MB/sec
string_dictionary/default                 1.02    395.7±3.05µs     2.5 GB/sec    1.00    388.6±2.24µs     2.6 GB/sec
string_dictionary/parquet_2               1.01    392.9±1.34µs     2.6 GB/sec    1.00    387.9±3.10µs     2.6 GB/sec
string_dictionary/zstd                    1.01   1163.9±7.26µs   886.7 MB/sec    1.00   1156.6±3.29µs   892.3 MB/sec
string_dictionary/zstd_parquet_2          1.01  1974.5±16.35µs   522.7 MB/sec    1.00  1963.2±18.26µs   525.7 MB/sec
string_non_null/bloom_filter              1.00      3.1±0.04ms   651.7 MB/sec    1.00      3.1±0.05ms   653.9 MB/sec
string_non_null/default                   1.01  1139.8±10.58µs  1796.0 MB/sec    1.00  1127.9±13.53µs  1815.0 MB/sec
string_non_null/parquet_2                 1.02  1885.6±19.25µs  1085.6 MB/sec    1.00  1855.6±19.42µs  1103.2 MB/sec
string_non_null/zstd                      1.00      3.2±0.03ms   633.9 MB/sec    1.00      3.2±0.03ms   636.9 MB/sec
string_non_null/zstd_parquet_2            1.02      5.0±0.09ms   407.2 MB/sec    1.00      4.9±0.06ms   415.1 MB/sec

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks @paleolimbot.

One question I have has to deal with the column chunk Statistics and the column index. Am I correct that if geo stats are written, the column chunk stats should be None? And should the column index for such a column also be None? If so, could you add a test that verifies this? 🙏 Could be in a later PR.

//! Testing utilities for geospatial Parquet types
/// Build well-known binary representing a point with the given XY coordinate
pub fn wkb_point_xy(x: f64, y: f64) -> Vec<u8> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will these eventually be used or are they intended to help users?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are used in the parquet/tests/geospatial.rs and parquet/src/geospatial/accumulator.rs tests!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support writing GeospatialStatistics in Parquet writer
3 participants