Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ members = [
"arrow-row",
"arrow-schema",
"arrow-select",
"arrow-string",
"arrow-string", "foo",
"parquet",
"parquet_derive",
"parquet_derive_test",
Expand All @@ -57,6 +57,8 @@ exclude = [
# significantly changing how it is compiled within the workspace, causing the whole workspace to be compiled from
# scratch this way, this is a stand-alone package that compiles independently of the others.
"arrow-pyarrow-integration-testing",
# parquet inregration testing likewise contains different flags
"parquet-integration-testing",
# object_store is excluded because it follows a separate release cycle from the other arrow crates
"object_store"
]
Expand Down
1 change: 1 addition & 0 deletions parquet-integration-testing/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
out
30 changes: 30 additions & 0 deletions parquet-integration-testing/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

[package]
name = "python-integration-testing"
description = "Binaries used for testing parquet-rs compatibility (NOT published to crates.io)"
publish = false
edition = "2021"


[dependencies]
arrow = { path = "../arrow", features = ["prettyprint"] }
parquet = { path = "../parquet", features = ["arrow"]}
serde = "1.0.203"
serde_json = { version = "1.0", default-features = false, features = ["std"] }
pretty_assertions = "1.4.0"
32 changes: 32 additions & 0 deletions parquet-integration-testing/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Apache Parquet Rust Integration Testing

The binary in this repo:

1. Reads files from the parquet-testing repo
2. Creates a JSON file with appropriately formatted contents
3. Compare these JSON files with "known good" golden master files

## Running

```shell
cargo run
```
285 changes: 285 additions & 0 deletions parquet-integration-testing/data/alltypes_plain.parquet.data.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,285 @@
{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The basic idea would be to check in files like this for all the various files in parquet-testing, and each parquet implementation could write a driver that made the equivalent JSON and checked it against the expected output

"filename": "alltypes_plain.parquet",
"rows": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to formalize the expected output for each physical/logic type.

The other thing I was thinking is it might pay to consider trade-offs between a row oriented format here compared to a column oriented format that is closer to what is actually written in parquet (i.e. rep/def levels and values). They both might be useful in some situations. For instance I've seen ill-formed parquet files in the wild because of inconsistent rep/def levels so ensuring there are sanity checks at that level makes sense.

Row based would certainly help for instance I've seen for non-conformant logical nested types like Lists/Maps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that if we want to move forward with this approach we should spend time formalizing and documenting what the expected format means

[
{
"id": "4"
},
{
"bool_col": "true"
},
{
"tinyint_col": "0"
},
{
"smallint_col": "0"
},
{
"int_col": "0"
},
{
"bigint_col": "0"
},
{
"float_col": "0.0"
},
{
"double_col": "0.0"
},
{
"date_string_col": "30332f30312f3039"
},
{
"string_col": "30"
},
{
"timestamp_col": "2009-03-01T00:00:00"
}
],
[
{
"id": "5"
},
{
"bool_col": "false"
},
{
"tinyint_col": "1"
},
{
"smallint_col": "1"
},
{
"int_col": "1"
},
{
"bigint_col": "10"
},
{
"float_col": "1.1"
},
{
"double_col": "10.1"
},
{
"date_string_col": "30332f30312f3039"
},
{
"string_col": "31"
},
{
"timestamp_col": "2009-03-01T00:01:00"
}
],
[
{
"id": "6"
},
{
"bool_col": "true"
},
{
"tinyint_col": "0"
},
{
"smallint_col": "0"
},
{
"int_col": "0"
},
{
"bigint_col": "0"
},
{
"float_col": "0.0"
},
{
"double_col": "0.0"
},
{
"date_string_col": "30342f30312f3039"
},
{
"string_col": "30"
},
{
"timestamp_col": "2009-04-01T00:00:00"
}
],
[
{
"id": "7"
},
{
"bool_col": "false"
},
{
"tinyint_col": "1"
},
{
"smallint_col": "1"
},
{
"int_col": "1"
},
{
"bigint_col": "10"
},
{
"float_col": "1.1"
},
{
"double_col": "10.1"
},
{
"date_string_col": "30342f30312f3039"
},
{
"string_col": "31"
},
{
"timestamp_col": "2009-04-01T00:01:00"
}
],
[
{
"id": "2"
},
{
"bool_col": "true"
},
{
"tinyint_col": "0"
},
{
"smallint_col": "0"
},
{
"int_col": "0"
},
{
"bigint_col": "0"
},
{
"float_col": "0.0"
},
{
"double_col": "0.0"
},
{
"date_string_col": "30322f30312f3039"
},
{
"string_col": "30"
},
{
"timestamp_col": "2009-02-01T00:00:00"
}
],
[
{
"id": "3"
},
{
"bool_col": "false"
},
{
"tinyint_col": "1"
},
{
"smallint_col": "1"
},
{
"int_col": "1"
},
{
"bigint_col": "10"
},
{
"float_col": "1.1"
},
{
"double_col": "10.1"
},
{
"date_string_col": "30322f30312f3039"
},
{
"string_col": "31"
},
{
"timestamp_col": "2009-02-01T00:01:00"
}
],
[
{
"id": "0"
},
{
"bool_col": "true"
},
{
"tinyint_col": "0"
},
{
"smallint_col": "0"
},
{
"int_col": "0"
},
{
"bigint_col": "0"
},
{
"float_col": "0.0"
},
{
"double_col": "0.0"
},
{
"date_string_col": "30312f30312f3039"
},
{
"string_col": "30"
},
{
"timestamp_col": "2009-01-01T00:00:00"
}
],
[
{
"id": "1"
},
{
"bool_col": "false"
},
{
"tinyint_col": "1"
},
{
"smallint_col": "1"
},
{
"int_col": "1"
},
{
"bigint_col": "10"
},
{
"float_col": "1.1"
},
{
"double_col": "10.1"
},
{
"date_string_col": "30312f30312f3039"
},
{
"string_col": "31"
},
{
"timestamp_col": "2009-01-01T00:01:00"
}
]
]
}
Loading