Variant integration fixes #8438

scovich · 2025-09-24T17:48:14Z

Which issue does this PR close?

Closes

Rationale for this change

It turns out we were too permissive in our handling of typed_value columns and certain other exceptional cases that parquet's variant integration tests specifically expect readers to reject.

What changes are included in this PR?

Simplify VariantArray::value to work directly with (optional) value and typed_value columns instead of the ShreddingState enum
Rename rewrite_to_view_types as canonicalize_and_verify_data_type and expand it to also reject all illegal column types (= any that don't map directly to a variant subtype)
Fix several broken integration tests
Remove several illegal unit tests (that were exercising invalid shredding scenarios)

Are these changes tested?

Yes.

Are there any user-facing changes?

Behavior change: We no longer tolerate invalid-type typed_value columns when reading shredded variant data. At least, not in code paths that go through VariantArray::value. There may still be some leakage in the shredded path step handling of variant_get.

scovich · 2025-09-24T17:55:13Z

parquet-variant-compute/src/variant_array.rs

    /// Note: Does not do deep validation of the [`Variant`], so it is up to the
    /// caller to ensure that the metadata and value were constructed correctly.
    pub fn value(&self, index: usize) -> Variant<'_, '_> {
-        match &self.shredding_state {


There was already substantial logic duplication among the different match arms, and it only got worse once typed_value_to_variant started requiring the value column (needed for both error checking now, and later when handling partially shredded objects). It turned out that directly referencing the two fields was a lot simpler.

Follow-up that continues this line of thought:

[Variant] Simpler shredding state #8444

scovich · 2025-09-24T17:58:59Z

parquet-variant-compute/src/variant_array.rs

+    let data_type = typed_value.data_type();
+    if value.is_some_and(|v| !matches!(data_type, DataType::Struct(_)) && v.is_valid(index)) {
+        // Only a partially shredded struct is allowed to have values for both columns
+        panic!("Invalid variant, conflicting value and typed_value");


This whole panic thing is becoming increasingly awkward as more and more valid error cases arise. Especially because:

Variant data is untrusted (coming in from the user), so we have to expect malformed data

All prod uses of VariantArray::value are in fallible code that could return an error, if given the opportunity.

Now that VariantArray no longer implements Array, we have the option to make value fallible (or add a fallible try_value if we really want to keep the panicky version).

I think adding try_value sounds like a good idea to me

However, it seems to me that most of these checks can be done once per array (e.g. this check for value and compare to the datatype doesn't change row by row, so paying the cost to do the validation on each row feels wasteful to me)

Can we perhaps move this check into the constructor of VariantArray 🤔

So this one is a row-oriented check, unlike the columnar type checks I added in rewrite_to_view_types:

For a specific row, both value and typed_value were non-NULL and typed_value is not a struct. I suppose we could try to memoize the "not a struct" part in order to avoid the overhead of that matches! invocation, but (a) checking for a specific enum variant is really cheap; and (b) where would we store the answer between invocations of value method, given that we don't build any kind of a tree?

parquet-variant-compute/src/variant_array.rs

scovich · 2025-09-24T18:07:43Z

parquet-variant-compute/src/variant_array.rs

 }

 /// replaces all instances of Binary with BinaryView in a DataType
 fn rewrite_to_view_types(data_type: &DataType) -> DataType {


If we agree this is the right place for the checks, I should probably rename the function (and make it fallible)?

And also expand it to cover the exhaustive set of valid and invalid data types so there's no confusion about what's legal and what's forbidden. This can be done immediately, even if a given "valid" data type isn't yet supported -- the read will simply fail later on in such cases (exactly the same as already happens today).

YesI agree checking the types up front as part of construction is 💯 and avoids the potential for errors later on in value methods

... when possible. Some of the new error checks I had to add are row-based, not column-based

scovich · 2025-09-24T18:09:27Z

parquet-variant-compute/src/variant_get.rs

    }

-    #[test]
-    fn get_variant_partially_shredded_uint8_as_variant() {


I'm not sure how exhaustive we want to be about negative testing as a replacement for all these unit tests I deleted?

I don't think we need to worry too much about it. Let's just makes sure each error path is hit

scovich · 2025-09-24T18:10:46Z

parquet/tests/variant_integration.rs

 // Is an error case (should be failing as the expected error message indicates)
-variant_test_case!(
-    42,
-    "Expected an error 'Invalid variant, conflicting value and typed_value`, but got no error"


This was just flat out wrong, swallowing the error message that correctly identifies a problem 🤦

scovich · 2025-09-24T18:12:00Z

parquet/tests/variant_integration.rs

+// TODO: Once structs are supported, expect "Invalid variant, non-object value with shredded fields"
 variant_test_case!(87, "Unsupported typed_value type: Struct(");


This is an invalid-case test, but the lack of struct support currently masks the real problem.

(another below)

scovich · 2025-09-24T18:13:23Z

parquet/tests/variant_integration.rs

 // Is an error case (should be failing as the expected error message indicates)
-variant_test_case!(
-    127,
-    "Invalid variant data: InvalidArgumentError(\"Received empty bytes\")"


AFAICT, the test data has invalid empty "" metadata column entries, perhaps because the data is manually generated and the author never expected readers to get beyond the schema checks 🤷

scovich · 2025-09-24T18:18:03Z

parquet/tests/variant_integration.rs

 // Is an error case (should be failing as the expected error message indicates)
+// TODO: Once structs are supported, expect "Invalid variant, non-object value with shredded fields"
 variant_test_case!(128, "Unsupported typed_value type: Struct(");
-variant_test_case!(129, "Invalid variant data: InvalidArgumentError(");


This test verifies an invalid input (value and typed_value both NULL), which the shredding spec mandates should produce Variant::Null:

If a Variant is missing in a context where a value is required, readers must return a Variant null (00): basic type 0 (primitive) and physical type 0 (null).

The parquet footer for this test is:

Row group 0: count: 1 123.00 B records start: 4 total(compressed): 123 B total(uncompressed):123 B -------------------------------------------------------------------------------- type encodings count avg size nulls min / max id INT32 _ _ 1 27.00 B 0 "1" / "1" var.metadata BINARY _ _ 1 36.00 B 0 "0x010000" / "0x010000" var.value BINARY _ _ 1 30.00 B 1 var.typed_value INT32 _ _ 1 30.00 B 1

Confirmed the test says this case should return Variant::Null 👍

https://github.com/apache/parquet-testing/blob/a3d96a65e11e2bbca7d22a894e8313ede90a33a3/shredded_variant/cases.json#L764-L768

scovich · 2025-09-24T18:20:14Z

parquet/tests/variant_integration.rs

 variant_test_case!(134, "Unsupported typed_value type: Struct(");
 variant_test_case!(135);
 variant_test_case!(136, "Unsupported typed_value type: List(");
-variant_test_case!(137, "Invalid variant data: InvalidArgumentError(");


I'm not sure why this one failed before, but it was apparently for the wrong reason.

The new error seems more like the expected error in cases: https://github.com/apache/parquet-testing/blob/a3d96a65e11e2bbca7d22a894e8313ede90a33a3/shredded_variant/cases.json#L812-L815

scovich · 2025-09-24T18:21:19Z

Attn @alamb @mbrobbel @klion26 who have been dancing around this recently.

alamb

thank you @scovich -- I think this PR is a nice step forward

In my opinion, it would be best to leave value as infallable, and instead check the type validity once as part of constructing VariantArray rather than on each row

alamb · 2025-09-24T19:41:21Z

parquet-variant-compute/src/variant_array.rs

+    let data_type = typed_value.data_type();
+    if value.is_some_and(|v| !matches!(data_type, DataType::Struct(_)) && v.is_valid(index)) {
+        // Only a partially shredded struct is allowed to have values for both columns
+        panic!("Invalid variant, conflicting value and typed_value");


I think adding try_value sounds like a good idea to me

However, it seems to me that most of these checks can be done once per array (e.g. this check for value and compare to the datatype doesn't change row by row, so paying the cost to do the validation on each row feels wasteful to me)

Can we perhaps move this check into the constructor of VariantArray 🤔

parquet-variant-compute/src/variant_array.rs

alamb · 2025-09-24T19:43:19Z

parquet-variant-compute/src/variant_array.rs

        DataType::Int64 => {
            primitive_conversion_single_value!(Int64Type, typed_value, index)
        }
-        DataType::UInt8 => {


If I understand this correctly, the point is that since the Variant spec has no unsigned types, it wouldn't be permissible to shred out such arrow types

https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types

Exactly. I don't think the shredding spec directly says that, but it's implied because shredding is always presumed to start from binary encoded variant values and is a more efficient representation of the same. So throwing in random other types doesn't really make sense.

Wow, I'm blind... the spec definitely directly says which parquet logical types are allowed for shredded columns -- there's a section for it, including a table:
https://github.com/apache/parquet-format/blob/master/VariantShredding.md#shredded-value-types

alamb · 2025-09-24T19:44:04Z

parquet-variant-compute/src/variant_array.rs

 }

 /// replaces all instances of Binary with BinaryView in a DataType
 fn rewrite_to_view_types(data_type: &DataType) -> DataType {


YesI agree checking the types up front as part of construction is 💯 and avoids the potential for errors later on in value methods

alamb · 2025-09-24T19:44:25Z

parquet-variant-compute/src/variant_array.rs

    match data_type {
+        // Unsigned integers are not allowed at all
+        DataType::UInt8 | DataType::UInt16 | DataType::UInt32 | DataType::UInt64 => {
+            panic!("Illegal shredded value type: {data_type:?}");


this would be a good place to return errors I think

Ok, let me quickly fix that

alamb · 2025-09-24T19:44:49Z

parquet-variant-compute/src/variant_get.rs

    }

-    #[test]
-    fn get_variant_partially_shredded_uint8_as_variant() {


I don't think we need to worry too much about it. Let's just makes sure each error path is hit

alamb · 2025-09-24T19:46:32Z

parquet/tests/variant_integration.rs

 // Is an error case (should be failing as the expected error message indicates)
+// TODO: Once structs are supported, expect "Invalid variant, non-object value with shredded fields"
 variant_test_case!(128, "Unsupported typed_value type: Struct(");
-variant_test_case!(129, "Invalid variant data: InvalidArgumentError(");


Confirmed the test says this case should return Variant::Null 👍

https://github.com/apache/parquet-testing/blob/a3d96a65e11e2bbca7d22a894e8313ede90a33a3/shredded_variant/cases.json#L764-L768

alamb · 2025-09-24T19:47:10Z

parquet/tests/variant_integration.rs

 variant_test_case!(134, "Unsupported typed_value type: Struct(");
 variant_test_case!(135);
 variant_test_case!(136, "Unsupported typed_value type: List(");
-variant_test_case!(137, "Invalid variant data: InvalidArgumentError(");


The new error seems more like the expected error in cases: https://github.com/apache/parquet-testing/blob/a3d96a65e11e2bbca7d22a894e8313ede90a33a3/shredded_variant/cases.json#L812-L815

scovich · 2025-09-24T21:45:36Z

@alamb -- I think I addressed all your comments.

I also added complete type checking in the VariantArray constructor now -- every arrow type either succeeds or fails (no catch-all). Some helper methods were renamed accordingly.

This does not eliminate the row-oriented checks that are also required, tho. So we still need to solve the problem of panicky value method.

scovich · 2025-09-24T23:25:50Z

parquet-variant-compute/src/variant_array.rs

+        // We can _possibly_ support (some of) these some day?
+        LargeBinary | LargeUtf8 | Utf8View | ListView(_) | LargeList(_) | LargeListView(_) => {


It's unclear to me what leeway writers have in producing different physical forms of the same logical data. Not just large vs. normal offsets vs. view, but also layout optimizations like dictionary or run-end coding?

scovich · 2025-09-24T23:27:45Z

parquet-variant-compute/src/variant_array.rs

+        Struct(fields) => {
+            // Avoid allocation unless at least one field was rewritten


Observation: The original code unconditionally built a new type from the ground up, which defeated the purpose of having Fields store Arc<Field> (which makes clone much more shallow than it would otherwise be). So this code collects the set of fields changed by deeper layers, and only constructs a new struct if at least one field actually changed. And even then, the other fields are just shallow Arc clones. In the common case where no fields changed, the hashmap is empty (no allocations) and we just return a borrowed version of the input data type.

scovich · 2025-09-25T03:22:16Z

Something seems to have gone wrong with CI?

  Error response from daemon: unauthorized: authentication required
  Error: Docker pull failed with exit code 1

mbrobbel

Thanks @scovich

parquet-variant-compute/src/variant_array.rs

scovich · 2025-09-25T12:07:42Z

Ok, this should be ready to go unless sombody has other comments to add?

klion26

LGTM. Sorry for the late reply, I'm struggling with some network issues in recent days.

klion26 · 2025-09-25T14:44:04Z

parquet-variant-compute/src/variant_array.rs

+    index: usize,
+) -> Variant<'a, 'a> {
+    let data_type = typed_value.data_type();
+    if value.is_some_and(|v| !matches!(data_type, DataType::Struct(_)) && v.is_valid(index)) {


We'll panic here if (data_type is not DataType::Struct(_)) and (v.is_valid(index)), do we need to panic if data_type is DataType::Struct and v.is_valid(index) here?

We do not need to panic if we have a struct here -- that corresponds to a partially shredded variant object, where the value is a variant object and the typed_value is a struct. Eventually, the code that handles partial shredding will detect if the value is not a variant object or contains field names that conflict with those of the typed_value, but that will happen in a different location. I have it prototyped locally and can push a PR once this one merges.

klion26 · 2025-09-25T15:01:22Z

parquet-variant-compute/src/variant_array.rs

+        };
+    }
+
+    let new_data_type = match data_type {


mbrobbel · 2025-09-25T16:13:37Z

Thanks @scovich, @alamb and @klion26

# Which issue does this PR close? - Related to #8336 # Rationale for this change While working on #8438, I noticed that the enum variants of `ShreddingState` actually made the code (a lot) more complex than if I just referenced the (optional) value and typed_value columns directly. That made me wonder if `ShreddingState` would be better as a simple two-field struct. # What changes are included in this PR? Change `ShreddingState` to a two-field struct and update the few call sites that noticed. While we're at it, improve the docs about how shredding works. # Are these changes tested? Existing tests cover what is mostly an internal change # Are there any user-facing changes? `ShreddingState` is pub and changed from enum to struct. --------- Co-authored-by: Andrew Lamb <[email protected]>

scovich added 2 commits September 24, 2025 10:03

[Variant] Fix several incorrect variant integration test cases

e08bcaa

more fixes; remove illegal unit tests

4f45954

github-actions bot added parquet Changes to the parquet crate parquet-variant parquet-variant* crates labels Sep 24, 2025

Merge remote-tracking branch 'oss/main' into variant-integration-fixes

811f4a5

scovich commented Sep 24, 2025

View reviewed changes

self review

127e3ae

Merge branch 'main' into variant-integration-fixes

fe4d628

alamb approved these changes Sep 24, 2025

View reviewed changes

scovich added 3 commits September 24, 2025 13:35

review feedback

c98edee

tighten up illegal shredding type checks

ab129d9

comment

54056fe

scovich added 2 commits September 24, 2025 15:21

forgot to delete Float16 tests

8a2e4af

decimal comment

edc7ecd

scovich commented Sep 24, 2025

View reviewed changes

Merge branch 'main' into variant-integration-fixes

6a6e067

mbrobbel approved these changes Sep 25, 2025

View reviewed changes

parquet-variant-compute/src/variant_array.rs Outdated Show resolved Hide resolved

parquet-variant-compute/src/variant_array.rs Outdated Show resolved Hide resolved

fix comments

5de3489

scovich mentioned this pull request Sep 25, 2025

[Variant] Allow VariantArray::value to work with owned value bytes #8430

Draft

decimal precision fix

8dd3ab7

klion26 approved these changes Sep 25, 2025

View reviewed changes

scovich mentioned this pull request Sep 25, 2025

[Variant] Simpler shredding state #8444

Merged

mbrobbel merged commit bed9ed8 into apache:main Sep 25, 2025
19 checks passed

scovich mentioned this pull request Sep 25, 2025

[Variant] VariantArray::value supports shredded struct access #8446

Draft

This was referenced Sep 25, 2025

[Variant] Overly aggressive inference of UUID values #8420

Open

[Variant] Shredded typed_value columns must have valid variant types #8435

Closed

		// TODO: Once structs are supported, expect "Invalid variant, non-object value with shredded fields"
		variant_test_case!(87, "Unsupported typed_value type: Struct(");

		// We can _possibly_ support (some of) these some day?
		LargeBinary \| LargeUtf8 \| Utf8View \| ListView(_) \| LargeList(_) \| LargeListView(_) => {

		Struct(fields) => {
		// Avoid allocation unless at least one field was rewritten

Variant integration fixes #8438

Variant integration fixes #8438

Uh oh!

Conversation

scovich commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich commented Sep 24, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich commented Sep 24, 2025

Uh oh!

Choose a reason for hiding this comment

scovich commented Sep 24, 2025 •

edited

Loading

scovich Sep 24, 2025 •

edited

Loading

scovich Sep 24, 2025 •

edited

Loading

scovich Sep 25, 2025 •

edited

Loading