Skip to content

Conversation

scovich
Copy link
Contributor

@scovich scovich commented Sep 24, 2025

Which issue does this PR close?

Closes

Rationale for this change

It turns out we were too permissive in our handling of typed_value columns and certain other exceptional cases that parquet's variant integration tests specifically expect readers to reject.

What changes are included in this PR?

  • Simplify VariantArray::value to work directly with (optional) value and typed_value columns instead of the ShreddingState enum
  • Rename rewrite_to_view_types as canonicalize_and_verify_data_type and expand it to also reject all illegal column types (= any that don't map directly to a variant subtype)
  • Fix several broken integration tests
  • Remove several illegal unit tests (that were exercising invalid shredding scenarios)

Are these changes tested?

Yes.

Are there any user-facing changes?

Behavior change: We no longer tolerate invalid-type typed_value columns when reading shredded variant data. At least, not in code paths that go through VariantArray::value. There may still be some leakage in the shredded path step handling of variant_get.

@github-actions github-actions bot added parquet Changes to the parquet crate parquet-variant parquet-variant* crates labels Sep 24, 2025
/// Note: Does not do deep validation of the [`Variant`], so it is up to the
/// caller to ensure that the metadata and value were constructed correctly.
pub fn value(&self, index: usize) -> Variant<'_, '_> {
match &self.shredding_state {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was already substantial logic duplication among the different match arms, and it only got worse once typed_value_to_variant started requiring the value column (needed for both error checking now, and later when handling partially shredded objects). It turned out that directly referencing the two fields was a lot simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up that continues this line of thought:

let data_type = typed_value.data_type();
if value.is_some_and(|v| !matches!(data_type, DataType::Struct(_)) && v.is_valid(index)) {
// Only a partially shredded struct is allowed to have values for both columns
panic!("Invalid variant, conflicting value and typed_value");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole panic thing is becoming increasingly awkward as more and more valid error cases arise. Especially because:

  1. Variant data is untrusted (coming in from the user), so we have to expect malformed data
  2. All prod uses of VariantArray::value are in fallible code that could return an error, if given the opportunity.

Now that VariantArray no longer implements Array, we have the option to make value fallible (or add a fallible try_value if we really want to keep the panicky version).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding try_value sounds like a good idea to me

However, it seems to me that most of these checks can be done once per array (e.g. this check for value and compare to the datatype doesn't change row by row, so paying the cost to do the validation on each row feels wasteful to me)

Can we perhaps move this check into the constructor of VariantArray 🤔

Copy link
Contributor Author

@scovich scovich Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this one is a row-oriented check, unlike the columnar type checks I added in rewrite_to_view_types:

For a specific row, both value and typed_value were non-NULL and typed_value is not a struct. I suppose we could try to memoize the "not a struct" part in order to avoid the overhead of that matches! invocation, but (a) checking for a specific enum variant is really cheap; and (b) where would we store the answer between invocations of value method, given that we don't build any kind of a tree?

}

/// replaces all instances of Binary with BinaryView in a DataType
fn rewrite_to_view_types(data_type: &DataType) -> DataType {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we agree this is the right place for the checks, I should probably rename the function (and make it fallible)?

And also expand it to cover the exhaustive set of valid and invalid data types so there's no confusion about what's legal and what's forbidden. This can be done immediately, even if a given "valid" data type isn't yet supported -- the read will simply fail later on in such cases (exactly the same as already happens today).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YesI agree checking the types up front as part of construction is 💯 and avoids the potential for errors later on in value methods

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... when possible. Some of the new error checks I had to add are row-based, not column-based

}

#[test]
fn get_variant_partially_shredded_uint8_as_variant() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how exhaustive we want to be about negative testing as a replacement for all these unit tests I deleted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to worry too much about it. Let's just makes sure each error path is hit

// Is an error case (should be failing as the expected error message indicates)
variant_test_case!(
42,
"Expected an error 'Invalid variant, conflicting value and typed_value`, but got no error"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just flat out wrong, swallowing the error message that correctly identifies a problem 🤦

Comment on lines +197 to 198
// TODO: Once structs are supported, expect "Invalid variant, non-object value with shredded fields"
variant_test_case!(87, "Unsupported typed_value type: Struct(");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an invalid-case test, but the lack of struct support currently masks the real problem.

(another below)

// Is an error case (should be failing as the expected error message indicates)
variant_test_case!(
127,
"Invalid variant data: InvalidArgumentError(\"Received empty bytes\")"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, the test data has invalid empty "" metadata column entries, perhaps because the data is manually generated and the author never expected readers to get beyond the schema checks 🤷

// Is an error case (should be failing as the expected error message indicates)
// TODO: Once structs are supported, expect "Invalid variant, non-object value with shredded fields"
variant_test_case!(128, "Unsupported typed_value type: Struct(");
variant_test_case!(129, "Invalid variant data: InvalidArgumentError(");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test verifies an invalid input (value and typed_value both NULL), which the shredding spec mandates should produce Variant::Null:

If a Variant is missing in a context where a value is required, readers must return a Variant null (00): basic type 0 (primitive) and physical type 0 (null).

The parquet footer for this test is:

Row group 0:  count: 1  123.00 B records  start: 4  total(compressed): 123 B total(uncompressed):123 B 
--------------------------------------------------------------------------------
                 type      encodings count     avg size   nulls   min / max
id               INT32     _   _     1         27.00 B    0       "1" / "1"
var.metadata     BINARY    _   _     1         36.00 B    0       "0x010000" / "0x010000"
var.value        BINARY    _   _     1         30.00 B    1       
var.typed_value  INT32     _   _     1         30.00 B    1       

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variant_test_case!(134, "Unsupported typed_value type: Struct(");
variant_test_case!(135);
variant_test_case!(136, "Unsupported typed_value type: List(");
variant_test_case!(137, "Invalid variant data: InvalidArgumentError(");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this one failed before, but it was apparently for the wrong reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich
Copy link
Contributor Author

scovich commented Sep 24, 2025

Attn @alamb @mbrobbel @klion26 who have been dancing around this recently.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @scovich -- I think this PR is a nice step forward

In my opinion, it would be best to leave value as infallable, and instead check the type validity once as part of constructing VariantArray rather than on each row

let data_type = typed_value.data_type();
if value.is_some_and(|v| !matches!(data_type, DataType::Struct(_)) && v.is_valid(index)) {
// Only a partially shredded struct is allowed to have values for both columns
panic!("Invalid variant, conflicting value and typed_value");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding try_value sounds like a good idea to me

However, it seems to me that most of these checks can be done once per array (e.g. this check for value and compare to the datatype doesn't change row by row, so paying the cost to do the validation on each row feels wasteful to me)

Can we perhaps move this check into the constructor of VariantArray 🤔

DataType::Int64 => {
primitive_conversion_single_value!(Int64Type, typed_value, index)
}
DataType::UInt8 => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this correctly, the point is that since the Variant spec has no unsigned types, it wouldn't be permissible to shred out such arrow types

https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. I don't think the shredding spec directly says that, but it's implied because shredding is always presumed to start from binary encoded variant values and is a more efficient representation of the same. So throwing in random other types doesn't really make sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, I'm blind... the spec definitely directly says which parquet logical types are allowed for shredded columns -- there's a section for it, including a table:
https://github.com/apache/parquet-format/blob/master/VariantShredding.md#shredded-value-types

}

/// replaces all instances of Binary with BinaryView in a DataType
fn rewrite_to_view_types(data_type: &DataType) -> DataType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YesI agree checking the types up front as part of construction is 💯 and avoids the potential for errors later on in value methods

match data_type {
// Unsigned integers are not allowed at all
DataType::UInt8 | DataType::UInt16 | DataType::UInt32 | DataType::UInt64 => {
panic!("Illegal shredded value type: {data_type:?}");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be a good place to return errors I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let me quickly fix that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

#[test]
fn get_variant_partially_shredded_uint8_as_variant() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to worry too much about it. Let's just makes sure each error path is hit

// Is an error case (should be failing as the expected error message indicates)
// TODO: Once structs are supported, expect "Invalid variant, non-object value with shredded fields"
variant_test_case!(128, "Unsupported typed_value type: Struct(");
variant_test_case!(129, "Invalid variant data: InvalidArgumentError(");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variant_test_case!(134, "Unsupported typed_value type: Struct(");
variant_test_case!(135);
variant_test_case!(136, "Unsupported typed_value type: List(");
variant_test_case!(137, "Invalid variant data: InvalidArgumentError(");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich
Copy link
Contributor Author

scovich commented Sep 24, 2025

@alamb -- I think I addressed all your comments.

I also added complete type checking in the VariantArray constructor now -- every arrow type either succeeds or fails (no catch-all). Some helper methods were renamed accordingly.

This does not eliminate the row-oriented checks that are also required, tho. So we still need to solve the problem of panicky value method.

Comment on lines 934 to 935
// We can _possibly_ support (some of) these some day?
LargeBinary | LargeUtf8 | Utf8View | ListView(_) | LargeList(_) | LargeListView(_) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to me what leeway writers have in producing different physical forms of the same logical data. Not just large vs. normal offsets vs. view, but also layout optimizations like dictionary or run-end coding?

Comment on lines 944 to 945
Struct(fields) => {
// Avoid allocation unless at least one field was rewritten
Copy link
Contributor Author

@scovich scovich Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: The original code unconditionally built a new type from the ground up, which defeated the purpose of having Fields store Arc<Field> (which makes clone much more shallow than it would otherwise be). So this code collects the set of fields changed by deeper layers, and only constructs a new struct if at least one field actually changed. And even then, the other fields are just shallow Arc clones. In the common case where no fields changed, the hashmap is empty (no allocations) and we just return a borrowed version of the input data type.

@scovich
Copy link
Contributor Author

scovich commented Sep 25, 2025

Something seems to have gone wrong with CI?

  Error response from daemon: unauthorized: authentication required
  Error: Docker pull failed with exit code 1

Copy link
Member

@mbrobbel mbrobbel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @scovich

@scovich
Copy link
Contributor Author

scovich commented Sep 25, 2025

Ok, this should be ready to go unless sombody has other comments to add?

Copy link
Member

@klion26 klion26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Sorry for the late reply, I'm struggling with some network issues in recent days.

index: usize,
) -> Variant<'a, 'a> {
let data_type = typed_value.data_type();
if value.is_some_and(|v| !matches!(data_type, DataType::Struct(_)) && v.is_valid(index)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll panic here if (data_type is not DataType::Struct(_)) and (v.is_valid(index)), do we need to panic if data_type is DataType::Struct and v.is_valid(index) here?

Copy link
Contributor Author

@scovich scovich Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need to panic if we have a struct here -- that corresponds to a partially shredded variant object, where the value is a variant object and the typed_value is a struct. Eventually, the code that handles partial shredding will detect if the value is not a variant object or contains field names that conflict with those of the typed_value, but that will happen in a different location. I have it prototyped locally and can push a PR once this one merges.

};
}

let new_data_type = match data_type {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@mbrobbel
Copy link
Member

Thanks @scovich, @alamb and @klion26

@mbrobbel mbrobbel merged commit bed9ed8 into apache:main Sep 25, 2025
19 checks passed
mbrobbel pushed a commit that referenced this pull request Sep 26, 2025
# Which issue does this PR close?

- Related to #8336

# Rationale for this change

While working on #8438, I noticed
that the enum variants of `ShreddingState` actually made the code (a
lot) more complex than if I just referenced the (optional) value and
typed_value columns directly. That made me wonder if `ShreddingState`
would be better as a simple two-field struct.

# What changes are included in this PR?

Change `ShreddingState` to a two-field struct and update the few call
sites that noticed.

While we're at it, improve the docs about how shredding works.

# Are these changes tested?

Existing tests cover what is mostly an internal change

# Are there any user-facing changes?

`ShreddingState` is pub and changed from enum to struct.

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate parquet-variant parquet-variant* crates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants