-
Notifications
You must be signed in to change notification settings - Fork 1k
Variant integration fixes #8438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/// Note: Does not do deep validation of the [`Variant`], so it is up to the | ||
/// caller to ensure that the metadata and value were constructed correctly. | ||
pub fn value(&self, index: usize) -> Variant<'_, '_> { | ||
match &self.shredding_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was already substantial logic duplication among the different match arms, and it only got worse once typed_value_to_variant
started requiring the value column (needed for both error checking now, and later when handling partially shredded objects). It turned out that directly referencing the two fields was a lot simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-up that continues this line of thought:
let data_type = typed_value.data_type(); | ||
if value.is_some_and(|v| !matches!(data_type, DataType::Struct(_)) && v.is_valid(index)) { | ||
// Only a partially shredded struct is allowed to have values for both columns | ||
panic!("Invalid variant, conflicting value and typed_value"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole panic thing is becoming increasingly awkward as more and more valid error cases arise. Especially because:
- Variant data is untrusted (coming in from the user), so we have to expect malformed data
- All prod uses of
VariantArray::value
are in fallible code that could return an error, if given the opportunity.
Now that VariantArray
no longer implements Array
, we have the option to make value
fallible (or add a fallible try_value
if we really want to keep the panicky version).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding try_value
sounds like a good idea to me
However, it seems to me that most of these checks can be done once per array (e.g. this check for value
and compare to the datatype doesn't change row by row, so paying the cost to do the validation on each row feels wasteful to me)
Can we perhaps move this check into the constructor of VariantArray 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this one is a row-oriented check, unlike the columnar type checks I added in rewrite_to_view_types
:
For a specific row, both value
and typed_value
were non-NULL and typed_value
is not a struct. I suppose we could try to memoize the "not a struct" part in order to avoid the overhead of that matches!
invocation, but (a) checking for a specific enum variant is really cheap; and (b) where would we store the answer between invocations of value
method, given that we don't build any kind of a tree?
} | ||
|
||
/// replaces all instances of Binary with BinaryView in a DataType | ||
fn rewrite_to_view_types(data_type: &DataType) -> DataType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we agree this is the right place for the checks, I should probably rename the function (and make it fallible)?
And also expand it to cover the exhaustive set of valid and invalid data types so there's no confusion about what's legal and what's forbidden. This can be done immediately, even if a given "valid" data type isn't yet supported -- the read will simply fail later on in such cases (exactly the same as already happens today).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YesI agree checking the types up front as part of construction is 💯 and avoids the potential for errors later on in value
methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... when possible. Some of the new error checks I had to add are row-based, not column-based
} | ||
|
||
#[test] | ||
fn get_variant_partially_shredded_uint8_as_variant() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how exhaustive we want to be about negative testing as a replacement for all these unit tests I deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to worry too much about it. Let's just makes sure each error path is hit
// Is an error case (should be failing as the expected error message indicates) | ||
variant_test_case!( | ||
42, | ||
"Expected an error 'Invalid variant, conflicting value and typed_value`, but got no error" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was just flat out wrong, swallowing the error message that correctly identifies a problem 🤦
// TODO: Once structs are supported, expect "Invalid variant, non-object value with shredded fields" | ||
variant_test_case!(87, "Unsupported typed_value type: Struct("); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an invalid-case test, but the lack of struct support currently masks the real problem.
(another below)
// Is an error case (should be failing as the expected error message indicates) | ||
variant_test_case!( | ||
127, | ||
"Invalid variant data: InvalidArgumentError(\"Received empty bytes\")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT, the test data has invalid empty ""
metadata column entries, perhaps because the data is manually generated and the author never expected readers to get beyond the schema checks 🤷
// Is an error case (should be failing as the expected error message indicates) | ||
// TODO: Once structs are supported, expect "Invalid variant, non-object value with shredded fields" | ||
variant_test_case!(128, "Unsupported typed_value type: Struct("); | ||
variant_test_case!(129, "Invalid variant data: InvalidArgumentError("); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test verifies an invalid input (value and typed_value both NULL), which the shredding spec mandates should produce Variant::Null
:
If a Variant is missing in a context where a value is required, readers must return a Variant null (00): basic type 0 (primitive) and physical type 0 (null).
The parquet footer for this test is:
Row group 0: count: 1 123.00 B records start: 4 total(compressed): 123 B total(uncompressed):123 B
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
id INT32 _ _ 1 27.00 B 0 "1" / "1"
var.metadata BINARY _ _ 1 36.00 B 0 "0x010000" / "0x010000"
var.value BINARY _ _ 1 30.00 B 1
var.typed_value INT32 _ _ 1 30.00 B 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed the test says this case should return Variant::Null 👍
variant_test_case!(134, "Unsupported typed_value type: Struct("); | ||
variant_test_case!(135); | ||
variant_test_case!(136, "Unsupported typed_value type: List("); | ||
variant_test_case!(137, "Invalid variant data: InvalidArgumentError("); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why this one failed before, but it was apparently for the wrong reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new error seems more like the expected error in cases: https://github.com/apache/parquet-testing/blob/a3d96a65e11e2bbca7d22a894e8313ede90a33a3/shredded_variant/cases.json#L812-L815
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you @scovich -- I think this PR is a nice step forward
In my opinion, it would be best to leave value
as infallable, and instead check the type validity once as part of constructing VariantArray
rather than on each row
let data_type = typed_value.data_type(); | ||
if value.is_some_and(|v| !matches!(data_type, DataType::Struct(_)) && v.is_valid(index)) { | ||
// Only a partially shredded struct is allowed to have values for both columns | ||
panic!("Invalid variant, conflicting value and typed_value"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding try_value
sounds like a good idea to me
However, it seems to me that most of these checks can be done once per array (e.g. this check for value
and compare to the datatype doesn't change row by row, so paying the cost to do the validation on each row feels wasteful to me)
Can we perhaps move this check into the constructor of VariantArray 🤔
DataType::Int64 => { | ||
primitive_conversion_single_value!(Int64Type, typed_value, index) | ||
} | ||
DataType::UInt8 => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this correctly, the point is that since the Variant spec has no unsigned types, it wouldn't be permissible to shred out such arrow types
https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. I don't think the shredding spec directly says that, but it's implied because shredding is always presumed to start from binary encoded variant values and is a more efficient representation of the same. So throwing in random other types doesn't really make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, I'm blind... the spec definitely directly says which parquet logical types are allowed for shredded columns -- there's a section for it, including a table:
https://github.com/apache/parquet-format/blob/master/VariantShredding.md#shredded-value-types
} | ||
|
||
/// replaces all instances of Binary with BinaryView in a DataType | ||
fn rewrite_to_view_types(data_type: &DataType) -> DataType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YesI agree checking the types up front as part of construction is 💯 and avoids the potential for errors later on in value
methods
match data_type { | ||
// Unsigned integers are not allowed at all | ||
DataType::UInt8 | DataType::UInt16 | DataType::UInt32 | DataType::UInt64 => { | ||
panic!("Illegal shredded value type: {data_type:?}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would be a good place to return errors I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let me quickly fix that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
|
||
#[test] | ||
fn get_variant_partially_shredded_uint8_as_variant() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to worry too much about it. Let's just makes sure each error path is hit
// Is an error case (should be failing as the expected error message indicates) | ||
// TODO: Once structs are supported, expect "Invalid variant, non-object value with shredded fields" | ||
variant_test_case!(128, "Unsupported typed_value type: Struct("); | ||
variant_test_case!(129, "Invalid variant data: InvalidArgumentError("); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed the test says this case should return Variant::Null 👍
variant_test_case!(134, "Unsupported typed_value type: Struct("); | ||
variant_test_case!(135); | ||
variant_test_case!(136, "Unsupported typed_value type: List("); | ||
variant_test_case!(137, "Invalid variant data: InvalidArgumentError("); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new error seems more like the expected error in cases: https://github.com/apache/parquet-testing/blob/a3d96a65e11e2bbca7d22a894e8313ede90a33a3/shredded_variant/cases.json#L812-L815
@alamb -- I think I addressed all your comments. I also added complete type checking in the VariantArray constructor now -- every arrow type either succeeds or fails (no catch-all). Some helper methods were renamed accordingly. This does not eliminate the row-oriented checks that are also required, tho. So we still need to solve the problem of panicky |
// We can _possibly_ support (some of) these some day? | ||
LargeBinary | LargeUtf8 | Utf8View | ListView(_) | LargeList(_) | LargeListView(_) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear to me what leeway writers have in producing different physical forms of the same logical data. Not just large vs. normal offsets vs. view, but also layout optimizations like dictionary or run-end coding?
Struct(fields) => { | ||
// Avoid allocation unless at least one field was rewritten |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observation: The original code unconditionally built a new type from the ground up, which defeated the purpose of having Fields
store Arc<Field>
(which makes clone
much more shallow than it would otherwise be). So this code collects the set of fields changed by deeper layers, and only constructs a new struct if at least one field actually changed. And even then, the other fields are just shallow Arc clones. In the common case where no fields changed, the hashmap is empty (no allocations) and we just return a borrowed version of the input data type.
Something seems to have gone wrong with CI?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @scovich
Ok, this should be ready to go unless sombody has other comments to add? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Sorry for the late reply, I'm struggling with some network issues in recent days.
index: usize, | ||
) -> Variant<'a, 'a> { | ||
let data_type = typed_value.data_type(); | ||
if value.is_some_and(|v| !matches!(data_type, DataType::Struct(_)) && v.is_valid(index)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll panic here if (data_type
is not DataType::Struct(_)
) and (v.is_valid(index)
), do we need to panic if data_type
is DataType::Struct
and v.is_valid(index)
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not need to panic if we have a struct here -- that corresponds to a partially shredded variant object, where the value is a variant object and the typed_value is a struct. Eventually, the code that handles partial shredding will detect if the value is not a variant object or contains field names that conflict with those of the typed_value
, but that will happen in a different location. I have it prototyped locally and can push a PR once this one merges.
}; | ||
} | ||
|
||
let new_data_type = match data_type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
# Which issue does this PR close? - Related to #8336 # Rationale for this change While working on #8438, I noticed that the enum variants of `ShreddingState` actually made the code (a lot) more complex than if I just referenced the (optional) value and typed_value columns directly. That made me wonder if `ShreddingState` would be better as a simple two-field struct. # What changes are included in this PR? Change `ShreddingState` to a two-field struct and update the few call sites that noticed. While we're at it, improve the docs about how shredding works. # Are these changes tested? Existing tests cover what is mostly an internal change # Are there any user-facing changes? `ShreddingState` is pub and changed from enum to struct. --------- Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Closes
Rationale for this change
It turns out we were too permissive in our handling of
typed_value
columns and certain other exceptional cases that parquet's variant integration tests specifically expect readers to reject.What changes are included in this PR?
VariantArray::value
to work directly with (optional)value
andtyped_value
columns instead of theShreddingState
enumrewrite_to_view_types
ascanonicalize_and_verify_data_type
and expand it to also reject all illegal column types (= any that don't map directly to a variant subtype)Are these changes tested?
Yes.
Are there any user-facing changes?
Behavior change: We no longer tolerate invalid-type
typed_value
columns when reading shredded variant data. At least, not in code paths that go throughVariantArray::value
. There may still be some leakage in the shredded path step handling ofvariant_get
.