-
Notifications
You must be signed in to change notification settings - Fork 1k
[Variant] VariantArray::value supports shredded struct access #8446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
/// | ||
/// # Panics | ||
/// * if the index is out of bounds | ||
/// * if the array value is null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// * if the array value is null |
if value.is_some_and(|v| !matches!(data_type, DataType::Struct(_)) && v.is_valid(index)) { | ||
// Only a partially shredded struct is allowed to have values for both columns | ||
panic!("Invalid variant, conflicting value and typed_value"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check moved to the bottom of the method (after struct has already been handled and returned early)
DataType::Float16 => { | ||
primitive_conversion_single_value!(Float16Type, typed_value, index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Variant integration fixes #8438 missed this case
// Track all shredded field names -- we must ignore them when processing value fields below. | ||
let mut shredded_field_names = std::collections::HashSet::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to build the hash table just once per array, instead of rebuilding at every row access, but there's nowhere to materialize it -- VariantArray
doesn't build any kind of tree and instead directly "mounts" the underlying arrays. So this just adds to the inefficiency that VariantArray::value
docs warn about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, it would be even nicer if Fields
were an IndexMap
, which gives O(1) name lookup that many consumers all over arrow would benefit from. But that would be a pretty major change and potentially impacts efficiency as well.
variant_test_case!(125); | ||
variant_test_case!(126, "Unsupported typed_value type: List("); | ||
// Is an error case (should be failing as the expected error message indicates) | ||
// Is an error case (error message mentions arrow data type instead of parquet logical type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert, bad merge
)); | ||
}; | ||
for (obj_field_name, obj_field_value) in obj.iter() { | ||
if !shredded_field_names.contains(obj_field_name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shredding spec says:
The value column of a partially shredded object must never contain fields represented by the Parquet columns in typed_value (shredded fields). Readers may always assume that data is written correctly and that shredded fields in typed_value are not present in value. As a result, reads when a field is defined in both value and a typed_value shredded field may be inconsistent.
So in theory we don't need this check. But variant integration test cases 43 (testPartiallyShreddedObjectMissingFieldConflict) and 125 (testPartiallyShreddedObjectFieldConflict) both have conflicting value fields they expect readers to ignore.
Is this actually a bug in those tests that we should file an issue for? Or do we keep the current code?
CC @alamb
I will review this one carefully tomorrow morning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @scovich -- I am sorry it took me so long to review this PR
I think I finally understand what you are getting at -- that in order to support value()
for a shredded object we have to be able to reconstruct it somehow
Before we head down the path of having owned Variants, I would really like to explore the option of adding a new Variant::ShreddedObject
that only has references to see what that would look like
I am mostly hoping we can avoid/minimize any per-row allocations when traversing shredded variants.
Would you be willing to give this a try? If not I will try and find time in the next day or two
let metadata = VariantMetadata::new(self.metadata.value(index)); | ||
let mut builder = SingleValueVariantBuilder::new(metadata.clone()); | ||
typed_value_to_variant(typed_value, value, index, &metadata, &mut builder)?; | ||
return Ok(VariantArrayValue::owned( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I see -- this is the core conundrum. If we want to return a Variant
that has references, we can't unshred the Variant into a new object, we have to have some way to return something that references the underling shredded object that may have multiple fields).
I think we are looking at other approaches at the moment, so marking this one as draft |
# Which issue does this PR close? - Closes #8336 - Closes #8334 # Rationale for this change The `VariantArray::value` method was really inefficient but had quietly crept into important code paths like `variant_get`. Since the complex and inefficient code was in support of variant unshredding, we should just add and use a proper `unshred_variant` function (which uses row builders like the other variant manipulating functions). # What changes are included in this PR? * Define the new `unshred_variant` function, which does what it says. It supports all the types `typed_value_to_variant` supported, plus Time64 and Struct as a bonus. The former because it was ~10LoC and the latter because it demonstrates the superiority of this new approach vs. e.g. #8446 * Wire up `variant_get` unshredding path to call it, which immediately benefits from all that function's existing test coverage. * Update the variant_integration test to `unshred_variant` instead of looping over rows calling `value(i)`. # Are these changes tested? Yes, a bunch of variant integration tests now pass that used to fail. # Are there any user-facing changes? Several new pub methods. I don't think any breaking changes.
VariantArray::value
to work with owned value bytes #8430, ignore the first three commits.Which issue does this PR close?
Struct
#8336Rationale for this change
Adding missing feature.
What changes are included in this PR?
Change
VariantArray::value
and its helper methods to use a newVariantBuilderExt
(calledSingleValueVariantBuilder
), which is similar to theVariantValueArrayBuilder
in its use of read-only metadata.Add a new
struct_typed_value_to_variant
helper for recursing into shredded structs.Add a new fallible
VariantArray::try_value
and makeVariantArray::value
unwrap it.Are these changes tested?
Yes! A bunch of variant integration tests now pass.
Are there any user-facing changes?
Add a new public
VariantArray::try_value
method.Make
ValueBuilder::append_null
public (not sure why it wasn't already)