Skip to content

Conversation

scovich
Copy link
Contributor

@scovich scovich commented Sep 25, 2025

Which issue does this PR close?

Rationale for this change

Adding missing feature.

What changes are included in this PR?

Change VariantArray::value and its helper methods to use a new VariantBuilderExt (called SingleValueVariantBuilder), which is similar to the VariantValueArrayBuilder in its use of read-only metadata.

Add a new struct_typed_value_to_variant helper for recursing into shredded structs.

Add a new fallible VariantArray::try_value and make VariantArray::value unwrap it.

Are these changes tested?

Yes! A bunch of variant integration tests now pass.

Are there any user-facing changes?

Add a new public VariantArray::try_value method.

Make ValueBuilder::append_null public (not sure why it wasn't already)

@github-actions github-actions bot added parquet Changes to the parquet crate parquet-variant parquet-variant* crates labels Sep 25, 2025
@scovich
Copy link
Contributor Author

scovich commented Sep 25, 2025

CC @alamb @klion26 @codephage2020

///
/// # Panics
/// * if the index is out of bounds
/// * if the array value is null
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// * if the array value is null

Comment on lines -911 to -914
if value.is_some_and(|v| !matches!(data_type, DataType::Struct(_)) && v.is_valid(index)) {
// Only a partially shredded struct is allowed to have values for both columns
panic!("Invalid variant, conflicting value and typed_value");
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check moved to the bottom of the method (after struct has already been handled and returned early)

Comment on lines -955 to -956
DataType::Float16 => {
primitive_conversion_single_value!(Float16Type, typed_value, index)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +1087 to +1088
// Track all shredded field names -- we must ignore them when processing value fields below.
let mut shredded_field_names = std::collections::HashSet::new();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to build the hash table just once per array, instead of rebuilding at every row access, but there's nowhere to materialize it -- VariantArray doesn't build any kind of tree and instead directly "mounts" the underlying arrays. So this just adds to the inefficiency that VariantArray::value docs warn about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, it would be even nicer if Fields were an IndexMap, which gives O(1) name lookup that many consumers all over arrow would benefit from. But that would be a pretty major change and potentially impacts efficiency as well.

variant_test_case!(125);
variant_test_case!(126, "Unsupported typed_value type: List(");
// Is an error case (should be failing as the expected error message indicates)
// Is an error case (error message mentions arrow data type instead of parquet logical type)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert, bad merge

));
};
for (obj_field_name, obj_field_value) in obj.iter() {
if !shredded_field_names.contains(obj_field_name) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shredding spec says:

The value column of a partially shredded object must never contain fields represented by the Parquet columns in typed_value (shredded fields). Readers may always assume that data is written correctly and that shredded fields in typed_value are not present in value. As a result, reads when a field is defined in both value and a typed_value shredded field may be inconsistent.

So in theory we don't need this check. But variant integration test cases 43 (testPartiallyShreddedObjectMissingFieldConflict) and 125 (testPartiallyShreddedObjectFieldConflict) both have conflicting value fields they expect readers to ignore.

Is this actually a bug in those tests that we should file an issue for? Or do we keep the current code?

CC @alamb

@alamb
Copy link
Contributor

alamb commented Sep 25, 2025

I will review this one carefully tomorrow morning

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @scovich -- I am sorry it took me so long to review this PR

I think I finally understand what you are getting at -- that in order to support value() for a shredded object we have to be able to reconstruct it somehow

Before we head down the path of having owned Variants, I would really like to explore the option of adding a new Variant::ShreddedObject that only has references to see what that would look like

I am mostly hoping we can avoid/minimize any per-row allocations when traversing shredded variants.

Would you be willing to give this a try? If not I will try and find time in the next day or two

let metadata = VariantMetadata::new(self.metadata.value(index));
let mut builder = SingleValueVariantBuilder::new(metadata.clone());
typed_value_to_variant(typed_value, value, index, &metadata, &mut builder)?;
return Ok(VariantArrayValue::owned(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I see -- this is the core conundrum. If we want to return a Variant that has references, we can't unshred the Variant into a new object, we have to have some way to return something that references the underling shredded object that may have multiple fields).

@alamb
Copy link
Contributor

alamb commented Sep 30, 2025

I think we are looking at other approaches at the moment, so marking this one as draft

@alamb alamb marked this pull request as draft September 30, 2025 20:23
alamb pushed a commit that referenced this pull request Oct 1, 2025
# Which issue does this PR close?

- Closes #8336
- Closes #8334

# Rationale for this change

The `VariantArray::value` method was really inefficient but had quietly
crept into important code paths like `variant_get`. Since the complex
and inefficient code was in support of variant unshredding, we should
just add and use a proper `unshred_variant` function (which uses row
builders like the other variant manipulating functions).

# What changes are included in this PR?

* Define the new `unshred_variant` function, which does what it says. It
supports all the types `typed_value_to_variant` supported, plus Time64
and Struct as a bonus. The former because it was ~10LoC and the latter
because it demonstrates the superiority of this new approach vs. e.g.
#8446
* Wire up `variant_get` unshredding path to call it, which immediately
benefits from all that function's existing test coverage.
* Update the variant_integration test to `unshred_variant` instead of
looping over rows calling `value(i)`.

# Are these changes tested?

Yes, a bunch of variant integration tests now pass that used to fail.

# Are there any user-facing changes?

Several new pub methods. I don't think any breaking changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate parquet-variant parquet-variant* crates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Variant] [Shredding] Support typed_access for Struct
2 participants