Skip to content

Conversation

scovich
Copy link
Contributor

@scovich scovich commented Sep 26, 2025

Which issue does this PR close?

Rationale for this change

The VariantArray::value method was really inefficient but had quietly crept into important code paths like variant_get. Since the complex and inefficient code was in support of variant unshredding, we should just add and use a proper unshred_variant function (which uses row builders like the other variant manipulating functions).

What changes are included in this PR?

  • Define the new unshred_variant function, which does what it says. It supports all the types typed_value_to_variant supported, plus Time64 and Struct as a bonus. The former because it was ~10LoC and the latter because it demonstrates the superiority of this new approach vs. e.g. [Variant] VariantArray::value supports shredded struct access #8446
  • Wire up variant_get unshredding path to call it, which immediately benefits from all that function's existing test coverage.
  • Update the variant_integration test to unshred_variant instead of looping over rows calling value(i).

Are these changes tested?

Yes, a bunch of variant integration tests now pass that used to fail.

Are there any user-facing changes?

Several new pub methods. I don't think any breaking changes.

@github-actions github-actions bot added parquet Changes to the parquet crate parquet-variant parquet-variant* crates labels Sep 26, 2025
@scovich
Copy link
Contributor Author

scovich commented Sep 26, 2025

CC @alamb @liamzwbao

@scovich
Copy link
Contributor Author

scovich commented Sep 29, 2025

Reverting to draft temporarily, there are a bunch of changes in flight.

@alamb
Copy link
Contributor

alamb commented Sep 29, 2025

This looks great -- thanks @scovich

I will review the other PRs first

Marking this as a draft per your other comment

@alamb alamb marked this pull request as draft September 29, 2025 19:02
@scovich
Copy link
Contributor Author

scovich commented Sep 29, 2025

Update: Once all of the following prefactoring PR merge, I can rebase this PR and it will be ready for review:

I also have variant array unshredding support working locally, ~100LoC, which can either go into this PR or become a stacked PR on top.

@scovich scovich marked this pull request as ready for review September 30, 2025 12:47
@scovich
Copy link
Contributor Author

scovich commented Sep 30, 2025

This should be ready for final review. List support will stack on top very shortly.

@scovich
Copy link
Contributor Author

scovich commented Sep 30, 2025

@alamb one question -- Once this change lands, should we delete the unshredding code from VariantArray::value, so that it can only work with unshredded binary variant?

  • PRO: One fewer (complex) code path to maintain
  • CON: The value method becomes harder to use

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 🏆 -- I think this approach is super elegant @scovich -- very nicely done

I think this is a much clearer mechanism than trying to unshred in vale

@alamb one question -- Once this change lands, should we delete the unshredding code from VariantArray::value, so that it can only work with unshredded binary variant?

I think this is my preference for now. My rationale is that with unshred_variant there is now a way to get a Variant to work with if needed. And it has the nice property that it will work naturally with variant_get ❤️

If it turns out there is an important usecase for a row level unshredder (aka what value() does today) we can also always add it back in

Overlap with cast_to_variant implementation

It seems to me like the unshred_variant in this PR subsumes the code in cast_to_variant -- we could probably switch cast_to_variant to create a "dummy" variant array and call into unshred_variant 🤔

https://github.com/apache/arrow-rs/blob/e2db7d4c444a76684c1b17931823367f01459df7/parquet-variant-compute/src/cast_to_variant.rs#L56-L55

An optimization for another PR though I think

cc @liamzwbao @klion26 and @codephage2020

@alamb
Copy link
Contributor

alamb commented Sep 30, 2025

If we merge this, I think we should file follow on tickets for:

  1. Remove unshredding code in VariantArray::value
  2. Consolidate the implementations of unshred_variant and cast_to_variant

@scovich
Copy link
Contributor Author

scovich commented Sep 30, 2025

It seems to me like the unshred_variant in this PR subsumes the code in cast_to_variant -- we could probably switch cast_to_variant to create a "dummy" variant array and call into unshred_variant 🤔

There's definitely some overlap and I may have missed some opportunities for builder code reuse, but the two functions are not doing the same thing:

  • cast_to_variant exists to convert fully strongly-typed data to binary variant
    • Among other things, this entails building a new metadata column from scratch, because no metadata dictionaries exist yet.
    • NOTE: The function can't even detect if a caller passes variant data -- shredded or otherwise -- because it takes a &dyn Array that carries no extension type info. So it has to assume that any StructArray it sees is an actual struct array whose rows should be converted to variant objects, and whose fields might just happen to be called metadata and value.
  • unshred_variant does what it says -- unshred already-converted variant data
    • It always works with a read-only metadata builder
    • It needs to handle mixed shredding situations (including partially shredded objects), that can require converting strongly-typed values to variant on some rows while copying variant bytes for other rows.
    • This function is less vulnerable to type mismatches because it takes &VariantArray instead of &dyn Array -- and the VariantArray constructor does at least some basic structural validation. We still technically can't distinguish between a column with variant structure vs. an actual variant tho -- the caller has to check the array's owning field for extension type metadata.

@alamb
Copy link
Contributor

alamb commented Sep 30, 2025

It seems to me like the unshred_variant in this PR subsumes the code in cast_to_variant -- we could probably switch cast_to_variant to create a "dummy" variant array and call into unshred_variant 🤔

There's definitely some overlap and I may have missed some opportunities for builder code reuse, but the two functions are not doing the same thing:

I didn't mean to imply this PR should be changed

  • cast_to_variant exists to convert fully strongly-typed data to binary variant

    • Among other things, this entails building a new metadata column from scratch, because no metadata dictionaries exist yet.

That is true, but I think most of the arrow types that are converted are primitive types (and thus have no metadata to convert -- the output array is just the same metadata (0 byte) repeated over and over

  • NOTE: The function can't even detect if a caller passes variant data -- shredded or otherwise -- because it takes a &dyn Array that carries no extension type info. So it has to assume that any StructArray it sees is an actual struct array whose rows should be converted to variant objects, and whose fields might just happen to be called metadata and value.

🤔 that is a good point

@scovich
Copy link
Contributor Author

scovich commented Sep 30, 2025

It seems to me like the unshred_variant in this PR subsumes the code in cast_to_variant -- we could probably switch cast_to_variant to create a "dummy" variant array and call into unshred_variant 🤔

There's definitely some overlap and I may have missed some opportunities for builder code reuse, but the two functions are not doing the same thing:

I didn't mean to imply this PR should be changed

I think now that #8519 has merged, it should be straightforward to coalesce some of the similar code between the two modules. Definitely a follow-up item tho.

@alamb alamb merged commit ca3b3be into apache:main Oct 1, 2025
20 checks passed
@alamb
Copy link
Contributor

alamb commented Oct 1, 2025

🚀

alamb pushed a commit that referenced this pull request Oct 2, 2025
* NOTE: Stacked on #8481, ignore
the first commit when reviewing.

# Which issue does this PR close?

- Closes #8337

# Rationale for this change

Add a missing feature.

# What changes are included in this PR?

Leveraging the recently added `ListLikeArray` trait, support all five
list types when unshredding variant data.

# Are these changes tested?

Yes -- all the list-related variant shredding integration tests pass
now.

# Are there any user-facing changes?

No.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate parquet-variant parquet-variant* crates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Variant] [Shredding] Support typed_access for Struct [Variant] [Shredding] Support typed_access for Time64(Microsecond)
2 participants