-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Variant] Add variant_get and Shredded VariantArray
#8021
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
3ba9fe0
f458ec5
52cf52b
6b66650
924155c
5cb1068
9c22f85
eee199e
264356b
336a0a3
016cee1
3e8820b
55a7ae6
85e7194
6e38490
0ccd0f8
f8b2df4
2355247
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
ShreddingState::try_new
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -52,85 +52,6 @@ pub struct VariantArray { | |||||||||||
| shredding_state: ShreddingState, | ||||||||||||
| } | ||||||||||||
|
|
||||||||||||
| /// Variant arrays can be shredded in one of three states, encoded here | ||||||||||||
| #[derive(Debug)] | ||||||||||||
| pub enum ShreddingState { | ||||||||||||
| /// This variant has no typed_value field | ||||||||||||
| Unshredded { | ||||||||||||
| metadata: BinaryViewArray, | ||||||||||||
| value: BinaryViewArray, | ||||||||||||
| }, | ||||||||||||
| /// This variant has a typed_value field and no value field | ||||||||||||
| /// meaning it is fully shredded (aka the value is stored in typed_value) | ||||||||||||
| FullyShredded { | ||||||||||||
| metadata: BinaryViewArray, | ||||||||||||
| typed_value: ArrayRef, | ||||||||||||
| }, | ||||||||||||
| /// This variant has both a value field and a typed_value field | ||||||||||||
| /// meaning it is partially shredded: first the typed_value is used, and | ||||||||||||
| /// if that is null, the value field is used. | ||||||||||||
| PartiallyShredded { | ||||||||||||
| metadata: BinaryViewArray, | ||||||||||||
| value: BinaryViewArray, | ||||||||||||
| typed_value: ArrayRef, | ||||||||||||
| }, | ||||||||||||
| } | ||||||||||||
|
|
||||||||||||
| impl ShreddingState { | ||||||||||||
| /// Return a reference to the metadata field | ||||||||||||
| pub fn metadata_field(&self) -> &BinaryViewArray { | ||||||||||||
| match self { | ||||||||||||
| ShreddingState::Unshredded { metadata, .. } => metadata, | ||||||||||||
| ShreddingState::FullyShredded { metadata, .. } => metadata, | ||||||||||||
| ShreddingState::PartiallyShredded { metadata, .. } => metadata, | ||||||||||||
| } | ||||||||||||
| } | ||||||||||||
|
|
||||||||||||
| /// Return a reference to the value field, if present | ||||||||||||
| pub fn value_field(&self) -> Option<&BinaryViewArray> { | ||||||||||||
| match self { | ||||||||||||
| ShreddingState::Unshredded { value, .. } => Some(value), | ||||||||||||
| ShreddingState::FullyShredded { .. } => None, | ||||||||||||
| ShreddingState::PartiallyShredded { value, .. } => Some(value), | ||||||||||||
| } | ||||||||||||
| } | ||||||||||||
|
|
||||||||||||
| /// Return a reference to the typed_value field, if present | ||||||||||||
| pub fn typed_value_field(&self) -> Option<&ArrayRef> { | ||||||||||||
| match self { | ||||||||||||
| ShreddingState::Unshredded { .. } => None, | ||||||||||||
| ShreddingState::FullyShredded { typed_value, .. } => Some(typed_value), | ||||||||||||
| ShreddingState::PartiallyShredded { typed_value, .. } => Some(typed_value), | ||||||||||||
| } | ||||||||||||
| } | ||||||||||||
|
|
||||||||||||
| /// Slice all the underlying arrays | ||||||||||||
| pub fn slice(&self, offset: usize, length: usize) -> Self { | ||||||||||||
| match self { | ||||||||||||
| ShreddingState::Unshredded { metadata, value } => ShreddingState::Unshredded { | ||||||||||||
| metadata: metadata.slice(offset, length), | ||||||||||||
| value: value.slice(offset, length), | ||||||||||||
| }, | ||||||||||||
| ShreddingState::FullyShredded { | ||||||||||||
| metadata, | ||||||||||||
| typed_value, | ||||||||||||
| } => ShreddingState::FullyShredded { | ||||||||||||
| metadata: metadata.slice(offset, length), | ||||||||||||
| typed_value: typed_value.slice(offset, length), | ||||||||||||
| }, | ||||||||||||
| ShreddingState::PartiallyShredded { | ||||||||||||
| metadata, | ||||||||||||
| value, | ||||||||||||
| typed_value, | ||||||||||||
| } => ShreddingState::PartiallyShredded { | ||||||||||||
| metadata: metadata.slice(offset, length), | ||||||||||||
| value: value.slice(offset, length), | ||||||||||||
| typed_value: typed_value.slice(offset, length), | ||||||||||||
| }, | ||||||||||||
| } | ||||||||||||
| } | ||||||||||||
| } | ||||||||||||
|
|
||||||||||||
| impl VariantArray { | ||||||||||||
| /// Creates a new `VariantArray` from a [`StructArray`]. | ||||||||||||
| /// | ||||||||||||
|
|
@@ -200,27 +121,8 @@ impl VariantArray { | |||||||||||
|
|
||||||||||||
| // Note these clones are cheap, they just bump the ref count | ||||||||||||
| let inner = inner.clone(); | ||||||||||||
| let metadata = metadata.clone(); | ||||||||||||
| let value = value.cloned(); | ||||||||||||
| let typed_value = typed_value.cloned(); | ||||||||||||
|
|
||||||||||||
| let shredding_state = match (metadata, value, typed_value) { | ||||||||||||
| (metadata, Some(value), Some(typed_value)) => ShreddingState::PartiallyShredded { | ||||||||||||
| metadata, | ||||||||||||
| value, | ||||||||||||
| typed_value, | ||||||||||||
| }, | ||||||||||||
| (metadata, Some(value), None) => ShreddingState::Unshredded { metadata, value }, | ||||||||||||
| (metadata, None, Some(typed_value)) => ShreddingState::FullyShredded { | ||||||||||||
| metadata, | ||||||||||||
| typed_value, | ||||||||||||
| }, | ||||||||||||
| (_metadata_field, None, None) => { | ||||||||||||
| return Err(ArrowError::InvalidArgumentError(String::from( | ||||||||||||
| "VariantArray has neither value nor typed_value field", | ||||||||||||
| ))); | ||||||||||||
| } | ||||||||||||
| }; | ||||||||||||
| let shredding_state = | ||||||||||||
| ShreddingState::try_new(metadata.clone(), value.cloned(), typed_value.cloned())?; | ||||||||||||
|
|
||||||||||||
| Ok(Self { | ||||||||||||
| inner, | ||||||||||||
|
|
@@ -308,6 +210,108 @@ impl VariantArray { | |||||||||||
| } | ||||||||||||
| } | ||||||||||||
|
|
||||||||||||
| /// Variant arrays can be shredded in one of three states, encoded here | ||||||||||||
| #[derive(Debug)] | ||||||||||||
| pub enum ShreddingState { | ||||||||||||
| /// This variant has no typed_value field | ||||||||||||
| Unshredded { | ||||||||||||
| metadata: BinaryViewArray, | ||||||||||||
| value: BinaryViewArray, | ||||||||||||
| }, | ||||||||||||
| /// This variant has a typed_value field and no value field | ||||||||||||
| /// meaning it is fully shredded (aka the value is stored in typed_value) | ||||||||||||
| FullyShredded { | ||||||||||||
| metadata: BinaryViewArray, | ||||||||||||
| typed_value: ArrayRef, | ||||||||||||
| }, | ||||||||||||
| /// This variant has both a value field and a typed_value field | ||||||||||||
| /// meaning it is partially shredded: first the typed_value is used, and | ||||||||||||
| /// if that is null, the value field is used. | ||||||||||||
| PartiallyShredded { | ||||||||||||
| metadata: BinaryViewArray, | ||||||||||||
| value: BinaryViewArray, | ||||||||||||
| typed_value: ArrayRef, | ||||||||||||
| }, | ||||||||||||
| } | ||||||||||||
|
|
||||||||||||
| impl ShreddingState { | ||||||||||||
| /// try to create a new `ShreddingState` from the given fields | ||||||||||||
| pub fn try_new( | ||||||||||||
| metadata: BinaryViewArray, | ||||||||||||
| value: Option<BinaryViewArray>, | ||||||||||||
| typed_value: Option<ArrayRef>, | ||||||||||||
| ) -> Result<Self, ArrowError> { | ||||||||||||
| match (metadata, value, typed_value) { | ||||||||||||
| (metadata, Some(value), Some(typed_value)) => Ok(Self::PartiallyShredded { | ||||||||||||
| metadata, | ||||||||||||
| value, | ||||||||||||
| typed_value, | ||||||||||||
| }), | ||||||||||||
| (metadata, Some(value), None) => Ok(Self::Unshredded { metadata, value }), | ||||||||||||
| (metadata, None, Some(typed_value)) => Ok(Self::FullyShredded { | ||||||||||||
| metadata, | ||||||||||||
| typed_value, | ||||||||||||
| }), | ||||||||||||
| (_metadata_field, None, None) => Err(ArrowError::InvalidArgumentError(String::from( | ||||||||||||
| "VariantArray has neither value nor typed_value field", | ||||||||||||
|
Comment on lines
+273
to
+274
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm pretty sure the shredding spec allows this case. It corresponds to a perfectly shredded all-null column where the writer chose to omit both of them as unnecessary. (see other comment above)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Makes sense -- I'll file a follow on ticket to add support
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||||||||
| ))), | ||||||||||||
| } | ||||||||||||
| } | ||||||||||||
|
|
||||||||||||
| /// Return a reference to the metadata field | ||||||||||||
| pub fn metadata_field(&self) -> &BinaryViewArray { | ||||||||||||
| match self { | ||||||||||||
| ShreddingState::Unshredded { metadata, .. } => metadata, | ||||||||||||
| ShreddingState::FullyShredded { metadata, .. } => metadata, | ||||||||||||
| ShreddingState::PartiallyShredded { metadata, .. } => metadata, | ||||||||||||
| } | ||||||||||||
| } | ||||||||||||
|
|
||||||||||||
| /// Return a reference to the value field, if present | ||||||||||||
| pub fn value_field(&self) -> Option<&BinaryViewArray> { | ||||||||||||
| match self { | ||||||||||||
| ShreddingState::Unshredded { value, .. } => Some(value), | ||||||||||||
| ShreddingState::FullyShredded { .. } => None, | ||||||||||||
| ShreddingState::PartiallyShredded { value, .. } => Some(value), | ||||||||||||
| } | ||||||||||||
| } | ||||||||||||
|
|
||||||||||||
| /// Return a reference to the typed_value field, if present | ||||||||||||
| pub fn typed_value_field(&self) -> Option<&ArrayRef> { | ||||||||||||
| match self { | ||||||||||||
| ShreddingState::Unshredded { .. } => None, | ||||||||||||
| ShreddingState::FullyShredded { typed_value, .. } => Some(typed_value), | ||||||||||||
| ShreddingState::PartiallyShredded { typed_value, .. } => Some(typed_value), | ||||||||||||
| } | ||||||||||||
| } | ||||||||||||
|
|
||||||||||||
| /// Slice all the underlying arrays | ||||||||||||
| pub fn slice(&self, offset: usize, length: usize) -> Self { | ||||||||||||
| match self { | ||||||||||||
| ShreddingState::Unshredded { metadata, value } => ShreddingState::Unshredded { | ||||||||||||
| metadata: metadata.slice(offset, length), | ||||||||||||
| value: value.slice(offset, length), | ||||||||||||
| }, | ||||||||||||
| ShreddingState::FullyShredded { | ||||||||||||
| metadata, | ||||||||||||
| typed_value, | ||||||||||||
| } => ShreddingState::FullyShredded { | ||||||||||||
| metadata: metadata.slice(offset, length), | ||||||||||||
| typed_value: typed_value.slice(offset, length), | ||||||||||||
| }, | ||||||||||||
| ShreddingState::PartiallyShredded { | ||||||||||||
| metadata, | ||||||||||||
| value, | ||||||||||||
| typed_value, | ||||||||||||
| } => ShreddingState::PartiallyShredded { | ||||||||||||
| metadata: metadata.slice(offset, length), | ||||||||||||
| value: value.slice(offset, length), | ||||||||||||
| typed_value: typed_value.slice(offset, length), | ||||||||||||
| }, | ||||||||||||
| } | ||||||||||||
| } | ||||||||||||
| } | ||||||||||||
|
|
||||||||||||
| /// returns the non-null element at index as a Variant | ||||||||||||
| fn typed_value_to_variant(typed_value: &ArrayRef, index: usize) -> Variant { | ||||||||||||
|
||||||||||||
| fn typed_value_to_variant(typed_value: &ArrayRef, index: usize) -> Variant { | |
| fn typed_value_to_variant(typed_value: &ArrayRef, index: usize) -> Variant { | |
| if typed_value.is_null(index) { | |
| return None; | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other Arrow Array apis don't return Option, they instead return the value directly requiring you to check is_null instead -- see https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveArray.html#method.value for example
/// Consistently with other Arrow arrays types, this API requires you to
/// check for nulls first using [`Self::is_valid`].
I think the reason it to allow a better chance at LLVM vectorizing the code, which I don't think is likely relevant here.
We could make the Variant API deviate from this pattern (it is already different in several other ways) and return Option<Variant>.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other Arrow Array apis don't return
Option, they instead return the value directly requiring you to checkis_nullinstead
For public API -- 100% agree. This is a private internal API tho, so it seems like we have room to do what we think makes the code simple/maintainable. Pulling important work inside the method instead of requiring all callers to remember it seems like a good example of that.
We can always change it if we discover it hurts performance or readability.
I think the reason it to allow a better chance at LLVM vectorizing the code
AFAIK, arrow normally requires all array entries to be physically valid, even when logically null. That way, one can perform columnar operations blindly and then just use the null masks to ignore the unwanted values after the fact. Here, we're accessing is_null and value both inside the loop -- and the latter conditionally -- so I'd be very surprised if LLVM was willing to inject vectorization code that requires touching values the code said not to touch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For public API -- 100% agree. This is a private internal API tho, so it seems like we have room to do what we think makes the code simple/maintainable. Pulling important work inside the method instead of requiring all callers to remember it seems like a good example of that.
Sorry -- I wasn't clear -- the reason I was talking about VariantArray::value is that it is the only caller of typed_value_to_variant so if we return Option from this value, we would just be stuck at the same place 🤔
Here, we're accessing is_null and value both inside the loop -- and the latter conditionally -- so I'd be very surprised if LLVM was willing to inject vectorization code that requires touching values the code said not to touch.
Yeah I agree I don't think it will make any difference for performance with Cariants. The primary reason in my mind is to be consistent with other APIs.
I think in an earlier version of this PR I actually had changed value() return Option. Maybe changing the signature is a good idea 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a future
FullyShreddedAllNullvariant (neithervaluenortyped_valuepresent), would we still need to store themetadataeven tho it's never actually used? 🤔There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure
I filed a ticket to track adding AllNull:
ShreddingState::AllNullvariant #8088