Skip to content
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions parquet-variant-compute/src/from_json.rs
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ pub fn batch_json_string_to_variant(input: &ArrayRef) -> Result<VariantArray, Ar
#[cfg(test)]
mod test {
use crate::batch_json_string_to_variant;
use arrow::array::{Array, ArrayRef, AsArray, StringArray};
use arrow::array::{Array, ArrayRef, StringArray};
use arrow_schema::ArrowError;
use parquet_variant::{Variant, VariantBuilder};
use std::sync::Arc;
Expand All @@ -69,8 +69,8 @@ mod test {
let array_ref: ArrayRef = Arc::new(input);
let variant_array = batch_json_string_to_variant(&array_ref).unwrap();

let metadata_array = variant_array.metadata_field().as_binary_view();
let value_array = variant_array.value_field().as_binary_view();
let metadata_array = variant_array.metadata_field();
let value_array = variant_array.value_field().expect("value field");

// Compare row 0
assert!(!variant_array.is_null(0));
Expand Down
294 changes: 229 additions & 65 deletions parquet-variant-compute/src/variant_array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,9 @@

//! [`VariantArray`] implementation

use arrow::array::{Array, ArrayData, ArrayRef, AsArray, StructArray};
use arrow::array::{Array, ArrayData, ArrayRef, AsArray, BinaryViewArray, StructArray};
use arrow::buffer::NullBuffer;
use arrow::datatypes::Int32Type;
use arrow_schema::{ArrowError, DataType};
use parquet_variant::Variant;
use std::any::Any;
Expand All @@ -44,27 +45,11 @@ use std::sync::Arc;
/// [document]: https://docs.google.com/document/d/1pw0AWoMQY3SjD7R4LgbPvMjG_xSCtXp3rZHkVp9jpZ4/edit?usp=sharing
#[derive(Debug)]
pub struct VariantArray {
/// StructArray of up to three fields:
///
/// 1. A required field named `metadata` which is binary, large_binary, or
/// binary_view
///
/// 2. An optional field named `value` that is binary, large_binary, or
/// binary_view
///
/// 3. An optional field named `typed_value` which can be any primitive type
/// or be a list, large_list, list_view or struct
///
/// NOTE: It is also permissible for the metadata field to be
/// Dictionary-Encoded, preferably (but not required) with an index type of
/// int8.
/// Reference to the underlying StructArray
inner: StructArray,

/// Reference to the metadata column of inner
metadata_ref: ArrayRef,

/// Reference to the value column of inner
value_ref: ArrayRef,
/// how is this variant array shredded?
shredding_state: ShreddingState,
}

impl VariantArray {
Expand All @@ -79,12 +64,22 @@ impl VariantArray {
/// # Errors:
/// - If the `StructArray` does not contain the required fields
///
/// # Current support
/// This structure does not (yet) support the full Arrow Variant Array specification.
/// # Requirements of the `StructArray`
///
/// 1. A required field named `metadata` which is binary, large_binary, or
/// binary_view
///
/// 2. An optional field named `value` that is binary, large_binary, or
/// binary_view
///
/// 3. An optional field named `typed_value` which can be any primitive type
/// or be a list, large_list, list_view or struct
///
/// NOTE: It is also permissible for the metadata field to be
/// Dictionary-Encoded, preferably (but not required) with an index type of
/// int8.
///
/// Only `StructArrays` with `metadata` and `value` fields that are
/// [`BinaryViewArray`] are supported. Shredded values are not currently supported
/// nor are using types other than `BinaryViewArray`
/// Currently, only [`BinaryViewArray`] are supported.
///
/// [`BinaryViewArray`]: arrow::array::BinaryViewArray
pub fn try_new(inner: ArrayRef) -> Result<Self, ArrowError> {
Expand All @@ -93,35 +88,47 @@ impl VariantArray {
"Invalid VariantArray: requires StructArray as input".to_string(),
));
};
// Ensure the StructArray has a metadata field of BinaryView

let Some(metadata_field) = VariantArray::find_metadata_field(inner) else {
// Note the specification allows for any order so we must search by name

// Ensure the StructArray has a metadata field of BinaryView
let Some(metadata_field) = inner.column_by_name("metadata") else {
return Err(ArrowError::InvalidArgumentError(
"Invalid VariantArray: StructArray must contain a 'metadata' field".to_string(),
));
};
if metadata_field.data_type() != &DataType::BinaryView {
let Some(metadata) = metadata_field.as_binary_view_opt() else {
return Err(ArrowError::NotYetImplemented(format!(
Comment on lines +98 to 99
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: Out of curiosity, do we have any style/preference guidelines for

let Some(foo) = foo_opt else {
    return Err(...);
};

vs.

let foo = foo_opt.unwrap_or_else(|| {
    ...
})?;

?

The former seems both more direct and less verbose, at least in this case?
Maybe unwrap_or_else is mostly (only?) useful as part of a bigger monadic chain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer this pattern for the reasons you suggest. However, I don't know of anything formal we have discussed or agreed on

let Some(foo) = foo_opt else {
    return Err(...);
};

"VariantArray 'metadata' field must be BinaryView, got {}",
metadata_field.data_type()
)));
}
let Some(value_field) = VariantArray::find_value_field(inner) else {
return Err(ArrowError::InvalidArgumentError(
"Invalid VariantArray: StructArray must contain a 'value' field".to_string(),
));
};
if value_field.data_type() != &DataType::BinaryView {
return Err(ArrowError::NotYetImplemented(format!(
"VariantArray 'value' field must be BinaryView, got {}",
value_field.data_type()
)));
}

// Find the value field, if present
let value = inner
.column_by_name("value")
.map(|v| {
let Some(binary_view) = v.as_binary_view_opt() else {
return Err(ArrowError::NotYetImplemented(format!(
"VariantArray 'value' field must be BinaryView, got {}",
v.data_type()
)));
};
Ok(binary_view)
})
.transpose()?;

// Find the typed_value field, if present
let typed_value = inner.column_by_name("typed_value");

// Note these clones are cheap, they just bump the ref count
let inner = inner.clone();
let shredding_state =
ShreddingState::try_new(metadata.clone(), value.cloned(), typed_value.cloned())?;

Ok(Self {
inner: inner.clone(),
metadata_ref: metadata_field,
value_ref: value_field,
inner,
shredding_state,
})
}

Expand All @@ -135,36 +142,195 @@ impl VariantArray {
self.inner
}

/// Return the shredding state of this `VariantArray`
pub fn shredding_state(&self) -> &ShreddingState {
&self.shredding_state
}

/// Return the [`Variant`] instance stored at the given row
///
/// Panics if the index is out of bounds.
/// Consistently with other Arrow arrays types, this API requires you to
/// check for nulls first using [`Self::is_valid`].
///
/// # Panics
/// * if the index is out of bounds
/// * if the array value is null
///
/// If this is a shredded variant but has no value at the shredded location, it
/// will return [`Variant::Null`].
///
///
/// # Performance Note
///
/// This is certainly not the most efficient way to access values in a
/// `VariantArray`, but it is useful for testing and debugging.
///
/// Note: Does not do deep validation of the [`Variant`], so it is up to the
/// caller to ensure that the metadata and value were constructed correctly.
pub fn value(&self, index: usize) -> Variant {
let metadata = self.metadata_field().as_binary_view().value(index);
let value = self.value_field().as_binary_view().value(index);
Variant::new(metadata, value)
match &self.shredding_state {
ShreddingState::Unshredded { metadata, value } => {
Variant::new(metadata.value(index), value.value(index))
}
ShreddingState::FullyShredded { typed_value, .. } => {
if typed_value.is_null(index) {
Variant::Null
} else {
typed_value_to_variant(typed_value, index)
}
}
ShreddingState::PartiallyShredded {
metadata,
value,
typed_value,
} => {
if typed_value.is_null(index) {
Variant::new(metadata.value(index), value.value(index))
} else {
typed_value_to_variant(typed_value, index)
}
}
}
}

fn find_metadata_field(array: &StructArray) -> Option<ArrayRef> {
array.column_by_name("metadata").cloned()
/// Return a reference to the metadata field of the [`StructArray`]
pub fn metadata_field(&self) -> &BinaryViewArray {
self.shredding_state.metadata_field()
}

fn find_value_field(array: &StructArray) -> Option<ArrayRef> {
array.column_by_name("value").cloned()
/// Return a reference to the value field of the `StructArray`
pub fn value_field(&self) -> Option<&BinaryViewArray> {
self.shredding_state.value_field()
}

/// Return a reference to the metadata field of the [`StructArray`]
pub fn metadata_field(&self) -> &ArrayRef {
// spec says fields order is not guaranteed, so we search by name
&self.metadata_ref
/// Return a reference to the typed_value field of the `StructArray`, if present
pub fn typed_value_field(&self) -> Option<&ArrayRef> {
self.shredding_state.typed_value_field()
Comment on lines +195 to +205
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These become trivial if we ditch the ShreddingState enum

}
}

/// Return a reference to the value field of the `StructArray`
pub fn value_field(&self) -> &ArrayRef {
// spec says fields order is not guaranteed, so we search by name
&self.value_ref
/// Variant arrays can be shredded in one of three states, encoded here
#[derive(Debug)]
pub enum ShreddingState {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a future FullyShreddedAllNull variant (neither value nor typed_value present), would we still need to store the metadata even tho it's never actually used? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure

I filed a ticket to track adding AllNull:

/// This variant has no typed_value field
Unshredded {
metadata: BinaryViewArray,
value: BinaryViewArray,
},
/// This variant has a typed_value field and no value field
/// meaning it is fully shredded (aka the value is stored in typed_value)
FullyShredded {
metadata: BinaryViewArray,
typed_value: ArrayRef,
},
/// This variant has both a value field and a typed_value field
/// meaning it is partially shredded: first the typed_value is used, and
/// if that is null, the value field is used.
PartiallyShredded {
metadata: BinaryViewArray,
value: BinaryViewArray,
typed_value: ArrayRef,
},
}

impl ShreddingState {
/// try to create a new `ShreddingState` from the given fields
pub fn try_new(
metadata: BinaryViewArray,
value: Option<BinaryViewArray>,
typed_value: Option<ArrayRef>,
) -> Result<Self, ArrowError> {
match (metadata, value, typed_value) {
(metadata, Some(value), Some(typed_value)) => Ok(Self::PartiallyShredded {
metadata,
value,
typed_value,
}),
(metadata, Some(value), None) => Ok(Self::Unshredded { metadata, value }),
(metadata, None, Some(typed_value)) => Ok(Self::FullyShredded {
metadata,
typed_value,
}),
(_metadata_field, None, None) => Err(ArrowError::InvalidArgumentError(String::from(
"VariantArray has neither value nor typed_value field",
Comment on lines +273 to +274
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure the shredding spec allows this case. It corresponds to a perfectly shredded all-null column where the writer chose to omit both of them as unnecessary.

(see other comment above)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense -- I'll file a follow on ticket to add support

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

))),
}
}

/// Return a reference to the metadata field
pub fn metadata_field(&self) -> &BinaryViewArray {
match self {
ShreddingState::Unshredded { metadata, .. } => metadata,
ShreddingState::FullyShredded { metadata, .. } => metadata,
ShreddingState::PartiallyShredded { metadata, .. } => metadata,
}
}

/// Return a reference to the value field, if present
pub fn value_field(&self) -> Option<&BinaryViewArray> {
match self {
ShreddingState::Unshredded { value, .. } => Some(value),
ShreddingState::FullyShredded { .. } => None,
ShreddingState::PartiallyShredded { value, .. } => Some(value),
}
}

/// Return a reference to the typed_value field, if present
pub fn typed_value_field(&self) -> Option<&ArrayRef> {
match self {
ShreddingState::Unshredded { .. } => None,
ShreddingState::FullyShredded { typed_value, .. } => Some(typed_value),
ShreddingState::PartiallyShredded { typed_value, .. } => Some(typed_value),
}
}

/// Slice all the underlying arrays
pub fn slice(&self, offset: usize, length: usize) -> Self {
match self {
ShreddingState::Unshredded { metadata, value } => ShreddingState::Unshredded {
metadata: metadata.slice(offset, length),
value: value.slice(offset, length),
},
ShreddingState::FullyShredded {
metadata,
typed_value,
} => ShreddingState::FullyShredded {
metadata: metadata.slice(offset, length),
typed_value: typed_value.slice(offset, length),
},
ShreddingState::PartiallyShredded {
metadata,
value,
typed_value,
} => ShreddingState::PartiallyShredded {
metadata: metadata.slice(offset, length),
value: value.slice(offset, length),
typed_value: typed_value.slice(offset, length),
},
}
}
}

/// returns the non-null element at index as a Variant
fn typed_value_to_variant(typed_value: &ArrayRef, index: usize) -> Variant {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to make this return Option<Variant> so callers don't have to check for null themselves.

Suggested change
fn typed_value_to_variant(typed_value: &ArrayRef, index: usize) -> Variant {
fn typed_value_to_variant(typed_value: &ArrayRef, index: usize) -> Variant {
if typed_value.is_null(index) {
return None;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other Arrow Array apis don't return Option, they instead return the value directly requiring you to check is_null instead -- see https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveArray.html#method.value for example

    /// Consistently with other Arrow arrays types, this API requires you to
    /// check for nulls first using [`Self::is_valid`].

I think the reason it to allow a better chance at LLVM vectorizing the code, which I don't think is likely relevant here.

We could make the Variant API deviate from this pattern (it is already different in several other ways) and return Option<Variant>.

Copy link
Contributor

@scovich scovich Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other Arrow Array apis don't return Option, they instead return the value directly requiring you to check is_null instead

For public API -- 100% agree. This is a private internal API tho, so it seems like we have room to do what we think makes the code simple/maintainable. Pulling important work inside the method instead of requiring all callers to remember it seems like a good example of that.

We can always change it if we discover it hurts performance or readability.

I think the reason it to allow a better chance at LLVM vectorizing the code

AFAIK, arrow normally requires all array entries to be physically valid, even when logically null. That way, one can perform columnar operations blindly and then just use the null masks to ignore the unwanted values after the fact. Here, we're accessing is_null and value both inside the loop -- and the latter conditionally -- so I'd be very surprised if LLVM was willing to inject vectorization code that requires touching values the code said not to touch.

Copy link
Contributor Author

@alamb alamb Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For public API -- 100% agree. This is a private internal API tho, so it seems like we have room to do what we think makes the code simple/maintainable. Pulling important work inside the method instead of requiring all callers to remember it seems like a good example of that.

Sorry -- I wasn't clear -- the reason I was talking about VariantArray::value is that it is the only caller of typed_value_to_variant so if we return Option from this value, we would just be stuck at the same place 🤔

Here, we're accessing is_null and value both inside the loop -- and the latter conditionally -- so I'd be very surprised if LLVM was willing to inject vectorization code that requires touching values the code said not to touch.

Yeah I agree I don't think it will make any difference for performance with Cariants. The primary reason in my mind is to be consistent with other APIs.

I think in an earlier version of this PR I actually had changed value() return Option. Maybe changing the signature is a good idea 🤔

match typed_value.data_type() {
DataType::Int32 => {
let typed_value = typed_value.as_primitive::<Int32Type>();
Variant::from(typed_value.value(index))
Comment on lines +335 to +338
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good use for downcast_primitive_array macro?

Suggested change
match typed_value.data_type() {
DataType::Int32 => {
let typed_value = typed_value.as_primitive::<Int32Type>();
Variant::from(typed_value.value(index))
downcast_primitive_array! {
typed_value => Some(Variant::from(typed_value.value(index))),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't make that work as it needs more information than just the native type (e.g. the precision/scale for decimals).

@superserious-dev 's cast_conversion macro I think can be adapted eventually to handle this case

https://github.com/apache/arrow-rs/pull/8074/files#diff-753713bd9a1945a5fc998fab6018eac0b08ebb4d04d3258b2b615498152a9303R46

}
// todo other types here (note this is very similar to cast_to_variant.rs)
// so it would be great to figure out how to share this code
_ => {
// We shouldn't panic in production code, but this is a
// placeholder until we implement more types
// TODO tickets: XXXX
debug_assert!(
false,
"Unsupported typed_value type: {:?}",
typed_value.data_type()
);
Variant::Null
}
}
}

Expand All @@ -186,13 +352,11 @@ impl Array for VariantArray {
}

fn slice(&self, offset: usize, length: usize) -> ArrayRef {
let slice = self.inner.slice(offset, length);
let met = self.metadata_ref.slice(offset, length);
let val = self.value_ref.slice(offset, length);
let inner = self.inner.slice(offset, length);
let shredding_state = self.shredding_state.slice(offset, length);
Arc::new(Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid unnecessary allocations for common cases, like -

let shredding_state = match (&self.shredding_state, offset, length) {
    // Fast path: no slice needed for full array
    (state, 0, len) if len == self.len() => state.clone(),
    
    // Fast path: uniform shredding doesn't need slicing
    (ShreddingState::None, _, _) => ShreddingState::None,
    (ShreddingState::All, _, _) => ShreddingState::All,
    
    // Only slice for mixed case
    (ShreddingState::Mixed(bitmap), offset, length) => {
        ShreddingState::Mixed(bitmap.slice(offset, length))
    }
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could definitely add a fast path for slicing the whole array.

I am not quite sure what ShreddingState::All and ShreddingState::Mixed is supposed to represent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, would we actually expect a redundant slice call to be notably more expensive than clone?
Seems like they'd do the same thing?

For example, BooleanArray::slice is:

    pub fn slice(&self, offset: usize, length: usize) -> Self {
        Self {
            values: self.values.slice(offset, length),
            nulls: self.nulls.as_ref().map(|n| n.slice(offset, length)),
        }
    }

with BooleanBuffer::slice:

    pub fn slice(&self, offset: usize, len: usize) -> Self {
        assert!(
            offset.saturating_add(len) <= self.len,
            "the length + offset of the sliced BooleanBuffer cannot exceed the existing length"
        );
        Self {
            buffer: self.buffer.clone(),
            offset: self.offset + offset,
            len,
        }
    }

Assuming the compiler inlines things as aggressively as it usually does, it seems like the net difference would be the bounds check and other offset arithmetic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree I don't expect any measurable performance improvement

inner: slice,
metadata_ref: met,
value_ref: val,
inner,
shredding_state,
})
}

Expand Down Expand Up @@ -258,7 +422,7 @@ mod test {
let err = VariantArray::try_new(Arc::new(array));
assert_eq!(
err.unwrap_err().to_string(),
"Invalid argument error: Invalid VariantArray: StructArray must contain a 'value' field"
"Invalid argument error: VariantArray has neither value nor typed_value field"
);
}

Expand Down
Loading
Loading