Skip to content

Conversation

cashmand
Copy link
Contributor

@cashmand cashmand commented Aug 15, 2025

What changes were proposed in this pull request?

For shredded Variant, we currently always set the value column to be nullable. But when there is no corresponding typed_value, and the value doesn't represent an object field (where null implies missing from the object), the value is never null, and we can set the column to be required.

Why are the changes needed?

This shouldn't affect results as read by Spark, but it may cause the parquet file to be marginally larger, and the spec wording indicates that value must be required in these situations, so a strict reader could reject the schema as it's currently being produced.

Does this PR introduce any user-facing change?

Variant parquet file schema may change slightly.

How was this patch tested?

Unit test extended to cover this case.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Aug 15, 2025
Copy link
Contributor

@gene-db gene-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I left a minor question.

Seq(
StructField(VariantValueFieldName, BinaryType, nullable = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the PR description correct? It says:

currently always set the value column to non-nullable

But, it looks like we always set it to nullable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yes, I've updated the PR description.

@cashmand cashmand requested a review from gene-db August 15, 2025 20:13
Copy link
Contributor

@gene-db gene-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cashmand Thanks for the fix!

LGTM

@cloud-fan
Copy link
Contributor

thanks, merging to master/4.0!

@cloud-fan cloud-fan closed this in fd77ec6 Aug 16, 2025
cloud-fan pushed a commit that referenced this pull request Aug 16, 2025
### What changes were proposed in this pull request?

For shredded Variant, we currently always set the `value` column to be nullable. But when there is no corresponding `typed_value`, and the value doesn't represent an object field (where null implies missing from the object), the `value` is never null, and we can set the column to be required.

### Why are the changes needed?

This shouldn't affect results as read by Spark, but it may cause the parquet file to be marginally larger, and the [spec](https://github.com/apache/parquet-format/blob/master/VariantShredding.md) wording indicates that `value` must be required in these situations, so a strict reader could reject the schema as it's currently being produced.

### Does this PR introduce _any_ user-facing change?

Variant parquet file schema may change slightly.

### How was this patch tested?

Unit test extended to cover this case.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #52043 from cashmand/fix_nullability.

Authored-by: cashmand <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit fd77ec6)
Signed-off-by: Wenchen Fan <[email protected]>
mzhang pushed a commit to mzhang/spark that referenced this pull request Aug 21, 2025
### What changes were proposed in this pull request?

For shredded Variant, we currently always set the `value` column to be nullable. But when there is no corresponding `typed_value`, and the value doesn't represent an object field (where null implies missing from the object), the `value` is never null, and we can set the column to be required.

### Why are the changes needed?

This shouldn't affect results as read by Spark, but it may cause the parquet file to be marginally larger, and the [spec](https://github.com/apache/parquet-format/blob/master/VariantShredding.md) wording indicates that `value` must be required in these situations, so a strict reader could reject the schema as it's currently being produced.

### Does this PR introduce _any_ user-facing change?

Variant parquet file schema may change slightly.

### How was this patch tested?

Unit test extended to cover this case.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#52043 from cashmand/fix_nullability.

Authored-by: cashmand <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants