[SPARK-53291][SQL] Fix nullability for value column #52043

cashmand · 2025-08-15T15:12:39Z

What changes were proposed in this pull request?

For shredded Variant, we currently always set the value column to be nullable. But when there is no corresponding typed_value, and the value doesn't represent an object field (where null implies missing from the object), the value is never null, and we can set the column to be required.

Why are the changes needed?

This shouldn't affect results as read by Spark, but it may cause the parquet file to be marginally larger, and the spec wording indicates that value must be required in these situations, so a strict reader could reject the schema as it's currently being produced.

Does this PR introduce any user-facing change?

Variant parquet file schema may change slightly.

How was this patch tested?

Unit test extended to cover this case.

Was this patch authored or co-authored using generative AI tooling?

No.

gene-db

Thanks! I left a minor question.

gene-db · 2025-08-15T16:40:30Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/SparkShreddingUtils.scala

        Seq(
-          StructField(VariantValueFieldName, BinaryType, nullable = true)


Is the PR description correct? It says:

currently always set the value column to non-nullable

But, it looks like we always set it to nullable?

Oops, yes, I've updated the PR description.

gene-db

@cashmand Thanks for the fix!

LGTM

cloud-fan · 2025-08-16T03:32:25Z

thanks, merging to master/4.0!

### What changes were proposed in this pull request? For shredded Variant, we currently always set the `value` column to be nullable. But when there is no corresponding `typed_value`, and the value doesn't represent an object field (where null implies missing from the object), the `value` is never null, and we can set the column to be required. ### Why are the changes needed? This shouldn't affect results as read by Spark, but it may cause the parquet file to be marginally larger, and the [spec](https://github.com/apache/parquet-format/blob/master/VariantShredding.md) wording indicates that `value` must be required in these situations, so a strict reader could reject the schema as it's currently being produced. ### Does this PR introduce _any_ user-facing change? Variant parquet file schema may change slightly. ### How was this patch tested? Unit test extended to cover this case. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #52043 from cashmand/fix_nullability. Authored-by: cashmand <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit fd77ec6) Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? For shredded Variant, we currently always set the `value` column to be nullable. But when there is no corresponding `typed_value`, and the value doesn't represent an object field (where null implies missing from the object), the `value` is never null, and we can set the column to be required. ### Why are the changes needed? This shouldn't affect results as read by Spark, but it may cause the parquet file to be marginally larger, and the [spec](https://github.com/apache/parquet-format/blob/master/VariantShredding.md) wording indicates that `value` must be required in these situations, so a strict reader could reject the schema as it's currently being produced. ### Does this PR introduce _any_ user-facing change? Variant parquet file schema may change slightly. ### How was this patch tested? Unit test extended to cover this case. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52043 from cashmand/fix_nullability. Authored-by: cashmand <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

Fix

60668aa

github-actions bot added the SQL label Aug 15, 2025

More tests

012463c

gene-db reviewed Aug 15, 2025

View reviewed changes

cashmand requested a review from gene-db August 15, 2025 20:13

gene-db approved these changes Aug 15, 2025

View reviewed changes

cloud-fan approved these changes Aug 16, 2025

View reviewed changes

cloud-fan closed this in fd77ec6 Aug 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-53291][SQL] Fix nullability for value column #52043

[SPARK-53291][SQL] Fix nullability for value column #52043

Uh oh!

cashmand commented Aug 15, 2025 •

edited

Loading

Uh oh!

gene-db left a comment

Uh oh!

gene-db Aug 15, 2025

Uh oh!

cashmand Aug 15, 2025

Uh oh!

gene-db left a comment

Uh oh!

cloud-fan commented Aug 16, 2025

Uh oh!

Uh oh!

		Seq(
		StructField(VariantValueFieldName, BinaryType, nullable = true)

[SPARK-53291][SQL] Fix nullability for value column #52043

[SPARK-53291][SQL] Fix nullability for value column #52043

Uh oh!

Conversation

cashmand commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

gene-db left a comment

Choose a reason for hiding this comment

Uh oh!

gene-db Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

cashmand Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

gene-db left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Aug 16, 2025

Uh oh!

Uh oh!

cashmand commented Aug 15, 2025 •

edited

Loading