-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-53291][SQL] Fix nullability for value column #52043
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I left a minor question.
Seq( | ||
StructField(VariantValueFieldName, BinaryType, nullable = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the PR description correct? It says:
currently always set the
value
column to non-nullable
But, it looks like we always set it to nullable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, yes, I've updated the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cashmand Thanks for the fix!
LGTM
thanks, merging to master/4.0! |
### What changes were proposed in this pull request? For shredded Variant, we currently always set the `value` column to be nullable. But when there is no corresponding `typed_value`, and the value doesn't represent an object field (where null implies missing from the object), the `value` is never null, and we can set the column to be required. ### Why are the changes needed? This shouldn't affect results as read by Spark, but it may cause the parquet file to be marginally larger, and the [spec](https://github.com/apache/parquet-format/blob/master/VariantShredding.md) wording indicates that `value` must be required in these situations, so a strict reader could reject the schema as it's currently being produced. ### Does this PR introduce _any_ user-facing change? Variant parquet file schema may change slightly. ### How was this patch tested? Unit test extended to cover this case. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #52043 from cashmand/fix_nullability. Authored-by: cashmand <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit fd77ec6) Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? For shredded Variant, we currently always set the `value` column to be nullable. But when there is no corresponding `typed_value`, and the value doesn't represent an object field (where null implies missing from the object), the `value` is never null, and we can set the column to be required. ### Why are the changes needed? This shouldn't affect results as read by Spark, but it may cause the parquet file to be marginally larger, and the [spec](https://github.com/apache/parquet-format/blob/master/VariantShredding.md) wording indicates that `value` must be required in these situations, so a strict reader could reject the schema as it's currently being produced. ### Does this PR introduce _any_ user-facing change? Variant parquet file schema may change slightly. ### How was this patch tested? Unit test extended to cover this case. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52043 from cashmand/fix_nullability. Authored-by: cashmand <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
For shredded Variant, we currently always set the
value
column to be nullable. But when there is no correspondingtyped_value
, and the value doesn't represent an object field (where null implies missing from the object), thevalue
is never null, and we can set the column to be required.Why are the changes needed?
This shouldn't affect results as read by Spark, but it may cause the parquet file to be marginally larger, and the spec wording indicates that
value
must be required in these situations, so a strict reader could reject the schema as it's currently being produced.Does this PR introduce any user-facing change?
Variant parquet file schema may change slightly.
How was this patch tested?
Unit test extended to cover this case.
Was this patch authored or co-authored using generative AI tooling?
No.