-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24734][SQL] Fix type coercions and nullabilities of nested data types of some functions. #21704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24734][SQL] Fix type coercions and nullabilities of nested data types of some functions. #21704
Conversation
|
@ueshin Thanks for bringing this topic! This problem with different |
|
@mn-mikke Thanks! I'll take a look and join the discussion later. |
|
Test build #92562 has finished for PR 21704 at commit
|
| ArrayType(et, dataTypes.exists(_.asInstanceOf[ArrayType].containsNull)) | ||
| case dt => dt | ||
| }.getOrElse(StringType) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we handle this case in type coercion (analysis phase)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, Concat for array type has the type coercion to add casts to make all children the same type, but we also have the optimization SimplifyCasts to remove unnecessary casts which might remove casts from arrays not containing null to arrays containing null (optimizer/expressions.scala#L611).
E.g., concat(array(1,2,3), array(4,null,6)) might generate a wrong data type during the execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test to show the wrong nullability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, I see. But, I just have a hunch that SimplifyCasts cannot simplify array casts in some cases?, e.g., this concat case. Since we basically cannot change plan semantics in optimization phase, I feel a little weird about this simplification. Anyway, I'm ok with your approach because I can't find a better & simpler way to solve this in analysis phase... Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that also makes sense. I'm not sure we can remove the simplification, though. cc @gatorsmile @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, Coalesce should be fixed, and Least and Greatest are also suspicious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about changing SimplifyCasts rule to start replacing Cast with a new dummy cast expression that will hold only a target data type and won't perform any casting?
This can work, but my point is we should not add the cast to change containsNull, as it may not match the underlying child expression and generates unnecessary null check code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, see. In that case, it would be nice to introduce a method that will resolve the output DataType and merges nullable/containNull flags of non-primitive types recursively for such expressions. For the most cases we could encapsulate the function into a new trait. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for the @mn-mikke idea
|
Test build #92598 has finished for PR 21704 at commit
|
| val dataTypes = children.map(_.dataType) | ||
| dataTypes.headOption.map { | ||
| case ArrayType(et, _) => | ||
| ArrayType(et, dataTypes.exists(_.asInstanceOf[ArrayType].containsNull)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we support array of array in concat?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should work (see a test for it). Did we miss anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then shall we fix the containNull for the inner array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, + valueContainsNull for MapType and nullable for StructField
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the inner nullabilities are coerced during type-coercion? If the inner nullabilities are different, type coercion adds casts and they will remain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin For Concat, Coalesce, etc. it seems to be that case since a coercion rule is executed if there is any nullability difference on any level of nesting. But it's not the case of CaseWhenCoercion rule, since sameType method is used for comparison.
I'm wondering if the goal is to avoid generation of extra Cast expressions, shouldn't other coercion rules utilize sameType method as well? Let's assume that the result of concat is subsequently used by flatten, wouldn't it lead to generation of extra null safe checks as mentioned here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please ignore the part of my previous comment regarding flatten function. The output data type of concat, etc. will be the same regardless what resolves null flags.
This reverts commit d87a8c6.
|
Note: we will be able to use |
|
Test build #92756 has finished for PR 21704 at commit
|
|
Test build #92750 has finished for PR 21704 at commit
|
|
Test build #92805 has finished for PR 21704 at commit
|
|
Test build #92855 has finished for PR 21704 at commit
|
| val types = children.map(_.dataType) | ||
| findWiderCommonType(types) match { | ||
| case Some(finalDataType) => CreateArray(children.map(Cast(_, finalDataType))) | ||
| case Some(finalDataType) => CreateArray(children.map(castIfNotSameType(_, finalDataType))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it needed? I think optimizer can remove unnecessary cast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the optimizer doesn't remove the cast when the difference of the finalDataType is only the nullabilities of nested types.
|
what's still WIP for this PR? |
| // we need to truncate, but we should not promote one side to string if the other side is | ||
| // string.g | ||
| case g @ Greatest(children) if !haveSameType(children) => | ||
| case g @ Greatest(children) if !g.areInputTypesForMergingEqual => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are not just concat, can you update the PR title?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated PR title and description.
| TypeUtils.checkForSameTypeInputExpr(children.map(_.dataType), s"function $prettyName") | ||
| } | ||
|
|
||
| override def dataType: ArrayType = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why CreateArray doesn't extend ComplexTypeMergingExpression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the data type of the CreateArray itself is not the merged type but ArrayType.
| keys.map(_.dataType.simpleString).mkString("[", ", ", "]")) | ||
| } else if (values.map(_.dataType).distinct.length > 1) { | ||
| } else if (values.length > 1 && | ||
| values.map(_.dataType).sliding(2, 1).exists { case Seq(t1, t2) => !t1.sameType(t2) }) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The checks for keys and values are very similar. Would it be possible to separate the common logic into a private method?
| if (types.isEmpty) { | ||
| None | ||
| } else { | ||
| types.tail.foldLeft(Option(types.head)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Some(types.head) to avoid an extra null check?
|
Test build #92922 has finished for PR 21704 at commit
|
|
|
||
| private def haveSameType(exprs: Seq[Expression]): Boolean = | ||
| exprs.map(_.dataType).distinct.length == 1 | ||
| private def haveSameType(exprs: Seq[Expression]): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is duplicated with ComplexTypeMergingExpression.areInputTypesForMergingEqual, can we unify them? We can
- remove
hasSameType. Any expression that needs to do this check should extendComplexTypeMergingExpressionand the TypeCoercion rule should callareInputTypesForMergingEqual. - remove
areInputTypesForMergingEqual.ComplexTypeMergingExpressionshould only define the list of data types that need to be merged, and TypeCoercion rule should callhasSameType(e.inputTypesForMerging)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have CreateArray and CreateMap, we can't make all such expressions ComplexTypeMergingExpression.
I'd apply 2) approach.
|
Test build #92967 has finished for PR 21704 at commit
|
This reverts commit f701242.
|
LGTM |
|
Test build #93083 has finished for PR 21704 at commit
|
|
Jenkins, retest this please. |
|
Test build #93087 has finished for PR 21704 at commit
|
|
thanks, merging to master! |
What changes were proposed in this pull request?
We have some functions which need to aware the nullabilities of all children, such as
CreateArray,CreateMap,Concat, and so on. Currently we add casts to fix the nullabilities, but the casts might be removed during the optimization phase.After the discussion, we decided to not add extra casts for just fixing the nullabilities of the nested types, but handle them by functions themselves.
How was this patch tested?
Modified and added some tests.