Skip to content

Conversation

@ueshin
Copy link
Member

@ueshin ueshin commented Jul 3, 2018

What changes were proposed in this pull request?

We have some functions which need to aware the nullabilities of all children, such as CreateArray, CreateMap, Concat, and so on. Currently we add casts to fix the nullabilities, but the casts might be removed during the optimization phase.
After the discussion, we decided to not add extra casts for just fixing the nullabilities of the nested types, but handle them by functions themselves.

How was this patch tested?

Modified and added some tests.

@ueshin
Copy link
Member Author

ueshin commented Jul 3, 2018

cc @mn-mikke @gatorsmile @cloud-fan

@mn-mikke
Copy link
Contributor

mn-mikke commented Jul 3, 2018

@ueshin Thanks for bringing this topic! This problem with different nullable/containsNull flags seems to be more generic. In 21687, we've touched a similar problem with CaseWhen and If expression. So I think It would nice if we could think together about a generic and consistent solution for all espressions. WDYT?

@ueshin
Copy link
Member Author

ueshin commented Jul 3, 2018

@mn-mikke Thanks! I'll take a look and join the discussion later.

@SparkQA
Copy link

SparkQA commented Jul 3, 2018

Test build #92562 has finished for PR 21704 at commit d87a8c6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

ArrayType(et, dataTypes.exists(_.asInstanceOf[ArrayType].containsNull))
case dt => dt
}.getOrElse(StringType)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we handle this case in type coercion (analysis phase)?

Copy link
Member Author

@ueshin ueshin Jul 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, Concat for array type has the type coercion to add casts to make all children the same type, but we also have the optimization SimplifyCasts to remove unnecessary casts which might remove casts from arrays not containing null to arrays containing null (optimizer/expressions.scala#L611).

E.g., concat(array(1,2,3), array(4,null,6)) might generate a wrong data type during the execution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test to show the wrong nullability.

Copy link
Member

@maropu maropu Jul 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I see. But, I just have a hunch that SimplifyCasts cannot simplify array casts in some cases?, e.g., this concat case. Since we basically cannot change plan semantics in optimization phase, I feel a little weird about this simplification. Anyway, I'm ok with your approach because I can't find a better & simpler way to solve this in analysis phase... Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that also makes sense. I'm not sure we can remove the simplification, though. cc @gatorsmile @cloud-fan

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Coalesce should be fixed, and Least and Greatest are also suspicious.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about changing SimplifyCasts rule to start replacing Cast with a new dummy cast expression that will hold only a target data type and won't perform any casting?

This can work, but my point is we should not add the cast to change containsNull, as it may not match the underlying child expression and generates unnecessary null check code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, see. In that case, it would be nice to introduce a method that will resolve the output DataType and merges nullable/containNull flags of non-primitive types recursively for such expressions. For the most cases we could encapsulate the function into a new trait. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the @mn-mikke idea

@SparkQA
Copy link

SparkQA commented Jul 4, 2018

Test build #92598 has finished for PR 21704 at commit 30d5aed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val dataTypes = children.map(_.dataType)
dataTypes.headOption.map {
case ArrayType(et, _) =>
ArrayType(et, dataTypes.exists(_.asInstanceOf[ArrayType].containsNull))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we support array of array in concat?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should work (see a test for it). Did we miss anything?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then shall we fix the containNull for the inner array?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, + valueContainsNull for MapType and nullable for StructField

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the inner nullabilities are coerced during type-coercion? If the inner nullabilities are different, type coercion adds casts and they will remain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin For Concat, Coalesce, etc. it seems to be that case since a coercion rule is executed if there is any nullability difference on any level of nesting. But it's not the case of CaseWhenCoercion rule, since sameType method is used for comparison.

I'm wondering if the goal is to avoid generation of extra Cast expressions, shouldn't other coercion rules utilize sameType method as well? Let's assume that the result of concat is subsequently used by flatten, wouldn't it lead to generation of extra null safe checks as mentioned here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ignore the part of my previous comment regarding flatten function. The output data type of concat, etc. will be the same regardless what resolves null flags.

@ueshin
Copy link
Member Author

ueshin commented Jul 9, 2018

Note: we will be able to use NonPrimitiveTypeMergingExpression in #21687.

@SparkQA
Copy link

SparkQA commented Jul 9, 2018

Test build #92756 has finished for PR 21704 at commit 444383d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 9, 2018

Test build #92750 has finished for PR 21704 at commit 3d8891e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin ueshin changed the title [SPARK-24734][SQL] Fix containsNull of Concat for array type. [WIP][SPARK-24734][SQL] Fix containsNull of Concat for array type. Jul 10, 2018
@SparkQA
Copy link

SparkQA commented Jul 10, 2018

Test build #92805 has finished for PR 21704 at commit 2c54e38.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 11, 2018

Test build #92855 has finished for PR 21704 at commit db254e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val types = children.map(_.dataType)
findWiderCommonType(types) match {
case Some(finalDataType) => CreateArray(children.map(Cast(_, finalDataType)))
case Some(finalDataType) => CreateArray(children.map(castIfNotSameType(_, finalDataType)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it needed? I think optimizer can remove unnecessary cast.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the optimizer doesn't remove the cast when the difference of the finalDataType is only the nullabilities of nested types.

@cloud-fan
Copy link
Contributor

what's still WIP for this PR?

@ueshin ueshin changed the title [WIP][SPARK-24734][SQL] Fix containsNull of Concat for array type. [SPARK-24734][SQL] Fix containsNull of Concat for array type. Jul 12, 2018
// we need to truncate, but we should not promote one side to string if the other side is
// string.g
case g @ Greatest(children) if !haveSameType(children) =>
case g @ Greatest(children) if !g.areInputTypesForMergingEqual =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are not just concat, can you update the PR title?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated PR title and description.

TypeUtils.checkForSameTypeInputExpr(children.map(_.dataType), s"function $prettyName")
}

override def dataType: ArrayType = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why CreateArray doesn't extend ComplexTypeMergingExpression?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the data type of the CreateArray itself is not the merged type but ArrayType.

@ueshin ueshin changed the title [SPARK-24734][SQL] Fix containsNull of Concat for array type. [SPARK-24734][SQL] Fix type coercions and nullabilities of nested data types of some functions. Jul 12, 2018
keys.map(_.dataType.simpleString).mkString("[", ", ", "]"))
} else if (values.map(_.dataType).distinct.length > 1) {
} else if (values.length > 1 &&
values.map(_.dataType).sliding(2, 1).exists { case Seq(t1, t2) => !t1.sameType(t2) }) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checks for keys and values are very similar. Would it be possible to separate the common logic into a private method?

if (types.isEmpty) {
None
} else {
types.tail.foldLeft(Option(types.head)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Some(types.head) to avoid an extra null check?

@SparkQA
Copy link

SparkQA commented Jul 12, 2018

Test build #92922 has finished for PR 21704 at commit f701242.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


private def haveSameType(exprs: Seq[Expression]): Boolean =
exprs.map(_.dataType).distinct.length == 1
private def haveSameType(exprs: Seq[Expression]): Boolean = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is duplicated with ComplexTypeMergingExpression.areInputTypesForMergingEqual, can we unify them? We can

  1. remove hasSameType. Any expression that needs to do this check should extend ComplexTypeMergingExpression and the TypeCoercion rule should call areInputTypesForMergingEqual.
  2. remove areInputTypesForMergingEqual. ComplexTypeMergingExpression should only define the list of data types that need to be merged, and TypeCoercion rule should call hasSameType(e.inputTypesForMerging)

Copy link
Member Author

@ueshin ueshin Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have CreateArray and CreateMap, we can't make all such expressions ComplexTypeMergingExpression.
I'd apply 2) approach.

@SparkQA
Copy link

SparkQA commented Jul 13, 2018

Test build #92967 has finished for PR 21704 at commit 5115961.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Jul 16, 2018

Test build #93083 has finished for PR 21704 at commit e489e8b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class ArraySetLike extends BinaryArrayExpressionWithImplicitCast

@ueshin
Copy link
Member Author

ueshin commented Jul 16, 2018

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 16, 2018

Test build #93087 has finished for PR 21704 at commit e489e8b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class ArraySetLike extends BinaryArrayExpressionWithImplicitCast

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in b045315 Jul 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants