Skip to content
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends Expression {
}
}

override def dataType: DataType = children.map(_.dataType).headOption.getOrElse(StringType)
override def dataType: DataType = {
val dataTypes = children.map(_.dataType)
dataTypes.headOption.map {
case ArrayType(et, _) =>
ArrayType(et, dataTypes.exists(_.asInstanceOf[ArrayType].containsNull))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we support array of array in concat?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should work (see a test for it). Did we miss anything?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then shall we fix the containNull for the inner array?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, + valueContainsNull for MapType and nullable for StructField

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the inner nullabilities are coerced during type-coercion? If the inner nullabilities are different, type coercion adds casts and they will remain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin For Concat, Coalesce, etc. it seems to be that case since a coercion rule is executed if there is any nullability difference on any level of nesting. But it's not the case of CaseWhenCoercion rule, since sameType method is used for comparison.

I'm wondering if the goal is to avoid generation of extra Cast expressions, shouldn't other coercion rules utilize sameType method as well? Let's assume that the result of concat is subsequently used by flatten, wouldn't it lead to generation of extra null safe checks as mentioned here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ignore the part of my previous comment regarding flatten function. The output data type of concat, etc. will be the same regardless what resolves null flags.

case dt => dt
}.getOrElse(StringType)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we handle this case in type coercion (analysis phase)?

Copy link
Member Author

@ueshin ueshin Jul 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, Concat for array type has the type coercion to add casts to make all children the same type, but we also have the optimization SimplifyCasts to remove unnecessary casts which might remove casts from arrays not containing null to arrays containing null (optimizer/expressions.scala#L611).

E.g., concat(array(1,2,3), array(4,null,6)) might generate a wrong data type during the execution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test to show the wrong nullability.

Copy link
Member

@maropu maropu Jul 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I see. But, I just have a hunch that SimplifyCasts cannot simplify array casts in some cases?, e.g., this concat case. Since we basically cannot change plan semantics in optimization phase, I feel a little weird about this simplification. Anyway, I'm ok with your approach because I can't find a better & simpler way to solve this in analysis phase... Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that also makes sense. I'm not sure we can remove the simplification, though. cc @gatorsmile @cloud-fan

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Coalesce should be fixed, and Least and Greatest are also suspicious.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about changing SimplifyCasts rule to start replacing Cast with a new dummy cast expression that will hold only a target data type and won't perform any casting?

This can work, but my point is we should not add the cast to change containsNull, as it may not match the underlying child expression and generates unnecessary null check code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, see. In that case, it would be nice to introduce a method that will resolve the output DataType and merges nullable/containNull flags of non-primitive types recursively for such expressions. For the most cases we could encapsulate the function into a new trait. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the @mn-mikke idea


lazy val javaType: String = CodeGenerator.javaType(dataType)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -912,11 +912,11 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper

test("Concat") {
// Primitive-type elements
val ai0 = Literal.create(Seq(1, 2, 3), ArrayType(IntegerType))
val ai1 = Literal.create(Seq.empty[Integer], ArrayType(IntegerType))
val ai2 = Literal.create(Seq(4, null, 5), ArrayType(IntegerType))
val ai3 = Literal.create(Seq(null, null), ArrayType(IntegerType))
val ai4 = Literal.create(null, ArrayType(IntegerType))
val ai0 = Literal.create(Seq(1, 2, 3), ArrayType(IntegerType, containsNull = false))
val ai1 = Literal.create(Seq.empty[Integer], ArrayType(IntegerType, containsNull = false))
val ai2 = Literal.create(Seq(4, null, 5), ArrayType(IntegerType, containsNull = true))
val ai3 = Literal.create(Seq(null, null), ArrayType(IntegerType, containsNull = true))
val ai4 = Literal.create(null, ArrayType(IntegerType, containsNull = true))

checkEvaluation(Concat(Seq(ai0)), Seq(1, 2, 3))
checkEvaluation(Concat(Seq(ai0, ai1)), Seq(1, 2, 3))
Expand All @@ -929,11 +929,11 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper
checkEvaluation(Concat(Seq(ai4, ai0)), null)

// Non-primitive-type elements
val as0 = Literal.create(Seq("a", "b", "c"), ArrayType(StringType))
val as1 = Literal.create(Seq.empty[String], ArrayType(StringType))
val as2 = Literal.create(Seq("d", null, "e"), ArrayType(StringType))
val as3 = Literal.create(Seq(null, null), ArrayType(StringType))
val as4 = Literal.create(null, ArrayType(StringType))
val as0 = Literal.create(Seq("a", "b", "c"), ArrayType(StringType, containsNull = false))
val as1 = Literal.create(Seq.empty[String], ArrayType(StringType, containsNull = false))
val as2 = Literal.create(Seq("d", null, "e"), ArrayType(StringType, containsNull = true))
val as3 = Literal.create(Seq(null, null), ArrayType(StringType, containsNull = true))
val as4 = Literal.create(null, ArrayType(StringType, containsNull = true))

val aa0 = Literal.create(Seq(Seq("a", "b"), Seq("c")), ArrayType(ArrayType(StringType)))
val aa1 = Literal.create(Seq(Seq("d"), Seq("e", "f")), ArrayType(ArrayType(StringType)))
Expand All @@ -949,6 +949,11 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper
checkEvaluation(Concat(Seq(as4, as0)), null)

checkEvaluation(Concat(Seq(aa0, aa1)), Seq(Seq("a", "b"), Seq("c"), Seq("d"), Seq("e", "f")))

assert(Concat(Seq(ai0, ai1)).dataType.asInstanceOf[ArrayType].containsNull === false)
assert(Concat(Seq(ai0, ai2)).dataType.asInstanceOf[ArrayType].containsNull === true)
assert(Concat(Seq(as0, as1)).dataType.asInstanceOf[ArrayType].containsNull === false)
assert(Concat(Seq(as0, as2)).dataType.asInstanceOf[ArrayType].containsNull === true)
}

test("Flatten") {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1492,6 +1492,14 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext {
assert(errMsg.contains(s"input to function $name requires at least two arguments"))
}
}

test("SPARK-24734: Fix containsNull of Concat for array type") {
val df = Seq((Seq(1), Seq[Integer](null), Seq("a", "b"))).toDF("k1", "k2", "v")
val ex = intercept[RuntimeException] {
df.select(map_from_arrays(concat($"k1", $"k2"), $"v")).show()
}
assert(ex.getMessage.contains("Cannot use null as map key"))
}
}

object DataFrameFunctionsSuite {
Expand Down