[SPARK-17981] [SPARK-17957] [SQL] Fix Incorrect Nullability Setting to False in FilterExec #15523

gatorsmile · 2016-10-17T23:47:05Z

What changes were proposed in this pull request?

When FilterExec contains isNotNull, which could be inferred and pushed down or users specified, we convert the nullability of the involved columns if the top-layer expression is null-intolerant. However, this is not correct, if the top-layer expression is not a leaf expression, it could still tolerate the null when it has null-tolerant child expressions.

For example, cast(coalesce(a#5, a#15) as double). Although cast is a null-intolerant expression, but obviouslycoalesce is null-tolerant. Thus, it could eat null.

When the nullability is wrong, we could generate incorrect results in different cases. For example,

    val df1 = Seq((1, 2), (2, 3)).toDF("a", "b")
    val df2 = Seq((2, 5), (3, 4)).toDF("a", "c")
    val joinedDf = df1.join(df2, Seq("a"), "outer").na.fill(0)
    val df3 = Seq((3, 1)).toDF("a", "d")
    joinedDf.join(df3, "a").show

The optimized plan is like

Project [a#29, b#30, c#31, d#42]
+- Join Inner, (a#29 = a#41)
   :- Project [cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as int) AS a#29, cast(coalesce(cast(b#6 as double), 0.0) as int) AS b#30, cast(coalesce(cast(c#16 as double), 0.0) as int) AS c#31]
   :  +- Filter isnotnull(cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as int))
   :     +- Join FullOuter, (a#5 = a#15)
   :        :- LocalRelation [a#5, b#6]
   :        +- LocalRelation [a#15, c#16]
   +- LocalRelation [a#41, d#42]

Without the fix, it returns an empty result. With the fix, it can return a correct answer:

+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
|  3|  0|  4|  1|
+---+---+---+---+

How was this patch tested?

Added test cases to verify the nullability changes in FilterExec. Also added a test case for verifying the reported incorrect result.

SparkQA · 2016-10-18T01:54:57Z

Test build #67096 has finished for PR 15523 at commit 54c3cc8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-10-18T04:18:43Z

cc @cloud-fan @davies @sameeragarwal

viirya · 2016-10-18T12:25:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala

+  // One expression is null intolerant iff it and its children are null intolerant
+  private def isNullIntolerant(expr: Expression): Boolean = expr match {
+    case e: NullIntolerant =>
+      if (e.isInstanceOf[LeafExpression]) true else e.children.forall(isNullIntolerant)


Thanks for fixing this!

This change is too conservative. Actually we only need to consider a non NullIntolerant expression when it contains the attributes in the output. I think we can do more aggressive way. E.g.,

// Split out all the IsNotNulls from condition. private val (notNullPreds, otherPreds) = splitConjunctivePredicates(condition).partition { case IsNotNull(a: NullIntolerant) => isNullIntolerant(a) && a.references.subsetOf(child.outputSet) case _ => false } private def isNullIntolerant(expr: Expression): Boolean = { expr.find { e => !e.isInstanceOf[NullIntolerant] && e.references.subsetOf(child.outputSet) }.isEmpty }

Just realized the original code was from your PR. Then, in your above code, why you still need to keep a.references.subsetOf(child.outputSet)? It looks confusing to me.

Even a passed the check of isNullIntolerant, i.e., it has not non NullIntolerant which wraps output attributes. If it doesn't refer to any output attributes, we don't need it.

Could you show me an example?

IsNotNull(Rand() > 0.5)?

uh, I see.

First, we definitely need test cases to cover each positive and negative scenario. Previously, we did not have any test case to check the validity of nullability changes. Second, the code needs more comments when the variable/function names are not able to explain the codes.

gatorsmile · 2016-10-19T20:31:54Z

@viirya I have not changed the algorithm. I just tried to improve the test case coverage.

Thanks to constructIsNotNullConstraints, the existing solution already covers all the cases, right?

Can you help me check anything scenario is missing? Thanks!

SparkQA · 2016-10-19T22:33:08Z

Test build #67212 has finished for PR 15523 at commit ce418f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-10-20T01:41:53Z

@gatorsmile A predicate like IsNotNull(a + b + Rand()) will let this change to wrongly set the nullability of a and b to true. Isn't it?

gatorsmile · 2016-10-20T02:14:52Z

The parm name of the verification function is wrong. It should be expectedNonNullableColumns. Please check the test case again.

SparkQA · 2016-10-20T04:24:46Z

Test build #67229 has finished for PR 15523 at commit 52cb8fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-24T19:11:51Z

Test build #67457 has finished for PR 15523 at commit 4f2101e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-10-24T20:13:37Z

cc @cloud-fan @davies @sameeragarwal @viirya

viirya · 2016-10-27T02:30:21Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+    verifyNullabilityInFilterExec(df,
+      expr = "_1", expectedNonNullableColumns = Seq("_1"))
+    verifyNullabilityInFilterExec(df,
+      expr = "_2 + Rand()", expectedNonNullableColumns = Seq("_2"))


This should not work in current approach. It works now because we infer redundant IsNotNull constraints. E.g., if Filter has a constraint IsNotNull(_2 + Rand()), we will infer another IsNotNull(_2) from it. Your approach is working on IsNotNull(_2) to decide _2 is non-nullable column, not IsNotNull(_2 + Rand()).

I submitted another PR #15653 for redundant IsNotNull constraints. But I am not sure if we want to fix it since it doesn't affect correctness. I left that to @cloud-fan or @hvanhovell to decide it.

I already explained why my current solution works in my previous statement. Personally, I like simple code, which is easy to understand and maintain, especially when it can cover all the cases. Result correctness and code maintainability are alwasy more important.

If constructIsNotNullConstraints is changed by somebody else (i.e., it does not provide the expected IsNotNull constraints), the test cases added by this PR will fail. Then, we can modify the codes.

Yeah, so I said I will left that to @cloud-fan or others to decide...

I agree with the simplicity argument but can you please add a comment here explaining why this particular case is working due to the null inference rule?

SparkQA · 2016-10-28T09:36:45Z

Test build #67697 has finished for PR 15523 at commit 49daace.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-10-28T23:32:52Z

LGTM and wait for @cloud-fan or others to do second check.

sameeragarwal

Thanks, this LGTM too with just a couple of extremely minor comments. As far as the change is concerned, I tend to side with simplicity argument for now (especially if we're planning to target 2.1).

sameeragarwal · 2016-11-02T23:49:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala

+  // If one expression and its children are null intolerant, it is null intolerant.
+  private def isNullIntolerant(expr: Expression): Boolean = expr match {
+    case e: NullIntolerant =>
+      if (e.isInstanceOf[LeafExpression]) true else e.children.forall(isNullIntolerant)


nit: How about something like this for better readability:

private def isNullIntolerant(expr: Expression): Boolean = expr match { case e: NullIntolerant with LeafExpression => true case e: NullIntolerant => e.children.forall(isNullIntolerant) case _ => false }

forall will return true if the children is empty. Thus, we can remove the first case. Now it becomes simpler. : )

private def isNullIntolerant(expr: Expression): Boolean = expr match { case e: NullIntolerant => e.children.forall(isNullIntolerant) case _ => false }

sameeragarwal · 2016-11-02T23:53:26Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+    verifyNullabilityInFilterExec(df,
+      expr = "_1", expectedNonNullableColumns = Seq("_1"))
+    verifyNullabilityInFilterExec(df,
+      expr = "_2 + Rand()", expectedNonNullableColumns = Seq("_2"))


I agree with the simplicity argument but can you please add a comment here explaining why this particular case is working due to the null inference rule?

SparkQA · 2016-11-03T06:00:09Z

Test build #68045 has finished for PR 15523 at commit 2364cc2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-11-03T15:35:01Z

Merging to master/2.1. Thanks!

…False in FilterExec ### What changes were proposed in this pull request? When `FilterExec` contains `isNotNull`, which could be inferred and pushed down or users specified, we convert the nullability of the involved columns if the top-layer expression is null-intolerant. However, this is not correct, if the top-layer expression is not a leaf expression, it could still tolerate the null when it has null-tolerant child expressions. For example, `cast(coalesce(a#5, a#15) as double)`. Although `cast` is a null-intolerant expression, but obviously`coalesce` is null-tolerant. Thus, it could eat null. When the nullability is wrong, we could generate incorrect results in different cases. For example, ``` Scala val df1 = Seq((1, 2), (2, 3)).toDF("a", "b") val df2 = Seq((2, 5), (3, 4)).toDF("a", "c") val joinedDf = df1.join(df2, Seq("a"), "outer").na.fill(0) val df3 = Seq((3, 1)).toDF("a", "d") joinedDf.join(df3, "a").show ``` The optimized plan is like ``` Project [a#29, b#30, c#31, d#42] +- Join Inner, (a#29 = a#41) :- Project [cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as int) AS a#29, cast(coalesce(cast(b#6 as double), 0.0) as int) AS b#30, cast(coalesce(cast(c#16 as double), 0.0) as int) AS c#31] : +- Filter isnotnull(cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as int)) : +- Join FullOuter, (a#5 = a#15) : :- LocalRelation [a#5, b#6] : +- LocalRelation [a#15, c#16] +- LocalRelation [a#41, d#42] ``` Without the fix, it returns an empty result. With the fix, it can return a correct answer: ``` +---+---+---+---+ | a| b| c| d| +---+---+---+---+ | 3| 0| 4| 1| +---+---+---+---+ ``` ### How was this patch tested? Added test cases to verify the nullability changes in FilterExec. Also added a test case for verifying the reported incorrect result. Author: gatorsmile <[email protected]> Closes #15523 from gatorsmile/nullabilityFilterExec. (cherry picked from commit 66a99f4) Signed-off-by: Herman van Hovell <[email protected]>

hvanhovell · 2016-11-03T15:37:57Z

@gatorsmile can you open a PR for 2.0 if we also need to port it to that branch?

gatorsmile · 2016-11-03T16:45:55Z

Sure, @hvanhovell will do it. Thanks!

…ty Setting to False in FilterExec ### What changes were proposed in this pull request? **This PR is to backport the fix #15523 to 2.0.** When `FilterExec` contains `isNotNull`, which could be inferred and pushed down or users specified, we convert the nullability of the involved columns if the top-layer expression is null-intolerant. However, this is not correct, if the top-layer expression is not a leaf expression, it could still tolerate the null when it has null-tolerant child expressions. For example, `cast(coalesce(a#5, a#15) as double)`. Although `cast` is a null-intolerant expression, but obviously`coalesce` is null-tolerant. Thus, it could eat null. When the nullability is wrong, we could generate incorrect results in different cases. For example, ``` Scala val df1 = Seq((1, 2), (2, 3)).toDF("a", "b") val df2 = Seq((2, 5), (3, 4)).toDF("a", "c") val joinedDf = df1.join(df2, Seq("a"), "outer").na.fill(0) val df3 = Seq((3, 1)).toDF("a", "d") joinedDf.join(df3, "a").show ``` The optimized plan is like ``` Project [a#29, b#30, c#31, d#42] +- Join Inner, (a#29 = a#41) :- Project [cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as int) AS a#29, cast(coalesce(cast(b#6 as double), 0.0) as int) AS b#30, cast(coalesce(cast(c#16 as double), 0.0) as int) AS c#31] : +- Filter isnotnull(cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as int)) : +- Join FullOuter, (a#5 = a#15) : :- LocalRelation [a#5, b#6] : +- LocalRelation [a#15, c#16] +- LocalRelation [a#41, d#42] ``` Without the fix, it returns an empty result. With the fix, it can return a correct answer: ``` +---+---+---+---+ | a| b| c| d| +---+---+---+---+ | 3| 0| 4| 1| +---+---+---+---+ ``` ### How was this patch tested? Added test cases to verify the nullability changes in FilterExec. Also added a test case for verifying the reported incorrect result. Author: gatorsmile <[email protected]> Closes #15781 from gatorsmile/nullabilityFix.

…False in FilterExec ### What changes were proposed in this pull request? When `FilterExec` contains `isNotNull`, which could be inferred and pushed down or users specified, we convert the nullability of the involved columns if the top-layer expression is null-intolerant. However, this is not correct, if the top-layer expression is not a leaf expression, it could still tolerate the null when it has null-tolerant child expressions. For example, `cast(coalesce(a#5, a#15) as double)`. Although `cast` is a null-intolerant expression, but obviously`coalesce` is null-tolerant. Thus, it could eat null. When the nullability is wrong, we could generate incorrect results in different cases. For example, ``` Scala val df1 = Seq((1, 2), (2, 3)).toDF("a", "b") val df2 = Seq((2, 5), (3, 4)).toDF("a", "c") val joinedDf = df1.join(df2, Seq("a"), "outer").na.fill(0) val df3 = Seq((3, 1)).toDF("a", "d") joinedDf.join(df3, "a").show ``` The optimized plan is like ``` Project [a#29, b#30, c#31, d#42] +- Join Inner, (a#29 = a#41) :- Project [cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as int) AS a#29, cast(coalesce(cast(b#6 as double), 0.0) as int) AS b#30, cast(coalesce(cast(c#16 as double), 0.0) as int) AS c#31] : +- Filter isnotnull(cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as int)) : +- Join FullOuter, (a#5 = a#15) : :- LocalRelation [a#5, b#6] : +- LocalRelation [a#15, c#16] +- LocalRelation [a#41, d#42] ``` Without the fix, it returns an empty result. With the fix, it can return a correct answer: ``` +---+---+---+---+ | a| b| c| d| +---+---+---+---+ | 3| 0| 4| 1| +---+---+---+---+ ``` ### How was this patch tested? Added test cases to verify the nullability changes in FilterExec. Also added a test case for verifying the reported incorrect result. Author: gatorsmile <[email protected]> Closes apache#15523 from gatorsmile/nullabilityFilterExec.

fix

54c3cc8

gatorsmile changed the title ~~[SPARK-17981] [SPARK-17957] [SQL] Incorrectly Set Nullability to False in FilterExec~~ [SPARK-17981] [SPARK-17957] [SQL] Fix Incorrect Nullability Setting to False in FilterExec Oct 17, 2016

viirya reviewed Oct 18, 2016

View reviewed changes

add more test cases

ce418f9

change the parm name to expectedNonNullableColumns

52cb8fb

gatorsmile added 2 commits October 24, 2016 09:49

Merge remote-tracking branch 'upstream/master' into NoPPDIsNotNull

c25df4d

merge

4f2101e

viirya reviewed Oct 27, 2016

View reviewed changes

update the comment

49daace

sameeragarwal approved these changes Nov 3, 2016

View reviewed changes

address comments.

2364cc2

asfgit closed this in 66a99f4 Nov 3, 2016

gatorsmile mentioned this pull request Nov 5, 2016

[SPARK-17981] [SPARK-17957] [SQL] [BACKPORT-2.0] Fix Incorrect Nullability Setting to False in FilterExec #15781

Closed

JoshRosen mentioned this pull request Jun 1, 2019

[SPARK-27915][SQL][WIP] Update logical Filter's output nullability based on IsNotNull conditions #24765

Closed

[SPARK-17981] [SPARK-17957] [SQL] Fix Incorrect Nullability Setting to False in FilterExec #15523

[SPARK-17981] [SPARK-17957] [SQL] Fix Incorrect Nullability Setting to False in FilterExec #15523

Uh oh!

Conversation

gatorsmile commented Oct 17, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 18, 2016

Uh oh!

gatorsmile commented Oct 18, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Oct 19, 2016

Uh oh!

SparkQA commented Oct 19, 2016

Uh oh!

viirya commented Oct 20, 2016

Uh oh!

gatorsmile commented Oct 20, 2016

Uh oh!

SparkQA commented Oct 20, 2016

Uh oh!

SparkQA commented Oct 24, 2016

Uh oh!

gatorsmile commented Oct 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 28, 2016

Uh oh!

viirya commented Oct 28, 2016

Uh oh!

sameeragarwal left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Nov 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 3, 2016

Uh oh!

hvanhovell commented Nov 3, 2016

Uh oh!

hvanhovell commented Nov 3, 2016

Uh oh!

gatorsmile commented Nov 3, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gatorsmile Nov 3, 2016 •

edited

Loading