Skip to content

Conversation

@dbtsai
Copy link
Member

@dbtsai dbtsai commented Jul 23, 2018

What changes were proposed in this pull request?

When trueValue and falseValue are semantic equivalence, the condition expression in if can be removed to avoid extra computation in runtime.

How was this patch tested?

Test added.

@dbtsai dbtsai changed the title [SPARK-24890] [SQ] Short circuiting the if condition when trueValue and falseValue are the same [SPARK-24890] [SQL] Short circuiting the if condition when trueValue and falseValue are the same Jul 23, 2018
}

case e @ CaseWhen(branches, _) if branches.headOption.map(_._1) == Some(TrueLiteral) =>
case CaseWhen(branches, _) if branches.headOption.map(_._1).contains(TrueLiteral) =>
Copy link
Member

@dongjoon-hyun dongjoon-hyun Jul 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we removed Scala 2.10, it seems to be okay. However, if we revert this irrelevant change, this PR becomes neater (and easier for someone to backport this).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's not a bug fix, I guess it's unlikely someone will backport this :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In the community, it's not allowed for backport. I mean the others who want to have this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, in any event, wouldn't it be better to revert this change back if there's any actual advantage against a unrelated style change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally, we avoid adding unneeded refactoring in such a PR. Please avoid it next time. Thanks!

@viirya
Copy link
Member

viirya commented Jul 23, 2018

LGTM
We should consider side effect of condition expression. See the discussion below.

}
}


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra space line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we always have two new lines between two objects

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay to remove it back though assuming from

Use one or two blank line(s) to separate class definitions.

https://github.com/databricks/scala-style-guide#blank-lines-vertical-whitespace

Looks either way is fine.

case If(TrueLiteral, trueValue, _) => trueValue
case If(FalseLiteral, _, falseValue) => falseValue
case If(Literal(null, _), _, falseValue) => falseValue
case If(_, trueValue, falseValue) if trueValue.semanticEquals(falseValue) => trueValue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not right. The condition must be deterministic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate?

For trueValue.semanticEquals(falseValue), it's guaranteed that both trueValue and falseValue are deterministic.

def semanticEquals(other: Expression): Boolean =
    deterministic && other.deterministic && canonicalized == other.canonicalized

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understandable that the condition can be non-deterministic, but this doesn't change the result of If.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition could have a side effect. For example, calling a stateful UDF.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

@dongjoon-hyun
Copy link
Member

Since this skips the evaluation of if condition, this will cause the following difference.

This PR

scala> sql("select * from t").show
+----+
|   a|
+----+
|   1|
|null|
+----+

scala> sql("select if(assert_true(a is null),a,a) from t").show
+-----------------------------------------------------+
|(IF(CAST(assert_true((a IS NULL)) AS BOOLEAN), a, a))|
+-----------------------------------------------------+
|                                                    1|
|                                                 null|
+-----------------------------------------------------+

Spark 2.3.1

scala> sql("select * from t").show
+----+
|   a|
+----+
|   1|
|null|
+----+

scala> sql("select if(assert_true(a is null),a,a) from t").show
18/07/23 11:59:11 ERROR Executor: Exception in task 0.0 in stage 20.0 (TID 20)
java.lang.RuntimeException: 'isnull(input[0, int, true])' is not true!

@viirya
Copy link
Member

viirya commented Jul 23, 2018 via email

@SparkQA
Copy link

SparkQA commented Jul 23, 2018

Test build #93454 has finished for PR 21848 at commit 30ca8cf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Jul 23, 2018

For now, seems we don't have a good way to know if an expression has side effect. Some expressions like AssertTrue should be marked as one with side effect. Maybe we should create a trait for this purpose.

@dbtsai
Copy link
Member Author

dbtsai commented Jul 23, 2018

@gatorsmile this can remove some of the expensive condition expressions, so I would like to find a way to properly implement this.

Thank you all for chiming in with many good points. Let me summary here.

  1. The cond expression can only be removed when it doesn't have a side effect.
  2. Stateful expression must have a side effect.
  3. Some of the non-stateful expressions such as AssertTrue have a side effect.
  4. Determinstic expression could have a side effect.
  5. Nondeterministic expressions always have a side effect.

This means determinstic is not enough, and we need another variable to check if an expression has a side effect.

@gatorsmile
Copy link
Member

Currently, we are setting the expressions deterministic to false when they are either having side effect or non-deterministic. We already did it for Hive UDFs who have stateful tags.

We should change deterministic to false for both AssertTrue and AssertNotNull.

@dbtsai
Copy link
Member Author

dbtsai commented Jul 24, 2018

This will simplify the scope of this PR a lot. My concern is the more non-deterministic expressions we have, the less optimization we can do. Luckily, both of them are not used in general expressions.

@SparkQA
Copy link

SparkQA commented Jul 24, 2018

Test build #93471 has finished for PR 21848 at commit bf0b2d9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Jul 24, 2018

Hmm, seems we have limitation on where non deterministic expressions can be in.

@kiszk
Copy link
Member

kiszk commented Jul 24, 2018

@dbtsai I have a question. How does the current code check the following condition?

Stateful expression must have a side effect.

case class AssertNotNull(child: Expression, walkedTypePath: Seq[String] = Nil)
extends UnaryExpression with NonSQLExpression {

override lazy val deterministic: Boolean = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us create a separate PR for the changes on deterministic? We need extra changes when we changing the flags.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair. I'll create a followup PR for this.

@dbtsai
Copy link
Member Author

dbtsai commented Jul 24, 2018

@kiszk trait Stateful extends Nondeterministic, and this rule will not be invoked when an expression is nondeterministic.

@dbtsai
Copy link
Member Author

dbtsai commented Jul 24, 2018

Here is a followup JIRA for making AssertTrue and AssertNotNull non-deterministic https://issues.apache.org/jira/browse/SPARK-24913

@SparkQA
Copy link

SparkQA commented Jul 25, 2018

Test build #93523 has finished for PR 21848 at commit b4f1431.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

Thanks! Merged to master.

case If(FalseLiteral, _, falseValue) => falseValue
case If(Literal(null, _), _, falseValue) => falseValue
case If(cond, trueValue, falseValue)
if cond.deterministic && trueValue.semanticEquals(falseValue) => trueValue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we'll still have this problem by skipping the evaluation of cond ..
Lately, SPARK-33544 introduced another approach for that. I think that superseded SPARK-24913. I think we can switch it to use SPARK-33544 approach.

@dbtsai, can we try and follow up it with using NoThrow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @tgravescs too FYI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants