[SPARK-35564][SQL] Support subexpression elimination for conditionally evaluated expressions #32987

Kimahriman · 2021-06-20T13:35:30Z

What changes were proposed in this pull request?

I am proposing to add support for conditionally evaluated expressions during subexpression elimination. Currently, only expressions that will definitely be always at least twice are candidates for subexpression elimination. This PR updates that logic so that expressions that are always evaluated at least once and conditionally evaluated at least once are also candidates for subexpression elimination. This helps optimize a common case during data normalization and cleaning and want to null out values that don't match a certain pattern, where you have something like:

transformed = F.regexp_replace(F.lower(F.trim('my_column')))
df.withColumn('normalized_value', F.when(F.length(transformed) > 0, transformed))

or

df.withColumn('normalized_value', F.when(transformed.rlike(<some regex>), transformed))

In these cases, transformed will always be fully calculated twice, because it might only be needed once. I am proposing creating a subexpression for transformed in this case.

In practice I've seen a decrease in runtime and codegen size of 10-30% in our production pipelines that heavily make use of this type of logic.

The only potential downside is creating extra subexpressions, and therefore function calls, more than necessary. This should only be an issue for certain edge cases where your conditional overwhelming evaluates to false. And then the only overhead is running your conditional logic potentially in a separate function rather than inlined in the codegen. I added a config to control this behavior if that is actually a real concern to anyone, but I'd be happy to just remove the config.

I also updated some of the existing logic for common expressions in coalesce and when that are actually better handled by the new logic, since you are only guaranteed to have the first value of a Coalesce evaluated, as well as the first conditional of a CaseWhen expression.

Why are the changes needed?

To increase the performance of conditional expressions.

Does this PR introduce any user-facing change?

No, just performance improvements.

How was this patch tested?

New and updated UT.

cloud-fan · 2021-06-22T13:50:23Z

OK to test

SparkQA · 2021-06-22T15:10:50Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44671/

SparkQA · 2021-06-22T15:19:46Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44671/

viirya · 2021-06-22T16:28:21Z

The only potential downside is creating extra subexpressions, and therefore function calls, more than necessary. This should only be an issue for certain edge cases where your conditional overwhelming evaluates to false. And then the only overhead is running your conditional logic potentially in a separate function rather than inlined in the codegen. I added a config to control this behavior if that is actually a real concern to anyone, but I'd be happy to just remove the config.

I don't think the downside is edge case. On the contrary, I rather think it is common case than the use-case proposed here.

After this, any common expression shared between conditionally evaluated expression and a normal expression will be subexpression. I have a concern that gen-ed code will be overwhelmed with such subexpressions.

At least we need a config for this and I don't think it should be enabled by default.

Kimahriman · 2021-06-22T16:58:52Z

After this, any common expression shared between conditionally evaluated expression and a normal expression will be subexpression. I have a concern that gen-ed code will be overwhelmed with such subexpressions.

What exactly is the overwhelming part? I figured smaller overall code size would be beneficial.

viirya · 2021-06-22T17:11:13Z

What exactly is the overwhelming part? I figured smaller overall code size would be beneficial.

It is not zero-cost. For example, too many subexpressions will possibly make non-split case to be split case.

Kimahriman · 2021-06-22T17:16:59Z

Could you elaborate on how that could happen? I don't know that much about the codegen process

viirya · 2021-06-22T17:25:50Z

Could you elaborate on how that could happen? I don't know that much about the codegen process

In short, during subexpressions codegen, if the total code length is more than a threshold, we choose to split it as functions to avoid reach the max size of a method.

Kimahriman · 2021-06-22T17:27:57Z

Oh you're specifically talking about the subexpressions being split into functions versus inlined, not the general splitting the whole codegen into functions?

SparkQA · 2021-06-22T18:52:53Z

Test build #140145 has finished for PR 32987 at commit d4d64c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Kimahriman · 2021-06-24T15:09:34Z

My main assumption in creating this was that it's always faster to run an expression once in a function than twice inlined. If this creates a lot of extra subexpressions that pushes the code over the 1kb threshold for breaking into functions, then the alternative is that you are running a lot of duplicate inlined logic, so at the end of the day it all comes down to how often a subexpression created by this logic is only evaluated once.

The two extremes of performance impact I can think of would be:

Worst case: Without this logic, you have subexpressions that are just small enough to remain inlined. You add one conditional that creates a new subexpression that pushes your code over the (default) 1kb limit. That conditional never evaluates to true, so your conditional subexpression is evaluated once in a function rather than inlined, and all your other subexpressions are evaluated with a function call instead of inlined as well. This is somewhat bound by the number of subexpressions that can be fit inline in the first place, plus the function calls of the one-time evaluated conditional subexpressions.
Best case: Your existing subexpressions have already been broken out into functions before this change, or the new subexpression fits inline as well, and the conditional always evaluates to true, so you are running the conditional expression once instead of two or more times. This is essentially the existing logic where we create a subexpression for things that are always evaluated at least twice, so obviously a win here.

Realistically things are going to fall somewhere in the middle. Where the extra function calls outweigh the deduped expression execution, who knows. But the upside here is pretty large, and I would expect most Spark users would expect this to logically happen (don't run the same code twice). If we want to leave it with the setting defaulted to disabled I'm fine with that.

cloud-fan · 2021-06-30T13:31:28Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

+
+  // Finds expressions that are conditionally evaluated, so that if they are definitely evaluated
+  // elsewhere, we can create a subexpression to optimize the conditional case.
+  private def conditionallyEvaluatedChildren(expr: Expression): Seq[Expression] = expr match {


I feel it's a bit more overcomplicated: now we have childen, commonChildren, conditionallyEvaluatedChildren.

Yeah it's just all different cases that need to be handled. I can think about how to simplify or if #33142 would help simplify

i.e. useCount and conditionalUseCount instead of separate map and all that or something, idk

Let me know if it's less overcomplicated now...

Kimahriman · 2021-07-07T11:37:55Z

Updated based on the refactor. It's still a little rough and needs some cleaning, renaming things, and updating a lot of comments, but wanted to get initial feedback

SparkQA · 2021-07-07T12:51:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45267/

SparkQA · 2021-07-07T13:24:18Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45267/

SparkQA · 2021-07-07T16:19:59Z

Test build #140756 has finished for PR 32987 at commit 1111b9b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ExpressionStats(expr: Expression)(
case class RecurseChildren(

SparkQA · 2021-07-22T13:05:31Z

Test build #141494 has started for PR 32987 at commit d956d22.

SparkQA · 2021-07-22T13:09:32Z

Test build #141497 has started for PR 32987 at commit 5d6e1ad.

SparkQA · 2021-07-22T13:21:24Z

Test build #141498 has started for PR 32987 at commit 80245a6.

Kimahriman · 2021-07-22T13:24:00Z

...src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala

The not-first conditions are now handled as a conditional instead. Supports all the same existing behavior but additionally can create subexpressions for things only in one of the remaining conditions instead of all. For example, CaseWhen((a + b) / (c + d) > 1, 1, a + b > 1, 2, c + d > 1, 3), a + b and c + d will become subexpressions now where they wouldn't previously, though only with this config enabled if we need to keep the config

Kimahriman · 2021-07-22T13:24:19Z

...src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala

Same as the CaseWhen above

Kimahriman · 2021-07-22T13:31:46Z

#33142 (comment) in other cases it's already accepted that the performance overhead of maybe only using a subexpression once is worth the trade-off of not having to potentially evaluate it twice, so this just expands the places that could happen. Personally I don't think it needs a config defaulting to turned off, but I'm fine leaving it in if necessary. It does effectively prevent all the existing cases of creating a subexpression for an expression that might only be evaluated once, like mentioned in the comment, if the config is turned off.

SparkQA · 2021-07-22T14:40:17Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46014/

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

cloud-fan · 2025-02-10T03:06:48Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

   */
-  private def updateCommonExprs(
-      exprs: Seq[Expression],
-      map: mutable.HashMap[ExpressionEquals, ExpressionStats],


shall we update the doc of this method? no equivalenceMap in this method now.

Yep good call, haven't kept up with with some of the docs

cloud-fan · 2025-02-10T03:09:01Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

+ */
+case class RecurseChildren(
+    alwaysChildren: Seq[Expression],
+    commonChildren: Seq[Seq[Expression]] = Nil,


do we have an example this commonChildren?

Updated the docs a little bit to clarify. Currently it's only If and CaseWhen expressions that commonChildren applies too, should I put one of those as an example in the doc?

…s for cases they are already being evaluated

github-actions bot added the SQL label Jun 20, 2021

Kimahriman mentioned this pull request Jun 22, 2021

[SPARK-35688][SQL]Subexpressions should be lazy evaluation in GeneratePredicate #32977

Closed

Kimahriman mentioned this pull request Jun 30, 2021

[SPARK-33337][SQL] Support subexpression elimination in branches of conditional expressions #30245

Closed

cloud-fan reviewed Jun 30, 2021

View reviewed changes

Kimahriman force-pushed the conditional-subexpr-elim branch from d4d64c7 to 1111b9b Compare July 7, 2021 11:36

Kimahriman force-pushed the conditional-subexpr-elim branch 2 times, most recently from d956d22 to 5d6e1ad Compare July 22, 2021 13:04

Kimahriman force-pushed the conditional-subexpr-elim branch from 5d6e1ad to 80245a6 Compare July 22, 2021 13:19

Kimahriman commented Jul 22, 2021

View reviewed changes

peter-toth mentioned this pull request Jun 21, 2023

[SPARK-42551][SQL] Support more subexpression elimination cases #41119

Closed

Kimahriman force-pushed the conditional-subexpr-elim branch from 3d415ec to be9e9b2 Compare June 21, 2023 11:13

peter-toth mentioned this pull request Jun 21, 2023

[SPARK-35564][SQL] Improve subexpression elimination #41677

Closed

peter-toth pushed a commit to peter-toth/spark that referenced this pull request Jun 21, 2023

tests cherry-picked from Kimahriman's apache#32987

fc38bf6

Kimahriman force-pushed the conditional-subexpr-elim branch from be9e9b2 to 90796d5 Compare August 13, 2023 12:56

Kimahriman force-pushed the conditional-subexpr-elim branch 2 times, most recently from ba70e61 to 2195945 Compare October 4, 2023 11:42

Kimahriman force-pushed the conditional-subexpr-elim branch 2 times, most recently from d3b4716 to 36efb6a Compare January 1, 2024 17:25

Kimahriman force-pushed the conditional-subexpr-elim branch from 36efb6a to aff4565 Compare January 21, 2024 00:01

Kimahriman force-pushed the conditional-subexpr-elim branch from aff4565 to a839d50 Compare March 16, 2024 15:22

Kimahriman force-pushed the conditional-subexpr-elim branch from a839d50 to 19b7846 Compare May 15, 2024 11:25

Kimahriman force-pushed the conditional-subexpr-elim branch from 19b7846 to 51a3902 Compare August 16, 2024 11:43

Kimahriman force-pushed the conditional-subexpr-elim branch 3 times, most recently from 48f5c82 to 299957e Compare October 2, 2024 15:15

Kimahriman force-pushed the conditional-subexpr-elim branch from 299957e to b367368 Compare November 25, 2024 12:28

Kimahriman force-pushed the conditional-subexpr-elim branch from b367368 to 24d6c9f Compare February 8, 2025 13:55

cloud-fan reviewed Feb 10, 2025

View reviewed changes

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Feb 10, 2025

View reviewed changes

Kimahriman force-pushed the conditional-subexpr-elim branch from a749e20 to 065b802 Compare March 17, 2025 19:37

Kimahriman force-pushed the conditional-subexpr-elim branch from 065b802 to ec431b7 Compare May 5, 2025 14:31

Kimahriman force-pushed the conditional-subexpr-elim branch from ec431b7 to a38de68 Compare June 24, 2025 13:54

Kimahriman force-pushed the conditional-subexpr-elim branch 2 times, most recently from 90bfc7f to 3319f8d Compare August 15, 2025 11:20

Kimahriman force-pushed the conditional-subexpr-elim branch from 3319f8d to 3d8d7ce Compare November 4, 2025 21:42

Kimahriman force-pushed the conditional-subexpr-elim branch from 3d8d7ce to a086dd5 Compare November 25, 2025 21:04

Track conditionally evaluated expressions to resolve as subexpression…

ffb3e92

…s for cases they are already being evaluated

Kimahriman force-pushed the conditional-subexpr-elim branch from a086dd5 to ffb3e92 Compare November 28, 2025 18:56

[SPARK-35564][SQL] Support subexpression elimination for conditionally evaluated expressions #32987

Are you sure you want to change the base?

[SPARK-35564][SQL] Support subexpression elimination for conditionally evaluated expressions #32987

Uh oh!

Conversation

Kimahriman commented Jun 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cloud-fan commented Jun 22, 2021

Uh oh!

SparkQA commented Jun 22, 2021

Uh oh!

SparkQA commented Jun 22, 2021

Uh oh!

viirya commented Jun 22, 2021

Uh oh!

Kimahriman commented Jun 22, 2021

Uh oh!

viirya commented Jun 22, 2021

Uh oh!

Kimahriman commented Jun 22, 2021

Uh oh!

viirya commented Jun 22, 2021

Uh oh!

Kimahriman commented Jun 22, 2021

Uh oh!

SparkQA commented Jun 22, 2021

Uh oh!

Kimahriman commented Jun 24, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kimahriman commented Jul 7, 2021

Uh oh!

SparkQA commented Jul 7, 2021

Uh oh!

SparkQA commented Jul 7, 2021

Uh oh!

SparkQA commented Jul 7, 2021

Uh oh!

SparkQA commented Jul 22, 2021

Uh oh!

SparkQA commented Jul 22, 2021

Uh oh!

SparkQA commented Jul 22, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kimahriman commented Jul 22, 2021

Uh oh!

SparkQA commented Jul 22, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

Kimahriman commented Jun 20, 2021 •

edited

Loading