-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-35564][SQL] Support subexpression elimination for conditionally evaluated expressions #32987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
OK to test |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
I don't think the downside is edge case. On the contrary, I rather think it is common case than the use-case proposed here. After this, any common expression shared between conditionally evaluated expression and a normal expression will be subexpression. I have a concern that gen-ed code will be overwhelmed with such subexpressions. At least we need a config for this and I don't think it should be enabled by default. |
What exactly is the overwhelming part? I figured smaller overall code size would be beneficial. |
It is not zero-cost. For example, too many subexpressions will possibly make non-split case to be split case. |
|
Could you elaborate on how that could happen? I don't know that much about the codegen process |
In short, during subexpressions codegen, if the total code length is more than a threshold, we choose to split it as functions to avoid reach the max size of a method. |
|
Oh you're specifically talking about the subexpressions being split into functions versus inlined, not the general splitting the whole codegen into functions? |
|
Test build #140145 has finished for PR 32987 at commit
|
|
My main assumption in creating this was that it's always faster to run an expression once in a function than twice inlined. If this creates a lot of extra subexpressions that pushes the code over the 1kb threshold for breaking into functions, then the alternative is that you are running a lot of duplicate inlined logic, so at the end of the day it all comes down to how often a subexpression created by this logic is only evaluated once. The two extremes of performance impact I can think of would be:
Realistically things are going to fall somewhere in the middle. Where the extra function calls outweigh the deduped expression execution, who knows. But the upside here is pretty large, and I would expect most Spark users would expect this to logically happen (don't run the same code twice). If we want to leave it with the setting defaulted to disabled I'm fine with that. |
|
|
||
| // Finds expressions that are conditionally evaluated, so that if they are definitely evaluated | ||
| // elsewhere, we can create a subexpression to optimize the conditional case. | ||
| private def conditionallyEvaluatedChildren(expr: Expression): Seq[Expression] = expr match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it's a bit more overcomplicated: now we have childen, commonChildren, conditionallyEvaluatedChildren.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's just all different cases that need to be handled. I can think about how to simplify or if #33142 would help simplify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e. useCount and conditionalUseCount instead of separate map and all that or something, idk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if it's less overcomplicated now...
d4d64c7 to
1111b9b
Compare
|
Updated based on the refactor. It's still a little rough and needs some cleaning, renaming things, and updating a lot of comments, but wanted to get initial feedback |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #140756 has finished for PR 32987 at commit
|
d956d22 to
5d6e1ad
Compare
|
Test build #141494 has started for PR 32987 at commit |
|
Test build #141497 has started for PR 32987 at commit |
5d6e1ad to
80245a6
Compare
|
Test build #141498 has started for PR 32987 at commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The not-first conditions are now handled as a conditional instead. Supports all the same existing behavior but additionally can create subexpressions for things only in one of the remaining conditions instead of all. For example, CaseWhen((a + b) / (c + d) > 1, 1, a + b > 1, 2, c + d > 1, 3), a + b and c + d will become subexpressions now where they wouldn't previously, though only with this config enabled if we need to keep the config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as the CaseWhen above
|
#33142 (comment) in other cases it's already accepted that the performance overhead of maybe only using a subexpression once is worth the trade-off of not having to potentially evaluate it twice, so this just expands the places that could happen. Personally I don't think it needs a config defaulting to turned off, but I'm fine leaving it in if necessary. It does effectively prevent all the existing cases of creating a subexpression for an expression that might only be evaluated once, like mentioned in the comment, if the config is turned off. |
|
Kubernetes integration test starting |
3d415ec to
be9e9b2
Compare
be9e9b2 to
90796d5
Compare
ba70e61 to
2195945
Compare
d3b4716 to
36efb6a
Compare
36efb6a to
aff4565
Compare
aff4565 to
a839d50
Compare
a839d50 to
19b7846
Compare
19b7846 to
51a3902
Compare
48f5c82 to
299957e
Compare
299957e to
b367368
Compare
b367368 to
24d6c9f
Compare
...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala
Outdated
Show resolved
Hide resolved
| */ | ||
| private def updateCommonExprs( | ||
| exprs: Seq[Expression], | ||
| map: mutable.HashMap[ExpressionEquals, ExpressionStats], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we update the doc of this method? no equivalenceMap in this method now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep good call, haven't kept up with with some of the docs
| */ | ||
| case class RecurseChildren( | ||
| alwaysChildren: Seq[Expression], | ||
| commonChildren: Seq[Seq[Expression]] = Nil, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have an example this commonChildren?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the docs a little bit to clarify. Currently it's only If and CaseWhen expressions that commonChildren applies too, should I put one of those as an example in the doc?
a749e20 to
065b802
Compare
065b802 to
ec431b7
Compare
ec431b7 to
a38de68
Compare
90bfc7f to
3319f8d
Compare
3319f8d to
3d8d7ce
Compare
3d8d7ce to
a086dd5
Compare
…s for cases they are already being evaluated
a086dd5 to
ffb3e92
Compare
What changes were proposed in this pull request?
I am proposing to add support for conditionally evaluated expressions during subexpression elimination. Currently, only expressions that will definitely be always at least twice are candidates for subexpression elimination. This PR updates that logic so that expressions that are always evaluated at least once and conditionally evaluated at least once are also candidates for subexpression elimination. This helps optimize a common case during data normalization and cleaning and want to null out values that don't match a certain pattern, where you have something like:
or
In these cases,
transformedwill always be fully calculated twice, because it might only be needed once. I am proposing creating a subexpression fortransformedin this case.In practice I've seen a decrease in runtime and codegen size of 10-30% in our production pipelines that heavily make use of this type of logic.
The only potential downside is creating extra subexpressions, and therefore function calls, more than necessary. This should only be an issue for certain edge cases where your conditional overwhelming evaluates to false. And then the only overhead is running your conditional logic potentially in a separate function rather than inlined in the codegen. I added a config to control this behavior if that is actually a real concern to anyone, but I'd be happy to just remove the config.
I also updated some of the existing logic for common expressions in coalesce and when that are actually better handled by the new logic, since you are only guaranteed to have the first value of a Coalesce evaluated, as well as the first conditional of a CaseWhen expression.
Why are the changes needed?
To increase the performance of conditional expressions.
Does this PR introduce any user-facing change?
No, just performance improvements.
How was this patch tested?
New and updated UT.