-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-42551][SQL] Support more subexpression elimination cases #41119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Do not reuse the expression if CodegenContext changed.
|
Hi, @cloud-fan @ulysses-you @wangyum could you help to review this PR ? Thanks |
|
cc @rednaxelafx |
|
and @peter-toth |
|
Hi, @rednaxelafx @peter-toth could you help to review this PR ? Thanks |
Hi @wankunde, thanks for pinging me. I can take a look at this PR sometime next week or the week after... |
|
I think it would be easier to compare the difference of the code generated by ProjectExec and FilterExec before reviewing this PR. |
|
Sorry, I still haven't got time to review the PR thoroughly. (Maybe next week...) IMO it is a good idea to lazily evaluate expressions and cache the results runtime in case of any expresions that might be used more than once. In some cases it could achieve much better performance than any kind of static analysis we do currently... |
|
Add a microbenchmark to evaluate the overhead of an additional function call. First benchmark for Project override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
spark.range(1, 20000000, 1, 1)
.selectExpr(
"cast(id + 1 as decimal) as a",
"cast(id + 2 as decimal) as b",
"cast(id + 3 as decimal) as c",
"cast(id + 4 as decimal) as d")
.createOrReplaceTempView("tab")
runBenchmark("Subexpression elimination in ProjectExec") {
val benchmark =
new Benchmark("Subexpression elimination in ProjectExec", 20000000, output = output)
for (expr <- Seq("a * b", "a / b", "a * b / c")) {
benchmark.addCase(s"Test $expr expr") { _ =>
val query =
s"""
|SELECT a, b, c, d, CASE WHEN $expr > 0 THEN 0 WHEN $expr > -1 THEN -1 END AS e
|FROM tab
|""".stripMargin
spark.sql(query).noop()
}
}
benchmark.run()
}
}Benchmark results are almost the same Second benchmark for Filter. override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
spark.range(1, 20000000, 1, 1)
.selectExpr(
"cast(id + 1 as decimal) as a",
"cast(id + 2 as decimal) as b",
"cast(id + 3 as decimal) as c",
"cast(id + 4 as decimal) as d")
.createOrReplaceTempView("tab")
runBenchmark("Subexpression elimination in FilterExec") {
val benchmark =
new Benchmark("Subexpression elimination in FilterExec", 20000000, output = output)
for (expr <- Seq("a * b", "a / b", "a * b / c")) {
benchmark.addCase(s"Test $expr expr") { _ =>
val query =
s"""
|SELECT a, b, c, d
|FROM tab
|WHERE $expr < 0 AND $expr < 1
|""".stripMargin
spark.sql(query).noop()
}
}
benchmark.run()
}
} |
|
@wankunde, I didn't foget this PR just realized that we can improve the current subexpression elimination logic in terms of recognizing the surely evaluated subexpressions and keeping track of conditionally evaluated ones. IMO if surely evaluated count is >= 2 or surely evaluated count = 1 but there is also a certain probability of evaluating the subexpression conditionally then we can include the subexpression for elimination. (This is not a new idea, basically @cloud-fan already mentioned this in #32987 (comment)). So I opened a PR for tracking the expected count of sure and conditional evaluations here: #41677. (Need to add a few more tests so it is just WIP now...) If that PR gets accepted we can revisit those subexpressions that have only conditional evaluations at 2+ places and apply your lazy construct idea proposed in this PR. |
|
Hi, @peter-toth , any update about the expression elimination work ? |
|
Yeah, unfortunately #41677 got a bit stuck. I will try to revive it this week. |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
How to support more subexpressions elimination cases
Why are the changes needed?
Support more subexpression elimination cases, improve performance.
For example, TPCDS q23b query, we can reuse the result of projection
sum(ss_quantity * ss_sales_price)in the if expressions:q4, q62, q67 are similar to the above.
One of our production query which has 19 case when expressions, it's query time changed from 1.1 hour to 42 seconds.
A simplify benchmark of the above production query.
Local benchmark result:
Before this PR:
After this PR:
Does this PR introduce any user-facing change?
No
How was this patch tested?
Exists UT.