-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-13981][SQL] Defer evaluating variables within Filter operator. #11792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Generated code before: with this patch |
|
Test build #53463 has finished for PR 11792 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output will have different nullability than child.output, we should use child.output here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nwm, it's nice trick to use output here, since we already generate the code for BoundReference, the nullablity of it does not matter, but help to simplify the code of expressions.
Could you add a comment for that?
|
Test build #53715 has finished for PR 11792 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to use notNullAttributes ++ otherPreds to make sure that IsNotNull always come first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That defeats the point of this. Then we're referencing all the attributes first before the predicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can not find where this assumption (IsNotNull come before) is guarateed, if not, we could generate wrong results.
Also, the conjuncts is in the sequence like this:
IsNotNull(A), IsNotNull(B), A > xx, B > xxx
We should re-order them explicitly, make sue they are in this order:
IsNotNull(A), A > xx, IsNotNull(B), B > xxx
|
Test build #54107 has finished for PR 11792 at commit
|
|
@davies I did this a different way. Let me know your thoughts. |
… and NULL improvements. This improves the Filter codegen to optimize IsNotNull filters which are common. This patch defers loading attributes as late as possible within the filter operator. This takes advantage of short-circuiting. Instead of generating code like: boolean isNull = ... int value = ... boolean isNull2 = ... int value2 = ... if (isNull) continue; we will generate: boolean isNull = ... int value = ... if (isNull) continue; int value2 = ... if (isNull) continue; On tpcds q55, this fixes the regression from introducing the IsNotNull predicates. TPCDS Snappy: Best/Avg Time(ms) Rate(M/s) Per Row(ns) -------------------------------------------------------------------------------- q55 4564 / 5036 25.2 39.6 q55 4064 / 4340 28.3 35.3
|
Test build #54228 has finished for PR 11792 at commit
|
| val idx = notNullPreds.indexWhere { n => n.asInstanceOf[IsNotNull].child.semanticEquals(r)} | ||
| if (idx != -1 && !generatedIsNotNullChecks(idx)) { | ||
| // Use the child's output. The nullability is what the child produced. | ||
| val code = genPredicate(notNullPreds(idx), input, child.output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generatedIsNotNullChecks(idx) = true
genPredicate(notNullPreds(idx), input, child.output)
|
LGTM, just two minor comments. |
|
Test build #54342 has finished for PR 11792 at commit
|
|
Test build #2702 has finished for PR 11792 at commit
|
|
Test build #2706 has finished for PR 11792 at commit
|
|
Thanks - merging in master. |
What changes were proposed in this pull request?
This improves the Filter codegen for NULLs by deferring loading the values for IsNotNull.
Instead of generating code like:
boolean isNull = ...
int value = ...
if (isNull) continue;
we will generate:
boolean isNull = ...
if (isNull) continue;
int value = ...
This is useful since retrieving the values can be non-trivial (they can be dictionary encoded
among other things). This currently only works when the attribute comes from the column batch
but could be extended to other cases in the future.
How was this patch tested?
On tpcds q55, this fixes the regression from introducing the IsNotNull predicates.