-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21707][SQL]Improvement a special case for non-deterministic filters in optimizer #18918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4216b6c to
b79b9af
Compare
|
ok to test |
|
How about data source tables? |
|
Test build #80545 has finished for PR 18918 at commit
|
|
@gatorsmile |
|
This is the plan of a Hive serde table. The fix should not be done in optimizer. We should fix it in the place that causes the issue. |
|
@gatorsmile the problem can be solved. but I'm not sure if it will cause other problems. Do you have any suggestions for that? this method is also applicable to #18892. thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we can't remove the Project if the condition is not deterministic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we split a project, pruning it here again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get it from your explanation. If I understand it correctly, when there is a Project which selects subset of output from the LeafNode, if we remove it by the below pattern, we will retrieve all fields. Is it your purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I don't get what the test title tries to say. Can you try to rephrase it?
b79b9af to
bf81c45
Compare
|
Test build #80628 has finished for PR 18918 at commit
|
|
Yes. We should fix it in |
9f73949 to
4daec54
Compare
|
@gatorsmile |
|
Test build #80670 has finished for PR 18918 at commit
|
4daec54 to
82b82af
Compare
|
Test build #80681 has finished for PR 18918 at commit
|
82b82af to
df7ecaa
Compare
|
Test build #80686 has finished for PR 18918 at commit
|
df7ecaa to
97a3270
Compare
|
Test build #80693 has finished for PR 18918 at commit
|
3c73556 to
471d81c
Compare
471d81c to
20fc87a
Compare
|
cc @cloud-fan @gatorsmile @viirya Could your take a look? |
| val p = path.getAbsolutePath | ||
| Seq(1 -> "a").toDF("a", "b").write.partitionBy("a").parquet(p) | ||
| val df = spark.read.parquet(p) | ||
| checkAnswer(df.filter(rand(10) <= 1.0).select($"a"), Row(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test can pass on current master.
|
what exactly are you proposing? |
|
Can one of the admins verify this patch? |
|
gentle ping @heary-cao |
Closes apache#17422 Closes apache#17619 Closes apache#18034 Closes apache#18229 Closes apache#18268 Closes apache#17973 Closes apache#18125 Closes apache#18918 Closes apache#19274 Closes apache#19456 Closes apache#19510 Closes apache#19420 Closes apache#20090 Closes apache#20177 Closes apache#20304 Closes apache#20319 Closes apache#20543 Closes apache#20437 Closes apache#21261 Closes apache#21726 Closes apache#14653 Closes apache#13143 Closes apache#17894 Closes apache#19758 Closes apache#12951 Closes apache#17092 Closes apache#21240 Closes apache#16910 Closes apache#12904 Closes apache#21731 Closes apache#21095 Added: Closes apache#19233 Closes apache#20100 Closes apache#21453 Closes apache#21455 Closes apache#18477 Added: Closes apache#21812 Closes apache#21787 Author: hyukjinkwon <[email protected]> Closes apache#21781 from HyukjinKwon/closing-prs.
What changes were proposed in this pull request?
Currently, Did a lot of special handling for non-deterministic projects and filters in optimizer. but not good enough. this patch add a new special case for non-deterministic filters.
in my spark-shell,execute the following SQL statement:
Before modified,
executed Plan:
FileScanRDD read userdata:
[0,0,0,4,0,3,0,6,4,2,b800000001,c000000001,c800000001,d000000001,d800000001,e000000001,0,0,0,4010000000000000,0,4008000000000000,0,30,30,30,34,30,33]After modified,
executed Plan:
FileScanRDD read userdata:
[0,2,0]So the PR description deal with that we only need to read needs fields.
In addition, we cluster in real environment. HiveTableScans also scan more columns according to the execution plan.
HadoopRDD also read more userdata:
{62340760016026144, 254850, 0, 64F00053E382D3AB, 3, , , null, 550667202, -78, -7.0, 6373, 152963, 114.13232277, 32.16357801, 2, 26, -116.657997, 21, 27, 15, 0.021978, -3, 3, -270543.0, 77187.0, 5041, 560, 7, 187, 003E3820BB8F8CA3, 9, 255, 2, 4, null, , , 101, 37.51, 202.74, , , , , , 39309, 610824, 52, 152, -117, 37900, 0, , , , , , , null, null, , null, null, null, null, null, null, null, null, 0, null, null, null, null, null, null, null, null, 0, 4, null, 26, 26, 20, 15, 15, null, 36, 350182624, 1039, 1, 430, 48, 0, -78, null, "5041,-27055,7719", "5041,-13528,3860", "5041,-5411,1544", "5041,-2706,772", "5041,-1353,386", null, 178, 4, 0.0, 0.0, 37.51, 202.74, 0, 30, 0, null, null, 687, 3696, 14768, 26300, 850.0, 125.0, 263, , 6.97, 3.77, null, null, null, null, 256, __, null, null, null, null, null, null, null, 254850_0, null, null, null, null, null, 0, 15, 0, 0, null, null, null, , -5411, 1544, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null}it will affect the performance of task.
How was this patch tested?
Should be covered existing test cases and add new test cases.