-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24242][SQL] RangeExec should have correct outputOrdering and outputPartitioning #21291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
30b42d5
86c800c
6039094
015e2ad
3a14bd6
59499ad
f93738b
4bcac26
b317777
f41fc14
14a3402
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -39,7 +39,9 @@ class ConfigBehaviorSuite extends QueryTest with SharedSQLContext { | |
| def computeChiSquareTest(): Double = { | ||
| val n = 10000 | ||
| // Trigger a sort | ||
| val data = spark.range(0, n, 1, 1).sort('id.desc) | ||
| // Range has range partitioning in its output now. To have a range shuffle, we | ||
| // need to run a repartition first. | ||
| val data = spark.range(0, n, 1, 1).repartition(10).sort('id.desc) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sorry, I am just curious, why is
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This test requires a range shuffle. Previously For now
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm also confused here, the range output ordering is
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because range reports it is just one partition now?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. then can we change the code to
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This test uses If we change it from 1 to 10 partition, the chi-sq value will changed too. Should we do this?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm, isn't
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a good point. This is query plan and partition size for
Because
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i see, so the
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By Here we need a redistribution on data to make sampling difficult. Previously, a repartition is added automatically before |
||
| .selectExpr("SPARK_PARTITION_ID() pid", "id").as[(Int, Long)].collect() | ||
|
|
||
| // Compute histogram for the number of records per partition post sort | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -55,7 +55,9 @@ class WholeStageCodegenSuite extends QueryTest with SharedSQLContext { | |
| val plan = df.queryExecution.executedPlan | ||
| assert(plan.find(p => | ||
| p.isInstanceOf[WholeStageCodegenExec] && | ||
| p.asInstanceOf[WholeStageCodegenExec].child.isInstanceOf[HashAggregateExec]).isDefined) | ||
| p.asInstanceOf[WholeStageCodegenExec].child.collect { | ||
|
||
| case h: HashAggregateExec => h | ||
| }.nonEmpty).isDefined) | ||
| assert(df.collect() === Array(Row(0, 1), Row(1, 1), Row(2, 1))) | ||
| } | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -34,14 +34,13 @@ class DebuggingSuite extends SparkFunSuite with SharedSQLContext { | |
|
|
||
| test("debugCodegen") { | ||
| val res = codegenString(spark.range(10).groupBy("id").count().queryExecution.executedPlan) | ||
|
||
| assert(res.contains("Subtree 1 / 2")) | ||
| assert(res.contains("Subtree 2 / 2")) | ||
| assert(res.contains("Subtree 1 / 1")) | ||
| assert(res.contains("Object[]")) | ||
| } | ||
|
|
||
| test("debugCodegenStringSeq") { | ||
| val res = codegenStringSeq(spark.range(10).groupBy("id").count().queryExecution.executedPlan) | ||
| assert(res.length == 2) | ||
| assert(res.length == 1) | ||
|
||
| assert(res.forall{ case (subtree, code) => | ||
| subtree.contains("Range") && code.contains("Object[]")}) | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.range(-10, -9, -20, 1).select("id").countinDataFrameRangeSuitecauses exception here.plan.executeCollect().headpulls empty iterator by callingnext.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is caused by returning
SinglePartitionwhen there is no data (and therefore no partition). So I think we should fix it there and not here.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, making sense. Thanks.