[SPARK-24242][SQL] RangeExec should have correct outputOrdering and outputPartitioning #21291

viirya · 2018-05-10T10:34:05Z

What changes were proposed in this pull request?

Logical Range node has been added with outputOrdering recently. It's used to eliminate redundant Sort during optimization. However, this outputOrdering doesn't not propagate to physical RangeExec node.

We also add correct outputPartitioning to RangeExec node.

How was this patch tested?

Added test.

viirya · 2018-05-10T10:34:39Z

cc @cloud-fan @kiszk

SparkQA · 2018-05-10T14:13:35Z

Test build #90455 has finished for PR 21291 at commit 30b42d5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-05-10T14:41:12Z

sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala

      shouldHaveSort = true)
  }
+
+  test("RangeExec should have correct output ordering") {


nit: start with SPARK-24242: ...

kiszk · 2018-05-10T14:41:36Z

LGTM except one minor comment

cloud-fan · 2018-05-10T15:43:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala


  override val output: Seq[Attribute] = range.output

+  override def outputOrdering: Seq[SortOrder] = range.outputOrdering


since we are here, shall we also implement outputPartitioning?

SparkQA · 2018-05-10T18:27:01Z

Test build #90460 has finished for PR 21291 at commit 86c800c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-05-10T18:38:36Z

LGTM, +1 for adding outputPartioning too, though..

viirya · 2018-05-10T23:41:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala

+
+  override def outputPartitioning: Partitioning = {
+    if (numSlices == 1) {
+      SinglePartition


Which one is better? SinglePartition or RangePartitioning(outputOrdering, 1)?

SinglePartition is better

SparkQA · 2018-05-11T01:31:19Z

Test build #90484 has finished for PR 21291 at commit 6039094.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-05-11T01:53:05Z

retest this please.

HyukjinKwon

lgtm

SparkQA · 2018-05-11T04:45:00Z

Test build #90487 has finished for PR 21291 at commit 6039094.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-05-11T05:07:42Z

retest this please.

viirya · 2018-05-11T05:24:00Z

Changed outputPartitioning changes executed plans.

E.g. in WholeStageCodegenSuite, a query like spark.range(3).groupBy("id").count().orderBy("id"). Its executed plan changes from

*(3) Sort [id#22L ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(id#22L ASC NULLS FIRST, 5)
   +- *(2) HashAggregate(keys=[id#22L], functions=[count(1)], output=[id#22L, count#26L])
      +- Exchange hashpartitioning(id#22L, 5)
         +- *(1) HashAggregate(keys=[id#22L], functions=[partial_count(1)], output=[id#22L, count#31L])
            +- *(1) Range (0, 3, step=1, splits=2)

to

*(1) Sort [id#22L ASC NULLS FIRST], true, 0
+- *(1) HashAggregate(keys=[id#22L], functions=[count(1)], output=[id#22L, count#26L])
   +- *(1) HashAggregate(keys=[id#22L], functions=[partial_count(1)], output=[id#22L, count#31L])
      +- *(1) Range (0, 3, step=1, splits=2)

I will update related tests.

viirya · 2018-05-11T06:26:03Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+      0
+    } else {
+      collected.head.getLong(0)
+    }


spark.range(-10, -9, -20, 1).select("id").count in DataFrameRangeSuite causes exception here. plan.executeCollect().head pulls empty iterator by calling next.

I think it is caused by returning SinglePartition when there is no data (and therefore no partition). So I think we should fix it there and not here.

Right, making sense. Thanks.

SparkQA · 2018-05-11T07:05:01Z

Test build #90495 has finished for PR 21291 at commit 6039094.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-11T07:05:02Z

Test build #90496 has finished for PR 21291 at commit 015e2ad.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-05-11T07:16:27Z

retest this please.

SparkQA · 2018-05-11T08:03:03Z

Test build #90497 has finished for PR 21291 at commit 015e2ad.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-05-11T08:14:45Z

retest this please.

SparkQA · 2018-05-11T11:01:24Z

Test build #90499 has finished for PR 21291 at commit 015e2ad.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-11T13:58:52Z

Test build #90503 has finished for PR 21291 at commit 3a14bd6.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-05-11T15:05:13Z

retest this please.

SparkQA · 2018-05-11T18:12:38Z

Test build #90514 has finished for PR 21291 at commit 3a14bd6.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-05-14T10:08:42Z

retest this please

SparkQA · 2018-05-15T03:11:48Z

Test build #90610 has finished for PR 21291 at commit f93738b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-05-15T14:27:47Z

retest this please.

SparkQA · 2018-05-15T16:31:18Z

Test build #90646 has finished for PR 21291 at commit f93738b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-05-15T22:31:24Z

retest this please.

SparkQA · 2018-05-16T01:47:08Z

Test build #90661 has finished for PR 21291 at commit f93738b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-05-16T02:56:59Z

retest this please.

SparkQA · 2018-05-16T06:05:47Z

Test build #90666 has finished for PR 21291 at commit f93738b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-16T13:36:57Z

Test build #90678 has finished for PR 21291 at commit b317777.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-05-16T13:44:29Z

python/pyspark/sql/tests.py

        # groupby one column and one sql expression
-        result3 = df.groupby(df.id, df.v % 2).agg(sum_udf(df.v))
-        expected3 = df.groupby(df.id, df.v % 2).agg(sum(df.v))
+        result3 = df.groupby(df.id, df.v % 2).agg(sum_udf(df.v)).orderBy(df.id, df.v % 2)


why not just orderBy(df.id)? and why was this not failing before this fix?

Simply said, the data ordering between result3 and expect3 are different now.

Previous query plan for two queries:

== Physical Plan == !AggregateInPandas [id#0L, (v#8 % 2.0) AS (v#8 % 2.0)#40], [sum(v#8)], [id#0L, (v#8 % 2.0)#40 AS (v % 2)#22, sum(v)#21 AS sum(v)#23] +- *(2) Sort [id#0L ASC NULLS FIRST, (v#8 % 2.0) AS (v#8 % 2.0)#40 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#0L, (v#8 % 2.0) AS (v#8 % 2.0)#40, 200) +- Generate explode(vs#4), [id#0L], false, [v#8] +- *(1) Project [id#0L, array((20.0 + cast(id#0L as double)), (21.0 + cast(id#0L as double)), (22.0 + cast(id#0L as double)), (23.0 + cast(id#0L as double)), (24.0 + cast(id#0L as double)), (25.0 + cast(id#0L as double)), (26.0 + cast(id#0L as double)), (27.0 + cast(id#0L as double)), (28.0 + cast(id#0L as double)), (29.0 + cast(id#0L as double))) AS vs#4] +- *(1) Range (0, 10, step=1, splits=8)

== Physical Plan == *(3) HashAggregate(keys=[id#0L, (v#8 % 2.0)#36], functions=[sum(v#8)], output=[id#0L, (v % 2)#31, sum(v)#32]) +- Exchange hashpartitioning(id#0L, (v#8 % 2.0)#36, 200) +- *(2) HashAggregate(keys=[id#0L, (v#8 % 2.0) AS (v#8 % 2.0)#36], functions=[partial_sum(v#8)], output=[id#0L, (v#8 % 2.0)#36, sum#38]) +- Generate explode(vs#4), [id#0L], false, [v#8] +- *(1) Project [id#0L, array((20.0 + cast(id#0L as double)), (21.0 + cast(id#0L as double)), (22.0 + cast(id#0L as double)), (23.0 + cast(id#0L as double)), (24.0 + cast(id#0L as double)), (25.0 + cast(id#0L as double)), (26.0 + cast(id#0L as double)), (27.0 + cast(id#0L as double)), (28.0 + cast(id#0L as double)), (29.0 + cast(id#0L as double))) AS vs#4] +- *(1) Range (0, 10, step=1, splits=8)

Both have Exchange hashpartitioning which produces the same data distribution previously. Notice Sort doesn't change data ordering because 200 partitions make sparse distribution.

Current query plan:

!AggregateInPandas [id#388L, (v#396 % 2.0) AS (v#396 % 2.0)#453], [sum(v#396)], [id#388L, (v#396 % 2.0)#453 AS (v % 2)#438, sum(v)#437 AS s um(v)#439] +- *(2) Sort [id#388L ASC NULLS FIRST, (v#396 % 2.0) AS (v#396 % 2.0)#453 ASC NULLS FIRST], false, 0 +- Generate explode(vs#392), [id#388L], false, [v#396] +- *(1) Project [id#388L, array((20.0 + cast(id#388L as double)), (21.0 + cast(id#388L as double)), (22.0 + cast(id#388L as double)), (23.0 + cast(id#388L as double)), (24.0 + cast(id#388L as double)), (25.0 + cast(id#388L as double)), (26.0 + cast(id#388L as double)), (2 7.0 + cast(id#388L as double)), (28.0 + cast(id#388L as double)), (29.0 + cast(id#388L as double))) AS vs#392] +- *(1) Range (0, 10, step=1, splits=4)

== Physical Plan == *(2) HashAggregate(keys=[id#388L, (v#396 % 2.0)#454], functions=[sum(v#396)], output=[id#388L, (v % 2)#447, sum(v)#448]) +- *(2) HashAggregate(keys=[id#388L, (v#396 % 2.0) AS (v#396 % 2.0)#454], functions=[partial_sum(v#396)], output=[id#388L, (v#396 % 2.0)#45 4, sum#456]) +- Generate explode(vs#392), [id#388L], false, [v#396] +- *(1) Project [id#388L, array((20.0 + cast(id#388L as double)), (21.0 + cast(id#388L as double)), (22.0 + cast(id#388L as double)), (23.0 + cast(id#388L as double)), (24.0 + cast(id#388L as double)), (25.0 + cast(id#388L as double)), (26.0 + cast(id#388L as double)), (2 7.0 + cast(id#388L as double)), (28.0 + cast(id#388L as double)), (29.0 + cast(id#388L as double))) AS vs#392] +- *(1) Range (0, 10, step=1, splits=4)

Exchange is not there anymore. They have same data distribution. But now Sort changes data ordering.

thanks for your detailed explanation. Anyway, can we just use orderBy(df.id) instead of orderBy(df.id, df.v % 2)?

They are already ordered by df.id. This is the partial data:

Expected: id (v % 2) sum(v) 0 0 0.0 120.0 1 0 1.0 125.0 2 1 1.0 125.0 3 1 0.0 130.0 4 2 0.0 130.0 5 2 1.0 135.0

Result: id (v % 2) sum(v) 0 0 0.0 120.0 1 0 1.0 125.0 2 1 0.0 130.0 3 1 1.0 125.0 4 2 0.0 130.0 5 2 1.0 135.0

oh I see now, sorry, thanks.

cloud-fan · 2018-05-17T07:58:27Z

sql/core/src/test/scala/org/apache/spark/sql/execution/debug/DebuggingSuite.scala

  }

  test("debugCodegen") {
    val res = codegenString(spark.range(10).groupBy("id").count().queryExecution.executedPlan)


can we change to groupBy('id * 2)? We should try our best to keep what to test, and keep the shuffle in this query.

cloud-fan · 2018-05-17T07:58:42Z

sql/core/src/test/scala/org/apache/spark/sql/execution/debug/DebuggingSuite.scala

  test("debugCodegenStringSeq") {
    val res = codegenStringSeq(spark.range(10).groupBy("id").count().queryExecution.executedPlan)
-    assert(res.length == 2)
+    assert(res.length == 1)


mgaido91 · 2018-05-17T08:46:33Z

LGTM

viirya · 2018-05-17T08:52:12Z

Thanks @mgaido91

SparkQA · 2018-05-17T12:16:47Z

Test build #90722 has finished for PR 21291 at commit f41fc14.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-05-18T02:46:36Z

sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala

    assert(plan.find(p =>
      p.isInstanceOf[WholeStageCodegenExec] &&
-        p.asInstanceOf[WholeStageCodegenExec].child.isInstanceOf[HashAggregateExec]).isDefined)
+        p.asInstanceOf[WholeStageCodegenExec].child.collect {


same here, can we change the groupBy instead of the test?

cloud-fan · 2018-05-18T02:46:51Z

LGTM

SparkQA · 2018-05-18T07:05:01Z

Test build #90772 has finished for PR 21291 at commit 14a3402.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-05-18T07:07:05Z

retest this please

SparkQA · 2018-05-18T10:56:30Z

Test build #90780 has finished for PR 21291 at commit 14a3402.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-05-21T07:39:06Z

Merged to master.

viirya · 2018-05-21T07:47:58Z

Thanks @cloud-fan @mgaido91 @kiszk @HyukjinKwon

RangeExec should have correct outputOrdering.

30b42d5

kiszk reviewed May 10, 2018

View reviewed changes

Address comment.

86c800c

cloud-fan reviewed May 10, 2018

View reviewed changes

Implement outputPartitioning.

6039094

viirya commented May 10, 2018

View reviewed changes

HyukjinKwon approved these changes May 11, 2018

View reviewed changes

Fix tests.

015e2ad

viirya commented May 11, 2018

View reviewed changes

Address comment.

3a14bd6

Add test for zero partition range.

f93738b

viirya changed the title ~~[SPARK-24242][SQL] RangeExec should have correct outputOrdering~~ [SPARK-24242][SQL] RangeExec should have correct outputOrdering and outputPartitioning May 15, 2018

viirya added 2 commits May 16, 2018 06:36

Merge remote-tracking branch 'upstream/master' into SPARK-24242

4bcac26

Make row order deterministic by orderby.

b317777

mgaido91 reviewed May 16, 2018

View reviewed changes

cloud-fan reviewed May 17, 2018

View reviewed changes

Address comment.

f41fc14

cloud-fan reviewed May 18, 2018

View reviewed changes

Address comment.

14a3402

asfgit closed this in 6d7d45a May 21, 2018

viirya deleted the SPARK-24242 branch December 27, 2023 18:35


		override val output: Seq[Attribute] = range.output

		override def outputOrdering: Seq[SortOrder] = range.outputOrdering

[SPARK-24242][SQL] RangeExec should have correct outputOrdering and outputPartitioning #21291

[SPARK-24242][SQL] RangeExec should have correct outputOrdering and outputPartitioning #21291

Uh oh!

Conversation

viirya commented May 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

viirya commented May 10, 2018

Uh oh!

SparkQA commented May 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiszk commented May 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 10, 2018

Uh oh!

mgaido91 commented May 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 11, 2018

Uh oh!

viirya commented May 11, 2018

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 11, 2018

Uh oh!

viirya commented May 11, 2018

Uh oh!

viirya commented May 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya May 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 11, 2018

Uh oh!

SparkQA commented May 11, 2018

Uh oh!

viirya commented May 11, 2018

Uh oh!

SparkQA commented May 11, 2018

Uh oh!

viirya commented May 11, 2018

Uh oh!

SparkQA commented May 11, 2018

Uh oh!

SparkQA commented May 11, 2018

Uh oh!

viirya commented May 11, 2018

Uh oh!

SparkQA commented May 11, 2018

Uh oh!

cloud-fan commented May 14, 2018

Uh oh!

SparkQA commented May 15, 2018

Uh oh!

viirya commented May 15, 2018

Uh oh!

SparkQA commented May 15, 2018

Uh oh!

viirya commented May 15, 2018

viirya commented May 10, 2018 •

edited

Loading

mgaido91 commented May 10, 2018 •

edited

Loading

viirya May 11, 2018 •

edited

Loading

cloud-fan May 17, 2018 •

edited

Loading