Skip to content

Conversation

@peter-toth
Copy link
Contributor

@peter-toth peter-toth commented Feb 3, 2019

What changes were proposed in this pull request?

This PR is a correctness fix in HashAggregateExec code generation. It forces evaluation of result expressions before calling consume() to avoid multiple executions.

This PR fixes a use case where an aggregate is nested into a broadcast join and appears on the "stream" side. The issue is that Broadcast join generates it's own loop. And without forcing evaluation of resultExpressions of HashAggregateExec before the join's loop these expressions can be executed multiple times giving incorrect results.

How was this patch tested?

New UT was added.

@maropu
Copy link
Member

maropu commented Feb 4, 2019

I think we should handle this case in a planner?
For example, if we turn off broadcast join, the behaviour changes;

scala> val baseTable = Seq((1), (1)).toDF("idx")
scala> val distinctWithId = baseTable.distinct.withColumn("id", functions.monotonically_increasing_id())
scala> baseTable.join(distinctWithId, "idx").show
+---+------------+
|idx|          id|
+---+------------+
|  1|369367187456|
|  1|369367187457|
+---+------------+

sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
scala> baseTable.join(distinctWithId, "idx").show
+---+------------+
|idx|          id|
+---+------------+
|  1|369367187456|
|  1|369367187456|
+---+------------+

Could you check again?

@maropu
Copy link
Member

maropu commented Feb 4, 2019

btw, could you describe more in the PR description? what's the root cause of this issue? How did this pr fix the issue? brabrabra....

@peter-toth
Copy link
Contributor Author

peter-toth commented Feb 4, 2019

The reason why I think this is a code generation issue is that if you disable spark.sql.codegen.wholeStage then the result is correct.

This is the physical plan of the example in the ticket:

== Physical Plan ==
*(3) Project [idx#4, id#6L]
+- *(3) BroadcastHashJoin [idx#4], [idx#9], Inner, BuildLeft
   :- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
   :  +- *(1) Project [value#1 AS idx#4]
   :     +- LocalTableScan [value#1]
   +- *(3) HashAggregate(keys=[idx#9], functions=[], output=[idx#9, id#6L])
      +- Exchange hashpartitioning(idx#9, 5)
         +- *(2) HashAggregate(keys=[idx#9], functions=[], output=[idx#9])
            +- *(2) Project [value#1 AS idx#9]
               +- LocalTableScan [value#1]

and if you take a look the code of stage 3 (left some comments in it regarding what my PR does):

    ...
    // this method is called for every aggregation key
    private void agg_doAggregateWithKeysOutput_0(UnsafeRow agg_keyTerm_0, UnsafeRow agg_bufferTerm_0)
            throws java.io.IOException {
        ((org.apache.spark.sql.execution.metric.SQLMetric) references[4] /* numOutputRows */).add(1);

        int agg_value_4 = agg_keyTerm_0.getInt(0);
        // this PR moves agg_value_5 calculation and agg_count_0 increment from boradcast join loop to here

        // generate join key for stream side
        boolean bhj_isNull_0 = false;
        long bhj_value_0 = -1L;
        if (!false) {
            bhj_value_0 = (long) agg_value_4;
        }
        // find matches from HashRelation
        scala.collection.Iterator bhj_matches_0 = bhj_isNull_0 ? null
                : (scala.collection.Iterator) bhj_relation_0.get(bhj_value_0);
        if (bhj_matches_0 != null) {
            while (bhj_matches_0.hasNext()) {
                UnsafeRow bhj_matched_0 = (UnsafeRow) bhj_matches_0.next();
                {
                    ((org.apache.spark.sql.execution.metric.SQLMetric) references[6] /* numOutputRows */).add(1);

                    int bhj_value_2 = bhj_matched_0.getInt(0);
                    boolean project_isNull_0 = false;
                    UTF8String project_value_0 = null;
                    if (!false) {
                        project_value_0 = UTF8String.fromString(String.valueOf(bhj_value_2));
                    }
                    final long agg_value_5 = partitionMask + agg_count_0;
                    agg_count_0++;
                    boolean project_isNull_2 = false;
                    UTF8String project_value_2 = null;
                    if (!false) {
                        project_value_2 = UTF8String.fromString(String.valueOf(agg_value_5));
                    }
    ...

So both hash aggregate and broadcast join are required in one codegen stage to experience this issue and also important that aggregate has to be on the "stream" side. This might be a rare case and explains why this issue hasn't come up earlier.
(I also think that there might be other operators than broadcast join that generate loop and so are affected, but I didn't look into that.)
But I think this is an issue with the generated code of HashAggregateExec and it seems to me that we can force evaluation of resultExpressions before generating broadcast join code (ie. calling consume()) without any drawback.

@mgaido91
Copy link
Contributor

mgaido91 commented Feb 4, 2019

The changes makes sense to me, but I think this problem was introduced in SPARK-13404, which claimed to have a significant perf gain (about 30% on TPCDS Q55), so it would be great if we can fix this without introducing perf regression. @peter-toth may you please run (and post the results) the benchmarks in order to ensure we are not introducing a perf regression with this PR?

@davies you are the author of that PR, do you have time to check this?

@maropu
Copy link
Member

maropu commented Feb 4, 2019

ok to test

@maropu
Copy link
Member

maropu commented Feb 4, 2019

cc: @cloud-fan @hvanhovell

@maropu
Copy link
Member

maropu commented Feb 4, 2019

This issue happens in case of stateful exprs only? If so, could you modify the code to apply the current fix only if HashAggregateExec has stateful exprs? I worry about the performance regression @mgaido91 pointed out, too. It seems the current fix affect the other queries, its a corner case though....

$evaluateKeyVars
$evaluateBufferVars
$evaluateAggResults
$evaluateResultVars
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. If you replace .distinct() to .groupBy("idx").max() in the example then this code path runs and the change fixes the same issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, could you please add test cases to cover all the code paths you added in this pr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I've added that path to the test.

@SparkQA
Copy link

SparkQA commented Feb 4, 2019

Test build #102034 has finished for PR 23731 at commit b5d079c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@peter-toth
Copy link
Contributor Author

Here are my benchmark results of q55. I run 3 times on master and 3 times on this PR branch against scale=5 generated data.
Master:

master:
  Stopped after 5 iterations, 29324 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_162-b12 on Mac OS X 10.14.2
Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
TPCDS Snappy:                            Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
q55                                           5683 / 5865          2.6         391.2       1.0X

  Stopped after 5 iterations, 28914 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_162-b12 on Mac OS X 10.14.2
Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
TPCDS Snappy:                            Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
q55                                           5584 / 5783          2.6         384.3       1.0X

  Stopped after 5 iterations, 29905 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_162-b12 on Mac OS X 10.14.2
Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
TPCDS Snappy:                            Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
q55                                           5873 / 5981          2.5         404.3       1.0X

This PR:

this PR:
  Stopped after 5 iterations, 32577 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_162-b12 on Mac OS X 10.14.2
Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
TPCDS Snappy:                            Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
q55                                           6226 / 6515          2.3         428.5       1.0X


  Stopped after 5 iterations, 30612 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_162-b12 on Mac OS X 10.14.2
Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
TPCDS Snappy:                            Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
q55                                           5792 / 6122          2.5         398.6       1.0X

  Stopped after 5 iterations, 32918 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_162-b12 on Mac OS X 10.14.2
Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
TPCDS Snappy:                            Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
q55                                           6415 / 6584          2.3         441.5       1.0X

Although the results are a bit varying, it seems this patch would introduce some performance degradation.
I will try to modify the patch to evaluate only Stateful expressions as @maropu suggested and run the benchmark again.

@mgaido91
Copy link
Contributor

mgaido91 commented Feb 4, 2019

@peter-toth did you run the benchmark also on the other queries? My guess is that it may also happen that q55 gets some perf degradation, but others improve. In that case we should kind of average over all the queries whether the impact is positive or not.

In case we decide to limit this to be done only for some expressions, we should do it for those which aer non-deterministic rather than only for the Stateful ones.

@peter-toth
Copy link
Contributor Author

Thanks @mgaido91, then I will run a full benchmark first.

@mgaido91
Copy link
Contributor

mgaido91 commented Feb 4, 2019

Thanks @peter-toth!

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Feb 8, 2019

Test build #102104 has finished for PR 23731 at commit b5d079c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

The issue is that Broadcast join generates it's own loop. And without forcing evaluation of resultExpressions of HashAggregateExec before the join's loop these expressions can be executed multiple times giving incorrect results.

Shouldn't we fix join instead of aggregate?

consume(ctx, eval)
val evaluateResultVars = evaluateVariables(resultVars)
s"""
$evaluateResultVars
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For non broadcast join cases, the change will force evaluation unnecessarily too. We should move evaluation out of the loop in broadcast join, if possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I a bit concern about is; is it semantically ok to defer the evaluation of nondeterministic exprs if HashAggregateExec has these exprs?

I think, to fix this issue, its ok to modify code in the join side if we could find a simpler solution there with no performance regression. But, I have just a question about the design regardless of this issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh.. Kris answered my question.. in #23731 (review)

@peter-toth
Copy link
Contributor Author

@mgaido91 @maropu @cloud-fan @viirya I've just collected the results of full TPCDSQueryBenchmark runs on master vs. this PR and overall it doesn't seem to have that big impact if we force the evaluation in aggregate. But I will try to change the PR and fix broadcast join to force evaluation of non-deterministic expressions out of the loop.

Here are the benchmark results if you are interested:

test case (values in ms) master master 2nd run this PR this PR 2nd run
total 27023411 26650436 27076386 27024466
q1 76017 75679 78988 81349
q10 90319 88032 93112 97605
q10a-v2.7 93878 93220 95113 94199
q11 187918 187985 198156 201343
q11-v2.7 186384 183901 191372 195608
q12 35709 34768 37109 39020
q12-v2.7 33617 34629 35666 37382
q13 76088 75420 79224 79844
q14-v2.7 647991 593543 611725 638555
q14a 793687 779841 786366 780845
q14a-v2.7 835538 849404 845488 824285
q14b 632994 659007 647556 628163
q15 50407 48712 53039 52104
q16 173310 176263 181887 185434
q17 358234 349108 384112 380049
q18 67506 67611 70547 71098
q18a-v2.7 260611 263082 268121 260403
q19 45170 45566 48270 48136
q2 62042 63219 64080 65965
q20 37064 36726 38922 38857
q20-v2.7 35919 35109 37882 37615
q21 30900 30886 31394 32062
q22 134810 140961 139015 146047
q22-v2.7 881799 666436 736824 786129
q22a-v2.7 77062 76621 76412 77984
q23a 551085 562690 570236 554450
q23b 509188 529924 514952 504751
q24-v2.7 232464 223424 235800 242875
q24a 235519 239793 240399 239449
q24b 242000 235056 240476 237701
q25 365661 364580 375756 388874
q26 50035 49711 52832 53557
q27 54088 53121 55344 57900
q27a-v2.7 156000 153863 156385 149929
q28 253946 253869 264420 266887
q29 364358 368899 371732 377960
q3 32569 32863 34460 35264
q30 86921 87179 90322 93566
q31 195233 195026 204116 208575
q32 70098 70355 72000 73724
q33 113700 114023 117777 121381
q34 48608 48859 50382 51431
q34-v2.7 47455 46366 48917 49559
q35 96340 93386 97375 99735
q35-v2.7 92379 93076 95759 97744
q35a-v2.7 98323 97102 98990 95888
q36 52161 52314 53873 55724
q36a-v2.7 56251 55359 55688 55844
q37 71609 72788 74007 75574
q38 113581 111851 115416 117293
q39a 67328 70285 69553 67597
q39b 67457 69747 70059 67390
q4 645699 632785 630790 662483
q40 92189 90362 94774 97688
q41 3931 3906 4157 4170
q42 32727 33828 33368 35238
q43 41522 40911 43279 44498
q44 118338 118557 122399 126860
q45 42166 42729 44086 43529
q46 57798 57073 59813 60527
q47 92922 92823 94551 99835
q47-v2.7 90674 90166 91696 96707
q48 64766 65324 66017 69578
q49 290310 306006 294090 303586
q49-v2.7 288623 320468 294292 290968
q5 242664 244383 258006 263323
q50 177935 187770 193884 188241
q51 180839 180103 184773 189253
q51a-v2.7 1261774 1166285 1173536 1156590
q52 32827 32561 33622 33896
q53 42069 42203 43387 44729
q54 178694 180335 185816 193587
q55 32863 32091 33049 33913
q56 113489 110395 116137 116632
q57 77085 78120 80313 80765
q57-v2.7 75062 75184 78971 78424
q58 102682 104439 106576 109776
q59 58396 57408 59314 62043
q5a-v2.7 264955 257989 264370 266317
q6 109696 107457 107418 109251
q6-v2.7 114195 109684 113283 114997
q60 115152 111163 116920 118019
q61 88653 87815 88905 92676
q62 45958 45002 45842 49475
q63 42271 41499 42553 44231
q64 375608 366708 371139 389530
q64-v2.7 398085 387919 378419 381406
q65 120201 121013 121007 124101
q66 107631 110160 110194 116045
q67 491712 496057 502941 498816
q67a-v2.7 582984 582585 588929 561230
q68 57644 57396 58574 62110
q69 92364 86473 87804 92744
q7 53092 54107 56050 56440
q70 76027 76236 78249 80208
q70a-v2.7 81529 81288 80193 79803
q71 111032 118959 114687 115729
q72 1263412 1215258 1204783 1194142
q72-v2.7 1274719 1297230 1224349 1188723
q73 50254 47880 48539 50459
q74 169414 163799 163079 169219
q74-v2.7 162417 163068 161341 164335
q75 365739 366476 372510 377480
q75-v2.7 352554 372481 374001 359940
q76 114328 110486 114099 114004
q77 181185 174504 177397 179485
q77a-v2.7 191487 188109 196948 188439
q78 381549 366170 382482 377300
q78-v2.7 373872 381872 400429 363220
q79 55042 53635 55588 55914
q8 40407 39912 41745 42687
q80 515092 512636 542615 522297
q80a-v2.7 527537 532262 546034 529508
q81 85570 85808 88579 86749
q82 99038 96385 99921 100342
q83 106249 99379 104401 105534
q84 39353 37683 40165 39717
q85 167298 167352 165008 169000
q86 37422 37246 38432 39093
q86a-v2.7 40440 40165 41697 41213
q87 120091 117061 125508 123227
q88 283596 293759 296313 300676
q89 44557 44882 46236 46012
q9 499645 497991 514793 525809
q90 70330 74905 72998 74123
q91 45903 46732 48843 48705
q92 66107 66381 68388 67379
q93 281893 293320 292280 295871
q94 117957 117824 126470 124528
q95 632772 591228 607336 608948
q96 37997 38276 39832 39946
q97 105541 104452 113225 108664
q98 38516 38436 41049 41108
q98-v2.7 37586 36939 39786 39280
q99 50378 51491 53608 52571

@mgaido91
Copy link
Contributor

@cloud-fan @viirya I am not sure about fixing this in the join is a good idea. First of all we have many kind of joins, so likely we would need to impact all of them and there may be other operators which use loops other than joins. I don't think it is correct to delegate to the consumer the responsibility of computing variables if needed. It seems more reasonable to me to fix it in the aggregate honestly.

@cloud-fan
Copy link
Contributor

@mgaido91 are you sure aggregate is the only one that produces unevaluated result expressions? IIRC this is a long-standing optimization in the whole stage codegen framework, and there is no such a rule that operators must evaluate the result expressions before calling parent.consume.

also cc @rednaxelafx @kiszk

@dongjoon-hyun
Copy link
Member

cc @dbtsai since he is the release manager for 2.4.1.

Copy link
Contributor

@rednaxelafx rednaxelafx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bug and fix touches a basic design area of Spark SQL's whole-stage codegen:

  • Deterministic expressions can be evaluated anywhere as long as the inputs (data dependencies) are available, and are allowed to be evaluated multiple times (although from a performance point of view it's not preferred to evaluate them repeatedly); non-deterministic expressions has to be only evaluated once, and the order of evaluation should respect the order in the original query.

Two rules of thumb are:

  1. In the whole-stage codegen framework, the evaluation of a deterministic expression can be deferred to just before its result is used. To improve performance and reduce code size, we only expect output expressions that are used more than once to be eagerly evaluated. This "used more than once" is expressed by CodegenSupport.usedInputs, and CodegenSupport.consume() handles the eager evaluation of such expressions automatically. That's #11274 already mentioned in one of the comments above.
  2. Any physical plan operator that carries an output projection list, such as ProjectExec and in this case HashAggregateExec has to perform special treatment of forcing evaluation of non-deterministic expressions before passing the outputVars to consume(), to make sure the side effects are emitted in the correct order and not evaluated repeatedly in the parents' doConsume(). See ProjectExec.doConsume() for an example of what this special treatment should look like.

Note that Stateful expressions are Nondeterministic by design; the latter covers more expressions than the former.

The reason why this special treatment isn't done in the CodegenSupport.consume() framework function is because: consume() only gets to see the outputVars from the child as a list of ExprCodes but not the list of Expressions that produced the code. The former has lost the notion of whether the generated code is deterministic or not, which can only be found on the latter.
consume() also gets to see the child.outputs but that's a list of Attributes, which doesn't have the knowledge of whether or not the original expression was deterministic. So that doesn't help.
With that, we'd have to perform the special treatment before calling consume().

This brings us to another related note: in the whole-stage codegen world, it really is preferred to host non-trivial expressions in ProjectExec as much as possible, so that we'd only have to non-trivial expression handling in one place. Fusing the output projection list in a fat operator is a design from the past -- it would have helped reduce the operator boundaries and thus reduce materialization/operator dispatch overhead in the Volcano model, but in the whole-stage codegen world that doesn't matter at all.

Here's my suggested fix for HashAggregateExec:

diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
index 19a47ffc6d..be457b435b 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
@@ -154,6 +154,14 @@ case class HashAggregateExec(
     child.asInstanceOf[CodegenSupport].inputRDDs()
   }
 
+  // Extract the code to evaluate non-deterministic expressions in the resultExpressions.
+  // NOTE: this function will mutate the state of the `ExprCode`s in `resultVars`: the `code` of
+  // non-deterministic expressions will be cleared.
+  private def evaluateNondeterministicResults(resultVars: Seq[ExprCode]): String = {
+    val nondeterministicAttrs = resultExpressions.filterNot(_.deterministic).map(_.toAttribute)
+    evaluateRequiredVariables(output, resultVars, AttributeSet(nondeterministicAttrs))
+  }
+
   protected override def doProduce(ctx: CodegenContext): String = {
     if (groupingExpressions.isEmpty) {
       doProduceWithoutKeys(ctx)
@@ -208,8 +216,10 @@ case class HashAggregateExec(
       // evaluate result expressions
       ctx.currentVars = aggResults
       val resultVars = bindReferences(resultExpressions, aggregateAttributes).map(_.genCode(ctx))
+      val evaluateNondeterministicAggResults = evaluateNondeterministicResults(resultVars)
       (resultVars, s"""
         |$evaluateAggResults
+        |$evaluateNondeterministicAggResults
         |${evaluateVariables(resultVars)}
        """.stripMargin)
     } else if (modes.contains(Partial) || modes.contains(PartialMerge)) {
@@ -466,10 +476,12 @@ case class HashAggregateExec(
       val resultVars = bindReferences[Expression](
         resultExpressions,
         inputAttrs).map(_.genCode(ctx))
+      val evaluateNondeterministicAggResults = evaluateNondeterministicResults(resultVars)
       s"""
        $evaluateKeyVars
        $evaluateBufferVars
        $evaluateAggResults
+       $evaluateNondeterministicAggResults
        ${consume(ctx, resultVars)}
        """
     } else if (modes.contains(Partial) || modes.contains(PartialMerge)) {
@@ -506,10 +518,14 @@ case class HashAggregateExec(
       // generate result based on grouping key
       ctx.INPUT_ROW = keyTerm
       ctx.currentVars = null
-      val eval = bindReferences[Expression](
+      val resultVars = bindReferences[Expression](
         resultExpressions,
         groupingAttributes).map(_.genCode(ctx))
-      consume(ctx, eval)
+      val evaluateNondeterministicResults = evaluateNondeterministicResults(resultVars)
+      s"""
+        |$evaluateNondeterministicAggResults
+        |${consume(ctx, resultVars)}
+       """.stripMargin
     }
     ctx.addNewFunction(funcName,
       s"""

@peter-toth
Copy link
Contributor Author

peter-toth commented Feb 12, 2019

I was thinking of why this following simple code snippet doesn't have the same issue:

    val baseTable = Seq((1), (1)).toDF("idx")
    val distinctWithId = baseTable.withColumn("id", monotonically_increasing_id())
    val x = baseTable.join(distinctWithId, "idx")
    x.show()

because it produces the expected

+---+----------+
|idx|        id|
+---+----------+
|  1|         0|
|  1|         0|
|  1|8589934592|
|  1|8589934592|
+---+----------+

and it seems because doConsume in ProjectExec evaluates non deterministic result vars before passing to Join. So I think it would be analogous to handle non-determinism in aggregate.

Oops, meanwhile we got the same answer. Thanks @rednaxelafx.

@maropu
Copy link
Member

maropu commented Feb 12, 2019

Thanks, Kris, I'm just curious that the @rednaxelafx approach has no performance regression..

@peter-toth
Copy link
Contributor Author

peter-toth commented Feb 12, 2019

So, shall I adjust the fix as @rednaxelafx suggested and maybe run another benchmark? Any objections?

@rednaxelafx
Copy link
Contributor

@maropu : my proposed change won't introduce any performance regressions because what used to be both (1) correct and (2) fast will stay the same, no changes whatsoever; whereas what used to be incorrect will be fixed.
You won't see any statistically significant differences in TPC-DS perf numbers because that benchmark doesn't really use a lot of non-deterministic expressions. Such expressions are rare in the SQL world. There isn't even a rand() call in TPC-DS...
We should expect the TPC-DS queries to generate identical whole-stage codegen code before and after my proposed fix.

@mgaido91
Copy link
Contributor

Thanks for your comment @rednaxelafx , huge +1 on everything you just said.

@mgaido91 are you sure aggregate is the only one that produces unevaluated result expressions?

@cloud-fan if it is not the only one, I think we have to fix the others too, but I don't think there are. ProjectExec is fine as mentioned by @rednaxelafx and I can't think of other plans which can generate non-deterministic expressions (there may be, but in this moment none comes to my mind).

@maropu
Copy link
Member

maropu commented Feb 12, 2019

@rednaxelafx I just worried about performance numbers other than TPCDS though, that's certainly true. Thanks, Kris.

nit: btw, could we move evaluateNondeterministicResults into CodegenSupport, and then ProjectExec reuse it?

@peter-toth
Copy link
Contributor Author

Thank you all for the comments and suggestions.
I pushed a commit with the changes except for the change in doProduceWithoutKeys() as resultVars are force evaluated there and changing the evaluation to non-deterministic only would be an optimization, not a bugfix.
Also I didn't change ProjectExec, let me know if these 2 should be incorporated in this PR.

Copy link
Contributor

@rednaxelafx rednaxelafx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, with a comment in the test case.

* Returns source code to evaluate the variables for non-deterministic expressions, and clear the
* code of evaluated variables, to prevent them to be evaluated twice.
*/
protected def evaluateNondeterministicVariables(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick on naming: "variables" are never non-deterministic, only expressions can have the property of being deterministic or not. Two options:

  • I'd prefer naming this utility function evaluateNondeterministicResults to emphasis this should (mostly) be used on the results of an output projection list.
  • But the existing utility function evaluateRequiredVariables uses the "variable" notion, so keeping consistency there is fine too.

I'm fine either way.

Also, historically Spark SQL's WSCG would use variable names like eval for the ExprCode type, e.g. evals: Seq[ExprCode]. Not sure why it started that way but you can see that naming pattern throughout the WSCG code base.
Again, your new utility function follows the same names used in evaluateRequiredVariables so that's fine. Local consistency is good enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep the consistent naming, +1 for evaluateNondeterministicVariables .

val baseTable = Seq((1), (1)).toDF("idx")

// BroadcastHashJoinExec with a HashAggregateExec child containing no aggregate expressions
val distinctWithId = baseTable.distinct().withColumn("id", monotonically_increasing_id())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how stable the results are going to be if you use monotonically_increasing_id here with an unspecified number of shuffle partitions. Since you're checking the exact value of the resulting id, if the number of shuffle partitions changes (let's say if someone decides to change the default shuffle partitions setting in all tests), this test can become fragile and fail unnecessarily.

It might be worth setting the shuffle partition to 1 explicitly inside this test case. Or go back to grouping by id instead of checking the exact value of id, or just assert the ids are equal.

Copy link
Member

@maropu maropu Feb 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, how about wrapping with withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> Long.MaxValue.toString) for safeguard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed both.

@SparkQA
Copy link

SparkQA commented Feb 12, 2019

Test build #102262 has finished for PR 23731 at commit 567f8f6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val distinctWithId = baseTable.distinct().withColumn("id", monotonically_increasing_id())
.join(baseTable, "idx")
assert(distinctWithId.queryExecution.executedPlan.collectFirst {
case BroadcastHashJoinExec(_, _, _, _, _, HashAggregateExec(_, _, Seq(), _, _, _, _), _) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this?

    assert(distinctWithId.queryExecution.executedPlan.collectFirst {
      case j: BroadcastHashJoinExec if j.left.asInstanceOf[HashAggregateExec] => true
    }.isDefined)

We need to strictly check agregate exprs? It seems baseTable.distinct() obviously has no aggregate expr?

Copy link
Contributor Author

@peter-toth peter-toth Feb 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer avoiding isInstanceOf if possible, but changed it a bit.

Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fix itself looks fine to me. Just some comments on the test, may you please also re-run the benchmark for the query having a considerable perf issue earlier i order to confirm now we have no regression? Thanks.

}
}

test("SPARK-26572: fix aggregate codegen result evaluation") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a problem with whole stage codegen, waht about moving this test to WholeStageCodegenSuite? And adding an assert that whole stage codegen is actually used, ie. the HashAggregate is a child of WholeStageCodegenExec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with moving it to WholeStageCodegenSuite but the plan looks like:

*(3) Project [idx#4, id#6L]
+- *(3) BroadcastHashJoin [idx#4], [idx#9], Inner, BuildRight
   :- *(3) HashAggregate(keys=[idx#4], functions=[], output=[idx#4, id#6L])
   :  +- Exchange hashpartitioning(idx#4, 1)
   :     +- *(1) HashAggregate(keys=[idx#4], functions=[], output=[idx#4])
   :        +- *(1) Project [value#1 AS idx#4]
   :           +- LocalTableScan [value#1]
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
      +- *(2) Project [value#1 AS idx#9]
         +- LocalTableScan [value#1]

so I guess you mean checking WholeStageCodegenExec has a ProjectExec child that has a BroadcastHashJoinExec child?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved and added WholeStageCodegenExec check.

@SparkQA
Copy link

SparkQA commented Feb 13, 2019

Test build #102288 has finished for PR 23731 at commit 5ae9add.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@peter-toth
Copy link
Contributor Author

Hmm, the failing UT doesn't seem to be related to the changes in this PR.

@mgaido91
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Feb 13, 2019

Test build #102292 has finished for PR 23731 at commit 5ae9add.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@peter-toth
Copy link
Contributor Author

the fix itself looks fine to me. Just some comments on the test, may you please also re-run the benchmark for the query having a considerable perf issue earlier i order to confirm now we have no regression? Thanks.

@mgaido91, I checked that the PR now doesn't add pref regression.

@mgaido91
Copy link
Contributor

LGTM

@viirya
Copy link
Member

viirya commented Feb 14, 2019

Looks good and a minor comment about variable naming.

Change-Id: I1a2c52e7ba30a186517d91568093da813f201d1f
@SparkQA
Copy link

SparkQA commented Feb 14, 2019

Test build #102342 has finished for PR 23731 at commit af861d5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan closed this in 2228ee5 Feb 14, 2019
cloud-fan pushed a commit that referenced this pull request Feb 14, 2019
This PR is a correctness fix in `HashAggregateExec` code generation. It forces evaluation of result expressions before calling `consume()` to avoid multiple executions.

This PR fixes a use case where an aggregate is nested into a broadcast join and appears on the "stream" side. The issue is that Broadcast join generates it's own loop. And without forcing evaluation of `resultExpressions` of `HashAggregateExec` before the join's loop these expressions can be executed multiple times giving incorrect results.

New UT was added.

Closes #23731 from peter-toth/SPARK-26572.

Authored-by: Peter Toth <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 2228ee5)
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Feb 14, 2019
This PR is a correctness fix in `HashAggregateExec` code generation. It forces evaluation of result expressions before calling `consume()` to avoid multiple executions.

This PR fixes a use case where an aggregate is nested into a broadcast join and appears on the "stream" side. The issue is that Broadcast join generates it's own loop. And without forcing evaluation of `resultExpressions` of `HashAggregateExec` before the join's loop these expressions can be executed multiple times giving incorrect results.

New UT was added.

Closes #23731 from peter-toth/SPARK-26572.

Authored-by: Peter Toth <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 2228ee5)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.4/2.3!

@peter-toth
Copy link
Contributor Author

Thanks @cloud-fan @maropu @mgaido91 @rednaxelafx and @viirya for your review and help.

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

This PR is a correctness fix in `HashAggregateExec` code generation. It forces evaluation of result expressions before calling `consume()` to avoid multiple executions.

This PR fixes a use case where an aggregate is nested into a broadcast join and appears on the "stream" side. The issue is that Broadcast join generates it's own loop. And without forcing evaluation of `resultExpressions` of `HashAggregateExec` before the join's loop these expressions can be executed multiple times giving incorrect results.

## How was this patch tested?

New UT was added.

Closes apache#23731 from peter-toth/SPARK-26572.

Authored-by: Peter Toth <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 23, 2019
This PR is a correctness fix in `HashAggregateExec` code generation. It forces evaluation of result expressions before calling `consume()` to avoid multiple executions.

This PR fixes a use case where an aggregate is nested into a broadcast join and appears on the "stream" side. The issue is that Broadcast join generates it's own loop. And without forcing evaluation of `resultExpressions` of `HashAggregateExec` before the join's loop these expressions can be executed multiple times giving incorrect results.

New UT was added.

Closes apache#23731 from peter-toth/SPARK-26572.

Authored-by: Peter Toth <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 2228ee5)
Signed-off-by: Wenchen Fan <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 25, 2019
This PR is a correctness fix in `HashAggregateExec` code generation. It forces evaluation of result expressions before calling `consume()` to avoid multiple executions.

This PR fixes a use case where an aggregate is nested into a broadcast join and appears on the "stream" side. The issue is that Broadcast join generates it's own loop. And without forcing evaluation of `resultExpressions` of `HashAggregateExec` before the join's loop these expressions can be executed multiple times giving incorrect results.

New UT was added.

Closes apache#23731 from peter-toth/SPARK-26572.

Authored-by: Peter Toth <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 2228ee5)
Signed-off-by: Wenchen Fan <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Aug 1, 2019
This PR is a correctness fix in `HashAggregateExec` code generation. It forces evaluation of result expressions before calling `consume()` to avoid multiple executions.

This PR fixes a use case where an aggregate is nested into a broadcast join and appears on the "stream" side. The issue is that Broadcast join generates it's own loop. And without forcing evaluation of `resultExpressions` of `HashAggregateExec` before the join's loop these expressions can be executed multiple times giving incorrect results.

New UT was added.

Closes apache#23731 from peter-toth/SPARK-26572.

Authored-by: Peter Toth <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 2228ee5)
Signed-off-by: Wenchen Fan <[email protected]>
otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
This PR is a correctness fix in `HashAggregateExec` code generation. It forces evaluation of result expressions before calling `consume()` to avoid multiple executions.

This PR fixes a use case where an aggregate is nested into a broadcast join and appears on the "stream" side. The issue is that Broadcast join generates it's own loop. And without forcing evaluation of `resultExpressions` of `HashAggregateExec` before the join's loop these expressions can be executed multiple times giving incorrect results.

New UT was added.

Closes apache#23731 from peter-toth/SPARK-26572.

Authored-by: Peter Toth <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 2228ee5)

RB=1571578
G=superfriends-reviewers
R=fli,yezhou,edlu,mshen
A=yezhou
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants