[SPARK-32018][SQL][3.0] UnsafeRow.setDecimal should set null with overflowed value #29125

cloud-fan · 2020-07-15T17:18:03Z

partially backport #29026

cloud-fan · 2020-07-15T17:18:38Z

cc @dongjoon-hyun @viirya

dongjoon-hyun · 2020-07-15T19:23:34Z

Thank you for pinging me, @cloud-fan .

dongjoon-hyun · 2020-07-15T19:23:53Z

Retest this please.

dongjoon-hyun · 2020-07-15T20:27:43Z

Retest this please.

viirya · 2020-07-15T21:02:43Z

Jenkins seems not working one this. ~~But GitHub Actions are passed.~~

Oh, this is for 3.0 and GitHub Actions is for master only.

viirya

No need to fix Sum.scala?

SparkQA · 2020-07-16T02:27:34Z

Test build #125906 has finished for PR 29125 at commit 4518513.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-16T03:53:33Z

Test build #125916 has finished for PR 29125 at commit 4518513.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-16T08:37:07Z

No need to fix Sum.scala?

That sum fix is in master only. I don't know if we can backport it as it breaks the streaming state store.

cloud-fan · 2020-07-16T10:15:33Z

retest this please

SparkQA · 2020-07-16T15:14:31Z

Test build #125962 has finished for PR 29125 at commit 7268c05.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-16T16:31:43Z

Test build #125965 has finished for PR 29125 at commit 7268c05.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you, @cloud-fan .
Merged to branch-3.0. (Jenkins passed here #29125 (comment))

…rflowed value partially backport #29026 Closes #29125 from cloud-fan/backport. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun · 2020-07-16T18:12:43Z

Could you make a backporting PR on branch-2.4 since SPARK-32018 is reported on 2.x too? This partial patch looks safe to have.

…rflowed value partially backport apache#29026 Closes apache#29125 from cloud-fan/backport. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…rflowed value backport #29125 Closes #29141 from cloud-fan/backport. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

skambha · 2020-07-31T20:13:54Z

@cloud-fan, I noticed this back port only now. This change is more far reaching in its impact as previous callers of UnsafeRow.getDecimal that would have thrown an exception earlier would now return null.

As an e.g, a caller like aggregate sum will need changes to account for this. Earlier cases where sum would throw error for overflow will now return incorrect results. The new tests that were added for sum overflow cases in the DataFrameSuite in master can be used to see repro.

Since this pr is closed, I will add a comment to the JIRA as well.

cloud-fan · 2020-08-03T03:33:57Z

@skambha the sum shouldn't fail without ANSI mode, this PR fixes it.

It's indeed a bug that we can write an overflowed decimal to UnsafeRow but can't read it. The sum is also buggy but we can't backport the fix due to streaming compatibility reasons.

skambha · 2020-08-03T18:39:28Z

@cloud-fan,
In some overflow scenarios:
With just this back port change it will cause incorrect results to be returned to the user now
Before this change, the user would see error

The test cases in DataFrameSuite will show these scenarios. Here is an example taken from there that I tried on spark 3.0.1 with and without this change and you can see this incorrect result behavior.

This back port by itself causes more scenarios to return incorrect results to the user.

With this back port change, incorrect results:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.1-SNAPSHOT
      /_/

Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_251)
Type in expressions to have them evaluated.
Type :help for more information.

scala>  val decStr = "1" + "0" * 19
decStr: String = 10000000000000000000

scala>            val d3 = spark.range(0, 1, 1, 1).union(spark.range(0, 11, 1, 1))
d3: org.apache.spark.sql.Dataset[Long] = [id: bigint] 
    
scala>             val d5 = d3.select(expr(s"cast('$decStr' as decimal (38, 18)) as d"),
     |               lit(1).as("key")).groupBy("key").agg(sum($"d").alias("sumd")).select($"sumd")
d5: org.apache.spark.sql.DataFrame = [sumd: decimal(38,18)]
   

scala> d5.show(false)
+---------------------------------------+
|sumd                                   |
+---------------------------------------+
|20000000000000000000.000000000000000000|
+---------------------------------------+

With this change, incorrect results with ansi enabled mode as well.

      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.1-SNAPSHOT
      /_/

Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_251)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.conf.set("spark.sql.ansi.enabled","true")

scala> val decStr = "1" + "0" * 19
decStr: String = 10000000000000000000

scala> val d3 = spark.range(0, 1, 1, 1).union(spark.range(0, 11, 1, 1))
d3: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala>  val d5 = d3.select(expr(s"cast('$decStr' as decimal (38, 18)) as d"),
     |      |               lit(1).as("key")).groupBy("key").agg(sum($"d").alias("sumd")).select($"sumd")
d5: org.apache.spark.sql.DataFrame = [sumd: decimal(38,18)]

scala> d5.show(false)
+---------------------------------------+
|sumd                                   |
+---------------------------------------+
|20000000000000000000.000000000000000000|
+---------------------------------------+

WITHOUT THIS CHANGE. the same test will throw an error for both the cases (ansi enabled) and not.

scala> val decStr = "1" + "0" * 19
decStr: String = 10000000000000000000

scala> val d3 = spark.range(0, 1, 1, 1).union(spark.range(0, 11, 1, 1))
d3: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> val d5 = d3.select(expr(s"cast('$decStr' as decimal (38, 18)) as d"),
     |          lit(1).as("key")).groupBy("key").agg(sum($"d").alias("sumd")).select($"sumd")
d5: org.apache.spark.sql.DataFrame = [sumd: decimal(38,18)]

scala> d5.show(false)
20/08/03 11:15:05 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)/ 2]
java.lang.ArithmeticException: Decimal precision 39 exceeds max precision 38
        at org.apache.spark.sql.types.Decimal.set(Decimal.scala:122)
        at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:574)
        at org.apache.spark.sql.types.Decimal.apply(Decimal.scala)
        at org.apache.spark.sql.catalyst.expressions.UnsafeRow.getDecimal(UnsafeRow.java:393)

Without this change, the ansi enabled scenario also throws error.

scala> spark.conf.set("spark.sql.ansi.enabled","true")

scala> val decStr = "1" + "0" * 19
decStr: String = 10000000000000000000

scala>  val d3 = spark.range(0, 1, 1, 1).union(spark.range(0, 11, 1, 1))
d3: org.apache.spark.sql.Dataset[Long] = [id: bigint]

val d5 = d3.select(expr(s"cast('$decStr' as decimal (38, 18)) as d"),
     |        lit(1).as("key")).groupBy("key").agg(sum($"d").alias("sumd")).select($"sumd")
d5: org.apache.spark.sql.DataFrame = [sumd: decimal(38,18)]

scala> d5.show(false)
20/08/03 11:18:08 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.ArithmeticException: Decimal precision 39 exceeds max precision 38
        at org.apache.spark.sql.types.Decimal.set(Decimal.scala:122)
        at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:574)
        at org.apache.spark.sql.types.Decimal.apply(Decimal.scala)
        at org.apache.spark.sql.catalyst.expressions.UnsafeRow.getDecimal(UnsafeRow.java:393)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.agg_doAggregate_sum_0$(generated.java:41)

cloud-fan · 2020-08-04T03:51:35Z

It's "one bug hides another bug". I don't think the right choice is to leave the bug there. UnsafeRow is a very fundamental component and I think it's better to fix all the bugs we know. Aggregate is not the only place that uses UnsafeRow. It can even be used by external data sources.

If we think the decimal sum overflow is serious enough, we should consider backporting the actual fix, and evaluate the streaming backward compatibility impact.

skambha · 2020-08-04T18:48:02Z

Sum operation is very common and heavily used by users.
Returning incorrect results silently is serious as there is no way for a user to know that their query returned incorrect results. Earlier the user would get an error and they can possibly increase the precision and rerun their query, but now they will not even know it is incorrect results unless they manually verify (which may not even be possible for large data). We are now exposing more cases which will return incorrect results now with this back port.

The Spark website states this “Note that, data correctness/data loss bugs are very serious. Make sure the corresponding bug report JIRA ticket is labeled as correctness or data-loss. If the bug report doesn’t get enough attention, please send an email to [email protected], to draw more attentions."

Incorrect results/data correctness are very serious

As already discussed, yes the UnsafeRow has far reaching impact and has unsafe side effects. In my opinion we should not back port just this change to v3. and v2.4.x line specially in a point release and expose wrong results to user for a common operation like sum.

So, my vote would be to not have this UnsafeRow only change in v3.0.x and v2.x.x

—
@cloud-fan Regarding your question on back porting the sum change, I think the issue was the streaming backward compatibility impact which blocked that change from going in. I am not that familiar with the streaming backward compatibility implications.

cloud-fan · 2020-08-05T04:58:16Z

@skambha you will still hit the sum bug when you disable whole-stage-codegen (or fallback to it due to generated code exceeds 64kb), right?

We are not introducing a new correctness bug. It's an existing bug and the backport makes it more visible.

We've added a mechanism in the master branch to check the streaming state store backward compatibility. If we want to backport the actual fix, we need to backport this mechanism as well. I think that's too many things to backport.

How about this: we force to enable ANSI for decimal sum, so that the behavior is the same without fixing the UnsafeRow bug? It's not an ideal fix but should be safer to backport. @skambha what do you think? Can you help to do it?

skambha · 2020-08-05T18:48:58Z

How about this: we force to enable ANSI for decimal sum, so that the behavior is the same without fixing the UnsafeRow >bug? It's not an ideal fix but should be safer to backport. @skambha what do you think? Can you help to do it?

Not sure if I understand correctly, so can you clarify. The reason I ask is : Currently, the v3.0 Sum has a ANSI mode in the evaluationExpression and forcing that to be true will not give us much. We will still run into the problems I mentioned a few comments earlier.

--
@cloud-fan, Just to clarify that we are in agreement. The first step is to revert this back port. Can you confirm please.
Yes, I can submit a PR to do this UnsafeRow revert for the v3.0.x and v2.x.x.

cloud-fan · 2020-08-06T05:38:02Z

I don't agree to revert the UnsafeRow bug fix. As I said, UnsafeRow is very fundamental and we can't tolerant any bugs.

I agree that the sum decimal bug becomes more visible with the UnsafeRow fix, and I see 2 options (reverting the UnsafeRow fix is not an option to me):

Backport the actual fix. This brings backward compatibility issues for streaming state store.
Fail if overflow happens, regardless of the ansi flag. This is not ideal but at least it's better than 3.0.0/2.x, which we fail overflow when whole-stage-codegen is on, and return wrong answer otherwise.

skambha · 2020-08-06T21:54:31Z

IIUC, The solutions you mention were also discussed earlier and were not accepted by you. If you do not want to revert this backport, then I hope you agree it is critical to fix it so users do not run into this incorrectness issue. Please feel free to go ahead with the option you prefer.

I have expressed the issues and will summarize them below and also put it in the JIRA.

The important issue is we should not return incorrect results. In general, it is not a good practice to back port a change to a stable branch and cause more queries to return incorrect results.

Just to reiterate:

This current PR that has back ported the UnsafeRow fix causes queries to return incorrect results. This is for v2.4.x and v3.0.x line. This change by itself has unsafe side effects and results in incorrect results being returned.
It does not matter whether you have whole stage on or off, ansi on or off, you will get more queries returning incorrect results.


scala> val decStr = "1" + "0" * 19
decStr: String = 10000000000000000000

scala> val d3 = spark.range(0, 1, 1, 1).union(spark.range(0, 11, 1, 1))
d3: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala>  val d5 = d3.select(expr(s"cast('$decStr' as decimal (38, 18)) as d"),lit(1).as("key")).groupBy("key").agg(sum($"d").alias("sumd")).select($"sumd")
d5: org.apache.spark.sql.DataFrame = [sumd: decimal(38,18)]

scala> d5.show(false)   <----  INCORRECT RESULTS RETURNED
+---------------------------------------+
|sumd                                   |
+---------------------------------------+
|20000000000000000000.000000000000000000|
+---------------------------------------+

Incorrect results is very serious and it is not good for Spark users to run into it for common operations like sum.

cloud-fan · 2020-08-07T04:55:07Z

OK let me clarify a few things:

I agree with you that making Spark more likely to return incorrect results is not acceptable.
I hope you understand that having a bug in the very fundamental UnsafeRow is not acceptable either.

I'll ask someone to implement the ANSI behavior for decimal sum in 3.0 and 2.4, so that it fails instead of returning wrong results.

dongjoon-hyun · 2020-08-08T06:17:15Z

cc @ScrapCodes since he is a release manager for Apache Spark 2.4.7.

…erflow of sum aggregation ### What changes were proposed in this pull request? This is a followup of #29125 In branch 3.0: 1. for hash aggregation, before #29125 there will be a runtime exception on decimal overflow of sum aggregation; after #29125, there could be a wrong result. 2. for sort aggregation, with/without #29125, there could be a wrong result on decimal overflow. While in master branch(the future 3.1 release), the problem doesn't exist since in #27627 there is a flag for marking whether overflow happens in aggregation buffer. However, the aggregation buffer is written in steaming checkpoints. Thus, we can't change to aggregation buffer to resolve the issue. As there is no easy solution for returning null/throwing exception regarding `spark.sql.ansi.enabled` on overflow in branch 3.0, we have to make a choice here: always throw exception on decimal value overflow of sum aggregation. ### Why are the changes needed? Avoid returning wrong result in decimal value sum aggregation. ### Does this PR introduce _any_ user-facing change? Yes, there is always exception on decimal value overflow of sum aggregation, instead of a possible wrong result. ### How was this patch tested? Unit test case Closes #29404 from gengliangwang/fixSum. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? Revert SPARK-32018 related changes in branch 3.0: #29125 and #29404 ### Why are the changes needed? #29404 is made to fix correctness regression introduced by #29125. However, the behavior of decimal overflow is strange in non-ansi mode: 1. from 3.0.0 to 3.0.1: decimal overflow will throw exceptions instead of returning null on decimal overflow 2. from 3.0.1 to 3.1.0: decimal overflow will return null instead of throwing exceptions. So, this PR proposes to revert both #29404 and #29125. So that Spark will return null on decimal overflow in Spark 3.0.0 and Spark 3.0.1. ### Does this PR introduce _any_ user-facing change? Yes, Spark will return null on decimal overflow in Spark 3.0.1. ### How was this patch tested? Unit tests Closes #29450 from gengliangwang/revertDecimalOverflow. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

probot-autolabeler bot added the SQL label Jul 15, 2020

cloud-fan changed the title ~~[SPARK-32018][SQL] UnsafeRow.setDecimal should set null with overflowed value~~ [SPARK-32018][SQL][3.0] UnsafeRow.setDecimal should set null with overflowed value Jul 15, 2020

viirya reviewed Jul 15, 2020

View reviewed changes

UnsafeRow.setDecimal should set null with overflowed value

7268c05

cloud-fan force-pushed the backport branch from 4518513 to 7268c05 Compare July 16, 2020 08:48

dongjoon-hyun approved these changes Jul 16, 2020

View reviewed changes

dongjoon-hyun closed this Jul 16, 2020

cloud-fan mentioned this pull request Jul 17, 2020

[SPARK-32018][SQL][2.4] UnsafeRow.setDecimal should set null with overflowed value #29141

Closed

gengliangwang mentioned this pull request Aug 11, 2020

[SPARK-32018][SQL][FollowUp][3.0] Throw exception on decimal value overflow of sum aggregation #29404

Closed

gengliangwang mentioned this pull request Aug 17, 2020

[3.0][SQL] Revert SPARK-32018 #29450

Closed

cloud-fan mentioned this pull request Aug 17, 2020

[Do not merge][SPARK-32018][SQL][3.0][FOLLOWUP] Fix a test failure in DataFrameSuite #29448

Closed

[SPARK-32018][SQL][3.0] UnsafeRow.setDecimal should set null with overflowed value #29125

[SPARK-32018][SQL][3.0] UnsafeRow.setDecimal should set null with overflowed value #29125

Uh oh!

Conversation

cloud-fan commented Jul 15, 2020

Uh oh!

cloud-fan commented Jul 15, 2020

Uh oh!

dongjoon-hyun commented Jul 15, 2020

Uh oh!

dongjoon-hyun commented Jul 15, 2020

Uh oh!

dongjoon-hyun commented Jul 15, 2020

Uh oh!

viirya commented Jul 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 16, 2020

Uh oh!

SparkQA commented Jul 16, 2020

Uh oh!

cloud-fan commented Jul 16, 2020

Uh oh!

cloud-fan commented Jul 16, 2020

Uh oh!

SparkQA commented Jul 16, 2020

Uh oh!

SparkQA commented Jul 16, 2020

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jul 16, 2020

Uh oh!

skambha commented Jul 31, 2020

Uh oh!

cloud-fan commented Aug 3, 2020

Uh oh!

skambha commented Aug 3, 2020

Uh oh!

cloud-fan commented Aug 4, 2020

Uh oh!

skambha commented Aug 4, 2020

Uh oh!

cloud-fan commented Aug 5, 2020

Uh oh!

skambha commented Aug 5, 2020

Uh oh!

cloud-fan commented Aug 6, 2020

Uh oh!

skambha commented Aug 6, 2020

Uh oh!

cloud-fan commented Aug 7, 2020

Uh oh!

dongjoon-hyun commented Aug 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

viirya commented Jul 15, 2020 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun commented Aug 8, 2020 •

edited

Loading