-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32018][SQL][3.0] UnsafeRow.setDecimal should set null with overflowed value #29125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for pinging me, @cloud-fan . |
|
Retest this please. |
1 similar comment
|
Retest this please. |
|
Jenkins seems not working one this. Oh, this is for 3.0 and GitHub Actions is for master only. |
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to fix Sum.scala?
|
Test build #125906 has finished for PR 29125 at commit
|
|
Test build #125916 has finished for PR 29125 at commit
|
That sum fix is in master only. I don't know if we can backport it as it breaks the streaming state store. |
|
retest this please |
|
Test build #125962 has finished for PR 29125 at commit
|
|
Test build #125965 has finished for PR 29125 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @cloud-fan .
Merged to branch-3.0. (Jenkins passed here #29125 (comment))
…rflowed value partially backport #29026 Closes #29125 from cloud-fan/backport. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
|
Could you make a backporting PR on branch-2.4 since SPARK-32018 is reported on 2.x too? This partial patch looks safe to have. |
…rflowed value partially backport apache#29026 Closes apache#29125 from cloud-fan/backport. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…rflowed value backport #29125 Closes #29141 from cloud-fan/backport. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
|
@cloud-fan, I noticed this back port only now. This change is more far reaching in its impact as previous callers of UnsafeRow.getDecimal that would have thrown an exception earlier would now return null. As an e.g, a caller like aggregate sum will need changes to account for this. Earlier cases where sum would throw error for overflow will now return incorrect results. The new tests that were added for sum overflow cases in the DataFrameSuite in master can be used to see repro. Since this pr is closed, I will add a comment to the JIRA as well. |
|
@skambha the It's indeed a bug that we can write an overflowed decimal to UnsafeRow but can't read it. The |
|
@cloud-fan, The test cases in DataFrameSuite will show these scenarios. Here is an example taken from there that I tried on spark 3.0.1 with and without this change and you can see this incorrect result behavior. This back port by itself causes more scenarios to return incorrect results to the user.
WITHOUT THIS CHANGE. the same test will throw an error for both the cases (ansi enabled) and not. Without this change, the ansi enabled scenario also throws error. |
|
It's "one bug hides another bug". I don't think the right choice is to leave the bug there. If we think the decimal sum overflow is serious enough, we should consider backporting the actual fix, and evaluate the streaming backward compatibility impact. |
The Spark website states this “Note that, data correctness/data loss bugs are very serious. Make sure the corresponding bug report JIRA ticket is labeled as correctness or data-loss. If the bug report doesn’t get enough attention, please send an email to [email protected], to draw more attentions." Incorrect results/data correctness are very serious As already discussed, yes the UnsafeRow has far reaching impact and has unsafe side effects. In my opinion we should not back port just this change to v3. and v2.4.x line specially in a point release and expose wrong results to user for a common operation like sum. So, my vote would be to not have this UnsafeRow only change in v3.0.x and v2.x.x — |
|
@skambha you will still hit the sum bug when you disable whole-stage-codegen (or fallback to it due to generated code exceeds 64kb), right? We are not introducing a new correctness bug. It's an existing bug and the backport makes it more visible. We've added a mechanism in the master branch to check the streaming state store backward compatibility. If we want to backport the actual fix, we need to backport this mechanism as well. I think that's too many things to backport. How about this: we force to enable ANSI for decimal sum, so that the behavior is the same without fixing the UnsafeRow bug? It's not an ideal fix but should be safer to backport. @skambha what do you think? Can you help to do it? |
Not sure if I understand correctly, so can you clarify. The reason I ask is : Currently, the v3.0 Sum has a ANSI mode in the evaluationExpression and forcing that to be true will not give us much. We will still run into the problems I mentioned a few comments earlier. -- |
|
I don't agree to revert the UnsafeRow bug fix. As I said, I agree that the sum decimal bug becomes more visible with the
|
|
IIUC, The solutions you mention were also discussed earlier and were not accepted by you. If you do not want to revert this backport, then I hope you agree it is critical to fix it so users do not run into this incorrectness issue. Please feel free to go ahead with the option you prefer. I have expressed the issues and will summarize them below and also put it in the JIRA. The important issue is we should not return incorrect results. In general, it is not a good practice to back port a change to a stable branch and cause more queries to return incorrect results. Just to reiterate:
|
|
OK let me clarify a few things:
I'll ask someone to implement the ANSI behavior for decimal sum in 3.0 and 2.4, so that it fails instead of returning wrong results. |
|
cc @ScrapCodes since he is a release manager for Apache Spark 2.4.7. |
…erflow of sum aggregation ### What changes were proposed in this pull request? This is a followup of #29125 In branch 3.0: 1. for hash aggregation, before #29125 there will be a runtime exception on decimal overflow of sum aggregation; after #29125, there could be a wrong result. 2. for sort aggregation, with/without #29125, there could be a wrong result on decimal overflow. While in master branch(the future 3.1 release), the problem doesn't exist since in #27627 there is a flag for marking whether overflow happens in aggregation buffer. However, the aggregation buffer is written in steaming checkpoints. Thus, we can't change to aggregation buffer to resolve the issue. As there is no easy solution for returning null/throwing exception regarding `spark.sql.ansi.enabled` on overflow in branch 3.0, we have to make a choice here: always throw exception on decimal value overflow of sum aggregation. ### Why are the changes needed? Avoid returning wrong result in decimal value sum aggregation. ### Does this PR introduce _any_ user-facing change? Yes, there is always exception on decimal value overflow of sum aggregation, instead of a possible wrong result. ### How was this patch tested? Unit test case Closes #29404 from gengliangwang/fixSum. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? Revert SPARK-32018 related changes in branch 3.0: #29125 and #29404 ### Why are the changes needed? #29404 is made to fix correctness regression introduced by #29125. However, the behavior of decimal overflow is strange in non-ansi mode: 1. from 3.0.0 to 3.0.1: decimal overflow will throw exceptions instead of returning null on decimal overflow 2. from 3.0.1 to 3.1.0: decimal overflow will return null instead of throwing exceptions. So, this PR proposes to revert both #29404 and #29125. So that Spark will return null on decimal overflow in Spark 3.0.0 and Spark 3.0.1. ### Does this PR introduce _any_ user-facing change? Yes, Spark will return null on decimal overflow in Spark 3.0.1. ### How was this patch tested? Unit tests Closes #29450 from gengliangwang/revertDecimalOverflow. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
partially backport #29026