Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Jul 7, 2020

What changes were proposed in this pull request?

This is a followup of #27627 to fix the remaining issues. There are 2 issues fixed in this PR:

  1. UnsafeRow.setDecimal can set an overflowed decimal and causes an error when reading it. The expected behavior is to return null.
  2. The update/merge expression for decimal type in Sum is wrong. We shouldn't turn the sum value back to 0 after it becomes null due to overflow. This issue was hidden because:
    2.1 for hash aggregate, the buffer is unsafe row. Due to the first bug, we fail when overflow happens, so there is no chance to mistakenly turn null back to 0.
    2.2 for sort-based aggregate, the buffer is generic row. The decimal can overflow (the Decimal class has unlimited precision) and we don't have the null problem.

If we only fix the first bug, then the second bug is exposed and test fails. If we only fix the second bug, there is no way to test it. This PR fixes these 2 bugs together.

Why are the changes needed?

Fix issues during decimal sum when overflow happens

Does this PR introduce any user-facing change?

Yes. Now decimal sum can return null correctly for overflow under non-ansi mode.

How was this patch tested?

new test and updated test

@cloud-fan
Copy link
Contributor Author

cc @skambha @rednaxelafx @viirya

Platform.putLong(baseObject, baseOffset + cursor + 8, 0L);

if (value == null) {
if (value == null || !value.changePrecision(precision, value.scale())) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to @allisonwang-db for catching this bug!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks valid for branch-3.0 and branch-2.4. Do you think we need to backport to to branch-2.4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we should

@dongjoon-hyun
Copy link
Member

Hi, @cloud-fan . SPARK-28067 is merged for 3.1.0 only. This also aims 3.1.0 only?

@SparkQA
Copy link

SparkQA commented Jul 7, 2020

Test build #125235 has finished for PR 29026 at commit 3717fc6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Jul 7, 2020

retest this please

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@HyukjinKwon HyukjinKwon changed the title [SPARK-28067][SPARK-32018] fix decimal overflow issues [SPARK-28067][SPARK-32018] Fix decimal overflow issues Jul 8, 2020
Copy link
Contributor

@rednaxelafx rednaxelafx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for fixing this!

@skambha
Copy link
Contributor

skambha commented Jul 8, 2020

@cloud-fan, Thanks for looking into covering some more cases. If we change the UnsafeRow.setDecimal overflow logic, I agree the implementation of the sum needs to change. This is coming back to the discussion we had earlier here (skambha#1), the sum overflow logic is very much dependent on the underlying implementation of overflow logic in UnsafeRowWriter and UnsafeRow etc.

I have a few comments related to the fix in UnsafeRow.

  1. So now we have UnsafeRow.setDecimal silently returns a null for an overflowed decimal value in setDecimal, but getDecimal throws error. There is inconsistency here. Why is that ok? Also, they dont honor the ansi mode.

  2. In this scenario, now we are more aggressive in the checking of the overflow. We have moved the overflow check further down to return null. Earlier I think the decision was to not do the checking per row, but now dont we end up doing that in some of the cases, right?

assert(unsafeRow.getDecimal(0, 38, 18) === d1)
val d2 = (d1 * Decimal(10)).toPrecision(39, 18)
unsafeRow.setDecimal(0, d2, 38)
assert(unsafeRow.getDecimal(0, 38, 18) === null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens with ansi mode true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UnsafeRow is a low-level entity and doesn't respect ansi flag.

@cloud-fan
Copy link
Contributor Author

So now we have UnsafeRow.setDecimal silently returns a null for an overflowed decimal value in setDecimal, but getDecimal throws error. There is inconsistency here. Why is that ok?

Correction: UnsafeRow.setDecimal returns void. This PR fixes UnsafeRow.setDecimal so that getDecimal can return null if the value is overflowed.

Earlier I think the decision was to not do the checking per row, but now dont we end up doing that in some of the cases

Under ansi mode, you have to check overflow per-row, as it's done by the Add expression. This is not changed in this PR.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125254 has finished for PR 29026 at commit 3717fc6.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

1 similar comment
@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125301 has finished for PR 29026 at commit 3717fc6.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125302 has finished for PR 29026 at commit 3717fc6.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125292 has finished for PR 29026 at commit 3717fc6.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125315 has started for PR 29026 at commit 3717fc6.

@dongjoon-hyun
Copy link
Member

It's a build time out during PySpark test.

Build timed out (after 500 minutes). Marking the build as aborted.
Build was aborted

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125383 has finished for PR 29026 at commit 3717fc6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jul 9, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jul 9, 2020

Test build #125422 has finished for PR 29026 at commit 3717fc6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

dongjoon-hyun pushed a commit that referenced this pull request Jul 16, 2020
…rflowed value

partially backport #29026

Closes #29125 from cloud-fan/backport.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
cloud-fan added a commit to cloud-fan/spark that referenced this pull request Jul 17, 2020
…rflowed value

partially backport apache#29026

Closes apache#29125 from cloud-fan/backport.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
gengliangwang added a commit that referenced this pull request Aug 19, 2020
…erflow in sum aggregation

### What changes were proposed in this pull request?

Add migration guide for decimal value overflow behavior in sum aggregation, introduced in #29026

### Why are the changes needed?

Add migration guide for the behavior changes from 3.0 to 3.1.
See also: #29450 (comment)

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Build docs and preview:
![image](https://user-images.githubusercontent.com/1097932/90589256-8b7e3380-e192-11ea-8ff1-05a447c20722.png)

Closes #29458 from gengliangwang/migrationGuideDecimalOverflow.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
@anuragmantri
Copy link

@cloud-fan @dongjoon-hyun - This PR (at least part of it) seems relevant to branch-2.4. Has there been a backport PR for this? If not, I would like to start a backport PR.

@maropu
Copy link
Member

maropu commented Oct 7, 2020

@anuragmantri
Copy link

@maropu - Thanks for pointing me to the discussion thread. Sorry I got lost a bit with so many threads. I will park this idea for now since SPARK-28067 is not in any release branches yet.

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Oct 8, 2020

There was an attempt to backport this fix but we decided not to as it breaks streaming backward compatibility. We can't backport part of the fix either, see #29448 (comment)

jlfsdtc added a commit to Kyligence/spark that referenced this pull request Jul 23, 2021
* KE-24858
[SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow

* [SPARK-28067][SPARK-32018] Fix decimal overflow issues

### What changes were proposed in this pull request?

This is a followup of apache#27627 to fix the remaining issues. There are 2 issues fixed in this PR:
1. `UnsafeRow.setDecimal` can set an overflowed decimal and causes an error when reading it. The expected behavior is to return null.
2. The update/merge expression for decimal type in `Sum` is wrong. We shouldn't turn the `sum` value back to 0 after it becomes null due to overflow. This issue was hidden because:
2.1 for hash aggregate, the buffer is unsafe row. Due to the first bug, we fail when overflow happens, so there is no chance to mistakenly turn null back to 0.
2.2 for sort-based aggregate, the buffer is generic row. The decimal can overflow (the Decimal class has unlimited precision) and we don't have the null problem.

If we only fix the first bug, then the second bug is exposed and test fails. If we only fix the second bug, there is no way to test it. This PR fixes these 2 bugs together.

### Why are the changes needed?

Fix issues during decimal sum when overflow happens

### Does this PR introduce _any_ user-facing change?

Yes. Now decimal sum can return null correctly for overflow under non-ansi mode.

### How was this patch tested?

new test and updated test

Closes apache#29026 from cloud-fan/decimal.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>

* KE-24858
fix error: java.lang.IllegalArgumentException: Can not interpolate java.lang.Boolean into code block.

* KE-24858
fix ci error

* KE-24858 update pom version

Co-authored-by: Sunitha Kambhampati <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Co-authored-by: longfei.jiang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants