Skip to content

Conversation

@cloud-fan
Copy link

No description provided.

if (!ansiEnabled) {
try {
checkAnswer(df, expectedAnswer)
} catch {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • This has changed the tests to say it is ok to throw exception even when ansienabled is false. Our ansienabled flag then isn't doing what it says it is supposed to do, right. That is a bug.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug, in unsafe row writer, and I think we should fix it. According to apache#27627 (comment) , this bug is already there for a long time.

This is a less critical bug as Spark fails instead of returning a wrong result.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, the issue we have is the UnsafeRow taking a overflow decimal value but when fetching it throws exception.

@skambha
Copy link
Owner

skambha commented May 22, 2020

In this pr, the tests have been changed to say it is ok to throw exception even when ansienabled is false.

I have a few concerns about this simplification. This does not cover all the scenarios that we discussed in apache#27627, right. Basically here, we will error out in some cases for overflow even if ansienabled is false. That is a bug. Also when wholestage is disabled, and ansienabled is false, we should get null as the result but that won’t be the case now.

I also have some reservations on how safe is it to make the assumptions in the logic in Sum to be dependent on the underlying implementation details of UnsafeRowWriter writing out null for overflow value. What if that were to change in the future. Is the expectation that the UT would uncover the issue..

@cloud-fan
Copy link
Author

Checking overflow for each input row is not acceptable. We must allow temporary overflow in the middle of calculation, to support cases like large_positive_decimal + large_positive_decimal + large_negative_decimal, where we can still get a valid result. That said, we can't merge apache#27627 as it is, as it can break queries that return correct result.

On the other hand, this patch fixes most of the problems, but we still suffer from the unsafe row writer bug. It's not a big deal, and we can fix it later. see #1 (comment)

@skambha
Copy link
Owner

skambha commented May 26, 2020

Checking overflow for each input row is not acceptable.
We must allow temporary overflow in the middle of calculation, to support cases like large_positive_decimal + large_positive_decimal + large_negative_decimal, where we can still get a valid result.

i c. This means that sum is not deterministic (when there is an intermediate overflow value) since the sum output will depend on the values in each partition.

For e.g, if we have the following scenario:
a1+a2 overflows, but with a3, it won't overflow. None of a4, a5 overflows.

Partition 1 -> a1, a2, a3 (Note a1+a2 overflows, but a1+a2+a3 doesn't)
Partition 2 -> a4, a5
Sum value = a1+a2+a3 +a4+a5 - Lets say this doesn't overflow. This is the result.

But if you had different partitions for the same values, for e.g:
Partition 1 -> a1,a2
Partition 2 -> a3, a4, a5

In this case, we would have null coming from a1+a2, and then the end result being null.

So we have same values but the output of sum is different.

--
Although not ideal, this does get us closer and it fixes the incorrect results scenario for some codepaths. The other scenarios throw exceptions which is less severe than incorrect results, so I am fine with these changes. This will have the non-deterministic behavior that I mentioned above for some scenarios.

@cloud-fan, Please let me know how you want to proceed. Would you like me to merge this into the other pr? Let me know if you need me to do anything. Thanks.

@cloud-fan
Copy link
Author

Yea please merge this and we can polish your PR further to get it merged.

This is a hard problem and we are not going to fix it completely here. There are 2 things we should keep thinking:

  1. how to fix the error in non-ANSI mode.
  2. how to fix the non-deterministic result.

both of them are existing problems and we are not making things worse.

@skambha
Copy link
Owner

skambha commented May 28, 2020

Hi @cloud-fan, I merged this locally and running tests and I see test failure in sql suite in SQLQueryTestSuite. I wanted to check if you saw it and if you maybe missed adding an update here?

@cloud-fan
Copy link
Author

can you let me know which test fails?

@skambha
Copy link
Owner

skambha commented May 28, 2020

This test: SQLQueryTestSuite for postgreSQL/window_part4.sql and this is the diff:

[info] - postgreSQL/window_part4.sql *** FAILED *** (8 seconds, 897 milliseconds)
[info]   postgreSQL/window_part4.sql
[info]   Expected "struct<[sum(n) OVER (ORDER BY i ASC NULLS FIRST ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING):decimal(22,2)]>", but got "struct<[]>" Schema did not match for query #8
[info]   SELECT SUM(n) OVER (ORDER BY i ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
[info]     FROM (VALUES(1,1.01),(2,2),(3,3)) v(i,n): -- !query
[info]   SELECT SUM(n) OVER (ORDER BY i ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
[info]     FROM (VALUES(1,1.01),(2,2),(3,3)) v(i,n)
[info]   -- !query schema
[info]   struct<>
[info]   -- !query output
[info]   java.lang.ClassCastException
[info]   org.apache.spark.sql.catalyst.expressions.SpecificInternalRow cannot be cast to org.apache.spark.sql.types.Decimal (SQLQueryTestSuite.scala:464)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedExc

@skambha
Copy link
Owner

skambha commented May 29, 2020

There are a few more in hive suite bucket for
org.apache.spark.sql.hive.execution.HashAggregationQuerySuite
org.apache.spark.sql.hive.execution.HashAggregationQueryWithControlledFallbackSuite

I went ahead and fixed it. Here is the diff:

a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/decimalExpressions.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/decimalExpressions.scala
@@ -159,7 +159,7 @@ case class CheckOverflowInSum(
    if (value == null) {
      if (nullOnOverflow) null else throw new ArithmeticException("Overflow in sum of decimals.")
    } else {
-      input.asInstanceOf[Decimal].toPrecision(
+      value.asInstanceOf[Decimal].toPrecision(
        dataType.precision,
        dataType.scale,
        Decimal.ROUND_HALF_UP,

@skambha skambha merged commit de2d68f into skambha:decaggfixwrongresults May 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants