[SPARK-19435][SQL] Type coercion between ArrayTypes #16777

HyukjinKwon · 2017-02-02T10:44:46Z

What changes were proposed in this pull request?

This PR proposes to support type coercion between ArrayTypes where the element types are compatible.

Before

Seq(Array(1)).toDF("a").selectExpr("greatest(a, array(1D))")
org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(`a`, array(1.0D))' due to data type mismatch: The expressions should all have the same type, got GREATEST(array<int>, array<double>).; line 1 pos 0;

Seq(Array(1)).toDF("a").selectExpr("least(a, array(1D))")
org.apache.spark.sql.AnalysisException: cannot resolve 'least(`a`, array(1.0D))' due to data type mismatch: The expressions should all have the same type, got LEAST(array<int>, array<double>).; line 1 pos 0;

sql("SELECT * FROM values (array(0)), (array(1D)) as data(a)")
org.apache.spark.sql.AnalysisException: incompatible types found in column a for inline table; line 1 pos 14

Seq(Array(1)).toDF("a").union(Seq(Array(1D)).toDF("b"))
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. ArrayType(DoubleType,false) <> ArrayType(IntegerType,false) at the first column of the second table;;

sql("SELECT IF(1=1, array(1), array(1D))")
org.apache.spark.sql.AnalysisException: cannot resolve '(IF((1 = 1), array(1), array(1.0D)))' due to data type mismatch: differing types in '(IF((1 = 1), array(1), array(1.0D)))' (array<int> and array<double>).; line 1 pos 7;

After

Seq(Array(1)).toDF("a").selectExpr("greatest(a, array(1D))")
res5: org.apache.spark.sql.DataFrame = [greatest(a, array(1.0)): array<double>]

Seq(Array(1)).toDF("a").selectExpr("least(a, array(1D))")
res6: org.apache.spark.sql.DataFrame = [least(a, array(1.0)): array<double>]

sql("SELECT * FROM values (array(0)), (array(1D)) as data(a)")
res8: org.apache.spark.sql.DataFrame = [a: array<double>]

Seq(Array(1)).toDF("a").union(Seq(Array(1D)).toDF("b"))
res10: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: array<double>]

sql("SELECT IF(1=1, array(1), array(1D))")
res15: org.apache.spark.sql.DataFrame = [(IF((1 = 1), array(1), array(1.0))): array<double>]

How was this patch tested?

Unit tests in TypeCoercion and Jenkins tests and

building with scala 2.10

./dev/change-scala-version.sh 2.10
./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package

HyukjinKwon · 2017-02-02T10:45:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

findTightestCommonType was removed as it seems used nowhere.

It becomes harder for reviewers to read this PR. Could you submit a separate PR for code cleaning? Thanks!

Sure, I submitted for this in #16786.

HyukjinKwon · 2017-02-02T10:46:54Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala

It seems we now support implicit cast of ArrayType via https://issues.apache.org/jira/browse/SPARK-18624.

HyukjinKwon · 2017-02-02T10:47:15Z

cc @hvanhovell, could you maybe take a look please? I saw this PR is related with SPARK-18624.

SparkQA · 2017-02-02T13:14:07Z

Test build #72280 has finished for PR 16777 at commit bd0d9f7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-03T00:49:26Z

Test build #72290 has finished for PR 16777 at commit b860d25.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-03T14:58:32Z

Test build #72307 has finished for PR 16777 at commit d187ad3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? This PR proposes to - remove unused `findTightestCommonType` in `TypeCoercion` as suggested in apache#16777 (comment) - rename `findTightestCommonTypeOfTwo ` to `findTightestCommonType`. - fix comments accordingly The usage was removed while refactoring/fixing in several JIRAs such as SPARK-16714, SPARK-16735 and SPARK-16646 ## How was this patch tested? Existing tests. Author: hyukjinkwon <[email protected]> Closes apache#16786 from HyukjinKwon/SPARK-19446.

SparkQA · 2017-02-04T19:21:41Z

Test build #72385 has finished for PR 16777 at commit c1eca9d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-02-09T23:35:27Z

cc @cloud-fan, WDYT?

cloud-fan · 2017-02-10T18:46:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

this becomes a method that returns a function, is it what we need?

Oh, I will fix it.

cloud-fan · 2017-02-10T19:01:13Z

@HyukjinKwon are you sure this is a common behavior in databases?

HyukjinKwon · 2017-02-11T01:21:43Z

Let me check other DBMSs and back.

HyukjinKwon · 2017-02-11T12:57:48Z

Postgres

postgres=# SELECT greatest(array[1], array[0.1]);
 greatest
----------
 {1}
(1 row)

postgres=# SELECT least(array[1], array[0.1]);
 least
-------
 {0.1}
(1 row)

postgres=# SELECT * FROM (values (array[1]), (array[0.1])) as foo;
 column1
---------
 {1}
 {0.1}
(2 rows)

postgres=# SELECT array[1] UNION SELECT array[0.1];
 array
-------
 {0.1}
 {1}
(2 rows)

postgres=# SELECT CASE WHEN TRUE THEN array[0.1] ELSE array[1] END;
 array
-------
 {0.1}
(1 row)

(sorry, I could not find a proper way to test this with IF. So, I used CASE/WHEN in postgres).

Hive - not supporting this type coercion.

SELECT least(array(1), array(1D));
FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments '1': least only takes primitive types, got array<int>

least/greatest: seems only supporting primitive types

SELECT inline(array(struct(array(0)), struct(array(1D))));
FAILED: SemanticException [Error 10016]: Line 1:38 Argument type mismatch '1D': Argument type "struct<col1:array<double>>" is different from preceding arguments. Previous type was "struct<col1:array<int>>"

SELECT array(1) UNION SELECT array(1D);
FAILED: SemanticException Schema of both sides of union should match: Column _c0 is of type array<int> on first table and type array<double> on second table. Cannot tell the position of null AST.

SELECT IF(1=1, array(1), array(1D));
FAILED: SemanticException [Error 10016]: Line 1:25 Argument type mismatch '1D': The second and the third arguments of function IF should have the same type, but they are different: "array<int>" and "array<double>"

MySQL

Seems not supporting arrays

HyukjinKwon · 2017-02-11T13:14:01Z

@cloud-fan, To cut this short, it seems Postgres supports this whereas Hive does not.

It seems we now support implicit cast betweenArrayTypes via SPARK-18624, for example as below:

sql("SELECT percentile_approx(10.0, array('1', '1', '1'), 100)").show()

Wouldn't it be more reasonable to allow the type coercion between them as well?

SparkQA · 2017-02-11T16:48:52Z

Test build #72740 has finished for PR 16777 at commit 59a1cbf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-11T16:53:51Z

Test build #72741 has finished for PR 16777 at commit def432a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-11T22:49:54Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala

tightestCommon -> expected?

cloud-fan · 2017-02-11T22:52:28Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala

widenTest already tests symmetric

cloud-fan · 2017-02-11T22:54:46Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala

These tests are too similar. we just need one test case for numeric, one for string, one for nested array.

HyukjinKwon · 2017-02-12T05:46:14Z

@cloud-fan, I just addressed your comments and tested a build with Scala 2.10.

SparkQA · 2017-02-12T05:48:35Z

Test build #72762 has started for PR 16777 at commit 180f3c1.

SparkQA · 2017-02-12T07:27:13Z

Test build #72760 has finished for PR 16777 at commit 10afdcb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-12T09:27:36Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala

nit: how about we always put the decimal type in the left side? then we won't mistakenly add symmetric test

cloud-fan · 2017-02-12T09:29:55Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala

I think we can just test int or long, not both

LongType test is removed.

cloud-fan · 2017-02-12T09:30:44Z

LGTM except several minor comments about tests, thanks for working on it!

HyukjinKwon · 2017-02-12T12:35:40Z

Thanks @cloud-fan for your detailed review. I will keep in mind those comments.

SparkQA · 2017-02-12T14:46:08Z

Test build #72780 has finished for PR 16777 at commit 5990f6f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-12T15:07:12Z

Test build #72777 has finished for PR 16777 at commit 31d3163.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-12T15:18:03Z

Test build #72779 has finished for PR 16777 at commit 768c7e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-12T15:24:29Z

Test build #72781 has finished for PR 16777 at commit c961270.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-12T15:41:40Z

Test build #72782 has finished for PR 16777 at commit 510a0ee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-02-13T02:18:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

Previously, findWiderTypeForDecimal is before findTightestCommonTypeToString . Thus, the results could be different. cc @cloud-fan

You changed the order. I am not sure whether this should be documented in the release note.

Yes, it is true that the type dispatch order was changed but findTightestCommonType does not take care of DecimalType therefore the results would be the same.

@cloud-fan refactored this logic recently and I believe he didn't missed this part.

The result is the same?

See the cases in findTightestCommonType: https://github.com/HyukjinKwon/spark/blob/510a0eee43030abbf37ef922684e6165d6f1e1c8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala#L87-L90

The original findTightestCommonTypeToString does not handle DecimalType . However, this PR is calling the findTightestCommonType at first.

Aha, thank you for correcting me. I overlooked but the result should be the same, shouldn't it?

If both are different, we were already applying different type coercion rules between findWiderTypeWithoutStringPromotion and findWiderTypeForTwo, I guess we should match them with the same given #14439 ?

We should make them consistent. That is why I think it is right to make the change, even if it causes the behavior changes.

I see. Thank you for catching it.

gatorsmile · 2017-02-13T02:23:54Z

I think we need to separate the changes from the support of Type coercion between ArrayTypes? Could you submit another PR at first? We might need extra test cases for this change.

HyukjinKwon · 2017-02-13T02:39:59Z

Do you mean two PRs (one of both is here) for cleaning up the logics here and the support of array type coercion?

gatorsmile · 2017-02-13T03:29:26Z

Yeah, the first PR is for refactoring and cleaning up findWiderTypeForTwo. We need to add the test cases for the behavior changes. We might also need to document this in the release note, because it changes the output types.

The second one is for Type coercion between ArrayTypes.

HyukjinKwon · 2017-02-13T04:39:38Z

I see what you mean. The code paths are now different. Let me try to investigate it and split them.

HyukjinKwon · 2017-02-13T06:09:38Z

@gatorsmile, Can we make this merged and then add test cases for them separately? It seems the results are the same and there is no behaviour change if I haven't missed some cases. I ran two tests as below:

val integralTypes = Seq(ByteType, ShortType, IntegerType, LongType)
val decimalTypes = (-38 to 38).flatMap { p =>
  (-38 to 38).flatMap(s => allCatch opt DecimalType(p, s))
}

assert(decimalTypes.nonEmpty)
  
integralTypes.foreach { it =>
  test(s"$it test") {
    decimalTypes.foreach { d =>
      val maybeType1 = TypeCoercion.findWiderTypeForDecimal(d, it)
      val maybeType2 = TypeCoercion.findTightestCommonType(d, it)

      if (maybeType2.isDefined) {
        val t1 = maybeType1.get
        val t2 = maybeType2.get
        assert(t1 === t2)
      }
    }
  }
}

integralTypes.foreach { it =>
  test(s"$it test subset") {
    val widenDecimals =
      decimalTypes.flatMap(TypeCoercion.findWiderTypeForDecimal(_, it)).toSet
    val tightDecimals =
      decimalTypes.flatMap(TypeCoercion.findTightestCommonType(_, it)).toSet

    assert(widenDecimals.nonEmpty)
    assert(tightDecimals.nonEmpty)
    assert(tightDecimals.subsetOf(widenDecimals))
  }
}

EDITTED: I made the test simpler for readability in this comment.

HyukjinKwon · 2017-02-13T18:13:02Z

(I just rebased for a conflict and added private[analysis] for consistency)

cloud-fan · 2017-02-13T18:25:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

-        findTightestCommonTypeToString(t1, t2)
-    }
+    findTightestCommonType(t1, t2)
+      .orElse(findWiderTypeForDecimal(t1, t2))


yea we changed the order, but looks like it won't change the result. findWiderTypeForDecimal will always return a result for decimal type and numeric type, and if findTightestCommonType can return a result, findWiderTypeForDecimal will return the same result. So it doesn't matter if we run findTightestCommonType before or after it.

Yes. Integer will be promoted to a wider Decimal anyway.

SparkQA · 2017-02-13T20:43:31Z

Test build #72819 has finished for PR 16777 at commit 98b46af.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-02-13T21:10:21Z

LGTM

gatorsmile · 2017-02-13T21:11:19Z

Thanks! Merging to master.

HyukjinKwon · 2017-02-14T08:18:58Z

Thank you @gatorsmile

## What changes were proposed in this pull request? This PR proposes to - remove unused `findTightestCommonType` in `TypeCoercion` as suggested in apache#16777 (comment) - rename `findTightestCommonTypeOfTwo ` to `findTightestCommonType`. - fix comments accordingly The usage was removed while refactoring/fixing in several JIRAs such as SPARK-16714, SPARK-16735 and SPARK-16646 ## How was this patch tested? Existing tests. Author: hyukjinkwon <[email protected]> Closes apache#16786 from HyukjinKwon/SPARK-19446.

## What changes were proposed in this pull request? This PR proposes to support type coercion between `ArrayType`s where the element types are compatible. **Before** ``` Seq(Array(1)).toDF("a").selectExpr("greatest(a, array(1D))") org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(`a`, array(1.0D))' due to data type mismatch: The expressions should all have the same type, got GREATEST(array<int>, array<double>).; line 1 pos 0; Seq(Array(1)).toDF("a").selectExpr("least(a, array(1D))") org.apache.spark.sql.AnalysisException: cannot resolve 'least(`a`, array(1.0D))' due to data type mismatch: The expressions should all have the same type, got LEAST(array<int>, array<double>).; line 1 pos 0; sql("SELECT * FROM values (array(0)), (array(1D)) as data(a)") org.apache.spark.sql.AnalysisException: incompatible types found in column a for inline table; line 1 pos 14 Seq(Array(1)).toDF("a").union(Seq(Array(1D)).toDF("b")) org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. ArrayType(DoubleType,false) <> ArrayType(IntegerType,false) at the first column of the second table;; sql("SELECT IF(1=1, array(1), array(1D))") org.apache.spark.sql.AnalysisException: cannot resolve '(IF((1 = 1), array(1), array(1.0D)))' due to data type mismatch: differing types in '(IF((1 = 1), array(1), array(1.0D)))' (array<int> and array<double>).; line 1 pos 7; ``` **After** ```scala Seq(Array(1)).toDF("a").selectExpr("greatest(a, array(1D))") res5: org.apache.spark.sql.DataFrame = [greatest(a, array(1.0)): array<double>] Seq(Array(1)).toDF("a").selectExpr("least(a, array(1D))") res6: org.apache.spark.sql.DataFrame = [least(a, array(1.0)): array<double>] sql("SELECT * FROM values (array(0)), (array(1D)) as data(a)") res8: org.apache.spark.sql.DataFrame = [a: array<double>] Seq(Array(1)).toDF("a").union(Seq(Array(1D)).toDF("b")) res10: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: array<double>] sql("SELECT IF(1=1, array(1), array(1D))") res15: org.apache.spark.sql.DataFrame = [(IF((1 = 1), array(1), array(1.0))): array<double>] ``` ## How was this patch tested? Unit tests in `TypeCoercion` and Jenkins tests and building with scala 2.10 ```scala ./dev/change-scala-version.sh 2.10 ./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package ``` Author: hyukjinkwon <[email protected]> Closes apache#16777 from HyukjinKwon/SPARK-19435.

HyukjinKwon commented Feb 2, 2017

View reviewed changes

HyukjinKwon mentioned this pull request Feb 3, 2017

[SPARK-19446][SQL] Remove unused findTightestCommonType in TypeCoercion #16786

Closed

HyukjinKwon force-pushed the SPARK-19435 branch from d187ad3 to c1eca9d Compare February 4, 2017 16:54

cloud-fan reviewed Feb 10, 2017

View reviewed changes

cloud-fan reviewed Feb 11, 2017

View reviewed changes

cloud-fan reviewed Feb 12, 2017

View reviewed changes

HyukjinKwon force-pushed the SPARK-19435 branch 2 times, most recently from 7d72e8b to 5990f6f Compare February 12, 2017 12:47

HyukjinKwon force-pushed the SPARK-19435 branch from 5990f6f to c961270 Compare February 12, 2017 12:48

gatorsmile reviewed Feb 13, 2017

View reviewed changes

HyukjinKwon added 3 commits February 14, 2017 03:06

Type coercion between ArrayTypes

67cb366

Match the types with the tests with/without string promotion

6a414a0

Add private[analysis] to be consistent

98b46af

HyukjinKwon force-pushed the SPARK-19435 branch from 510a0ee to 98b46af Compare February 13, 2017 18:10

cloud-fan reviewed Feb 13, 2017

View reviewed changes

asfgit closed this in 9af8f74 Feb 13, 2017

HyukjinKwon deleted the SPARK-19435 branch January 2, 2018 03:38

[SPARK-19435][SQL] Type coercion between ArrayTypes #16777

[SPARK-19435][SQL] Type coercion between ArrayTypes #16777

Uh oh!

Conversation

HyukjinKwon commented Feb 2, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Feb 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Feb 2, 2017

Uh oh!

SparkQA commented Feb 3, 2017

Uh oh!

SparkQA commented Feb 3, 2017

Uh oh!

SparkQA commented Feb 4, 2017

Uh oh!

HyukjinKwon commented Feb 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Feb 10, 2017

Uh oh!

HyukjinKwon commented Feb 11, 2017

Uh oh!

HyukjinKwon commented Feb 11, 2017

Uh oh!

HyukjinKwon commented Feb 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Feb 11, 2017

Uh oh!

SparkQA commented Feb 11, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Feb 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Feb 12, 2017

Uh oh!

SparkQA commented Feb 12, 2017

Uh oh!

cloud-fan Feb 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Feb 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Feb 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Feb 12, 2017

Uh oh!

SparkQA commented Feb 12, 2017

Uh oh!

SparkQA commented Feb 12, 2017

Uh oh!

SparkQA commented Feb 12, 2017

Uh oh!

HyukjinKwon commented Feb 2, 2017 •

edited

Loading

HyukjinKwon commented Feb 11, 2017 •

edited

Loading

HyukjinKwon commented Feb 12, 2017 •

edited

Loading

cloud-fan Feb 12, 2017 •

edited

Loading

HyukjinKwon Feb 12, 2017 •

edited

Loading

cloud-fan commented Feb 12, 2017 •

edited

Loading

HyukjinKwon Feb 13, 2017 •

edited

Loading

HyukjinKwon commented Feb 13, 2017 •

edited

Loading

HyukjinKwon commented Feb 13, 2017 •

edited

Loading

HyukjinKwon commented Feb 13, 2017 •

edited

Loading