Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR proposes to support type coercion between ArrayTypes where the element types are compatible.

Before

Seq(Array(1)).toDF("a").selectExpr("greatest(a, array(1D))")
org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(`a`, array(1.0D))' due to data type mismatch: The expressions should all have the same type, got GREATEST(array<int>, array<double>).; line 1 pos 0;

Seq(Array(1)).toDF("a").selectExpr("least(a, array(1D))")
org.apache.spark.sql.AnalysisException: cannot resolve 'least(`a`, array(1.0D))' due to data type mismatch: The expressions should all have the same type, got LEAST(array<int>, array<double>).; line 1 pos 0;

sql("SELECT * FROM values (array(0)), (array(1D)) as data(a)")
org.apache.spark.sql.AnalysisException: incompatible types found in column a for inline table; line 1 pos 14

Seq(Array(1)).toDF("a").union(Seq(Array(1D)).toDF("b"))
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. ArrayType(DoubleType,false) <> ArrayType(IntegerType,false) at the first column of the second table;;

sql("SELECT IF(1=1, array(1), array(1D))")
org.apache.spark.sql.AnalysisException: cannot resolve '(IF((1 = 1), array(1), array(1.0D)))' due to data type mismatch: differing types in '(IF((1 = 1), array(1), array(1.0D)))' (array<int> and array<double>).; line 1 pos 7;

After

Seq(Array(1)).toDF("a").selectExpr("greatest(a, array(1D))")
res5: org.apache.spark.sql.DataFrame = [greatest(a, array(1.0)): array<double>]

Seq(Array(1)).toDF("a").selectExpr("least(a, array(1D))")
res6: org.apache.spark.sql.DataFrame = [least(a, array(1.0)): array<double>]

sql("SELECT * FROM values (array(0)), (array(1D)) as data(a)")
res8: org.apache.spark.sql.DataFrame = [a: array<double>]

Seq(Array(1)).toDF("a").union(Seq(Array(1D)).toDF("b"))
res10: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: array<double>]

sql("SELECT IF(1=1, array(1), array(1D))")
res15: org.apache.spark.sql.DataFrame = [(IF((1 = 1), array(1), array(1.0))): array<double>]

How was this patch tested?

Unit tests in TypeCoercion and Jenkins tests and

building with scala 2.10

./dev/change-scala-version.sh 2.10
./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findTightestCommonType was removed as it seems used nowhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It becomes harder for reviewers to read this PR. Could you submit a separate PR for code cleaning? Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I submitted for this in #16786.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we now support implicit cast of ArrayType via https://issues.apache.org/jira/browse/SPARK-18624.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Feb 2, 2017

cc @hvanhovell, could you maybe take a look please? I saw this PR is related with SPARK-18624.

@SparkQA
Copy link

SparkQA commented Feb 2, 2017

Test build #72280 has finished for PR 16777 at commit bd0d9f7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 3, 2017

Test build #72290 has finished for PR 16777 at commit b860d25.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 3, 2017

Test build #72307 has finished for PR 16777 at commit d187ad3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

ghost pushed a commit to dbtsai/spark that referenced this pull request Feb 4, 2017
## What changes were proposed in this pull request?

This PR proposes to

- remove unused `findTightestCommonType` in `TypeCoercion` as suggested in apache#16777 (comment)
- rename `findTightestCommonTypeOfTwo ` to `findTightestCommonType`.
- fix comments accordingly

The usage was removed while refactoring/fixing in several JIRAs such as SPARK-16714, SPARK-16735 and SPARK-16646

## How was this patch tested?

Existing tests.

Author: hyukjinkwon <[email protected]>

Closes apache#16786 from HyukjinKwon/SPARK-19446.
@SparkQA
Copy link

SparkQA commented Feb 4, 2017

Test build #72385 has finished for PR 16777 at commit c1eca9d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

cc @cloud-fan, WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this becomes a method that returns a function, is it what we need?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I will fix it.

@cloud-fan
Copy link
Contributor

@HyukjinKwon are you sure this is a common behavior in databases?

@HyukjinKwon
Copy link
Member Author

Let me check other DBMSs and back.

@HyukjinKwon
Copy link
Member Author

Postgres

postgres=# SELECT greatest(array[1], array[0.1]);
 greatest
----------
 {1}
(1 row)

postgres=# SELECT least(array[1], array[0.1]);
 least
-------
 {0.1}
(1 row)
postgres=# SELECT * FROM (values (array[1]), (array[0.1])) as foo;
 column1
---------
 {1}
 {0.1}
(2 rows)
postgres=# SELECT array[1] UNION SELECT array[0.1];
 array
-------
 {0.1}
 {1}
(2 rows)
postgres=# SELECT CASE WHEN TRUE THEN array[0.1] ELSE array[1] END;
 array
-------
 {0.1}
(1 row)

(sorry, I could not find a proper way to test this with IF. So, I used CASE/WHEN in postgres).

Hive - not supporting this type coercion.

SELECT least(array(1), array(1D));
FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments '1': least only takes primitive types, got array<int>

least/greatest: seems only supporting primitive types

SELECT inline(array(struct(array(0)), struct(array(1D))));
FAILED: SemanticException [Error 10016]: Line 1:38 Argument type mismatch '1D': Argument type "struct<col1:array<double>>" is different from preceding arguments. Previous type was "struct<col1:array<int>>"
SELECT array(1) UNION SELECT array(1D);
FAILED: SemanticException Schema of both sides of union should match: Column _c0 is of type array<int> on first table and type array<double> on second table. Cannot tell the position of null AST.
SELECT IF(1=1, array(1), array(1D));
FAILED: SemanticException [Error 10016]: Line 1:25 Argument type mismatch '1D': The second and the third arguments of function IF should have the same type, but they are different: "array<int>" and "array<double>"

MySQL

Seems not supporting arrays

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Feb 11, 2017

@cloud-fan, To cut this short, it seems Postgres supports this whereas Hive does not.

It seems we now support implicit cast betweenArrayTypes via SPARK-18624, for example as below:

sql("SELECT percentile_approx(10.0, array('1', '1', '1'), 100)").show()

Wouldn't it be more reasonable to allow the type coercion between them as well?

@SparkQA
Copy link

SparkQA commented Feb 11, 2017

Test build #72740 has finished for PR 16777 at commit 59a1cbf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 11, 2017

Test build #72741 has finished for PR 16777 at commit def432a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tightestCommon -> expected?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

widenTest already tests symmetric

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are too similar. we just need one test case for numeric, one for string, one for nested array.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Feb 12, 2017

@cloud-fan, I just addressed your comments and tested a build with Scala 2.10.

@SparkQA
Copy link

SparkQA commented Feb 12, 2017

Test build #72762 has started for PR 16777 at commit 180f3c1.

@SparkQA
Copy link

SparkQA commented Feb 12, 2017

Test build #72760 has finished for PR 16777 at commit 10afdcb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@cloud-fan cloud-fan Feb 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: how about we always put the decimal type in the left side? then we won't mistakenly add symmetric test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just test int or long, not both

Copy link
Member Author

@HyukjinKwon HyukjinKwon Feb 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LongType test is removed.

@cloud-fan
Copy link
Contributor

cloud-fan commented Feb 12, 2017

LGTM except several minor comments about tests, thanks for working on it!

@HyukjinKwon
Copy link
Member Author

Thanks @cloud-fan for your detailed review. I will keep in mind those comments.

@HyukjinKwon HyukjinKwon force-pushed the SPARK-19435 branch 2 times, most recently from 7d72e8b to 5990f6f Compare February 12, 2017 12:47
@SparkQA
Copy link

SparkQA commented Feb 12, 2017

Test build #72780 has finished for PR 16777 at commit 5990f6f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 12, 2017

Test build #72777 has finished for PR 16777 at commit 31d3163.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 12, 2017

Test build #72779 has finished for PR 16777 at commit 768c7e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 12, 2017

Test build #72781 has finished for PR 16777 at commit c961270.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 12, 2017

Test build #72782 has finished for PR 16777 at commit 510a0ee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, findWiderTypeForDecimal is before findTightestCommonTypeToString . Thus, the results could be different. cc @cloud-fan

You changed the order. I am not sure whether this should be documented in the release note.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is true that the type dispatch order was changed but findTightestCommonType does not take care of DecimalType therefore the results would be the same.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan refactored this logic recently and I believe he didn't missed this part.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original findTightestCommonTypeToString does not handle DecimalType . However, this PR is calling the findTightestCommonType at first.

Copy link
Member Author

@HyukjinKwon HyukjinKwon Feb 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, thank you for correcting me. I overlooked but the result should be the same, shouldn't it?

If both are different, we were already applying different type coercion rules between findWiderTypeWithoutStringPromotion and findWiderTypeForTwo, I guess we should match them with the same given #14439 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make them consistent. That is why I think it is right to make the change, even if it causes the behavior changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thank you for catching it.

@gatorsmile
Copy link
Member

I think we need to separate the changes from the support of Type coercion between ArrayTypes? Could you submit another PR at first? We might need extra test cases for this change.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Feb 13, 2017

Do you mean two PRs (one of both is here) for cleaning up the logics here and the support of array type coercion?

@gatorsmile
Copy link
Member

Yeah, the first PR is for refactoring and cleaning up findWiderTypeForTwo. We need to add the test cases for the behavior changes. We might also need to document this in the release note, because it changes the output types.

The second one is for Type coercion between ArrayTypes.

@HyukjinKwon
Copy link
Member Author

I see what you mean. The code paths are now different. Let me try to investigate it and split them.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Feb 13, 2017

@gatorsmile, Can we make this merged and then add test cases for them separately? It seems the results are the same and there is no behaviour change if I haven't missed some cases. I ran two tests as below:

val integralTypes = Seq(ByteType, ShortType, IntegerType, LongType)
val decimalTypes = (-38 to 38).flatMap { p =>
  (-38 to 38).flatMap(s => allCatch opt DecimalType(p, s))
}

assert(decimalTypes.nonEmpty)
  
integralTypes.foreach { it =>
  test(s"$it test") {
    decimalTypes.foreach { d =>
      val maybeType1 = TypeCoercion.findWiderTypeForDecimal(d, it)
      val maybeType2 = TypeCoercion.findTightestCommonType(d, it)

      if (maybeType2.isDefined) {
        val t1 = maybeType1.get
        val t2 = maybeType2.get
        assert(t1 === t2)
      }
    }
  }
}

integralTypes.foreach { it =>
  test(s"$it test subset") {
    val widenDecimals =
      decimalTypes.flatMap(TypeCoercion.findWiderTypeForDecimal(_, it)).toSet
    val tightDecimals =
      decimalTypes.flatMap(TypeCoercion.findTightestCommonType(_, it)).toSet

    assert(widenDecimals.nonEmpty)
    assert(tightDecimals.nonEmpty)
    assert(tightDecimals.subsetOf(widenDecimals))
  }
}

EDITTED: I made the test simpler for readability in this comment.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Feb 13, 2017

(I just rebased for a conflict and added private[analysis] for consistency)

findTightestCommonTypeToString(t1, t2)
}
findTightestCommonType(t1, t2)
.orElse(findWiderTypeForDecimal(t1, t2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea we changed the order, but looks like it won't change the result. findWiderTypeForDecimal will always return a result for decimal type and numeric type, and if findTightestCommonType can return a result, findWiderTypeForDecimal will return the same result. So it doesn't matter if we run findTightestCommonType before or after it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Integer will be promoted to a wider Decimal anyway.

@SparkQA
Copy link

SparkQA commented Feb 13, 2017

Test build #72819 has finished for PR 16777 at commit 98b46af.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

Thanks! Merging to master.

@asfgit asfgit closed this in 9af8f74 Feb 13, 2017
@HyukjinKwon
Copy link
Member Author

Thank you @gatorsmile

cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?

This PR proposes to

- remove unused `findTightestCommonType` in `TypeCoercion` as suggested in apache#16777 (comment)
- rename `findTightestCommonTypeOfTwo ` to `findTightestCommonType`.
- fix comments accordingly

The usage was removed while refactoring/fixing in several JIRAs such as SPARK-16714, SPARK-16735 and SPARK-16646

## How was this patch tested?

Existing tests.

Author: hyukjinkwon <[email protected]>

Closes apache#16786 from HyukjinKwon/SPARK-19446.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?

This PR proposes to support type coercion between `ArrayType`s where the element types are compatible.

**Before**

```
Seq(Array(1)).toDF("a").selectExpr("greatest(a, array(1D))")
org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(`a`, array(1.0D))' due to data type mismatch: The expressions should all have the same type, got GREATEST(array<int>, array<double>).; line 1 pos 0;

Seq(Array(1)).toDF("a").selectExpr("least(a, array(1D))")
org.apache.spark.sql.AnalysisException: cannot resolve 'least(`a`, array(1.0D))' due to data type mismatch: The expressions should all have the same type, got LEAST(array<int>, array<double>).; line 1 pos 0;

sql("SELECT * FROM values (array(0)), (array(1D)) as data(a)")
org.apache.spark.sql.AnalysisException: incompatible types found in column a for inline table; line 1 pos 14

Seq(Array(1)).toDF("a").union(Seq(Array(1D)).toDF("b"))
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. ArrayType(DoubleType,false) <> ArrayType(IntegerType,false) at the first column of the second table;;

sql("SELECT IF(1=1, array(1), array(1D))")
org.apache.spark.sql.AnalysisException: cannot resolve '(IF((1 = 1), array(1), array(1.0D)))' due to data type mismatch: differing types in '(IF((1 = 1), array(1), array(1.0D)))' (array<int> and array<double>).; line 1 pos 7;
```

**After**

```scala
Seq(Array(1)).toDF("a").selectExpr("greatest(a, array(1D))")
res5: org.apache.spark.sql.DataFrame = [greatest(a, array(1.0)): array<double>]

Seq(Array(1)).toDF("a").selectExpr("least(a, array(1D))")
res6: org.apache.spark.sql.DataFrame = [least(a, array(1.0)): array<double>]

sql("SELECT * FROM values (array(0)), (array(1D)) as data(a)")
res8: org.apache.spark.sql.DataFrame = [a: array<double>]

Seq(Array(1)).toDF("a").union(Seq(Array(1D)).toDF("b"))
res10: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: array<double>]

sql("SELECT IF(1=1, array(1), array(1D))")
res15: org.apache.spark.sql.DataFrame = [(IF((1 = 1), array(1), array(1.0))): array<double>]
```

## How was this patch tested?

Unit tests in `TypeCoercion` and Jenkins tests and

building with scala 2.10

```scala
./dev/change-scala-version.sh 2.10
./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package
```

Author: hyukjinkwon <[email protected]>

Closes apache#16777 from HyukjinKwon/SPARK-19435.
@HyukjinKwon HyukjinKwon deleted the SPARK-19435 branch January 2, 2018 03:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants