-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19435][SQL] Type coercion between ArrayTypes #16777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
findTightestCommonType was removed as it seems used nowhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It becomes harder for reviewers to read this PR. Could you submit a separate PR for code cleaning? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I submitted for this in #16786.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we now support implicit cast of ArrayType via https://issues.apache.org/jira/browse/SPARK-18624.
|
cc @hvanhovell, could you maybe take a look please? I saw this PR is related with SPARK-18624. |
|
Test build #72280 has finished for PR 16777 at commit
|
|
Test build #72290 has finished for PR 16777 at commit
|
|
Test build #72307 has finished for PR 16777 at commit
|
## What changes were proposed in this pull request? This PR proposes to - remove unused `findTightestCommonType` in `TypeCoercion` as suggested in apache#16777 (comment) - rename `findTightestCommonTypeOfTwo ` to `findTightestCommonType`. - fix comments accordingly The usage was removed while refactoring/fixing in several JIRAs such as SPARK-16714, SPARK-16735 and SPARK-16646 ## How was this patch tested? Existing tests. Author: hyukjinkwon <[email protected]> Closes apache#16786 from HyukjinKwon/SPARK-19446.
d187ad3 to
c1eca9d
Compare
|
Test build #72385 has finished for PR 16777 at commit
|
|
cc @cloud-fan, WDYT? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this becomes a method that returns a function, is it what we need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I will fix it.
|
@HyukjinKwon are you sure this is a common behavior in databases? |
|
Let me check other DBMSs and back. |
|
Postgres (sorry, I could not find a proper way to test this with Hive - not supporting this type coercion. least/greatest: seems only supporting primitive types MySQL Seems not supporting arrays |
|
@cloud-fan, To cut this short, it seems Postgres supports this whereas Hive does not. It seems we now support implicit cast between sql("SELECT percentile_approx(10.0, array('1', '1', '1'), 100)").show()Wouldn't it be more reasonable to allow the type coercion between them as well? |
|
Test build #72740 has finished for PR 16777 at commit
|
|
Test build #72741 has finished for PR 16777 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tightestCommon -> expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
widenTest already tests symmetric
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are too similar. we just need one test case for numeric, one for string, one for nested array.
|
@cloud-fan, I just addressed your comments and tested a build with Scala 2.10. |
|
Test build #72762 has started for PR 16777 at commit |
|
Test build #72760 has finished for PR 16777 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: how about we always put the decimal type in the left side? then we won't mistakenly add symmetric test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just test int or long, not both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LongType test is removed.
|
LGTM except several minor comments about tests, thanks for working on it! |
|
Thanks @cloud-fan for your detailed review. I will keep in mind those comments. |
7d72e8b to
5990f6f
Compare
5990f6f to
c961270
Compare
|
Test build #72780 has finished for PR 16777 at commit
|
|
Test build #72777 has finished for PR 16777 at commit
|
|
Test build #72779 has finished for PR 16777 at commit
|
|
Test build #72781 has finished for PR 16777 at commit
|
|
Test build #72782 has finished for PR 16777 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, findWiderTypeForDecimal is before findTightestCommonTypeToString . Thus, the results could be different. cc @cloud-fan
You changed the order. I am not sure whether this should be documented in the release note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is true that the type dispatch order was changed but findTightestCommonType does not take care of DecimalType therefore the results would be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan refactored this logic recently and I believe he didn't missed this part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result is the same?
See the cases in findTightestCommonType: https://github.com/HyukjinKwon/spark/blob/510a0eee43030abbf37ef922684e6165d6f1e1c8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala#L87-L90
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original findTightestCommonTypeToString does not handle DecimalType . However, this PR is calling the findTightestCommonType at first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, thank you for correcting me. I overlooked but the result should be the same, shouldn't it?
If both are different, we were already applying different type coercion rules between findWiderTypeWithoutStringPromotion and findWiderTypeForTwo, I guess we should match them with the same given #14439 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make them consistent. That is why I think it is right to make the change, even if it causes the behavior changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thank you for catching it.
|
I think we need to separate the changes from the support of |
|
Do you mean two PRs (one of both is here) for cleaning up the logics here and the support of array type coercion? |
|
Yeah, the first PR is for refactoring and cleaning up The second one is for |
|
I see what you mean. The code paths are now different. Let me try to investigate it and split them. |
|
@gatorsmile, Can we make this merged and then add test cases for them separately? It seems the results are the same and there is no behaviour change if I haven't missed some cases. I ran two tests as below: val integralTypes = Seq(ByteType, ShortType, IntegerType, LongType)
val decimalTypes = (-38 to 38).flatMap { p =>
(-38 to 38).flatMap(s => allCatch opt DecimalType(p, s))
}
assert(decimalTypes.nonEmpty)
integralTypes.foreach { it =>
test(s"$it test") {
decimalTypes.foreach { d =>
val maybeType1 = TypeCoercion.findWiderTypeForDecimal(d, it)
val maybeType2 = TypeCoercion.findTightestCommonType(d, it)
if (maybeType2.isDefined) {
val t1 = maybeType1.get
val t2 = maybeType2.get
assert(t1 === t2)
}
}
}
}
integralTypes.foreach { it =>
test(s"$it test subset") {
val widenDecimals =
decimalTypes.flatMap(TypeCoercion.findWiderTypeForDecimal(_, it)).toSet
val tightDecimals =
decimalTypes.flatMap(TypeCoercion.findTightestCommonType(_, it)).toSet
assert(widenDecimals.nonEmpty)
assert(tightDecimals.nonEmpty)
assert(tightDecimals.subsetOf(widenDecimals))
}
}EDITTED: I made the test simpler for readability in this comment. |
510a0ee to
98b46af
Compare
|
(I just rebased for a conflict and added |
| findTightestCommonTypeToString(t1, t2) | ||
| } | ||
| findTightestCommonType(t1, t2) | ||
| .orElse(findWiderTypeForDecimal(t1, t2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea we changed the order, but looks like it won't change the result. findWiderTypeForDecimal will always return a result for decimal type and numeric type, and if findTightestCommonType can return a result, findWiderTypeForDecimal will return the same result. So it doesn't matter if we run findTightestCommonType before or after it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Integer will be promoted to a wider Decimal anyway.
|
Test build #72819 has finished for PR 16777 at commit
|
|
LGTM |
|
Thanks! Merging to master. |
|
Thank you @gatorsmile |
## What changes were proposed in this pull request? This PR proposes to - remove unused `findTightestCommonType` in `TypeCoercion` as suggested in apache#16777 (comment) - rename `findTightestCommonTypeOfTwo ` to `findTightestCommonType`. - fix comments accordingly The usage was removed while refactoring/fixing in several JIRAs such as SPARK-16714, SPARK-16735 and SPARK-16646 ## How was this patch tested? Existing tests. Author: hyukjinkwon <[email protected]> Closes apache#16786 from HyukjinKwon/SPARK-19446.
## What changes were proposed in this pull request?
This PR proposes to support type coercion between `ArrayType`s where the element types are compatible.
**Before**
```
Seq(Array(1)).toDF("a").selectExpr("greatest(a, array(1D))")
org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(`a`, array(1.0D))' due to data type mismatch: The expressions should all have the same type, got GREATEST(array<int>, array<double>).; line 1 pos 0;
Seq(Array(1)).toDF("a").selectExpr("least(a, array(1D))")
org.apache.spark.sql.AnalysisException: cannot resolve 'least(`a`, array(1.0D))' due to data type mismatch: The expressions should all have the same type, got LEAST(array<int>, array<double>).; line 1 pos 0;
sql("SELECT * FROM values (array(0)), (array(1D)) as data(a)")
org.apache.spark.sql.AnalysisException: incompatible types found in column a for inline table; line 1 pos 14
Seq(Array(1)).toDF("a").union(Seq(Array(1D)).toDF("b"))
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. ArrayType(DoubleType,false) <> ArrayType(IntegerType,false) at the first column of the second table;;
sql("SELECT IF(1=1, array(1), array(1D))")
org.apache.spark.sql.AnalysisException: cannot resolve '(IF((1 = 1), array(1), array(1.0D)))' due to data type mismatch: differing types in '(IF((1 = 1), array(1), array(1.0D)))' (array<int> and array<double>).; line 1 pos 7;
```
**After**
```scala
Seq(Array(1)).toDF("a").selectExpr("greatest(a, array(1D))")
res5: org.apache.spark.sql.DataFrame = [greatest(a, array(1.0)): array<double>]
Seq(Array(1)).toDF("a").selectExpr("least(a, array(1D))")
res6: org.apache.spark.sql.DataFrame = [least(a, array(1.0)): array<double>]
sql("SELECT * FROM values (array(0)), (array(1D)) as data(a)")
res8: org.apache.spark.sql.DataFrame = [a: array<double>]
Seq(Array(1)).toDF("a").union(Seq(Array(1D)).toDF("b"))
res10: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: array<double>]
sql("SELECT IF(1=1, array(1), array(1D))")
res15: org.apache.spark.sql.DataFrame = [(IF((1 = 1), array(1), array(1.0))): array<double>]
```
## How was this patch tested?
Unit tests in `TypeCoercion` and Jenkins tests and
building with scala 2.10
```scala
./dev/change-scala-version.sh 2.10
./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package
```
Author: hyukjinkwon <[email protected]>
Closes apache#16777 from HyukjinKwon/SPARK-19435.
What changes were proposed in this pull request?
This PR proposes to support type coercion between
ArrayTypes where the element types are compatible.Before
After
How was this patch tested?
Unit tests in
TypeCoercionand Jenkins tests andbuilding with scala 2.10