-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21811][SQL] Fix the inconsistency behavior when finding the widest common type #21074
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #89376 has finished for PR 21074 at commit
|
| types.foldLeft[Option[DataType]](Some(NullType))((r, c) => r match { | ||
| case Some(d) => findWiderTypeForTwo(d, c) | ||
| case None => None | ||
| // Currently we find the wider common type by comparing the two types from left to right, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The real problem is, findWiderTypeForTwo doesn't satisfy the associative law, i.e. (a op b) op c may not equal to a op (b op c). I think StringType is the only exception here, it's more clear to do
val (stringType, nonStringType) = types.partition(_ == StringType)
(stringType.distinct ++ nonStringType).foldLeft...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a behavior change. We need to make it configurable. Add a conf and update the migration guide.
| case _ => None | ||
| }) | ||
| // `findWiderTypeForTwo` doesn't satisfy the associative law, i.e. (a op b) op c may not equal | ||
| // to a op (b op c). This is only a problem when each of the types is StringType or can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only a problem for StringType. Excluding StringType, findWiderTypeForTwo satisfies the associative law
|
Test build #89463 has finished for PR 21074 at commit
|
|
Test build #89459 has finished for PR 21074 at commit
|
| // to a op (b op c). This is only a problem for StringType. Excluding StringType, | ||
| // findWiderTypeForTwo satisfies the associative law. For instance, (TimestampType, | ||
| // IntegerType, StringType) should have StringType as the wider common type. | ||
| val (stringTypes, nonStringTypes) = types.partition(_ == StringType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, does this work with array types too (array of string vs array of non string types)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's expected to , let me also fix it for array types. Thanks!
| // findWiderTypeForTwo satisfies the associative law. For instance, (TimestampType, | ||
| // IntegerType, StringType) should have StringType as the wider common type. | ||
| val (stringTypes, nonStringTypes) = types.partition { t => | ||
| t == StringType || t == ArrayType(StringType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need something like
def hasStringType(dt: DataType): Boolean = dt match {
case StringType => true
case ArrayType(et, _) => hasStringType(et)
// Add StructType if we support string promotion for struct fields in the future.
case _ => false
}
|
Test build #89478 has finished for PR 21074 at commit
|
| case None => None | ||
| }) | ||
| // findWiderTypeForTwo doesn't satisfy the associative law, i.e. (a op b) op c may not equal | ||
| // to a op (b op c). This is only a problem for StringType. Excluding StringType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only a problem for StringType or nested StringType in ArrayType. Excluding these types, ...
docs/sql-programming-guide.md
Outdated
| - Since Spark 2.4, writing a dataframe with an empty or nested empty schema using any file formats (parquet, orc, json, text, csv etc.) is not allowed. An exception is thrown when attempting to write dataframes with empty schema. | ||
| - Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after promotes both sides to TIMESTAMP. To set `false` to `spark.sql.hive.compareDateTimestampInTimestamp` restores the previous behavior. This option will be removed in Spark 3.0. | ||
| - Since Spark 2.4, creating a managed table with nonempty location is not allowed. An exception is thrown when attempting to create a managed table with nonempty location. To set `true` to `spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the previous behavior. This option will be removed in Spark 3.0. | ||
| - Since Spark 2.4, finding the widest common type for the arguments of a variadic function(e.g. IN/COALESCE) should always success when each of the types of arguments is either StringType or can be promoted to StringType. Previously this may throw an exception for some specific arguments ordering. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Since Spark 2.4, the type coercion rules can automatically promote the argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest common type, no matter how the input arguments order. In prior Spark versions, the promotion could fail in some specific orders (e.g., TimestampType, IntegerType and StringType) and throw an exception.
|
Test build #89508 has finished for PR 21074 at commit
|
|
Test build #89519 has finished for PR 21074 at commit
|
|
Test build #89516 has finished for PR 21074 at commit
|
| ruleTest(rule, | ||
| Coalesce(Seq(tsArrayLit, intArrayLit, strArrayLit)), | ||
| Coalesce(Seq(Cast(tsArrayLit, ArrayType(StringType)), | ||
| Cast(intArrayLit, ArrayType(StringType)), Cast(strArrayLit, ArrayType(StringType))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add an end to end test case that can trigger this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually don't add end-to-end tests for type coercion changes, as the type coercion logic is pretty isolated, it's very unlikely that we can pass the unit test but not end-to-end test.
|
thanks, merging to master! |
|
Seems missed out. I merged this to master. |
|
@HyukjinKwon thanks! I encountered some network issue and forgot to come back and merge this... |
What changes were proposed in this pull request?
Currently we find the wider common type by comparing the two types from left to right, this can be a problem when you have two data types which don't have a common type but each can be promoted to StringType.
For instance, if you have a table with the schema:
[c1: date, c2: string, c3: int]
The following succeeds:
SELECT coalesce(c1, c2, c3) FROM table
While the following produces an exception:
SELECT coalesce(c1, c3, c2) FROM table
This is only a issue when the seq of dataTypes contains
StringTypeand all the types can do string promotion.close #19033
How was this patch tested?
Add test in
TypeCoercionSuite