Skip to content

Conversation

@jiangxb1987
Copy link
Contributor

What changes were proposed in this pull request?

Currently we find the wider common type by comparing the two types from left to right, this can be a problem when you have two data types which don't have a common type but each can be promoted to StringType.

For instance, if you have a table with the schema:
[c1: date, c2: string, c3: int]

The following succeeds:
SELECT coalesce(c1, c2, c3) FROM table

While the following produces an exception:
SELECT coalesce(c1, c3, c2) FROM table

This is only a issue when the seq of dataTypes contains StringType and all the types can do string promotion.

close #19033

How was this patch tested?

Add test in TypeCoercionSuite

@SparkQA
Copy link

SparkQA commented Apr 15, 2018

Test build #89376 has finished for PR 21074 at commit 803a6a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

types.foldLeft[Option[DataType]](Some(NullType))((r, c) => r match {
case Some(d) => findWiderTypeForTwo(d, c)
case None => None
// Currently we find the wider common type by comparing the two types from left to right,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real problem is, findWiderTypeForTwo doesn't satisfy the associative law, i.e. (a op b) op c may not equal to a op (b op c). I think StringType is the only exception here, it's more clear to do

val (stringType, nonStringType) = types.partition(_ == StringType)
(stringType.distinct ++ nonStringType).foldLeft...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a behavior change. We need to make it configurable. Add a conf and update the migration guide.

case _ => None
})
// `findWiderTypeForTwo` doesn't satisfy the associative law, i.e. (a op b) op c may not equal
// to a op (b op c). This is only a problem when each of the types is StringType or can be
Copy link
Contributor

@cloud-fan cloud-fan Apr 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only a problem for StringType. Excluding StringType, findWiderTypeForTwo satisfies the associative law

@SparkQA
Copy link

SparkQA commented Apr 17, 2018

Test build #89463 has finished for PR 21074 at commit d77f213.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 17, 2018

Test build #89459 has finished for PR 21074 at commit 571912f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// to a op (b op c). This is only a problem for StringType. Excluding StringType,
// findWiderTypeForTwo satisfies the associative law. For instance, (TimestampType,
// IntegerType, StringType) should have StringType as the wider common type.
val (stringTypes, nonStringTypes) = types.partition(_ == StringType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, does this work with array types too (array of string vs array of non string types)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's expected to , let me also fix it for array types. Thanks!

// findWiderTypeForTwo satisfies the associative law. For instance, (TimestampType,
// IntegerType, StringType) should have StringType as the wider common type.
val (stringTypes, nonStringTypes) = types.partition { t =>
t == StringType || t == ArrayType(StringType)
Copy link
Contributor

@cloud-fan cloud-fan Apr 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need something like

def hasStringType(dt: DataType): Boolean = dt match {
  case StringType => true
  case ArrayType(et, _) => hasStringType(et)
  // Add StructType if we support string promotion for struct fields in the future.
  case _ => false
}

@SparkQA
Copy link

SparkQA commented Apr 18, 2018

Test build #89478 has finished for PR 21074 at commit 4ce5081.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case None => None
})
// findWiderTypeForTwo doesn't satisfy the associative law, i.e. (a op b) op c may not equal
// to a op (b op c). This is only a problem for StringType. Excluding StringType,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only a problem for StringType or nested StringType in ArrayType. Excluding these types, ...

- Since Spark 2.4, writing a dataframe with an empty or nested empty schema using any file formats (parquet, orc, json, text, csv etc.) is not allowed. An exception is thrown when attempting to write dataframes with empty schema.
- Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after promotes both sides to TIMESTAMP. To set `false` to `spark.sql.hive.compareDateTimestampInTimestamp` restores the previous behavior. This option will be removed in Spark 3.0.
- Since Spark 2.4, creating a managed table with nonempty location is not allowed. An exception is thrown when attempting to create a managed table with nonempty location. To set `true` to `spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the previous behavior. This option will be removed in Spark 3.0.
- Since Spark 2.4, finding the widest common type for the arguments of a variadic function(e.g. IN/COALESCE) should always success when each of the types of arguments is either StringType or can be promoted to StringType. Previously this may throw an exception for some specific arguments ordering.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Since Spark 2.4, the type coercion rules can automatically promote the argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest common type, no matter how the input arguments order. In prior Spark versions, the promotion could fail in some specific orders (e.g., TimestampType, IntegerType and StringType) and throw an exception.

@SparkQA
Copy link

SparkQA commented Apr 18, 2018

Test build #89508 has finished for PR 21074 at commit 55eefe0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 18, 2018

Test build #89519 has finished for PR 21074 at commit ff514af.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 18, 2018

Test build #89516 has finished for PR 21074 at commit c2abce2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

ruleTest(rule,
Coalesce(Seq(tsArrayLit, intArrayLit, strArrayLit)),
Coalesce(Seq(Cast(tsArrayLit, ArrayType(StringType)),
Cast(intArrayLit, ArrayType(StringType)), Cast(strArrayLit, ArrayType(StringType)))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an end to end test case that can trigger this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually don't add end-to-end tests for type coercion changes, as the type coercion logic is pretty isolated, it's very unlikely that we can pass the unit test but not end-to-end test.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@HyukjinKwon
Copy link
Member

Seems missed out. I merged this to master.

@asfgit asfgit closed this in d96c3e3 Apr 19, 2018
@cloud-fan
Copy link
Contributor

@HyukjinKwon thanks! I encountered some network issue and forgot to come back and merge this...

@jiangxb1987 jiangxb1987 deleted the typeCoercion branch April 23, 2018 07:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants