Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/sql-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -1810,7 +1810,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
- Since Spark 2.4, writing a dataframe with an empty or nested empty schema using any file formats (parquet, orc, json, text, csv etc.) is not allowed. An exception is thrown when attempting to write dataframes with empty schema.
- Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after promotes both sides to TIMESTAMP. To set `false` to `spark.sql.hive.compareDateTimestampInTimestamp` restores the previous behavior. This option will be removed in Spark 3.0.
- Since Spark 2.4, creating a managed table with nonempty location is not allowed. An exception is thrown when attempting to create a managed table with nonempty location. To set `true` to `spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the previous behavior. This option will be removed in Spark 3.0.

- Since Spark 2.4, the type coercion rules can automatically promote the argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest common type, no matter how the input arguments order. In prior Spark versions, the promotion could fail in some specific orders (e.g., TimestampType, IntegerType and StringType) and throw an exception.
## Upgrading From Spark SQL 2.2 to 2.3

- Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named `_corrupt_record` by default). For example, `spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()` and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. Instead, you can cache or save the parsed results and then send the same query. For example, `val df = spark.read.schema(schema).json(file).cache()` and then `df.filter($"_corrupt_record".isNotNull).count()`.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -175,11 +175,27 @@ object TypeCoercion {
})
}

/**
* Whether the data type contains StringType.
*/
def hasStringType(dt: DataType): Boolean = dt match {
case StringType => true
case ArrayType(et, _) => hasStringType(et)
// Add StructType if we support string promotion for struct fields in the future.
case _ => false
}

private def findWiderCommonType(types: Seq[DataType]): Option[DataType] = {
types.foldLeft[Option[DataType]](Some(NullType))((r, c) => r match {
case Some(d) => findWiderTypeForTwo(d, c)
case None => None
})
// findWiderTypeForTwo doesn't satisfy the associative law, i.e. (a op b) op c may not equal
// to a op (b op c). This is only a problem for StringType or nested StringType in ArrayType.
// Excluding these types, findWiderTypeForTwo satisfies the associative law. For instance,
// (TimestampType, IntegerType, StringType) should have StringType as the wider common type.
val (stringTypes, nonStringTypes) = types.partition(hasStringType(_))
(stringTypes.distinct ++ nonStringTypes).foldLeft[Option[DataType]](Some(NullType))((r, c) =>
r match {
case Some(d) => findWiderTypeForTwo(d, c)
case _ => None
})
}

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -539,6 +539,9 @@ class TypeCoercionSuite extends AnalysisTest {
val floatLit = Literal.create(1.0f, FloatType)
val timestampLit = Literal.create("2017-04-12", TimestampType)
val decimalLit = Literal(new java.math.BigDecimal("1000000000000000000000"))
val tsArrayLit = Literal(Array(new Timestamp(System.currentTimeMillis())))
val strArrayLit = Literal(Array("c"))
val intArrayLit = Literal(Array(1))

ruleTest(rule,
Coalesce(Seq(doubleLit, intLit, floatLit)),
Expand Down Expand Up @@ -572,6 +575,16 @@ class TypeCoercionSuite extends AnalysisTest {
Coalesce(Seq(nullLit, floatNullLit, doubleLit, stringLit)),
Coalesce(Seq(Cast(nullLit, StringType), Cast(floatNullLit, StringType),
Cast(doubleLit, StringType), Cast(stringLit, StringType))))

ruleTest(rule,
Coalesce(Seq(timestampLit, intLit, stringLit)),
Coalesce(Seq(Cast(timestampLit, StringType), Cast(intLit, StringType),
Cast(stringLit, StringType))))

ruleTest(rule,
Coalesce(Seq(tsArrayLit, intArrayLit, strArrayLit)),
Coalesce(Seq(Cast(tsArrayLit, ArrayType(StringType)),
Cast(intArrayLit, ArrayType(StringType)), Cast(strArrayLit, ArrayType(StringType)))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an end to end test case that can trigger this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually don't add end-to-end tests for type coercion changes, as the type coercion logic is pretty isolated, it's very unlikely that we can pass the unit test but not end-to-end test.

}

test("CreateArray casts") {
Expand Down