Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions docs/sql-migration-guide-upgrade.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ displayTitle: Spark SQL Upgrading Guide

- The `ADD JAR` command previously returned a result set with the single value 0. It now returns an empty result set.

- In Spark version 2.4 and earlier, users can create map values with map type key via built-in function like `CreateMap`, `MapFromArrays`, etc. Since Spark 3.0, it's not allowed to create map values with map type key with these built-in functions. Users can still read map values with map type key from data source or Java/Scala collections, though they are not very useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think should we also add this note?

Note that, maps with map type key still exist, via reading from parquet files, converting from scala/java map, etc. Spark is not to completely forbid map as map key, but to avoid creating it by Spark itself.

## Upgrading From Spark SQL 2.3 to 2.4

- In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below.
Expand Down Expand Up @@ -117,7 +119,7 @@ displayTitle: Spark SQL Upgrading Guide

- Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation.

- Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was written as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string.
- Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was written as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string.

- Since Spark 2.4, The LOAD DATA command supports wildcard `?` and `*`, which match any one character, and zero or more characters, respectively. Example: `LOAD DATA INPATH '/tmp/folder*/'` or `LOAD DATA INPATH '/tmp/part-?'`. Special Characters like `space` also now work in paths. Example: `LOAD DATA INPATH '/tmp/folder name/'`.

Expand Down Expand Up @@ -303,7 +305,7 @@ displayTitle: Spark SQL Upgrading Guide
## Upgrading From Spark SQL 2.1 to 2.2

- Spark 2.1.1 introduced a new configuration key: `spark.sql.hive.caseSensitiveInferenceMode`. It had a default setting of `NEVER_INFER`, which kept behavior identical to 2.1.0. However, Spark 2.2.0 changes this setting's default value to `INFER_AND_SAVE` to restore compatibility with reading Hive metastore tables whose underlying file schema have mixed-case column names. With the `INFER_AND_SAVE` configuration value, on first access Spark will perform schema inference on any Hive metastore table for which it has not already saved an inferred schema. Note that schema inference can be a very time-consuming operation for tables with thousands of partitions. If compatibility with mixed-case column names is not a concern, you can safely set `spark.sql.hive.caseSensitiveInferenceMode` to `NEVER_INFER` to avoid the initial overhead of schema inference. Note that with the new default `INFER_AND_SAVE` setting, the results of the schema inference are saved as a metastore key for future use. Therefore, the initial schema inference occurs only at a table's first access.

- Since Spark 2.2.1 and 2.3.0, the schema is always inferred at runtime when the data source tables have the columns that exist in both partition schema and data schema. The inferred schema does not have the partitioned columns. When reading the table, Spark respects the partition values of these overlapping columns instead of the values stored in the data source files. In 2.2.0 and 2.1.x release, the inferred schema is partitioned but the data of the table is invisible to users (i.e., the result set is empty).

- Since Spark 2.2, view definitions are stored in a different way from prior versions. This may cause Spark unable to read views created by prior versions. In such cases, you need to recreate the views using `ALTER VIEW AS` or `CREATE OR REPLACE VIEW AS` with newer Spark versions.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -521,13 +521,18 @@ case class MapEntries(child: Expression) extends UnaryExpression with ExpectsInp
case class MapConcat(children: Seq[Expression]) extends ComplexTypeMergingExpression {

override def checkInputDataTypes(): TypeCheckResult = {
var funcName = s"function $prettyName"
val funcName = s"function $prettyName"
if (children.exists(!_.dataType.isInstanceOf[MapType])) {
TypeCheckResult.TypeCheckFailure(
s"input to $funcName should all be of type map, but it's " +
children.map(_.dataType.catalogString).mkString("[", ", ", "]"))
} else {
TypeUtils.checkForSameTypeInputExpr(children.map(_.dataType), funcName)
val sameTypeCheck = TypeUtils.checkForSameTypeInputExpr(children.map(_.dataType), funcName)
if (sameTypeCheck.isFailure) {
sameTypeCheck
} else {
TypeUtils.checkForMapKeyType(dataType.keyType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this. The children already should not have map type keys?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see https://github.com/apache/spark/pull/23045/files#diff-3f19ec3d15dcd8cd42bb25dde1c5c1a9R20 . The child may be read from parquet files, so map of map is still possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I see. thanks!

}
}
}

Expand Down Expand Up @@ -740,7 +745,8 @@ case class MapFromEntries(child: Expression) extends UnaryExpression {
@transient override lazy val dataType: MapType = dataTypeDetails.get._1

override def checkInputDataTypes(): TypeCheckResult = dataTypeDetails match {
case Some(_) => TypeCheckResult.TypeCheckSuccess
case Some((mapType, _, _)) =>
TypeUtils.checkForMapKeyType(mapType.keyType)
case None => TypeCheckResult.TypeCheckFailure(s"'${child.sql}' is of " +
s"${child.dataType.catalogString} type. $prettyName accepts only arrays of pair structs.")
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -161,11 +161,11 @@ case class CreateMap(children: Seq[Expression]) extends Expression {
"The given values of function map should all be the same type, but they are " +
values.map(_.dataType.catalogString).mkString("[", ", ", "]"))
} else {
TypeCheckResult.TypeCheckSuccess
TypeUtils.checkForMapKeyType(dataType.keyType)
}
}

override def dataType: DataType = {
override def dataType: MapType = {
MapType(
keyType = TypeCoercion.findCommonTypeDifferentOnlyInNullFlags(keys.map(_.dataType))
.getOrElse(StringType),
Expand Down Expand Up @@ -224,6 +224,16 @@ case class MapFromArrays(left: Expression, right: Expression)

override def inputTypes: Seq[AbstractDataType] = Seq(ArrayType, ArrayType)

override def checkInputDataTypes(): TypeCheckResult = {
val defaultCheck = super.checkInputDataTypes()
if (defaultCheck.isFailure) {
defaultCheck
} else {
val keyType = left.dataType.asInstanceOf[ArrayType].elementType
TypeUtils.checkForMapKeyType(keyType)
}
}

override def dataType: DataType = {
MapType(
keyType = left.dataType.asInstanceOf[ArrayType].elementType,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -514,6 +514,10 @@ case class TransformKeys(

override def dataType: DataType = MapType(function.dataType, valueType, valueContainsNull)

override def checkInputDataTypes(): TypeCheckResult = {
TypeUtils.checkForMapKeyType(function.dataType)
}

override def bind(f: (Expression, Seq[(DataType, Boolean)]) => LambdaFunction): TransformKeys = {
copy(function = f(function, (keyType, false) :: (valueType, valueContainsNull) :: Nil))
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -46,12 +46,20 @@ object TypeUtils {
if (TypeCoercion.haveSameType(types)) {
TypeCheckResult.TypeCheckSuccess
} else {
return TypeCheckResult.TypeCheckFailure(
TypeCheckResult.TypeCheckFailure(
s"input to $caller should all be the same type, but it's " +
types.map(_.catalogString).mkString("[", ", ", "]"))
}
}

def checkForMapKeyType(keyType: DataType): TypeCheckResult = {
if (keyType.existsRecursively(_.isInstanceOf[MapType])) {
TypeCheckResult.TypeCheckFailure("The key of map cannot be/contain map.")
} else {
TypeCheckResult.TypeCheckSuccess
}
}

def getNumeric(t: DataType): Numeric[Any] =
t.asInstanceOf[NumericType].numeric.asInstanceOf[Numeric[Any]]

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ import scala.util.Random
import org.apache.spark.SparkFunSuite
import org.apache.spark.sql.Row
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
import org.apache.spark.sql.catalyst.util.DateTimeTestUtils
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types._
Expand Down Expand Up @@ -108,32 +109,28 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper
}

test("Map Concat") {
val m0 = Literal.create(Map("a" -> "1", "b" -> "2"), MapType(StringType, StringType,
val m0 = Literal.create(create_map("a" -> "1", "b" -> "2"), MapType(StringType, StringType,
valueContainsNull = false))
val m1 = Literal.create(Map("c" -> "3", "a" -> "4"), MapType(StringType, StringType,
val m1 = Literal.create(create_map("c" -> "3", "a" -> "4"), MapType(StringType, StringType,
valueContainsNull = false))
val m2 = Literal.create(Map("d" -> "4", "e" -> "5"), MapType(StringType, StringType))
val m3 = Literal.create(Map("a" -> "1", "b" -> "2"), MapType(StringType, StringType))
val m4 = Literal.create(Map("a" -> null, "c" -> "3"), MapType(StringType, StringType))
val m5 = Literal.create(Map("a" -> 1, "b" -> 2), MapType(StringType, IntegerType))
val m6 = Literal.create(Map("a" -> null, "c" -> 3), MapType(StringType, IntegerType))
val m7 = Literal.create(Map(List(1, 2) -> 1, List(3, 4) -> 2),
val m2 = Literal.create(create_map("d" -> "4", "e" -> "5"), MapType(StringType, StringType))
val m3 = Literal.create(create_map("a" -> "1", "b" -> "2"), MapType(StringType, StringType))
val m4 = Literal.create(create_map("a" -> null, "c" -> "3"), MapType(StringType, StringType))
val m5 = Literal.create(create_map("a" -> 1, "b" -> 2), MapType(StringType, IntegerType))
val m6 = Literal.create(create_map("a" -> null, "c" -> 3), MapType(StringType, IntegerType))
val m7 = Literal.create(create_map(List(1, 2) -> 1, List(3, 4) -> 2),
MapType(ArrayType(IntegerType), IntegerType))
val m8 = Literal.create(Map(List(5, 6) -> 3, List(1, 2) -> 4),
val m8 = Literal.create(create_map(List(5, 6) -> 3, List(1, 2) -> 4),
MapType(ArrayType(IntegerType), IntegerType))
val m9 = Literal.create(Map(Map(1 -> 2, 3 -> 4) -> 1, Map(5 -> 6, 7 -> 8) -> 2),
MapType(MapType(IntegerType, IntegerType), IntegerType))
val m10 = Literal.create(Map(Map(9 -> 10, 11 -> 12) -> 3, Map(1 -> 2, 3 -> 4) -> 4),
MapType(MapType(IntegerType, IntegerType), IntegerType))
val m11 = Literal.create(Map(1 -> "1", 2 -> "2"), MapType(IntegerType, StringType,
val m9 = Literal.create(create_map(1 -> "1", 2 -> "2"), MapType(IntegerType, StringType,
valueContainsNull = false))
val m12 = Literal.create(Map(3 -> "3", 4 -> "4"), MapType(IntegerType, StringType,
val m10 = Literal.create(create_map(3 -> "3", 4 -> "4"), MapType(IntegerType, StringType,
valueContainsNull = false))
val m13 = Literal.create(Map(1 -> 2, 3 -> 4),
val m11 = Literal.create(create_map(1 -> 2, 3 -> 4),
MapType(IntegerType, IntegerType, valueContainsNull = false))
val m14 = Literal.create(Map(5 -> 6),
val m12 = Literal.create(create_map(5 -> 6),
MapType(IntegerType, IntegerType, valueContainsNull = false))
val m15 = Literal.create(Map(7 -> null),
val m13 = Literal.create(create_map(7 -> null),
MapType(IntegerType, IntegerType, valueContainsNull = true))
val mNull = Literal.create(null, MapType(StringType, StringType))

Expand All @@ -147,7 +144,7 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper

// maps with no overlap
checkEvaluation(MapConcat(Seq(m0, m2)),
Map("a" -> "1", "b" -> "2", "d" -> "4", "e" -> "5"))
create_map("a" -> "1", "b" -> "2", "d" -> "4", "e" -> "5"))

// 3 maps
checkEvaluation(MapConcat(Seq(m0, m1, m2)),
Expand All @@ -174,7 +171,7 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper
)

// keys that are primitive
checkEvaluation(MapConcat(Seq(m11, m12)),
checkEvaluation(MapConcat(Seq(m9, m10)),
(
Array(1, 2, 3, 4), // keys
Array("1", "2", "3", "4") // values
Expand All @@ -189,20 +186,11 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper
)
)

// keys that are maps, with overlap
checkEvaluation(MapConcat(Seq(m9, m10)),
(
Array(Map(1 -> 2, 3 -> 4), Map(5 -> 6, 7 -> 8), Map(9 -> 10, 11 -> 12),
Map(1 -> 2, 3 -> 4)), // keys
Array(1, 2, 3, 4) // values
)
)

// both keys and value are primitive and valueContainsNull = false
checkEvaluation(MapConcat(Seq(m13, m14)), Map(1 -> 2, 3 -> 4, 5 -> 6))
checkEvaluation(MapConcat(Seq(m11, m12)), create_map(1 -> 2, 3 -> 4, 5 -> 6))

// both keys and value are primitive and valueContainsNull = true
checkEvaluation(MapConcat(Seq(m13, m15)), Map(1 -> 2, 3 -> 4, 7 -> null))
checkEvaluation(MapConcat(Seq(m11, m13)), create_map(1 -> 2, 3 -> 4, 7 -> null))

// null map
checkEvaluation(MapConcat(Seq(m0, mNull)), null)
Expand All @@ -211,7 +199,7 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper
checkEvaluation(MapConcat(Seq(mNull)), null)

// single map
checkEvaluation(MapConcat(Seq(m0)), Map("a" -> "1", "b" -> "2"))
checkEvaluation(MapConcat(Seq(m0)), create_map("a" -> "1", "b" -> "2"))

// no map
checkEvaluation(MapConcat(Seq.empty), Map.empty)
Expand Down Expand Up @@ -245,12 +233,12 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper
assert(MapConcat(Seq(m1, mNull)).nullable)

val mapConcat = MapConcat(Seq(
Literal.create(Map(Seq(1, 2) -> Seq("a", "b")),
Literal.create(create_map(Seq(1, 2) -> Seq("a", "b")),
MapType(
ArrayType(IntegerType, containsNull = false),
ArrayType(StringType, containsNull = false),
valueContainsNull = false)),
Literal.create(Map(Seq(3, 4, null) -> Seq("c", "d", null), Seq(6) -> null),
Literal.create(create_map(Seq(3, 4, null) -> Seq("c", "d", null), Seq(6) -> null),
MapType(
ArrayType(IntegerType, containsNull = true),
ArrayType(StringType, containsNull = true),
Expand All @@ -264,6 +252,18 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper
Seq(1, 2) -> Seq("a", "b"),
Seq(3, 4, null) -> Seq("c", "d", null),
Seq(6) -> null))

// map key can't be map
val mapOfMap = Literal.create(Map(Map(1 -> 2, 3 -> 4) -> 1, Map(5 -> 6, 7 -> 8) -> 2),
MapType(MapType(IntegerType, IntegerType), IntegerType))
val mapOfMap2 = Literal.create(Map(Map(9 -> 10, 11 -> 12) -> 3, Map(1 -> 2, 3 -> 4) -> 4),
MapType(MapType(IntegerType, IntegerType), IntegerType))
val map = MapConcat(Seq(mapOfMap, mapOfMap2))
map.checkInputDataTypes() match {
case TypeCheckResult.TypeCheckSuccess => fail("should not allow map as map key")
case TypeCheckResult.TypeCheckFailure(msg) =>
assert(msg.contains("The key of map cannot be/contain map"))
}
}

test("MapFromEntries") {
Expand All @@ -274,20 +274,20 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper
StructField("b", valueType))),
true)
}
def r(values: Any*): InternalRow = create_row(values: _*)
def row(values: Any*): InternalRow = create_row(values: _*)

// Primitive-type keys and values
val aiType = arrayType(IntegerType, IntegerType)
val ai0 = Literal.create(Seq(r(1, 10), r(2, 20), r(3, 20)), aiType)
val ai1 = Literal.create(Seq(r(1, null), r(2, 20), r(3, null)), aiType)
val ai0 = Literal.create(Seq(row(1, 10), row(2, 20), row(3, 20)), aiType)
val ai1 = Literal.create(Seq(row(1, null), row(2, 20), row(3, null)), aiType)
val ai2 = Literal.create(Seq.empty, aiType)
val ai3 = Literal.create(null, aiType)
val ai4 = Literal.create(Seq(r(1, 10), r(1, 20)), aiType)
val ai5 = Literal.create(Seq(r(1, 10), r(null, 20)), aiType)
val ai6 = Literal.create(Seq(null, r(2, 20), null), aiType)
val ai4 = Literal.create(Seq(row(1, 10), row(1, 20)), aiType)
val ai5 = Literal.create(Seq(row(1, 10), row(null, 20)), aiType)
val ai6 = Literal.create(Seq(null, row(2, 20), null), aiType)

checkEvaluation(MapFromEntries(ai0), Map(1 -> 10, 2 -> 20, 3 -> 20))
checkEvaluation(MapFromEntries(ai1), Map(1 -> null, 2 -> 20, 3 -> null))
checkEvaluation(MapFromEntries(ai0), create_map(1 -> 10, 2 -> 20, 3 -> 20))
checkEvaluation(MapFromEntries(ai1), create_map(1 -> null, 2 -> 20, 3 -> null))
checkEvaluation(MapFromEntries(ai2), Map.empty)
checkEvaluation(MapFromEntries(ai3), null)
checkEvaluation(MapKeys(MapFromEntries(ai4)), Seq(1, 1))
Expand All @@ -298,23 +298,36 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper

// Non-primitive-type keys and values
val asType = arrayType(StringType, StringType)
val as0 = Literal.create(Seq(r("a", "aa"), r("b", "bb"), r("c", "bb")), asType)
val as1 = Literal.create(Seq(r("a", null), r("b", "bb"), r("c", null)), asType)
val as0 = Literal.create(Seq(row("a", "aa"), row("b", "bb"), row("c", "bb")), asType)
val as1 = Literal.create(Seq(row("a", null), row("b", "bb"), row("c", null)), asType)
val as2 = Literal.create(Seq.empty, asType)
val as3 = Literal.create(null, asType)
val as4 = Literal.create(Seq(r("a", "aa"), r("a", "bb")), asType)
val as5 = Literal.create(Seq(r("a", "aa"), r(null, "bb")), asType)
val as6 = Literal.create(Seq(null, r("b", "bb"), null), asType)
val as4 = Literal.create(Seq(row("a", "aa"), row("a", "bb")), asType)
val as5 = Literal.create(Seq(row("a", "aa"), row(null, "bb")), asType)
val as6 = Literal.create(Seq(null, row("b", "bb"), null), asType)

checkEvaluation(MapFromEntries(as0), Map("a" -> "aa", "b" -> "bb", "c" -> "bb"))
checkEvaluation(MapFromEntries(as1), Map("a" -> null, "b" -> "bb", "c" -> null))
checkEvaluation(MapFromEntries(as0), create_map("a" -> "aa", "b" -> "bb", "c" -> "bb"))
checkEvaluation(MapFromEntries(as1), create_map("a" -> null, "b" -> "bb", "c" -> null))
checkEvaluation(MapFromEntries(as2), Map.empty)
checkEvaluation(MapFromEntries(as3), null)
checkEvaluation(MapKeys(MapFromEntries(as4)), Seq("a", "a"))
checkEvaluation(MapFromEntries(as6), null)

// Map key can't be null
checkExceptionInExpression[RuntimeException](
MapFromEntries(as5),
"The first field from a struct (key) can't be null.")
checkEvaluation(MapFromEntries(as6), null)

// map key can't be map
val structOfMap = row(create_map(1 -> 1), 1)
val map = MapFromEntries(Literal.create(
Seq(structOfMap),
arrayType(keyType = MapType(IntegerType, IntegerType), valueType = IntegerType)))
map.checkInputDataTypes() match {
case TypeCheckResult.TypeCheckSuccess => fail("should not allow map as map key")
case TypeCheckResult.TypeCheckFailure(msg) =>
assert(msg.contains("The key of map cannot be/contain map"))
}
}

test("Sort Array") {
Expand Down
Loading