Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
b34544a
Implicit casting on collated expressions
mihailomilosevic2001 Mar 5, 2024
fdbfa44
Fix doc files
mihailomilosevic2001 Mar 5, 2024
ce9b027
Fix contains, startWith, endWith tests
mihailomilosevic2001 Mar 5, 2024
e537190
Fix imports
mihailomilosevic2001 Mar 5, 2024
b5a79c1
Fix docs and incorporate changes
mihailomilosevic2001 Mar 6, 2024
8321d0c
Fix tests in CollationSuite
mihailomilosevic2001 Mar 6, 2024
d178233
Add test and incorporate changes
mihailomilosevic2001 Mar 7, 2024
a4b9be7
Fix godlen files
mihailomilosevic2001 Mar 7, 2024
a6e7662
Incorporate StringType in findWiderCommonType
mihailomilosevic2001 Mar 8, 2024
e1d7ad5
Merge branch 'master' into SPARK-47210
mihailomilosevic2001 Mar 8, 2024
b3b1356
Fix ArrayType(StringType, _) casting in findWiderCommonType
mihailomilosevic2001 Mar 11, 2024
7773d13
Fix type mismatch error
mihailomilosevic2001 Mar 11, 2024
198a728
Merge branch 'apache:master' into SPARK-47210
mihailomilosevic2001 Mar 11, 2024
255b1ab
Incorporate changes and fix errors
mihailomilosevic2001 Mar 11, 2024
9ce417f
Merge branch 'master' into SPARK-47210
mihailomilosevic2001 Mar 12, 2024
50f3aa2
Fix errors
mihailomilosevic2001 Mar 12, 2024
ca0c84d
Rework casting
mihailomilosevic2001 Mar 13, 2024
880a1b1
Merge branch 'master' into SPARK-47210
mihailomilosevic2001 Mar 13, 2024
56d6c7c
Fix failing tests
mihailomilosevic2001 Mar 14, 2024
94e5259
Fix array cast errors
mihailomilosevic2001 Mar 14, 2024
ccb52ba
Fix additional errors
mihailomilosevic2001 Mar 14, 2024
9b1387b
Fix explicit collation search
mihailomilosevic2001 Mar 17, 2024
c9974e1
Fix scala style errors
mihailomilosevic2001 Mar 18, 2024
fca9a65
Add support for ImplicitCastInputTypes
mihailomilosevic2001 Mar 18, 2024
660d664
Fix accidental change in license header
mihailomilosevic2001 Mar 18, 2024
c8edd93
Fix null casting
mihailomilosevic2001 Mar 19, 2024
a91490b
Fix failing tests
mihailomilosevic2001 Mar 19, 2024
49a8d61
Move implicit casting when strings present
mihailomilosevic2001 Mar 19, 2024
4c4cd84
Fix unintentional changes
mihailomilosevic2001 Mar 19, 2024
66122a6
improve types.py
mihailomilosevic2001 Mar 20, 2024
50f46e4
Refactor code
mihailomilosevic2001 Mar 21, 2024
cc86a87
Merge branch 'master' into SPARK-47210
mihailomilosevic2001 Mar 21, 2024
c01e80c
Fix imports and failing tests
mihailomilosevic2001 Mar 21, 2024
cc797a2
Disable casting of StructTypes
mihailomilosevic2001 Mar 21, 2024
5d001ee
Fix imports
mihailomilosevic2001 Mar 21, 2024
c68fc7d
Fix concat tests
mihailomilosevic2001 Mar 21, 2024
1c926ab
Fix unnecessary repetition
mihailomilosevic2001 Mar 21, 2024
dec39bf
Remove Elt test
mihailomilosevic2001 Mar 21, 2024
e808446
Remove tests for Repeat
mihailomilosevic2001 Mar 21, 2024
ca1a23a
Merge branch 'master' into SPARK-47210
mihailomilosevic2001 Mar 21, 2024
116931c
Merge branch 'apache:master' into SPARK-47210
mihailomilosevic2001 Mar 22, 2024
af487a2
Fix failing tests
mihailomilosevic2001 Mar 22, 2024
4ba7055
Fix nullability for StringType->StringType
mihailomilosevic2001 Mar 22, 2024
e490e42
Improve comments and switch tests from E2E to unit tests
mihailomilosevic2001 Mar 24, 2024
00e88e7
Add new tests and remove compatibility test
mihailomilosevic2001 Mar 25, 2024
85b4d16
Fix conflict resolution mistake
mihailomilosevic2001 Mar 25, 2024
30f7225
Merge branch 'apache:master' into SPARK-47210
mihailomilosevic2001 Mar 25, 2024
e89a354
Add indeterminate collation tests
mihailomilosevic2001 Mar 26, 2024
788dc06
Fix test
mihailomilosevic2001 Mar 26, 2024
75c0140
Block Alias on Indeterminate
mihailomilosevic2001 Mar 27, 2024
2918413
Merge remote-tracking branch 'upstream/master' into SPARK-47210
mihailomilosevic2001 Mar 28, 2024
f6ed55a
Remove introduction of indeterminate collation
mihailomilosevic2001 Mar 28, 2024
98960c0
Fix import problem
mihailomilosevic2001 Mar 28, 2024
de623c8
Fix failing tests
mihailomilosevic2001 Mar 28, 2024
a92b4e1
Fix pyspark error
mihailomilosevic2001 Mar 28, 2024
f7f3011
Merge branch 'apache:master' into SPARK-47210
mihailomilosevic2001 Mar 28, 2024
f67808e
Fix errors
mihailomilosevic2001 Mar 29, 2024
815ce42
Fix schema error
mihailomilosevic2001 Mar 29, 2024
7fca38a
Merge remote-tracking branch 'upstream/master' into SPARK-47210
mihailomilosevic2001 Mar 29, 2024
b19b0eb
Fix collated tests
mihailomilosevic2001 Mar 29, 2024
a111f03
Add isExplicit flag
mihailomilosevic2001 Mar 29, 2024
55bdd9b
Fix import error
mihailomilosevic2001 Mar 29, 2024
a7228be
Fix imports in TypeCoercion
mihailomilosevic2001 Mar 31, 2024
27a72c6
Merge remote-tracking branch 'upstream/master' into SPARK-47210
mihailomilosevic2001 Apr 1, 2024
18ada04
Add support for explicit propagation in arrays
mihailomilosevic2001 Apr 1, 2024
38670af
Fix tests to follow recent changes
mihailomilosevic2001 Apr 1, 2024
01d891e
Incorporate changes
mihailomilosevic2001 Apr 1, 2024
c5daf86
Fix error
mihailomilosevic2001 Apr 1, 2024
9ac5678
Change var to val in StringType
mihailomilosevic2001 Apr 1, 2024
0f1757d
Fix import style
mihailomilosevic2001 Apr 1, 2024
506c8c0
Revert explicit flag addition
mihailomilosevic2001 Apr 1, 2024
f743cf8
Narrow down expressions casting
mihailomilosevic2001 Apr 2, 2024
4f8fe1d
Incorporate minor changes
mihailomilosevic2001 Apr 2, 2024
52bf4dc
Incorporate changes
mihailomilosevic2001 Apr 2, 2024
7cbeafe
Special case expressions
mihailomilosevic2001 Apr 3, 2024
3e92e92
Return new line
mihailomilosevic2001 Apr 3, 2024
b23e106
Remove indentation cosmetic
mihailomilosevic2001 Apr 3, 2024
880ebed
Add more cosmetic changes
mihailomilosevic2001 Apr 3, 2024
f96ecd9
Incorporate changes
mihailomilosevic2001 Apr 3, 2024
e1e0cf4
Merge branch 'apache:master' into SPARK-47210
mihailomilosevic2001 Apr 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Narrow down expressions casting
  • Loading branch information
mihailomilosevic2001 committed Apr 2, 2024
commit f743cf831d799d6be4278f95cfab73025291784a
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ object AnsiTypeCoercion extends TypeCoercionBase {
UnpivotCoercion ::
WidenSetOperationTypes ::
new AnsiCombinedTypeCoercionRule(
CollationTypeCasts ::
PreCollationTypeCasts ::
InConversion ::
PromoteStrings ::
DecimalPrecision ::
Expand All @@ -93,7 +93,8 @@ object AnsiTypeCoercion extends TypeCoercionBase {
ImplicitTypeCasts ::
DateTimeOperations ::
WindowFrameCoercion ::
GetDateFieldOperations:: Nil) :: Nil
GetDateFieldOperations ::
PostCollationTypeCasts :: Nil) :: Nil

val findTightestCommonType: (DataType, DataType) => Option[DataType] = {
case (t1, t2) if t1 == t2 => Some(t1)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,25 +22,12 @@ import javax.annotation.Nullable
import scala.annotation.tailrec

import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, Collate, ComplexTypeMergingExpression, CreateArray, Elt, ExpectsInputTypes, Expression, Predicate, SortOrder}
import org.apache.spark.sql.catalyst.expressions.{ArrayJoin, BinaryExpression, Cast, Collate, ComplexTypeMergingExpression, ConcatWs, CreateArray, Expression, In, InSubquery, Substring}
import org.apache.spark.sql.errors.QueryCompilationErrors
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, StringType}

object CollationTypeCasts extends TypeCoercionRule {
override val transform: PartialFunction[Expression, Expression] = {
case e if !e.childrenResolved => e
case sc @ (_: BinaryExpression
| _: ComplexTypeMergingExpression
| _: CreateArray
| _: Elt
| _: ExpectsInputTypes
| _: Predicate
| _: SortOrder) =>
val newChildren = collateToSingleType(sc.children)
sc.withNewChildren(newChildren)
}

abstract class CollationTypeCasts extends TypeCoercionRule {
/**
* Extracts StringTypes from filtered hasStringType
*/
Expand Down Expand Up @@ -104,6 +91,7 @@ object CollationTypeCasts extends TypeCoercionRule {
val implicitTypes = expr.map(_.dataType)
.filter(hasStringType)
.map(extractStringType)
.filter(dt => dt.collationId != SQLConf.get.defaultStringType.collationId)

if (hasMultipleImplicits(implicitTypes)) {
throw QueryCompilationErrors.implicitCollationMismatchError()
Expand All @@ -127,3 +115,38 @@ object CollationTypeCasts extends TypeCoercionRule {
.filter(dt => !(dt == SQLConf.get.defaultStringType.collationId)).distinct.size > 1

}

/**
* This rule is used to collate all existing expressions related to StringType into a single
* collation. Arrays are handled using their elementType and should be cast for these expressions.
*/
object PreCollationTypeCasts extends CollationTypeCasts {
override val transform: PartialFunction[Expression, Expression] = {
case e if !e.childrenResolved => e
case sc@(_: In
| _: InSubquery
| _: CreateArray
| _: ComplexTypeMergingExpression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's narrow down the scope. Not all ComplexTypeMergingExpression expressions require all its inputs to be the same type. Please follow existing implicit cast rules and match If, CaseWhen, etc. explicitly.

| _: ArrayJoin
| _: BinaryExpression
| _: ConcatWs
| _: Substring) =>
val newChildren = collateToSingleType(sc.children)
sc.withNewChildren(newChildren)
}
}

/**
* This rule is used for managing expressions that have possible implicit casts from different
* types in ImplicitTypeCasts rule.
*/
object PostCollationTypeCasts extends CollationTypeCasts {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this rule need two phases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catalyst rules are executed with a loop, it's OK to wait for the next iteration and adjust the implicit casts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some expressions might be cast from different types, e.g. (NullType or IntegerType) to StringType, and we want to make sure these are picked up and accordingly cast to the proper collation for expressions that lie into that category (i.e. expressions that extend ImpicitInputTypeCast and ExpectsInputType, as well as BinaryExpressions). It is either this or fine-picking these expressions in ImplicitTypeCasts, but this approach seemed more appropriate. What do you think? (We only do this for ArrayJoin, Substring and BinaryExpressions, as these are the only supported ones for collations so far)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't PreCollationTypeCasts alone do the work?

override val transform: PartialFunction[Expression, Expression] = {
case e if !e.childrenResolved => e
case sc@(_: ArrayJoin
| _: BinaryExpression
| _: Substring) =>
val newChildren = collateToSingleType(sc.children)
sc.withNewChildren(newChildren)
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,7 @@ import scala.annotation.tailrec
import scala.collection.mutable

import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.analysis.CollationTypeCasts.{castStringType, getOutputCollation}
import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
import org.apache.spark.sql.catalyst.analysis.PreCollationTypeCasts.getOutputCollation
import org.apache.spark.sql.catalyst.expressions._
import org.apache.spark.sql.catalyst.expressions.aggregate._
import org.apache.spark.sql.catalyst.plans.logical._
Expand Down Expand Up @@ -660,9 +659,8 @@ abstract class TypeCoercionBase {
val newIndex = implicitCast(index, IntegerType).getOrElse(index)
val newInputs = if (conf.eltOutputAsString ||
!children.tail.map(_.dataType).forall(_ == BinaryType)) {
val st = getOutputCollation(children)
children.tail.map { e =>
implicitCast(e, st).getOrElse(e)
implicitCast(e, SQLConf.get.defaultStringType).getOrElse(e)
}
} else {
children.tail
Expand Down Expand Up @@ -707,39 +705,27 @@ abstract class TypeCoercionBase {
}.getOrElse(b) // If there is no applicable conversion, leave expression unchanged.

case e: ImplicitCastInputTypes if e.inputTypes.nonEmpty =>
val childrenBeforeCollations: Seq[Expression] = e.children.zip(e.inputTypes).map {
val children: Seq[Expression] = e.children.zip(e.inputTypes).map {
// If we cannot do the implicit cast, just use the original input.
case (in, expected) => implicitCast(in, expected).getOrElse(in)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert these cosmetic changes.

}
val st = getOutputCollation(e.children)
val children: Seq[Expression] = childrenBeforeCollations.map {
case in if hasStringType(in.dataType) =>
castStringType(in, st).getOrElse(in)
case in => in
}
e.withNewChildren(children)

case e: ExpectsInputTypes if e.inputTypes.nonEmpty =>
// Convert NullType into some specific target type for ExpectsInputTypes that don't do
// general implicit casting.
val childrenBeforeCollations: Seq[Expression] =
val children: Seq[Expression] =
e.children.zip(e.inputTypes).map { case (in, expected) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

if (in.dataType == NullType && !expected.acceptsType(NullType)) {
Literal.create(null, expected.defaultConcreteType)
} else {
in
}
}
val st = getOutputCollation(e.children)
val children: Seq[Expression] = childrenBeforeCollations.map {
case in if hasStringType(in.dataType) =>
castStringType(in, st).getOrElse(in)
case in => in
}
e.withNewChildren(children)

case udf: ScalaUDF if udf.inputTypes.nonEmpty =>
val childrenBeforeCollations = udf.children.zip(udf.inputTypes).map { case (in, expected) =>
val children = udf.children.zip(udf.inputTypes).map { case (in, expected) =>
// Currently Scala UDF will only expect `AnyDataType` at top level, so this trick works.
// In the future we should create types like `AbstractArrayType`, so that Scala UDF can
// accept inputs of array type of arbitrary element type.
Expand All @@ -752,12 +738,6 @@ abstract class TypeCoercionBase {
).getOrElse(in)
}
}
val st = getOutputCollation(udf.children)
val children: Seq[Expression] = childrenBeforeCollations.map {
case in if hasStringType(in.dataType) =>
castStringType(in, st).getOrElse(in)
case in => in
}
udf.copy(children = children)
}

Expand Down Expand Up @@ -860,7 +840,7 @@ object TypeCoercion extends TypeCoercionBase {
UnpivotCoercion ::
WidenSetOperationTypes ::
new CombinedTypeCoercionRule(
CollationTypeCasts ::
PreCollationTypeCasts ::
InConversion ::
PromoteStrings ::
DecimalPrecision ::
Expand All @@ -877,7 +857,8 @@ object TypeCoercion extends TypeCoercionBase {
ImplicitTypeCasts ::
DateTimeOperations ::
WindowFrameCoercion ::
StringLiteralCoercion :: Nil) :: Nil
StringLiteralCoercion ::
PostCollationTypeCasts :: Nil) :: Nil

override def canCast(from: DataType, to: DataType): Boolean = Cast.canCast(from, to)

Expand Down