Skip to content
Closed
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
f59a213
BinaryComparison shouldn't auto cast string to int/long
wangyum Aug 5, 2017
bc83848
follow hive
wangyum Sep 10, 2017
cedb239
Remove useless code
wangyum Sep 10, 2017
8d37c72
Merge remote-tracking branch 'origin/master' into SPARK-21646
wangyum Sep 10, 2017
522c4cd
Fix test error
wangyum Sep 11, 2017
3bec6a2
Fix SQLQueryTestSuite test error
wangyum Sep 11, 2017
844aec7
Add spark.sql.binary.comparison.compatible.with.hive conf.
wangyum Sep 18, 2017
7812018
spark.sql.binary.comparison.compatible.with.hive -> spark.sql.autoTyp…
wangyum Sep 19, 2017
27d5b13
spark.sql.autoTypeCastingCompatibility -> spark.sql.typeCoercion.mode
wangyum Sep 20, 2017
53d673f
default -> legacy
wangyum Oct 7, 2017
8da0cf0
Fix test error
wangyum Oct 7, 2017
2ada11a
Refactor TypeCoercionModeSuite
wangyum Oct 9, 2017
d34f294
Add IN test suite
wangyum Oct 12, 2017
b99fb60
Merge branch 'master' into SPARK-21646
wangyum Nov 10, 2017
0d9cf69
legacy -> default
wangyum Nov 14, 2017
22d0355
Merge remote-tracking branch 'upstream/master' into SPARK-21646
wangyum Nov 14, 2017
558ff90
Merge remote-tracking branch 'upstream/master' into SPARK-21646
wangyum Dec 5, 2017
663eb35
Update doc
wangyum Dec 6, 2017
7802483
Remove duplicate InConversion
wangyum Dec 6, 2017
dffe5d2
Merge remote-tracking branch 'upstream/master' into SPARK-21646
wangyum Jan 9, 2018
97a071d
Merge SPARK-22894 to Hive mode.
wangyum Jan 9, 2018
408e889
InConversion -> NativeInConversion; PromoteStrings -> NativePromoteSt…
wangyum Jan 9, 2018
e763330
Lost WindowFrameCoercion
wangyum Jan 9, 2018
81067b9
Merge remote-tracking branch 'upstream/master' into SPARK-21646
wangyum Mar 31, 2018
d0a2089
Since Spark 2.3 -> Since Spark 2.4
wangyum Jun 10, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions docs/sql-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -968,6 +968,13 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession
</p>
</td>
</tr>
<tr>
<td><code>spark.sql.binary.comparison.compatible.with.hive</code></td>
<td>true</td>
<td>
Whether compatible with Hive when binary comparison.
</td>
</tr>
</table>

## JSON Datasets
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,28 @@ object TypeCoercion {
case (l, r) => None
}

/**
* Follow hive's binary comparison action:
* https://github.com/apache/hive/blob/rel/storage-release-2.4.0/ql/src/java/
* org/apache/hadoop/hive/ql/exec/FunctionRegistry.java#L781
*/
val findCommonTypeCompatibleWithHive: (DataType, DataType) =>
Option[DataType] = {
case (StringType, DateType) => Some(DateType)
case (DateType, StringType) => Some(DateType)
case (StringType, TimestampType) => Some(TimestampType)
case (TimestampType, StringType) => Some(TimestampType)
case (TimestampType, DateType) => Some(TimestampType)
case (DateType, TimestampType) => Some(TimestampType)
case (StringType, NullType) => Some(StringType)
case (NullType, StringType) => Some(StringType)
case (StringType | TimestampType, r: NumericType) => Some(DoubleType)
case (l: NumericType, StringType | TimestampType) => Some(DoubleType)
case (l: StringType, r: AtomicType) if r != StringType => Some(r)
case (l: AtomicType, r: StringType) if l != StringType => Some(l)
case (l, r) => None
}

/**
* Case 2 type widening (see the classdoc comment above for TypeCoercion).
*
Expand Down Expand Up @@ -352,11 +374,16 @@ object TypeCoercion {
p.makeCopy(Array(Cast(left, TimestampType), right))
case p @ Equality(left @ TimestampType(), right @ StringType()) =>
p.makeCopy(Array(left, Cast(right, TimestampType)))

case p @ BinaryComparison(left, right)
if findCommonTypeForBinaryComparison(left.dataType, right.dataType).isDefined =>
if !plan.conf.binaryComparisonCompatibleWithHive &&
findCommonTypeForBinaryComparison(left.dataType, right.dataType).isDefined =>
val commonType = findCommonTypeForBinaryComparison(left.dataType, right.dataType).get
p.makeCopy(Array(castExpr(left, commonType), castExpr(right, commonType)))
case p @ BinaryComparison(left, right)
if plan.conf.binaryComparisonCompatibleWithHive &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard to maintain and debug. Instead of mixing them together, could you separate it from the current rule Spark uses?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs 2 PromoteStrings and 2 InConversion if separate from current rule, So I merge the logic to findCommonTypeForBinaryComparison.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine to have two separate rules, but we can call the shared functions.

findCommonTypeCompatibleWithHive(left.dataType, right.dataType).isDefined =>
val commonType = findCommonTypeCompatibleWithHive(left.dataType, right.dataType).get
p.makeCopy(Array(castExpr(left, commonType), castExpr(right, commonType)))

case Abs(e @ StringType()) => Abs(Cast(e, DoubleType))
case Sum(e @ StringType()) => Sum(Cast(e, DoubleType))
Expand Down Expand Up @@ -411,8 +438,13 @@ object TypeCoercion {
val rhs = sub.output

val commonTypes = lhs.zip(rhs).flatMap { case (l, r) =>
findCommonTypeForBinaryComparison(l.dataType, r.dataType)
.orElse(findTightestCommonType(l.dataType, r.dataType))
if (plan.conf.binaryComparisonCompatibleWithHive) {
findCommonTypeCompatibleWithHive(l.dataType, r.dataType)
.orElse(findTightestCommonType(l.dataType, r.dataType))
} else {
findCommonTypeForBinaryComparison(l.dataType, r.dataType)
.orElse(findTightestCommonType(l.dataType, r.dataType))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you still try to avoid duplicating the codes?

}
}

// The number of columns/expressions must match between LHS and RHS of an
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -925,6 +925,12 @@ object SQLConf {
.intConf
.createWithDefault(10000)

val BINARY_COMPARISON_COMPATIBLE_WITH_HIVE =
buildConf("spark.sql.binary.comparison.compatible.with.hive")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> spark.sql.autoTypeCastingCompatibility

.doc("Whether compatible with Hive when binary comparison.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binary comparison is just one of the implicit type casting cases.

.booleanConf
.createWithDefault(true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to be false.


object Deprecated {
val MAPRED_REDUCE_TASKS = "mapred.reduce.tasks"
}
Expand Down Expand Up @@ -1203,6 +1209,9 @@ class SQLConf extends Serializable with Logging {

def arrowMaxRecordsPerBatch: Int = getConf(ARROW_EXECUTION_MAX_RECORDS_PER_BATCH)

def binaryComparisonCompatibleWithHive: Boolean =
getConf(SQLConf.BINARY_COMPARISON_COMPATIBLE_WITH_HIVE)

/** ********************** SQLConf functionality methods ************ */

/** Set Spark SQL configuration properties. */
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

package org.apache.spark.sql.catalyst.analysis

import java.sql.Timestamp
import java.sql.{Date, Timestamp}

import org.apache.spark.sql.catalyst.analysis.TypeCoercion._
import org.apache.spark.sql.catalyst.dsl.expressions._
Expand Down Expand Up @@ -1101,13 +1101,34 @@ class TypeCoercionSuite extends AnalysisTest {
test("binary comparison with string promotion") {
ruleTest(PromoteStrings,
GreaterThan(Literal("123"), Literal(1)),
GreaterThan(Cast(Literal("123"), IntegerType), Literal(1)))
GreaterThan(Cast(Literal("123"), DoubleType), Cast(Literal(1), DoubleType)))
ruleTest(PromoteStrings,
LessThan(Literal(true), Literal("123")),
LessThan(Literal(true), Cast(Literal("123"), BooleanType)))
ruleTest(PromoteStrings,
EqualTo(Literal(Array(1, 2)), Literal("123")),
EqualTo(Literal(Array(1, 2)), Literal("123")))
ruleTest(PromoteStrings,
GreaterThan(Literal("123"), Literal(1L)),
GreaterThan(Cast(Literal("123"), DoubleType), Cast(Literal(1L), DoubleType)))
ruleTest(PromoteStrings,
GreaterThan(Literal("123"), Literal(0.1)),
GreaterThan(Cast(Literal("123"), DoubleType), Literal(0.1)))

val date1 = Date.valueOf("2017-07-21")
val timestamp1 = Timestamp.valueOf("2017-07-21 23:42:12.123")
ruleTest(PromoteStrings,
GreaterThan(Literal(date1), Literal("2017-07-01")),
GreaterThan(Literal(date1), Cast(Literal("2017-07-01"), DateType)))
ruleTest(PromoteStrings,
GreaterThan(Literal(timestamp1), Literal("2017-07-01")),
GreaterThan(Literal(timestamp1), Cast(Literal("2017-07-01"), TimestampType)))
ruleTest(PromoteStrings,
GreaterThan(Literal(timestamp1), Cast(Literal("2017-07-01"), DateType)),
GreaterThan(Literal(timestamp1), Cast(Cast(Literal("2017-07-01"), DateType), TimestampType)))
ruleTest(PromoteStrings,
GreaterThan(Literal(timestamp1), Literal(1L)),
GreaterThan(Cast(Literal(timestamp1), DoubleType), Cast(Literal(1L), DoubleType)))
}

test("cast WindowFrame boundaries to the type they operate upon") {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,47 +13,47 @@ true
-- !query 1
select 1 = '1'
-- !query 1 schema
struct<(1 = CAST(1 AS INT)):boolean>
struct<(CAST(1 AS DOUBLE) = CAST(1 AS DOUBLE)):boolean>
-- !query 1 output
true


-- !query 2
select 1.0 = '1'
-- !query 2 schema
struct<(1.0 = CAST(1 AS DECIMAL(2,1))):boolean>
struct<(CAST(1.0 AS DOUBLE) = CAST(1 AS DOUBLE)):boolean>
-- !query 2 output
true


-- !query 3
select 1 > '1'
-- !query 3 schema
struct<(1 > CAST(1 AS INT)):boolean>
struct<(CAST(1 AS DOUBLE) > CAST(1 AS DOUBLE)):boolean>
-- !query 3 output
false


-- !query 4
select 2 > '1.0'
-- !query 4 schema
struct<(2 > CAST(1.0 AS INT)):boolean>
struct<(CAST(2 AS DOUBLE) > CAST(1.0 AS DOUBLE)):boolean>
-- !query 4 output
true


-- !query 5
select 2 > '2.0'
-- !query 5 schema
struct<(2 > CAST(2.0 AS INT)):boolean>
struct<(CAST(2 AS DOUBLE) > CAST(2.0 AS DOUBLE)):boolean>
-- !query 5 output
false


-- !query 6
select 2 > '2.2'
-- !query 6 schema
struct<(2 > CAST(2.2 AS INT)):boolean>
struct<(CAST(2 AS DOUBLE) > CAST(2.2 AS DOUBLE)):boolean>
-- !query 6 output
false

Expand All @@ -69,39 +69,39 @@ false
-- !query 8
select to_date('2009-07-30 04:17:52') > '2009-07-30 04:17:52'
-- !query 8 schema
struct<(CAST(to_date('2009-07-30 04:17:52') AS STRING) > 2009-07-30 04:17:52):boolean>
struct<(to_date('2009-07-30 04:17:52') > CAST(2009-07-30 04:17:52 AS DATE)):boolean>
-- !query 8 output
false


-- !query 9
select 1 >= '1'
-- !query 9 schema
struct<(1 >= CAST(1 AS INT)):boolean>
struct<(CAST(1 AS DOUBLE) >= CAST(1 AS DOUBLE)):boolean>
-- !query 9 output
true


-- !query 10
select 2 >= '1.0'
-- !query 10 schema
struct<(2 >= CAST(1.0 AS INT)):boolean>
struct<(CAST(2 AS DOUBLE) >= CAST(1.0 AS DOUBLE)):boolean>
-- !query 10 output
true


-- !query 11
select 2 >= '2.0'
-- !query 11 schema
struct<(2 >= CAST(2.0 AS INT)):boolean>
struct<(CAST(2 AS DOUBLE) >= CAST(2.0 AS DOUBLE)):boolean>
-- !query 11 output
true


-- !query 12
select 2.0 >= '2.2'
-- !query 12 schema
struct<(2.0 >= CAST(2.2 AS DECIMAL(2,1))):boolean>
struct<(CAST(2.0 AS DOUBLE) >= CAST(2.2 AS DOUBLE)):boolean>
-- !query 12 output
false

Expand All @@ -117,39 +117,39 @@ true
-- !query 14
select to_date('2009-07-30 04:17:52') >= '2009-07-30 04:17:52'
-- !query 14 schema
struct<(CAST(to_date('2009-07-30 04:17:52') AS STRING) >= 2009-07-30 04:17:52):boolean>
struct<(to_date('2009-07-30 04:17:52') >= CAST(2009-07-30 04:17:52 AS DATE)):boolean>
-- !query 14 output
false
true


-- !query 15
select 1 < '1'
-- !query 15 schema
struct<(1 < CAST(1 AS INT)):boolean>
struct<(CAST(1 AS DOUBLE) < CAST(1 AS DOUBLE)):boolean>
-- !query 15 output
false


-- !query 16
select 2 < '1.0'
-- !query 16 schema
struct<(2 < CAST(1.0 AS INT)):boolean>
struct<(CAST(2 AS DOUBLE) < CAST(1.0 AS DOUBLE)):boolean>
-- !query 16 output
false


-- !query 17
select 2 < '2.0'
-- !query 17 schema
struct<(2 < CAST(2.0 AS INT)):boolean>
struct<(CAST(2 AS DOUBLE) < CAST(2.0 AS DOUBLE)):boolean>
-- !query 17 output
false


-- !query 18
select 2.0 < '2.2'
-- !query 18 schema
struct<(2.0 < CAST(2.2 AS DECIMAL(2,1))):boolean>
struct<(CAST(2.0 AS DOUBLE) < CAST(2.2 AS DOUBLE)):boolean>
-- !query 18 output
true

Expand All @@ -165,39 +165,39 @@ false
-- !query 20
select to_date('2009-07-30 04:17:52') < '2009-07-30 04:17:52'
-- !query 20 schema
struct<(CAST(to_date('2009-07-30 04:17:52') AS STRING) < 2009-07-30 04:17:52):boolean>
struct<(to_date('2009-07-30 04:17:52') < CAST(2009-07-30 04:17:52 AS DATE)):boolean>
-- !query 20 output
true
false


-- !query 21
select 1 <= '1'
-- !query 21 schema
struct<(1 <= CAST(1 AS INT)):boolean>
struct<(CAST(1 AS DOUBLE) <= CAST(1 AS DOUBLE)):boolean>
-- !query 21 output
true


-- !query 22
select 2 <= '1.0'
-- !query 22 schema
struct<(2 <= CAST(1.0 AS INT)):boolean>
struct<(CAST(2 AS DOUBLE) <= CAST(1.0 AS DOUBLE)):boolean>
-- !query 22 output
false


-- !query 23
select 2 <= '2.0'
-- !query 23 schema
struct<(2 <= CAST(2.0 AS INT)):boolean>
struct<(CAST(2 AS DOUBLE) <= CAST(2.0 AS DOUBLE)):boolean>
-- !query 23 output
true


-- !query 24
select 2.0 <= '2.2'
-- !query 24 schema
struct<(2.0 <= CAST(2.2 AS DECIMAL(2,1))):boolean>
struct<(CAST(2.0 AS DOUBLE) <= CAST(2.2 AS DOUBLE)):boolean>
-- !query 24 output
true

Expand All @@ -213,6 +213,6 @@ true
-- !query 26
select to_date('2009-07-30 04:17:52') <= '2009-07-30 04:17:52'
-- !query 26 schema
struct<(CAST(to_date('2009-07-30 04:17:52') AS STRING) <= 2009-07-30 04:17:52):boolean>
struct<(to_date('2009-07-30 04:17:52') <= CAST(2009-07-30 04:17:52 AS DATE)):boolean>
-- !query 26 output
true
Original file line number Diff line number Diff line change
Expand Up @@ -1966,7 +1966,7 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {

test("SPARK-17913: compare long and string type column may return confusing result") {
val df = Seq(123L -> "123", 19157170390056973L -> "19157170390056971").toDF("i", "j")
checkAnswer(df.select($"i" === $"j"), Row(true) :: Row(false) :: Nil)
checkAnswer(df.select($"i" === $"j"), Row(true) :: Row(true) :: Nil)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To compatible with Hive, MySQL and Oracle:
oracle

}

test("SPARK-19691 Calculating percentile of decimal column fails with ClassCastException") {
Expand Down
Loading