Skip to content
Closed
Prev Previous commit
Next Next commit
Address comment. Revert change of InferSchema.
  • Loading branch information
viirya committed May 18, 2016
commit 4fca52c68ae15eb792dab08cea0dc725c3437397
Original file line number Diff line number Diff line change
Expand Up @@ -114,16 +114,7 @@ private[sql] object InferSchema {
// record fields' types have been combined.
NullType

case VALUE_STRING =>
// If there is only one row, the following non-numeric numbers will be incorrectly
// recognized as StringType.
val value = parser.getText
if (value.equals("NaN") ||
value.equals("Infinity") ||
value.equals("-Infinity")) {
return DoubleType
}
return StringType
case VALUE_STRING => StringType
case START_OBJECT =>
val builder = Array.newBuilder[StructField]
while (nextUntil(parser, END_OBJECT)) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.datasources.json

import org.apache.spark.sql.QueryTest
import org.apache.spark.sql.test.SharedSQLContext
import org.apache.spark.sql.types.{DoubleType, StructField, StructType}

/**
* Test cases for various [[JSONOptions]].
Expand Down Expand Up @@ -107,10 +108,11 @@ class JsonParsingOptionsSuite extends QueryTest with SharedSQLContext {
// quoted non-numeric numbers should still work even allowNonNumericNumbers is off.
testCases = Seq("""{"age": "NaN"}""", """{"age": "Infinity"}""", """{"age": "-Infinity"}""")
val tests: Seq[Double => Boolean] = Seq(_.isNaN, _.isPosInfinity, _.isNegInfinity)
val schema = StructType(StructField("age", DoubleType, true) :: Nil)

testCases.zipWithIndex.foreach { case (str, idx) =>
val rdd = spark.sparkContext.parallelize(Seq(str))
val df = spark.read.option("allowNonNumericNumbers", "false").json(rdd)
val df = spark.read.option("allowNonNumericNumbers", "false").schema(schema).json(rdd)

assert(df.schema.head.name == "age")
assert(tests(idx)(df.first().getDouble(0)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it's double type? Shouldn't it be string if allowNonNumericNumbers is off?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from @rxin comment that we want to support quoted non-numeric numbers when allowNonNumericNumbers is off.

Copy link
Contributor

@cloud-fan cloud-fan May 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make sense to me. What if users really want to use "NaN" as string?
cc @rxin

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I shouldn't change InferSchem. Tests here are also need to add few doubles. I will update it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, it should be number when the field is inferred as double/float type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I updated it.

Non-quoted non-numeric numbers are parsed as double when the corresponding field is double/float type. This behavior is as same as before this patch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did I say that anywhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I don't think that's what I meant there.

Expand All @@ -124,10 +126,10 @@ class JsonParsingOptionsSuite extends QueryTest with SharedSQLContext {
val tests: Seq[Double => Boolean] = Seq(_.isNaN, _.isPosInfinity, _.isNegInfinity,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, I found that "Inf", "-Inf" seems not working even JsonGenerator.Feature.QUOTE_NON_NUMERIC_NUMBERS is enabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to upgrade jackson library version in order to support "INF" and "-INF" (case-sensitive).

_.isPosInfinity, _.isNegInfinity, _.isNaN, _.isPosInfinity, _.isNegInfinity,
_.isPosInfinity, _.isNegInfinity)

val schema = StructType(StructField("age", DoubleType, true) :: Nil)
testCases.zipWithIndex.foreach { case (str, idx) =>
val rdd = spark.sparkContext.parallelize(Seq(str))
val df = spark.read.option("allowNonNumericNumbers", "true").json(rdd)
val df = spark.read.option("allowNonNumericNumbers", "true").schema(schema).json(rdd)

assert(df.schema.head.name == "age")
assert(tests(idx)(df.first().getDouble(0)))
Expand Down