Skip to content
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@
package org.apache.spark.sql.catalyst.json

import java.io.ByteArrayOutputStream
import java.util.Locale

import scala.collection.mutable.ArrayBuffer
import scala.util.Try
Expand Down Expand Up @@ -126,16 +125,13 @@ class JacksonParser(

case VALUE_STRING =>
// Special case handling for NaN and Infinity.
val value = parser.getText
val lowerCaseValue = value.toLowerCase(Locale.ROOT)
if (lowerCaseValue.equals("nan") ||
lowerCaseValue.equals("infinity") ||
lowerCaseValue.equals("-infinity") ||
lowerCaseValue.equals("inf") ||
lowerCaseValue.equals("-inf")) {
value.toFloat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds a bug fix, because toFloat is case sensitive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For input/output, it is not a bug. Please see #17956 (comment) and the PR description.

Copy link
Member

@gatorsmile gatorsmile May 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether we should follow it. How about our CSV parsers? How about Pandas? Are they case sensitive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is JSON format itself related with CSV and Pandas? Please see the discussion in #9759 (comment).

Also, this does not change the existing behaviour.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, we want to be consistent with the others. Could you check Pandas?

Copy link
Member Author

@HyukjinKwon HyukjinKwon May 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably let me check and open JIRA/PR if there are some cases we should handle when I have some time. Let's leave out that here. It sounds that does not block this PR and I don't want extra changes hold off this PR like #17901.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR changes the behavior, right? Does your test case pass without the code changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this adds some cases. I am not still convinced that we should take care of every other cases. Let's add options nanValue, positiveInf and negativeInf in a separate PR like our CSV datasource.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this PR, +INF, INF and -INF" do not work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I added some cases here. I will open another PR/JIRA to add those options into JSON related functionalities.

} else {
throw new RuntimeException(s"Cannot parse $value as FloatType.")
parser.getText match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one last question is, should we be case sensitive here? can we check other systems like pandas?

Copy link
Member Author

@HyukjinKwon HyukjinKwon May 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to my knowledge, at least Python supports some case-insensitive cases. I would rather leave this working with options (discussed in #17956 (comment)) treating this as a variant. At least, I think we can say here it follows Scala + Jackson's standard.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that we don't really support case insensitive before. Although we check the lower case of input string, but we actually call toFloat on the original string. And "nan".toFloat will get java.lang.NumberFormatException: For input string: "nan".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to me NaN and Infinity, INF are special and case insensitive are not really making sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, here, we are not trying to educate users/customers what are right or not. We are trying to improve the usability. If the other popular platforms accept some formats, we need to follow them even if they are strange to us.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not add all the variants in the default behaviour. We could add the options for those cases. We all already have different ideas. Is it worth adding those cases right now in this PR? This is not the end of fixing this code path.

case "NaN" | "+NaN" | "-NaN" => Float.NaN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that Jackson's JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS doesn't support "+NaN" and "-NaN".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, but it is supported in Scala. These were added to address #17956 (comment) to say we support Scala + Jackson standard.

scala> "+NaN".toDouble
res2: Double = NaN

scala> "-NaN".toDouble
res3: Double = NaN

case "+INF" | "INF" | "+Infinity" | "Infinity" => Float.PositiveInfinity
case "-INF" | "-Infinity" => Float.NegativeInfinity
case other => Try(other.toFloat).getOrElse {
throw new RuntimeException(s"Cannot parse $other as FloatType.")
}
}
}

Expand All @@ -146,16 +142,13 @@ class JacksonParser(

case VALUE_STRING =>
// Special case handling for NaN and Infinity.
val value = parser.getText
val lowerCaseValue = value.toLowerCase(Locale.ROOT)
if (lowerCaseValue.equals("nan") ||
lowerCaseValue.equals("infinity") ||
lowerCaseValue.equals("-infinity") ||
lowerCaseValue.equals("inf") ||
lowerCaseValue.equals("-inf")) {
value.toDouble
} else {
throw new RuntimeException(s"Cannot parse $value as DoubleType.")
parser.getText match {
case "NaN" | "+NaN" | "-NaN" => Double.NaN
case "+INF" | "INF" | "+Infinity" | "Infinity" => Double.PositiveInfinity
case "-INF" | "-Infinity" => Double.NegativeInfinity
case other => Try(other.toDouble).getOrElse {
throw new RuntimeException(s"Cannot parse $other as DoubleType.")
}
}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ package org.apache.spark.sql.execution.datasources.json
import java.io.{File, StringWriter}
import java.nio.charset.StandardCharsets
import java.sql.{Date, Timestamp}
import java.util.Locale

import com.fasterxml.jackson.core.JsonFactory
import org.apache.hadoop.fs.{Path, PathFilter}
Expand Down Expand Up @@ -1988,4 +1989,55 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
assert(errMsg.startsWith("The field for corrupt records must be string type and nullable"))
}
}

test("SPARK-18772: Parse special floats correctly") {
val jsons = Seq(
"""{"a": "+INF"}""",
"""{"a": "INF"}""",
"""{"a": "-INF"}""",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is difference between line 1995 and line 1997?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to test +INF?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes ...

"""{"a": "NaN"}""",
"""{"a": "+NaN"}""",
"""{"a": "-NaN"}""",
"""{"a": "Infinity"}""",
"""{"a": "+Infinity"}""",
"""{"a": "-Infinity"}""")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand which standards we follow. The above seven values are accepted by toFloat and toDouble?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This follows Jackson + Scala. #9759 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below is from JsonParser.Feature ALLOW_NON_NUMERIC_NUMBERS

"INF" (for positive infinity), as well as alias of "Infinity"
"-INF" (for negative infinity), alias "-Infinity"
"NaN" (for other not-a-numbers, like result of division by zero)

Then, this PR also adds +INF and +Infinity.

What is the Scala standard?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just let us assume we want to follow JsonParser.Feature ALLOW_NON_NUMERIC_NUMBERS, does our JSON option flag work? allowNonNumericNumbers.

If we turn it off, these test cases should fail, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also can address https://issues.apache.org/jira/browse/SPARK-11753 in this PR, right? Simply accepting all these non-numeric numbers without respecting our flag allowNonNumericNumbers does not sound right to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not check it, but does """{"a": -Infinity}""" fail?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks -Infinity succeeding.

scala> "-Infinity".toDouble
res0: Double = -Infinity

scala> "-Infinity".toFloat
res1: Float = -Infinity

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like Jackson only support the following four cases.

    val jsons = Seq(
      """{"a": NaN}""",
      """{"a": Infinity}""",
      """{"a": +Infinity}""",
      """{"a": -Infinity}""")

Could you add a separate test case for them? Also add the negative cases too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


// positive cases
val checks: Seq[Double => Boolean] = Seq(
_.isPosInfinity,
_.isPosInfinity,
_.isNegInfinity,
_.isNaN,
_.isNaN,
_.isNaN,
_.isPosInfinity,
_.isPosInfinity,
_.isNegInfinity)

Seq(FloatType, DoubleType).foreach { dt =>
jsons.zip(checks).foreach { case (json, check) =>
val ds = spark.read
.schema(StructType(Seq(StructField("a", dt))))
.json(Seq(json).toDS())
.select($"a".cast(DoubleType)).as[Double]
assert(check(ds.first()))
}
}

// negative cases
Seq(FloatType, DoubleType).foreach { dt =>
val lowerCasedJsons = jsons.map(_.toLowerCase(Locale.ROOT))
// The special floats are case-sensitive so these cases below throw exceptions.
lowerCasedJsons.foreach { lowerCasedJson =>
val e = intercept[SparkException] {
spark.read
.option("mode", "FAILFAST")
.schema(StructType(Seq(StructField("a", dt))))
.json(Seq(lowerCasedJson).toDS())
.collect()
}
assert(e.getMessage.contains("Cannot parse"))
}
}
}
}