-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-18772][SQL] Avoid unnecessary conversion try for special floats in JSON #17956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -18,7 +18,6 @@ | |
| package org.apache.spark.sql.catalyst.json | ||
|
|
||
| import java.io.ByteArrayOutputStream | ||
| import java.util.Locale | ||
|
|
||
| import scala.collection.mutable.ArrayBuffer | ||
| import scala.util.Try | ||
|
|
@@ -126,16 +125,13 @@ class JacksonParser( | |
|
|
||
| case VALUE_STRING => | ||
| // Special case handling for NaN and Infinity. | ||
| val value = parser.getText | ||
| val lowerCaseValue = value.toLowerCase(Locale.ROOT) | ||
| if (lowerCaseValue.equals("nan") || | ||
| lowerCaseValue.equals("infinity") || | ||
| lowerCaseValue.equals("-infinity") || | ||
| lowerCaseValue.equals("inf") || | ||
| lowerCaseValue.equals("-inf")) { | ||
| value.toFloat | ||
| } else { | ||
| throw new RuntimeException(s"Cannot parse $value as FloatType.") | ||
| parser.getText match { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. one last question is, should we be case sensitive here? can we check other systems like pandas?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Up to my knowledge, at least Python supports some case-insensitive cases. I would rather leave this working with options (discussed in #17956 (comment)) treating this as a variant. At least, I think we can say here it follows Scala + Jackson's standard.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems to me that we don't really support case insensitive before. Although we check the lower case of input string, but we actually call
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems to me NaN and Infinity, INF are special and case insensitive are not really making sense.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, here, we are not trying to educate users/customers what are right or not. We are trying to improve the usability. If the other popular platforms accept some formats, we need to follow them even if they are strange to us.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should not add all the variants in the default behaviour. We could add the options for those cases. We all already have different ideas. Is it worth adding those cases right now in this PR? This is not the end of fixing this code path. |
||
| case "NaN" | "+NaN" | "-NaN" => Float.NaN | ||
|
||
| case "+INF" | "INF" | "+Infinity" | "Infinity" => Float.PositiveInfinity | ||
| case "-INF" | "-Infinity" => Float.NegativeInfinity | ||
| case other => Try(other.toFloat).getOrElse { | ||
| throw new RuntimeException(s"Cannot parse $other as FloatType.") | ||
| } | ||
| } | ||
| } | ||
|
|
||
|
|
@@ -146,16 +142,13 @@ class JacksonParser( | |
|
|
||
| case VALUE_STRING => | ||
| // Special case handling for NaN and Infinity. | ||
| val value = parser.getText | ||
| val lowerCaseValue = value.toLowerCase(Locale.ROOT) | ||
| if (lowerCaseValue.equals("nan") || | ||
| lowerCaseValue.equals("infinity") || | ||
| lowerCaseValue.equals("-infinity") || | ||
| lowerCaseValue.equals("inf") || | ||
| lowerCaseValue.equals("-inf")) { | ||
| value.toDouble | ||
| } else { | ||
| throw new RuntimeException(s"Cannot parse $value as DoubleType.") | ||
| parser.getText match { | ||
| case "NaN" | "+NaN" | "-NaN" => Double.NaN | ||
| case "+INF" | "INF" | "+Infinity" | "Infinity" => Double.PositiveInfinity | ||
| case "-INF" | "-Infinity" => Double.NegativeInfinity | ||
| case other => Try(other.toDouble).getOrElse { | ||
| throw new RuntimeException(s"Cannot parse $other as DoubleType.") | ||
| } | ||
| } | ||
| } | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -20,6 +20,7 @@ package org.apache.spark.sql.execution.datasources.json | |
| import java.io.{File, StringWriter} | ||
| import java.nio.charset.StandardCharsets | ||
| import java.sql.{Date, Timestamp} | ||
| import java.util.Locale | ||
|
|
||
| import com.fasterxml.jackson.core.JsonFactory | ||
| import org.apache.hadoop.fs.{Path, PathFilter} | ||
|
|
@@ -1988,4 +1989,55 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { | |
| assert(errMsg.startsWith("The field for corrupt records must be string type and nullable")) | ||
| } | ||
| } | ||
|
|
||
| test("SPARK-18772: Parse special floats correctly") { | ||
| val jsons = Seq( | ||
| """{"a": "+INF"}""", | ||
| """{"a": "INF"}""", | ||
| """{"a": "-INF"}""", | ||
|
||
| """{"a": "NaN"}""", | ||
| """{"a": "+NaN"}""", | ||
| """{"a": "-NaN"}""", | ||
| """{"a": "Infinity"}""", | ||
| """{"a": "+Infinity"}""", | ||
| """{"a": "-Infinity"}""") | ||
|
||
|
|
||
| // positive cases | ||
| val checks: Seq[Double => Boolean] = Seq( | ||
| _.isPosInfinity, | ||
| _.isPosInfinity, | ||
| _.isNegInfinity, | ||
| _.isNaN, | ||
| _.isNaN, | ||
| _.isNaN, | ||
| _.isPosInfinity, | ||
| _.isPosInfinity, | ||
| _.isNegInfinity) | ||
|
|
||
| Seq(FloatType, DoubleType).foreach { dt => | ||
| jsons.zip(checks).foreach { case (json, check) => | ||
| val ds = spark.read | ||
| .schema(StructType(Seq(StructField("a", dt)))) | ||
| .json(Seq(json).toDS()) | ||
| .select($"a".cast(DoubleType)).as[Double] | ||
| assert(check(ds.first())) | ||
| } | ||
| } | ||
|
|
||
| // negative cases | ||
| Seq(FloatType, DoubleType).foreach { dt => | ||
| val lowerCasedJsons = jsons.map(_.toLowerCase(Locale.ROOT)) | ||
| // The special floats are case-sensitive so these cases below throw exceptions. | ||
| lowerCasedJsons.foreach { lowerCasedJson => | ||
| val e = intercept[SparkException] { | ||
| spark.read | ||
| .option("mode", "FAILFAST") | ||
| .schema(StructType(Seq(StructField("a", dt)))) | ||
| .json(Seq(lowerCasedJson).toDS()) | ||
| .collect() | ||
| } | ||
| assert(e.getMessage.contains("Cannot parse")) | ||
| } | ||
| } | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds a bug fix, because
toFloatis case sensitive?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For input/output, it is not a bug. Please see #17956 (comment) and the PR description.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure whether we should follow it. How about our CSV parsers? How about Pandas? Are they case sensitive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is JSON format itself related with CSV and Pandas? Please see the discussion in #9759 (comment).
Also, this does not change the existing behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, we want to be consistent with the others. Could you check Pandas?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably let me check and open JIRA/PR if there are some cases we should handle when I have some time. Let's leave out that here. It sounds that does not block this PR and I don't want extra changes hold off this PR like #17901.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR changes the behavior, right? Does your test case pass without the code changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, this adds some cases. I am not still convinced that we should take care of every other cases. Let's add options
nanValue,positiveInfandnegativeInfin a separate PR like our CSV datasource.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this PR,
+INF,INFand-INF"do not work.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I added some cases here. I will open another PR/JIRA to add those options into JSON related functionalities.