-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21677][SQL] json_tuple throws NullPointException when column is null as string type #18930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @viirya |
|
cc @HyukjinKwon |
| @transient private lazy val fieldExpressions: Seq[Expression] = children.tail | ||
|
|
||
| // toString on null will throw NullPointerException so that return a very unlikely column name | ||
| private val nullFieldName = "__NullFieldName" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A field name given with constant null will be replaced with this pseudo field name.
|
ok to test |
|
Test build #80579 has finished for PR 18930 at commit
|
|
retest this please. |
|
Test build #80587 has finished for PR 18930 at commit
|
| @transient private lazy val fieldExpressions: Seq[Expression] = children.tail | ||
|
|
||
| // a field name given with constant null will be replaced with this pseudo field name | ||
| private val nullFieldName = "__NullFieldName" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmchung, could we maybe compute this foldable related optimization ahead -
https://github.com/jmchung/spark/blob/ffa575a6731fef3e0731b73e0f7311cb024e831b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L425-L439 and remove this fake field name?
I think we can make a function for the above codes first and then use it for computation for each row. Did I understand correctly?
I tried a rough version I thought - https://github.com/jmchung/spark/compare/SPARK-21677...HyukjinKwon:tmp-18930?expand=1, @viirya what do you think about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've also considered using Option here. But don't want to come out Option version from me first, so we can experience review process. It looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon @viirya Yep, we've discarded the fake field name and use Option here. We made a slight revision to deal with the None in foldableFieldNames instead of creating a new function.
|
Test build #80634 has finished for PR 18930 at commit
|
| |SELECT json_tuple('{"a" : 1, "b" : 2}' | ||
| |, cast(NULL AS STRING), 'b' | ||
| |, cast(NULL AS STRING), 'a') | ||
| """.stripMargin), Row(null, "2", null, "1")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmchung Can we also add the test we discussed in slack which mixes constant field name and non constant one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya Done, the added test case contains column name, constant field name, and null field name.
|
|
||
| // eagerly evaluate any foldable the field names | ||
| @transient private lazy val foldableFieldNames: IndexedSeq[String] = { | ||
| @transient private lazy val foldableFieldNames: Array[Option[String]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should continue to use IndexedSeq which is more efficient as foldableFieldNames will be used for many times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya ok, thanks
|
Test build #80656 has finished for PR 18930 at commit
|
|
Test build #80664 has finished for PR 18930 at commit
|
| } | ||
| } | ||
|
|
||
| // Array[String] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this line?
| }.toIndexedSeq | ||
| } | ||
| } | ||
| }.toIndexedSeq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move toIndexedSeq to inner block, i.e. after the map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya Done: (1) remove redundant comment (2) move toIndexedSeq after the map
|
LGTM except for minor comments. |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too except for the comment above.
| } | ||
| } | ||
|
|
||
| test("SPARK-21677: json_tuple throws NullPointException when column is null as string type") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move this to spark/sql/core/src/test/resources/sql-tests/inputs/json-functions.sql and/or spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just an end-to-end test case. We also need to add unit test cases in JsonExpressionsSuite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The end-to-end test at L2047 may not be able to move to JsonExpressionsSuite. We can have some unit test cases similar to L2039 in JsonExpressionsSuite as @gatorsmile suggested.
It is also good to have this end-to-end tests in json-functions.sql.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile has added unit test case in JsonExpressionsSuite
@viirya also add end-to-end test in json-functions.sql
|
Test build #80688 has finished for PR 18930 at commit
|
| .createOrReplaceTempView("jsonTable") | ||
|
|
||
| checkAnswer( | ||
| sql("""SELECT json_tuple(jsonField, b, cast(NULL AS STRING), 'a') FROM jsonTable"""), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: """ -> "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will move L2053 to json-functions.sql
| |SELECT json_tuple('{"a" : 1, "b" : 2}' | ||
| |, cast(NULL AS STRING), 'b' | ||
| |, cast(NULL AS STRING), 'a') | ||
| """.stripMargin), Row(null, "2", null, "1")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: move Row(null, "2", null, "1")) to the next line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks
…t case in json-functions.sql
|
Test build #80740 has finished for PR 18930 at commit
|
| describe function extended json_tuple; | ||
| select json_tuple('{"a" : 1, "b" : 2}', cast(NULL AS STRING), 'b', cast(NULL AS STRING), 'a') | ||
| create temporary view jsonTable(jsonField, a, b) as select * from values '{"a": 1, "b": 2}', 'a', 'b'; | ||
| SELECT json_tuple(jsonField, b, cast(NULL AS STRING), 'a') FROM jsonTable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To generate the result file, you need to run the command
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite"
| describe function extended json_tuple; | ||
| select json_tuple('{"a" : 1, "b" : 2}', cast(NULL AS STRING), 'b', cast(NULL AS STRING), 'a') | ||
| create temporary view jsonTable(jsonField, a, b) as select * from values '{"a": 1, "b": 2}', 'a', 'b'; | ||
| SELECT json_tuple(jsonField, b, cast(NULL AS STRING), 'a') FROM jsonTable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add an extra space at the end of this file
| describe function json_tuple; | ||
| describe function extended json_tuple; | ||
| select json_tuple('{"a" : 1, "b" : 2}', cast(NULL AS STRING), 'b', cast(NULL AS STRING), 'a') | ||
| create temporary view jsonTable(jsonField, a, b) as select * from values '{"a": 1, "b": 2}', 'a', 'b'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent with the other SQL commands, use upper cases for SQL keywords
| select from_json(); | ||
| -- json_tuple | ||
| describe function json_tuple; | ||
| describe function extended json_tuple; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to add these two desc commands.
| describe function extended json_tuple; | ||
| select json_tuple('{"a" : 1, "b" : 2}', cast(NULL AS STRING), 'b', cast(NULL AS STRING), 'a') | ||
| create temporary view jsonTable(jsonField, a, b) as select * from values '{"a": 1, "b": 2}', 'a', 'b'; | ||
| SELECT json_tuple(jsonField, b, cast(NULL AS STRING), 'a') FROM jsonTable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember to drop the created view DROP VIEW IF EXISTS jsonTable;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile @viirya Thank you for your time to review the code. SQL statements are consistent in style and the golden file of json-functions.sql also committed.
…son-functions.sql
|
Test build #80758 has finished for PR 18930 at commit
|
|
retest this please. |
|
Test build #80768 has finished for PR 18930 at commit
|
|
retest this please |
|
Test build #80774 has finished for PR 18930 at commit
|
|
retest this please |
|
Test build #80783 has finished for PR 18930 at commit
|
|
LGTM |
| select from_json(); | ||
| -- json_tuple | ||
| SELECT json_tuple('{"a" : 1, "b" : 2}', CAST(NULL AS STRING), 'b', CAST(NULL AS STRING), 'a'); | ||
| CREATE TEMPORARY VIEW jsonTable(jsonField, a, b) AS SELECT * FROM VALUES ('{"a": 1, "b": 2}', 'a', 'b'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks 'a' field is not used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we can rewrite to:
CREATE TEMPORARY VIEW jsonTable(jsonField, a) AS SELECT * FROM VALUES ('{"a": 1, "b": 2}', 'a');
SELECT json_tuple(jsonField, 'b', CAST(NULL AS STRING), a) FROM jsonTable;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, we have altered the 'a' to table field a.
| CREATE TEMPORARY VIEW jsonTable(jsonField, a, b) AS SELECT * FROM VALUES ('{"a": 1, "b": 2}', 'a', 'b'); | ||
| SELECT json_tuple(jsonField, b, CAST(NULL AS STRING), 'a') FROM jsonTable; | ||
| -- Clean up | ||
| DROP VIEW IF EXISTS jsonTable; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks the project style does not require a newline at the end but I would personally add this ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I remember @gatorsmile has also suggested to add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon @viirya Fixed.
|
LGTM too. |
|
Test build #80788 has finished for PR 18930 at commit
|
|
retest this please |
|
Test build #80789 has finished for PR 18930 at commit
|
| CREATE TEMPORARY VIEW jsonTable(jsonField, a) AS SELECT * FROM VALUES ('{"a": 1, "b": 2}', 'a'); | ||
| SELECT json_tuple(jsonField, 'b', CAST(NULL AS STRING), a) FROM jsonTable; | ||
| -- Clean up | ||
| DROP VIEW IF EXISTS jsonTable; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: just FYI, we do not need to drop the temp view in SQLQueryTestSuite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a big deal. Thus, I will merge it when the test can pass.
|
retest this please |
|
Test build #80805 has finished for PR 18930 at commit
|
|
Thanks! Merging to master. |
|
Thanks @HyukjinKwon @gatorsmile |
|
Thanks @viirya @HyukjinKwon @gatorsmile , I learned a lot from this journey. |
What changes were proposed in this pull request?
Currently the
nullfield name will throw NullPointException. As a given field name null can't be matched with any field names in json, we just output null as its column value. This PR achieves it by returning a very unlikely column name__NullFieldNamein evaluation of the field names.How was this patch tested?
Added unit test.