Skip to content

Conversation

@jmchung
Copy link
Contributor

@jmchung jmchung commented Aug 13, 2017

What changes were proposed in this pull request?

scala> Seq(("""{"Hyukjin": 224, "John": 1225}""")).toDS.selectExpr("json_tuple(value, trim(null))").show()
...
java.lang.NullPointerException
	at ...

Currently the null field name will throw NullPointException. As a given field name null can't be matched with any field names in json, we just output null as its column value. This PR achieves it by returning a very unlikely column name __NullFieldName in evaluation of the field names.

How was this patch tested?

Added unit test.

@jmchung
Copy link
Contributor Author

jmchung commented Aug 13, 2017

cc @viirya

@jmchung jmchung changed the title Spark 21677 [SPARK-21677][SQL] json_tuple throws NullPointException when column is null as string type Aug 13, 2017
@viirya
Copy link
Member

viirya commented Aug 13, 2017

cc @HyukjinKwon

@transient private lazy val fieldExpressions: Seq[Expression] = children.tail

// toString on null will throw NullPointerException so that return a very unlikely column name
private val nullFieldName = "__NullFieldName"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A field name given with constant null will be replaced with this pseudo field name.

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Aug 13, 2017

Test build #80579 has finished for PR 18930 at commit ffa575a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Aug 13, 2017

retest this please.

@SparkQA
Copy link

SparkQA commented Aug 13, 2017

Test build #80587 has finished for PR 18930 at commit ffa575a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@transient private lazy val fieldExpressions: Seq[Expression] = children.tail

// a field name given with constant null will be replaced with this pseudo field name
private val nullFieldName = "__NullFieldName"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmchung, could we maybe compute this foldable related optimization ahead -
https://github.com/jmchung/spark/blob/ffa575a6731fef3e0731b73e0f7311cb024e831b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L425-L439 and remove this fake field name?

I think we can make a function for the above codes first and then use it for computation for each row. Did I understand correctly?

I tried a rough version I thought - https://github.com/jmchung/spark/compare/SPARK-21677...HyukjinKwon:tmp-18930?expand=1, @viirya what do you think about this?

Copy link
Member

@viirya viirya Aug 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've also considered using Option here. But don't want to come out Option version from me first, so we can experience review process. It looks good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon @viirya Yep, we've discarded the fake field name and use Option here. We made a slight revision to deal with the None in foldableFieldNames instead of creating a new function.

@SparkQA
Copy link

SparkQA commented Aug 14, 2017

Test build #80634 has finished for PR 18930 at commit 5d71263.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

|SELECT json_tuple('{"a" : 1, "b" : 2}'
|, cast(NULL AS STRING), 'b'
|, cast(NULL AS STRING), 'a')
""".stripMargin), Row(null, "2", null, "1"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmchung Can we also add the test we discussed in slack which mixes constant field name and non constant one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Done, the added test case contains column name, constant field name, and null field name.


// eagerly evaluate any foldable the field names
@transient private lazy val foldableFieldNames: IndexedSeq[String] = {
@transient private lazy val foldableFieldNames: Array[Option[String]] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should continue to use IndexedSeq which is more efficient as foldableFieldNames will be used for many times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya ok, thanks

@SparkQA
Copy link

SparkQA commented Aug 15, 2017

Test build #80656 has finished for PR 18930 at commit 0078445.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 15, 2017

Test build #80664 has finished for PR 18930 at commit 5c69df5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
}

// Array[String]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this line?

}.toIndexedSeq
}
}
}.toIndexedSeq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move toIndexedSeq to inner block, i.e. after the map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Done: (1) remove redundant comment (2) move toIndexedSeq after the map

@viirya
Copy link
Member

viirya commented Aug 15, 2017

LGTM except for minor comments.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too except for the comment above.

}
}

test("SPARK-21677: json_tuple throws NullPointException when column is null as string type") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move this to spark/sql/core/src/test/resources/sql-tests/inputs/json-functions.sql and/or spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just an end-to-end test case. We also need to add unit test cases in JsonExpressionsSuite

Copy link
Member

@viirya viirya Aug 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The end-to-end test at L2047 may not be able to move to JsonExpressionsSuite. We can have some unit test cases similar to L2039 in JsonExpressionsSuite as @gatorsmile suggested.

It is also good to have this end-to-end tests in json-functions.sql.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile has added unit test case in JsonExpressionsSuite
@viirya also add end-to-end test in json-functions.sql

@SparkQA
Copy link

SparkQA commented Aug 15, 2017

Test build #80688 has finished for PR 18930 at commit ab16929.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

.createOrReplaceTempView("jsonTable")

checkAnswer(
sql("""SELECT json_tuple(jsonField, b, cast(NULL AS STRING), 'a') FROM jsonTable"""),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: """ -> "

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will move L2053 to json-functions.sql

|SELECT json_tuple('{"a" : 1, "b" : 2}'
|, cast(NULL AS STRING), 'b'
|, cast(NULL AS STRING), 'a')
""".stripMargin), Row(null, "2", null, "1"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: move Row(null, "2", null, "1")) to the next line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks

@SparkQA
Copy link

SparkQA commented Aug 16, 2017

Test build #80740 has finished for PR 18930 at commit e0e0c74.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

describe function extended json_tuple;
select json_tuple('{"a" : 1, "b" : 2}', cast(NULL AS STRING), 'b', cast(NULL AS STRING), 'a')
create temporary view jsonTable(jsonField, a, b) as select * from values '{"a": 1, "b": 2}', 'a', 'b';
SELECT json_tuple(jsonField, b, cast(NULL AS STRING), 'a') FROM jsonTable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To generate the result file, you need to run the command

SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite"

describe function extended json_tuple;
select json_tuple('{"a" : 1, "b" : 2}', cast(NULL AS STRING), 'b', cast(NULL AS STRING), 'a')
create temporary view jsonTable(jsonField, a, b) as select * from values '{"a": 1, "b": 2}', 'a', 'b';
SELECT json_tuple(jsonField, b, cast(NULL AS STRING), 'a') FROM jsonTable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add an extra space at the end of this file

describe function json_tuple;
describe function extended json_tuple;
select json_tuple('{"a" : 1, "b" : 2}', cast(NULL AS STRING), 'b', cast(NULL AS STRING), 'a')
create temporary view jsonTable(jsonField, a, b) as select * from values '{"a": 1, "b": 2}', 'a', 'b';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with the other SQL commands, use upper cases for SQL keywords

select from_json();
-- json_tuple
describe function json_tuple;
describe function extended json_tuple;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to add these two desc commands.

describe function extended json_tuple;
select json_tuple('{"a" : 1, "b" : 2}', cast(NULL AS STRING), 'b', cast(NULL AS STRING), 'a')
create temporary view jsonTable(jsonField, a, b) as select * from values '{"a": 1, "b": 2}', 'a', 'b';
SELECT json_tuple(jsonField, b, cast(NULL AS STRING), 'a') FROM jsonTable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember to drop the created view DROP VIEW IF EXISTS jsonTable;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile @viirya Thank you for your time to review the code. SQL statements are consistent in style and the golden file of json-functions.sql also committed.

@SparkQA
Copy link

SparkQA commented Aug 17, 2017

Test build #80758 has finished for PR 18930 at commit 5191ed4.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Aug 17, 2017

retest this please.

@SparkQA
Copy link

SparkQA commented Aug 17, 2017

Test build #80768 has finished for PR 18930 at commit 5191ed4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Aug 17, 2017

Test build #80774 has finished for PR 18930 at commit 5191ed4.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Aug 17, 2017

Test build #80783 has finished for PR 18930 at commit 5191ed4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Aug 17, 2017

LGTM

select from_json();
-- json_tuple
SELECT json_tuple('{"a" : 1, "b" : 2}', CAST(NULL AS STRING), 'b', CAST(NULL AS STRING), 'a');
CREATE TEMPORARY VIEW jsonTable(jsonField, a, b) AS SELECT * FROM VALUES ('{"a": 1, "b": 2}', 'a', 'b');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks 'a' field is not used here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we can rewrite to:

CREATE TEMPORARY VIEW jsonTable(jsonField, a) AS SELECT * FROM VALUES ('{"a": 1, "b": 2}', 'a'); 
SELECT json_tuple(jsonField, 'b', CAST(NULL AS STRING), a) FROM jsonTable; 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, we have altered the 'a' to table field a.

CREATE TEMPORARY VIEW jsonTable(jsonField, a, b) AS SELECT * FROM VALUES ('{"a": 1, "b": 2}', 'a', 'b');
SELECT json_tuple(jsonField, b, CAST(NULL AS STRING), 'a') FROM jsonTable;
-- Clean up
DROP VIEW IF EXISTS jsonTable;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks the project style does not require a newline at the end but I would personally add this ..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I remember @gatorsmile has also suggested to add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon
Copy link
Member

LGTM too.

@SparkQA
Copy link

SparkQA commented Aug 17, 2017

Test build #80788 has finished for PR 18930 at commit ff3b9da.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Aug 17, 2017

Test build #80789 has finished for PR 18930 at commit ff3b9da.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

CREATE TEMPORARY VIEW jsonTable(jsonField, a) AS SELECT * FROM VALUES ('{"a": 1, "b": 2}', 'a');
SELECT json_tuple(jsonField, 'b', CAST(NULL AS STRING), a) FROM jsonTable;
-- Clean up
DROP VIEW IF EXISTS jsonTable;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: just FYI, we do not need to drop the temp view in SQLQueryTestSuite.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a big deal. Thus, I will merge it when the test can pass.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Aug 17, 2017

Test build #80805 has finished for PR 18930 at commit ff3b9da.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merging to master.

@asfgit asfgit closed this in 7ab9518 Aug 17, 2017
@viirya
Copy link
Member

viirya commented Aug 17, 2017

Thanks @HyukjinKwon @gatorsmile

@jmchung
Copy link
Contributor Author

jmchung commented Aug 18, 2017

Thanks @viirya @HyukjinKwon @gatorsmile , I learned a lot from this journey.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants