[SPARK-31416][SQL] Check more strictly that a field name can be used as a valid Java identifier for codegen #28184

sarutak · 2020-04-10T18:30:10Z

What changes were proposed in this pull request?

Check more strictly that a field name can be used as a valid Java identifier in ScalaReflection.serializerFor
To check that, SourceVersion is used so that we need not add reserved keywords to be checked manually for the future Java versions (e.g, underscore, var, yield), .

Why are the changes needed?

In the current implementation, enum is not checked even though it's a reserved keyword.
Also, there are lots of characters and sequences of character including numeric literals but they are not checked.
So we can't get better error message with following code.

case class  Data(`0`: Int)
Seq(Data(1)).toDF.show

20/04/11 03:24:24 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: Expression "value_0 = value_3" is not a type
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: Expression "value_0 = value_3" is not a type

...

Does this PR introduce any user-facing change?

Yes. With this change and the code example above, we can get following error message.

java.lang.UnsupportedOperationException: `0` is not a valid identifier of Java and cannot be used as field name
- root class: "Data"

...

How was this patch tested?

Add another assertion to existing test case.

SparkQA · 2020-04-10T18:46:39Z

Test build #121099 has finished for PR 28184 at commit 602c8c9.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-04-11T00:23:54Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

-      "`abstract` is a reserved keyword and cannot be used as field name"))
+      "`abstract` is not a valid identifier of Java and cannot be used as field name"))
+
+    val e2 = intercept[UnsupportedOperationException] {


Could you add this test in a new test block instead of adding the existing one? btw, it seems better to place these tests for ScalaReflection in ScalaReflectionRelationSuite.

maropu · 2020-04-11T00:24:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala

-            throw new UnsupportedOperationException(s"`$fieldName` is a reserved keyword and " +
-              "cannot be used as field name\n" + walkedTypePath)
+          if (SourceVersion.isKeyword(fieldName) ||
+            !SourceVersion.isIdentifier(encodeFieldNameToIdentifier(fieldName))) {


nit: one more indent.

Is it needed?
There are some similar style found.

JDBCRelation.scala
Utils.scala
CodeGenerator.scala

Yea, I know that. Not strong preference though, I found the style below in this file;
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L283

if (XXXXX || YYYY) { // one more indent to tell a difference from <code block> <code block> }

So, I left the comment above for per-file style consistency. Anyway, trivial though.

Hmm, exactly. O.K, I'll follow the rule. Thanks.

maropu · 2020-04-11T00:26:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala

-          if (javaKeywords.contains(fieldName)) {
-            throw new UnsupportedOperationException(s"`$fieldName` is a reserved keyword and " +
-              "cannot be used as field name\n" + walkedTypePath)
+          if (SourceVersion.isKeyword(fieldName) ||


I'm not familiar with this method though, this check depends on users' Java runtime version? cc: @rednaxelafx @kiszk

The SourceVersion class knows how to detect the spec version of the current running Java runtime, and can perform checks like identifier validation accordingly.
c.f. http://hg.openjdk.java.net/jdk8u/jdk8u/langtools/file/01036da3155c/src/share/classes/javax/lang/model/SourceVersion.java#l183

Ah, I see. Thanks for the info, Kris. btw, the reserved keywords depend on Janino though, is it okay for the check to depend on running Java runtime?

In the original implementation, javaKeywords contains default so how about checking keywords for Java 8 ?

How about this, @cloud-fan ? I found he defined the initial set for the keywords in #13485

While the reserved keywords depend on Janino though, SourceVersion.isKeyword(fieldName) would have the super set of keywords that Janino cannot accept. I think that this change is reasonable. We can avoid maintain javaKeywords.

SourceVersion.isKeyword(fieldName) would have the super set of keywords that Janino cannot accept

Looks nice, thanks for the check, @kiszk

SparkQA · 2020-04-11T03:35:08Z

Test build #121107 has finished for PR 28184 at commit 14bb7ba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

Thank you, @sarutak .

Since we start to support multiple JDKs, I'm +1 to use SourceVersion.isKeyword. Specifically, this PR is a good change to support JDK11+. Not only for enum which mentioned in the PR description, _ is not a keyword in JDK8 while it is in JDK11. So, we had better depend on JDK's result instead of keeping our blacklist.

jshell> javax.lang.model.SourceVersion.isKeyword("_")
$1 ==> true

SourceVersion.isIdentifier also looks fine.

For indentation, +1 for @maropu 's comment.

dongjoon-hyun · 2020-04-12T02:19:12Z

cc @srowen and @gatorsmile

maropu

No more comment except for the existing ones.

SparkQA · 2020-04-12T12:35:18Z

Test build #121143 has finished for PR 28184 at commit a42b169.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class InvalidInJava(abstract: Int)

srowen

Looks good in principle.

dongjoon-hyun

+1, LGTM. Thank you, @sarutak , @maropu , @rednaxelafx , @kiszk , @srowen .
Merged to master/3.0.

…as a valid Java identifier for codegen ### What changes were proposed in this pull request? Check more strictly that a field name can be used as a valid Java identifier in `ScalaReflection.serializerFor` To check that, `SourceVersion` is used so that we need not add reserved keywords to be checked manually for the future Java versions (e.g, underscore, var, yield), . ### Why are the changes needed? In the current implementation, `enum` is not checked even though it's a reserved keyword. Also, there are lots of characters and sequences of character including numeric literals but they are not checked. So we can't get better error message with following code. ``` case class Data(`0`: Int) Seq(Data(1)).toDF.show 20/04/11 03:24:24 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: Expression "value_0 = value_3" is not a type org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: Expression "value_0 = value_3" is not a type ... ``` ### Does this PR introduce any user-facing change? Yes. With this change and the code example above, we can get following error message. ``` java.lang.UnsupportedOperationException: `0` is not a valid identifier of Java and cannot be used as field name - root class: "Data" ... ``` ### How was this patch tested? Add another assertion to existing test case. Closes #28184 from sarutak/improve-identifier-check. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 6cd0bef) Signed-off-by: Dongjoon Hyun <[email protected]>

…as a valid Java identifier for codegen ### What changes were proposed in this pull request? Check more strictly that a field name can be used as a valid Java identifier in `ScalaReflection.serializerFor` To check that, `SourceVersion` is used so that we need not add reserved keywords to be checked manually for the future Java versions (e.g, underscore, var, yield), . ### Why are the changes needed? In the current implementation, `enum` is not checked even though it's a reserved keyword. Also, there are lots of characters and sequences of character including numeric literals but they are not checked. So we can't get better error message with following code. ``` case class Data(`0`: Int) Seq(Data(1)).toDF.show 20/04/11 03:24:24 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: Expression "value_0 = value_3" is not a type org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: Expression "value_0 = value_3" is not a type ... ``` ### Does this PR introduce any user-facing change? Yes. With this change and the code example above, we can get following error message. ``` java.lang.UnsupportedOperationException: `0` is not a valid identifier of Java and cannot be used as field name - root class: "Data" ... ``` ### How was this patch tested? Add another assertion to existing test case. Closes apache#28184 from sarutak/improve-identifier-check. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

Strictly check that a field name can be used as a valid identifier.

602c8c9

Fixed genjavadoc error.

14bb7ba

maropu reviewed Apr 11, 2020

View reviewed changes

dongjoon-hyun added the SQL label Apr 12, 2020

dongjoon-hyun reviewed Apr 12, 2020

View reviewed changes

maropu approved these changes Apr 12, 2020

View reviewed changes

sarutak added 2 commits April 12, 2020 16:56

Added another indentation.

eef0532

Move test cases for Java keywords check to ScalaReflectionRelationSuite.

a42b169

srowen reviewed Apr 12, 2020

View reviewed changes

dongjoon-hyun approved these changes Apr 12, 2020

View reviewed changes

dongjoon-hyun closed this in 6cd0bef Apr 12, 2020

[SPARK-31416][SQL] Check more strictly that a field name can be used as a valid Java identifier for codegen #28184

[SPARK-31416][SQL] Check more strictly that a field name can be used as a valid Java identifier for codegen #28184

Uh oh!

Conversation

sarutak commented Apr 10, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Apr 10, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 11, 2020

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Apr 12, 2020

Uh oh!

maropu left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 12, 2020

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

dongjoon-hyun left a comment •

edited

Loading