Skip to content

Conversation

@sarutak
Copy link
Member

@sarutak sarutak commented Apr 10, 2020

What changes were proposed in this pull request?

Check more strictly that a field name can be used as a valid Java identifier in ScalaReflection.serializerFor
To check that, SourceVersion is used so that we need not add reserved keywords to be checked manually for the future Java versions (e.g, underscore, var, yield), .

Why are the changes needed?

In the current implementation, enum is not checked even though it's a reserved keyword.
Also, there are lots of characters and sequences of character including numeric literals but they are not checked.
So we can't get better error message with following code.

case class  Data(`0`: Int)
Seq(Data(1)).toDF.show

20/04/11 03:24:24 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: Expression "value_0 = value_3" is not a type
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: Expression "value_0 = value_3" is not a type

...

Does this PR introduce any user-facing change?

Yes. With this change and the code example above, we can get following error message.

java.lang.UnsupportedOperationException: `0` is not a valid identifier of Java and cannot be used as field name
- root class: "Data"

...

How was this patch tested?

Add another assertion to existing test case.

@SparkQA
Copy link

SparkQA commented Apr 10, 2020

Test build #121099 has finished for PR 28184 at commit 602c8c9.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

"`abstract` is a reserved keyword and cannot be used as field name"))
"`abstract` is not a valid identifier of Java and cannot be used as field name"))

val e2 = intercept[UnsupportedOperationException] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add this test in a new test block instead of adding the existing one? btw, it seems better to place these tests for ScalaReflection in ScalaReflectionRelationSuite.

throw new UnsupportedOperationException(s"`$fieldName` is a reserved keyword and " +
"cannot be used as field name\n" + walkedTypePath)
if (SourceVersion.isKeyword(fieldName) ||
!SourceVersion.isIdentifier(encodeFieldNameToIdentifier(fieldName))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: one more indent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it needed?
There are some similar style found.

JDBCRelation.scala
Utils.scala
CodeGenerator.scala

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I know that. Not strong preference though, I found the style below in this file;
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L283

if (XXXXX ||
    YYYY) { // one more indent to tell a difference from <code block>
  <code block>
}

So, I left the comment above for per-file style consistency. Anyway, trivial though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, exactly. O.K, I'll follow the rule. Thanks.

if (javaKeywords.contains(fieldName)) {
throw new UnsupportedOperationException(s"`$fieldName` is a reserved keyword and " +
"cannot be used as field name\n" + walkedTypePath)
if (SourceVersion.isKeyword(fieldName) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with this method though, this check depends on users' Java runtime version? cc: @rednaxelafx @kiszk

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SourceVersion class knows how to detect the spec version of the current running Java runtime, and can perform checks like identifier validation accordingly.
c.f. http://hg.openjdk.java.net/jdk8u/jdk8u/langtools/file/01036da3155c/src/share/classes/javax/lang/model/SourceVersion.java#l183

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Thanks for the info, Kris. btw, the reserved keywords depend on Janino though, is it okay for the check to depend on running Java runtime?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original implementation, javaKeywords contains default so how about checking keywords for Java 8 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this, @cloud-fan ? I found he defined the initial set for the keywords in #13485

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the reserved keywords depend on Janino though, SourceVersion.isKeyword(fieldName) would have the super set of keywords that Janino cannot accept. I think that this change is reasonable. We can avoid maintain javaKeywords.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SourceVersion.isKeyword(fieldName) would have the super set of keywords that Janino cannot accept

Looks nice, thanks for the check, @kiszk

@SparkQA
Copy link

SparkQA commented Apr 11, 2020

Test build #121107 has finished for PR 28184 at commit 14bb7ba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @sarutak .

Since we start to support multiple JDKs, I'm +1 to use SourceVersion.isKeyword. Specifically, this PR is a good change to support JDK11+. Not only for enum which mentioned in the PR description, _ is not a keyword in JDK8 while it is in JDK11. So, we had better depend on JDK's result instead of keeping our blacklist.

jshell> javax.lang.model.SourceVersion.isKeyword("_")
$1 ==> true

SourceVersion.isIdentifier also looks fine.

For indentation, +1 for @maropu 's comment.

@dongjoon-hyun
Copy link
Member

cc @srowen and @gatorsmile

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more comment except for the existing ones.

@SparkQA
Copy link

SparkQA commented Apr 12, 2020

Test build #121143 has finished for PR 28184 at commit a42b169.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class InvalidInJava(abstract: Int)

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in principle.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @sarutak , @maropu , @rednaxelafx , @kiszk , @srowen .
Merged to master/3.0.

dongjoon-hyun pushed a commit that referenced this pull request Apr 12, 2020
…as a valid Java identifier for codegen

### What changes were proposed in this pull request?

Check more strictly that a field name can be used as a valid Java identifier in `ScalaReflection.serializerFor`
To check that, `SourceVersion` is used so that we need not add reserved keywords to be checked manually for the future Java versions (e.g, underscore, var, yield), .

### Why are the changes needed?

In the current implementation, `enum` is not checked even though it's a reserved keyword.
Also, there are lots of characters and sequences of character including numeric literals but they are not checked.
So we can't get better error message with following code.
```
case class  Data(`0`: Int)
Seq(Data(1)).toDF.show

20/04/11 03:24:24 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: Expression "value_0 = value_3" is not a type
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: Expression "value_0 = value_3" is not a type

...

```

### Does this PR introduce any user-facing change?

Yes. With this change and the code example above, we can get following error message.
```
java.lang.UnsupportedOperationException: `0` is not a valid identifier of Java and cannot be used as field name
- root class: "Data"

...
```

### How was this patch tested?

Add another assertion to existing test case.

Closes #28184 from sarutak/improve-identifier-check.

Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 6cd0bef)
Signed-off-by: Dongjoon Hyun <[email protected]>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…as a valid Java identifier for codegen

### What changes were proposed in this pull request?

Check more strictly that a field name can be used as a valid Java identifier in `ScalaReflection.serializerFor`
To check that, `SourceVersion` is used so that we need not add reserved keywords to be checked manually for the future Java versions (e.g, underscore, var, yield), .

### Why are the changes needed?

In the current implementation, `enum` is not checked even though it's a reserved keyword.
Also, there are lots of characters and sequences of character including numeric literals but they are not checked.
So we can't get better error message with following code.
```
case class  Data(`0`: Int)
Seq(Data(1)).toDF.show

20/04/11 03:24:24 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: Expression "value_0 = value_3" is not a type
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: Expression "value_0 = value_3" is not a type

...

```

### Does this PR introduce any user-facing change?

Yes. With this change and the code example above, we can get following error message.
```
java.lang.UnsupportedOperationException: `0` is not a valid identifier of Java and cannot be used as field name
- root class: "Data"

...
```

### How was this patch tested?

Add another assertion to existing test case.

Closes apache#28184 from sarutak/improve-identifier-check.

Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants