-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31416][SQL] Check more strictly that a field name can be used as a valid Java identifier for codegen #28184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #121099 has finished for PR 28184 at commit
|
| "`abstract` is a reserved keyword and cannot be used as field name")) | ||
| "`abstract` is not a valid identifier of Java and cannot be used as field name")) | ||
|
|
||
| val e2 = intercept[UnsupportedOperationException] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add this test in a new test block instead of adding the existing one? btw, it seems better to place these tests for ScalaReflection in ScalaReflectionRelationSuite.
| throw new UnsupportedOperationException(s"`$fieldName` is a reserved keyword and " + | ||
| "cannot be used as field name\n" + walkedTypePath) | ||
| if (SourceVersion.isKeyword(fieldName) || | ||
| !SourceVersion.isIdentifier(encodeFieldNameToIdentifier(fieldName))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: one more indent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it needed?
There are some similar style found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I know that. Not strong preference though, I found the style below in this file;
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L283
if (XXXXX ||
YYYY) { // one more indent to tell a difference from <code block>
<code block>
}
So, I left the comment above for per-file style consistency. Anyway, trivial though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, exactly. O.K, I'll follow the rule. Thanks.
| if (javaKeywords.contains(fieldName)) { | ||
| throw new UnsupportedOperationException(s"`$fieldName` is a reserved keyword and " + | ||
| "cannot be used as field name\n" + walkedTypePath) | ||
| if (SourceVersion.isKeyword(fieldName) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with this method though, this check depends on users' Java runtime version? cc: @rednaxelafx @kiszk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SourceVersion class knows how to detect the spec version of the current running Java runtime, and can perform checks like identifier validation accordingly.
c.f. http://hg.openjdk.java.net/jdk8u/jdk8u/langtools/file/01036da3155c/src/share/classes/javax/lang/model/SourceVersion.java#l183
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Thanks for the info, Kris. btw, the reserved keywords depend on Janino though, is it okay for the check to depend on running Java runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original implementation, javaKeywords contains default so how about checking keywords for Java 8 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this, @cloud-fan ? I found he defined the initial set for the keywords in #13485
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the reserved keywords depend on Janino though, SourceVersion.isKeyword(fieldName) would have the super set of keywords that Janino cannot accept. I think that this change is reasonable. We can avoid maintain javaKeywords.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SourceVersion.isKeyword(fieldName) would have the super set of keywords that Janino cannot accept
Looks nice, thanks for the check, @kiszk
|
Test build #121107 has finished for PR 28184 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @sarutak .
Since we start to support multiple JDKs, I'm +1 to use SourceVersion.isKeyword. Specifically, this PR is a good change to support JDK11+. Not only for enum which mentioned in the PR description, _ is not a keyword in JDK8 while it is in JDK11. So, we had better depend on JDK's result instead of keeping our blacklist.
jshell> javax.lang.model.SourceVersion.isKeyword("_")
$1 ==> true
SourceVersion.isIdentifier also looks fine.
For indentation, +1 for @maropu 's comment.
|
cc @srowen and @gatorsmile |
maropu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No more comment except for the existing ones.
|
Test build #121143 has finished for PR 28184 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in principle.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @sarutak , @maropu , @rednaxelafx , @kiszk , @srowen .
Merged to master/3.0.
…as a valid Java identifier for codegen ### What changes were proposed in this pull request? Check more strictly that a field name can be used as a valid Java identifier in `ScalaReflection.serializerFor` To check that, `SourceVersion` is used so that we need not add reserved keywords to be checked manually for the future Java versions (e.g, underscore, var, yield), . ### Why are the changes needed? In the current implementation, `enum` is not checked even though it's a reserved keyword. Also, there are lots of characters and sequences of character including numeric literals but they are not checked. So we can't get better error message with following code. ``` case class Data(`0`: Int) Seq(Data(1)).toDF.show 20/04/11 03:24:24 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: Expression "value_0 = value_3" is not a type org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: Expression "value_0 = value_3" is not a type ... ``` ### Does this PR introduce any user-facing change? Yes. With this change and the code example above, we can get following error message. ``` java.lang.UnsupportedOperationException: `0` is not a valid identifier of Java and cannot be used as field name - root class: "Data" ... ``` ### How was this patch tested? Add another assertion to existing test case. Closes #28184 from sarutak/improve-identifier-check. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 6cd0bef) Signed-off-by: Dongjoon Hyun <[email protected]>
…as a valid Java identifier for codegen ### What changes were proposed in this pull request? Check more strictly that a field name can be used as a valid Java identifier in `ScalaReflection.serializerFor` To check that, `SourceVersion` is used so that we need not add reserved keywords to be checked manually for the future Java versions (e.g, underscore, var, yield), . ### Why are the changes needed? In the current implementation, `enum` is not checked even though it's a reserved keyword. Also, there are lots of characters and sequences of character including numeric literals but they are not checked. So we can't get better error message with following code. ``` case class Data(`0`: Int) Seq(Data(1)).toDF.show 20/04/11 03:24:24 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: Expression "value_0 = value_3" is not a type org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: Expression "value_0 = value_3" is not a type ... ``` ### Does this PR introduce any user-facing change? Yes. With this change and the code example above, we can get following error message. ``` java.lang.UnsupportedOperationException: `0` is not a valid identifier of Java and cannot be used as field name - root class: "Data" ... ``` ### How was this patch tested? Add another assertion to existing test case. Closes apache#28184 from sarutak/improve-identifier-check. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
Check more strictly that a field name can be used as a valid Java identifier in
ScalaReflection.serializerForTo check that,
SourceVersionis used so that we need not add reserved keywords to be checked manually for the future Java versions (e.g, underscore, var, yield), .Why are the changes needed?
In the current implementation,
enumis not checked even though it's a reserved keyword.Also, there are lots of characters and sequences of character including numeric literals but they are not checked.
So we can't get better error message with following code.
Does this PR introduce any user-facing change?
Yes. With this change and the code example above, we can get following error message.
How was this patch tested?
Add another assertion to existing test case.