-
Notifications
You must be signed in to change notification settings - Fork 29k
[WIP][SPARK-29108][SQL] Add new module sql/thriftserver and add v11 thrift protocol #26221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
@juliuszsompolski |
juliuszsompolski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a README.txt like the one in Hive:
Thrift commands to generate files from TCLIService.thrift:
--------------------
thrift --gen java:beans,hashcode -o src/gen/thrift if/TCLIService.thrift
and keep the directory structure as created by it (i.e. code in src/gen/thrift/gen-javabean)
Now after running thrift, there needs to be a manual step of moving the code again to src/gen, because it always generates the gen-javabean directory.
| // new word (with no underscores), and end with the word "Service". | ||
|
|
||
| namespace java org.apache.spark.service.rpc.thrift | ||
| namespace cpp apache.spark.service.rpc.thrift |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The org.apache.hive.service package was coming because it was in the hive-service module in Hive.
Since now it's all in one spark-thriftserver module, maybe change the package to org.apache.spark.sql.thriftserver.cli.thrift (as I suppose the other code will be in org.apache.spark.sql.thriftserver)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The org.apache.hive.service package was coming because it was in the hive-service module in Hive.
Since now it's all in one spark-thriftserver module, maybe change the package toorg.apache.spark.sql.thriftserver.cli.thrift(as I suppose the other code will be inorg.apache.spark.sql.thriftserver)
In hive-2.3.x, thrift protocol code have been separated from package name org.apache.hive.service.cli.thrift, now is org.apache.spark.service.rpc.thrift.
because under package org.apache.spark.service.rpc.thrift, there are some other classes based on these generated thrift code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does hive-jdbc still support it if we rename the package to org.apache.spark.service.rpc.thrift?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wangyum hive-jdbc will then not pickup this code, but instead use the code from the hive-service dependency.
One may say that it is bad, because it makes the client and server run different code, but I'd say is actually good - most people would use a standalone hive jdbc client jar that uses hive-service code from Hive. When we use it here, we test that the code in Spark does not break compatibility with Hive client, even if in the future we decide to make some changes to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wangyum
hive-jdbcwill then not pickup this code, but instead use the code from thehive-servicedependency.
One may say that it is bad, because it makes the client and server run different code, but I'd say is actually good - most people would use a standalone hive jdbc client jar that useshive-servicecode from Hive. When we use it here, we test that the code in Spark does not break compatibility with Hive client, even if in the future we decide to make some changes to it.
Yea, it is feasible if it meets the standards of the protocol and is backwards compatible. We are use a server implement thriftserver based on v9, and it works well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In hive-2.3.x, thrift protocol code have been separated from package name org.apache.hive.service.cli.thrift, now is org.apache.spark.service.rpc.thrift.
because under package org.apache.spark.service.rpc.thrift, there are some other classes based on these generated thrift code.
Could we use org.apache.spark.thriftserver everywhere instead of org.apache.spark.service? the org.apache.hive.service was coming from that hive-service was a separate module, but we don't have it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use
org.apache.spark.thriftservereverywhere instead oforg.apache.spark.service? theorg.apache.hive.servicewas coming from thathive-servicewas a separate module, but we don't have it here.
It is easy after #26221 (comment)
It is ok to do like this since our package is thriftserver not service
@juliuszsompolski Done this. |
|
cc @wangyum |
pom.xml
Outdated
| <!-- Thrift properties --> | ||
| <thrift.home>you-must-set-this-to-run-thrift</thrift.home> | ||
| <thrift.gen.dir>${basedir}/src/gen/thrift</thrift.gen.dir> | ||
| <thrift.args>-I ${thrift.home} --gen java:beans,hashcode,generated_annotations=undated</thrift.args> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(just asking, I don't know maven): should this be set in global pom, or in thriftserver pom? How can this be used?
If it requires setting it manually in your pom, and the generated files are committed anyway, they I would not do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(just asking, I don't know maven): should this be set in global pom, or in thriftserver pom? How can this be used?
If it requires setting it manually in your pom, and the generated files are committed anyway, they I would not do it.
Current same as hive, make it like antlr4 is best but I may need some help... since I am particularly good at maven various plugin, I am learning the hive's approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@juliuszsompolski Please see this PR: AngersZhuuuu#2
pom.xml
Outdated
| <CodeCacheSize>1g</CodeCacheSize> | ||
|
|
||
| <!-- Thrift properties --> | ||
| <thrift.home>you-must-set-this-to-run-thrift</thrift.home> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it need to be set, or can Maven find thrift from PATH? Or even better, do the generation via libthrift from maven central itself?
pom.xml
Outdated
| </profile> | ||
|
|
||
| <profile> | ||
| <id>thriftif</id> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this be in thriftserver pom file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this be in thriftserver pom file?
It is ok to move it to thriftserver pom.
| <dependencies> | ||
| <dependency> | ||
| <groupId>org.apache.spark</groupId> | ||
| <artifactId>spark-hive_${scala.binary.version}</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the dependency on spark-hive still? Is it just for metastore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the dependency on spark-hive still? Is it just for metastore?
In current thriftserver some place use Hiveutils Hive and HiveDelegationTokenProvider,
they are from spark-hive.
If the final implementation does not need these, we can remove them later.
| <dependency> | ||
| <groupId>${hive.group}</groupId> | ||
| <artifactId>hive-beeline</artifactId> | ||
| </dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hive- dependencies are only for beeline now, and for testing with Hive JDBC client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hive- dependencies are only for beeline now, and for testing with Hive JDBC client?
Current beeline still need these dependencies. In first version we can't do all the thing, so copy this from sql/hive-thriftserver . First step is to replace old one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could consider separating the server and the client in separate modules? A sql/jdbc module would have hive-beeline and hive-jdbc and hive-service and be used to provide bin/beeline, but it will only be a testing dependency of thriftserver, to provide the client for testing - that would further separate the server code from Hive. Could then the existing sql/hive-thriftserver also be depending on it?
@wangyum @gatorsmile what do you think? (definitely in a separate PR from this one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
@juliuszsompolski @AngersZhuuuu Could we complete the entire module first and then split it into smaller PRs to commit? |
|
@wangyum I agree to not commit anything the overall change is ready, but to be able to review, I would like partial PRs to be stacked on top of each other - so that the tens of thousands of lines of moving code around are in separate PRs from smaller changesets the parts that actually change something. |
| namespace cpp apache.spark.sql.thriftserver.cli.thrift | ||
|
|
||
| // List of protocol versions. A new token should be | ||
| // added to the end of this list every time a change is made. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@juliuszsompolski Ok with this package name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
New code in https://github.com/AngersZhuuuu/spark/tree/SPARK-29018-V11-STEP2
Current change list:
- Impelemnt
Typein scala since spark con't support all type of hive - Implement
Service/AbstractServiceprepare for remove hive conf in future - Construct RowSet with StructType and Row
- Implement
HiveAuthFactorysince between 1.2.1/2.3.5, their delegation token managerment changed. Impelment on DelegationTokenMnagerment by scala - MV
tableTypeStringfromSparkMetadataOperationUtilstoSparkMetadataOperation - Since there are
tableTypeStringinSparkMetadataOperationremoveClassicTypeMapping,HiveTableTypeMapping,TableTypeMappingandTableTypeMappingFactory - Implement all operation for spark since it execute in different way
- Add new method
GetQueryId,SetClientInfofor thrift version v11 inThriftCLIService - Add
statementidtoOperationfor implementGetQueryId - Remove
GlobalHivercFileProcessorsetFetchSizeprocessGlobalInitFileetc
Still working on this.
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
First step as we discussed in #25721 (comment)
Now we just add a new module and implement thrift protocol v11, prepare for next work.
Why are the changes needed?
implement a new thriftserver by spark
Does this PR introduce any user-facing change?
No for now
How was this patch tested?
Don't need test now