Skip to content

Conversation

@AngersZhuuuu
Copy link
Contributor

What changes were proposed in this pull request?

First step as we discussed in #25721 (comment)

Now we just add a new module and implement thrift protocol v11, prepare for next work.

Why are the changes needed?

implement a new thriftserver by spark

Does this PR introduce any user-facing change?

No for now

How was this patch tested?

Don't need test now

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@AngersZhuuuu
Copy link
Contributor Author

(bonus: could we make these thrift files be generated on the fly? seems possible: https://stackoverflow.com/questions/18767986/how-can-i-compile-all-thrift-files-thrift-as-a-maven-phase)

@juliuszsompolski
For this, I want to do it later, it doesn't impact, I've tried it now and seems not success, For quick development of follow-up work, I'd rather do that later。

@AngersZhuuuu AngersZhuuuu changed the title [WIP][SPARK-29108][SQL]Add new module sql/thriftserver and add v11 thrift protocol [WIP][SPARK-29108][SQL] Add new module sql/thriftserver and add v11 thrift protocol Oct 23, 2019
Copy link
Contributor

@juliuszsompolski juliuszsompolski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a README.txt like the one in Hive:

Thrift commands to generate files from TCLIService.thrift:
--------------------
thrift --gen java:beans,hashcode -o src/gen/thrift if/TCLIService.thrift

and keep the directory structure as created by it (i.e. code in src/gen/thrift/gen-javabean)

Now after running thrift, there needs to be a manual step of moving the code again to src/gen, because it always generates the gen-javabean directory.

// new word (with no underscores), and end with the word "Service".

namespace java org.apache.spark.service.rpc.thrift
namespace cpp apache.spark.service.rpc.thrift
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The org.apache.hive.service package was coming because it was in the hive-service module in Hive.
Since now it's all in one spark-thriftserver module, maybe change the package to org.apache.spark.sql.thriftserver.cli.thrift (as I suppose the other code will be in org.apache.spark.sql.thriftserver)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The org.apache.hive.service package was coming because it was in the hive-service module in Hive.
Since now it's all in one spark-thriftserver module, maybe change the package to org.apache.spark.sql.thriftserver.cli.thrift (as I suppose the other code will be in org.apache.spark.sql.thriftserver)

In hive-2.3.x, thrift protocol code have been separated from package name org.apache.hive.service.cli.thrift, now is org.apache.spark.service.rpc.thrift.
because under package org.apache.spark.service.rpc.thrift, there are some other classes based on these generated thrift code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does hive-jdbc still support it if we rename the package to org.apache.spark.service.rpc.thrift?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangyum hive-jdbc will then not pickup this code, but instead use the code from the hive-service dependency.
One may say that it is bad, because it makes the client and server run different code, but I'd say is actually good - most people would use a standalone hive jdbc client jar that uses hive-service code from Hive. When we use it here, we test that the code in Spark does not break compatibility with Hive client, even if in the future we decide to make some changes to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangyum hive-jdbc will then not pickup this code, but instead use the code from the hive-service dependency.
One may say that it is bad, because it makes the client and server run different code, but I'd say is actually good - most people would use a standalone hive jdbc client jar that uses hive-service code from Hive. When we use it here, we test that the code in Spark does not break compatibility with Hive client, even if in the future we decide to make some changes to it.

Yea, it is feasible if it meets the standards of the protocol and is backwards compatible. We are use a server implement thriftserver based on v9, and it works well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In hive-2.3.x, thrift protocol code have been separated from package name org.apache.hive.service.cli.thrift, now is org.apache.spark.service.rpc.thrift.
because under package org.apache.spark.service.rpc.thrift, there are some other classes based on these generated thrift code.

Could we use org.apache.spark.thriftserver everywhere instead of org.apache.spark.service? the org.apache.hive.service was coming from that hive-service was a separate module, but we don't have it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use org.apache.spark.thriftserver everywhere instead of org.apache.spark.service? the org.apache.hive.service was coming from that hive-service was a separate module, but we don't have it here.

It is easy after #26221 (comment)
It is ok to do like this since our package is thriftserver not service

@AngersZhuuuu
Copy link
Contributor Author

Could you add a README.txt like the one in Hive:

Thrift commands to generate files from TCLIService.thrift:
--------------------
thrift --gen java:beans,hashcode -o src/gen/thrift if/TCLIService.thrift

and keep the directory structure as created by it (i.e. code in src/gen/thrift/gen-javabean)

Now after running thrift, there needs to be a manual step of moving the code again to src/gen, because it always generates the gen-javabean directory.

@juliuszsompolski Done this.

@HyukjinKwon
Copy link
Member

cc @wangyum

pom.xml Outdated
<!-- Thrift properties -->
<thrift.home>you-must-set-this-to-run-thrift</thrift.home>
<thrift.gen.dir>${basedir}/src/gen/thrift</thrift.gen.dir>
<thrift.args>-I ${thrift.home} --gen java:beans,hashcode,generated_annotations=undated</thrift.args>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just asking, I don't know maven): should this be set in global pom, or in thriftserver pom? How can this be used?

If it requires setting it manually in your pom, and the generated files are committed anyway, they I would not do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just asking, I don't know maven): should this be set in global pom, or in thriftserver pom? How can this be used?

If it requires setting it manually in your pom, and the generated files are committed anyway, they I would not do it.

Current same as hive, make it like antlr4 is best but I may need some help... since I am particularly good at maven various plugin, I am learning the hive's approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliuszsompolski Please see this PR: AngersZhuuuu#2

pom.xml Outdated
<CodeCacheSize>1g</CodeCacheSize>

<!-- Thrift properties -->
<thrift.home>you-must-set-this-to-run-thrift</thrift.home>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to be set, or can Maven find thrift from PATH? Or even better, do the generation via libthrift from maven central itself?

pom.xml Outdated
</profile>

<profile>
<id>thriftif</id>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be in thriftserver pom file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be in thriftserver pom file?

It is ok to move it to thriftserver pom.

<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_${scala.binary.version}</artifactId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the dependency on spark-hive still? Is it just for metastore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the dependency on spark-hive still? Is it just for metastore?

In current thriftserver some place use Hiveutils Hive and HiveDelegationTokenProvider,
they are from spark-hive.

If the final implementation does not need these, we can remove them later.

<dependency>
<groupId>${hive.group}</groupId>
<artifactId>hive-beeline</artifactId>
</dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hive- dependencies are only for beeline now, and for testing with Hive JDBC client?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hive- dependencies are only for beeline now, and for testing with Hive JDBC client?

Current beeline still need these dependencies. In first version we can't do all the thing, so copy this from sql/hive-thriftserver . First step is to replace old one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could consider separating the server and the client in separate modules? A sql/jdbc module would have hive-beeline and hive-jdbc and hive-service and be used to provide bin/beeline, but it will only be a testing dependency of thriftserver, to provide the client for testing - that would further separate the server code from Hive. Could then the existing sql/hive-thriftserver also be depending on it?
@wangyum @gatorsmile what do you think? (definitely in a separate PR from this one)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@wangyum
Copy link
Member

wangyum commented Oct 24, 2019

@juliuszsompolski @AngersZhuuuu Could we complete the entire module first and then split it into smaller PRs to commit?

@juliuszsompolski
Copy link
Contributor

@wangyum I agree to not commit anything the overall change is ready, but to be able to review, I would like partial PRs to be stacked on top of each other - so that the tens of thousands of lines of moving code around are in separate PRs from smaller changesets the parts that actually change something.
My browser almost crashes when trying to read the huge PR with everything together :-).

namespace cpp apache.spark.sql.thriftserver.cli.thrift

// List of protocol versions. A new token should be
// added to the end of this list every time a change is made.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliuszsompolski Ok with this package name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

New code in https://github.com/AngersZhuuuu/spark/tree/SPARK-29018-V11-STEP2

Current change list:

  1. Impelemnt Type in scala since spark con't support all type of hive
  2. Implement Service/AbstractService prepare for remove hive conf in future
  3. Construct RowSet with StructType and Row
  4. Implement HiveAuthFactory since between 1.2.1/2.3.5, their delegation token managerment changed. Impelment on DelegationTokenMnagerment by scala
  5. MV tableTypeString from SparkMetadataOperationUtils to SparkMetadataOperation
  6. Since there are tableTypeString in SparkMetadataOperation remove ClassicTypeMapping, HiveTableTypeMapping, TableTypeMapping and TableTypeMappingFactory
  7. Implement all operation for spark since it execute in different way
  8. Add new method GetQueryId, SetClientInfo for thrift version v11 in ThriftCLIService
  9. Add statementid to Operation for implement GetQueryId
  10. Remove GlobalHivercFileProcessor setFetchSize processGlobalInitFileetc

Still working on this.

@github-actions
Copy link

github-actions bot commented Feb 6, 2020

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Feb 6, 2020
@github-actions github-actions bot closed this Feb 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants