[WIP][SPARK-29108][SQL] Add new module sql/thriftserver and add v11 thrift protocol #26221

AngersZhuuuu · 2019-10-23T03:32:20Z

What changes were proposed in this pull request?

First step as we discussed in #25721 (comment)

Now we just add a new module and implement thrift protocol v11, prepare for next work.

Why are the changes needed?

implement a new thriftserver by spark

Does this PR introduce any user-facing change?

No for now

How was this patch tested?

Don't need test now

AmplabJenkins · 2019-10-23T03:34:38Z

Can one of the admins verify this patch?

AngersZhuuuu · 2019-10-23T03:36:43Z

(bonus: could we make these thrift files be generated on the fly? seems possible: https://stackoverflow.com/questions/18767986/how-can-i-compile-all-thrift-files-thrift-as-a-maven-phase)

@juliuszsompolski
For this, I want to do it later, it doesn't impact, I've tried it now and seems not success, For quick development of follow-up work, I'd rather do that later。

juliuszsompolski

Could you add a README.txt like the one in Hive:

Thrift commands to generate files from TCLIService.thrift:
--------------------
thrift --gen java:beans,hashcode -o src/gen/thrift if/TCLIService.thrift

and keep the directory structure as created by it (i.e. code in src/gen/thrift/gen-javabean)

Now after running thrift, there needs to be a manual step of moving the code again to src/gen, because it always generates the gen-javabean directory.

juliuszsompolski · 2019-10-23T10:20:32Z

sql/thriftserver/if/TCLIService.thrift

+//   new word (with no underscores), and end with the word "Service".
+
+namespace java org.apache.spark.service.rpc.thrift
+namespace cpp apache.spark.service.rpc.thrift


The org.apache.hive.service package was coming because it was in the hive-service module in Hive.
Since now it's all in one spark-thriftserver module, maybe change the package to org.apache.spark.sql.thriftserver.cli.thrift (as I suppose the other code will be in org.apache.spark.sql.thriftserver)

The org.apache.hive.service package was coming because it was in the hive-service module in Hive.
Since now it's all in one spark-thriftserver module, maybe change the package to org.apache.spark.sql.thriftserver.cli.thrift (as I suppose the other code will be in org.apache.spark.sql.thriftserver)

In hive-2.3.x, thrift protocol code have been separated from package name org.apache.hive.service.cli.thrift, now is org.apache.spark.service.rpc.thrift.
because under package org.apache.spark.service.rpc.thrift, there are some other classes based on these generated thrift code.

Does hive-jdbc still support it if we rename the package to org.apache.spark.service.rpc.thrift?

@wangyum hive-jdbc will then not pickup this code, but instead use the code from the hive-service dependency.
One may say that it is bad, because it makes the client and server run different code, but I'd say is actually good - most people would use a standalone hive jdbc client jar that uses hive-service code from Hive. When we use it here, we test that the code in Spark does not break compatibility with Hive client, even if in the future we decide to make some changes to it.

@wangyum hive-jdbc will then not pickup this code, but instead use the code from the hive-service dependency.
One may say that it is bad, because it makes the client and server run different code, but I'd say is actually good - most people would use a standalone hive jdbc client jar that uses hive-service code from Hive. When we use it here, we test that the code in Spark does not break compatibility with Hive client, even if in the future we decide to make some changes to it.

Yea, it is feasible if it meets the standards of the protocol and is backwards compatible. We are use a server implement thriftserver based on v9, and it works well.

In hive-2.3.x, thrift protocol code have been separated from package name org.apache.hive.service.cli.thrift, now is org.apache.spark.service.rpc.thrift.
because under package org.apache.spark.service.rpc.thrift, there are some other classes based on these generated thrift code.

Could we use org.apache.spark.thriftserver everywhere instead of org.apache.spark.service? the org.apache.hive.service was coming from that hive-service was a separate module, but we don't have it here.

Could we use org.apache.spark.thriftserver everywhere instead of org.apache.spark.service? the org.apache.hive.service was coming from that hive-service was a separate module, but we don't have it here.

It is easy after #26221 (comment)
It is ok to do like this since our package is thriftserver not service

AngersZhuuuu · 2019-10-23T12:49:18Z

Could you add a README.txt like the one in Hive:
Thrift commands to generate files from TCLIService.thrift:
--------------------
thrift --gen java:beans,hashcode -o src/gen/thrift if/TCLIService.thrift
and keep the directory structure as created by it (i.e. code in src/gen/thrift/gen-javabean)

Now after running thrift, there needs to be a manual step of moving the code again to src/gen, because it always generates the gen-javabean directory.

@juliuszsompolski Done this.

HyukjinKwon · 2019-10-24T06:50:01Z

cc @wangyum

juliuszsompolski · 2019-10-24T11:19:01Z

pom.xml

+    <!-- Thrift properties -->
+    <thrift.home>you-must-set-this-to-run-thrift</thrift.home>
+    <thrift.gen.dir>${basedir}/src/gen/thrift</thrift.gen.dir>
+    <thrift.args>-I ${thrift.home} --gen java:beans,hashcode,generated_annotations=undated</thrift.args>


(just asking, I don't know maven): should this be set in global pom, or in thriftserver pom? How can this be used?

If it requires setting it manually in your pom, and the generated files are committed anyway, they I would not do it.

(just asking, I don't know maven): should this be set in global pom, or in thriftserver pom? How can this be used?

If it requires setting it manually in your pom, and the generated files are committed anyway, they I would not do it.

Current same as hive, make it like antlr4 is best but I may need some help... since I am particularly good at maven various plugin, I am learning the hive's approach.

@juliuszsompolski Please see this PR: AngersZhuuuu#2

juliuszsompolski · 2019-10-24T11:21:05Z

pom.xml

    <CodeCacheSize>1g</CodeCacheSize>
+
+    <!-- Thrift properties -->
+    <thrift.home>you-must-set-this-to-run-thrift</thrift.home>


Does it need to be set, or can Maven find thrift from PATH? Or even better, do the generation via libthrift from maven central itself?

juliuszsompolski · 2019-10-24T11:23:23Z

pom.xml

    </profile>
+
+    <profile>
+      <id>thriftif</id>


could this be in thriftserver pom file?

could this be in thriftserver pom file?

It is ok to move it to thriftserver pom.

juliuszsompolski · 2019-10-24T11:30:12Z

sql/thriftserver/pom.xml

+    <dependencies>
+        <dependency>
+            <groupId>org.apache.spark</groupId>
+            <artifactId>spark-hive_${scala.binary.version}</artifactId>


what is the dependency on spark-hive still? Is it just for metastore?

what is the dependency on spark-hive still? Is it just for metastore?

In current thriftserver some place use Hiveutils Hive and HiveDelegationTokenProvider,
they are from spark-hive.

If the final implementation does not need these, we can remove them later.

juliuszsompolski · 2019-10-24T11:39:35Z

sql/thriftserver/pom.xml

+        <dependency>
+            <groupId>${hive.group}</groupId>
+            <artifactId>hive-beeline</artifactId>
+        </dependency>


The hive- dependencies are only for beeline now, and for testing with Hive JDBC client?

The hive- dependencies are only for beeline now, and for testing with Hive JDBC client?

Current beeline still need these dependencies. In first version we can't do all the thing, so copy this from sql/hive-thriftserver . First step is to replace old one.

Maybe we could consider separating the server and the client in separate modules? A sql/jdbc module would have hive-beeline and hive-jdbc and hive-service and be used to provide bin/beeline, but it will only be a testing dependency of thriftserver, to provide the client for testing - that would further separate the server code from Hive. Could then the existing sql/hive-thriftserver also be depending on it?
@wangyum @gatorsmile what do you think? (definitely in a separate PR from this one)

wangyum · 2019-10-24T14:48:25Z

@juliuszsompolski @AngersZhuuuu Could we complete the entire module first and then split it into smaller PRs to commit?

juliuszsompolski · 2019-10-24T14:58:39Z

@wangyum I agree to not commit anything the overall change is ready, but to be able to review, I would like partial PRs to be stacked on top of each other - so that the tens of thousands of lines of moving code around are in separate PRs from smaller changesets the parts that actually change something.
My browser almost crashes when trying to read the huge PR with everything together :-).

…ft-maven-plugin

Maven generate thrift source code

… into SPARK-29018-V11

This reverts commit 95d8137.

AngersZhuuuu · 2019-10-27T03:37:35Z

sql/thriftserver/if/TCLIService.thrift

+namespace cpp apache.spark.sql.thriftserver.cli.thrift
+
+// List of protocol versions. A new token should be
+// added to the end of this list every time a change is made.


@juliuszsompolski Ok with this package name?

👍

New code in https://github.com/AngersZhuuuu/spark/tree/SPARK-29018-V11-STEP2

Current change list:

Impelemnt Type in scala since spark con't support all type of hive

Implement Service/AbstractService prepare for remove hive conf in future

Construct RowSet with StructType and Row

Implement HiveAuthFactory since between 1.2.1/2.3.5, their delegation token managerment changed. Impelment on DelegationTokenMnagerment by scala

MV tableTypeString from SparkMetadataOperationUtils to SparkMetadataOperation

Since there are tableTypeString in SparkMetadataOperation remove ClassicTypeMapping, HiveTableTypeMapping, TableTypeMapping and TableTypeMappingFactory

Implement all operation for spark since it execute in different way

Add new method GetQueryId, SetClientInfo for thrift version v11 in ThriftCLIService

Add statementid to Operation for implement GetQueryId

Remove GlobalHivercFileProcessor setFetchSize processGlobalInitFileetc

Still working on this.

github-actions · 2020-02-06T00:08:07Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Add new module and add v11 thrift protocol

8b56400

add pom lincense

e30f686

AngersZhuuuu changed the title ~~[WIP][SPARK-29108][SQL]Add new module sql/thriftserver and add v11 thrift protocol~~ [WIP][SPARK-29108][SQL] Add new module sql/thriftserver and add v11 thrift protocol Oct 23, 2019

juliuszsompolski reviewed Oct 23, 2019

View reviewed changes

AngersZhuuuu added 3 commits October 23, 2019 18:50

update code folder name

4a50bc9

all like hive

7ffdc3b

remove py/cpp/r/php

b74d5e0

dongjoon-hyun added SQL TESTS labels Oct 23, 2019

juliuszsompolski reviewed Oct 24, 2019

View reviewed changes

wangyum and others added 7 commits October 25, 2019 00:20

Maven generate thrift source code

fc2648f

org.apache.thrift.tools:maven-thrift-plugin -> org.apache.thrift:thri…

5365dcf

…ft-maven-plugin

save some basic code

95d8137

Merge pull request #2 from wangyum/SPARK-29108

4dc5c7e

Maven generate thrift source code

Merge branch 'SPARK-29018-V11' of https://github.com/AngersZhuuuu/spark…

24ce6d4

… into SPARK-29018-V11

Revert "save some basic code"

5efe8cb

This reverts commit 95d8137.

Update TCLIService.thrift

6f7d48a

AngersZhuuuu commented Oct 27, 2019

View reviewed changes

AngersZhuuuu mentioned this pull request Oct 31, 2019

[WIP][SPARK-29108][SQL] Add new module sql/thriftserver with all code and UT #26340

Closed

github-actions bot added the Stale label Feb 6, 2020

github-actions bot closed this Feb 7, 2020

[WIP][SPARK-29108][SQL] Add new module sql/thriftserver and add v11 thrift protocol #26221

[WIP][SPARK-29108][SQL] Add new module sql/thriftserver and add v11 thrift protocol #26221

Uh oh!

Conversation

AngersZhuuuu commented Oct 23, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AmplabJenkins commented Oct 23, 2019

Uh oh!

AngersZhuuuu commented Oct 23, 2019

Uh oh!

juliuszsompolski left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu commented Oct 23, 2019

Uh oh!

HyukjinKwon commented Oct 24, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangyum commented Oct 24, 2019

Uh oh!

juliuszsompolski commented Oct 24, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants