Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Apr 1, 2024

What changes were proposed in this pull request?

This PR aims to support Apache Hive 4.0 Metastore where partition filters even for CHAR and a VARCHAR types can be pushed down.

Why are the changes needed?

This is blocked by SPARK-47679 due to the incompatible change of HIVE-27925 .

Supporting more Hive versions (with extra performance improvement) is good for our users.

Does this PR introduce any user-facing change?

Yes. Regarding supporting Hive 4.0 metastore the documentation is updated accordingly.

How was this patch tested?

Manually

I used the docker image of apache/hive:4.0.0-beta-1 for starting a metastore and a hiveserver2 (along with a hadoop3 docker image).

Created a table:

CREATE EXTERNAL TABLE testTable1 ( 
  column1 String 
) PARTITIONED BY (partColumn1 CHAR(30), partColumn2 VARCHAR(30)) LOCATION 'hdfs://hadoop3:8020/tmp/hive_external/';

Inserted some values in beeline:

insert into table testtable1 values ("column1_v1", "partcolumn1_v1", "partcolumn2_v1"), ("column1_v2", "partcolumn1_v2", "partcolumn2_v2");

Started my spark in the hiveserver2 container as:

./bin/spark-shell --conf spark.sql.hive.metastore.version=4.0.0 --conf spark.sql.hive.metastore.jars="/opt/hive/lib/*"

Run the query as:

scala> sql("select * from testtable1 where partcolumn1 = 'partcolumn1_v1' and partcolumn2 = 'partcolumn2_v1'").show
Hive Session ID = 6846fe0e-968a-474d-afec-4f67b3a2a274
+----------+--------------------+--------------+
|   column1|         partcolumn1|   partcolumn2|
+----------+--------------------+--------------+
|column1_v1|partcolumn1_v1   ...|partcolumn2_v1|
+----------+--------------------+--------------+

And check the HMS calls in the metastore container in the file /tmp/hive/hive.log:

...
2023-09-22T21:06:34,293  INFO [Metastore-Handler-Pool: Thread-1356] HiveMetaStore.audit: ugi=hive       ip=172.30.0.5   cmd=source:172.30.0.5 get_partitions_by_filter : tbl=hive.default.testtable1
...

Which contains the expected get_partitions_by_filter.

Was this patch authored or co-authored using generative AI tooling?

No.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Apr 1, 2024

Hi, @attilapiros .

I revised your closed PR, #43064, with your authorship. SPARK-45265 has been assigned to you, too. If you want, you can take over from here also.

$ git log -n1
commit cdb31b0984f4902e41c2e837d78964a4f278253c (HEAD -> SPARK-45265, dongjoon/SPARK-45265)
Author: attilapiros <[email protected]>
Date:   Mon Apr 1 15:41:52 2024 -0700

    [SPARK-45265][SQL] Supporting Hive 4.0 Metastore

@attilapiros
Copy link
Contributor

Hi @dongjoon-hyun,

Thanks! I am fine either way.

By the way should not we need to extend the condition with || version == "4.0" here:
https://github.com/apache/spark/blob/f3f1ee3b481ef2419922eb296116ae583ad9e8c8/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientSuite.scala#L166

The rest looks good to me but we should wait for the tests to finish.

@dongjoon-hyun
Copy link
Member Author

Thank you, @attilapiros . I addressed your comment and fixed the patch according to HIVE-21078 and HIVE-21164.
It seems that there are more API changes from Hive side. I'm still looking at Hive API.

@dongjoon-hyun
Copy link
Member Author

Sorry guys, @HyukjinKwon and @attilapiros .
It seems to require more efforts than I thought.
I'll revisit this later in this week.

@dongjoon-hyun
Copy link
Member Author

Hi, @attilapiros , just a question. When you did last try, the PR description has partitioned tables, I'm wondering how did you test partitioned tables with your previous PR? HIVE-21703 seems to be in 4.0.0-alpha-1 (30 March 2022). Did you have a special workaround at that time?

@attilapiros
Copy link
Contributor

For the testing I run hive in a docker image like:
https://hive.apache.org/developement/quickstart/

Checking my command history it was probably 4.0.0-beta-1.

@dongjoon-hyun
Copy link
Member Author

Got it. Thank you for the info.

dongjoon-hyun pushed a commit that referenced this pull request Nov 14, 2024
### What changes were proposed in this pull request?

This PR continues the work from #43064 and #45801 to support Hive Metastore Server 4.0. CHAR/VARCHAR type partition filter pushdown is not included in this PR, as it requires further investment.

### Why are the changes needed?

Enhance the multiple hive metastore server support feature

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

Passing HiveClient*Suites w/ 4.0

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #48823 from yaooqinn/SPARK-45265.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun dongjoon-hyun deleted the SPARK-45265 branch November 16, 2025 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants