[SPARK-19359][SQL] renaming partition should not leave useless directories #16837

cloud-fan · 2017-02-07T17:37:32Z

What changes were proposed in this pull request?

Hive metastore is not case-preserving and keep partition columns with lower case names. If Spark SQL creates a table with upper-case partition column names using HiveExternalCatalog, when we rename partition, it first calls the HiveClient to renamePartition, which will create a new lower case partition path, then Spark SQL renames the lower case path to upper-case.

However, when we rename a nested path, different file systems have different behaviors. e.g. in jenkins, renaming a=1/b=2 to A=2/B=2 will success, but leave an empty directory a=1. in mac os, the renaming doesn't work as expected and result to a=1/B=2.

This PR renames the partition directory recursively from the first partition column in HiveExternalCatalog, to be most compatible with different file systems.

How was this patch tested?

new regression test

cloud-fan · 2017-02-07T17:43:25Z

cc @gatorsmile @windpiger @viirya

SparkQA · 2017-02-07T18:51:05Z

Test build #72523 has finished for PR 16837 at commit 4725444.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-07T18:58:47Z

Test build #72524 has finished for PR 16837 at commit 953cb9e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2017-02-07T21:26:50Z

Does this change not require changing the other external catalog?

cloud-fan · 2017-02-08T01:09:38Z

no it's only related to a hack in HiveExternalCatalog, I have updated the description.

windpiger · 2017-02-08T01:45:20Z

LGTM,it is a better and safer way to process the useless dirs.

SparkQA · 2017-02-08T02:29:26Z

Test build #72547 has finished for PR 16837 at commit c1a7f1d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-02-08T02:34:17Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

The basepath of wrongPartPath needs to update with the column in previous iteration.

good catch!

SparkQA · 2017-02-08T05:29:00Z

Test build #72555 has finished for PR 16837 at commit d8f2ba5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-02-08T06:18:39Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

Since we rename previous actualPartitionPath to expectedPartitionPath in last run, actualPartitionPath doesn't exist anymore.

oh yea, we should always rename directory within one nested level.

SparkQA · 2017-02-08T06:48:38Z

Test build #72566 has started for PR 16837 at commit 329886e.

viirya · 2017-02-08T07:58:37Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionProviderCompatibilitySuite.scala

+      tablePath = new Path(table.location)
+      // the `A=2` directory is still there, we follow this behavior from hive.
+      assert(fs.listStatus(tablePath)
+        .filterNot(_.getPath.toString.contains("A=2")).count(_.isDirectory) == 1)


If going to check A=2 directory exist, I think here is filter?

I wanna check the number of partition directories except A=2.

cloud-fan · 2017-02-08T10:52:52Z

retest this please

viirya · 2017-02-08T13:23:21Z

LGTM

SparkQA · 2017-02-08T13:36:53Z

Test build #72578 has finished for PR 16837 at commit 329886e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-02-08T21:29:25Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

+      newSpec: TablePartitionSpec): Path = {
+    import ExternalCatalogUtils.getPartitionPathString
+
+    var totalPath = tablePath


How about currFullPath or fullPath?

gatorsmile · 2017-02-08T21:52:53Z

LGTM except one comment.

SparkQA · 2017-02-09T04:01:05Z

Test build #72617 has finished for PR 16837 at commit 2c4d3c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-02-09T05:39:36Z

Thanks! Merging to master.

…ories ## What changes were proposed in this pull request? Hive metastore is not case-preserving and keep partition columns with lower case names. If Spark SQL creates a table with upper-case partition column names using `HiveExternalCatalog`, when we rename partition, it first calls the HiveClient to renamePartition, which will create a new lower case partition path, then Spark SQL renames the lower case path to upper-case. However, when we rename a nested path, different file systems have different behaviors. e.g. in jenkins, renaming `a=1/b=2` to `A=2/B=2` will success, but leave an empty directory `a=1`. in mac os, the renaming doesn't work as expected and result to `a=1/B=2`. This PR renames the partition directory recursively from the first partition column in `HiveExternalCatalog`, to be most compatible with different file systems. ## How was this patch tested? new regression test Author: Wenchen Fan <[email protected]> Closes apache#16837 from cloud-fan/partition.

cloud-fan force-pushed the partition branch from 4725444 to 953cb9e Compare February 7, 2017 17:42

windpiger mentioned this pull request Feb 8, 2017

[SPARK-19359][SQL]clear useless path after rename a partition with upper-case by HiveExternalCatalog #16700

Closed

cloud-fan force-pushed the partition branch from 953cb9e to c1a7f1d Compare February 8, 2017 01:18

viirya reviewed Feb 8, 2017

View reviewed changes

cloud-fan force-pushed the partition branch from c1a7f1d to d8f2ba5 Compare February 8, 2017 03:56

viirya reviewed Feb 8, 2017

View reviewed changes

renaming partition should not leave useless directories

329886e

cloud-fan force-pushed the partition branch from d8f2ba5 to 329886e Compare February 8, 2017 06:43

viirya reviewed Feb 8, 2017

View reviewed changes

gatorsmile reviewed Feb 8, 2017

View reviewed changes

address comments

2c4d3c7

asfgit closed this in 50a9912 Feb 9, 2017

[SPARK-19359][SQL] renaming partition should not leave useless directories #16837

[SPARK-19359][SQL] renaming partition should not leave useless directories #16837

Uh oh!

Conversation

cloud-fan commented Feb 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Feb 7, 2017

Uh oh!

SparkQA commented Feb 7, 2017

Uh oh!

SparkQA commented Feb 7, 2017

Uh oh!

rxin commented Feb 7, 2017

Uh oh!

cloud-fan commented Feb 8, 2017

Uh oh!

windpiger commented Feb 8, 2017

Uh oh!

SparkQA commented Feb 8, 2017

Uh oh!

viirya Feb 8, 2017

Choose a reason for hiding this comment

Uh oh!

cloud-fan Feb 8, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 8, 2017

Uh oh!

viirya Feb 8, 2017

Choose a reason for hiding this comment

Uh oh!

cloud-fan Feb 8, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 8, 2017

Uh oh!

viirya Feb 8, 2017

Choose a reason for hiding this comment

Uh oh!

cloud-fan Feb 8, 2017

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Feb 8, 2017

Uh oh!

viirya commented Feb 8, 2017

Uh oh!

SparkQA commented Feb 8, 2017

Uh oh!

gatorsmile Feb 8, 2017

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Feb 8, 2017

Uh oh!

SparkQA commented Feb 9, 2017

Uh oh!

gatorsmile commented Feb 9, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cloud-fan commented Feb 7, 2017 •

edited

Loading