Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Feb 7, 2017

What changes were proposed in this pull request?

Hive metastore is not case-preserving and keep partition columns with lower case names. If Spark SQL creates a table with upper-case partition column names using HiveExternalCatalog, when we rename partition, it first calls the HiveClient to renamePartition, which will create a new lower case partition path, then Spark SQL renames the lower case path to upper-case.

However, when we rename a nested path, different file systems have different behaviors. e.g. in jenkins, renaming a=1/b=2 to A=2/B=2 will success, but leave an empty directory a=1. in mac os, the renaming doesn't work as expected and result to a=1/B=2.

This PR renames the partition directory recursively from the first partition column in HiveExternalCatalog, to be most compatible with different file systems.

How was this patch tested?

new regression test

@cloud-fan
Copy link
Contributor Author

cc @gatorsmile @windpiger @viirya

@SparkQA
Copy link

SparkQA commented Feb 7, 2017

Test build #72523 has finished for PR 16837 at commit 4725444.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 7, 2017

Test build #72524 has finished for PR 16837 at commit 953cb9e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Feb 7, 2017

Does this change not require changing the other external catalog?

@cloud-fan
Copy link
Contributor Author

no it's only related to a hack in HiveExternalCatalog, I have updated the description.

@windpiger
Copy link
Contributor

LGTM,it is a better and safer way to process the useless dirs.

@SparkQA
Copy link

SparkQA commented Feb 8, 2017

Test build #72547 has finished for PR 16837 at commit c1a7f1d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The basepath of wrongPartPath needs to update with the column in previous iteration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

@SparkQA
Copy link

SparkQA commented Feb 8, 2017

Test build #72555 has finished for PR 16837 at commit d8f2ba5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we rename previous actualPartitionPath to expectedPartitionPath in last run, actualPartitionPath doesn't exist anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yea, we should always rename directory within one nested level.

@SparkQA
Copy link

SparkQA commented Feb 8, 2017

Test build #72566 has started for PR 16837 at commit 329886e.

tablePath = new Path(table.location)
// the `A=2` directory is still there, we follow this behavior from hive.
assert(fs.listStatus(tablePath)
.filterNot(_.getPath.toString.contains("A=2")).count(_.isDirectory) == 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If going to check A=2 directory exist, I think here is filter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanna check the number of partition directories except A=2.

@cloud-fan
Copy link
Contributor Author

retest this please

@viirya
Copy link
Member

viirya commented Feb 8, 2017

LGTM

@SparkQA
Copy link

SparkQA commented Feb 8, 2017

Test build #72578 has finished for PR 16837 at commit 329886e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

newSpec: TablePartitionSpec): Path = {
import ExternalCatalogUtils.getPartitionPathString

var totalPath = tablePath
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about currFullPath or fullPath?

@gatorsmile
Copy link
Member

LGTM except one comment.

@SparkQA
Copy link

SparkQA commented Feb 9, 2017

Test build #72617 has finished for PR 16837 at commit 2c4d3c7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merging to master.

@asfgit asfgit closed this in 50a9912 Feb 9, 2017
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
…ories

## What changes were proposed in this pull request?

Hive metastore is not case-preserving and keep partition columns with lower case names. If Spark SQL creates a table with upper-case partition column names using `HiveExternalCatalog`, when we rename partition, it first calls the HiveClient to renamePartition, which will create a new lower case partition path, then Spark SQL renames the lower case path to upper-case.

However, when we rename a nested path, different file systems have different behaviors. e.g. in jenkins, renaming `a=1/b=2` to `A=2/B=2` will success, but leave an empty directory `a=1`. in mac os, the renaming doesn't work as expected and result to `a=1/B=2`.

This PR renames the partition directory recursively from the first partition column in `HiveExternalCatalog`, to be most compatible with different file systems.

## How was this patch tested?

new regression test

Author: Wenchen Fan <[email protected]>

Closes apache#16837 from cloud-fan/partition.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants