-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19359][SQL] renaming partition should not leave useless directories #16837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #72523 has finished for PR 16837 at commit
|
|
Test build #72524 has finished for PR 16837 at commit
|
|
Does this change not require changing the other external catalog? |
|
no it's only related to a hack in |
|
LGTM,it is a better and safer way to process the useless dirs. |
|
Test build #72547 has finished for PR 16837 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The basepath of wrongPartPath needs to update with the column in previous iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
|
Test build #72555 has finished for PR 16837 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we rename previous actualPartitionPath to expectedPartitionPath in last run, actualPartitionPath doesn't exist anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yea, we should always rename directory within one nested level.
|
Test build #72566 has started for PR 16837 at commit |
| tablePath = new Path(table.location) | ||
| // the `A=2` directory is still there, we follow this behavior from hive. | ||
| assert(fs.listStatus(tablePath) | ||
| .filterNot(_.getPath.toString.contains("A=2")).count(_.isDirectory) == 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If going to check A=2 directory exist, I think here is filter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanna check the number of partition directories except A=2.
|
retest this please |
|
LGTM |
|
Test build #72578 has finished for PR 16837 at commit
|
| newSpec: TablePartitionSpec): Path = { | ||
| import ExternalCatalogUtils.getPartitionPathString | ||
|
|
||
| var totalPath = tablePath |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about currFullPath or fullPath?
|
LGTM except one comment. |
|
Test build #72617 has finished for PR 16837 at commit
|
|
Thanks! Merging to master. |
…ories ## What changes were proposed in this pull request? Hive metastore is not case-preserving and keep partition columns with lower case names. If Spark SQL creates a table with upper-case partition column names using `HiveExternalCatalog`, when we rename partition, it first calls the HiveClient to renamePartition, which will create a new lower case partition path, then Spark SQL renames the lower case path to upper-case. However, when we rename a nested path, different file systems have different behaviors. e.g. in jenkins, renaming `a=1/b=2` to `A=2/B=2` will success, but leave an empty directory `a=1`. in mac os, the renaming doesn't work as expected and result to `a=1/B=2`. This PR renames the partition directory recursively from the first partition column in `HiveExternalCatalog`, to be most compatible with different file systems. ## How was this patch tested? new regression test Author: Wenchen Fan <[email protected]> Closes apache#16837 from cloud-fan/partition.
What changes were proposed in this pull request?
Hive metastore is not case-preserving and keep partition columns with lower case names. If Spark SQL creates a table with upper-case partition column names using
HiveExternalCatalog, when we rename partition, it first calls the HiveClient to renamePartition, which will create a new lower case partition path, then Spark SQL renames the lower case path to upper-case.However, when we rename a nested path, different file systems have different behaviors. e.g. in jenkins, renaming
a=1/b=2toA=2/B=2will success, but leave an empty directorya=1. in mac os, the renaming doesn't work as expected and result toa=1/B=2.This PR renames the partition directory recursively from the first partition column in
HiveExternalCatalog, to be most compatible with different file systems.How was this patch tested?
new regression test