-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29786][SQL] Fix MetaException when dropping a partition not exists on HDFS #26422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29786][SQL] Fix MetaException when dropping a partition not exists on HDFS #26422
Conversation
|
Maybe you should add some server backend error message. |
Hi @AngersZhuuuu , I've checked for logs of the driver, it's almost the same. |
|
Gentle ping, @cloud-fan |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
| // (b='1', c='1') and (b='1', c='2'), a partial spec of (b='1') will match both. | ||
| val parts = client.getPartitions(hiveTable, s.asJava).asScala | ||
| // Check whether the partition we are going to drop is empty. | ||
| // We make a dummy one for the empty partition. See [SPARK-29786] for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this how hive resolve the problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this how hive resolve the problem?
Yes, It's the same method as Hive uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it bad for performance? i.e. you call fs.exists and fs.listStatus for each partition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it bad for performance? i.e. you call
fs.existsandfs.listStatusfor each partition.
Yes, but only affect drop partitions. I think it's necessary and won't take much time to do the check while dropping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point to the Hive source code that does the same thing? i.e. create a dummy directory before dropping the partition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point to the Hive source code that does the same thing? i.e. create a dummy directory before dropping the partition.
In Hive 1.x, it's like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it for DROP PARTITION?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it for DROP PARTITION?
No, it will check every query before executing. Maybe it's better to do the check before all queries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Spark have a problem to do table scan when partition directory not exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Spark have a problem to do table scan when partition directory not exist?
It's related to #24668, and controlled by spark.sql.files.ignoreMissingFiles.
Spark will check it when listing leaf files.
|
ok to test |
|
@Deegue can we add a test case for it? |
|
retest this please |
|
Test build #119927 has finished for PR 26422 at commit
|
|
retest this please |
|
ok to test |
|
retest this please |
|
Test build #120076 has finished for PR 26422 at commit
|
|
retest this please |
|
Test build #120082 has finished for PR 26422 at commit
|
|
retest this please |
1 similar comment
|
retest this please |
|
Test build #120087 has finished for PR 26422 at commit
|
…()`, using Partition.getPath() instead.
|
Test build #120093 has finished for PR 26422 at commit
|
|
Test build #120110 has finished for PR 26422 at commit
|
Added one and all tests passed. |
| // We make a dummy one for the empty partition. See [SPARK-29786] for more details. | ||
| parts.foreach { partition => | ||
| val partPath = partition.getPath.head | ||
| if (isEmptyPath(partPath)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to check non-existing path, not empty path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to check non-existing path, not empty path?
Yes, you're right. We only need to check the existence of path instead of those under the path.
|
Test build #120460 has finished for PR 26422 at commit
|
|
Test build #120469 has finished for PR 26422 at commit
|
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the description, so, currently when dropping a partition which only exists in metastore but not on HDFS, the partition will be dropped from metastore, but then an exception will be thrown. Is it correct?
| if (isExistPath(partPath)) { | ||
| val fs = partPath.getFileSystem(conf) | ||
| fs.mkdirs(partPath) | ||
| fs.deleteOnExit(partPath) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. When the partition exists (isExistPath returns true), why you need to mkdir it again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. When the partition exists (
isExistPathreturns true), why you need to mkdir it again?
Sorry, I mistakenly delete ! when adjusting the code ..
Yes, but it doesn't throw exception under the package based on latest master branch. So close this PR. |
|
Test FAILed. |
What changes were proposed in this pull request?
When we drop a partition which exist on Hive meta and doesn't exist on HDFS, it should be dropped successfully instead of throwing MetaException.
Hive also deals with this case by this method.
Example:
Before this patch:
After this patch:
Why are the changes needed?
When we drop a partition which exist on Hive meta and doesn't exist on HDFS, we will receive MetaException. But actually, this partition has been dropped. It's quite confusing and in this case, no Exception should be thrown.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unit tests.