[SPARK-29786][SQL] Fix MetaException when dropping a partition not exists on HDFS #26422

Deegue · 2019-11-07T07:43:52Z

What changes were proposed in this pull request?

When we drop a partition which exist on Hive meta and doesn't exist on HDFS, it should be dropped successfully instead of throwing MetaException.

Hive also deals with this case by this method.

Example:
Before this patch:

spark-sql > alter table test.tmp drop partition(stat_day=20190516);
Error: Error running query: MetaException(message:File does not exist: /user/hive/warehouse/test.db/tmp/stat_day=20190516
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.getContentSummary(FSDirectory.java:2414)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getContentSummary(FSNamesystem.java:4719)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getContentSummary(NameNodeRpcServer.java:1237)
	at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getContentSummary(AuthorizationProviderProxyClientProtocol.java:568)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getContentSummary(ClientNamenodeProtocolServerSideTranslatorPB.java:896)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2278)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2274)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2274)
) (state=,code=0)

After this patch:

spark-sql > alter table test.tmp drop partition(stat_day=20190516);
+---------+--+
| Result  |
+---------+--+
+---------+--+
No rows selected (0.521 seconds)

Why are the changes needed?

When we drop a partition which exist on Hive meta and doesn't exist on HDFS, we will receive MetaException. But actually, this partition has been dropped. It's quite confusing and in this case, no Exception should be thrown.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit tests.

Merge

merge

Merge

AngersZhuuuu · 2019-11-13T07:24:57Z

Maybe you should add some server backend error message.

Deegue · 2019-11-14T02:42:14Z

Maybe you should add some server backend error message.

Hi @AngersZhuuuu , I've checked for logs of the driver, it's almost the same.
So I'm adding some notes for better understanding the code.

Deegue · 2019-11-19T03:40:19Z

Gentle ping, @cloud-fan

github-actions · 2020-02-28T00:12:20Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

cloud-fan · 2020-03-02T06:33:27Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

        // (b='1', c='1') and (b='1', c='2'), a partial spec of (b='1') will match both.
        val parts = client.getPartitions(hiveTable, s.asJava).asScala
+        // Check whether the partition we are going to drop is empty.
+        // We make a dummy one for the empty partition. See [SPARK-29786] for more details.


is this how hive resolve the problem?

is this how hive resolve the problem?

Yes, It's the same method as Hive uses.

Isn't it bad for performance? i.e. you call fs.exists and fs.listStatus for each partition.

Isn't it bad for performance? i.e. you call fs.exists and fs.listStatus for each partition.

Yes, but only affect drop partitions. I think it's necessary and won't take much time to do the check while dropping.

Can you point to the Hive source code that does the same thing? i.e. create a dummy directory before dropping the partition.

Can you point to the Hive source code that does the same thing? i.e. create a dummy directory before dropping the partition.

In Hive 1.x, it's like this.

is it for DROP PARTITION?

is it for DROP PARTITION?

No, it will check every query before executing. Maybe it's better to do the check before all queries?

Does Spark have a problem to do table scan when partition directory not exist?

Does Spark have a problem to do table scan when partition directory not exist?

It's related to #24668, and controlled by spark.sql.files.ignoreMissingFiles.
Spark will check it when listing leaf files.

cloud-fan · 2020-03-17T08:26:02Z

ok to test

cloud-fan · 2020-03-17T08:26:24Z

@Deegue can we add a test case for it?

cloud-fan · 2020-03-17T08:52:58Z

retest this please

SparkQA · 2020-03-17T10:47:03Z

Test build #119927 has finished for PR 26422 at commit 74d5984.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Deegue · 2020-03-20T02:16:59Z

retest this please

cloud-fan · 2020-03-20T03:39:21Z

ok to test

cloud-fan · 2020-03-20T03:39:29Z

retest this please

SparkQA · 2020-03-20T05:37:23Z

Test build #120076 has finished for PR 26422 at commit 248b6cf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Deegue · 2020-03-20T06:54:18Z

retest this please

SparkQA · 2020-03-20T07:05:02Z

Test build #120082 has finished for PR 26422 at commit c20c390.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

Deegue · 2020-03-20T07:20:12Z

retest this please

Deegue · 2020-03-20T07:37:54Z

retest this please

SparkQA · 2020-03-20T10:31:26Z

Test build #120087 has finished for PR 26422 at commit 9d75854.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…()`, using Partition.getPath() instead.

SparkQA · 2020-03-20T13:26:27Z

Test build #120093 has finished for PR 26422 at commit 5a5ff74.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-20T19:59:06Z

Test build #120110 has finished for PR 26422 at commit f4b3793.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Deegue · 2020-03-21T05:36:47Z

@Deegue can we add a test case for it?

Added one and all tests passed.

cloud-fan · 2020-03-26T12:02:38Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

+        // We make a dummy one for the empty partition. See [SPARK-29786] for more details.
+        parts.foreach { partition =>
+          val partPath = partition.getPath.head
+          if (isEmptyPath(partPath)) {


I think we need to check non-existing path, not empty path?

I think we need to check non-existing path, not empty path?

Yes, you're right. We only need to check the existence of path instead of those under the path.

SparkQA · 2020-03-27T09:44:22Z

Test build #120460 has finished for PR 26422 at commit 4d1c35e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-27T10:14:38Z

Test build #120469 has finished for PR 26422 at commit 7cf7c6b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya

Based on the description, so, currently when dropping a partition which only exists in metastore but not on HDFS, the partition will be dropped from metastore, but then an exception will be thrown. Is it correct?

viirya · 2020-05-23T05:58:28Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

+          if (isExistPath(partPath)) {
+            val fs = partPath.getFileSystem(conf)
+            fs.mkdirs(partPath)
+            fs.deleteOnExit(partPath)
+          }


I'm confused. When the partition exists (isExistPath returns true), why you need to mkdir it again?

I'm confused. When the partition exists (isExistPath returns true), why you need to mkdir it again?

Sorry, I mistakenly delete ! when adjusting the code ..

Deegue · 2020-05-29T08:39:49Z

Based on the description, so, currently when dropping a partition which only exists in metastore but not on HDFS, the partition will be dropped from metastore, but then an exception will be thrown. Is it correct?

Yes, but it doesn't throw exception under the package based on latest master branch. So close this PR.

AmplabJenkins · 2020-05-29T08:40:20Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/27900/
Test FAILed.

Deegue added 4 commits October 22, 2019 09:53

Merge pull request #1 from apache/master

90233e4

Merge

Merge pull request #2 from apache/master

7a246d5

merge

Merge pull request #3 from apache/master

783211f

Merge

Fix drop partition MetaException:File does not exist.

355496e

dongjoon-hyun added the SQL label Nov 9, 2019

Add notes.

74d5984

github-actions bot added the Stale label Feb 28, 2020

github-actions bot closed this Feb 29, 2020

cloud-fan reviewed Mar 2, 2020

View reviewed changes

cloud-fan reopened this Mar 17, 2020

cloud-fan removed the Stale label Mar 17, 2020

Deegue added 2 commits March 18, 2020 19:36

1584531350

84b1da8

UT failed #119927, getDataLocation->getPartitionPath

248b6cf

UT failed #120076, add HIDDEN_FILES_PATH_FILTER

c20c390

optimize

9d75854

UT failed #120087. Hive-0.13 doesn't have `Partition.getPartitionPath…

5a5ff74

…()`, using Partition.getPath() instead.

UT failed #120093, purge and ignoreIfNotExists should be false.

f4b3793

cloud-fan reviewed Mar 26, 2020

View reviewed changes

Deegue added 2 commits March 27, 2020 15:07

isEmptyPath->isExistPath

4d1c35e

remove redundant code

7cf7c6b

viirya reviewed May 23, 2020

View reviewed changes

restore '!'

f8b96de

Deegue closed this May 29, 2020

[SPARK-29786][SQL] Fix MetaException when dropping a partition not exists on HDFS #26422

[SPARK-29786][SQL] Fix MetaException when dropping a partition not exists on HDFS #26422

Uh oh!

Conversation

Deegue commented Nov 7, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AngersZhuuuu commented Nov 13, 2019

Uh oh!

Deegue commented Nov 14, 2019

Uh oh!

Deegue commented Nov 19, 2019

Uh oh!

github-actions bot commented Feb 28, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Deegue Mar 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Deegue Mar 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Mar 17, 2020

Uh oh!

cloud-fan commented Mar 17, 2020

Uh oh!

cloud-fan commented Mar 17, 2020

Uh oh!

SparkQA commented Mar 17, 2020

Uh oh!

Deegue commented Mar 20, 2020

Uh oh!

cloud-fan commented Mar 20, 2020

Uh oh!

cloud-fan commented Mar 20, 2020

Uh oh!

SparkQA commented Mar 20, 2020

Uh oh!

Deegue commented Mar 20, 2020

Uh oh!

SparkQA commented Mar 20, 2020

Uh oh!

Deegue commented Mar 20, 2020

Uh oh!

Deegue commented Mar 20, 2020

Uh oh!

SparkQA commented Mar 20, 2020

Uh oh!

SparkQA commented Mar 20, 2020

Uh oh!

SparkQA commented Mar 20, 2020

Uh oh!

Deegue commented Mar 21, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 27, 2020

Uh oh!

Deegue Mar 17, 2020 •

edited

Loading

Deegue Mar 26, 2020 •

edited

Loading