Skip to content

Conversation

@Deegue
Copy link
Contributor

@Deegue Deegue commented Nov 7, 2019

What changes were proposed in this pull request?

When we drop a partition which exist on Hive meta and doesn't exist on HDFS, it should be dropped successfully instead of throwing MetaException.

Hive also deals with this case by this method.

Example:
Before this patch:

spark-sql > alter table test.tmp drop partition(stat_day=20190516);
Error: Error running query: MetaException(message:File does not exist: /user/hive/warehouse/test.db/tmp/stat_day=20190516
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.getContentSummary(FSDirectory.java:2414)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getContentSummary(FSNamesystem.java:4719)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getContentSummary(NameNodeRpcServer.java:1237)
	at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getContentSummary(AuthorizationProviderProxyClientProtocol.java:568)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getContentSummary(ClientNamenodeProtocolServerSideTranslatorPB.java:896)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2278)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2274)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2274)
) (state=,code=0)

After this patch:

spark-sql > alter table test.tmp drop partition(stat_day=20190516);
+---------+--+
| Result  |
+---------+--+
+---------+--+
No rows selected (0.521 seconds)

Why are the changes needed?

When we drop a partition which exist on Hive meta and doesn't exist on HDFS, we will receive MetaException. But actually, this partition has been dropped. It's quite confusing and in this case, no Exception should be thrown.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit tests.

@AngersZhuuuu
Copy link
Contributor

Maybe you should add some server backend error message.

@Deegue
Copy link
Contributor Author

Deegue commented Nov 14, 2019

Maybe you should add some server backend error message.

Hi @AngersZhuuuu , I've checked for logs of the driver, it's almost the same.
So I'm adding some notes for better understanding the code.

@Deegue
Copy link
Contributor Author

Deegue commented Nov 19, 2019

Gentle ping, @cloud-fan

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Feb 28, 2020
@github-actions github-actions bot closed this Feb 29, 2020
// (b='1', c='1') and (b='1', c='2'), a partial spec of (b='1') will match both.
val parts = client.getPartitions(hiveTable, s.asJava).asScala
// Check whether the partition we are going to drop is empty.
// We make a dummy one for the empty partition. See [SPARK-29786] for more details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this how hive resolve the problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this how hive resolve the problem?

Yes, It's the same method as Hive uses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it bad for performance? i.e. you call fs.exists and fs.listStatus for each partition.

Copy link
Contributor Author

@Deegue Deegue Mar 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it bad for performance? i.e. you call fs.exists and fs.listStatus for each partition.

Yes, but only affect drop partitions. I think it's necessary and won't take much time to do the check while dropping.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point to the Hive source code that does the same thing? i.e. create a dummy directory before dropping the partition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point to the Hive source code that does the same thing? i.e. create a dummy directory before dropping the partition.

In Hive 1.x, it's like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it for DROP PARTITION?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it for DROP PARTITION?

No, it will check every query before executing. Maybe it's better to do the check before all queries?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Spark have a problem to do table scan when partition directory not exist?

Copy link
Contributor Author

@Deegue Deegue Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Spark have a problem to do table scan when partition directory not exist?

It's related to #24668, and controlled by spark.sql.files.ignoreMissingFiles.
Spark will check it when listing leaf files.

@cloud-fan cloud-fan reopened this Mar 17, 2020
@cloud-fan cloud-fan removed the Stale label Mar 17, 2020
@cloud-fan
Copy link
Contributor

ok to test

@cloud-fan
Copy link
Contributor

@Deegue can we add a test case for it?

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Mar 17, 2020

Test build #119927 has finished for PR 26422 at commit 74d5984.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Deegue
Copy link
Contributor Author

Deegue commented Mar 20, 2020

retest this please

@cloud-fan
Copy link
Contributor

ok to test

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Mar 20, 2020

Test build #120076 has finished for PR 26422 at commit 248b6cf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Deegue
Copy link
Contributor Author

Deegue commented Mar 20, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Mar 20, 2020

Test build #120082 has finished for PR 26422 at commit c20c390.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Deegue
Copy link
Contributor Author

Deegue commented Mar 20, 2020

retest this please

1 similar comment
@Deegue
Copy link
Contributor Author

Deegue commented Mar 20, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Mar 20, 2020

Test build #120087 has finished for PR 26422 at commit 9d75854.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 20, 2020

Test build #120093 has finished for PR 26422 at commit 5a5ff74.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 20, 2020

Test build #120110 has finished for PR 26422 at commit f4b3793.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Deegue
Copy link
Contributor Author

Deegue commented Mar 21, 2020

@Deegue can we add a test case for it?

Added one and all tests passed.

// We make a dummy one for the empty partition. See [SPARK-29786] for more details.
parts.foreach { partition =>
val partPath = partition.getPath.head
if (isEmptyPath(partPath)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to check non-existing path, not empty path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to check non-existing path, not empty path?

Yes, you're right. We only need to check the existence of path instead of those under the path.

@SparkQA
Copy link

SparkQA commented Mar 27, 2020

Test build #120460 has finished for PR 26422 at commit 4d1c35e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 27, 2020

Test build #120469 has finished for PR 26422 at commit 7cf7c6b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the description, so, currently when dropping a partition which only exists in metastore but not on HDFS, the partition will be dropped from metastore, but then an exception will be thrown. Is it correct?

Comment on lines 651 to 655
if (isExistPath(partPath)) {
val fs = partPath.getFileSystem(conf)
fs.mkdirs(partPath)
fs.deleteOnExit(partPath)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. When the partition exists (isExistPath returns true), why you need to mkdir it again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. When the partition exists (isExistPath returns true), why you need to mkdir it again?

Sorry, I mistakenly delete ! when adjusting the code ..

@Deegue
Copy link
Contributor Author

Deegue commented May 29, 2020

Based on the description, so, currently when dropping a partition which only exists in metastore but not on HDFS, the partition will be dropped from metastore, but then an exception will be thrown. Is it correct?

Yes, but it doesn't throw exception under the package based on latest master branch. So close this PR.

@Deegue Deegue closed this May 29, 2020
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/27900/
Test FAILed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants