Skip to content

Conversation

@zhengchenyu
Copy link
Contributor

@zhengchenyu zhengchenyu commented Oct 24, 2025

What changes were proposed in this pull request?

The key modifications are as follows:

  • Force to use staging directory in SQLHadoopMapReduceCommitProtocol except PathOutputCommitProtocol and perform a rename from the source directory to the destination directory during the commit job phase. Dynamic partition overwrite and custom partition path have also been integrated into this process. And Handle paths according to SaveMode.

  • Avoid deleting partitions before task runs and implement dynamic overwrite in SQLHadoopMapReduceCommitProtocol. To maintain compatibility with static mode, the corresponding partition files need to be deleted after the job is done.

Note: For ease of review, some code in HadoopMapReduceCommitProtocol has been retained. In fact, I think the parameter dynamicPartitionOverwrite and the code for renaming partition directories during the commit job phase are no longer meaningful and should be removed.

Why are the changes needed?

SparkSQL uses the partition location or table location as the commit path (except in dynamic partition overwrite mode and custom partition path mode). This has at least the following issues:

  • As described in SPARK-37210, conflicts can occur when multiple partitions job of the same table are run concurrently. Using a staging directory can avoid this issue.
  • As described in SPARK-53937, using a staging directory allows for near-atomic operations.

Dynamic partition overwrite mode and custom partition path mode already use the staging directory. And dynamic partition overwrite mode and custom partition path are implemented differently, which can be simplified into a unified process. And in #29000, reset the staging directory as the output directory of FileOutputCommitter. This way is more safer. It should be modified to this way.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing unit tests and newly added unit tests

@zhengchenyu zhengchenyu marked this pull request as draft October 24, 2025 07:18
@zhengchenyu zhengchenyu deleted the SPARK-54003 branch November 12, 2025 11:01
@zhengchenyu zhengchenyu reopened this Nov 12, 2025
@zhengchenyu zhengchenyu marked this pull request as ready for review November 13, 2025 01:33
@zhengchenyu
Copy link
Contributor Author

@cloud-fan Can you please review this PR? I think that using the staging directory directly is a safer approach. Can you give me some advice?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant