Skip to content

Conversation

@bozhang2820
Copy link
Contributor

@bozhang2820 bozhang2820 commented Apr 30, 2024

What changes were proposed in this pull request?

This is a follow-up for #45930, where we introduced ShuffleCleanupMode and implemented cleaning up of shuffle dependencies.

There was a bug where ShuffleManager.unregisterShuffle was used on Driver, and in non-local mode it is not effective at all. This change fixed the bug by changing to use ShuffleDriverComponents.removeShuffle instead.

Why are the changes needed?

This is to address the comments in #45930 (comment)

Does this PR introduce any user-facing change?

No

How was this patch tested?

Updated unit tests.

Was this patch authored or co-authored using generative AI tooling?

No

@dongjoon-hyun
Copy link
Member

Could you re-trigger CI, @bozhang2820 ?

@dongjoon-hyun dongjoon-hyun marked this pull request as draft May 7, 2024 16:11
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 7, 2024

To prevent accidental merging, I converted this to Draft PR because the CI is broken and the PR title has [WIP].

@bozhang2820 bozhang2820 marked this pull request as ready for review May 27, 2024 05:03
@bozhang2820
Copy link
Contributor Author

Could you re-trigger CI, @bozhang2820 ?

Sorry for the late reply @dongjoon-hyun.. Updated the test and removed the WIP tag.

It seems that the current test failure is unrelated to this change?

@bozhang2820 bozhang2820 changed the title [SPARK-47764][FOLLOW-UP][WIP] Change to use ShuffleDriverComponents.removeShuffle to remove shuffle properly [SPARK-47764][FOLLOW-UP] Change to use ShuffleDriverComponents.removeShuffle to remove shuffle properly May 27, 2024
@abellina
Copy link
Contributor

@bozhang2820 this PR is needed for the feature to work in my tests, I am just wondering what the status of this is.

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable - can you add a test for this ?
We can use local vs local-cluster to test the expected behavior

if (sc.isLocal) {
SparkEnv.get.shuffleManager.unregisterShuffle(shuffleId)
} else {
sc.shuffleDriverComponents.removeShuffle(shuffleId, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work for local mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes ShuffleDriverComponents.removeShuffle should work for both local an non-local mode. Will update the code here.

@bozhang2820
Copy link
Contributor Author

bozhang2820 commented Jul 23, 2024

Looks reasonable - can you add a test for this ? We can use local vs local-cluster to test the expected behavior

I tried adding a unit test for local-cluster mode but found it a bit difficult:

  1. The cleanup is done in a best-effort basis, and in local-cluster mode the shuffle blocks are less likely to be cleaned up in time.
  2. There seems to be no API to get the MigratableResolver on executors from the driver side.

What do you think?

SparkEnv.get.shuffleManager.unregisterShuffle(shuffleId)
// Do not unregister the shuffle on MapOutputTracker here to trigger stage retry.
// Otherwise, downstream tasks will fail with MetadataFetchFailedException.
sc.shuffleDriverComponents.removeShuffle(shuffleId, Utils.isTesting)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cleanup is not critical, I think the second parameter blocking can always be false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit tests would be flaky if we use false here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see, can we add some comments to explain it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Done.

@dongjoon-hyun
Copy link
Member

Thank you, @bozhang2820, @cloud-fan , @mridulm .
Merged to master for Apache Spark 4.0.0-preview2.

@dongjoon-hyun
Copy link
Member

For the test cases, please file a JIRA issue not to forget because it seems to need more time, @bozhang2820 .

ilicmarkodb pushed a commit to ilicmarkodb/spark that referenced this pull request Jul 29, 2024
…Shuffle to remove shuffle properly

### What changes were proposed in this pull request?
This is a follow-up for apache#45930, where we introduced ShuffleCleanupMode and implemented cleaning up of shuffle dependencies.

There was a bug where `ShuffleManager.unregisterShuffle` was used on Driver, and in non-local mode it is not effective at all. This change fixed the bug by changing to use `ShuffleDriverComponents.removeShuffle` instead.

### Why are the changes needed?
This is to address the comments in apache#45930 (comment)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Updated unit tests.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#46302 from bozhang2820/spark-47764-1.

Authored-by: Bo Zhang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@bozhang2820 bozhang2820 deleted the spark-47764-1 branch January 8, 2025 01:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants