-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-47764][FOLLOW-UP] Change to use ShuffleDriverComponents.removeShuffle to remove shuffle properly #46302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Could you re-trigger CI, @bozhang2820 ? |
|
To prevent accidental merging, I converted this to |
Sorry for the late reply @dongjoon-hyun.. Updated the test and removed the WIP tag. It seems that the current test failure is unrelated to this change? |
|
@bozhang2820 this PR is needed for the feature to work in my tests, I am just wondering what the status of this is. |
mridulm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable - can you add a test for this ?
We can use local vs local-cluster to test the expected behavior
| if (sc.isLocal) { | ||
| SparkEnv.get.shuffleManager.unregisterShuffle(shuffleId) | ||
| } else { | ||
| sc.shuffleDriverComponents.removeShuffle(shuffleId, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this work for local mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes ShuffleDriverComponents.removeShuffle should work for both local an non-local mode. Will update the code here.
I tried adding a unit test for local-cluster mode but found it a bit difficult:
What do you think? |
| SparkEnv.get.shuffleManager.unregisterShuffle(shuffleId) | ||
| // Do not unregister the shuffle on MapOutputTracker here to trigger stage retry. | ||
| // Otherwise, downstream tasks will fail with MetadataFetchFailedException. | ||
| sc.shuffleDriverComponents.removeShuffle(shuffleId, Utils.isTesting) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cleanup is not critical, I think the second parameter blocking can always be false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unit tests would be flaky if we use false here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I see, can we add some comments to explain it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Done.
|
Thank you, @bozhang2820, @cloud-fan , @mridulm . |
|
For the test cases, please file a JIRA issue not to forget because it seems to need more time, @bozhang2820 . |
…Shuffle to remove shuffle properly ### What changes were proposed in this pull request? This is a follow-up for apache#45930, where we introduced ShuffleCleanupMode and implemented cleaning up of shuffle dependencies. There was a bug where `ShuffleManager.unregisterShuffle` was used on Driver, and in non-local mode it is not effective at all. This change fixed the bug by changing to use `ShuffleDriverComponents.removeShuffle` instead. ### Why are the changes needed? This is to address the comments in apache#45930 (comment) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Updated unit tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#46302 from bozhang2820/spark-47764-1. Authored-by: Bo Zhang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
This is a follow-up for #45930, where we introduced ShuffleCleanupMode and implemented cleaning up of shuffle dependencies.
There was a bug where
ShuffleManager.unregisterShufflewas used on Driver, and in non-local mode it is not effective at all. This change fixed the bug by changing to useShuffleDriverComponents.removeShuffleinstead.Why are the changes needed?
This is to address the comments in #45930 (comment)
Does this PR introduce any user-facing change?
No
How was this patch tested?
Updated unit tests.
Was this patch authored or co-authored using generative AI tooling?
No