Skip to content

Conversation

Ngone51
Copy link
Member

@Ngone51 Ngone51 commented Jun 27, 2024

What changes were proposed in this pull request?

This PR uses SortShuffleManager#taskIdMapsForShuffle to track the migrated shuffle files on the destination executor.

Why are the changes needed?

This is a long-standing bug in decommission where the migrated shuffle files can't be cleaned up from the executor. Normally, the shuffle files are tracked by taskIdMapsForShuffle during the map task execution. Upon receiving the RemoveShuffle(shuffleId) request from driver, executor can clean up those shuffle files by searching taskIdMapsForShuffle. However, for the migrated shuffle files by decommission, they lose the track in the destination executor's taskIdMapsForShuffle and can't be deleted as a result.

Note this bug only affects shuffle removal on the executor. For shuffle removal on the external shuffle service (when spark.shuffle.service.removeShuffle enabled and the executor stores the shuffle files has gone), we don't rely on taskIdMapsForShuffle but using the specific shuffle block id to locate the shuffle file directly. So it won't be an issue there.

Does this PR introduce any user-facing change?

No. (Common users won't see the difference underlying.)

How was this patch tested?

Add unit test.

Was this patch authored or co-authored using generative AI tooling?

No.

Ngone51 added 2 commits June 27, 2024 19:31
…e to cleanup from executor

### What changes were proposed in this pull request?

This PR uses `SortShuffleManager#taskIdMapsForShuffle` to track the migrated shuffle files on the destination executor.

### Why are the changes needed?

This is a long-standing bug in decommission where the migrated shuffle files can't be cleaned up from the executor. Normally, the shuffle files are tracked by `taskIdMapsForShuffle` during the map task execution. Upon receiving the `RemoveShuffle(shuffleId)` request from driver, executor can clean up those shuffle files by searching `taskIdMapsForShuffle`. However, for the migrated shuffle files by decommission, they lose the track in the destination executor's  `taskIdMapsForShuffle` and can't be deleted as a result.

Note this bug only affects shuffle removal on the executor. For shuffle removal on the external shuffle service (when `spark.shuffle.service.removeShuffle` enabled and the executor stores the shuffle files has gone), we don't rely on `taskIdMapsForShuffle` but using the specific shuffle block id to locate the shuffle file directly. So it won't be an issue there.

### Does this PR introduce _any_ user-facing change?

No. (Common users won't see the difference underlying.)

### How was this patch tested?

Add unit test.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47037 from Ngone51/SPARK-46957.

Authored-by: Yi Wu <[email protected]>
Signed-off-by: Yi Wu <[email protected]>
@github-actions github-actions bot added the CORE label Jun 27, 2024
@Ngone51
Copy link
Member Author

Ngone51 commented Jun 27, 2024

This PR backports #47037 to branch-3.5/3.4. It was reverted due to the compliation error against Java 8. It's re-cherry-picked now with the fix #47121.

@Ngone51
Copy link
Member Author

Ngone51 commented Jun 27, 2024

@yaooqinn @LuciferYang Could you take a look? Thanks!

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

yaooqinn pushed a commit that referenced this pull request Jun 27, 2024
…uld be able to cleanup from executor

### What changes were proposed in this pull request?

This PR uses `SortShuffleManager#taskIdMapsForShuffle` to track the migrated shuffle files on the destination executor.

### Why are the changes needed?

This is a long-standing bug in decommission where the migrated shuffle files can't be cleaned up from the executor. Normally, the shuffle files are tracked by `taskIdMapsForShuffle` during the map task execution. Upon receiving the `RemoveShuffle(shuffleId)` request from driver, executor can clean up those shuffle files by searching `taskIdMapsForShuffle`. However, for the migrated shuffle files by decommission, they lose the track in the destination executor's  `taskIdMapsForShuffle` and can't be deleted as a result.

Note this bug only affects shuffle removal on the executor. For shuffle removal on the external shuffle service (when `spark.shuffle.service.removeShuffle` enabled and the executor stores the shuffle files has gone), we don't rely on `taskIdMapsForShuffle` but using the specific shuffle block id to locate the shuffle file directly. So it won't be an issue there.

### Does this PR introduce _any_ user-facing change?

No. (Common users won't see the difference underlying.)

### How was this patch tested?

Add unit test.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47122 from Ngone51/SPARK-46957-3.5.

Authored-by: Yi Wu <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
yaooqinn pushed a commit that referenced this pull request Jun 27, 2024
…uld be able to cleanup from executor

### What changes were proposed in this pull request?

This PR uses `SortShuffleManager#taskIdMapsForShuffle` to track the migrated shuffle files on the destination executor.

### Why are the changes needed?

This is a long-standing bug in decommission where the migrated shuffle files can't be cleaned up from the executor. Normally, the shuffle files are tracked by `taskIdMapsForShuffle` during the map task execution. Upon receiving the `RemoveShuffle(shuffleId)` request from driver, executor can clean up those shuffle files by searching `taskIdMapsForShuffle`. However, for the migrated shuffle files by decommission, they lose the track in the destination executor's  `taskIdMapsForShuffle` and can't be deleted as a result.

Note this bug only affects shuffle removal on the executor. For shuffle removal on the external shuffle service (when `spark.shuffle.service.removeShuffle` enabled and the executor stores the shuffle files has gone), we don't rely on `taskIdMapsForShuffle` but using the specific shuffle block id to locate the shuffle file directly. So it won't be an issue there.

### Does this PR introduce _any_ user-facing change?

No. (Common users won't see the difference underlying.)

### How was this patch tested?

Add unit test.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47122 from Ngone51/SPARK-46957-3.5.

Authored-by: Yi Wu <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
(cherry picked from commit b28ddb1)
Signed-off-by: Kent Yao <[email protected]>
@yaooqinn
Copy link
Member

Merged to 3.5/3.4, thank you @Ngone51 @LuciferYang

@yaooqinn yaooqinn closed this Jun 27, 2024
szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Aug 7, 2024
…uld be able to cleanup from executor

### What changes were proposed in this pull request?

This PR uses `SortShuffleManager#taskIdMapsForShuffle` to track the migrated shuffle files on the destination executor.

### Why are the changes needed?

This is a long-standing bug in decommission where the migrated shuffle files can't be cleaned up from the executor. Normally, the shuffle files are tracked by `taskIdMapsForShuffle` during the map task execution. Upon receiving the `RemoveShuffle(shuffleId)` request from driver, executor can clean up those shuffle files by searching `taskIdMapsForShuffle`. However, for the migrated shuffle files by decommission, they lose the track in the destination executor's  `taskIdMapsForShuffle` and can't be deleted as a result.

Note this bug only affects shuffle removal on the executor. For shuffle removal on the external shuffle service (when `spark.shuffle.service.removeShuffle` enabled and the executor stores the shuffle files has gone), we don't rely on `taskIdMapsForShuffle` but using the specific shuffle block id to locate the shuffle file directly. So it won't be an issue there.

### Does this PR introduce _any_ user-facing change?

No. (Common users won't see the difference underlying.)

### How was this patch tested?

Add unit test.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47122 from Ngone51/SPARK-46957-3.5.

Authored-by: Yi Wu <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
(cherry picked from commit b28ddb1)
Signed-off-by: Kent Yao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants