[SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a thread out of RpcEnv #35069

Ngone51 · 2021-12-30T08:37:46Z

What changes were proposed in this pull request?

This PR proposes to let WorkerWatcher run System.exit in a separate thread instead of some thread of RpcEnv.

Why are the changes needed?

System.exit will trigger the shutdown hook to run executor.stop, which will result in the same deadlock issue with SPARK-14180. But note that since Spark upgrades to Hadoop 3 recently, each hook now will have a timeout threshold which forcibly interrupt the hook execution once reaches timeout. So, the deadlock issue doesn't really exist in the master branch. However, it's still critical for previous releases and is a wrong behavior that should be fixed.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Tested manually.

Ngone51 · 2022-01-04T02:01:29Z

cc @mridulm @cloud-fan @jiangxb1987

cloud-fan

good catch!

andrewli81 · 2022-01-04T19:08:35Z

Thanks for proactively fixing this!

jiangxb1987

LGTM!

…n a thread out of RpcEnv ### What changes were proposed in this pull request? This PR proposes to let `WorkerWatcher` run `System.exit` in a separate thread instead of some thread of `RpcEnv`. ### Why are the changes needed? `System.exit` will trigger the shutdown hook to run `executor.stop`, which will result in the same deadlock issue with SPARK-14180. But note that since Spark upgrades to Hadoop 3 recently, each hook now will have a [timeout threshold](https://github.com/apache/hadoop/blob/d4794dd3b2ba365a9d95ad6aafcf43a1ea40f777/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ShutdownHookManager.java#L205-L209) which forcibly interrupt the hook execution once reaches timeout. So, the deadlock issue doesn't really exist in the master branch. However, it's still critical for previous releases and is a wrong behavior that should be fixed. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested manually. Closes #35069 from Ngone51/fix-workerwatcher-exit. Authored-by: yi.wu <[email protected]> Signed-off-by: yi.wu <[email protected]> (cherry picked from commit 639d6f4) Signed-off-by: yi.wu <[email protected]>

Ngone51 · 2022-01-05T02:50:52Z

Thanks all! Merged to Master/branch-3.2/branch-3.1/branch-3.0.

mridulm · 2022-01-05T03:04:35Z

Late +1

…n a thread out of RpcEnv ### What changes were proposed in this pull request? This PR proposes to let `WorkerWatcher` run `System.exit` in a separate thread instead of some thread of `RpcEnv`. ### Why are the changes needed? `System.exit` will trigger the shutdown hook to run `executor.stop`, which will result in the same deadlock issue with SPARK-14180. But note that since Spark upgrades to Hadoop 3 recently, each hook now will have a [timeout threshold](https://github.com/apache/hadoop/blob/d4794dd3b2ba365a9d95ad6aafcf43a1ea40f777/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ShutdownHookManager.java#L205-L209) which forcibly interrupt the hook execution once reaches timeout. So, the deadlock issue doesn't really exist in the master branch. However, it's still critical for previous releases and is a wrong behavior that should be fixed. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested manually. Closes apache#35069 from Ngone51/fix-workerwatcher-exit. Authored-by: yi.wu <[email protected]> Signed-off-by: yi.wu <[email protected]> (cherry picked from commit 639d6f4) Signed-off-by: yi.wu <[email protected]>

…n a thread out of RpcEnv ### What changes were proposed in this pull request? This PR proposes to let `WorkerWatcher` run `System.exit` in a separate thread instead of some thread of `RpcEnv`. ### Why are the changes needed? `System.exit` will trigger the shutdown hook to run `executor.stop`, which will result in the same deadlock issue with SPARK-14180. But note that since Spark upgrades to Hadoop 3 recently, each hook now will have a [timeout threshold](https://github.com/apache/hadoop/blob/d4794dd3b2ba365a9d95ad6aafcf43a1ea40f777/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ShutdownHookManager.java#L205-L209) which forcibly interrupt the hook execution once reaches timeout. So, the deadlock issue doesn't really exist in the master branch. However, it's still critical for previous releases and is a wrong behavior that should be fixed. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested manually. Closes apache#35069 from Ngone51/fix-workerwatcher-exit. Authored-by: yi.wu <[email protected]> Signed-off-by: yi.wu <[email protected]> (cherry picked from commit 639d6f4) Signed-off-by: yi.wu <[email protected]> (cherry picked from commit 537de84) Signed-off-by: Dongjoon Hyun <[email protected]>

fix

9bcfcd6

github-actions bot added the CORE label Dec 30, 2021

cloud-fan approved these changes Jan 4, 2022

View reviewed changes

jiangxb1987 approved these changes Jan 4, 2022

View reviewed changes

Ngone51 closed this in 639d6f4 Jan 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a thread out of RpcEnv #35069

[SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a thread out of RpcEnv #35069

Uh oh!

Ngone51 commented Dec 30, 2021

Uh oh!

Ngone51 commented Jan 4, 2022

Uh oh!

cloud-fan left a comment

Uh oh!

andrewli81 commented Jan 4, 2022

Uh oh!

jiangxb1987 left a comment

Uh oh!

Ngone51 commented Jan 5, 2022

Uh oh!

mridulm commented Jan 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a thread out of RpcEnv #35069

[SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a thread out of RpcEnv #35069

Uh oh!

Conversation

Ngone51 commented Dec 30, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Ngone51 commented Jan 4, 2022

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

andrewli81 commented Jan 4, 2022

Uh oh!

jiangxb1987 left a comment

Choose a reason for hiding this comment

Uh oh!

Ngone51 commented Jan 5, 2022

Uh oh!

mridulm commented Jan 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants