Skip to content

Conversation

@Ngone51
Copy link
Member

@Ngone51 Ngone51 commented Dec 30, 2021

What changes were proposed in this pull request?

This PR proposes to let WorkerWatcher run System.exit in a separate thread instead of some thread of RpcEnv.

Why are the changes needed?

System.exit will trigger the shutdown hook to run executor.stop, which will result in the same deadlock issue with SPARK-14180. But note that since Spark upgrades to Hadoop 3 recently, each hook now will have a timeout threshold which forcibly interrupt the hook execution once reaches timeout. So, the deadlock issue doesn't really exist in the master branch. However, it's still critical for previous releases and is a wrong behavior that should be fixed.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Tested manually.

@github-actions github-actions bot added the CORE label Dec 30, 2021
@Ngone51
Copy link
Member Author

Ngone51 commented Jan 4, 2022

cc @mridulm @cloud-fan @jiangxb1987

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

@andrewli81
Copy link

Thanks for proactively fixing this!

Copy link
Contributor

@jiangxb1987 jiangxb1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Ngone51 Ngone51 closed this in 639d6f4 Jan 5, 2022
Ngone51 added a commit that referenced this pull request Jan 5, 2022
…n a thread out of RpcEnv

### What changes were proposed in this pull request?

This PR proposes to let `WorkerWatcher` run `System.exit` in a separate thread instead of some thread of `RpcEnv`.

### Why are the changes needed?

`System.exit` will trigger the shutdown hook to run `executor.stop`, which will result in the same deadlock issue with SPARK-14180. But note that since Spark upgrades to Hadoop 3  recently, each hook now will have a [timeout threshold](https://github.com/apache/hadoop/blob/d4794dd3b2ba365a9d95ad6aafcf43a1ea40f777/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ShutdownHookManager.java#L205-L209) which forcibly interrupt the hook execution once reaches timeout. So, the deadlock issue doesn't really exist in the master branch. However, it's still critical for previous releases and is a wrong behavior that should be fixed.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tested manually.

Closes #35069 from Ngone51/fix-workerwatcher-exit.

Authored-by: yi.wu <[email protected]>
Signed-off-by: yi.wu <[email protected]>
(cherry picked from commit 639d6f4)
Signed-off-by: yi.wu <[email protected]>
Ngone51 added a commit that referenced this pull request Jan 5, 2022
…n a thread out of RpcEnv

### What changes were proposed in this pull request?

This PR proposes to let `WorkerWatcher` run `System.exit` in a separate thread instead of some thread of `RpcEnv`.

### Why are the changes needed?

`System.exit` will trigger the shutdown hook to run `executor.stop`, which will result in the same deadlock issue with SPARK-14180. But note that since Spark upgrades to Hadoop 3  recently, each hook now will have a [timeout threshold](https://github.com/apache/hadoop/blob/d4794dd3b2ba365a9d95ad6aafcf43a1ea40f777/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ShutdownHookManager.java#L205-L209) which forcibly interrupt the hook execution once reaches timeout. So, the deadlock issue doesn't really exist in the master branch. However, it's still critical for previous releases and is a wrong behavior that should be fixed.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tested manually.

Closes #35069 from Ngone51/fix-workerwatcher-exit.

Authored-by: yi.wu <[email protected]>
Signed-off-by: yi.wu <[email protected]>
(cherry picked from commit 639d6f4)
Signed-off-by: yi.wu <[email protected]>
Ngone51 added a commit that referenced this pull request Jan 5, 2022
…n a thread out of RpcEnv

### What changes were proposed in this pull request?

This PR proposes to let `WorkerWatcher` run `System.exit` in a separate thread instead of some thread of `RpcEnv`.

### Why are the changes needed?

`System.exit` will trigger the shutdown hook to run `executor.stop`, which will result in the same deadlock issue with SPARK-14180. But note that since Spark upgrades to Hadoop 3  recently, each hook now will have a [timeout threshold](https://github.com/apache/hadoop/blob/d4794dd3b2ba365a9d95ad6aafcf43a1ea40f777/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ShutdownHookManager.java#L205-L209) which forcibly interrupt the hook execution once reaches timeout. So, the deadlock issue doesn't really exist in the master branch. However, it's still critical for previous releases and is a wrong behavior that should be fixed.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tested manually.

Closes #35069 from Ngone51/fix-workerwatcher-exit.

Authored-by: yi.wu <[email protected]>
Signed-off-by: yi.wu <[email protected]>
(cherry picked from commit 639d6f4)
Signed-off-by: yi.wu <[email protected]>
@Ngone51
Copy link
Member Author

Ngone51 commented Jan 5, 2022

Thanks all! Merged to Master/branch-3.2/branch-3.1/branch-3.0.

@mridulm
Copy link
Contributor

mridulm commented Jan 5, 2022

Late +1

fishcus pushed a commit to fishcus/spark that referenced this pull request Jan 12, 2022
…n a thread out of RpcEnv

### What changes were proposed in this pull request?

This PR proposes to let `WorkerWatcher` run `System.exit` in a separate thread instead of some thread of `RpcEnv`.

### Why are the changes needed?

`System.exit` will trigger the shutdown hook to run `executor.stop`, which will result in the same deadlock issue with SPARK-14180. But note that since Spark upgrades to Hadoop 3  recently, each hook now will have a [timeout threshold](https://github.com/apache/hadoop/blob/d4794dd3b2ba365a9d95ad6aafcf43a1ea40f777/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ShutdownHookManager.java#L205-L209) which forcibly interrupt the hook execution once reaches timeout. So, the deadlock issue doesn't really exist in the master branch. However, it's still critical for previous releases and is a wrong behavior that should be fixed.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tested manually.

Closes apache#35069 from Ngone51/fix-workerwatcher-exit.

Authored-by: yi.wu <[email protected]>
Signed-off-by: yi.wu <[email protected]>
(cherry picked from commit 639d6f4)
Signed-off-by: yi.wu <[email protected]>
catalinii pushed a commit to lyft/spark that referenced this pull request Feb 22, 2022
…n a thread out of RpcEnv

### What changes were proposed in this pull request?

This PR proposes to let `WorkerWatcher` run `System.exit` in a separate thread instead of some thread of `RpcEnv`.

### Why are the changes needed?

`System.exit` will trigger the shutdown hook to run `executor.stop`, which will result in the same deadlock issue with SPARK-14180. But note that since Spark upgrades to Hadoop 3  recently, each hook now will have a [timeout threshold](https://github.com/apache/hadoop/blob/d4794dd3b2ba365a9d95ad6aafcf43a1ea40f777/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ShutdownHookManager.java#L205-L209) which forcibly interrupt the hook execution once reaches timeout. So, the deadlock issue doesn't really exist in the master branch. However, it's still critical for previous releases and is a wrong behavior that should be fixed.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tested manually.

Closes apache#35069 from Ngone51/fix-workerwatcher-exit.

Authored-by: yi.wu <[email protected]>
Signed-off-by: yi.wu <[email protected]>
(cherry picked from commit 639d6f4)
Signed-off-by: yi.wu <[email protected]>
catalinii pushed a commit to lyft/spark that referenced this pull request Mar 4, 2022
…n a thread out of RpcEnv

### What changes were proposed in this pull request?

This PR proposes to let `WorkerWatcher` run `System.exit` in a separate thread instead of some thread of `RpcEnv`.

### Why are the changes needed?

`System.exit` will trigger the shutdown hook to run `executor.stop`, which will result in the same deadlock issue with SPARK-14180. But note that since Spark upgrades to Hadoop 3  recently, each hook now will have a [timeout threshold](https://github.com/apache/hadoop/blob/d4794dd3b2ba365a9d95ad6aafcf43a1ea40f777/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ShutdownHookManager.java#L205-L209) which forcibly interrupt the hook execution once reaches timeout. So, the deadlock issue doesn't really exist in the master branch. However, it's still critical for previous releases and is a wrong behavior that should be fixed.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tested manually.

Closes apache#35069 from Ngone51/fix-workerwatcher-exit.

Authored-by: yi.wu <[email protected]>
Signed-off-by: yi.wu <[email protected]>
(cherry picked from commit 639d6f4)
Signed-off-by: yi.wu <[email protected]>
kazuyukitanimura pushed a commit to kazuyukitanimura/spark that referenced this pull request Aug 10, 2022
…n a thread out of RpcEnv

### What changes were proposed in this pull request?

This PR proposes to let `WorkerWatcher` run `System.exit` in a separate thread instead of some thread of `RpcEnv`.

### Why are the changes needed?

`System.exit` will trigger the shutdown hook to run `executor.stop`, which will result in the same deadlock issue with SPARK-14180. But note that since Spark upgrades to Hadoop 3  recently, each hook now will have a [timeout threshold](https://github.com/apache/hadoop/blob/d4794dd3b2ba365a9d95ad6aafcf43a1ea40f777/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ShutdownHookManager.java#L205-L209) which forcibly interrupt the hook execution once reaches timeout. So, the deadlock issue doesn't really exist in the master branch. However, it's still critical for previous releases and is a wrong behavior that should be fixed.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tested manually.

Closes apache#35069 from Ngone51/fix-workerwatcher-exit.

Authored-by: yi.wu <[email protected]>
Signed-off-by: yi.wu <[email protected]>
(cherry picked from commit 639d6f4)
Signed-off-by: yi.wu <[email protected]>
(cherry picked from commit 537de84)
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants