-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-47565][PYTHON] PySpark worker pool crash resilience #45635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47565][PYTHON] PySpark worker pool crash resilience #45635
Conversation
utkarsh39
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flushing an initial round of comments for my understanding
core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the interestOps call succeed, will both of these checks be automatically true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this isn't always the case, i.e. the workerHandle may already see the process being dead and selectionKey update will happily pass. I also check isValid for the off-chance, that we got cancelled after interestOps was set.
core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala
Outdated
Show resolved
Hide resolved
|
LGTM. Let's get a review from others? |
|
@ueshin @HyukjinKwon can you take a look here? |
|
Let's file a JIRA, see https://spark.apache.org/contributing.html |
|
Apache Spark uses the GitHub Actions in your forked repository so the builds have to be found in https://github.com/sebastianhillig-db/spark/actions . The GitHub Actions would have to be enabled at https://github.com/sebastianhillig-db/spark/settings/actions , and rebase this PR |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix itself seems pretty good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that there is a chance to introduce an infinite loop to Apache Spark. Maybe, limit the number of retry? WDYT, @sebastianhillig-db ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On each iteration, a worker is pulled from idleWorkers, this will end up "emptying" the pool. The synchronization around this will ensure that no other workers are added while this happens. (see https://github.com/apache/spark/pull/45635/files/ba3c6f6ee19762278004594735f25ab4f6fafb3e#diff-1bd846874b06327e6abd0803aa74eed890352dfa974d5c1da1a12dc7477e20d0L411-L413)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On each iteration, a worker is pulled from
idleWorkers, this will end up "emptying" the pool. The synchronization around this will ensure that no other workers are added while this happens. (see https://github.com/apache/spark/pull/45635/files/ba3c6f6ee19762278004594735f25ab4f6fafb3e#diff-1bd846874b06327e6abd0803aa74eed890352dfa974d5c1da1a12dc7477e20d0L411-L413)
The link seems to be broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, sorry - the force push broke that link. I'm referring to "releaseWorker" using the same synchronization, so we should not be adding new workers while this code runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ba3c6f6 to
0f59a6a
Compare
core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala
Show resolved
Hide resolved
|
Merged to master. |
What changes were proposed in this pull request?
PySpark worker processes may die while they are idling. Here we aim to provide some resilience, by validating process and selectionkey aliveness prior to returning the process from idle pool.
Why are the changes needed?
To not fail queries when a python process crashed while idling.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added appropriate testcase.
Was this patch authored or co-authored using generative AI tooling?
No