Skip to content

Conversation

@sebastianhillig-db
Copy link
Contributor

@sebastianhillig-db sebastianhillig-db commented Mar 21, 2024

What changes were proposed in this pull request?

PySpark worker processes may die while they are idling. Here we aim to provide some resilience, by validating process and selectionkey aliveness prior to returning the process from idle pool.

Why are the changes needed?

To not fail queries when a python process crashed while idling.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added appropriate testcase.

Was this patch authored or co-authored using generative AI tooling?

No

@sebastianhillig-db sebastianhillig-db changed the title [WIP] First stab at dealing with worker crashes [WIP] First stab at dealing with worker crashes while idling Mar 21, 2024
Copy link
Contributor

@utkarsh39 utkarsh39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing an initial round of comments for my understanding

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the interestOps call succeed, will both of these checks be automatically true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this isn't always the case, i.e. the workerHandle may already see the process being dead and selectionKey update will happily pass. I also check isValid for the off-chance, that we got cancelled after interestOps was set.

@utkarsh39
Copy link
Contributor

LGTM. Let's get a review from others?

@sebastianhillig-db sebastianhillig-db changed the title [WIP] First stab at dealing with worker crashes while idling PySpark worker pool crash resilience Mar 25, 2024
@sebastianhillig-db
Copy link
Contributor Author

@ueshin @HyukjinKwon can you take a look here?

@HyukjinKwon
Copy link
Member

Let's file a JIRA, see https://spark.apache.org/contributing.html

@HyukjinKwon
Copy link
Member

Apache Spark uses the GitHub Actions in your forked repository so the builds have to be found in https://github.com/sebastianhillig-db/spark/actions . The GitHub Actions would have to be enabled at https://github.com/sebastianhillig-db/spark/settings/actions , and rebase this PR

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix itself seems pretty good.

@sebastianhillig-db sebastianhillig-db changed the title PySpark worker pool crash resilience [SPARK-47565] PySpark worker pool crash resilience Mar 26, 2024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that there is a chance to introduce an infinite loop to Apache Spark. Maybe, limit the number of retry? WDYT, @sebastianhillig-db ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On each iteration, a worker is pulled from idleWorkers, this will end up "emptying" the pool. The synchronization around this will ensure that no other workers are added while this happens. (see https://github.com/apache/spark/pull/45635/files/ba3c6f6ee19762278004594735f25ab4f6fafb3e#diff-1bd846874b06327e6abd0803aa74eed890352dfa974d5c1da1a12dc7477e20d0L411-L413)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On each iteration, a worker is pulled from idleWorkers, this will end up "emptying" the pool. The synchronization around this will ensure that no other workers are added while this happens. (see https://github.com/apache/spark/pull/45635/files/ba3c6f6ee19762278004594735f25ab4f6fafb3e#diff-1bd846874b06327e6abd0803aa74eed890352dfa974d5c1da1a12dc7477e20d0L411-L413)

The link seems to be broken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, sorry - the force push broke that link. I'm referring to "releaseWorker" using the same synchronization, so we should not be adding new workers while this code runs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon HyukjinKwon changed the title [SPARK-47565] PySpark worker pool crash resilience [SPARK-47565][PYTHON] PySpark worker pool crash resilience Mar 27, 2024
@sebastianhillig-db sebastianhillig-db force-pushed the python-worker-factory-crash branch from ba3c6f6 to 0f59a6a Compare March 27, 2024 09:42
@HyukjinKwon
Copy link
Member

Merged to master.

@sebastianhillig-db sebastianhillig-db deleted the python-worker-factory-crash branch April 4, 2024 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants