Skip to content

Conversation

@ghost
Copy link

@ghost ghost commented Jun 13, 2024

What changes were proposed in this pull request?

This change lets a Scala Spark Connect client reattempt execution of a plan when it receives a SESSION_NOT_FOUND error from the Spark Connect service if it has not received any partial responses.

This is a Scala version of the previous fix of the same issue - #46297.

Why are the changes needed?

Spark Connect clients often get a spurious error from the Spark Connect service if the service is busy or the network is congested. This error leads to a situation where the client immediately attempts to reattach without the service being aware of the client; this leads to a query failure.

Does this PR introduce any user-facing change?

Prevoiusly, a Scala Spark Connect client would fail with the error code "INVALID_HANDLE.SESSION_NOT_FOUND" in the very first attempt to make a request to the service, but with this change, the client will automatically retry.

How was this patch tested?

Attached unit test.

Was this patch authored or co-authored using generative AI tooling?

No.

…error is raised and no partial response was received
Changgyoo Park added 2 commits June 13, 2024 13:12
@ghost
Copy link
Author

ghost commented Jun 13, 2024

@HyukjinKwon Hi! Could you review and possibly merge this PR? It is a follow-up change of #46297 which was merged in May.

@ghost ghost changed the title [SPARK-48056][CONNECT][SCALA] Re-execute plan if a SESSION_NOT_FOUND … [SPARK-48056][CONNECT][SCALA] Re-execute plan on SESSION_NOT_FOUND errors Jun 14, 2024
@zhengruifeng zhengruifeng changed the title [SPARK-48056][CONNECT][SCALA] Re-execute plan on SESSION_NOT_FOUND errors [SPARK-48056][CONNECT][FOLLOW-UP] Scala Client re-execute plan if a SESSION_NOT_FOUND error is raised and no partial response was received Jun 14, 2024
@ghost ghost deleted the SPARK-48056 branch June 14, 2024 08:32
vkorukanti pushed a commit to delta-io/delta that referenced this pull request Aug 21, 2024
…e available in Delta Connect testing (#3576)

## Description
For local E2E Delta Connect testing, we also designed an [util
class](https://github.com/delta-io/delta/blob/01bf60743b77c47147843e9083129320490f1629/spark-connect/client/src/test/scala-spark-master/io/delta/connect/tables/RemoteSparkSession.scala#L62)
to start a local server in a different process similar to
[SparkConnect](https://github.com/apache/spark/blob/ba208b9ca99990fa329c36b28d0aa2a5f4d0a77e/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/test/RemoteSparkSession.scala#L37).

We noticed that the server takes a random amount of seconds to start up,
and back then we received the error `INVALID_HANDLE.SESSION_NOT_FOUND]
The handle 746e6c86-9fa9-4b08-9572-388c20eaed47 is invalid. Session not
found. SQLSTATE: HY000"`, so what we did is to add a 10s `Thread.sleep`
before starting the client.

This is not robust, so we are removing the `Thread.sleep`. This should
work because:
1. The SparkSession's builder here already uses the default
[Configuration](https://github.com/apache/spark/blob/3edc9c23a723a92c5a951cea0436529de65c640a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala#L891)
of the `SparkConnectClient` which includes a default retry policy.
2. Spark patches the error `INVALID_HANDLE.SESSION_NOT_FOUND` in this
[PR](apache/spark#46971) at some point, so we
should be able to retry even if encountering this error.

## How was this patch tested?
Existing UTs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants