[SPARK-46241][PYTHON][CONNECT] Fix error handling routine so it wouldn't fall into infinite recursion #44144

cdkrot · 2023-12-04T02:43:48Z

What changes were proposed in this pull request?

Remove _display_server_stack_trace and always display error stack trace if we have one

Why are the changes needed?

There is a certain codepath which can make existing error handling fall into infinite recursion. I.e. consider following codepath:

[Some error happens] -> _handle_error -> _handle_rpc_error -> _display_server_stack_trace -> RuntimeConf.get -> SparkConnectClient.config -> [An error happens] -> _handle_error.

There can be other similar codepaths

Does this PR introduce any user-facing change?

Gets rid of occasionally infinite recursive error handling (which can cause downgraded user experience)

How was this patch tested?

N/A

Was this patch authored or co-authored using generative AI tooling?

No

cdkrot · 2023-12-04T02:44:42Z

cc @HyukjinKwon, also @nija-at @grundprinzip

HyukjinKwon · 2023-12-04T03:15:01Z

python/pyspark/sql/connect/client/core.py

        )


+class ForbidRecursion:


I faced the same problem before too. In my case, I worked around by https://github.com/apache/spark/pull/43965/files#diff-831a8c82df3f07cbdaba03aaf7a0e9abaaf5dd6c63f9dd121e4a263e3094844eR1528.

Basically tries to get the config once, and do not retry if it fails. But wasn't sure if that's the best approach. cc @heyihong

I looked into https://github.com/apache/spark/pull/43965/files#diff-831a8c82df3f07cbdaba03aaf7a0e9abaaf5dd6c63f9dd121e4a263e3094844eR1528, I think this is not enough.

try: except: guard you have is a good idea, however it won't be triggered immediately by recursion, and the code will walk again into RuntimeConf.get and got into infinite recursion again.

There may be some simpler approach to deal with recursive error handling (e.g. use the grpc stub to get the config value). Using ForbidRecursion seems to be a big hammer. Also we should have some tests for this scenario

I like this hammer since it's very specific and allows to keep all existing fancies in error handling we already have :). Happy to discuss other ideas too though

Regarding testing, I tried to write a test with mock stub which would fail, but I found that I need somewhat sophisticated GrpcError instance to pass this conversion

spark/python/pyspark/sql/connect/client/core.py

Line 1559 in 4398bb5

status = rpc_status.from_call(cast(grpc.Call, rpc_error))

IMO, the logic that determines whether to display stack trace based on SQL confs should be implemented on the sever side

It seems it's controlled by server, we only need some configuration to do this.

Another approach I see is to write spark.sql.connect.serverStacktrace.enabled in some form of lazy fetch and also put something forbidding recursion into this loader instead

We probably don't need two extra config request round trip to know whether to display stack trace or not. We can just determine whether to display stack trace based on whether the stack trace field in the response is empty

That's a good idea, let's try this

cdkrot · 2023-12-04T10:25:49Z

Changed to @heyihong's suggestion to always print a stacktrace if we got one (that makes sense). I checked up, there seems no other recursive problems currently. (Original proposal was 285b85c)

HyukjinKwon · 2023-12-04T23:43:11Z

Hm, test failure seems related?

======================================================================
ERROR [612.514s]: test_other_than_dataframe_iter (pyspark.sql.tests.connect.test_parity_pandas_map.MapInPandasParityTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/reattach.py", line 156, in _has_next
    self._current = self._call_iter(
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/reattach.py", line 271, in _call_iter
    raise e
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/reattach.py", line 253, in _call_iter
    return iter_fun()
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/reattach.py", line 157, in <lambda>
    lambda: next(self._iterator)  # type: ignore[arg-type]
  File "/usr/local/lib/python3.9/dist-packages/grpc/_channel.py", line 541, in __next__
    return self._next()
  File "/usr/local/lib/python3.9/dist-packages/grpc/_channel.py", line 967, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:33899: Failed to connect to remote host: Connection refused"
	debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:33899: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2023-12-04T11:06:56.275862745+00:00"}"
>
>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/tests/pandas/test_pandas_map.py", line 378, in test_self_join
    df2 = df1.mapInPandas(lambda iter: iter, "id long")
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/dataframe.py", line 2043, in mapInPandas
    return self._map_partitions(func, schema, PythonEvalType.SQL_MAP_PANDAS_ITER_UDF, barrier)
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/dataframe.py", line 2032, in _map_partitions
    child=self._plan, function=udf_obj, cols=self.columns, is_barrier=barrier
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/dataframe.py", line 244, in columns
    return self.schema.names
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/dataframe.py", line 1776, in schema
    return self._session.client.schema(query)
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/core.py", line 912, in schema
    schema = self._analyze(method="schema", plan=plan).schema
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/core.py", line 1098, in _analyze
    self._handle_error(error)
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/core.py", line 1504, in _handle_error
    raise error
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/core.py", line 1091, in _analyze
    for attempt in self._retrying():
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/retries.py", line 247, in __iter__
    self._wait()
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/retries.py", line 232, in _wait
    raise RetriesExceeded from exception
pyspark.sql.connect.client.retries.RetriesExceeded

======================================================================
ERROR [0.000s]: tearDownClass (pyspark.sql.tests.connect.test_parity_pandas_map.MapInPandasParityTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/core.py", line 1433, in release_session
    resp = self._stub.ReleaseSession(req, metadata=self._builder.metadata())
  File "/usr/local/lib/python3.9/dist-packages/grpc/_channel.py", line 1161, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.9/dist-packages/grpc/_channel.py", line 1004, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:33899: Failed to connect to remote host: Connection refused"
	debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:33899: Failed to connect to remote host: Connection refused {created_time:"2023-12-04T11:27:19.924646308+00:00", grpc_status:14}"
>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/__w/apache_spark/apache_spark/python/pyspark/testing/connectutils.py", line 194, in tearDownClass
    cls.spark.stop()
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/session.py", line 655, in stop
    self.client.release_session()
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/core.py", line 1438, in release_session
    self._handle_error(error)
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/core.py", line 1504, in _handle_error
    raise error
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/core.py", line 1431, in release_session
    for attempt in self._retrying():
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/retries.py", line 247, in __iter__
    self._wait()
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/retries.py", line 232, in _wait
    raise RetriesExceeded from exception
pyspark.sql.connect.client.retries.RetriesExceeded

----------------------------------------------------------------------

cdkrot · 2023-12-05T10:48:29Z

@HyukjinKwon it seems to be a flake, doesn't seem the change I'm doing could've affected this and this passed after retrigger

HyukjinKwon · 2023-12-05T23:40:12Z

Merged to master.

…n't fall into infinite recursion ### What changes were proposed in this pull request? Remove _display_server_stack_trace and always display error stack trace if we have one ### Why are the changes needed? There is a certain codepath which can make existing error handling fall into infinite recursion. I.e. consider following codepath: `[Some error happens] -> _handle_error -> _handle_rpc_error -> _display_server_stack_trace -> RuntimeConf.get -> SparkConnectClient.config -> [An error happens] -> _handle_error`. There can be other similar codepaths ### Does this PR introduce _any_ user-facing change? Gets rid of occasionally infinite recursive error handling (which can cause downgraded user experience) ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#44144 from cdkrot/forbid_recursive_error_handling. Authored-by: Alice Sayutina <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? Revert #44144, and introduce a forbid recursion guard as previously proposed. This way the infinite error handling recursion is still prevented, but the client-side knob is still present. ### Why are the changes needed? Previously proposed as part of #44144, however was discussed in favour of something else. However it seems (proposal by grundprinzip) that the original proposal was more correct, since it seems driver stacktrace is decided on client not server (see #43667) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Hand testing ### Was this patch authored or co-authored using generative AI tooling? No Closes #44210 from cdkrot/forbid_recursive_error_handling_2. Authored-by: Alice Sayutina <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

test

285b85c

github-actions bot added SQL PYTHON CONNECT labels Dec 4, 2023

minor

eaa16e6

HyukjinKwon changed the title ~~[SPARK-TBD] Forbid Recursive Error handling~~ [SPARK-TBD][PYTHON][CONNECT Forbid Recursive Error handling Dec 4, 2023

HyukjinKwon changed the title ~~[SPARK-TBD][PYTHON][CONNECT Forbid Recursive Error handling~~ [SPARK-TBD][PYTHON][CONNECT] Forbid Recursive Error handling Dec 4, 2023

HyukjinKwon reviewed Dec 4, 2023

View reviewed changes

cdkrot changed the title ~~[SPARK-TBD][PYTHON][CONNECT] Forbid Recursive Error handling~~ [SPARK-46241][PYTHON][CONNECT] Forbid Recursive Error handling Dec 4, 2023

Yihong's suggestion to always display stack trace

73e43d9

cdkrot changed the title ~~[SPARK-46241][PYTHON][CONNECT] Forbid Recursive Error handling~~ [SPARK-46241][PYTHON][CONNECT] Fix error handling routine so it wouldn't fall into infinite recursion Dec 4, 2023

cdkrot requested a review from heyihong December 4, 2023 12:57

HyukjinKwon approved these changes Dec 4, 2023

View reviewed changes

heyihong approved these changes Dec 5, 2023

View reviewed changes

HyukjinKwon closed this in c9df53f Dec 5, 2023

cdkrot mentioned this pull request Dec 6, 2023

[SPARK-46308] Forbid recursive error handling by adding recursion guards #44210

Closed

		)


		class ForbidRecursion:

[SPARK-46241][PYTHON][CONNECT] Fix error handling routine so it wouldn't fall into infinite recursion #44144

[SPARK-46241][PYTHON][CONNECT] Fix error handling routine so it wouldn't fall into infinite recursion #44144

Uh oh!

Conversation

cdkrot commented Dec 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

cdkrot commented Dec 4, 2023

Uh oh!

HyukjinKwon Dec 4, 2023

Choose a reason for hiding this comment

Uh oh!

cdkrot Dec 4, 2023

Choose a reason for hiding this comment

Uh oh!

heyihong Dec 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cdkrot Dec 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heyihong Dec 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cdkrot Dec 4, 2023

Choose a reason for hiding this comment

Uh oh!

heyihong Dec 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cdkrot Dec 4, 2023

Choose a reason for hiding this comment

Uh oh!

cdkrot commented Dec 4, 2023

Uh oh!

HyukjinKwon commented Dec 4, 2023

Uh oh!

cdkrot commented Dec 5, 2023

Uh oh!

HyukjinKwon commented Dec 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cdkrot commented Dec 4, 2023 •

edited

Loading

heyihong Dec 4, 2023 •

edited

Loading

cdkrot Dec 4, 2023 •

edited

Loading

heyihong Dec 4, 2023 •

edited

Loading

heyihong Dec 4, 2023 •

edited

Loading