Skip to content

Conversation

@cdkrot
Copy link
Contributor

@cdkrot cdkrot commented Dec 4, 2023

What changes were proposed in this pull request?

Remove _display_server_stack_trace and always display error stack trace if we have one

Why are the changes needed?

There is a certain codepath which can make existing error handling fall into infinite recursion. I.e. consider following codepath:

[Some error happens] -> _handle_error -> _handle_rpc_error -> _display_server_stack_trace -> RuntimeConf.get -> SparkConnectClient.config -> [An error happens] -> _handle_error.

There can be other similar codepaths

Does this PR introduce any user-facing change?

Gets rid of occasionally infinite recursive error handling (which can cause downgraded user experience)

How was this patch tested?

N/A

Was this patch authored or co-authored using generative AI tooling?

No

@cdkrot
Copy link
Contributor Author

cdkrot commented Dec 4, 2023

cc @HyukjinKwon, also @nija-at @grundprinzip

@HyukjinKwon HyukjinKwon changed the title [SPARK-TBD] Forbid Recursive Error handling [SPARK-TBD][PYTHON][CONNECT Forbid Recursive Error handling Dec 4, 2023
@HyukjinKwon HyukjinKwon changed the title [SPARK-TBD][PYTHON][CONNECT Forbid Recursive Error handling [SPARK-TBD][PYTHON][CONNECT] Forbid Recursive Error handling Dec 4, 2023
)


class ForbidRecursion:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I faced the same problem before too. In my case, I worked around by https://github.com/apache/spark/pull/43965/files#diff-831a8c82df3f07cbdaba03aaf7a0e9abaaf5dd6c63f9dd121e4a263e3094844eR1528.

Basically tries to get the config once, and do not retry if it fails. But wasn't sure if that's the best approach. cc @heyihong

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into https://github.com/apache/spark/pull/43965/files#diff-831a8c82df3f07cbdaba03aaf7a0e9abaaf5dd6c63f9dd121e4a263e3094844eR1528, I think this is not enough.

try: except: guard you have is a good idea, however it won't be triggered immediately by recursion, and the code will walk again into RuntimeConf.get and got into infinite recursion again.

Copy link
Contributor

@heyihong heyihong Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be some simpler approach to deal with recursive error handling (e.g. use the grpc stub to get the config value). Using ForbidRecursion seems to be a big hammer. Also we should have some tests for this scenario

Copy link
Contributor Author

@cdkrot cdkrot Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this hammer since it's very specific and allows to keep all existing fancies in error handling we already have :). Happy to discuss other ideas too though

Regarding testing, I tried to write a test with mock stub which would fail, but I found that I need somewhat sophisticated GrpcError instance to pass this conversion

status = rpc_status.from_call(cast(grpc.Call, rpc_error))

Copy link
Contributor

@heyihong heyihong Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, the logic that determines whether to display stack trace based on SQL confs should be implemented on the sever side

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it's controlled by server, we only need some configuration to do this.

Another approach I see is to write spark.sql.connect.serverStacktrace.enabled in some form of lazy fetch and also put something forbidding recursion into this loader instead

Copy link
Contributor

@heyihong heyihong Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need two extra config request round trip to know whether to display stack trace or not. We can just determine whether to display stack trace based on whether the stack trace field in the response is empty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea, let's try this

@cdkrot cdkrot changed the title [SPARK-TBD][PYTHON][CONNECT] Forbid Recursive Error handling [SPARK-46241][PYTHON][CONNECT] Forbid Recursive Error handling Dec 4, 2023
@cdkrot
Copy link
Contributor Author

cdkrot commented Dec 4, 2023

Changed to @heyihong's suggestion to always print a stacktrace if we got one (that makes sense). I checked up, there seems no other recursive problems currently. (Original proposal was 285b85c)

@cdkrot cdkrot changed the title [SPARK-46241][PYTHON][CONNECT] Forbid Recursive Error handling [SPARK-46241][PYTHON][CONNECT] Fix error handling routine so it wouldn't fall into infinite recursion Dec 4, 2023
@cdkrot cdkrot requested a review from heyihong December 4, 2023 12:57
@HyukjinKwon
Copy link
Member

Hm, test failure seems related?

======================================================================
ERROR [612.514s]: test_other_than_dataframe_iter (pyspark.sql.tests.connect.test_parity_pandas_map.MapInPandasParityTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/reattach.py", line 156, in _has_next
    self._current = self._call_iter(
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/reattach.py", line 271, in _call_iter
    raise e
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/reattach.py", line 253, in _call_iter
    return iter_fun()
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/reattach.py", line 157, in <lambda>
    lambda: next(self._iterator)  # type: ignore[arg-type]
  File "/usr/local/lib/python3.9/dist-packages/grpc/_channel.py", line 541, in __next__
    return self._next()
  File "/usr/local/lib/python3.9/dist-packages/grpc/_channel.py", line 967, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:33899: Failed to connect to remote host: Connection refused"
	debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:33899: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2023-12-04T11:06:56.275862745+00:00"}"
>
>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/tests/pandas/test_pandas_map.py", line 378, in test_self_join
    df2 = df1.mapInPandas(lambda iter: iter, "id long")
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/dataframe.py", line 2043, in mapInPandas
    return self._map_partitions(func, schema, PythonEvalType.SQL_MAP_PANDAS_ITER_UDF, barrier)
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/dataframe.py", line 2032, in _map_partitions
    child=self._plan, function=udf_obj, cols=self.columns, is_barrier=barrier
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/dataframe.py", line 244, in columns
    return self.schema.names
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/dataframe.py", line 1776, in schema
    return self._session.client.schema(query)
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/core.py", line 912, in schema
    schema = self._analyze(method="schema", plan=plan).schema
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/core.py", line 1098, in _analyze
    self._handle_error(error)
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/core.py", line 1504, in _handle_error
    raise error
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/core.py", line 1091, in _analyze
    for attempt in self._retrying():
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/retries.py", line 247, in __iter__
    self._wait()
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/retries.py", line 232, in _wait
    raise RetriesExceeded from exception
pyspark.sql.connect.client.retries.RetriesExceeded

======================================================================
ERROR [0.000s]: tearDownClass (pyspark.sql.tests.connect.test_parity_pandas_map.MapInPandasParityTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/core.py", line 1433, in release_session
    resp = self._stub.ReleaseSession(req, metadata=self._builder.metadata())
  File "/usr/local/lib/python3.9/dist-packages/grpc/_channel.py", line 1161, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.9/dist-packages/grpc/_channel.py", line 1004, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:33899: Failed to connect to remote host: Connection refused"
	debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:33899: Failed to connect to remote host: Connection refused {created_time:"2023-12-04T11:27:19.924646308+00:00", grpc_status:14}"
>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/__w/apache_spark/apache_spark/python/pyspark/testing/connectutils.py", line 194, in tearDownClass
    cls.spark.stop()
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/session.py", line 655, in stop
    self.client.release_session()
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/core.py", line 1438, in release_session
    self._handle_error(error)
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/core.py", line 1504, in _handle_error
    raise error
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/core.py", line 1431, in release_session
    for attempt in self._retrying():
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/retries.py", line 247, in __iter__
    self._wait()
  File "/__w/apache_spark/apache_spark/python/pyspark/sql/connect/client/retries.py", line 232, in _wait
    raise RetriesExceeded from exception
pyspark.sql.connect.client.retries.RetriesExceeded

----------------------------------------------------------------------

@cdkrot
Copy link
Contributor Author

cdkrot commented Dec 5, 2023

@HyukjinKwon it seems to be a flake, doesn't seem the change I'm doing could've affected this and this passed after retrigger

@HyukjinKwon
Copy link
Member

Merged to master.

dbatomic pushed a commit to dbatomic/spark that referenced this pull request Dec 11, 2023
…n't fall into infinite recursion

### What changes were proposed in this pull request?

Remove _display_server_stack_trace and always display error stack trace if we have one

### Why are the changes needed?

There is a certain codepath which can make existing error handling fall into infinite recursion. I.e. consider following codepath:

`[Some error happens] -> _handle_error -> _handle_rpc_error -> _display_server_stack_trace -> RuntimeConf.get -> SparkConnectClient.config -> [An error happens] -> _handle_error`.

There can be other similar codepaths

### Does this PR introduce _any_ user-facing change?

Gets rid of occasionally infinite recursive error handling (which can cause downgraded user experience)

### How was this patch tested?
N/A

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#44144 from cdkrot/forbid_recursive_error_handling.

Authored-by: Alice Sayutina <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Dec 19, 2023
### What changes were proposed in this pull request?

Revert #44144, and introduce a forbid recursion guard as previously proposed. This way the infinite error handling recursion is still prevented, but the client-side knob is still present.

### Why are the changes needed?

Previously proposed as part of #44144, however was discussed in favour of something else. However it seems (proposal by grundprinzip) that the original proposal was more correct, since it seems driver stacktrace is decided on client not server (see #43667)

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Hand testing

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44210 from cdkrot/forbid_recursive_error_handling_2.

Authored-by: Alice Sayutina <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants