-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-46241][PYTHON][CONNECT] Fix error handling routine so it wouldn't fall into infinite recursion #44144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @HyukjinKwon, also @nija-at @grundprinzip |
| ) | ||
|
|
||
|
|
||
| class ForbidRecursion: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I faced the same problem before too. In my case, I worked around by https://github.com/apache/spark/pull/43965/files#diff-831a8c82df3f07cbdaba03aaf7a0e9abaaf5dd6c63f9dd121e4a263e3094844eR1528.
Basically tries to get the config once, and do not retry if it fails. But wasn't sure if that's the best approach. cc @heyihong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into https://github.com/apache/spark/pull/43965/files#diff-831a8c82df3f07cbdaba03aaf7a0e9abaaf5dd6c63f9dd121e4a263e3094844eR1528, I think this is not enough.
try: except: guard you have is a good idea, however it won't be triggered immediately by recursion, and the code will walk again into RuntimeConf.get and got into infinite recursion again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be some simpler approach to deal with recursive error handling (e.g. use the grpc stub to get the config value). Using ForbidRecursion seems to be a big hammer. Also we should have some tests for this scenario
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this hammer since it's very specific and allows to keep all existing fancies in error handling we already have :). Happy to discuss other ideas too though
Regarding testing, I tried to write a test with mock stub which would fail, but I found that I need somewhat sophisticated GrpcError instance to pass this conversion
spark/python/pyspark/sql/connect/client/core.py
Line 1559 in 4398bb5
| status = rpc_status.from_call(cast(grpc.Call, rpc_error)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, the logic that determines whether to display stack trace based on SQL confs should be implemented on the sever side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it's controlled by server, we only need some configuration to do this.
Another approach I see is to write spark.sql.connect.serverStacktrace.enabled in some form of lazy fetch and also put something forbidding recursion into this loader instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't need two extra config request round trip to know whether to display stack trace or not. We can just determine whether to display stack trace based on whether the stack trace field in the response is empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea, let's try this
|
Hm, test failure seems related? |
|
@HyukjinKwon it seems to be a flake, doesn't seem the change I'm doing could've affected this and this passed after retrigger |
|
Merged to master. |
…n't fall into infinite recursion ### What changes were proposed in this pull request? Remove _display_server_stack_trace and always display error stack trace if we have one ### Why are the changes needed? There is a certain codepath which can make existing error handling fall into infinite recursion. I.e. consider following codepath: `[Some error happens] -> _handle_error -> _handle_rpc_error -> _display_server_stack_trace -> RuntimeConf.get -> SparkConnectClient.config -> [An error happens] -> _handle_error`. There can be other similar codepaths ### Does this PR introduce _any_ user-facing change? Gets rid of occasionally infinite recursive error handling (which can cause downgraded user experience) ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#44144 from cdkrot/forbid_recursive_error_handling. Authored-by: Alice Sayutina <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? Revert #44144, and introduce a forbid recursion guard as previously proposed. This way the infinite error handling recursion is still prevented, but the client-side knob is still present. ### Why are the changes needed? Previously proposed as part of #44144, however was discussed in favour of something else. However it seems (proposal by grundprinzip) that the original proposal was more correct, since it seems driver stacktrace is decided on client not server (see #43667) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Hand testing ### Was this patch authored or co-authored using generative AI tooling? No Closes #44210 from cdkrot/forbid_recursive_error_handling_2. Authored-by: Alice Sayutina <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
Remove _display_server_stack_trace and always display error stack trace if we have one
Why are the changes needed?
There is a certain codepath which can make existing error handling fall into infinite recursion. I.e. consider following codepath:
[Some error happens] -> _handle_error -> _handle_rpc_error -> _display_server_stack_trace -> RuntimeConf.get -> SparkConnectClient.config -> [An error happens] -> _handle_error.There can be other similar codepaths
Does this PR introduce any user-facing change?
Gets rid of occasionally infinite recursive error handling (which can cause downgraded user experience)
How was this patch tested?
N/A
Was this patch authored or co-authored using generative AI tooling?
No