-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-50540][PYTHON][SS] Fix string schema for StatefulProcessorHandle #49138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@LuciferYang Do you know why the python codegen test failed here? I was using the |
|
After executing Is there any local file changes that hasn't been committed? |
| """ | ||
| self.stateful_processor_api_client.get_value_state(state_name, schema, ttl_duration_ms) | ||
| return ValueState(ValueStateClient(self.stateful_processor_api_client), state_name, schema) | ||
| schema_struct = self.stateful_processor_api_client.get_value_state( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It's not very straightforward to understand we request initialization for State and we get the schema as a result. (Arguably it's already confusing we get nothing from get_XXX, though it's following the current name convention.)
Why not just have a separate method? It's not very heavyweight even we have a separate request call(s) for schema, and it's only used for string schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I'm open for voices, I know this reduces one round trip and there would be people who prefers performance over cleaner code.
cc. @anishshri-db It'd be good to hear about your input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea - probably better to have a separate API for this. As @HeartSaVioR mentioned, could we only do the conversion if string is passed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, will update, thanks!
jingz-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM! Will take another look after moving the schema into a separate call.
| schema: Union[StructType, str], | ||
| ) -> None: | ||
| self._stateful_processor_api_client = stateful_processor_api_client | ||
| if isinstance(schema, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make schema part of the class constructor to avoid multiple API calls for parsing.
| message.getMethodCase match { | ||
| case UtilsRequest.MethodCase.PARSESTRINGSCHEMA => | ||
| val stringSchema = message.getParseStringSchema.getSchema | ||
| val schema = CatalystSqlParser.parseTableSchema(stringSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we have an exception handling here for parse exception thrown from parseTableSchema when users pass in invalid schema string? It seems currently we will always send back return code 0 even if the parsing failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have exception handling logic on the server side, if parseTableSchema throws an error, it will be handled here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's think of the outcome for users. If they make a mistake on the schema string, which error message they will get? This should be a part of error class.
I know we have TODOs for error class and I'm OK to file another one - we will need to address it before Spark 4.0, but it's not very close as of now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the error we got when passing a schema like count inttttt
pyspark.errors.exceptions.base.PySparkRuntimeError: Error parsing string schema:
[UNSUPPORTED_DATATYPE] Unsupported data type "INTTTTT". SQLSTATE: 0A000 (line 1, pos 6)
== SQL ==
count inttttt
------^^^
This error is already handled in the bigger try-catch block here and we throw the error in stateful_processor_api_client._parse_string_schema(), we could just use the same TODO(SPARK-49233) there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK let's deal with this in existing TODO.
jingz-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small comment and otherwise LGTM! Thanks for making the change.
| verify(arrowStreamWriter).finalizeCurrentArrowBatch() | ||
| } | ||
|
|
||
| test("utils request - parse string schema") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we also have the test for negative case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we use a bigger try-catch block under run() method, it's hard to test with. We could either have a specific try-catch block just for this call, or have a negative test on python side, which one do you prefer? cc @jingz-db
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with having negative test on python side. I'm also OK with having follow up PR for this.
HeartSaVioR
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in overall - left a couple comments.
HeartSaVioR
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, let's address minor comments as follow-up.
|
Thanks! Merging to master. |
What changes were proposed in this pull request?
Fix string schema for StatefulProcessorHandle, it was throwing an error before when passing the schema as String type because the utility method we used
_parse_datatype_stringrequires a SparkContext which is not available on executors.The way we support it is to create a new API
ParseStringSchemafrom the client side (Python worker) to server side (JVM). Client passes a string schema to the server, we do the parsing on server side and then send the result back to the client.Why are the changes needed?
This is to fix an issue/bug with the existing code.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Updated existing test cases to include string schemas.
Was this patch authored or co-authored using generative AI tooling?
No.