-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40645][CONNECT] Throw exception for Collect() and recommend to use toPandas() #38089
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
R: @grundprinzip @HyukjinKwon |
python/pyspark/sql/connect/client.py
Outdated
| return DataFrame.withPlan(SQL(sql_string), self) | ||
|
|
||
| def collect(self, plan: pb2.Plan) -> pandas.DataFrame: | ||
| def toPandas(self, plan: pb2.Plan) -> pandas.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove this ... ? since this doesn;t exists in SparkSession
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @grundprinzip FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make these methods private instead and see how they're used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tricky part is that while the original session does not have this method we need a way to pass the plan to the client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I make it private but still have data_frame to call it. We need a way to pass the plan to the session/client.
We can see in the future if this can be replaced.
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM one comment
python/pyspark/sql/connect/client.py
Outdated
| return DataFrame.withPlan(SQL(sql_string), self) | ||
|
|
||
| def collect(self, plan: pb2.Plan) -> pandas.DataFrame: | ||
| def _toPandas(self, plan: pb2.Plan) -> pandas.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we should use snake naming rule for private methods. Only API follows the camel naming conversion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I see done!
I am still learning how to write python code for spark....
itholic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good
|
Merged to master. |
What changes were proposed in this pull request?
Current connect
Collect()return Pandas DataFrame, which does not match with PySpark DataFrame API which returns aList[Row]:spark/python/pyspark/sql/connect/data_frame.py
Line 227 in ceb8527
spark/python/pyspark/sql/dataframe.py
Line 1119 in ceb8527
The underlying implementation has been generating Pandas DataFrame though. In this case, we can choose to use to
toPandas()and throw exception forCollect()to recommend to usetoPandas().Why are the changes needed?
The goal of the connect project is still to align with existing data frame API as much as possible. In this case, given that
Collect()is not compatible in existing python client, we can choose to disable it for now.Does this PR introduce any user-facing change?
No
How was this patch tested?
UT