[SPARK-40645][CONNECT] Throw exception for Collect() and recommend to use toPandas() #38089

amaliujia · 2022-10-04T03:32:00Z

What changes were proposed in this pull request?

Current connect Collect() return Pandas DataFrame, which does not match with PySpark DataFrame API which returns a List[Row]:

spark/python/pyspark/sql/connect/data_frame.py

Line 227 in ceb8527

def collect(self):

spark/python/pyspark/sql/dataframe.py

Line 1119 in ceb8527

def collect(self) -> List[Row]:

The underlying implementation has been generating Pandas DataFrame though. In this case, we can choose to use to toPandas() and throw exception for Collect() to recommend to use toPandas().

Why are the changes needed?

The goal of the connect project is still to align with existing data frame API as much as possible. In this case, given that Collect() is not compatible in existing python client, we can choose to disable it for now.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

…ndas().

amaliujia · 2022-10-04T03:32:35Z

R: @grundprinzip @HyukjinKwon
cc: @cloud-fan

HyukjinKwon · 2022-10-04T05:06:45Z

python/pyspark/sql/connect/client.py

        return DataFrame.withPlan(SQL(sql_string), self)

-    def collect(self, plan: pb2.Plan) -> pandas.DataFrame:
+    def toPandas(self, plan: pb2.Plan) -> pandas.DataFrame:


I think we should remove this ... ? since this doesn;t exists in SparkSession

cc @grundprinzip FYI

We should make these methods private instead and see how they're used.

The tricky part is that while the original session does not have this method we need a way to pass the plan to the client.

I make it private but still have data_frame to call it. We need a way to pass the plan to the session/client.

We can see in the future if this can be replaced.

HyukjinKwon

LGTM one comment

HyukjinKwon · 2022-10-05T00:38:05Z

python/pyspark/sql/connect/client.py

        return DataFrame.withPlan(SQL(sql_string), self)

-    def collect(self, plan: pb2.Plan) -> pandas.DataFrame:
+    def _toPandas(self, plan: pb2.Plan) -> pandas.DataFrame:


Actually we should use snake naming rule for private methods. Only API follows the camel naming conversion

oh I see done!

I am still learning how to write python code for spark....

itholic

Looks pretty good

HyukjinKwon · 2022-10-05T02:23:31Z

Merged to master.

[SPARK-40645] Throw exception for Collect() and recommend to use toPa…

e362147

…ndas().

github-actions bot added CONNECT CORE PYTHON SQL labels Oct 4, 2022

HyukjinKwon reviewed Oct 4, 2022

View reviewed changes

HyukjinKwon approved these changes Oct 4, 2022

View reviewed changes

update

f129f80

HyukjinKwon reviewed Oct 5, 2022

View reviewed changes

itholic approved these changes Oct 5, 2022

View reviewed changes

amaliujia and others added 2 commits October 4, 2022 18:43

update

5d73d45

Apply suggestions from code review

5ce90f7

HyukjinKwon closed this in 99a76a6 Oct 5, 2022

amaliujia deleted the SPARK-40645 branch October 5, 2022 04:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-40645][CONNECT] Throw exception for Collect() and recommend to use toPandas() #38089

[SPARK-40645][CONNECT] Throw exception for Collect() and recommend to use toPandas() #38089

Uh oh!

amaliujia commented Oct 4, 2022 •

edited

Loading

Uh oh!

amaliujia commented Oct 4, 2022

Uh oh!

HyukjinKwon Oct 4, 2022

Uh oh!

HyukjinKwon Oct 4, 2022

Uh oh!

grundprinzip Oct 4, 2022

Uh oh!

grundprinzip Oct 4, 2022

Uh oh!

amaliujia Oct 4, 2022

Uh oh!

HyukjinKwon left a comment

Uh oh!

HyukjinKwon Oct 5, 2022

Uh oh!

amaliujia Oct 5, 2022

Uh oh!

itholic left a comment

Uh oh!

HyukjinKwon commented Oct 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-40645][CONNECT] Throw exception for Collect() and recommend to use toPandas() #38089

[SPARK-40645][CONNECT] Throw exception for Collect() and recommend to use toPandas() #38089

Uh oh!

Conversation

amaliujia commented Oct 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

amaliujia commented Oct 4, 2022

Uh oh!

HyukjinKwon Oct 4, 2022

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Oct 4, 2022

Choose a reason for hiding this comment

Uh oh!

grundprinzip Oct 4, 2022

Choose a reason for hiding this comment

Uh oh!

grundprinzip Oct 4, 2022

Choose a reason for hiding this comment

Uh oh!

amaliujia Oct 4, 2022

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Oct 5, 2022

Choose a reason for hiding this comment

Uh oh!

amaliujia Oct 5, 2022

Choose a reason for hiding this comment

Uh oh!

itholic left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Oct 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

amaliujia commented Oct 4, 2022 •

edited

Loading