Skip to content

Conversation

@amaliujia
Copy link
Contributor

@amaliujia amaliujia commented Oct 4, 2022

What changes were proposed in this pull request?

Current connect Collect() return Pandas DataFrame, which does not match with PySpark DataFrame API which returns a List[Row]:


def collect(self) -> List[Row]:

The underlying implementation has been generating Pandas DataFrame though. In this case, we can choose to use to toPandas() and throw exception for Collect() to recommend to use toPandas().

Why are the changes needed?

The goal of the connect project is still to align with existing data frame API as much as possible. In this case, given that Collect() is not compatible in existing python client, we can choose to disable it for now.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

@amaliujia
Copy link
Contributor Author

R: @grundprinzip @HyukjinKwon
cc: @cloud-fan

return DataFrame.withPlan(SQL(sql_string), self)

def collect(self, plan: pb2.Plan) -> pandas.DataFrame:
def toPandas(self, plan: pb2.Plan) -> pandas.DataFrame:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove this ... ? since this doesn;t exists in SparkSession

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @grundprinzip FYI

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make these methods private instead and see how they're used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tricky part is that while the original session does not have this method we need a way to pass the plan to the client.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I make it private but still have data_frame to call it. We need a way to pass the plan to the session/client.

We can see in the future if this can be replaced.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM one comment

return DataFrame.withPlan(SQL(sql_string), self)

def collect(self, plan: pb2.Plan) -> pandas.DataFrame:
def _toPandas(self, plan: pb2.Plan) -> pandas.DataFrame:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we should use snake naming rule for private methods. Only API follows the camel naming conversion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see done!

I am still learning how to write python code for spark....

Copy link
Contributor

@itholic itholic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good

@HyukjinKwon
Copy link
Member

Merged to master.

@amaliujia amaliujia deleted the SPARK-40645 branch October 5, 2022 04:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants