[SPARK-40486][PS] Implement `spearman` and `kendall` in `DataFrame.corrwith` #37929

zhengruifeng · 2022-09-19T10:00:27Z

What changes were proposed in this pull request?

extract the computation of DataFrame.corr into correlation.py, so it should be able to reused in DataFrame.corrwith/DataFrameGroupBy.corrwith/DataFrameGroupBy.corr/etc;
implement spearman and kendall in DataFrame.corrwith
add parameter axis in DataFrame.corrwith;

Why are the changes needed?

For API coverage

In [1]: import pyspark.pandas as ps

In [2]: df1 = ps.DataFrame({ "A":[1, 5, 7, 8], "X":[5, 8, 4, 3], "C":[10, 4, 9, 3]})

In [3]: df2 = ps.DataFrame({ "A":[5, 3, 6, 4], "B":[11, 2, 4, 3],  "C":[4, 3, 8, 5]})

In [4]: ps.set_option("compute.ops_on_diff_frames", True)

In [5]: df1.corrwith(df2, method="kendall").sort_index()
                                                                                
A    0.0
B    NaN
C    0.0
X    NaN
dtype: float64

In [6]: df1.to_pandas().corrwith(df2.to_pandas(), method="kendall").sort_index()
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:975: PandasAPIOnSparkAdviceWarning: `to_pandas` loads all data into the driver's memory. It should only be used if the resulting pandas DataFrame is expected to be small.
  warnings.warn(message, PandasAPIOnSparkAdviceWarning)
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:975: PandasAPIOnSparkAdviceWarning: `to_pandas` loads all data into the driver's memory. It should only be used if the resulting pandas DataFrame is expected to be small.
  warnings.warn(message, PandasAPIOnSparkAdviceWarning)
Out[6]: 
A    0.0
B    NaN
C    0.0
X    NaN
dtype: float64

In [7]: df1.corrwith(df2.B, method="spearman").sort_index()
Out[7]: 
A   -0.4
C    0.8
X   -0.2
dtype: float64

In [8]: df1.to_pandas().corrwith(df2.B.to_pandas(), method="spearman").sort_index()
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:975: PandasAPIOnSparkAdviceWarning: `to_pandas` loads all data into the driver's memory. It should only be used if the resulting pandas DataFrame is expected to be small.
  warnings.warn(message, PandasAPIOnSparkAdviceWarning)
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:975: PandasAPIOnSparkAdviceWarning: `to_pandas` loads all data into the driver's memory. It should only be used if the resulting pandas Series is expected to be small.
  warnings.warn(message, PandasAPIOnSparkAdviceWarning)
Out[8]: 
A   -0.4
C    0.8
X   -0.2
dtype: float64

Does this PR introduce any user-facing change?

yes, new correlations supported

How was this patch tested?

added UT

zhengruifeng · 2022-09-20T02:10:29Z

cc @itholic @HyukjinKwon

zhengruifeng · 2022-09-20T06:50:58Z

Merged into master, thanks for review! @HyukjinKwon

itholic · 2022-09-21T00:06:12Z

python/pyspark/pandas/frame.py

-        method : str, default 'pearson'
-            Method of correlation, one of:
-
+        method : {'pearson', 'spearman', 'kendall'}


qq: do we also need to implement callable as pandas does ?

good question, I think it's a bit hard to support this callable:
it takes two arrays, so should collect all values in the columns, but it's not scalable then

maybe, we can support another callable: Callable[[Column, Column], float], which is an aggregation function, this may make sense. I think we need more discussion/thoughs on it.

github-actions bot added CORE PANDAS API ON SPARK PYTHON labels Sep 19, 2022

zhengruifeng marked this pull request as ready for review September 20, 2022 00:03

zhengruifeng added 4 commits September 20, 2022 10:27

refactor corr

5a4dd81

refactor corrwith

b46f23d

fix test

8790b46

update docs

2ea62a8

zhengruifeng force-pushed the ps_corrwith_spearman_kendall branch from 7f242bf to 2ea62a8 Compare September 20, 2022 02:27

HyukjinKwon approved these changes Sep 20, 2022

View reviewed changes

zhengruifeng closed this in eee6e45 Sep 20, 2022

zhengruifeng deleted the ps_corrwith_spearman_kendall branch September 20, 2022 06:51

itholic reviewed Sep 21, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-40486][PS] Implement `spearman` and `kendall` in `DataFrame.corrwith` #37929

[SPARK-40486][PS] Implement `spearman` and `kendall` in `DataFrame.corrwith` #37929

Uh oh!

zhengruifeng commented Sep 19, 2022 •

edited

Loading

Uh oh!

zhengruifeng commented Sep 20, 2022

Uh oh!

zhengruifeng commented Sep 20, 2022

Uh oh!

itholic Sep 21, 2022

Uh oh!

zhengruifeng Sep 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-40486][PS] Implement spearman and kendall in DataFrame.corrwith #37929

[SPARK-40486][PS] Implement spearman and kendall in DataFrame.corrwith #37929

Uh oh!

Conversation

zhengruifeng commented Sep 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zhengruifeng commented Sep 20, 2022

Uh oh!

zhengruifeng commented Sep 20, 2022

Uh oh!

itholic Sep 21, 2022

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Sep 21, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-40486][PS] Implement `spearman` and `kendall` in `DataFrame.corrwith` #37929

[SPARK-40486][PS] Implement `spearman` and `kendall` in `DataFrame.corrwith` #37929

zhengruifeng commented Sep 19, 2022 •

edited

Loading