Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Sep 19, 2022

What changes were proposed in this pull request?

  1. extract the computation of DataFrame.corr into correlation.py, so it should be able to reused in DataFrame.corrwith/DataFrameGroupBy.corrwith/DataFrameGroupBy.corr/etc;
  2. implement spearman and kendall in DataFrame.corrwith
  3. add parameter axis in DataFrame.corrwith;

Why are the changes needed?

For API coverage

In [1]: import pyspark.pandas as ps

In [2]: df1 = ps.DataFrame({ "A":[1, 5, 7, 8], "X":[5, 8, 4, 3], "C":[10, 4, 9, 3]})

In [3]: df2 = ps.DataFrame({ "A":[5, 3, 6, 4], "B":[11, 2, 4, 3],  "C":[4, 3, 8, 5]})

In [4]: ps.set_option("compute.ops_on_diff_frames", True)

In [5]: df1.corrwith(df2, method="kendall").sort_index()
                                                                                
A    0.0
B    NaN
C    0.0
X    NaN
dtype: float64

In [6]: df1.to_pandas().corrwith(df2.to_pandas(), method="kendall").sort_index()
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:975: PandasAPIOnSparkAdviceWarning: `to_pandas` loads all data into the driver's memory. It should only be used if the resulting pandas DataFrame is expected to be small.
  warnings.warn(message, PandasAPIOnSparkAdviceWarning)
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:975: PandasAPIOnSparkAdviceWarning: `to_pandas` loads all data into the driver's memory. It should only be used if the resulting pandas DataFrame is expected to be small.
  warnings.warn(message, PandasAPIOnSparkAdviceWarning)
Out[6]: 
A    0.0
B    NaN
C    0.0
X    NaN
dtype: float64

In [7]: df1.corrwith(df2.B, method="spearman").sort_index()
Out[7]: 
A   -0.4
C    0.8
X   -0.2
dtype: float64

In [8]: df1.to_pandas().corrwith(df2.B.to_pandas(), method="spearman").sort_index()
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:975: PandasAPIOnSparkAdviceWarning: `to_pandas` loads all data into the driver's memory. It should only be used if the resulting pandas DataFrame is expected to be small.
  warnings.warn(message, PandasAPIOnSparkAdviceWarning)
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:975: PandasAPIOnSparkAdviceWarning: `to_pandas` loads all data into the driver's memory. It should only be used if the resulting pandas Series is expected to be small.
  warnings.warn(message, PandasAPIOnSparkAdviceWarning)
Out[8]: 
A   -0.4
C    0.8
X   -0.2
dtype: float64

Does this PR introduce any user-facing change?

yes, new correlations supported

How was this patch tested?

added UT

@zhengruifeng
Copy link
Contributor Author

cc @itholic @HyukjinKwon

@zhengruifeng zhengruifeng force-pushed the ps_corrwith_spearman_kendall branch from 7f242bf to 2ea62a8 Compare September 20, 2022 02:27
@zhengruifeng
Copy link
Contributor Author

Merged into master, thanks for review! @HyukjinKwon

@zhengruifeng zhengruifeng deleted the ps_corrwith_spearman_kendall branch September 20, 2022 06:51
method : str, default 'pearson'
Method of correlation, one of:
method : {'pearson', 'spearman', 'kendall'}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: do we also need to implement callable as pandas does ?
Screen Shot 2022-09-21 at 9 06 01 AM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question, I think it's a bit hard to support this callable:
it takes two arrays, so should collect all values in the columns, but it's not scalable then

maybe, we can support another callable: Callable[[Column, Column], float], which is an aggregation function, this may make sense. I think we need more discussion/thoughs on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants