Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Apr 2, 2022

What changes were proposed in this pull request?

Implement Series.autocorr

Why are the changes needed?

for API coverage

Does this PR introduce any user-facing change?

yes, Series now support function autocorr

In [86]: s = pd.Series([.2, .0, .6, .2, np.nan, .5, .6])

In [87]: s.autocorr()
Out[87]: -0.14121975762272054

How was this patch tested?

added doctest

@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented Apr 2, 2022

check against the Pandas side:

In [86]: s = pd.Series([.2, .0, .6, .2, np.nan, .5, .6])

In [87]: s.autocorr()
Out[87]: -0.14121975762272054

In [88]: s.autocorr(0)
Out[88]: 1.0

In [89]: s.autocorr(2)
Out[89]: 0.9707253433941511

In [90]: s.autocorr(-3)
Out[90]: 0.2773500981126146

In [91]: s.autocorr(5)
Out[91]: -0.9999999999999998

In [92]: s.autocorr(6)
/home/zrf/.zrf/anaconda3/lib/python3.9/site-packages/numpy/lib/function_base.py:2683: RuntimeWarning: Degrees of freedom <= 0 for slice
  c = cov(x, y, rowvar, dtype=dtype)
/home/zrf/.zrf/anaconda3/lib/python3.9/site-packages/numpy/lib/function_base.py:2542: RuntimeWarning: divide by zero encountered in true_divide
  c *= np.true_divide(1, fact)
Out[92]: nan



@HyukjinKwon
Copy link
Member

Thanks for working on this @zhengruifeng ! cc @ueshin @xinrong-databricks @itholic FYI!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should we define the column names in variables which are reused throughout the method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, will update soon

Copy link
Contributor

@awdavidson awdavidson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to add some basic tests around this?

@xinrong-meng
Copy link
Member

Thanks @zhengruifeng!

https://github.com/apache/spark/blob/master/python/pyspark/pandas/tests/test_series.py is a good place to add tests.

It would be great to specify what changes in Does this PR introduce any user-facing change? section of the PR description. An example is good enough.

@zhengruifeng
Copy link
Contributor Author

@xinrong-databricks Will add the tests and update the PR description, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add .. versionadded:: 3.4.0?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's document the .. versionadded:: 3.4.0 here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Copy link
Contributor

@awdavidson awdavidson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM

@zhengruifeng
Copy link
Contributor Author

cc @HyukjinKwon , I think this PR is ready too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should also add a note about the global window operation and its performance impact.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

@zhengruifeng zhengruifeng force-pushed the pandas_series_autocorr branch from 90ca30c to 78daa69 Compare April 13, 2022 07:28
@HyukjinKwon
Copy link
Member

Build link: https://github.com/zhengruifeng/spark/runs/6003246221

@zhengruifeng
Copy link
Contributor Author

all tests passed

@HyukjinKwon
Copy link
Member

Merged to master.

@zhengruifeng
Copy link
Contributor Author

Thanks all for reviewing!

@zhengruifeng zhengruifeng deleted the pandas_series_autocorr branch April 13, 2022 09:18
zhengruifeng added a commit that referenced this pull request Oct 13, 2022
… consistent with Pandas

### What changes were proposed in this pull request?
in `Series.autocorr`, rename `periods` as `lag`

### Why are the changes needed?
when implementing the `Series.autocorr` in my first PS PR #36048 , I wrongly follow the parameter name `min_periods` in `Series.corr`, it should be `lag` to be the same with [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.Series.autocorr.html)

### Does this PR introduce _any_ user-facing change?
no, since 3.4 is not released

### How was this patch tested?
existing UTs

Closes #38216 from zhengruifeng/ps_ser_autocorr_rename_parameter.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
… consistent with Pandas

### What changes were proposed in this pull request?
in `Series.autocorr`, rename `periods` as `lag`

### Why are the changes needed?
when implementing the `Series.autocorr` in my first PS PR apache#36048 , I wrongly follow the parameter name `min_periods` in `Series.corr`, it should be `lag` to be the same with [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.Series.autocorr.html)

### Does this PR introduce _any_ user-facing change?
no, since 3.4 is not released

### How was this patch tested?
existing UTs

Closes apache#38216 from zhengruifeng/ps_ser_autocorr_rename_parameter.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants