-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-11057] [SQL] Add correlation and covariance matrices #9366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @mengxr |
|
Test build #44651 has finished for PR 9366 at commit
|
|
Hi guys, would you share your thoughts about this ? |
|
In general I think that currently there are some issues in the StatFunctions.scala: It seems that all computations both for covariance and correlation are being accomplished in one place which makes it a little confusing and harder to extend for the future. collectStatisticalData method is called for both correlation and covariance and even if I call something like this: Here is an example: I think we can actually separate the computations. Is there a reason why these computations are being accomplished in one place ? @rxin, @mengxr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't assume all columns are of numeric type. Catch exception here and use null as value if exception happens?
|
Hi @sun-rui, |
|
what do you think ? |
|
Yes, since R throws error message in this case, we can leave exception un-handled. No need to verify all column types. User will get exception message at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala#L81 |
|
yes, there is even a test case which covers that case. |
|
can someone from Spark SQL committers or experts also look at this ? |
|
Test build #53308 has finished for PR 9366 at commit
|
|
Test build #56142 has finished for PR 9366 at commit
|
|
cc @mengxr |
|
I have been trying to use correlation on a matrix with many columns. @NarineK menthioned R like correlation. I wish we had something like what pandas offers. It handles missing data automatically. Take a look here. Even the corr() function from MLlib can not handle missing data. These features are really missing from SparkSQL:
|
|
@NarineK Are you still working on this? cc @yanboliang |
|
We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks! |
## What changes were proposed in this pull request? This PR proposes to close stale PRs, mostly the same instances with apache#18017 I believe the author in apache#14807 removed his account. Closes apache#7075 Closes apache#8927 Closes apache#9202 Closes apache#9366 Closes apache#10861 Closes apache#11420 Closes apache#12356 Closes apache#13028 Closes apache#13506 Closes apache#14191 Closes apache#14198 Closes apache#14330 Closes apache#14807 Closes apache#15839 Closes apache#16225 Closes apache#16685 Closes apache#16692 Closes apache#16995 Closes apache#17181 Closes apache#17211 Closes apache#17235 Closes apache#17237 Closes apache#17248 Closes apache#17341 Closes apache#17708 Closes apache#17716 Closes apache#17721 Closes apache#17937 Added: Closes apache#14739 Closes apache#17139 Closes apache#17445 Closes apache#18042 Closes apache#18359 Added: Closes apache#16450 Closes apache#16525 Closes apache#17738 Added: Closes apache#16458 Closes apache#16508 Closes apache#17714 Added: Closes apache#17830 Closes apache#14742 ## How was this patch tested? N/A Author: hyukjinkwon <[email protected]> Closes apache#18417 from HyukjinKwon/close-stale-pr.
Hi there,
As we know R has the option to calculate the correlation and covariance for all columns of a dataframe or between columns of two dataframes.
If we look at apache math package we can see that, they have that too.
http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29
In case we have as input only one DataFrame:
for correlation:
cor[i,j] = cor[j,i]
and for the main diagonal we can have 1s.
for covariance:
cov[i,j] = cov[j,i]
and for main diagonal: we can compute the variance for that specific column:
See:
http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29
Thanks,
Narine