[SPARK-11057] [SQL] Add correlation and covariance matrices #9366

NarineK · 2015-10-30T00:28:56Z

Hi there,

As we know R has the option to calculate the correlation and covariance for all columns of a dataframe or between columns of two dataframes.

If we look at apache math package we can see that, they have that too.
http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29

In case we have as input only one DataFrame:

for correlation:
cor[i,j] = cor[j,i]
and for the main diagonal we can have 1s.

for covariance:
cov[i,j] = cov[j,i]
and for main diagonal: we can compute the variance for that specific column:
See:
http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29

Thanks,
Narine

NarineK · 2015-10-30T00:32:49Z

@shivaram , @rxin , would you guys, please, take a look at this ?
Thanks!

shivaram · 2015-10-30T02:09:01Z

cc @mengxr

SparkQA · 2015-10-30T02:42:08Z

Test build #44651 has finished for PR 9366 at commit 74bdf54.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

NarineK · 2015-11-05T06:48:29Z

Hi guys, would you share your thoughts about this ?
Thanks!

NarineK · 2015-11-09T19:52:54Z

In general I think that currently there are some issues in the StatFunctions.scala:

It seems that all computations both for covariance and correlation are being accomplished in one place which makes it a little confusing and harder to extend for the future.

collectStatisticalData method is called for both correlation and covariance and even if I call something like this:
df.stats.corr("numeric_colame", "string_colname")
I get an error like this:
java.lang.IllegalArgumentException: requirement failed: Covariance calculation for columns with dataType StringType not supported.

Here is an example:
These 2 variables are being computed each time when we compute covariance, however, are being used only for correlation:
var MkX = 0.0 // sum of squares of differences from the (current) mean for col1
var MkY = 0.0 // sum of squares of differences from the (current) mean for col2

I think we can actually separate the computations. Is there a reason why these computations are being accomplished in one place ? @rxin, @mengxr

sun-rui · 2015-11-16T08:32:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala

You can't assume all columns are of numeric type. Catch exception here and use null as value if exception happens?

NarineK · 2015-11-16T13:55:50Z

Hi @sun-rui,
thank you for your comment. In general, I think that, it might be better to verify all columns types and make sure that we are dealing with numeric fields. if any of the fields isn't numeric we can show an error message, similar to R.
cor(iris)
Error in cor(iris) : 'x' must be numeric

NarineK · 2015-11-16T13:56:19Z

what do you think ?

sun-rui · 2015-11-17T07:22:01Z

Yes, since R throws error message in this case, we can leave exception un-handled. No need to verify all column types. User will get exception message at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala#L81

NarineK · 2015-11-17T16:21:36Z

yes, there is even a test case which covers that case.

NarineK · 2015-11-17T16:26:20Z

can someone from Spark SQL committers or experts also look at this ?

SparkQA · 2016-03-16T09:16:44Z

Test build #53308 has finished for PR 9366 at commit 74bdf54.

This patch fails R style tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-04-18T21:46:53Z

Test build #56142 has finished for PR 9366 at commit 74bdf54.

This patch fails R style tests.
This patch does not merge cleanly.
This patch adds no public classes.

shivaram · 2016-04-19T17:28:18Z

cc @mengxr

sjjpo2002 · 2016-04-22T21:55:23Z

I have been trying to use correlation on a matrix with many columns. @NarineK menthioned R like correlation. I wish we had something like what pandas offers. It handles missing data automatically. Take a look here. Even the corr() function from MLlib can not handle missing data. These features are really missing from SparkSQL:

Apply correlation on all columns and return a matrix
Handle missing data automatically like how pandas does

gatorsmile · 2017-06-13T16:01:13Z

@NarineK Are you still working on this? cc @yanboliang

gatorsmile · 2017-06-27T06:39:46Z

We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks!

## What changes were proposed in this pull request? This PR proposes to close stale PRs, mostly the same instances with apache#18017 I believe the author in apache#14807 removed his account. Closes apache#7075 Closes apache#8927 Closes apache#9202 Closes apache#9366 Closes apache#10861 Closes apache#11420 Closes apache#12356 Closes apache#13028 Closes apache#13506 Closes apache#14191 Closes apache#14198 Closes apache#14330 Closes apache#14807 Closes apache#15839 Closes apache#16225 Closes apache#16685 Closes apache#16692 Closes apache#16995 Closes apache#17181 Closes apache#17211 Closes apache#17235 Closes apache#17237 Closes apache#17248 Closes apache#17341 Closes apache#17708 Closes apache#17716 Closes apache#17721 Closes apache#17937 Added: Closes apache#14739 Closes apache#17139 Closes apache#17445 Closes apache#18042 Closes apache#18359 Added: Closes apache#16450 Closes apache#16525 Closes apache#17738 Added: Closes apache#16458 Closes apache#16508 Closes apache#17714 Added: Closes apache#17830 Closes apache#14742 ## How was this patch tested? N/A Author: hyukjinkwon <[email protected]> Closes apache#18417 from HyukjinKwon/close-stale-pr.

Initial commit for correelation and covariance matrices

74bdf54

shivaram mentioned this pull request Nov 13, 2015

[SPARK-11715][SPARKR] Add R support corr for Column Aggregration #9680

Closed

sun-rui reviewed Nov 16, 2015
View reviewed changes

HyukjinKwon mentioned this pull request Jun 25, 2017

[INFRA] Close stale PRs #18417

Closed

asfgit closed this in b32bd00 Jun 27, 2017

[SPARK-11057] [SQL] Add correlation and covariance matrices #9366

[SPARK-11057] [SQL] Add correlation and covariance matrices #9366

Uh oh!

Conversation

NarineK commented Oct 30, 2015

Uh oh!

NarineK commented Oct 30, 2015

Uh oh!

shivaram commented Oct 30, 2015

Uh oh!

SparkQA commented Oct 30, 2015

Uh oh!

NarineK commented Nov 5, 2015

Uh oh!

NarineK commented Nov 9, 2015

Uh oh!

sun-rui Nov 16, 2015

Choose a reason for hiding this comment

Uh oh!

NarineK commented Nov 16, 2015

Uh oh!

NarineK commented Nov 16, 2015

Uh oh!

sun-rui commented Nov 17, 2015

Uh oh!

NarineK commented Nov 17, 2015

Uh oh!

NarineK commented Nov 17, 2015

Uh oh!

SparkQA commented Mar 16, 2016

Uh oh!

SparkQA commented Apr 18, 2016

Uh oh!

shivaram commented Apr 19, 2016

Uh oh!

sjjpo2002 commented Apr 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Jun 13, 2017

Uh oh!

gatorsmile commented Jun 27, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sjjpo2002 commented Apr 22, 2016 •

edited

Loading