Skip to content

Conversation

@NarineK
Copy link
Contributor

@NarineK NarineK commented Oct 30, 2015

Hi there,

As we know R has the option to calculate the correlation and covariance for all columns of a dataframe or between columns of two dataframes.

If we look at apache math package we can see that, they have that too.
http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29

In case we have as input only one DataFrame:


for correlation:
cor[i,j] = cor[j,i]
and for the main diagonal we can have 1s.


for covariance:
cov[i,j] = cov[j,i]
and for main diagonal: we can compute the variance for that specific column:
See:
http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29

Thanks,
Narine

@NarineK
Copy link
Contributor Author

NarineK commented Oct 30, 2015

@shivaram , @rxin , would you guys, please, take a look at this ?
Thanks!

@shivaram
Copy link
Contributor

cc @mengxr

@SparkQA
Copy link

SparkQA commented Oct 30, 2015

Test build #44651 has finished for PR 9366 at commit 74bdf54.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@NarineK
Copy link
Contributor Author

NarineK commented Nov 5, 2015

Hi guys, would you share your thoughts about this ?
Thanks!

@NarineK
Copy link
Contributor Author

NarineK commented Nov 9, 2015

In general I think that currently there are some issues in the StatFunctions.scala:

It seems that all computations both for covariance and correlation are being accomplished in one place which makes it a little confusing and harder to extend for the future.

collectStatisticalData method is called for both correlation and covariance and even if I call something like this:
df.stats.corr("numeric_colame", "string_colname")
I get an error like this:
java.lang.IllegalArgumentException: requirement failed: Covariance calculation for columns with dataType StringType not supported.

Here is an example:
These 2 variables are being computed each time when we compute covariance, however, are being used only for correlation:
var MkX = 0.0 // sum of squares of differences from the (current) mean for col1
var MkY = 0.0 // sum of squares of differences from the (current) mean for col2

I think we can actually separate the computations. Is there a reason why these computations are being accomplished in one place ? @rxin, @mengxr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't assume all columns are of numeric type. Catch exception here and use null as value if exception happens?

@NarineK
Copy link
Contributor Author

NarineK commented Nov 16, 2015

Hi @sun-rui,
thank you for your comment. In general, I think that, it might be better to verify all columns types and make sure that we are dealing with numeric fields. if any of the fields isn't numeric we can show an error message, similar to R.
cor(iris)
Error in cor(iris) : 'x' must be numeric

@NarineK
Copy link
Contributor Author

NarineK commented Nov 16, 2015

what do you think ?

@sun-rui
Copy link
Contributor

sun-rui commented Nov 17, 2015

Yes, since R throws error message in this case, we can leave exception un-handled. No need to verify all column types. User will get exception message at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala#L81

@NarineK
Copy link
Contributor Author

NarineK commented Nov 17, 2015

yes, there is even a test case which covers that case.

@NarineK
Copy link
Contributor Author

NarineK commented Nov 17, 2015

can someone from Spark SQL committers or experts also look at this ?

@SparkQA
Copy link

SparkQA commented Mar 16, 2016

Test build #53308 has finished for PR 9366 at commit 74bdf54.

  • This patch fails R style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 18, 2016

Test build #56142 has finished for PR 9366 at commit 74bdf54.

  • This patch fails R style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

cc @mengxr

@sjjpo2002
Copy link

sjjpo2002 commented Apr 22, 2016

I have been trying to use correlation on a matrix with many columns. @NarineK menthioned R like correlation. I wish we had something like what pandas offers. It handles missing data automatically. Take a look here. Even the corr() function from MLlib can not handle missing data. These features are really missing from SparkSQL:

  • Apply correlation on all columns and return a matrix
  • Handle missing data automatically like how pandas does

@gatorsmile
Copy link
Member

@NarineK Are you still working on this? cc @yanboliang

@gatorsmile
Copy link
Member

We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks!

@asfgit asfgit closed this in b32bd00 Jun 27, 2017
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
## What changes were proposed in this pull request?

This PR proposes to close stale PRs, mostly the same instances with apache#18017

I believe the author in apache#14807 removed his account.

Closes apache#7075
Closes apache#8927
Closes apache#9202
Closes apache#9366
Closes apache#10861
Closes apache#11420
Closes apache#12356
Closes apache#13028
Closes apache#13506
Closes apache#14191
Closes apache#14198
Closes apache#14330
Closes apache#14807
Closes apache#15839
Closes apache#16225
Closes apache#16685
Closes apache#16692
Closes apache#16995
Closes apache#17181
Closes apache#17211
Closes apache#17235
Closes apache#17237
Closes apache#17248
Closes apache#17341
Closes apache#17708
Closes apache#17716
Closes apache#17721
Closes apache#17937

Added:
Closes apache#14739
Closes apache#17139
Closes apache#17445
Closes apache#18042
Closes apache#18359

Added:
Closes apache#16450
Closes apache#16525
Closes apache#17738

Added:
Closes apache#16458
Closes apache#16508
Closes apache#17714

Added:
Closes apache#17830
Closes apache#14742

## How was this patch tested?

N/A

Author: hyukjinkwon <[email protected]>

Closes apache#18417 from HyukjinKwon/close-stale-pr.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants