Skip to content

Conversation

@dorx
Copy link
Contributor

@dorx dorx commented Aug 2, 2014

No description provided.

@SparkQA
Copy link

SparkQA commented Aug 2, 2014

QA tests have started for PR 1733. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17744/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 2, 2014

QA results for PR 1733:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17744/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 6, 2014

QA tests have started for PR 1733. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17974/consoleFull

@dorx
Copy link
Contributor Author

dorx commented Aug 6, 2014

@mengxr @jkbradley @falaki
PR ready for review now.

@SparkQA
Copy link

SparkQA commented Aug 6, 2014

QA tests have started for PR 1733. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17975/consoleFull

@mengxr
Copy link
Contributor

mengxr commented Aug 6, 2014

remove space between @ and jkbradley ~ :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we call it chiSqTest (following R's)? We need test in the method name because X_2 is also a distribution. I feel chiSqTest may be better than chiSquaredTest because it is also called chi-square test without d.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chiSqTest sounds good.

@SparkQA
Copy link

SparkQA commented Aug 6, 2014

QA results for PR 1733:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17974/consoleFull

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

documentation

@SparkQA
Copy link

SparkQA commented Aug 6, 2014

QA results for PR 1733:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17975/consoleFull

@mengxr
Copy link
Contributor

mengxr commented Aug 6, 2014

I think we should either allow user to input the raw observations or use Map[_, Long] for input frequencies. I'm going to take a look at R's implementation ...

@mengxr
Copy link
Contributor

mengxr commented Aug 6, 2014

@dorx I checked R's implementation and finally figured out what is going on.

  1. When only a vector x is given, it is treated as a vector containing frequency counts for categories and tested against multinomial distribution.
  2. When a matrix x is given, it is treated as a contingency table and the test is for independence.
  3. When both x and y are given, both vectors are treated as factors (categorical values) and the test is for independence.

I want to suggest the following APIs:

// test observed frequencies against multinomial distribution with
// `p = (1/n, 1/n, ..., 1/n)`
def chiSqTest(counts: Vector)

// test observed frequencies against the given multinomial distribution
def chiSqTest(counts: Vector, p: Vector)

// test independence using the given contingency table 
def chiSqTest(counts: Matrix)

// test independence using the given observed pairs (assuming categorical values)
def chiSqTest[V1, V2](observations: RDD[(V1, V2)])

@mengxr
Copy link
Contributor

mengxr commented Aug 6, 2014

The previous proposal may be hard to implement in Python. Another solution would be separate goodness-of-fit test from independence test, e.g., chiSqGofTest and chiSqIndTest.

def chiSqGofTest(counts: Vector)

def chiSqGofTest(counts: Vector, p: Vector)

def chiSqIndTest(counts: Matrix)

def chiSqIndTest[V1, V2](observations: RDD[(V1, V2)])

We can also add direct RDD support, which may be unnecessary:

def chiSqGofTest[V](observations: RDD[V], p: Map[V, Double])

Since we only support pearson, we can hide method in the public API for now.

@SparkQA
Copy link

SparkQA commented Aug 8, 2014

QA tests have started for PR 1733. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18217/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 8, 2014

QA results for PR 1733:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18217/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 8, 2014

QA tests have started for PR 1733. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18226/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 8, 2014

QA results for PR 1733:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18226/consoleFull

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be initialized as a null

@SparkQA
Copy link

SparkQA commented Aug 11, 2014

QA tests have started for PR 1733. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18333/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 11, 2014

QA results for PR 1733:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18333/consoleFull

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p-value should be 0 here (strongly against the null)

@SparkQA
Copy link

SparkQA commented Aug 11, 2014

QA tests have started for PR 1733. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18340/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 12, 2014

QA results for PR 1733:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18340/consoleFull

asfgit pushed a commit that referenced this pull request Aug 12, 2014
Author: Doris Xin <[email protected]>

Closes #1733 from dorx/chisquare and squashes the following commits:

cafb3a7 [Doris Xin] fixed p-value for extreme case.
d286783 [Doris Xin] Merge branch 'master' into chisquare
e95e485 [Doris Xin] reviewer comments.
7dde711 [Doris Xin] ChiSqTestResult renaming and changed to Class
80d03e2 [Doris Xin] Reviewer comments.
c39eeb5 [Doris Xin] units passed with updated API
e90d90a [Doris Xin] Merge branch 'master' into chisquare
7eea80b [Doris Xin] WIP
d64c2fb [Doris Xin] Merge branch 'master' into chisquare
5686082 [Doris Xin] facelift
bc7eb2e [Doris Xin] unit passed; still need docs and some refactoring
50703a5 [Doris Xin] merge master
4e4e361 [Doris Xin] WIP
e6b83f3 [Doris Xin] reviewer comments
3d61582 [Doris Xin] input names
706d436 [Doris Xin] Added API for RDD[Vector]
6598379 [Doris Xin] API and code structure.
ff17423 [Doris Xin] WIP

(cherry picked from commit 32638b5)
Signed-off-by: Xiangrui Meng <[email protected]>
@mengxr
Copy link
Contributor

mengxr commented Aug 12, 2014

LGTM. Merged into both master and branch-1.1. Thanks!

@asfgit asfgit closed this in 32638b5 Aug 12, 2014
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
Author: Doris Xin <[email protected]>

Closes apache#1733 from dorx/chisquare and squashes the following commits:

cafb3a7 [Doris Xin] fixed p-value for extreme case.
d286783 [Doris Xin] Merge branch 'master' into chisquare
e95e485 [Doris Xin] reviewer comments.
7dde711 [Doris Xin] ChiSqTestResult renaming and changed to Class
80d03e2 [Doris Xin] Reviewer comments.
c39eeb5 [Doris Xin] units passed with updated API
e90d90a [Doris Xin] Merge branch 'master' into chisquare
7eea80b [Doris Xin] WIP
d64c2fb [Doris Xin] Merge branch 'master' into chisquare
5686082 [Doris Xin] facelift
bc7eb2e [Doris Xin] unit passed; still need docs and some refactoring
50703a5 [Doris Xin] merge master
4e4e361 [Doris Xin] WIP
e6b83f3 [Doris Xin] reviewer comments
3d61582 [Doris Xin] input names
706d436 [Doris Xin] Added API for RDD[Vector]
6598379 [Doris Xin] API and code structure.
ff17423 [Doris Xin] WIP
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants