[SPARK-8598] [MLlib] Implementation of 1-sample, two-sided, Kolmogorov Smirnov Test for RDDs #6994

josepablocam · 2015-06-24T19:57:11Z

This contribution is my original work and I license it to the project under it's open source license.

sryza · 2015-06-24T20:00:38Z

mllib/src/main/scala/org/apache/spark/mllib/stat/test/KSTest.scala

Alphabetize the imports in here

sryza · 2015-06-24T20:25:12Z

mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala

I think I'd leave this API out on the first pass, as, while there are definitely situations where it's useful, it's likely to be confusing to users. We can always add it in later if there's demand.

…e accordingly

sryza · 2015-06-25T05:41:29Z

jenkins, test this please

sryza · 2015-06-25T05:43:54Z

mllib/src/main/scala/org/apache/spark/mllib/stat/test/KSTest.scala

nit: "data" instead of "dat"

…rgument between empirical and evalOneSampleP

…ition to the sorting pass), operates on each partition separately. Implementation of Sandy Ryza's algorithm

… cdf, but prior to adj it was below

sryza · 2015-06-26T00:55:37Z

mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala

Mention what distributions are supported

… ksTest(data, name) (solely standard normal)

sryza · 2015-06-26T01:09:53Z

mllib/src/main/scala/org/apache/spark/mllib/stat/test/KSTest.scala

Can you include a note on how this is implemented?

Should also copy this paragraph to the public API doc. Otherwise, user won't see it. Use @see for the wiki link.

…g the distributed approach.

mengxr · 2015-07-09T06:30:19Z

mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala

.However -> . However

ECDF is not defined. This is not a standard term in statistics. empirical CDF is fine.

…tOneSample( _, cdf)

SparkQA · 2015-07-09T20:23:30Z

Test build #36961 has finished for PR 6994 at commit 1f56371.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-09T20:40:35Z

Test build #36962 has finished for PR 6994 at commit 0d0c201.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2015-07-09T20:46:02Z

LGTM except the KSTestResult vs. KolmogorovSmirnovTestResult question is pending. I don't have strong preference between KS and KolmogorovSmirnov, but I think we should keep the naming consistent between the method name and the result type. @srowen?

srowen · 2015-07-10T18:41:03Z

@mengxr @josepablocam Yes, let's be consistent for sure. I prefer spelling out KolmogorovSmirnov myself, unless there is a strong convention elsewhere for "KS", and I don't think there is. It might make it that much more recognizable to anyone skimming the API javadoc. So rename KSTestResult?

josepablocam · 2015-07-10T19:10:50Z

@srowen Sounds good. I'll make the changes to KolmogorovSmirnovTestResult as appropriate. Thanks

…nt with method name

SparkQA · 2015-07-11T01:35:03Z

Test build #37070 has finished for PR 6994 at commit bbb30b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2015-07-11T03:59:21Z

Merged into master. Thanks!

@see

…v Smirnov Test for RDDs This contribution is my original work and I license it to the project under it's open source license. Author: jose.cambronero <[email protected]> Closes #6994 from josepablocam/master and squashes the following commits: bbb30b1 [jose.cambronero] renamed KSTestResult to KolmogorovSmirnovTestResult, to stay consistent with method name 0d0c201 [jose.cambronero] kstTest -> kolmogorovSmirnovTest in statistics.md 1f56371 [jose.cambronero] changed ksTest in public API to kolmogorovSmirnovTest for clarity a48ae7b [jose.cambronero] refactor code to account for serializable RealDistribution. Reuse testOneSample( _, cdf) 1bb44bd [jose.cambronero] style and doc changes. Factored out ks test into 2 separate tests 2ec2aa6 [jose.cambronero] initialize to stdnormal when no params passed (and log). Change unit tests to approximate equivalence rather than strict a4bc0c7 [jose.cambronero] changed ksTest(data, distName) to ksTest(data, distName, params*) after api discussions. Changed tests and docs accordingly 7e66f57 [jose.cambronero] copied implementation note to public api docs, and added @see for links to wiki info e760ebd [jose.cambronero] line length changes to fit style check 3288e42 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty 9026895 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty 1226b30 [jose.cambronero] reindent multi-line lambdas, prior intepretation of style guide was wrong on my part 9c0f1af [jose.cambronero] additional style changes incorporated and added documentation to mllib statistics docs 3f81ad2 [jose.cambronero] renamed ks1 sample test for clarity 992293b [jose.cambronero] Style changes as per comments and added implementation note explaining the distributed approach. 6a4784f [jose.cambronero] specified what distributions are available for the convenience method ksTest(data, name) (solely standard normal) 4b8ba61 [jose.cambronero] fixed off by 1/N in cases when post-constant adjustment ecdf is above cdf, but prior to adj it was below 0b5e8ec [jose.cambronero] changed KS one sample test to perform just 1 distributed pass (in addition to the sorting pass), operates on each partition separately. Implementation of Sandy Ryza's algorithm 16b5c4c [jose.cambronero] renamed dat to data and eliminated recalc of RDD size by sharing as argument between empirical and evalOneSampleP c18dc66 [jose.cambronero] removed ksTestOpt from API and changed comments in HypothesisTestSuite accordingly f6951b6 [jose.cambronero] changed style and some comments based on feedback from pull request b9cff3a [jose.cambronero] made small changes to pass style check ce8e9a1 [jose.cambronero] added kstest testing in HypothesisTestSuite 4da189b [jose.cambronero] added user facing ks test functions c659ea1 [jose.cambronero] created KS test class 13dfe4d [jose.cambronero] created test result class for ks test

josepablocam · 2015-07-13T17:44:13Z

sorry, closing PR. Didn't mean to start another build. Was just synching the branch with upstream

SparkQA · 2015-07-13T17:47:31Z

Test build #37144 has finished for PR 6994 at commit 08834f4.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Least(children: Expression*) extends Expression
- case class Greatest(children: Expression*) extends Expression

jose.cambronero added 5 commits June 24, 2015 10:28

created test result class for ks test

13dfe4d

created KS test class

c659ea1

added user facing ks test functions

4da189b

added kstest testing in HypothesisTestSuite

ce8e9a1

made small changes to pass style check

b9cff3a

sryza reviewed Jun 24, 2015
View reviewed changes

mllib/src/main/scala/org/apache/spark/mllib/stat/test/KSTest.scala Outdated

Copy link

Contributor

sryza Jun 24, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alphabetize the imports in here

changed style and some comments based on feedback from pull request

f6951b6

sryza reviewed Jun 24, 2015
View reviewed changes

removed ksTestOpt from API and changed comments in HypothesisTestSuit…

c18dc66

…e accordingly

sryza reviewed Jun 25, 2015
View reviewed changes

mllib/src/main/scala/org/apache/spark/mllib/stat/test/KSTest.scala Outdated

Copy link

Contributor

sryza Jun 25, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "data" instead of "dat"

jose.cambronero added 3 commits June 25, 2015 10:13

renamed dat to data and eliminated recalc of RDD size by sharing as a…

16b5c4c

…rgument between empirical and evalOneSampleP

changed KS one sample test to perform just 1 distributed pass (in add…

0b5e8ec

…ition to the sorting pass), operates on each partition separately. Implementation of Sandy Ryza's algorithm

fixed off by 1/N in cases when post-constant adjustment ecdf is above…

4b8ba61

… cdf, but prior to adj it was below

sryza reviewed Jun 26, 2015
View reviewed changes

mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala Outdated

Copy link

Contributor

sryza Jun 26, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention what distributions are supported

specified what distributions are available for the convenience method…

6a4784f

… ksTest(data, name) (solely standard normal)

sryza reviewed Jun 26, 2015
View reviewed changes

Style changes as per comments and added implementation note explainin…

992293b

…g the distributed approach.

mengxr reviewed Jul 9, 2015
View reviewed changes

jose.cambronero added 4 commits July 9, 2015 11:59

style and doc changes. Factored out ks test into 2 separate tests

1bb44bd

refactor code to account for serializable RealDistribution. Reuse tes…

a48ae7b

…tOneSample( _, cdf)

changed ksTest in public API to kolmogorovSmirnovTest for clarity

1f56371

kstTest -> kolmogorovSmirnovTest in statistics.md

0d0c201

renamed KSTestResult to KolmogorovSmirnovTestResult, to stay consiste…

bbb30b1

…nt with method name

Merge remote-tracking branch 'upstream/master'

08834f4

josepablocam closed this Jul 13, 2015

[SPARK-8598] [MLlib] Implementation of 1-sample, two-sided, Kolmogorov Smirnov Test for RDDs #6994

[SPARK-8598] [MLlib] Implementation of 1-sample, two-sided, Kolmogorov Smirnov Test for RDDs #6994

Uh oh!

Conversation

josepablocam commented Jun 24, 2015

Uh oh!

sryza Jun 24, 2015

Choose a reason for hiding this comment

Uh oh!

sryza Jun 24, 2015

Choose a reason for hiding this comment

Uh oh!

sryza commented Jun 25, 2015

Uh oh!

sryza Jun 25, 2015

Choose a reason for hiding this comment

Uh oh!

sryza Jun 26, 2015

Choose a reason for hiding this comment

Uh oh!

sryza Jun 26, 2015

Choose a reason for hiding this comment

Uh oh!

mengxr Jul 1, 2015

Choose a reason for hiding this comment

Uh oh!

mengxr Jul 9, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 9, 2015

Uh oh!

SparkQA commented Jul 9, 2015

Uh oh!

mengxr commented Jul 9, 2015

Uh oh!

srowen commented Jul 10, 2015

Uh oh!

josepablocam commented Jul 10, 2015

Uh oh!

SparkQA commented Jul 11, 2015

Uh oh!

mengxr commented Jul 11, 2015

Uh oh!

josepablocam commented Jul 13, 2015

Uh oh!

SparkQA commented Jul 13, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants