Skip to content

Conversation

@josepablocam
Copy link

This contribution is my original work and I license it to the project under it's open source license.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alphabetize the imports in here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd leave this API out on the first pass, as, while there are definitely situations where it's useful, it's likely to be confusing to users. We can always add it in later if there's demand.

@sryza
Copy link
Contributor

sryza commented Jun 25, 2015

jenkins, test this please

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "data" instead of "dat"

jose.cambronero added 3 commits June 25, 2015 10:13
…rgument between empirical and evalOneSampleP
…ition to the sorting pass), operates on each partition separately. Implementation of Sandy Ryza's algorithm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention what distributions are supported

… ksTest(data, name) (solely standard normal)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include a note on how this is implemented?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also copy this paragraph to the public API doc. Otherwise, user won't see it. Use @see for the wiki link.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.However -> . However

ECDF is not defined. This is not a standard term in statistics. empirical CDF is fine.

@SparkQA
Copy link

SparkQA commented Jul 9, 2015

Test build #36961 has finished for PR 6994 at commit 1f56371.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 9, 2015

Test build #36962 has finished for PR 6994 at commit 0d0c201.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Jul 9, 2015

LGTM except the KSTestResult vs. KolmogorovSmirnovTestResult question is pending. I don't have strong preference between KS and KolmogorovSmirnov, but I think we should keep the naming consistent between the method name and the result type. @srowen?

@srowen
Copy link
Member

srowen commented Jul 10, 2015

@mengxr @josepablocam Yes, let's be consistent for sure. I prefer spelling out KolmogorovSmirnov myself, unless there is a strong convention elsewhere for "KS", and I don't think there is. It might make it that much more recognizable to anyone skimming the API javadoc. So rename KSTestResult?

@josepablocam
Copy link
Author

@srowen Sounds good. I'll make the changes to KolmogorovSmirnovTestResult as appropriate. Thanks

@SparkQA
Copy link

SparkQA commented Jul 11, 2015

Test build #37070 has finished for PR 6994 at commit bbb30b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Jul 11, 2015

Merged into master. Thanks!

mengxr pushed a commit that referenced this pull request Jul 11, 2015
…v Smirnov Test for RDDs

This contribution is my original work and I license it to the project under it's open source license.

Author: jose.cambronero <[email protected]>

Closes #6994 from josepablocam/master and squashes the following commits:

bbb30b1 [jose.cambronero] renamed KSTestResult to KolmogorovSmirnovTestResult, to stay consistent with method name
0d0c201 [jose.cambronero] kstTest -> kolmogorovSmirnovTest in statistics.md
1f56371 [jose.cambronero] changed ksTest in public API to kolmogorovSmirnovTest for clarity
a48ae7b [jose.cambronero] refactor code to account for serializable RealDistribution. Reuse testOneSample( _, cdf)
1bb44bd [jose.cambronero]  style and doc changes. Factored out ks test into 2 separate tests
2ec2aa6 [jose.cambronero] initialize to stdnormal when no params passed (and log). Change unit tests to approximate equivalence rather than strict
a4bc0c7 [jose.cambronero] changed ksTest(data, distName) to ksTest(data, distName, params*) after api discussions. Changed tests and docs accordingly
7e66f57 [jose.cambronero] copied implementation note to public api docs, and added @see for links to wiki info
e760ebd [jose.cambronero] line length changes to fit style check
3288e42 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty
9026895 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty
1226b30 [jose.cambronero] reindent multi-line lambdas, prior intepretation of style guide was wrong on my part
9c0f1af [jose.cambronero] additional style changes incorporated and added documentation to mllib statistics docs
3f81ad2 [jose.cambronero] renamed ks1 sample test for clarity
992293b [jose.cambronero] Style changes as per comments and added implementation note explaining the distributed approach.
6a4784f [jose.cambronero] specified what distributions are available for the convenience method ksTest(data, name) (solely standard normal)
4b8ba61 [jose.cambronero] fixed off by 1/N in cases when post-constant adjustment ecdf is above cdf, but prior to adj it was below
0b5e8ec [jose.cambronero] changed KS one sample test to perform just 1 distributed pass (in addition to the sorting pass), operates on each partition separately. Implementation of Sandy Ryza's algorithm
16b5c4c [jose.cambronero] renamed dat to data and eliminated recalc of RDD size by sharing as argument between empirical and evalOneSampleP
c18dc66 [jose.cambronero] removed ksTestOpt from API and changed comments in HypothesisTestSuite accordingly
f6951b6 [jose.cambronero] changed style and some comments based on feedback from pull request
b9cff3a [jose.cambronero] made small changes to pass style check
ce8e9a1 [jose.cambronero] added kstest testing in HypothesisTestSuite
4da189b [jose.cambronero] added user facing ks test functions
c659ea1 [jose.cambronero] created KS test class
13dfe4d [jose.cambronero] created test result class for ks test
@josepablocam
Copy link
Author

sorry, closing PR. Didn't mean to start another build. Was just synching the branch with upstream

@SparkQA
Copy link

SparkQA commented Jul 13, 2015

Test build #37144 has finished for PR 6994 at commit 08834f4.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Least(children: Expression*) extends Expression
    • case class Greatest(children: Expression*) extends Expression

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants