Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
13dfe4d
created test result class for ks test
Jun 24, 2015
c659ea1
created KS test class
Jun 24, 2015
4da189b
added user facing ks test functions
Jun 24, 2015
ce8e9a1
added kstest testing in HypothesisTestSuite
Jun 24, 2015
b9cff3a
made small changes to pass style check
Jun 24, 2015
f6951b6
changed style and some comments based on feedback from pull request
Jun 24, 2015
c18dc66
removed ksTestOpt from API and changed comments in HypothesisTestSuit…
Jun 24, 2015
16b5c4c
renamed dat to data and eliminated recalc of RDD size by sharing as a…
Jun 25, 2015
0b5e8ec
changed KS one sample test to perform just 1 distributed pass (in add…
Jun 25, 2015
4b8ba61
fixed off by 1/N in cases when post-constant adjustment ecdf is above…
Jun 25, 2015
6a4784f
specified what distributions are available for the convenience method…
Jun 26, 2015
992293b
Style changes as per comments and added implementation note explainin…
Jun 26, 2015
3f81ad2
renamed ks1 sample test for clarity
Jun 26, 2015
9c0f1af
additional style changes incorporated and added documentation to mlli…
Jun 26, 2015
1226b30
reindent multi-line lambdas, prior intepretation of style guide was w…
Jun 29, 2015
9026895
addressed style changes, correctness change to simpler approach, and …
Jul 7, 2015
3288e42
addressed style changes, correctness change to simpler approach, and …
Jul 7, 2015
e760ebd
line length changes to fit style check
Jul 7, 2015
7e66f57
copied implementation note to public api docs, and added @see for lin…
Jul 7, 2015
a4bc0c7
changed ksTest(data, distName) to ksTest(data, distName, params*) aft…
Jul 8, 2015
2ec2aa6
initialize to stdnormal when no params passed (and log). Change unit …
Jul 9, 2015
1bb44bd
style and doc changes. Factored out ks test into 2 separate tests
Jul 9, 2015
a48ae7b
refactor code to account for serializable RealDistribution. Reuse tes…
Jul 9, 2015
1f56371
changed ksTest in public API to kolmogorovSmirnovTest for clarity
Jul 9, 2015
0d0c201
kstTest -> kolmogorovSmirnovTest in statistics.md
Jul 9, 2015
bbb30b1
renamed KSTestResult to KolmogorovSmirnovTestResult, to stay consiste…
Jul 11, 2015
08834f4
Merge remote-tracking branch 'upstream/master'
Jul 13, 2015
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
changed style and some comments based on feedback from pull request
  • Loading branch information
jose.cambronero committed Jun 24, 2015
commit f6951b60457dc749da9214833841690491f434df
22 changes: 11 additions & 11 deletions mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,17 @@ object Statistics {
KSTest.testOneSample(data, cdf)
}

/**
* A convenience method to conduct a one-sample, two sided Kolmogorov Smirnov test for probability
* distribution equality
* @param data an `RDD[Double]` containing the sample of data to test
* @param name a `String` name for a theoretical distribution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention what distributions are supported

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name -> dist or distName? It is not clear what name means. You mentioned only standard normal distribution is supported but forgot to provide its corresponding distribution name in the doc. It is hard to guess stdnorm unless looking into the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make more sense to provide a Double => Double implementation that is the standard normal CDF? rather than this system of names? That seems more straightforward, but is this with an eye to PySpark interoperability?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The system of names was basically a way to get around the fact that we're using math3 distributions to calculate the cdfs and these classes aren't serializable. So what we do is we create one instance of the distribution per partition, by passing a function () => RealDistribution. The system of names hides this from the user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API doesn't take a Math3 object though... it just has to be a function and that can be serializable. It can be a reusable wrapper around any Math3 distribution, that uses un-serializable implementations inside and manages serializing / recreating the distribution from parameters.

... but actually the distribution objects appear to be serializable? AbstractRealDistribution is so the NormalDistribution is and others.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, on point 1, if I'm understanding correctly: my only issue at the time was that recreating the distribution object for each observation whenever the cdf function was called (vs once in a partition, and reusing that object for each observation in that partition) seemed inefficient to me (but I can of course be wrong on this point).

You're right on distributions in math3 being serializable (seems that is another difference between 3.4.1 and 3.1.1). So perhaps a better approach is to simply have the API take a RealDistribution directly? (and eliminate the name system, but keep the Double=>Double option as well, in case users want something that is not implemented in math3).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't expose third-party APIs in our public APIs. Having both distName and Double => Double looks okay to me.

* @return KSTestResult object containing test statistic, p-value, and null hypothesis.
*/
def ksTest(data: RDD[Double], name: String): KSTestResult = {
KSTest.testOneSample(data, name)
}

/**
* Conduct a one-sample, two sided Kolmogorov Smirnov test for probability distribution equality,
* which creates only 1 distribution object per partition (useful in conjunction with Apache
Expand All @@ -187,15 +198,4 @@ object Statistics {
: KSTestResult = {
KSTest.testOneSampleOpt(data, distCalc)
}

/**
* A convenience method to conduct a one-sample, two sided Kolmogorov Smirnov test for probability
* distribution equality
* @param data an `RDD[Double]` containing the sample of data to test
* @param name a `String` name for a theoretical distribution
* @return KSTestResult object containing test statistic, p-value, and null hypothesis.
*/
def ksTest(data: RDD[Double], name: String): KSTestResult = {
KSTest.testOneSample(data, name)
}
}
28 changes: 14 additions & 14 deletions mllib/src/main/scala/org/apache/spark/mllib/stat/test/KSTest.scala
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@ package org.apache.spark.mllib.stat.test
import org.apache.commons.math3.distribution.NormalDistribution
import org.apache.commons.math3.stat.inference.KolmogorovSmirnovTest

import org.apache.spark.{SparkException, Logging}
import org.apache.spark.rdd.RDD

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra newline

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NAVER - http://www.naver.com/

[email protected] 님께 보내신 메일 <Re: [spark] [SPARK-8598] [MLlib] Implementation of 1-sample, two-sided, Kolmogorov Smirnov Test for RDDs (#6994)> 이 다음과 같은 이유로 전송 실패했습니다.


받는 사람이 회원님의 메일을 수신차단 하였습니다.



Expand All @@ -31,7 +30,7 @@ import org.apache.spark.rdd.RDD
* the null hypothesis that the sample data comes from that theoretical distribution.
* For more information on KS Test: https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include a note on how this is implemented?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also copy this paragraph to the public API doc. Otherwise, user won't see it. Use @see for the wiki link.

*/
private[stat] object KSTest {
private[stat] object KSTest {

// Null hypothesis for the type of KS test to be included in the result.
object NullHypothesis extends Enumeration {
Expand All @@ -41,22 +40,21 @@ import org.apache.spark.rdd.RDD

/**
* Calculate empirical cumulative distribution values needed for KS statistic
* @param dat `RDD[Double]` on which to calculate empirical cumulative distribution values
* @param data `RDD[Double]` on which to calculate empirical cumulative distribution values
* @return and RDD of (Double, Double, Double), where the first element in each tuple is the
* value, the second element is the ECDFV - 1 /n, and the third element is the ECDFV,
* where ECDF stands for empirical cumulative distribution function value
*
*/
def empirical(dat: RDD[Double]): RDD[(Double, Double, Double)] = {
val n = dat.count().toDouble
dat.sortBy(x => x).zipWithIndex().map { case (v, i) => (v, i / n, (i + 1) / n) }
def empirical(data: RDD[Double]): RDD[(Double, Double, Double)] = {
val n = data.count().toDouble
data.sortBy(x => x).zipWithIndex().map { case (v, i) => (v, i / n, (i + 1) / n) }
}

/**
* Runs a KS test for 1 set of sample data, comparing it to a theoretical distribution
* @param dat `RDD[Double]` to evaluate
* @param cdf `Double => Double` function to calculate the theoretical CDF
* @return a KSTestResult summarizing the test results (pval, statistic, and null hypothesis)
* @return KSTestResult summarizing the test results (pval, statistic, and null hypothesis)
*/
def testOneSample(dat: RDD[Double], cdf: Double => Double): KSTestResult = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "data" instead of "dat"

val empiriRDD = empirical(dat) // empirical distribution
Expand All @@ -77,11 +75,11 @@ import org.apache.spark.rdd.RDD
* @param dat `RDD[Double]` to evaluate
* @param distCalc a function to calculate the distance between the empirical values and the
* theoretical value
* @return a KSTestResult summarizing the test results (pval, statistic, and null hypothesis)
* @return KSTestResult summarizing the test results (pval, statistic, and null hypothesis)
*/
def testOneSampleOpt(dat: RDD[Double],
distCalc: Iterator[(Double, Double, Double)] => Iterator[Double])
: KSTestResult = {
distCalc: Iterator[(Double, Double, Double)] => Iterator[Double])
: KSTestResult = {
val empiriRDD = empirical(dat) // empirical distribution information
val distances = empiriRDD.mapPartitions(distCalc, false)
val ksStat = distances.max
Expand All @@ -91,7 +89,8 @@ import org.apache.spark.rdd.RDD
/**
* Returns a function to calculate the KSTest with a standard normal distribution
* to be used with testOneSampleOpt
* @return Return a function that we can map over partitions to calculate the KS distance in each
* @return Return a function that we can map over partitions to calculate the KS distance for each
* observation on a per-partition basis
*/
def stdNormDistances(): (Iterator[(Double, Double, Double)]) => Iterator[Double] = {
val dist = new NormalDistribution(0, 1)
Expand All @@ -107,13 +106,14 @@ import org.apache.spark.rdd.RDD
* a named distribution
* @param dat the sample data that we wish to evaluate
* @param distName the name of the theoretical distribution
* @return The KS statistic and p-value associated with a two sided test
* @return KSTestResult summarizing the test results (pval, statistic, and null hypothesis)
*/
def testOneSample(dat: RDD[Double], distName: String): KSTestResult = {
val distanceCalc =
distName match {
case "stdnorm" => stdNormDistances()
case _ => throw new UnsupportedOperationException()
case _ => throw new UnsupportedOperationException(s"$distName not yet supported through" +
s"convenience method. Current options are:[stdnorm].")
}

testOneSampleOpt(dat, distanceCalc)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ package org.apache.spark.mllib.stat
import java.util.Random

import org.apache.commons.math3.distribution.{ExponentialDistribution,
NormalDistribution, UniformRealDistribution}
NormalDistribution, UniformRealDistribution}
import org.apache.commons.math3.stat.inference.KolmogorovSmirnovTest

import org.apache.spark.{SparkException, SparkFunSuite}
Expand Down Expand Up @@ -159,7 +159,6 @@ class HypothesisTestSuite extends SparkFunSuite with MLlibTestSparkContext {
}

test("kolmogorov smirnov test empirical distributions") {

// Create theoretical distributions
val stdNormalDist = new NormalDistribution(0, 1)
val expDist = new ExponentialDistribution(0.6)
Expand Down