Skip to content
Closed
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
13dfe4d
created test result class for ks test
Jun 24, 2015
c659ea1
created KS test class
Jun 24, 2015
4da189b
added user facing ks test functions
Jun 24, 2015
ce8e9a1
added kstest testing in HypothesisTestSuite
Jun 24, 2015
b9cff3a
made small changes to pass style check
Jun 24, 2015
f6951b6
changed style and some comments based on feedback from pull request
Jun 24, 2015
c18dc66
removed ksTestOpt from API and changed comments in HypothesisTestSuit…
Jun 24, 2015
16b5c4c
renamed dat to data and eliminated recalc of RDD size by sharing as a…
Jun 25, 2015
0b5e8ec
changed KS one sample test to perform just 1 distributed pass (in add…
Jun 25, 2015
4b8ba61
fixed off by 1/N in cases when post-constant adjustment ecdf is above…
Jun 25, 2015
6a4784f
specified what distributions are available for the convenience method…
Jun 26, 2015
992293b
Style changes as per comments and added implementation note explainin…
Jun 26, 2015
3f81ad2
renamed ks1 sample test for clarity
Jun 26, 2015
9c0f1af
additional style changes incorporated and added documentation to mlli…
Jun 26, 2015
1226b30
reindent multi-line lambdas, prior intepretation of style guide was w…
Jun 29, 2015
9026895
addressed style changes, correctness change to simpler approach, and …
Jul 7, 2015
3288e42
addressed style changes, correctness change to simpler approach, and …
Jul 7, 2015
e760ebd
line length changes to fit style check
Jul 7, 2015
7e66f57
copied implementation note to public api docs, and added @see for lin…
Jul 7, 2015
a4bc0c7
changed ksTest(data, distName) to ksTest(data, distName, params*) aft…
Jul 8, 2015
2ec2aa6
initialize to stdnormal when no params passed (and log). Change unit …
Jul 9, 2015
1bb44bd
style and doc changes. Factored out ks test into 2 separate tests
Jul 9, 2015
a48ae7b
refactor code to account for serializable RealDistribution. Reuse tes…
Jul 9, 2015
1f56371
changed ksTest in public API to kolmogorovSmirnovTest for clarity
Jul 9, 2015
0d0c201
kstTest -> kolmogorovSmirnovTest in statistics.md
Jul 9, 2015
bbb30b1
renamed KSTestResult to KolmogorovSmirnovTestResult, to stay consiste…
Jul 11, 2015
08834f4
Merge remote-tracking branch 'upstream/master'
Jul 13, 2015
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 36 additions & 1 deletion docs/mllib-statistics.md
Original file line number Diff line number Diff line change
Expand Up @@ -283,7 +283,7 @@ approxSample = data.sampleByKey(False, fractions);

Hypothesis testing is a powerful tool in statistics to determine whether a result is statistically
significant, whether this result occurred by chance or not. MLlib currently supports Pearson's
chi-squared ( $\chi^2$) tests for goodness of fit and independence. The input data types determine
chi-squared ( $\chi^2$) tests for goodness of fit and independence. The input data types determine
whether the goodness of fit or the independence test is conducted. The goodness of fit test requires
an input type of `Vector`, whereas the independence test requires a `Matrix` as input.

Expand Down Expand Up @@ -422,6 +422,41 @@ for i, result in enumerate(featureTestResults):

</div>

Additionally, MLlib provides a 1-sample, 2-sided implementation of the Kolmogorov-Smirnov test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kolmogorov-Smirnov -> Kolmogorov-Smirnov (KS)

Otherwise, we use KS without definition.

for equality of probability distributions. By providing the name of a theoretical distribution
(currently solely supported for the normal distribution) and its parameters, or a function to
calculate the cumulative distribution according to a given theoretical distribution, the user can
test the null hypothesis that their sample is drawn from that distribution. In the case that the
user tests against the normal distribution (`distName="norm"`), but does not provide distribution
parameters, the test initializes to the standard normal distribution and logs an appropriate
message.

<div class="codetabs">
<div data-lang="scala" markdown="1">
[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to
run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run
and interpret the hypothesis tests.

{% highlight scala %}
import org.apache.spark.SparkContext
import org.apache.spark.mllib.stat.Statistics._

val data: RDD[Double] = ... // an RDD of sample data

// run a KS test for the sample versus a standard normal distribution
val ksTestResult = Statistics.ksTest(data, "norm", 0, 1)
println(ksTestResult) // summary of the test including the p-value, test statistic,
// and null hypothesis
// if our p-value indicates significance, we can reject the null hypothesis

// perform a KS test using a cumulative distribution function of our making
val myCDF: Double => Double = ...
val ksTestResult = Statistics.ksTest(data, myCDF)
{% endhighlight %}
</div>
</div>


## Random data generation

Random data generation is useful for randomized algorithms, prototyping, and performance testing.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.{Matrix, Vector}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.stat.correlation.Correlations
import org.apache.spark.mllib.stat.test.{ChiSqTest, ChiSqTestResult}
import org.apache.spark.mllib.stat.test.{ChiSqTest, ChiSqTestResult, KSTest, KSTestResult}
import org.apache.spark.rdd.RDD

/**
Expand Down Expand Up @@ -158,4 +158,47 @@ object Statistics {
def chiSqTest(data: RDD[LabeledPoint]): Array[ChiSqTestResult] = {
ChiSqTest.chiSquaredFeatures(data)
}

/**
* Conduct the two-sided Kolmogorov Smirnov test for data sampled from a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto: Kolmogorov-Smirnov (KS)

* continuous distribution. By comparing the largest difference between the empirical cumulative
* distribution of the sample data and the theoretical distribution we can provide a test for the
* the null hypothesis that the sample data comes from that theoretical distribution.
* For more information on KS Test:
* @see [[https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test]]
*
* Implementation note: We seek to implement the KS test with a minimal number of distributed
* passes. We sort the RDD, and then perform the following operations on a per-partition basis:
* calculate an empirical cumulative distribution value for each observation, and a theoretical
* cumulative distribution value. We know the latter to be correct, while the former will be off
* by a constant (how large the constant is depends on how many values precede it in other
* partitions).However, given that this constant simply shifts the ECDF upwards, but doesn't
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.However -> . However

ECDF is not defined. This is not a standard term in statistics. empirical CDF is fine.

* change its shape, and furthermore, that constant is the same within a given partition, we can
* pick 2 values in each partition that can potentially resolve to the largest global distance.
* Namely, we pick the minimum distance and the maximum distance. Additionally, we keep track of
* how many elements are in each partition. Once these three values have been returned for every
* partition, we can collect and operate locally. Locally, we can now adjust each distance by the
* appropriate constant (the cumulative sum of # of elements in the prior partitions divided by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# -> number

* the data set size). Finally, we take the maximum absolute value, and this is the statistic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this paragraph inside the method as implementation details, or only keep a copy in KSTest. End users do not need to know it.

* @param data an `RDD[Double]` containing the sample of data to test
* @param cdf a `Double => Double` function to calculate the theoretical CDF at a given value
* @return KSTestResult object containing test statistic, p-value, and null hypothesis.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link KSTestResult

*/
def ksTest(data: RDD[Double], cdf: Double => Double): KSTestResult = {
KSTest.testOneSample(data, cdf)
}

/**
* Convenience function to conduct a one-sample, two sided Kolmogorov Smirnov test for probability
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two sided Kolmogorov Smirnov -> two-sided Kolmogorov-Smirnov

* distribution equality. Currently supports the normal distribution, taking as parameters
* the mean and standard deviation.
* (distName = "norm")
* @param data an `RDD[Double]` containing the sample of data to test
* @param distName a `String` name for a theoretical distribution
* @param params `Double*` specifying the parameters to be used for the theoretical distribution
* @return KSTestResult object containing test statistic, p-value, and null hypothesis.
*/
def ksTest(data: RDD[Double], distName: String, params: Double*): KSTestResult = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add @varargs for Java compatibility. Another issue is the API for two-sample test. Are we going to use the same method name? What is your proposal?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about overloading the name (similar to how the R ks.test does 1 sample when passed in a vector of data, and 2 sample when passed in 2). Scipy's implementation breaks out the 2 sample test as ks_2samp. I think I prefer the R approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about renaming ksTest -> kolmogorovSmirnovTest? Obviously I prefer less terse names in modern languages but am aware that at times these are meant to mirror old R packages and such.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with overloading the name would show up in the Python API, because you cannot declare two methods with the same name. Then under this method, you cannot call the second argument distName or data2, which has to be more general like y. This is R's doc for the second arg:

y   either a numeric vector of data values, or a character string naming a
    cumulative distribution function or an actual cumulative distribution
    function such as pnorm. Only continuous CDFs are valid.

MATLAB uses kstest2. We can discuss more in the 2-sample test PR.

@srowen This is mostly mirroring R's API. No strong preference, but I would never type kolmogorovSmirnovTest without auto-completion. (Well, I just typed it ...)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen I've changed to kolmogorovSmirnovTest in the public API. Do you think the result class name should also change? I am leaning towards no, but wanted to see what others thought before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change one, I think we should change the other.

KSTest.testOneSample(data, distName, params: _*)
}
}
203 changes: 203 additions & 0 deletions mllib/src/main/scala/org/apache/spark/mllib/stat/test/KSTest.scala
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.mllib.stat.test

import scala.annotation.varargs

import org.apache.commons.math3.distribution.{NormalDistribution, RealDistribution}
import org.apache.commons.math3.stat.inference.KolmogorovSmirnovTest

import org.apache.spark.Logging
import org.apache.spark.rdd.RDD

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra newline

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NAVER - http://www.naver.com/

[email protected] 님께 보내신 메일 <Re: [spark] [SPARK-8598] [MLlib] Implementation of 1-sample, two-sided, Kolmogorov Smirnov Test for RDDs (#6994)> 이 다음과 같은 이유로 전송 실패했습니다.


받는 사람이 회원님의 메일을 수신차단 하였습니다.


/**
* Conduct the two-sided Kolmogorov Smirnov test for data sampled from a
* continuous distribution. By comparing the largest difference between the empirical cumulative
* distribution of the sample data and the theoretical distribution we can provide a test for the
* the null hypothesis that the sample data comes from that theoretical distribution.
* For more information on KS Test:
* @see [[https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test]]
*
* Implementation note: We seek to implement the KS test with a minimal number of distributed
* passes. We sort the RDD, and then perform the following operations on a per-partition basis:
* calculate an empirical cumulative distribution value for each observation, and a theoretical
* cumulative distribution value. We know the latter to be correct, while the former will be off by
* a constant (how large the constant is depends on how many values precede it in other partitions).
* However, given that this constant simply shifts the ECDF upwards, but doesn't change its shape,
* and furthermore, that constant is the same within a given partition, we can pick 2 values
* in each partition that can potentially resolve to the largest global distance. Namely, we
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the algorithm:) Is there a reference, or is it original?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Credit goes to @syrza for the algorithm

In mllib/src/main/scala/org/apache/spark/mllib/stat/test/KSTest.scala
#6994 (comment):

+/**

  • * Conduct the two-sided Kolmogorov Smirnov test for data sampled from a
  • * continuous distribution. By comparing the largest difference between the empirical cumulative
  • * distribution of the sample data and the theoretical distribution we can provide a test for the
  • * the null hypothesis that the sample data comes from that theoretical distribution.
  • * For more information on KS Test: https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
  • * Implementation note: We seek to implement the KS test with a minimal number of distributed
  • * passes. We sort the RDD, and then perform the following operations on a per-partition basis:
  • * calculate an empirical cumulative distribution value for each observation, and a theoretical
  • * cumulative distribution value. We know the latter to be correct, while the former will be off by
  • * a constant (how large the constant is depends on how many values precede it in other partitions).
  • * However, given that this constant simply shifts the ECDF upwards, but doesn't change its shape,
  • * and furthermore, that constant is the same within a given partition, we can pick 2 values
  • * in each partition that can potentially resolve to the largest global distance. Namely, we

I like the algorithm:) Is there a reference, or is it original?


Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/6994/files#r33717824.

* pick the minimum distance and the maximum distance. Additionally, we keep track of how many
* elements are in each partition. Once these three values have been returned for every partition,
* we can collect and operate locally. Locally, we can now adjust each distance by the appropriate
* constant (the cumulative sum of # of elements in the prior partitions divided by the data set
* size). Finally, we take the maximum absolute value, and this is the statistic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my previous comments about the text.

*/
private[stat] object KSTest extends Logging {

// Null hypothesis for the type of KS test to be included in the result.
object NullHypothesis extends Enumeration {
type NullHypothesis = Value
val oneSampleTwoSided = Value("Sample follows theoretical distribution")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: oneSampleTwoSided -> OneSampleTwoSided

}

/**
* Runs a KS test for 1 set of sample data, comparing it to a theoretical distribution
* @param data `RDD[Double]` data on which to run test
* @param cdf `Double => Double` function to calculate the theoretical CDF
* @return KSTestResult summarizing the test results (pval, statistic, and null hypothesis)
*/
def testOneSample(data: RDD[Double], cdf: Double => Double): KSTestResult = {
val n = data.count().toDouble
val localData = data.sortBy(x => x).mapPartitions { part =>
val partDiffs = oneSampleDifferences(part, n, cdf) // local distances
searchOneSampleCandidates(partDiffs) // candidates: local extrema
}.collect()
val ksStat = searchOneSampleStatistic(localData, n) // result: global extreme
evalOneSampleP(ksStat, n.toLong)
}

/**
* Runs a KS test for 1 set of sample data, comparing it to a theoretical distribution
* @param data `RDD[Double]` data on which to run test
* @param createDist `Unit => RealDistribution` function to create a theoretical distribution
* @return KSTestResult summarizing the test results (pval, statistic, and null hypothesis)
*/
def testOneSample(data: RDD[Double], createDist: () => RealDistribution): KSTestResult = {
val n = data.count().toDouble
val localData = data.sortBy(x => x).mapPartitions { part =>
val partDiffs = oneSampleDifferences(part, n, createDist) // local distances
searchOneSampleCandidates(partDiffs) // candidates: local extrema
}.collect()
val ksStat = searchOneSampleStatistic(localData, n) // result: global extreme
evalOneSampleP(ksStat, n.toLong)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it almost the same as testOneSample(..., cdf)? Should we reuse the code?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mengxr I've refactored the code, accounting that RealDistribution in math3 3.4.1 is serializable (which Sean had already pointed out to me). So that we now indeed use the testOneSample(..., cdf) and avoid duplication.

}

/**
* Calculate unadjusted distances between the empirical CDF and the theoretical CDF in a
* partition
* @param partData `Iterator[Double]` 1 partition of a sorted RDD
* @param n `Double` the total size of the RDD
* @param cdf `Double => Double` a function the calculates the theoretical CDF of a value
* @return `Iterator[(Double, Double)] `Unadjusted (ie. off by a constant) potential extrema
* in a partition. The first element corresponds to the (ECDF - 1/N) - CDF, the second
* element corresponds to ECDF - CDF. We can then search the resulting iterator
* for the minimum of the first and the maximum of the second element, and provide this
* as a partition's candidate extrema
*/
private def oneSampleDifferences(partData: Iterator[Double], n: Double, cdf: Double => Double)
: Iterator[(Double, Double)] = {
// zip data with index (within that partition)
// calculate local (unadjusted) ECDF and subtract CDF
partData.zipWithIndex.map { case (v, ix) =>
// dp and dl are later adjusted by constant, when global info is available
val dp = (ix + 1) / n
val dl = ix / n
val cdfVal = cdf(v)
(dl - cdfVal, dp - cdfVal)
}
}

private def oneSampleDifferences(
partData: Iterator[Double],
n: Double,
createDist: () => RealDistribution)
: Iterator[(Double, Double)] = {
val dist = createDist()
oneSampleDifferences(partData, n, x => dist.cumulativeProbability(x))
}

/**
* Search the unadjusted differences in a partition and return the
* two extrema (furthest below and furthest above CDF), along with a count of elements in that
* partition
* @param partDiffs `Iterator[(Double, Double)]` the unadjusted differences between ECDF and CDF
* in a partition, which come as a tuple of (ECDF - 1/N - CDF, ECDF - CDF)
* @return `Iterator[(Double, Double, Double)]` the local extrema and a count of elements
*/
private def searchOneSampleCandidates(partDiffs: Iterator[(Double, Double)])
: Iterator[(Double, Double, Double)] = {
val initAcc = (Double.MaxValue, Double.MinValue, 0.0)
val pResults = partDiffs.foldLeft(initAcc) { case ((pMin, pMax, pCt), (dl, dp)) =>
(math.min(pMin, dl), math.max(pMax, dp), pCt + 1)
}
val results = if (pResults == initAcc) Array[(Double, Double, Double)]() else Array(pResults)
results.iterator
}

/**
* Find the global maximum distance between ECDF and CDF (i.e. the KS Statistic) after adjusting
* local extrema estimates from individual partitions with the amount of elements in preceding
* partitions
* @param localData `Array[(Double, Double, Double)]` A local array containing the collected
* results of `searchOneSampleCandidates` across all partitions
* @param n `Double`The size of the RDD
* @return The one-sample Kolmogorov Smirnov Statistic
*/
private def searchOneSampleStatistic(localData: Array[(Double, Double, Double)], n: Double)
: Double = {
val initAcc = (Double.MinValue, 0.0)
// adjust differences based on the # of elements preceding it, which should provide
// the correct distance between ECDF and CDF
val results = localData.foldLeft(initAcc) { case ((prevMax, prevCt), (minCand, maxCand, ct)) =>
val adjConst = prevCt / n
val dist1 = math.abs(minCand + adjConst)
val dist2 = math.abs(maxCand + adjConst)
val maxVal = Array(prevMax, dist1, dist2).max
(maxVal, prevCt + ct)
}
results._1
}

/**
* A convenience function that allows running the KS test for 1 set of sample data against
* a named distribution
* @param data the sample data that we wish to evaluate
* @param distName the name of the theoretical distribution
* @param params Variable length parameter for distribution's parameters
* @return KSTestResult summarizing the test results (pval, statistic, and null hypothesis)
*/
@varargs
def testOneSample(data: RDD[Double], distName: String, params: Double*): KSTestResult = {
val distanceCalc =
distName match {
case "norm" => () => {
if (params.nonEmpty) {
// parameters are passed, then can only be 2
require(params.length == 2, "Normal distribution requires mean and standard " +
"deviation as parameters")
new NormalDistribution(params(0), params(1))
} else {
// if no parameters passed in initializes to standard normal
logInfo("No parameters specified for Normal distribution," +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normal -> normal

"initialized to standard normal (i.e. N(0, 1))")
new NormalDistribution(0, 1)
}
}
case _ => throw new UnsupportedOperationException(s"$distName not yet supported through" +
s" convenience method. Current options are:[stdnorm].")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are:[stdnorm] -> are: [norm]

}

testOneSample(data, distanceCalc)
}

private def evalOneSampleP(ksStat: Double, n: Long): KSTestResult = {
val pval = 1 - new KolmogorovSmirnovTest().cdf(ksStat, n.toInt)
new KSTestResult(pval, ksStat, NullHypothesis.oneSampleTwoSided.toString)
}
}

Original file line number Diff line number Diff line change
Expand Up @@ -90,3 +90,19 @@ class ChiSqTestResult private[stat] (override val pValue: Double,
super.toString
}
}

/**
* :: Experimental ::
* Object containing the test results for the Kolmogorov-Smirnov test.
*/
@Experimental
class KSTestResult private[stat] (override val pValue: Double,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move override val pValue: Double to next line

override val statistic: Double,
override val nullHypothesis: String) extends TestResult[Int] {

override val degreesOfFreedom = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put an empty line between methods


override def toString: String = {
"Kolmogorov Smirnov test summary:\n" + super.toString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kolmogorov-Smirnov

}
}
Loading