Skip to content
Closed
Changes from 1 commit
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
13dfe4d
created test result class for ks test
Jun 24, 2015
c659ea1
created KS test class
Jun 24, 2015
4da189b
added user facing ks test functions
Jun 24, 2015
ce8e9a1
added kstest testing in HypothesisTestSuite
Jun 24, 2015
b9cff3a
made small changes to pass style check
Jun 24, 2015
f6951b6
changed style and some comments based on feedback from pull request
Jun 24, 2015
c18dc66
removed ksTestOpt from API and changed comments in HypothesisTestSuit…
Jun 24, 2015
16b5c4c
renamed dat to data and eliminated recalc of RDD size by sharing as a…
Jun 25, 2015
0b5e8ec
changed KS one sample test to perform just 1 distributed pass (in add…
Jun 25, 2015
4b8ba61
fixed off by 1/N in cases when post-constant adjustment ecdf is above…
Jun 25, 2015
6a4784f
specified what distributions are available for the convenience method…
Jun 26, 2015
992293b
Style changes as per comments and added implementation note explainin…
Jun 26, 2015
3f81ad2
renamed ks1 sample test for clarity
Jun 26, 2015
9c0f1af
additional style changes incorporated and added documentation to mlli…
Jun 26, 2015
1226b30
reindent multi-line lambdas, prior intepretation of style guide was w…
Jun 29, 2015
9026895
addressed style changes, correctness change to simpler approach, and …
Jul 7, 2015
3288e42
addressed style changes, correctness change to simpler approach, and …
Jul 7, 2015
e760ebd
line length changes to fit style check
Jul 7, 2015
7e66f57
copied implementation note to public api docs, and added @see for lin…
Jul 7, 2015
a4bc0c7
changed ksTest(data, distName) to ksTest(data, distName, params*) aft…
Jul 8, 2015
2ec2aa6
initialize to stdnormal when no params passed (and log). Change unit …
Jul 9, 2015
1bb44bd
style and doc changes. Factored out ks test into 2 separate tests
Jul 9, 2015
a48ae7b
refactor code to account for serializable RealDistribution. Reuse tes…
Jul 9, 2015
1f56371
changed ksTest in public API to kolmogorovSmirnovTest for clarity
Jul 9, 2015
0d0c201
kstTest -> kolmogorovSmirnovTest in statistics.md
Jul 9, 2015
bbb30b1
renamed KSTestResult to KolmogorovSmirnovTestResult, to stay consiste…
Jul 11, 2015
08834f4
Merge remote-tracking branch 'upstream/master'
Jul 13, 2015
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
renamed dat to data and eliminated recalc of RDD size by sharing as a…
…rgument between empirical and evalOneSampleP
  • Loading branch information
jose.cambronero committed Jun 25, 2015
commit 16b5c4cfc4cc81164e3a5e2e8adaca017b7e4514
32 changes: 17 additions & 15 deletions mllib/src/main/scala/org/apache/spark/mllib/stat/test/KSTest.scala
Original file line number Diff line number Diff line change
Expand Up @@ -41,49 +41,51 @@ private[stat] object KSTest {
/**
* Calculate empirical cumulative distribution values needed for KS statistic
* @param data `RDD[Double]` on which to calculate empirical cumulative distribution values
* @return and RDD of (Double, Double, Double), where the first element in each tuple is the
* @param size Size of data
* @return RDD of (Double, Double, Double), where the first element in each tuple is the
* value, the second element is the ECDFV - 1 /n, and the third element is the ECDFV,
* where ECDF stands for empirical cumulative distribution function value
*/
def empirical(data: RDD[Double]): RDD[(Double, Double, Double)] = {
val n = data.count().toDouble
data.sortBy(x => x).zipWithIndex().map { case (v, i) => (v, i / n, (i + 1) / n) }
def empirical(data: RDD[Double], size: Double): RDD[(Double, Double, Double)] = {
data.sortBy(x => x).zipWithIndex().map { case (v, i) => (v, i / size, (i + 1) / size) }
}

/**
* Runs a KS test for 1 set of sample data, comparing it to a theoretical distribution
* @param dat `RDD[Double]` to evaluate
* @param data `RDD[Double]` to evaluate
* @param cdf `Double => Double` function to calculate the theoretical CDF
* @return KSTestResult summarizing the test results (pval, statistic, and null hypothesis)
*/
def testOneSample(dat: RDD[Double], cdf: Double => Double): KSTestResult = {
val empiriRDD = empirical(dat) // empirical distribution
def testOneSample(data: RDD[Double], cdf: Double => Double): KSTestResult = {
val n = data.count()
val empiriRDD = empirical(data, n.toDouble) // empirical distribution
val distances = empiriRDD.map {
case (v, dl, dp) =>
val cdfVal = cdf(v)
Math.max(cdfVal - dl, dp - cdfVal)
}
val ksStat = distances.max()
evalOneSampleP(ksStat, distances.count())
evalOneSampleP(ksStat, n)
}

/**
* Runs a KS test for 1 set of sample data, comparing it to a theoretical distribution. Optimized
* such that each partition runs a separate mapping operation. This can help in cases where the
* CDF calculation involves creating an object. By using this implementation we can make sure
* only 1 object is created per partition, versus 1 per observation.
* @param dat `RDD[Double]` to evaluate
* @param data `RDD[Double]` to evaluate
* @param distCalc a function to calculate the distance between the empirical values and the
* theoretical value
* @return KSTestResult summarizing the test results (pval, statistic, and null hypothesis)
*/
def testOneSampleOpt(dat: RDD[Double],
def testOneSampleOpt(data: RDD[Double],
distCalc: Iterator[(Double, Double, Double)] => Iterator[Double])
: KSTestResult = {
val empiriRDD = empirical(dat) // empirical distribution information
val n = data.count()
val empiriRDD = empirical(data, n.toDouble) // empirical distribution information
val distances = empiriRDD.mapPartitions(distCalc, false)
val ksStat = distances.max
evalOneSampleP(ksStat, distances.count())
evalOneSampleP(ksStat, n)
}

/**
Expand All @@ -104,19 +106,19 @@ private[stat] object KSTest {
/**
* A convenience function that allows running the KS test for 1 set of sample data against
* a named distribution
* @param dat the sample data that we wish to evaluate
* @param data the sample data that we wish to evaluate
* @param distName the name of the theoretical distribution
* @return KSTestResult summarizing the test results (pval, statistic, and null hypothesis)
*/
def testOneSample(dat: RDD[Double], distName: String): KSTestResult = {
def testOneSample(data: RDD[Double], distName: String): KSTestResult = {
val distanceCalc =
distName match {
case "stdnorm" => stdNormDistances()
case _ => throw new UnsupportedOperationException(s"$distName not yet supported through" +
s"convenience method. Current options are:[stdnorm].")
}

testOneSampleOpt(dat, distanceCalc)
testOneSampleOpt(data, distanceCalc)
}

private def evalOneSampleP(ksStat: Double, n: Long): KSTestResult = {
Expand Down