Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
13dfe4d
created test result class for ks test
Jun 24, 2015
c659ea1
created KS test class
Jun 24, 2015
4da189b
added user facing ks test functions
Jun 24, 2015
ce8e9a1
added kstest testing in HypothesisTestSuite
Jun 24, 2015
b9cff3a
made small changes to pass style check
Jun 24, 2015
f6951b6
changed style and some comments based on feedback from pull request
Jun 24, 2015
c18dc66
removed ksTestOpt from API and changed comments in HypothesisTestSuit…
Jun 24, 2015
16b5c4c
renamed dat to data and eliminated recalc of RDD size by sharing as a…
Jun 25, 2015
0b5e8ec
changed KS one sample test to perform just 1 distributed pass (in add…
Jun 25, 2015
4b8ba61
fixed off by 1/N in cases when post-constant adjustment ecdf is above…
Jun 25, 2015
6a4784f
specified what distributions are available for the convenience method…
Jun 26, 2015
992293b
Style changes as per comments and added implementation note explainin…
Jun 26, 2015
3f81ad2
renamed ks1 sample test for clarity
Jun 26, 2015
9c0f1af
additional style changes incorporated and added documentation to mlli…
Jun 26, 2015
1226b30
reindent multi-line lambdas, prior intepretation of style guide was w…
Jun 29, 2015
9026895
addressed style changes, correctness change to simpler approach, and …
Jul 7, 2015
3288e42
addressed style changes, correctness change to simpler approach, and …
Jul 7, 2015
e760ebd
line length changes to fit style check
Jul 7, 2015
7e66f57
copied implementation note to public api docs, and added @see for lin…
Jul 7, 2015
a4bc0c7
changed ksTest(data, distName) to ksTest(data, distName, params*) aft…
Jul 8, 2015
2ec2aa6
initialize to stdnormal when no params passed (and log). Change unit …
Jul 9, 2015
1bb44bd
style and doc changes. Factored out ks test into 2 separate tests
Jul 9, 2015
a48ae7b
refactor code to account for serializable RealDistribution. Reuse tes…
Jul 9, 2015
1f56371
changed ksTest in public API to kolmogorovSmirnovTest for clarity
Jul 9, 2015
0d0c201
kstTest -> kolmogorovSmirnovTest in statistics.md
Jul 9, 2015
bbb30b1
renamed KSTestResult to KolmogorovSmirnovTestResult, to stay consiste…
Jul 11, 2015
08834f4
Merge remote-tracking branch 'upstream/master'
Jul 13, 2015
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
additional style changes incorporated and added documentation to mlli…
…b statistics docs
  • Loading branch information
jose.cambronero committed Jun 26, 2015
commit 9c0f1af882c930cafe55fe828c0c2d0fbe2d23f1
34 changes: 33 additions & 1 deletion docs/mllib-statistics.md
Original file line number Diff line number Diff line change
Expand Up @@ -283,7 +283,7 @@ approxSample = data.sampleByKey(False, fractions);

Hypothesis testing is a powerful tool in statistics to determine whether a result is statistically
significant, whether this result occurred by chance or not. MLlib currently supports Pearson's
chi-squared ( $\chi^2$) tests for goodness of fit and independence. The input data types determine
chi-squared ( $\chi^2$) tests for goodness of fit and independence. The input data types determine
whether the goodness of fit or the independence test is conducted. The goodness of fit test requires
an input type of `Vector`, whereas the independence test requires a `Matrix` as input.

Expand Down Expand Up @@ -422,6 +422,38 @@ for i, result in enumerate(featureTestResults):

</div>

Additionally, MLlib provides a 1-sample, 2-sided implementation of the Kolmogorov-Smirnov test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kolmogorov-Smirnov -> Kolmogorov-Smirnov (KS)

Otherwise, we use KS without definition.

for equality of probability distributions. By providing the name of a theoretical distribution
(currently solely supported for the standard normal distribution), or a function to calculate
the cumulative distribution according to a given theoretical distribution, the user can
test the null hypothesis that their sample is drawn from that distribution.

<div class="codetabs">
<div data-lang="scala" markdown="1">
[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to
run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run
and interpret the hypothesis tests.

{% highlight scala %}
import org.apache.spark.SparkContext
import org.apache.spark.mllib.stat.Statistics._

val data: RDD[Double] = ... // an RDD of sample data

// run a KS test for the sample versus a standard normal distribution
val ksTestResult = Statistics.ksTest(data, "stdnorm")
println(ksTestResult) // summary of the test including the p-value, test statistic,
// and null hypothesis
// if our p-value indicates significance, we can reject the null hypothesis

// perform a KS test using a cumulative distribution function of our making
val myCDF: Double => Double = ...
val ksTestResult = Statistics.ksTest(data, myCDF)
{% endhighlight %}
</div>
</div>


## Random data generation

Random data generation is useful for randomized algorithms, prototyping, and performance testing.
Expand Down
54 changes: 27 additions & 27 deletions mllib/src/main/scala/org/apache/spark/mllib/stat/test/KSTest.scala
Original file line number Diff line number Diff line change
Expand Up @@ -60,9 +60,9 @@ private[stat] object KSTest {
def testOneSample(data: RDD[Double], cdf: Double => Double): KSTestResult = {
val n = data.count().toDouble
val localData = data.sortBy(x => x).mapPartitions { part =>
val partDiffs = oneSampleDifferences(part, n, cdf) // local distances
searchOneSampleCandidates(partDiffs) // candidates: local extrema
}.collect()
val partDiffs = oneSampleDifferences(part, n, cdf) // local distances
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent these two spaces. general rule is:

  • when you have an open curly brace, all following lines are indented two spaces until ending curly brace
  • ending curly brace is at the same level of indentation as the first non-space character on the line that has the starting curly brace

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. Sorry, I misunderstood the comment before and thought you meant to un-indent the entire thing, not just the last line. Fixed now

searchOneSampleCandidates(partDiffs) // candidates: local extrema
}.collect()
val ksStat = searchOneSampleStatistic(localData, n) // result: global extreme
evalOneSampleP(ksStat, n.toLong)
}
Expand All @@ -76,9 +76,9 @@ private[stat] object KSTest {
def testOneSample(data: RDD[Double], createDist: () => RealDistribution): KSTestResult = {
val n = data.count().toDouble
val localData = data.sortBy(x => x).mapPartitions { part =>
val partDiffs = oneSampleDifferences(part, n, createDist) // local distances
searchOneSampleCandidates(partDiffs) // candidates: local extrema
}.collect()
val partDiffs = oneSampleDifferences(part, n, createDist) // local distances
searchOneSampleCandidates(partDiffs) // candidates: local extrema
}.collect()
val ksStat = searchOneSampleStatistic(localData, n) // result: global extreme
evalOneSampleP(ksStat, n.toLong)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it almost the same as testOneSample(..., cdf)? Should we reuse the code?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mengxr I've refactored the code, accounting that RealDistribution in math3 3.4.1 is serializable (which Sean had already pointed out to me). So that we now indeed use the testOneSample(..., cdf) and avoid duplication.

}
Expand All @@ -101,14 +101,14 @@ private[stat] object KSTest {
// zip data with index (within that partition)
// calculate local (unadjusted) ECDF and subtract CDF
partData.zipWithIndex.map { case (v, ix) =>
// dp and dl are later adjusted by constant, when global info is available
val dp = (ix + 1) / n
val dl = ix / n
val cdfVal = cdf(v)
// if dp > cdfVal the adjusted dp is still above cdfVal, if dp < cdfVal
// we want negative distance so that constant adjusted gives correct distance
if (dp > cdfVal) dp - cdfVal else dl - cdfVal
}
// dp and dl are later adjusted by constant, when global info is available
val dp = (ix + 1) / n
val dl = ix / n
val cdfVal = cdf(v)
// if dp > cdfVal the adjusted dp is still above cdfVal, if dp < cdfVal
// we want negative distance so that constant adjusted gives correct distance
if (dp > cdfVal) dp - cdfVal else dl - cdfVal
}
}

private def oneSampleDifferences(
Expand All @@ -132,8 +132,8 @@ private[stat] object KSTest {
: Iterator[(Double, Double, Double)] = {
val initAcc = (Double.MaxValue, Double.MinValue, 0.0)
val partResults = partDiffs.foldLeft(initAcc) { case ((pMin, pMax, pCt), currDiff) =>
(Math.min(pMin, currDiff), Math.max(pMax, currDiff), pCt + 1)
}
(Math.min(pMin, currDiff), Math.max(pMax, currDiff), pCt + 1)
}
Array(partResults).iterator
}

Expand All @@ -152,16 +152,16 @@ private[stat] object KSTest {
// adjust differences based on the # of elements preceding it, which should provide
// the correct distance between ECDF and CDF
val results = localData.foldLeft(initAcc) { case ((prevMax, prevCt), (minCand, maxCand, ct)) =>
val adjConst = prevCt / n
val pdist1 = minCand + adjConst
val pdist2 = maxCand + adjConst
// adjust by 1 / N if pre-constant the value is less than cdf and post-constant
// it is greater than or equal to the cdf
val dist1 = if (pdist1 >= 0 && minCand < 0) pdist1 + 1 / n else Math.abs(pdist1)
val dist2 = if (pdist2 >= 0 && maxCand < 0) pdist2 + 1 / n else Math.abs(pdist2)
val maxVal = Array(prevMax, dist1, dist2).max
(maxVal, prevCt + ct)
}
val adjConst = prevCt / n
val pdist1 = minCand + adjConst
val pdist2 = maxCand + adjConst
// adjust by 1 / N if pre-constant the value is less than cdf and post-constant
// it is greater than or equal to the cdf
val dist1 = if (pdist1 >= 0 && minCand < 0) pdist1 + 1 / n else Math.abs(pdist1)
val dist2 = if (pdist2 >= 0 && maxCand < 0) pdist2 + 1 / n else Math.abs(pdist2)
val maxVal = Array(prevMax, dist1, dist2).max
(maxVal, prevCt + ct)
}
results._1
}

Expand All @@ -177,7 +177,7 @@ private[stat] object KSTest {
distName match {
case "stdnorm" => () => new NormalDistribution(0, 1)
case _ => throw new UnsupportedOperationException(s"$distName not yet supported through" +
s"convenience method. Current options are:[stdnorm].")
s" convenience method. Current options are:[stdnorm].")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are:[stdnorm] -> are: [norm]

}

testOneSample(data, distanceCalc)
Expand Down