-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-11207][ML] Add test cases for solver selection of LinearRegres… #9180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
11cd9c1
28427d2
f85bca6
22ba64e
f6b2256
2082d47
003d3bd
0a43033
59383fd
888b216
c27a4c3
97c76c9
74de81e
241ec72
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -124,6 +124,59 @@ object LinearDataGenerator { | |
| y.zip(x).map(p => LabeledPoint(p._1, Vectors.dense(p._2))) | ||
| } | ||
|
|
||
| /** | ||
| * | ||
| * @param intercept Data intercept | ||
| * @param weights Weights to be applied. | ||
| * @param xMean the mean of the generated features. Lots of time, if the features are not properly | ||
| * standardized, the algorithm with poor implementation will have difficulty | ||
| * to converge. | ||
| * @param xVariance the variance of the generated features. | ||
| * @param nPoints Number of points in sample. | ||
| * @param seed Random seed | ||
| * @param eps Epsilon scaling factor. | ||
| * @return Seq of LabeledPoint includes sparse vectors.. | ||
| */ | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about consolidate with
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I also thought it is good idea. But
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's modify the JIRA and do it here. Basically, you can create a |
||
| @Since("1.6.0") | ||
| def generateLinearSparseInput( | ||
| intercept: Double, | ||
| weights: Array[Double], | ||
| xMean: Array[Double], | ||
| xVariance: Array[Double], | ||
| nPoints: Int, | ||
| seed: Int, | ||
| eps: Double): Seq[LabeledPoint] = { | ||
| val rnd = new Random(seed) | ||
| val x = Array.fill[Array[Double]](nPoints)( | ||
| Array.fill[Double](weights.length)(rnd.nextDouble())) | ||
|
|
||
| x.foreach { v => | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Once you have
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can also add the variance of sparsity such that the num of non zeros will not be constant. |
||
| var i = 0 | ||
| val len = v.length | ||
| while (i < len) { | ||
| if (rnd.nextDouble() < 0.7) { | ||
| v(i) = 0.0 | ||
| } else { | ||
| v(i) = (v(i) - 0.5) * math.sqrt(12.0 * xVariance(i)) + xMean(i) | ||
| } | ||
| i += 1 | ||
| } | ||
| } | ||
|
|
||
| val y = x.map { xi => | ||
| blas.ddot(weights.length, xi, 1, weights, 1) + intercept + eps * rnd.nextGaussian() | ||
| } | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To simplify the following code, do y.zip(x).map { p =>
if (sparsity == 0.0) {
LabeledPoint(p._1, Vectors.dense(p._2))
} else {
LabeledPoint(p._1, Vectors.dense(p._2).toSparse)
}
} |
||
| val sparseX = x.map { (v: Array[Double]) => | ||
| v.zipWithIndex.filter{ | ||
| case (d: Double, i: Int) => d != 0.0 | ||
| }.map { | ||
| case (d: Double, i: Int) => (i, d) | ||
| } | ||
| } | ||
| y.zip(sparseX).map(p => LabeledPoint(p._1, Vectors.sparse(weights.length, p._2))) | ||
| } | ||
|
|
||
| /** | ||
| * Generate an RDD containing sample data for Linear Regression models - including Ridge, Lasso, | ||
| * and uregularized variants. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -34,7 +34,7 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext { | |
| private val seed: Int = 42 | ||
| @transient var dataset: DataFrame = _ | ||
| @transient var datasetWithoutIntercept: DataFrame = _ | ||
| @transient var datasetWithBigFeature: DataFrame = _ | ||
| @transient var datasetWithManyFeature: DataFrame = _ | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's call it
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, changed |
||
| /* | ||
| In `LinearRegressionSuite`, we will make sure that the model trained by SparkML | ||
|
|
@@ -52,22 +52,27 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext { | |
| super.beforeAll() | ||
| dataset = sqlContext.createDataFrame( | ||
| sc.parallelize(LinearDataGenerator.generateLinearInput( | ||
| 6.3, Array(4.7, 7.2), Array(0.9, -1.3), Array(0.7, 1.2), 10000, seed, 0.1), 2)) | ||
| intercept = 6.3, weights = Array(4.7, 7.2), xMean = Array(0.9, -1.3), | ||
| xVariance = Array(0.7, 1.2), nPoints = 10000, seed = seed, eps = 0.1), 2)) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. make |
||
| /* | ||
| datasetWithoutIntercept is not needed for correctness testing but is useful for illustrating | ||
| training model without intercept | ||
| */ | ||
| datasetWithoutIntercept = sqlContext.createDataFrame( | ||
| sc.parallelize(LinearDataGenerator.generateLinearInput( | ||
| 0.0, Array(4.7, 7.2), Array(0.9, -1.3), Array(0.7, 1.2), 10000, seed, 0.1), 2)) | ||
| intercept = 0.0, weights = Array(4.7, 7.2), xMean = Array(0.9, -1.3), | ||
| xVariance = Array(0.7, 1.2), nPoints = 10000, seed = seed, eps = 0.1), 2)) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ditto |
||
|
|
||
| val r = new Random(seed) | ||
| // When feature size is larger than 4096, normal optimizer is choosed | ||
| // as the solver of linear regression in the case of "auto" mode. | ||
| val featureSize = 4100 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. leave a comment about this value |
||
| datasetWithBigFeature = sqlContext.createDataFrame( | ||
| sc.parallelize(LinearDataGenerator.generateLinearInput( | ||
| 0.0, Seq.fill(featureSize)(r.nextDouble).toArray, | ||
| Seq.fill(featureSize)(r.nextDouble).toArray, | ||
| Seq.fill(featureSize)(r.nextDouble).toArray, 200, seed, 0.1 | ||
| datasetWithManyFeature = sqlContext.createDataFrame( | ||
| sc.parallelize(LinearDataGenerator.generateLinearSparseInput( | ||
| intercept = 0.0, weights = Seq.fill(featureSize)(r.nextDouble).toArray, | ||
| xMean = Seq.fill(featureSize)(r.nextDouble).toArray, | ||
| xVariance = Seq.fill(featureSize)(r.nextDouble).toArray, nPoints = 200, | ||
| seed = seed, eps = 0.1 | ||
| ), 2)) | ||
| } | ||
|
|
||
|
|
@@ -696,7 +701,7 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext { | |
|
|
||
| test("linear regression model with l-bfgs with big feature datasets") { | ||
| val trainer = new LinearRegression().setSolver("auto") | ||
| val model = trainer.fit(datasetWithBigFeature) | ||
| val model = trainer.fit(datasetWithManyFeature) | ||
|
|
||
| // Training results for the model should be available | ||
| assert(model.hasSummary) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra line.