Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[SPARK-11207] Improve test case with many feature datasets
  • Loading branch information
Lewuathe committed Oct 21, 2015
commit f85bca6667dcebbfccbd50cde46b11f6855d1974
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,59 @@ object LinearDataGenerator {
y.zip(x).map(p => LabeledPoint(p._1, Vectors.dense(p._2)))
}

/**
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra line.

* @param intercept Data intercept
* @param weights Weights to be applied.
* @param xMean the mean of the generated features. Lots of time, if the features are not properly
* standardized, the algorithm with poor implementation will have difficulty
* to converge.
* @param xVariance the variance of the generated features.
* @param nPoints Number of points in sample.
* @param seed Random seed
* @param eps Epsilon scaling factor.
* @return Seq of LabeledPoint includes sparse vectors..
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about consolidate with LinearDataGenerator, and add sparsity = 1.0 as param to control if it's sparse feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I also thought it is good idea. But LinearDataGenerator is used as static object, then we have to pass sparsity as parameter to generateLinearInput. This method seems to be used a lot of suites. It is necessary to change a lot of method reference.
Therefore it might be better to do in separate JIRA. What do you thing about?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's modify the JIRA and do it here. Basically, you can create a LinearDataGenerator with old signature calling new API for compatibility issue.

@Since("1.6.0")
def generateLinearSparseInput(
intercept: Double,
weights: Array[Double],
xMean: Array[Double],
xVariance: Array[Double],
nPoints: Int,
seed: Int,
eps: Double): Seq[LabeledPoint] = {
val rnd = new Random(seed)
val x = Array.fill[Array[Double]](nPoints)(
Array.fill[Double](weights.length)(rnd.nextDouble()))

x.foreach { v =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you have sparsity, randomly choose n = numFeatures * (1 - sparsity) as non-zero features, and zero the rest out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also add the variance of sparsity such that the num of non zeros will not be constant.

var i = 0
val len = v.length
while (i < len) {
if (rnd.nextDouble() < 0.7) {
v(i) = 0.0
} else {
v(i) = (v(i) - 0.5) * math.sqrt(12.0 * xVariance(i)) + xMean(i)
}
i += 1
}
}

val y = x.map { xi =>
blas.ddot(weights.length, xi, 1, weights, 1) + intercept + eps * rnd.nextGaussian()
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simplify the following code, do

y.zip(x).map { p => 
  if (sparsity == 0.0) {
    LabeledPoint(p._1, Vectors.dense(p._2))
  } else {
    LabeledPoint(p._1, Vectors.dense(p._2).toSparse)
  }
}

val sparseX = x.map { (v: Array[Double]) =>
v.zipWithIndex.filter{
case (d: Double, i: Int) => d != 0.0
}.map {
case (d: Double, i: Int) => (i, d)
}
}
y.zip(sparseX).map(p => LabeledPoint(p._1, Vectors.sparse(weights.length, p._2)))
}

/**
* Generate an RDD containing sample data for Linear Regression models - including Ridge, Lasso,
* and uregularized variants.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext {
private val seed: Int = 42
@transient var dataset: DataFrame = _
@transient var datasetWithoutIntercept: DataFrame = _
@transient var datasetWithBigFeature: DataFrame = _
@transient var datasetWithManyFeature: DataFrame = _

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call it datasetWithSparseFeature

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, changed dataset into datasetWithDenseFeature, and datasetWithoutIntercept into datasetWithDenseFeatureWithoutIntercept.

/*
In `LinearRegressionSuite`, we will make sure that the model trained by SparkML
Expand All @@ -52,22 +52,27 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext {
super.beforeAll()
dataset = sqlContext.createDataFrame(
sc.parallelize(LinearDataGenerator.generateLinearInput(
6.3, Array(4.7, 7.2), Array(0.9, -1.3), Array(0.7, 1.2), 10000, seed, 0.1), 2))
intercept = 6.3, weights = Array(4.7, 7.2), xMean = Array(0.9, -1.3),
xVariance = Array(0.7, 1.2), nPoints = 10000, seed = seed, eps = 0.1), 2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seed = seed is not necessary. it's self-explained.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make seed = seed into just seed

/*
datasetWithoutIntercept is not needed for correctness testing but is useful for illustrating
training model without intercept
*/
datasetWithoutIntercept = sqlContext.createDataFrame(
sc.parallelize(LinearDataGenerator.generateLinearInput(
0.0, Array(4.7, 7.2), Array(0.9, -1.3), Array(0.7, 1.2), 10000, seed, 0.1), 2))
intercept = 0.0, weights = Array(4.7, 7.2), xMean = Array(0.9, -1.3),
xVariance = Array(0.7, 1.2), nPoints = 10000, seed = seed, eps = 0.1), 2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


val r = new Random(seed)
// When feature size is larger than 4096, normal optimizer is choosed
// as the solver of linear regression in the case of "auto" mode.
val featureSize = 4100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leave a comment about this value 4100

datasetWithBigFeature = sqlContext.createDataFrame(
sc.parallelize(LinearDataGenerator.generateLinearInput(
0.0, Seq.fill(featureSize)(r.nextDouble).toArray,
Seq.fill(featureSize)(r.nextDouble).toArray,
Seq.fill(featureSize)(r.nextDouble).toArray, 200, seed, 0.1
datasetWithManyFeature = sqlContext.createDataFrame(
sc.parallelize(LinearDataGenerator.generateLinearSparseInput(
intercept = 0.0, weights = Seq.fill(featureSize)(r.nextDouble).toArray,
xMean = Seq.fill(featureSize)(r.nextDouble).toArray,
xVariance = Seq.fill(featureSize)(r.nextDouble).toArray, nPoints = 200,
seed = seed, eps = 0.1
), 2))
}

Expand Down Expand Up @@ -696,7 +701,7 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext {

test("linear regression model with l-bfgs with big feature datasets") {
val trainer = new LinearRegression().setSolver("auto")
val model = trainer.fit(datasetWithBigFeature)
val model = trainer.fit(datasetWithManyFeature)

// Training results for the model should be available
assert(model.hasSummary)
Expand Down