Skip to content
Closed
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,6 @@ object LinearDataGenerator {
nPoints, seed, eps)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formatting in pervious method

  def generateLinearInput(
      intercept: Double,
      weights: Array[Double],
      nPoints: Int,
      seed: Int,
      eps: Double = 0.1): Seq[LabeledPoint] = {
    generateLinearInput(intercept, weights,
      Array.fill[Double](weights.length)(0.0),
      Array.fill[Double](weights.length)(1.0 / 3.0),
      nPoints, seed, eps)}

looks weird for me. Can you fix in this PR? Thanks.


/**
*
* @param intercept Data intercept
* @param weights Weights to be applied.
* @param xMean the mean of the generated features. Lots of time, if the features are not properly
Expand All @@ -104,24 +103,71 @@ object LinearDataGenerator {
nPoints: Int,
seed: Int,
eps: Double): Seq[LabeledPoint] = {
generateLinearInputInternal(intercept, weights, xMean, xVariance, nPoints, seed, eps, 0.0)
}


/**
* @param intercept Data intercept
* @param weights Weights to be applied.
* @param xMean the mean of the generated features. Lots of time, if the features are not properly
* standardized, the algorithm with poor implementation will have difficulty
* to converge.
* @param xVariance the variance of the generated features.
* @param nPoints Number of points in sample.
* @param seed Random seed
* @param eps Epsilon scaling factor.
* @param sparcity The ratio of zero elements. If it is 0.0, LabeledPoints with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: sparsity

* DenseVector is returned.
* @return Seq of input.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about consolidate with LinearDataGenerator, and add sparsity = 1.0 as param to control if it's sparse feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I also thought it is good idea. But LinearDataGenerator is used as static object, then we have to pass sparsity as parameter to generateLinearInput. This method seems to be used a lot of suites. It is necessary to change a lot of method reference.
Therefore it might be better to do in separate JIRA. What do you thing about?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's modify the JIRA and do it here. Basically, you can create a LinearDataGenerator with old signature calling new API for compatibility issue.

@Since("1.6.0")
def generateLinearInputInternal(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just call it generateLinearInput without Internal.

intercept: Double,
weights: Array[Double],
xMean: Array[Double],
xVariance: Array[Double],
nPoints: Int,
seed: Int,
eps: Double,
sparcity: Double): Seq[LabeledPoint] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. typo.

require(sparcity <= 1.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think it's okay to have sparsity == 1.0. Just have everything zeros.

require(0.0 <= sparsity && sparsity <= 1.0)

val rnd = new Random(seed)
val x = Array.fill[Array[Double]](nPoints)(
Array.fill[Double](weights.length)(rnd.nextDouble()))

x.foreach { v =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you have sparsity, randomly choose n = numFeatures * (1 - sparsity) as non-zero features, and zero the rest out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also add the variance of sparsity such that the num of non zeros will not be constant.

var i = 0
val len = v.length
val sparceRnd = new Random(seed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you seed rnd and sparceRnd with the same seed, both of them will generate the same sequence of random numbers which is not what you want. You should be able to use the same random number generator which will give you uncorrelated random numbers in both creating the features and choice which columns to zero out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use same random generator for both creating features and choice which columns to zero, x is different from current ones. This cause unit test failures. Can we change the assertion tolerance or target written in LinearRegressionSuite?

while (i < len) {
v(i) = (v(i) - 0.5) * math.sqrt(12.0 * xVariance(i)) + xMean(i)
if (sparceRnd.nextDouble() < sparcity) {
v(i) = 0.0
} else {
v(i) = (v(i) - 0.5) * math.sqrt(12.0 * xVariance(i)) + xMean(i)
}
i += 1
}
}

val y = x.map { xi =>
blas.ddot(weights.length, xi, 1, weights, 1) + intercept + eps * rnd.nextGaussian()
}
y.zip(x).map(p => LabeledPoint(p._1, Vectors.dense(p._2)))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simplify the following code, do

y.zip(x).map { p => 
  if (sparsity == 0.0) {
    LabeledPoint(p._1, Vectors.dense(p._2))
  } else {
    LabeledPoint(p._1, Vectors.dense(p._2).toSparse)
  }
}

val sparseX = x.map { (v: Array[Double]) =>
v.zipWithIndex.filter {
case (d: Double, i: Int) => d != 0.0
}.map {
case (d: Double, i: Int) => (i, d)
}
}
if (sparcity == 0.0) {
// Return LabeledPoints with DenseVector
y.zip(x).map(p => LabeledPoint(p._1, Vectors.dense(p._2)))
} else {
// Return LabeledPoints with SparseVector
y.zip(sparseX).map(p => LabeledPoint(p._1, Vectors.sparse(weights.length, p._2)))
}
}

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext {
private val seed: Int = 42
@transient var dataset: DataFrame = _
@transient var datasetWithoutIntercept: DataFrame = _
@transient var datasetWithManyFeature: DataFrame = _

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call it datasetWithSparseFeature

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, changed dataset into datasetWithDenseFeature, and datasetWithoutIntercept into datasetWithDenseFeatureWithoutIntercept.

/*
In `LinearRegressionSuite`, we will make sure that the model trained by SparkML
Expand All @@ -51,14 +52,27 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext {
super.beforeAll()
dataset = sqlContext.createDataFrame(
sc.parallelize(LinearDataGenerator.generateLinearInput(
6.3, Array(4.7, 7.2), Array(0.9, -1.3), Array(0.7, 1.2), 10000, seed, 0.1), 2))
intercept = 6.3, weights = Array(4.7, 7.2), xMean = Array(0.9, -1.3),
xVariance = Array(0.7, 1.2), nPoints = 10000, seed = seed, eps = 0.1), 2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seed = seed is not necessary. it's self-explained.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make seed = seed into just seed

/*
datasetWithoutIntercept is not needed for correctness testing but is useful for illustrating
training model without intercept
*/
datasetWithoutIntercept = sqlContext.createDataFrame(
sc.parallelize(LinearDataGenerator.generateLinearInput(
0.0, Array(4.7, 7.2), Array(0.9, -1.3), Array(0.7, 1.2), 10000, seed, 0.1), 2))
intercept = 0.0, weights = Array(4.7, 7.2), xMean = Array(0.9, -1.3),
xVariance = Array(0.7, 1.2), nPoints = 10000, seed = seed, eps = 0.1), 2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


val r = new Random(seed)
// When feature size is larger than 4096, normal optimizer is choosed
// as the solver of linear regression in the case of "auto" mode.
val featureSize = 4100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leave a comment about this value 4100

datasetWithManyFeature = sqlContext.createDataFrame(
sc.parallelize(LinearDataGenerator.generateLinearInputInternal(
intercept = 0.0, weights = Seq.fill(featureSize)(r.nextDouble).toArray,
xMean = Seq.fill(featureSize)(r.nextDouble).toArray,
xVariance = Seq.fill(featureSize)(r.nextDouble).toArray, nPoints = 200,
seed = seed, eps = 0.1, sparcity = 0.7), 2))
}

test("params") {
Expand Down Expand Up @@ -186,19 +200,15 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext {
val trainer2 = (new LinearRegression).setElasticNetParam(1.0).setRegParam(0.57)
.setSolver(solver).setStandardization(false)

var model1: LinearRegressionModel = null
var model2: LinearRegressionModel = null

// Normal optimizer is not supported with only L1 regularization case.
if (solver == "normal") {
intercept[IllegalArgumentException] {
trainer1.fit(dataset)
trainer2.fit(dataset)
}
} else {
model1 = trainer1.fit(dataset)
model2 = trainer2.fit(dataset)

val model1 = trainer1.fit(dataset)
val model2 = trainer2.fit(dataset)

/*
weights <- coef(glmnet(features, label, family="gaussian", alpha = 1.0, lambda = 0.57))
Expand Down Expand Up @@ -247,18 +257,15 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext {
val trainer2 = (new LinearRegression).setElasticNetParam(1.0).setRegParam(0.57)
.setFitIntercept(false).setStandardization(false).setSolver(solver)

var model1: LinearRegressionModel = null
var model2: LinearRegressionModel = null

// Normal optimizer is not supported with only L1 regularization case.
if (solver == "normal") {
intercept[IllegalArgumentException] {
trainer1.fit(dataset)
trainer2.fit(dataset)
}
} else {
model1 = trainer1.fit(dataset)
model2 = trainer2.fit(dataset)
val model1 = trainer1.fit(dataset)
val model2 = trainer2.fit(dataset)

/*
weights <- coef(glmnet(features, label, family="gaussian", alpha = 1.0, lambda = 0.57,
Expand Down Expand Up @@ -408,18 +415,15 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext {
val trainer2 = (new LinearRegression).setElasticNetParam(0.3).setRegParam(1.6)
.setStandardization(false).setSolver(solver)

var model1: LinearRegressionModel = null
var model2: LinearRegressionModel = null

// Normal optimizer is not supported with non-zero elasticnet parameter.
if (solver == "normal") {
intercept[IllegalArgumentException] {
trainer1.fit(dataset)
trainer2.fit(dataset)
}
} else {
model1 = trainer1.fit(dataset)
model2 = trainer2.fit(dataset)
val model1 = trainer1.fit(dataset)
val model2 = trainer2.fit(dataset)

/*
weights <- coef(glmnet(features, label, family="gaussian", alpha = 0.3, lambda = 1.6))
Expand Down Expand Up @@ -469,18 +473,15 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext {
val trainer2 = (new LinearRegression).setElasticNetParam(0.3).setRegParam(1.6)
.setFitIntercept(false).setStandardization(false).setSolver(solver)

var model1: LinearRegressionModel = null
var model2: LinearRegressionModel = null

// Normal optimizer is not supported with non-zero elasticnet parameter.
if (solver == "normal") {
intercept[IllegalArgumentException] {
trainer1.fit(dataset)
trainer2.fit(dataset)
}
} else {
model1 = trainer1.fit(dataset)
model2 = trainer2.fit(dataset)
val model1 = trainer1.fit(dataset)
val model2 = trainer2.fit(dataset)

/*
weights <- coef(glmnet(features, label, family="gaussian", alpha = 0.3, lambda = 1.6,
Expand Down Expand Up @@ -531,7 +532,6 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext {
val trainerNoPredictionCol = trainer.setPredictionCol("")
val modelNoPredictionCol = trainerNoPredictionCol.fit(dataset)


// Training results for the model should be available
assert(model.hasSummary)
assert(modelNoPredictionCol.hasSummary)
Expand Down Expand Up @@ -585,6 +585,10 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext {
.objectiveHistory
.sliding(2)
.forall(x => x(0) >= x(1)))
} else {
// To clalify that the normal solver is used here.
assert(model.summary.objectiveHistory.length == 1)
assert(model.summary.objectiveHistory(0) == 0.0)
}
}
}
Expand Down Expand Up @@ -693,4 +697,18 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext {
assert(model4a0.weights ~== model4b.weights absTol 1E-3)
}
}

test("linear regression model with l-bfgs with big feature datasets") {
val trainer = new LinearRegression().setSolver("auto")
val model = trainer.fit(datasetWithManyFeature)

// Training results for the model should be available
assert(model.hasSummary)
// When LBFGS is used as optimizer, objective history can be restored.
assert(
model.summary
.objectiveHistory
.sliding(2)
.forall(x => x(0) >= x(1)))
}
}