Skip to content
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[SPARK-11207] Add new API for generateLinearInput
  • Loading branch information
Lewuathe committed Oct 24, 2015
commit 003d3bd87f3936c4fd6ee0dc77ca81f3811bcbd7
Original file line number Diff line number Diff line change
Expand Up @@ -103,26 +103,10 @@ object LinearDataGenerator {
nPoints: Int,
seed: Int,
eps: Double): Seq[LabeledPoint] = {

val rnd = new Random(seed)
val x = Array.fill[Array[Double]](nPoints)(
Array.fill[Double](weights.length)(rnd.nextDouble()))

x.foreach { v =>
var i = 0
val len = v.length
while (i < len) {
v(i) = (v(i) - 0.5) * math.sqrt(12.0 * xVariance(i)) + xMean(i)
i += 1
}
}

val y = x.map { xi =>
blas.ddot(weights.length, xi, 1, weights, 1) + intercept + eps * rnd.nextGaussian()
}
y.zip(x).map(p => LabeledPoint(p._1, Vectors.dense(p._2)))
generateLinearInputInternal(intercept, weights, xMean, xVariance, nPoints, seed, eps, 0.0)
}


/**
* @param intercept Data intercept
* @param weights Weights to be applied.
Expand All @@ -133,10 +117,12 @@ object LinearDataGenerator {
* @param nPoints Number of points in sample.
* @param seed Random seed
* @param eps Epsilon scaling factor.
* @return Seq of LabeledPoint includes sparse vectors..
* @param sparcity The ratio of zero elements. If it is 0.0, LabeledPoints with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: sparsity

* DenseVector is returned.
* @return Seq of input.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about consolidate with LinearDataGenerator, and add sparsity = 1.0 as param to control if it's sparse feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I also thought it is good idea. But LinearDataGenerator is used as static object, then we have to pass sparsity as parameter to generateLinearInput. This method seems to be used a lot of suites. It is necessary to change a lot of method reference.
Therefore it might be better to do in separate JIRA. What do you thing about?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's modify the JIRA and do it here. Basically, you can create a LinearDataGenerator with old signature calling new API for compatibility issue.

@Since("1.6.0")
def generateLinearSparseInput(
def generateLinearInputInternal(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just call it generateLinearInput without Internal.

intercept: Double,
weights: Array[Double],
xMean: Array[Double],
Expand Down Expand Up @@ -168,13 +154,19 @@ object LinearDataGenerator {
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simplify the following code, do

y.zip(x).map { p => 
  if (sparsity == 0.0) {
    LabeledPoint(p._1, Vectors.dense(p._2))
  } else {
    LabeledPoint(p._1, Vectors.dense(p._2).toSparse)
  }
}

val sparseX = x.map { (v: Array[Double]) =>
v.zipWithIndex.filter{
v.zipWithIndex.filter {
case (d: Double, i: Int) => d != 0.0
}.map {
case (d: Double, i: Int) => (i, d)
}
}
y.zip(sparseX).map(p => LabeledPoint(p._1, Vectors.sparse(weights.length, p._2)))
if (sparcity == 0.0) {
// Return LabeledPoints with DenseVector
y.zip(x).map(p => LabeledPoint(p._1, Vectors.dense(p._2)))
} else {
// Return LabeledPoints with SparseVector
y.zip(sparseX).map(p => LabeledPoint(p._1, Vectors.sparse(weights.length, p._2)))
}
}

/**
Expand Down