Skip to content
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Address the feedback
  • Loading branch information
DB Tsai committed Nov 25, 2014
commit daf2b0670a1e2311f50f3312aa688bcd9b26b099
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ class StandardScalerModel private[mllib] (
f
}

private lazy val shift: Array[Double] = mean.toArray
private val shift: Array[Double] = mean.toArray

/**
* Applies standardization transformation on a vector.
Expand All @@ -97,19 +97,24 @@ class StandardScalerModel private[mllib] (
override def transform(vector: Vector): Vector = {
require(mean.size == vector.size)
if (withMean) {
// By default, Scala generates Java methods for member variables. So every time when
// the member variables are accessed, `invokespecial` will be called which is expensive.
// This can be avoid by having a local reference of `shift`.
val localShift = shift
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth to leave a comment here and explain why we need local reference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shift is only used in this branch. Shall we just put val shift = mean.toArray here instead of having a member variable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'll change it back to lazy since it will not be evaluated in those branches which don't use shift. I don't want to create shift array/object for each sample since shift will always be the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shift only holds a reference to mean.values. We don't really need to define it as a member and make it lazy. It should give the same performance if we only define it inside the if branch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For different implementation of vector, toArray can be very expensive. For example, toArray for sparse vector requires to create a new array object and loop through all the non zero values. As a result, we can have a global lazy shift which can prevent this happens.

vector match {
case dv: DenseVector =>
val values = dv.values.clone()
val size = values.size
var i = 0
if (withStd) {
// Having a local reference of `factor` to avoid overhead as the comment before.
val localFactor = factor
var i = 0
while (i < size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we move var i = 0 inside each closure? It feels safer.

values(i) = (values(i) - localShift(i)) * localFactor(i)
i += 1
}
} else {
var i = 0
while (i < size) {
values(i) -= localShift(i)
i += 1
Expand All @@ -119,6 +124,7 @@ class StandardScalerModel private[mllib] (
case v => throw new IllegalArgumentException("Do not support vector type " + v.getClass)
}
} else if (withStd) {
// Having a local reference of `factor` to avoid overhead as the comment before.
val localFactor = factor
vector match {
case dv: DenseVector =>
Expand All @@ -135,9 +141,9 @@ class StandardScalerModel private[mllib] (
// so we can re-use it to save memory.
val indices = sv.indices
val values = sv.values.clone()
val size = values.size
val nnz = values.size
var i = 0
while (i < size) {
while (i < nnz) {
values(i) *= localFactor(indices(i))
i += 1
}
Expand Down