-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-4431][MLlib] Implement efficient foreachActive for dense and sparse vector #3288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -76,6 +76,22 @@ sealed trait Vector extends Serializable { | |
| def copy: Vector = { | ||
| throw new NotImplementedError(s"copy is not implemented for ${this.getClass}.") | ||
| } | ||
|
|
||
| /** | ||
| * It will return the iterator for the active elements of dense and sparse vector as | ||
| * (index, value) pair. Note that foreach method can be overridden for better performance | ||
| * in different vector implementation. | ||
| * | ||
| * @param skippingZeros Skipping zero elements explicitly if true. It will be useful when we | ||
| * iterator through dense vector having lots of zero elements which | ||
| * we want to skip. Default is false. | ||
| * @return Iterator[(Int, Double)] where the first element in the tuple is the index, | ||
| * and the second element is the corresponding value. | ||
| */ | ||
| private[spark] def activeIterator(skippingZeros: Boolean): Iterator[(Int, Double)] | ||
|
|
||
| private[spark] def activeIterator: Iterator[(Int, Double)] = activeIterator(false) | ||
|
|
||
| } | ||
|
|
||
| /** | ||
|
|
@@ -273,6 +289,47 @@ class DenseVector(val values: Array[Double]) extends Vector { | |
| override def copy: DenseVector = { | ||
| new DenseVector(values.clone()) | ||
| } | ||
|
|
||
| private[spark] override def activeIterator(skippingZeros: Boolean) = new Iterator[(Int, Double)] { | ||
| private var i = 0 | ||
| private val valuesSize = values.size | ||
|
|
||
| // If zeros are asked to be explicitly skipped, the parent `size` method is called to count | ||
| // the number of nonzero elements using `hasNext` and `next` methods. | ||
| override lazy val size: Int = if (skippingZeros) super.size else valuesSize | ||
|
|
||
| override def hasNext = { | ||
| if (skippingZeros) { | ||
| var found = false | ||
| while (!found && i < valuesSize) if (values(i) != 0.0) found = true else i += 1 | ||
| } | ||
| i < valuesSize | ||
| } | ||
|
|
||
| override def next = { | ||
| val result = (i, values(i)) | ||
| i += 1 | ||
| result | ||
| } | ||
|
|
||
| override def foreach[@specialized(Unit) U](f: ((Int, Double)) => U) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another option is force the return type to be
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Interesting. In scala's range code, they have I'll do a bytecode analysis, and see if it will generate the same bytecode.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, the generated bytecode of both approaches are the same. |
||
| var i = 0 | ||
| if (skippingZeros) { | ||
| while (i < valuesSize) { | ||
| if (values(i) != 0.0) { | ||
| f(i, values(i)) | ||
| } | ||
| i += 1 | ||
| } | ||
| } else { | ||
| while (i < valuesSize) { | ||
| f(i, values(i)) | ||
| i += 1 | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| } | ||
|
|
||
| /** | ||
|
|
@@ -309,4 +366,45 @@ class SparseVector( | |
| } | ||
|
|
||
| private[mllib] override def toBreeze: BV[Double] = new BSV[Double](indices, values, size) | ||
|
|
||
| private[spark] override def activeIterator(skippingZeros: Boolean) = new Iterator[(Int, Double)] { | ||
| private var i = 0 | ||
| private val valuesSize = values.size | ||
|
|
||
| // If zeros are asked to be explicitly skipped, the parent `size` method is called to count | ||
| // the number of nonzero elements using `hasNext` and `next` methods. | ||
| override lazy val size: Int = if (skippingZeros) super.size else valuesSize | ||
|
|
||
| def hasNext = { | ||
| if (skippingZeros) { | ||
| var found = false | ||
| while (!found && i < valuesSize) if (values(i) != 0.0) found = true else i += 1 | ||
| } | ||
| i < valuesSize | ||
| } | ||
|
|
||
| def next = { | ||
| val result = (indices(i), values(i)) | ||
| i += 1 | ||
| result | ||
| } | ||
|
|
||
| override def foreach[@specialized(Unit) U](f: ((Int, Double)) => U) { | ||
| var i = 0 | ||
| if (skippingZeros) { | ||
| while (i < valuesSize) { | ||
| if (values(i) != 0.0) { | ||
| f(indices(i), values(i)) | ||
| } | ||
| i += 1 | ||
| } | ||
| } else { | ||
| while (i < valuesSize) { | ||
| f(indices(i), values(i)) | ||
| i += 1 | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need
skippingZeroshere because it is very easy to chain the iterator with a filter to achieve it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skippingZeroswill be very useful inforeachoperation, and if you use iterator -> filter -> foreach, it will not use the optimizedforeachwhich is implemented by native while loop.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the following code should have the same performance:
vec.foreach { (i, v) => if (v != 0.0) { ... } }There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the following code,
It takes 61.809 for dense vector, and 54.626 for sparse vector.
The most expensive part is calling the anonymous function even when the values are zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, the issue is in the anonymous function. Basically, scala will convert primitive index: Int and value: Double into boxed object in order to have them in tuple. In my testing dataset, there are so many zeros explicitly, and even those values with zero have to be converted to tuple before we do the
if statement. That's why it's dramatically faster if we do theif statementbefore calling the anonymous function.Changing the signature of
foreachintoto take two primitive variables will solve this problem, but it will not comply the interface of
foreach.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tuple2[Int, Double]is specialized in Scala: https://github.com/scala/scala/blob/2.10.x/src/library/scala/Tuple2.scala, but the iterator interface still won't be high-performance for numerical computation. The iterator interface is not used in this PR, as we only needforeach. Then the best for us is definingforeachdirectly:( We could implement
ZippedTraversable2but it doesn't seem to be necessary.)or
This is a private function. We can change it when we see more use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right; the
Tuple2[Int, Double]is specialized, and I mistakenly interpreted the bytecode.For the flowing scala code,
the generated bytecode will be
However,
is expensive, so that's why checking zero in the anonymous function will slow down the whole thing.
I agree with you, the iterator is slow by nature, and we are only interested in foreach implementation. I'll remove the iterator, and just have foreach method in vector.