-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-4431][MLlib] Implement efficient foreachActive for dense and sparse vector #3288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #23430 has started for PR 3288 at commit
|
|
Test build #23430 has finished for PR 3288 at commit
|
|
Test PASSed. |
|
Test build #23462 has started for PR 3288 at commit
|
|
Test build #23462 has finished for PR 3288 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need skippingZeros here because it is very easy to chain the iterator with a filter to achieve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skippingZeros will be very useful in foreach operation, and if you use iterator -> filter -> foreach, it will not use the optimized foreach which is implemented by native while loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the following code should have the same performance:
vec.foreach { (i, v) =>
if (v != 0.0) {
...
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the following code,
sample.activeIterator(false).foreach {
case (index, value) => if(value != 0.0) add(index, value)
}
It takes 61.809 for dense vector, and 54.626 for sparse vector.
The most expensive part is calling the anonymous function even when the values are zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, the issue is in the anonymous function. Basically, scala will convert primitive index: Int and value: Double into boxed object in order to have them in tuple. In my testing dataset, there are so many zeros explicitly, and even those values with zero have to be converted to tuple before we do the if statement. That's why it's dramatically faster if we do the if statement before calling the anonymous function.
Changing the signature of foreach into
def foreach[@specialized(Unit) U](f: (Int, Double) => U)
to take two primitive variables will solve this problem, but it will not comply the interface of foreach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tuple2[Int, Double] is specialized in Scala: https://github.com/scala/scala/blob/2.10.x/src/library/scala/Tuple2.scala, but the iterator interface still won't be high-performance for numerical computation. The iterator interface is not used in this PR, as we only need foreach. Then the best for us is defining foreach directly:
def foreach(f: (Int, Double) => Unit)( We could implement ZippedTraversable2 but it doesn't seem to be necessary.)
or
def foreach(skipZeros = True)(f: (Int, Double) => Unit)This is a private function. We can change it when we see more use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right; the Tuple2[Int, Double] is specialized, and I mistakenly interpreted the bytecode.
For the flowing scala code,
def foreach[@specialized(Unit) U](f: ((Int, Double)) => U) {
var i = 0
val localValuesSize = values.size
val localValues = values
while (i < localValuesSize) {
f(i, localValues(i))
i += 1
}
}the generated bytecode will be
public foreach(Lscala/Function1;)V
L0
LINENUMBER 296 L0
ICONST_0
ISTORE 2
L1
LINENUMBER 297 L1
GETSTATIC scala/Predef$.MODULE$ : Lscala/Predef$;
ALOAD 0
INVOKEVIRTUAL org/apache/spark/mllib/linalg/DenseVector.values ()[D
INVOKEVIRTUAL scala/Predef$.doubleArrayOps ([D)Lscala/collection/mutable/ArrayOps;
INVOKEINTERFACE scala/collection/mutable/ArrayOps.size ()I
ISTORE 3
L2
LINENUMBER 298 L2
ALOAD 0
INVOKEVIRTUAL org/apache/spark/mllib/linalg/DenseVector.values ()[D
ASTORE 4
L3
LINENUMBER 299 L3
FRAME APPEND [I I [D]
ILOAD 2
ILOAD 3
IF_ICMPGE L4
L5
LINENUMBER 300 L5
ALOAD 1
NEW scala/Tuple2$mcID$sp
DUP
ILOAD 2
ALOAD 4
ILOAD 2
DALOAD
INVOKESPECIAL scala/Tuple2$mcID$sp.<init> (ID)V
INVOKEINTERFACE scala/Function1.apply (Ljava/lang/Object;)Ljava/lang/Object;
POP
L6
LINENUMBER 301 L6
ILOAD 2
ICONST_1
IADD
ISTORE 2
GOTO L3
However,
INVOKESPECIAL scala/Tuple2$mcID$sp.<init> (ID)V
INVOKEINTERFACE scala/Function1.apply (Ljava/lang/Object;)Ljava/lang/Object;
is expensive, so that's why checking zero in the anonymous function will slow down the whole thing.
I agree with you, the iterator is slow by nature, and we are only interested in foreach implementation. I'll remove the iterator, and just have foreach method in vector.
|
(PS, when I did the bytecode analysis, I found that accessing the |
|
Test build #23568 has started for PR 3288 at commit
|
|
Test build #23568 has finished for PR 3288 at commit
|
|
Test PASSed. |
… the accessing a single step operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be faster than
sample.foreach(true){
case (index, value) => add(index, value)
}for 5%.
See the generated bytecode.
With pattern matching.
L20
LINENUMBER 103 L20
ALOAD 4
INVOKEVIRTUAL scala/Tuple2._1$mcI$sp ()I
ISTORE 5
L21
ALOAD 4
INVOKEVIRTUAL scala/Tuple2._2$mcD$sp ()D
DSTORE 6
L22
ALOAD 0
ILOAD 5
DLOAD 6
INVOKEVIRTUAL org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.org$apache$spark$mllib$stat$MultivariateOnlineSummarizer$$add (ID)V
GETSTATIC scala/runtime/BoxedUnit.UNIT : Lscala/runtime/BoxedUnit;
ASTORE 8
Without pattern matching.
L17
LINENUMBER 100 L17
ALOAD 0
ALOAD 3
INVOKEVIRTUAL scala/Tuple2._1$mcI$sp ()I
ALOAD 3
INVOKEVIRTUAL scala/Tuple2._2$mcD$sp ()D
INVOKEVIRTUAL org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.org$apache$spark$mllib$stat$MultivariateOnlineSummarizer$$add (ID)V
|
Test build #23694 has started for PR 3288 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling foreach without parenthesis
dv.foreach {
case (index: Int, value: Double) => dvMap0.put(index, value)
}will cause
Error:(182, 16) missing parameter type for expanded function
The argument types of an anonymous function must be fully known. (SLS 8.5)
Expected type was: Boolean
dv.foreach {
^
This is scala curry function overloading issue. It seems that unless we change the signature to
private[spark] def foreach(skippingZeros: Boolean = false, f: ((Int, Double)) => Unit)we need to explicitly call it with parenthesis when we want to call it with default value of skippingZeros.
|
Test build #23694 has finished for PR 3288 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following style is more common in the codebase:
sample.foreachActive { (index, value) =>
...
}
|
Test build #23731 has started for PR 3288 at commit
|
|
LGTM except minor inline comments. Thank for improving the performance! |
|
Test build #23733 has started for PR 3288 at commit
|
|
Test build #23731 has finished for PR 3288 at commit
|
|
Test PASSed. |
|
Test build #23733 has finished for PR 3288 at commit
|
|
Test PASSed. |
|
Merged into master and branch-1.2. Thanks! |
…parse vector Previously, we were using Breeze's activeIterator to access the non-zero elements in dense/sparse vector. Due to the overhead, we switched back to native `while loop` in #SPARK-4129. However, #SPARK-4129 requires de-reference the dv.values/sv.values in each access to the value, which is very expensive. Also, in MultivariateOnlineSummarizer, we're using Breeze's dense vector to store the partial stats, and this is very expensive compared with using primitive scala array. In this PR, efficient foreachActive is implemented to unify the code path for dense and sparse vector operation which makes codebase easier to maintain. Breeze dense vector is replaced by primitive array to reduce the overhead further. Benchmarking with mnist8m dataset on single JVM with first 200 samples loaded in memory, and repeating 5000 times. Before change: Sparse Vector - 30.02 Dense Vector - 38.27 With this PR: Sparse Vector - 6.29 Dense Vector - 11.72 Author: DB Tsai <[email protected]> Closes apache#3288 from dbtsai/activeIterator and squashes the following commits: 844b0e6 [DB Tsai] formating 03dd693 [DB Tsai] futher performance tunning. 1907ae1 [DB Tsai] address feedback 98448bb [DB Tsai] Made the override final, and had a local copy of variables which made the accessing a single step operation. c0cbd5a [DB Tsai] fix a bug 6441f92 [DB Tsai] Finished SPARK-4431
…parse vector Previously, we were using Breeze's activeIterator to access the non-zero elements in dense/sparse vector. Due to the overhead, we switched back to native `while loop` in #SPARK-4129. However, #SPARK-4129 requires de-reference the dv.values/sv.values in each access to the value, which is very expensive. Also, in MultivariateOnlineSummarizer, we're using Breeze's dense vector to store the partial stats, and this is very expensive compared with using primitive scala array. In this PR, efficient foreachActive is implemented to unify the code path for dense and sparse vector operation which makes codebase easier to maintain. Breeze dense vector is replaced by primitive array to reduce the overhead further. Benchmarking with mnist8m dataset on single JVM with first 200 samples loaded in memory, and repeating 5000 times. Before change: Sparse Vector - 30.02 Dense Vector - 38.27 With this PR: Sparse Vector - 6.29 Dense Vector - 11.72 Author: DB Tsai <[email protected]> Closes #3288 from dbtsai/activeIterator and squashes the following commits: 844b0e6 [DB Tsai] formating 03dd693 [DB Tsai] futher performance tunning. 1907ae1 [DB Tsai] address feedback 98448bb [DB Tsai] Made the override final, and had a local copy of variables which made the accessing a single step operation. c0cbd5a [DB Tsai] fix a bug 6441f92 [DB Tsai] Finished SPARK-4431 (cherry picked from commit b5d17ef) Signed-off-by: Xiangrui Meng <[email protected]>
Previously, we were using Breeze's activeIterator to access the non-zero elements
in dense/sparse vector. Due to the overhead, we switched back to native
while loopin #SPARK-4129.
However, #SPARK-4129 requires de-reference the dv.values/sv.values in
each access to the value, which is very expensive. Also, in MultivariateOnlineSummarizer,
we're using Breeze's dense vector to store the partial stats, and this is very expensive compared
with using primitive scala array.
In this PR, efficient foreachActive is implemented to unify the code path for dense and sparse
vector operation which makes codebase easier to maintain. Breeze dense vector is replaced
by primitive array to reduce the overhead further.
Benchmarking with mnist8m dataset on single JVM
with first 200 samples loaded in memory, and repeating 5000 times.
Before change:
Sparse Vector - 30.02
Dense Vector - 38.27
With this PR:
Sparse Vector - 6.29
Dense Vector - 11.72