-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-8140] [MLlib] Remove empty model check in StreamingLinearAlgorithm #6684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
MechCoder
commented
Jun 6, 2015
- Prevent creating a map of data to find numFeatures
- If model is empty, then initialize with a zero vector of numFeature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your intent here, but if the vectors are very large, it might actually be faster to get its size remotely and return only the size to the driver. But yes it entails a small distributed operation. I'm not sure which is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also a good possibility that input is very large. Is accessing the first element of a RDD slower than doing a distributed operation across the entire RDD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would only evaluate one element in this case. first() accesses the first partition and first element of its iterator, so map should only be applied once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I had thought, input.map(_.features.size) is evaluated first and then from that the first element is extracted. Should I revert this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could double-check with a very quick/small benchmark to see if it is in fact evaluated that way and if there is much difference at all. I suspect both are OK and don't know which is better.
d5d6178 to
4f214fc
Compare
|
Test build #34361 has finished for PR 6684 at commit
|
|
Test build #34362 has finished for PR 6684 at commit
|
|
Test build #34363 has finished for PR 6684 at commit
|
|
@srowen I removed the None check and restored the model.isEmpty check. |
|
Test build #34364 has finished for PR 6684 at commit
|
And other minor cosmits
|
Test build #34369 has finished for PR 6684 at commit
|
|
Test build #34370 has finished for PR 6684 at commit
|
|
I'm OK with this. CC @mengxr and/or @jkbradley for a second look, so I'll leave it open a little while. |
|
Also if |
|
Sorry for the slow response! I think it looks OK, though I wonder if the incomplete match causes a compilation warning. |
|
Hm, good point. That results in a warning actually, but no runtime problem since the model must exist here. Really this whole construct can be removed now; I should have seen that. Well we can follow this up with that additional change to further simplify and remove a warning. |
…ithm 1. Prevent creating a map of data to find numFeatures 2. If model is empty, then initialize with a zero vector of numFeature Author: MechCoder <[email protected]> Closes apache#6684 from MechCoder/spark-8140 and squashes the following commits: 7fbf5f9 [MechCoder] [SPARK-8140] Remove empty model check in StreamingLinearAlgorithm And other minor cosmits