-
Notifications
You must be signed in to change notification settings - Fork 29k
[WIP] SPARK-1430: Support sparse data in Python MLlib #341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 1 commit
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
881fef7
Added a sparse vector in Python and made Java-Python format more compact
mateiz 154f45d
Update docs, name some magic values
mateiz 2abbb44
Further work to get linear models working with sparse data
mateiz eaee759
Update regression, classification and clustering models for sparse data
mateiz 0e7a3d8
Keep vectors sparse in Java when reading LabeledPoints
mateiz a5d6426
Add linalg.py to run-tests script
mateiz ab244d1
Allow SparseVectors to be initialized using a dict
mateiz 889dde8
Support scipy.sparse matrices in all our algorithms and models
mateiz 74eefe7
Added LabeledPoint class in Python
mateiz a07ba10
Fix some typos and calculation of initial weights
mateiz c48e85a
Added some tests for passing lists as input, and added mllib/tests.py to
mateiz da0f27e
Added a MLlib K-means example and updated docs to discuss sparse data
mateiz 37ab747
Fix some examples and docs due to changes in MLlib API
mateiz 88bc01f
Clean up inheritance of LinearModel in Python, and expose its parametrs
mateiz 1e1bd0f
Add MLlib logistic regression example in Python
mateiz b9f97a3
Fix test
mateiz ea5a25a
Fix remaining uses of copyto() after merge
mateiz d52e763
Remove no-longer-needed slice code and handle review comments
mateiz File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Keep vectors sparse in Java when reading LabeledPoints
- Loading branch information
commit 0e7a3d8599d6eb677e734cd3fadc27d6942a40f9
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -60,6 +60,8 @@ trait Vector extends Serializable { | |
| * @param i index | ||
| */ | ||
| private[mllib] def apply(i: Int): Double = toBreeze(i) | ||
|
|
||
| private[mllib] def slice(start: Int, end: Int): Vector | ||
| } | ||
|
|
||
| /** | ||
|
|
@@ -130,9 +132,11 @@ object Vectors { | |
| private[mllib] def fromBreeze(breezeVector: BV[Double]): Vector = { | ||
| breezeVector match { | ||
| case v: BDV[Double] => | ||
| require(v.offset == 0, s"Do not support non-zero offset ${v.offset}.") | ||
| require(v.stride == 1, s"Do not support stride other than 1, but got ${v.stride}.") | ||
| new DenseVector(v.data) | ||
| if (v.offset == 0 && v.stride == 1) { | ||
| new DenseVector(v.data) | ||
| } else { | ||
| new DenseVector(v.toArray) // Can't use underlying array directly, so make a new one | ||
| } | ||
| case v: BSV[Double] => | ||
| new SparseVector(v.length, v.index, v.data) | ||
| case v: BV[_] => | ||
|
|
@@ -155,6 +159,10 @@ class DenseVector(val values: Array[Double]) extends Vector { | |
| private[mllib] override def toBreeze: BV[Double] = new BDV[Double](values) | ||
|
|
||
| override def apply(i: Int) = values(i) | ||
|
|
||
| private[mllib] override def slice(start: Int, end: Int): Vector = { | ||
| new DenseVector(values.slice(start, end)) | ||
| } | ||
| } | ||
|
|
||
| /** | ||
|
|
@@ -185,4 +193,39 @@ class SparseVector( | |
| } | ||
|
|
||
| private[mllib] override def toBreeze: BV[Double] = new BSV[Double](indices, values, size) | ||
|
|
||
| override def apply(pos: Int): Double = { | ||
| // A more efficient apply() than creating a new Breeze vector | ||
| var i = 0 | ||
| while (i < indices.length) { | ||
| if (indices(i) == pos) { | ||
| return values(i) | ||
| } else if (indices(i) > pos) { | ||
| return 0.0 | ||
| } | ||
| i += 1 | ||
| } | ||
| 0.0 | ||
| } | ||
|
|
||
| private[mllib] override def slice(start: Int, end: Int): Vector = { | ||
| require(start <= end, s"invalid range: ${start} to ${end}") | ||
| require(start >= 0, s"invalid range: ${start} to ${end}") | ||
| require(end <= size, s"invalid range: ${start} to ${end}") | ||
| // Figure out the range of indices that fall within the given bounds | ||
| var i = 0 | ||
| var indexRangeStart = 0 | ||
| var indexRangeEnd = 0 | ||
| while (i < indices.length && indices(i) < start) { | ||
|
||
| i += 1 | ||
| } | ||
| indexRangeStart = i | ||
| while (i < indices.length && indices(i) < end) { | ||
| i += 1 | ||
| } | ||
| indexRangeEnd = i | ||
| val newIndices = indices.slice(indexRangeStart, indexRangeEnd).map(_ - start) | ||
| val newValues = values.slice(indexRangeStart, indexRangeEnd) | ||
| new SparseVector(end - start, newIndices, newValues) | ||
| } | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Breeze's sparse vector uses binary search for random access. I think in the current code base, only decision tree needs random access to a vector. However, we haven't claimed it supports sparse input yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll remove this and split() because they're no longer needed. They were needed when we passed vectors with the label included from Python instead of passing LabeledPoint.