-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-10117][MLLIB] Implement SQL data source API for reading LIBSVM data #8537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #41827 has finished for PR 8537 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
featuresType->vectorType?- Do we need to make this class public? If yes, I would recommend a default public constructor with path and creating setter/getter for each option to make it easy to expand in the future.
- missing doc
|
@Lewuathe Thanks for working on this! Beside inline comments, could you add a Java test suite? |
|
@Lewuathe Could you add |
|
@mengxr Thank you for reviewing and pointing! I'll update the patch and test codes. |
|
Test build #41934 has finished for PR 8537 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we need to read the entire file anyway, it doesn't save much with PrunedScan. Maybe TableScan is simpler but sufficient.
|
@Lewuathe I made another pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't verify the result is a sparse vector because runtime type erasure. We need
val v = row1.getAs[SparseVector](1)
assert(v == Vectors.sparse(...))to force check.
|
Test build #42074 has finished for PR 8537 at commit
|
|
Test build #42100 has finished for PR 8537 at commit
|
|
Can't it find |
|
You need to add a file to the mllib module. |
|
Test build #42129 has finished for PR 8537 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add @Since("1.6.0") to DefaultSource? See https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala#L128.
|
Made another pass, only some minor issues left. |
|
LGTM pending Jenkins |
|
Test build #42208 has finished for PR 8537 at commit
|
|
Merged into master. Thanks! |
It is convenient to implement data source API for LIBSVM format to have a better integration with DataFrames and ML pipeline API.
Two option is implemented.
numFeatures: Specify the dimension of features vectorfeaturesType: Specify the type of output vector.sparseis default.