[SPARK-10117][MLLIB] Implement SQL data source API for reading LIBSVM data #8537

Lewuathe · 2015-08-31T13:19:41Z

It is convenient to implement data source API for LIBSVM format to have a better integration with DataFrames and ML pipeline API.

Two option is implemented.

numFeatures: Specify the dimension of features vector
featuresType: Specify the type of output vector. sparse is default.

SparkQA · 2015-08-31T14:10:46Z

Test build #41827 has finished for PR 8537 at commit 99accaa.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LibSVMRelation(val path: String, val numFeatures: Int, val featuresType: String)
- class DefaultSource extends RelationProvider with DataSourceRegister
- implicit class LibSVMReader(read: DataFrameReader)

mengxr · 2015-08-31T21:32:39Z

mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala

featuresType -> vectorType?

Do we need to make this class public? If yes, I would recommend a default public constructor with path and creating setter/getter for each option to make it easy to expand in the future.

missing doc

mengxr · 2015-08-31T21:33:53Z

@Lewuathe Thanks for working on this! Beside inline comments, could you add a Java test suite?

mengxr · 2015-08-31T23:09:45Z

@Lewuathe Could you add [MLLIB] to the PR tile?

Lewuathe · 2015-08-31T23:12:10Z

@mengxr Thank you for reviewing and pointing! I'll update the patch and test codes.

SparkQA · 2015-09-02T14:43:50Z

Test build #41934 has finished for PR 8537 at commit 40d3027.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DefaultSource extends RelationProvider with DataSourceRegister
- implicit class LibSVMReader(read: DataFrameReader)

mengxr · 2015-09-02T20:41:48Z

mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala

Since we need to read the entire file anyway, it doesn't save much with PrunedScan. Maybe TableScan is simpler but sufficient.

mengxr · 2015-09-02T20:57:11Z

@Lewuathe I made another pass.

mengxr · 2015-09-04T15:56:31Z

mllib/src/test/scala/org/apache/spark/ml/source/LibSVMRelationSuite.scala

This doesn't verify the result is a sparse vector because runtime type erasure. We need

val v = row1.getAs[SparseVector](1) assert(v == Vectors.sparse(...))

to force check.

SparkQA · 2015-09-06T19:20:32Z

Test build #42074 has finished for PR 8537 at commit 4f40891.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DefaultSource extends RelationProvider with DataSourceRegister
- implicit class LibSVMReader(read: DataFrameReader)

SparkQA · 2015-09-07T15:54:02Z

Test build #42100 has finished for PR 8537 at commit 0ea1c1c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DefaultSource extends RelationProvider with DataSourceRegister

Lewuathe · 2015-09-08T00:29:35Z

Can't it find ml package in DataSourceRegister?

rxin · 2015-09-08T01:04:27Z

You need to add a file to the mllib module.

SparkQA · 2015-09-08T11:29:14Z

Test build #42129 has finished for PR 8537 at commit 9ce63c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DefaultSource extends RelationProvider with DataSourceRegister

mengxr · 2015-09-08T15:34:18Z

mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala

Could you add @Since("1.6.0") to DefaultSource? See https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala#L128.

mengxr · 2015-09-08T15:35:48Z

Made another pass, only some minor issues left.

mengxr · 2015-09-09T15:47:44Z

LGTM pending Jenkins

SparkQA · 2015-09-09T16:08:52Z

Test build #42208 has finished for PR 8537 at commit 986999d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DefaultSource extends RelationProvider with DataSourceRegister

mengxr · 2015-09-09T16:29:39Z

Merged into master. Thanks!

[SPARK-10117] Implement SQL data source API for reading LIBSVM data

99accaa

mengxr reviewed Aug 31, 2015
View reviewed changes

Lewuathe changed the title ~~[SPARK-10117] Implement SQL data source API for reading LIBSVM data~~ [SPARK-10117][MLLIB] Implement SQL data source API for reading LIBSVM data Aug 31, 2015

Lewuathe added 2 commits September 2, 2015 21:10

Merge branch 'master' into SPARK-10117

7056d4a

Add Java test

40d3027

mengxr reviewed Sep 2, 2015
View reviewed changes

Lewuathe added 4 commits September 3, 2015 21:36

[SPARK-10117] Implement SQL data source API for reading LIBSVM data

3fd8dce

Add Java test

70ee4dd

Fix

aef9564

Fix some points

a97ee97

mengxr reviewed Sep 4, 2015
View reviewed changes

Lewuathe added 2 commits September 6, 2015 20:23

Merge branch 'master' into SPARK-10117

5ab62ab

Improve test suites

4f40891

LibSVMRelation is registered into META-INF

0ea1c1c

Lewuathe added 3 commits September 8, 2015 18:27

Merge branch 'master' into SPARK-10117

ba3657c

Merge branch 'SPARK-10117' of github.com:Lewuathe/spark into SPARK-10117

1fdd2df

Rewrite service loader file

9ce63c7

mengxr reviewed Sep 8, 2015
View reviewed changes

Lewuathe added 3 commits September 9, 2015 22:59

Merge branch 'master' into SPARK-10117

21600a4

Fix some reviews

11d513f

Change unit test phrase

986999d

asfgit closed this in 2ddeb63 Sep 9, 2015

[SPARK-10117][MLLIB] Implement SQL data source API for reading LIBSVM data #8537

[SPARK-10117][MLLIB] Implement SQL data source API for reading LIBSVM data #8537

Uh oh!

Conversation

Lewuathe commented Aug 31, 2015

Uh oh!

SparkQA commented Aug 31, 2015

Uh oh!

mengxr Aug 31, 2015

Choose a reason for hiding this comment

Uh oh!

mengxr commented Aug 31, 2015

Uh oh!

mengxr commented Aug 31, 2015

Uh oh!

Lewuathe commented Aug 31, 2015

Uh oh!

SparkQA commented Sep 2, 2015

Uh oh!

mengxr Sep 2, 2015

Choose a reason for hiding this comment

Uh oh!

mengxr commented Sep 2, 2015

Uh oh!

mengxr Sep 4, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 6, 2015

Uh oh!

SparkQA commented Sep 7, 2015

Uh oh!

Lewuathe commented Sep 8, 2015

Uh oh!

rxin commented Sep 8, 2015

Uh oh!

SparkQA commented Sep 8, 2015

Uh oh!

mengxr Sep 8, 2015

Choose a reason for hiding this comment

Uh oh!

mengxr commented Sep 8, 2015

Uh oh!

mengxr commented Sep 9, 2015

Uh oh!

SparkQA commented Sep 9, 2015

Uh oh!

mengxr commented Sep 9, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants