Skip to content

Conversation

@Lewuathe
Copy link
Contributor

It is convenient to implement data source API for LIBSVM format to have a better integration with DataFrames and ML pipeline API.

Two option is implemented.

  • numFeatures: Specify the dimension of features vector
  • featuresType: Specify the type of output vector. sparse is default.

@SparkQA
Copy link

SparkQA commented Aug 31, 2015

Test build #41827 has finished for PR 8537 at commit 99accaa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LibSVMRelation(val path: String, val numFeatures: Int, val featuresType: String)
    • class DefaultSource extends RelationProvider with DataSourceRegister
    • implicit class LibSVMReader(read: DataFrameReader)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • featuresType -> vectorType?
  • Do we need to make this class public? If yes, I would recommend a default public constructor with path and creating setter/getter for each option to make it easy to expand in the future.
  • missing doc

@mengxr
Copy link
Contributor

mengxr commented Aug 31, 2015

@Lewuathe Thanks for working on this! Beside inline comments, could you add a Java test suite?

@mengxr
Copy link
Contributor

mengxr commented Aug 31, 2015

@Lewuathe Could you add [MLLIB] to the PR tile?

@Lewuathe Lewuathe changed the title [SPARK-10117] Implement SQL data source API for reading LIBSVM data [SPARK-10117][MLLIB] Implement SQL data source API for reading LIBSVM data Aug 31, 2015
@Lewuathe
Copy link
Contributor Author

@mengxr Thank you for reviewing and pointing! I'll update the patch and test codes.

@SparkQA
Copy link

SparkQA commented Sep 2, 2015

Test build #41934 has finished for PR 8537 at commit 40d3027.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DefaultSource extends RelationProvider with DataSourceRegister
    • implicit class LibSVMReader(read: DataFrameReader)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we need to read the entire file anyway, it doesn't save much with PrunedScan. Maybe TableScan is simpler but sufficient.

@mengxr
Copy link
Contributor

mengxr commented Sep 2, 2015

@Lewuathe I made another pass.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't verify the result is a sparse vector because runtime type erasure. We need

val v = row1.getAs[SparseVector](1)
assert(v == Vectors.sparse(...))

to force check.

@SparkQA
Copy link

SparkQA commented Sep 6, 2015

Test build #42074 has finished for PR 8537 at commit 4f40891.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DefaultSource extends RelationProvider with DataSourceRegister
    • implicit class LibSVMReader(read: DataFrameReader)

@SparkQA
Copy link

SparkQA commented Sep 7, 2015

Test build #42100 has finished for PR 8537 at commit 0ea1c1c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DefaultSource extends RelationProvider with DataSourceRegister

@Lewuathe
Copy link
Contributor Author

Lewuathe commented Sep 8, 2015

Can't it find ml package in DataSourceRegister?

@rxin
Copy link
Contributor

rxin commented Sep 8, 2015

You need to add a file to the mllib module.

@SparkQA
Copy link

SparkQA commented Sep 8, 2015

Test build #42129 has finished for PR 8537 at commit 9ce63c7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DefaultSource extends RelationProvider with DataSourceRegister

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mengxr
Copy link
Contributor

mengxr commented Sep 8, 2015

Made another pass, only some minor issues left.

@mengxr
Copy link
Contributor

mengxr commented Sep 9, 2015

LGTM pending Jenkins

@SparkQA
Copy link

SparkQA commented Sep 9, 2015

Test build #42208 has finished for PR 8537 at commit 986999d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DefaultSource extends RelationProvider with DataSourceRegister

@mengxr
Copy link
Contributor

mengxr commented Sep 9, 2015

Merged into master. Thanks!

@asfgit asfgit closed this in 2ddeb63 Sep 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants