Skip to content
Prev Previous commit
Address comments
  • Loading branch information
HyukjinKwon committed Dec 6, 2017
commit 265dd48ce16fd62058f4515a9e91c67942b45ed7
Original file line number Diff line number Diff line change
Expand Up @@ -204,25 +204,33 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext {

assert(df.columns(0) == "label")
assert(df.columns(1) == "features")
val row1 = df.first()
assert(row1.getDouble(0) == 1.0)
val v = row1.getAs[SparseVector](1)
assert(v == Vectors.sparse(6, Seq((0, 1.0), (2, 2.0), (4, 3.0))))

val results = df.collect()

assert(results.map(_.getDouble(0)).toSet == Seq(1.0, 0.0, 0.0, 0.0).toSet)

val actual = results.map(_.getAs[SparseVector](1))
val expected = Seq(
Vectors.sparse(6, Seq((0, 1.0), (2, 2.0), (4, 3.0))),
Vectors.sparse(6, Nil),
Vectors.sparse(6, Nil),
Vectors.sparse(6, Seq((1, 4.0), (3, 5.0), (5, 6.0))))
assert(actual.toSet == expected.toSet)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here you only test the first line ?
Why not use df.collect() to test every line ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just test the first line?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following test only include checking df and readbackDF equality. But, it seems we also need test the whole loaded df and raw file content equality.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we change how to deal with each line in iteration. I think both comparing single line or repeated multiple lines are fine. I think many tests here already test only first line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let me update it. It's easy to change anyway.

// Write
df.coalesce(1)
.write.option("lineSep", lineSep).format("libsvm").save(path1.getAbsolutePath)
val partFile = Utils.recursiveList(path1).filter(f => f.getName.startsWith("part-")).head
val readBack = new String(
java.nio.file.Files.readAllBytes(partFile.toPath), StandardCharsets.UTF_8)
assert(readBack === dataWithTrailingLineSep)
assert(readBack == dataWithTrailingLineSep)

// Roundtrip
val readBackDF = spark.read
.option("lineSep", lineSep)
.format("libsvm")
.load(path1.getAbsolutePath)
assert(df.collect().toSet === readBackDF.collect().toSet)
assert(df.collect().toSet == readBackDF.collect().toSet)
} finally {
Utils.deleteRecursively(path0)
Utils.deleteRecursively(path1)
Expand Down