[SPARK-22346][ML] VectorSizeHint Transformer for using VectorAssembler in StructuredSteaming #19746

MrBago · 2017-11-14T15:09:32Z

What changes were proposed in this pull request?

A new VectorSizeHint transformer was added. This transformer is meant to be used as a pipeline stage ahead of VectorAssembler, on vector columns, so that VectorAssembler can join vectors in a streaming context where the size of the input vectors is otherwise not known.

How was this patch tested?

Unit tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

SparkQA · 2017-11-14T15:13:42Z

Test build #83850 has finished for PR 19746 at commit df990ed.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class VectorSizeHint @Since(\"2.3.0\") (@Since(\"2.3.0\") override val uid: String)

SparkQA · 2017-11-14T22:08:18Z

Test build #83859 has finished for PR 19746 at commit df990ed.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class VectorSizeHint @Since(\"2.3.0\") (@Since(\"2.3.0\") override val uid: String)

SparkQA · 2017-11-14T22:17:46Z

Test build #83860 has finished for PR 19746 at commit 7f6ab98.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-15T02:46:14Z

Test build #83869 has finished for PR 19746 at commit 73fe1d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class InvalidEntryException(msg: String) extends Exception(msg)

SparkQA · 2017-11-15T22:26:22Z

Test build #83909 has finished for PR 19746 at commit 38e1c5c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class InvalidEntryException(msg: String) extends Exception(msg)

SparkQA · 2017-11-16T02:56:35Z

Test build #83918 has finished for PR 19746 at commit 03bd63c.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class VectorSizeHint @Since(\"2.3.0\") (@Since(\"2.3.0\") override val uid: String)
class InvalidEntryException(msg: String) extends Exception(msg)

WeichenXu123

I leave some comments. Thanks!

WeichenXu123 · 2017-11-20T10:33:40Z

mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala

I think here can simply use:

val checkVectorSizeUDF = udf { vector: Vector => ...} checkVectorSizeUDF(col(localInputCol))

So code will be clearer.

WeichenXu123 · 2017-11-20T10:36:03Z

mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala

The UDF which is possible to throw exception should be marked as nondeterministic, check this PR #19662 for more explanation.

WeichenXu123 · 2017-11-20T10:39:09Z

mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala

I think here use res.na.drop(Array(localInputCol)) will be better.

WeichenXu123 · 2017-11-20T10:42:55Z

mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala

Do we need define a new exception class ? Or directly use SparkException ?

WeichenXu123 · 2017-11-20T10:45:30Z

mllib/src/test/scala/org/apache/spark/ml/feature/VectorSizeHintSuite.scala

Use intercept[SparkException] {...} is better.

I've made the change. Just out of curiosity, why is intercept better than assertThrows?

WeichenXu123 · 2017-11-20T10:50:50Z

mllib/src/test/scala/org/apache/spark/ml/feature/VectorSizeHintSuite.scala

I don't find a test for optimistic option. We should test:
If input dataset vector column do not include metadata, the VectorSizeHint should add metadata with proper size, or input vector column include metadata with different size, the VectorSizeHint should replace it.

I talked offline to @jkbradley and I think it's better to throw an exception unless if the column includes metadata & the there is a mismatch between the new and original size.

I've added a new test for this exception and made sure the other tests are run with all handleInvalid cases. Does it look ok now?

SparkQA · 2017-11-20T22:35:34Z

Test build #84036 has finished for PR 19746 at commit fb51cbf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-21T00:16:06Z

Test build #84039 has finished for PR 19746 at commit 7b51563.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2017-11-21T01:38:23Z

mllib/src/test/scala/org/apache/spark/ml/feature/VectorSizeHintSuite.scala

Add checking for added metadata here ?

And should test if metadata exists, but size do not match, exception will be thrown.

Can I just remove this test? I feel like all of that is tested in the first 3 tests.

OK. I agree. Other testcases already cover them.

SparkQA · 2017-11-21T08:05:02Z

Test build #84046 has finished for PR 19746 at commit 2b1ed0a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

MrBago · 2017-11-22T00:37:05Z

jenkins retest this please

SparkQA · 2017-11-22T03:12:06Z

Test build #84089 has finished for PR 19746 at commit 2b1ed0a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MrBago · 2017-11-22T22:31:32Z

jenkins retest this please

SparkQA · 2017-11-23T02:42:44Z

Test build #84122 has finished for PR 19746 at commit 2b1ed0a.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2017-11-28T08:22:43Z

mllib/src/test/scala/org/apache/spark/ml/feature/VectorSizeHintSuite.scala

case (data, transform) ==> case (data, transformer)

WeichenXu123 · 2017-11-28T08:23:15Z

mllib/src/test/scala/org/apache/spark/ml/feature/VectorSizeHintSuite.scala

Use CheckAnswer(expected, expected) will be simpler.

The reason I didn't use CheckAnswer is because there isn't an implicit encoder in testImplicits that handles Vector. I tried CheckAnswer[Vector](expected, expected) but that doesn't work either :(. Is there an encoder that works for Vectors?

ah, sorry, it should be CheckAnswer(Tuple1(expected), Tuple1(expected)). It should work I think.

WeichenXu123 · 2017-11-28T08:25:28Z

What about supporting multiple columns ? VectorAssembler require multiple input columns, they all need VectorSizeHint to transform first. There's no need to use multiple VectorSizeHint transformer.

jkbradley · 2017-12-01T22:50:10Z

@WeichenXu123 From what I've seen, it's more common for people to use VectorAssembler to assemble a bunch of Numeric columns, rather than a bunch of Vector columns. I'd recommend we do things incrementally, adding single-column support before multi-column support (especially since we're still trying to achieve consensus about design for multi-column support, per my recent comment in the umbrella JIRA).

SparkQA · 2017-12-05T03:39:18Z

Test build #84449 has finished for PR 19746 at commit 0837b76.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-05T04:31:54Z

Test build #84451 has finished for PR 19746 at commit c3d1c5e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-12-08T19:01:07Z

reviewing now

jkbradley

Thanks for the PR! I had a number of comments, but they are mostly small ones.

jkbradley · 2017-12-08T01:45:56Z

mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala

Add :: Experimental :: note here so it shows up properly in docs. Look at other uses of Experimental for examples. (Same for the companion object)

Also, it'd be good to add more docs about why/when people should use this.

jkbradley · 2017-12-08T19:00:53Z

mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala

style: always specify type explicitly (There was some better reason for this which I forget...)

jkbradley · 2017-12-08T19:02:03Z

mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala

Add a docstring and mark with @group param

jkbradley · 2017-12-08T19:02:18Z

mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala

Mark with @group getParam

jkbradley · 2017-12-08T19:08:15Z

mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala

The writing here is formatted strangely. How about:
"How to handle invalid vectors in inputCol. Invalid vectors include nulls and vectors with the wrong size. The options are skip (filter out rows with invalid vectors), error (throw an error) and keep (do not check the vector size, and keep all rows)."

jkbradley · 2017-12-08T20:00:02Z

mllib/src/test/scala/org/apache/spark/ml/feature/VectorSizeHintSuite.scala

typo: mismatch

jkbradley · 2017-12-08T20:02:32Z

mllib/src/test/scala/org/apache/spark/ml/feature/VectorSizeHintSuite.scala

style nit: Call collect() with parentheses

jkbradley · 2017-12-08T20:03:19Z

mllib/src/test/scala/org/apache/spark/ml/feature/VectorSizeHintSuite.scala

Test keep/optimistic too

Did you a thought on how to test keep/optimistic. I could verify that the invalid data is not removed but that's a little bit weird to test. It's ensuring that this option allows the column to get into a "bad state" where the metadata doesn't match the contents. Is that what you had in mind?

Yep, that's what I had in mind. That is the expected behavior, so we can test that behavior...even if it's not what most use cases would need.

jkbradley · 2017-12-08T20:03:52Z

mllib/src/test/scala/org/apache/spark/ml/feature/VectorSizeHintSuite.scala

steaming streaming

jkbradley · 2017-12-08T20:05:23Z

mllib/src/test/scala/org/apache/spark/ml/feature/VectorSizeHintSuite.scala

You can just put these in a PipelineModel to avoid using foldLeft.

jkbradley

Just a few comments based on the updates

jkbradley · 2017-12-13T21:45:31Z

mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala

+  override def copy(extra: ParamMap): this.type = defaultCopy(extra)
+}
+
+@Experimental


Add Scala docstring here with :: Experimental :: note.

jkbradley · 2017-12-13T21:47:01Z

mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala

+  /**
+   * Param for how to handle invalid entries. Invalid vectors include nulls and vectors with the
+   * wrong size. The options are `skip` (filter out rows with invalid vectors), `error` (throw an
+   * error) and `keep` (do not check the vector size, and keep all rows). `error` by default.


keep -> optimistic

jkbradley · 2017-12-13T21:47:09Z

mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala

+    "handleInvalid",
+    "How to handle invalid vectors in inputCol. Invalid vectors include nulls and vectors with " +
+      "the wrong size. The options are skip (filter out rows with invalid vectors), error " +
+      "(throw an error) and keep (do not check the vector size, and keep all rows). `error` by " +


keep -> optimistic

jkbradley · 2017-12-13T21:55:42Z

mllib/src/test/scala/org/apache/spark/ml/feature/VectorSizeHintSuite.scala

+      .setInputCols(Array("a", "b"))
+      .setOutputCol("assembled")
+    val pipeline = new Pipeline().setStages(Array(sizeHintA, sizeHintB, vectorAssembler))
+    /**


remove unused code?

SparkQA · 2017-12-14T00:59:44Z

Test build #84880 has finished for PR 19746 at commit cafa875.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-18T21:28:27Z

Test build #85071 has finished for PR 19746 at commit d63f077.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123

Only one minor issue, otherwise LGTM.

WeichenXu123 · 2017-12-19T05:42:52Z

mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala

+  /**
+   * Param for how to handle invalid entries. Invalid vectors include nulls and vectors with the
+   * wrong size. The options are `skip` (filter out rows with invalid vectors), `error` (throw an
+   * error) and `optimistic` (do not check the vector size, and keep all row\). `error` by default.


"row\" ==> "rows"

SparkQA · 2017-12-19T22:23:24Z

Test build #85134 has finished for PR 19746 at commit 9c3dcec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-12-22T22:08:59Z

LGTM
Merged to master
Thanks @MrBago and @WeichenXu123 !

MrBago changed the title ~~[SPARK-22346][ML]~~ [SPARK-22346][ML] VectorSizeHint Transformer for using VectorAssembler in StructuredSteaming Nov 14, 2017

MrBago force-pushed the vector-size-hint branch from 38e1c5c to 03bd63c Compare November 15, 2017 23:56

WeichenXu123 reviewed Nov 20, 2017

View reviewed changes

WeichenXu123 reviewed Nov 21, 2017

View reviewed changes

WeichenXu123 reviewed Nov 28, 2017

View reviewed changes

WeichenXu123 mentioned this pull request Nov 29, 2017

[SPARK-22644][ML][TEST] Make ML testsuite support StructuredStreaming test #19843

Closed

jkbradley reviewed Dec 8, 2017

View reviewed changes

MrBago added 5 commits December 13, 2017 13:24

Added VectorSizeHint Transformer in ml.feature.

24cc417

PR feedback.

2e76297

Error when size-hint size does not match existing metadata.

136d8f8

Drop redundant test.

e117c15

PR feedback.

591dcd2

MrBago added 2 commits December 13, 2017 13:25

More PR feedback.

b30e3b1

PR feedback.

cafa875

MrBago force-pushed the vector-size-hint branch from c3d1c5e to cafa875 Compare December 13, 2017 21:35

jkbradley reviewed Dec 13, 2017

View reviewed changes

MrBago added 2 commits December 18, 2017 12:07

Update VectorSizeHintSuite to test optimistic param:w.

7021552

Updated documentation for

d63f077

WeichenXu123 reviewed Dec 19, 2017

View reviewed changes

Fix typo.

9c3dcec

asfgit closed this in d23dc5b Dec 22, 2017

[SPARK-22346][ML] VectorSizeHint Transformer for using VectorAssembler in StructuredSteaming #19746

[SPARK-22346][ML] VectorSizeHint Transformer for using VectorAssembler in StructuredSteaming #19746

Uh oh!

Conversation

MrBago commented Nov 14, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Nov 14, 2017

Uh oh!

SparkQA commented Nov 14, 2017

Uh oh!

SparkQA commented Nov 14, 2017

Uh oh!

SparkQA commented Nov 15, 2017

Uh oh!

SparkQA commented Nov 15, 2017

Uh oh!

SparkQA commented Nov 16, 2017

Uh oh!

WeichenXu123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 20, 2017

Uh oh!

SparkQA commented Nov 21, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 21, 2017

Uh oh!

MrBago commented Nov 22, 2017

Uh oh!

SparkQA commented Nov 22, 2017

Uh oh!

MrBago commented Nov 22, 2017

Uh oh!

SparkQA commented Nov 23, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 commented Nov 28, 2017

Uh oh!

jkbradley commented Dec 1, 2017

Uh oh!

SparkQA commented Dec 5, 2017

Uh oh!

SparkQA commented Dec 5, 2017

Uh oh!

jkbradley commented Dec 8, 2017

Uh oh!

jkbradley left a comment

Choose a reason for hiding this comment

Uh oh!