Skip to content

Conversation

@MrBago
Copy link
Contributor

@MrBago MrBago commented Nov 14, 2017

What changes were proposed in this pull request?

A new VectorSizeHint transformer was added. This transformer is meant to be used as a pipeline stage ahead of VectorAssembler, on vector columns, so that VectorAssembler can join vectors in a streaming context where the size of the input vectors is otherwise not known.

How was this patch tested?

Unit tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

@SparkQA
Copy link

SparkQA commented Nov 14, 2017

Test build #83850 has finished for PR 19746 at commit df990ed.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class VectorSizeHint @Since(\"2.3.0\") (@Since(\"2.3.0\") override val uid: String)

@MrBago MrBago changed the title [SPARK-22346][ML] [SPARK-22346][ML] VectorSizeHint Transformer for using VectorAssembler in StructuredSteaming Nov 14, 2017
@SparkQA
Copy link

SparkQA commented Nov 14, 2017

Test build #83859 has finished for PR 19746 at commit df990ed.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class VectorSizeHint @Since(\"2.3.0\") (@Since(\"2.3.0\") override val uid: String)

@SparkQA
Copy link

SparkQA commented Nov 14, 2017

Test build #83860 has finished for PR 19746 at commit 7f6ab98.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 15, 2017

Test build #83869 has finished for PR 19746 at commit 73fe1d8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class InvalidEntryException(msg: String) extends Exception(msg)

@SparkQA
Copy link

SparkQA commented Nov 15, 2017

Test build #83909 has finished for PR 19746 at commit 38e1c5c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class InvalidEntryException(msg: String) extends Exception(msg)

@SparkQA
Copy link

SparkQA commented Nov 16, 2017

Test build #83918 has finished for PR 19746 at commit 03bd63c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class VectorSizeHint @Since(\"2.3.0\") (@Since(\"2.3.0\") override val uid: String)
  • class InvalidEntryException(msg: String) extends Exception(msg)

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I leave some comments. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here can simply use:

val checkVectorSizeUDF = udf { vector: Vector => ...}
checkVectorSizeUDF(col(localInputCol))

So code will be clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UDF which is possible to throw exception should be marked as nondeterministic, check this PR #19662 for more explanation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here use res.na.drop(Array(localInputCol)) will be better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need define a new exception class ? Or directly use SparkException ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use intercept[SparkException] {...} is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made the change. Just out of curiosity, why is intercept better than assertThrows?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find a test for optimistic option. We should test:
If input dataset vector column do not include metadata, the VectorSizeHint should add metadata with proper size, or input vector column include metadata with different size, the VectorSizeHint should replace it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked offline to @jkbradley and I think it's better to throw an exception unless if the column includes metadata & the there is a mismatch between the new and original size.

I've added a new test for this exception and made sure the other tests are run with all handleInvalid cases. Does it look ok now?

@SparkQA
Copy link

SparkQA commented Nov 20, 2017

Test build #84036 has finished for PR 19746 at commit fb51cbf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 21, 2017

Test build #84039 has finished for PR 19746 at commit 7b51563.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add checking for added metadata here ?

And should test if metadata exists, but size do not match, exception will be thrown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I just remove this test? I feel like all of that is tested in the first 3 tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I agree. Other testcases already cover them.

@SparkQA
Copy link

SparkQA commented Nov 21, 2017

Test build #84046 has finished for PR 19746 at commit 2b1ed0a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MrBago
Copy link
Contributor Author

MrBago commented Nov 22, 2017

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Nov 22, 2017

Test build #84089 has finished for PR 19746 at commit 2b1ed0a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MrBago
Copy link
Contributor Author

MrBago commented Nov 22, 2017

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Nov 23, 2017

Test build #84122 has finished for PR 19746 at commit 2b1ed0a.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case (data, transform) ==> case (data, transformer)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use CheckAnswer(expected, expected) will be simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I didn't use CheckAnswer is because there isn't an implicit encoder in testImplicits that handles Vector. I tried CheckAnswer[Vector](expected, expected) but that doesn't work either :(. Is there an encoder that works for Vectors?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, sorry, it should be CheckAnswer(Tuple1(expected), Tuple1(expected)). It should work I think.

@WeichenXu123
Copy link
Contributor

What about supporting multiple columns ? VectorAssembler require multiple input columns, they all need VectorSizeHint to transform first. There's no need to use multiple VectorSizeHint transformer.

@jkbradley
Copy link
Member

@WeichenXu123 From what I've seen, it's more common for people to use VectorAssembler to assemble a bunch of Numeric columns, rather than a bunch of Vector columns. I'd recommend we do things incrementally, adding single-column support before multi-column support (especially since we're still trying to achieve consensus about design for multi-column support, per my recent comment in the umbrella JIRA).

@SparkQA
Copy link

SparkQA commented Dec 5, 2017

Test build #84449 has finished for PR 19746 at commit 0837b76.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 5, 2017

Test build #84451 has finished for PR 19746 at commit c3d1c5e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

reviewing now

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I had a number of comments, but they are mostly small ones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add :: Experimental :: note here so it shows up properly in docs. Look at other uses of Experimental for examples. (Same for the companion object)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it'd be good to add more docs about why/when people should use this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: always specify type explicitly (There was some better reason for this which I forget...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a docstring and mark with @group param

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mark with @group getParam

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The writing here is formatted strangely. How about:
"How to handle invalid vectors in inputCol. Invalid vectors include nulls and vectors with the wrong size. The options are skip (filter out rows with invalid vectors), error (throw an error) and keep (do not check the vector size, and keep all rows)."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: mismatch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: Call collect() with parentheses

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test keep/optimistic too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you a thought on how to test keep/optimistic. I could verify that the invalid data is not removed but that's a little bit weird to test. It's ensuring that this option allows the column to get into a "bad state" where the metadata doesn't match the contents. Is that what you had in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's what I had in mind. That is the expected behavior, so we can test that behavior...even if it's not what most use cases would need.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

steaming streaming

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just put these in a PipelineModel to avoid using foldLeft.

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments based on the updates

override def copy(extra: ParamMap): this.type = defaultCopy(extra)
}

@Experimental
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add Scala docstring here with :: Experimental :: note.

/**
* Param for how to handle invalid entries. Invalid vectors include nulls and vectors with the
* wrong size. The options are `skip` (filter out rows with invalid vectors), `error` (throw an
* error) and `keep` (do not check the vector size, and keep all rows). `error` by default.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep -> optimistic

"handleInvalid",
"How to handle invalid vectors in inputCol. Invalid vectors include nulls and vectors with " +
"the wrong size. The options are skip (filter out rows with invalid vectors), error " +
"(throw an error) and keep (do not check the vector size, and keep all rows). `error` by " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep -> optimistic

.setInputCols(Array("a", "b"))
.setOutputCol("assembled")
val pipeline = new Pipeline().setStages(Array(sizeHintA, sizeHintB, vectorAssembler))
/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove unused code?

@SparkQA
Copy link

SparkQA commented Dec 14, 2017

Test build #84880 has finished for PR 19746 at commit cafa875.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 18, 2017

Test build #85071 has finished for PR 19746 at commit d63f077.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one minor issue, otherwise LGTM.

/**
* Param for how to handle invalid entries. Invalid vectors include nulls and vectors with the
* wrong size. The options are `skip` (filter out rows with invalid vectors), `error` (throw an
* error) and `optimistic` (do not check the vector size, and keep all row\). `error` by default.
Copy link
Contributor

@WeichenXu123 WeichenXu123 Dec 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"row\" ==> "rows"

@SparkQA
Copy link

SparkQA commented Dec 19, 2017

Test build #85134 has finished for PR 19746 at commit 9c3dcec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in d23dc5b Dec 22, 2017
@jkbradley
Copy link
Member

LGTM
Merged to master
Thanks @MrBago and @WeichenXu123 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants