[SPARK-23615][ML][PYSPARK]Add maxDF Parameter to Python CountVectorizer #20777

huaxingao · 2018-03-08T22:33:10Z

What changes were proposed in this pull request?

The maxDF parameter is for filtering out frequently occurring terms. This param was recently added to the Scala CountVectorizer and needs to be added to Python also.

How was this patch tested?

add test

SparkQA · 2018-03-08T22:59:13Z

Test build #88106 has finished for PR 20777 at commit cbf70bb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

Thanks for the pr @huaxingao! Looks good, I just think we should fix up the tests and docs.

BryanCutler · 2018-03-09T00:35:01Z

python/pyspark/ml/feature.py

I think this is too much to put as a doctest. Instead, can you just add a unit test in ml/tests.py? I think you just need 2 transforms, one with an integer value of maxDF > 1 and one as a fractional value. Also, I don't think your test data actually uses the maxDF filtering.

BryanCutler · 2018-03-09T00:38:59Z

python/pyspark/ml/feature.py

I think this documentation is exactly the same as minDF, please refer to the scala docs. Actually, I think the scala doc is a little confusing and could be clearer. Would you like to take a shot at rewording it?

BryanCutler · 2018-03-09T00:40:23Z

python/pyspark/ml/feature.py

I'm not crazy about hardcoding a value here since in Scala it is Long.MaxValue, but I'm not sure there is another way.

Thank you very much for the comments. Will make changes.

SparkQA · 2018-03-09T06:47:34Z

Test build #88115 has finished for PR 20777 at commit c1aeac1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

Thanks @huaxingao , looks good! I just requested a minor tweak in the doc

BryanCutler · 2018-03-14T22:03:53Z

mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala

This sounds much better, but probably should use ignore instead of remove and might be good to just change the order of the sentence like this:

Specifies the maximum number of different documents a term could appear in to be included in the vocabulary. A term that appears more than the threshold will be ignored. If this is an integer greater than or equal to 1, this specifies the maximum number of documents the term could appear in; if this is a double in [0,1), then this specifies the maximum fraction of documents the term could appear in.

BryanCutler · 2018-03-14T22:08:16Z

mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala

good catch!

BryanCutler · 2018-03-14T22:18:48Z

python/pyspark/ml/tests.py

Could you also add an assert that the vocabulary is equal to something? I think it would be ['b', 'c' 'd']

Hi Bryan, Thanks for your comments. I will change these.

SparkQA · 2018-03-15T00:13:19Z

Test build #88244 has finished for PR 20777 at commit 91405f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2018-03-15T17:19:15Z

mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala

@srowen do these doc changes look ok to you? It was a little confusing before saying that the term "must appear" when it's a max value.

Agree, your wording is clearer.

Thanks @srowen !

BryanCutler · 2018-03-15T17:24:04Z

python/pyspark/ml/feature.py

I think it's best just to hardcode the value like you did before, sys.maxsize can be 32bit on some systems https://docs.python.org/3/library/sys.html#sys.maxsize

Will make the change now. Thanks!

SparkQA · 2018-03-15T22:23:28Z

Test build #88276 has finished for PR 20777 at commit d6cd73a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2018-03-16T18:52:35Z

mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala

I think the format for scaladoc actually needs the extra '^' to display right, see the vocabSize default.

If you can, it's always a good idea to generate the docs to make sure of any changes

huaxingao · 2018-03-20T21:40:39Z

@BryanCutler Do you mind if I close this PR and open a new one? I got problems when I tried to resolve the conflicts.

BryanCutler · 2018-03-20T23:47:14Z

@huaxingao , it's best to keep the same PR if possible to better preserve the discussion history. Could you give it another try to resolve conflicts?

SparkQA · 2018-03-21T18:30:13Z

Test build #88482 has finished for PR 20777 at commit d34165a.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-21T18:58:19Z

Test build #88480 has finished for PR 20777 at commit ca35029.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class _CountVectorizerParams(JavaParams, HasInputCol, HasOutputCol):
class CountVectorizerModel(JavaModel, _CountVectorizerParams, JavaMLReadable, JavaMLWritable):

SparkQA · 2018-03-21T22:04:33Z

Test build #88489 has finished for PR 20777 at commit 515bd5b.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class _CountVectorizerParams(JavaParams, HasInputCol, HasOutputCol):

SparkQA · 2018-03-21T22:09:27Z

Test build #88490 has finished for PR 20777 at commit 81fd23b.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-22T00:04:34Z

Test build #88491 has finished for PR 20777 at commit d06e64b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

LGTM, I'll merge tomorrow if no more comments

BryanCutler · 2018-03-23T23:03:12Z

merged to master! thanks @huaxingao

huaxingao · 2018-03-24T00:08:35Z

Thank you very much for your help! @BryanCutler

BryanCutler requested changes Mar 9, 2018

View reviewed changes

BryanCutler reviewed Mar 14, 2018

View reviewed changes

BryanCutler reviewed Mar 15, 2018

View reviewed changes

BryanCutler reviewed Mar 16, 2018

View reviewed changes

huaxingao force-pushed the spark-23615 branch from ca35029 to d34165a Compare March 21, 2018 18:23

huaxingao added 3 commits March 21, 2018 14:23

[SPARK-23615][ML][PYSPARK]Add maxDF Parameter to Python CountVectorizer

2c1e8f0

address comments

360d26b

resolve conflict 4

515bd5b

huaxingao force-pushed the spark-23615 branch from d34165a to 515bd5b Compare March 21, 2018 22:00

fix a minor nit

81fd23b

scala style problem

d06e64b

BryanCutler approved these changes Mar 23, 2018

View reviewed changes

asfgit closed this in a336553 Mar 23, 2018

[SPARK-23615][ML][PYSPARK]Add maxDF Parameter to Python CountVectorizer #20777

[SPARK-23615][ML][PYSPARK]Add maxDF Parameter to Python CountVectorizer #20777

Uh oh!

Conversation

huaxingao commented Mar 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 8, 2018

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 9, 2018

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 15, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 15, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huaxingao commented Mar 20, 2018

Uh oh!

BryanCutler commented Mar 20, 2018

Uh oh!

SparkQA commented Mar 21, 2018

Uh oh!

SparkQA commented Mar 21, 2018

Uh oh!

SparkQA commented Mar 21, 2018

Uh oh!

SparkQA commented Mar 21, 2018

Uh oh!

SparkQA commented Mar 22, 2018

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Mar 23, 2018

Uh oh!

huaxingao commented Mar 24, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

huaxingao commented Mar 8, 2018 •

edited

Loading