Skip to content

Conversation

@huaxingao
Copy link
Contributor

@huaxingao huaxingao commented Mar 8, 2018

What changes were proposed in this pull request?

The maxDF parameter is for filtering out frequently occurring terms. This param was recently added to the Scala CountVectorizer and needs to be added to Python also.

How was this patch tested?

add test

@SparkQA
Copy link

SparkQA commented Mar 8, 2018

Test build #88106 has finished for PR 20777 at commit cbf70bb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pr @huaxingao! Looks good, I just think we should fix up the tests and docs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is too much to put as a doctest. Instead, can you just add a unit test in ml/tests.py? I think you just need 2 transforms, one with an integer value of maxDF > 1 and one as a fractional value. Also, I don't think your test data actually uses the maxDF filtering.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this documentation is exactly the same as minDF, please refer to the scala docs. Actually, I think the scala doc is a little confusing and could be clearer. Would you like to take a shot at rewording it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not crazy about hardcoding a value here since in Scala it is Long.MaxValue, but I'm not sure there is another way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the comments. Will make changes.

@SparkQA
Copy link

SparkQA commented Mar 9, 2018

Test build #88115 has finished for PR 20777 at commit c1aeac1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @huaxingao , looks good! I just requested a minor tweak in the doc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds much better, but probably should use ignore instead of remove and might be good to just change the order of the sentence like this:

Specifies the maximum number of different documents a term could appear in to be included
in the vocabulary. A term that appears more than the threshold will be ignored. If this is an
integer greater than or equal to 1, this specifies the maximum number of documents the term
could appear in; if this is a double in [0,1), then this specifies the maximum fraction of
documents the term could appear in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add an assert that the vocabulary is equal to something? I think it would be ['b', 'c' 'd']

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Bryan, Thanks for your comments. I will change these.

@SparkQA
Copy link

SparkQA commented Mar 15, 2018

Test build #88244 has finished for PR 20777 at commit 91405f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen do these doc changes look ok to you? It was a little confusing before saying that the term "must appear" when it's a max value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, your wording is clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @srowen !

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's best just to hardcode the value like you did before, sys.maxsize can be 32bit on some systems https://docs.python.org/3/library/sys.html#sys.maxsize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will make the change now. Thanks!

@SparkQA
Copy link

SparkQA commented Mar 15, 2018

Test build #88276 has finished for PR 20777 at commit d6cd73a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the format for scaladoc actually needs the extra '^' to display right, see the vocabSize default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can, it's always a good idea to generate the docs to make sure of any changes

@huaxingao
Copy link
Contributor Author

@BryanCutler Do you mind if I close this PR and open a new one? I got problems when I tried to resolve the conflicts.

@BryanCutler
Copy link
Member

@huaxingao , it's best to keep the same PR if possible to better preserve the discussion history. Could you give it another try to resolve conflicts?

@SparkQA
Copy link

SparkQA commented Mar 21, 2018

Test build #88482 has finished for PR 20777 at commit d34165a.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2018

Test build #88480 has finished for PR 20777 at commit ca35029.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class _CountVectorizerParams(JavaParams, HasInputCol, HasOutputCol):
  • class CountVectorizerModel(JavaModel, _CountVectorizerParams, JavaMLReadable, JavaMLWritable):

@SparkQA
Copy link

SparkQA commented Mar 21, 2018

Test build #88489 has finished for PR 20777 at commit 515bd5b.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class _CountVectorizerParams(JavaParams, HasInputCol, HasOutputCol):

@SparkQA
Copy link

SparkQA commented Mar 21, 2018

Test build #88490 has finished for PR 20777 at commit 81fd23b.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 22, 2018

Test build #88491 has finished for PR 20777 at commit d06e64b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I'll merge tomorrow if no more comments

@asfgit asfgit closed this in a336553 Mar 23, 2018
@BryanCutler
Copy link
Member

merged to master! thanks @huaxingao

@huaxingao
Copy link
Contributor Author

Thank you very much for your help! @BryanCutler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants