-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23615][ML][PYSPARK]Add maxDF Parameter to Python CountVectorizer #20777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #88106 has finished for PR 20777 at commit
|
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pr @huaxingao! Looks good, I just think we should fix up the tests and docs.
python/pyspark/ml/feature.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is too much to put as a doctest. Instead, can you just add a unit test in ml/tests.py? I think you just need 2 transforms, one with an integer value of maxDF > 1 and one as a fractional value. Also, I don't think your test data actually uses the maxDF filtering.
python/pyspark/ml/feature.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this documentation is exactly the same as minDF, please refer to the scala docs. Actually, I think the scala doc is a little confusing and could be clearer. Would you like to take a shot at rewording it?
python/pyspark/ml/feature.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not crazy about hardcoding a value here since in Scala it is Long.MaxValue, but I'm not sure there is another way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the comments. Will make changes.
|
Test build #88115 has finished for PR 20777 at commit
|
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @huaxingao , looks good! I just requested a minor tweak in the doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds much better, but probably should use ignore instead of remove and might be good to just change the order of the sentence like this:
Specifies the maximum number of different documents a term could appear in to be included
in the vocabulary. A term that appears more than the threshold will be ignored. If this is an
integer greater than or equal to 1, this specifies the maximum number of documents the term
could appear in; if this is a double in [0,1), then this specifies the maximum fraction of
documents the term could appear in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
python/pyspark/ml/tests.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add an assert that the vocabulary is equal to something? I think it would be ['b', 'c' 'd']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Bryan, Thanks for your comments. I will change these.
|
Test build #88244 has finished for PR 20777 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen do these doc changes look ok to you? It was a little confusing before saying that the term "must appear" when it's a max value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, your wording is clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @srowen !
python/pyspark/ml/feature.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's best just to hardcode the value like you did before, sys.maxsize can be 32bit on some systems https://docs.python.org/3/library/sys.html#sys.maxsize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will make the change now. Thanks!
|
Test build #88276 has finished for PR 20777 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the format for scaladoc actually needs the extra '^' to display right, see the vocabSize default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can, it's always a good idea to generate the docs to make sure of any changes
|
@BryanCutler Do you mind if I close this PR and open a new one? I got problems when I tried to resolve the conflicts. |
|
@huaxingao , it's best to keep the same PR if possible to better preserve the discussion history. Could you give it another try to resolve conflicts? |
|
Test build #88482 has finished for PR 20777 at commit
|
|
Test build #88480 has finished for PR 20777 at commit
|
|
Test build #88489 has finished for PR 20777 at commit
|
|
Test build #88490 has finished for PR 20777 at commit
|
|
Test build #88491 has finished for PR 20777 at commit
|
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I'll merge tomorrow if no more comments
|
merged to master! thanks @huaxingao |
|
Thank you very much for your help! @BryanCutler |
What changes were proposed in this pull request?
The maxDF parameter is for filtering out frequently occurring terms. This param was recently added to the Scala CountVectorizer and needs to be added to Python also.
How was this patch tested?
add test