-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21741][ML][PySpark] Python API for DataFrame-based multivariate summarizer #20695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #87778 has finished for PR 20695 at commit
|
|
Test build #87782 has finished for PR 20695 at commit
|
|
Test build #87784 has finished for PR 20695 at commit
|
|
2.4.0? |
|
Test build #87816 has finished for PR 20695 at commit
|
MrBago
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small comment, otherwise LGTM.
python/pyspark/ml/stat.py
Outdated
| return Summarizer(js) | ||
|
|
||
| @since("2.4.0") | ||
| def summary(self, featureCol, weightCol=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to move the "summary" method into another class, and have Summary only contain static methods. That will help with autocomplete so that it's clear that you're not meant to do Summery.metrics("min").mean(features).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable.
|
Test build #88463 has finished for PR 20695 at commit
|
|
Test build #89167 has finished for PR 20695 at commit
|
jkbradley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a few comments. Thanks!
python/pyspark/ml/stat.py
Outdated
| return SummarizerBuilder(js) | ||
|
|
||
|
|
||
| class SummarizerBuilder(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name needs to match its Scala equivalent: "SummaryBuilder"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, shouldn't we use JavaWrapper for this? That will clean up when this object is destroyed.
python/pyspark/ml/stat.py
Outdated
| self._js = js | ||
|
|
||
| @since("2.4.0") | ||
| def summary(self, featureCol, weightCol=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto: naming should match Scala: "featuresCol"
| def summary(self, featureCol, weightCol=None): | ||
| """ | ||
| Returns an aggregate object that contains the summary of the column with the requested | ||
| metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's copy the docs for arguments & return value from Scala
|
Test build #89331 has finished for PR 20695 at commit
|
|
Test build #89333 has finished for PR 20695 at commit
|
jkbradley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updates, just 1 new comment
python/pyspark/ml/stat.py
Outdated
| """ | ||
| def __init__(self, js): | ||
| self._js = js |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should call the super's init method, and it should store js in _java_obj (which is set in the JavaWrapper init).
|
Test build #89422 has finished for PR 20695 at commit
|
|
LGTM |
What changes were proposed in this pull request?
Python API for DataFrame-based multivariate summarizer.
How was this patch tested?
doctest added.