-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-9906] [ML] User guide for LogisticRegressionSummary #8197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
docs/ml-guide.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@feynmanliang Is there are a shorthand to do this directly in Spark SQL? Once I understand that I can update the Java Example as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might be able to get it done by doing an aggregation (e.g. max) but I haven't tried myself. In any case, I think just keeping everything up to L868 is sufficient since L870-872 aren't really showing how to use the summary feature anyways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had thought the user can choose the best threshold and re-run LogisticRegression on this best threshold.
In the first run, it will be a random sampling of the dataset and in the second run it will be the entire dataset. Do you still think it will not be useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The closest I could come up with is.
val maxFMeasure = fMeasure.select(max(df("F-Measure"))).collect()(0).getFloat(0)
val threshold = fMeasure.filter(df("F-Measure") >= maxFMeasure).collect()(0).getFloat(0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
I think what you described is useful, but is outside the scope of LogisticRegressionSummary. L869-L872 don't demonstrate any of the functionality these docs are intended to describe, which is why I propose we remove it. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I wouldn't complain because it makes my job easier.
|
Test build #40866 has finished for PR 8197 at commit
|
docs/ml-guide.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"toyData" -> "toy data"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Scala / Java effect :P
7cc3f58 to
56cb35b
Compare
|
Test build #40966 has finished for PR 8197 at commit
|
docs/ml-guide.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"objectiveHistory and metric" (surround code with backticks so docs apply correct styles to it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, metrics is not part of the public API
|
@feynmanliang ok, I've addressed your comments, anything else? |
|
retest this please |
|
Jenkins test this please |
|
Lgtm pending tests |
b79d780 to
5244459
Compare
5244459 to
1ab3d9c
Compare
|
Test build #40988 has finished for PR 8197 at commit
|
|
I think this should be part of the linear methods section, not part of the main ML guide. Once we have summary support in more model types, then we can mention it in the main ML guide. Could you please merge it with ml-linear-methods.md? Feel free to reorganize and stomp on existing examples as needed since that section is very basic right now. Thanks! |
docs/ml-guide.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this links the scala api doc, move this under the scala codetab and add another one for the java api doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
👍 to @jkbradley 's suggestion; my apologies for not thinking of that and the misleading JIRA task description. I've updated the JIRA to reflect this. |
|
@MechCoder I am working on LinearRegressionSummary docs (SPARK-9905) and I think we should communicate about where these docs should go. What do you think about adding a |
|
@feynmanliang That organization sounds good, if the summary examples can make use of previous basic examples (and avoid copying code). Another option would be to intersperse text and code, as in [http://spark.apache.org/docs/latest/mllib-clustering.html#streaming-k-means], but that will make sense only if clear headers can go within the codetabs text. |
Documentation cleanup in `ml-linear-methods`
|
I have merged your changes. Thanks! |
|
Cool LGTM, @jkbradley for final pass |
docs/ml-linear-methods.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrame?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, yep that's right
|
Test build #41151 has finished for PR 8197 at commit
|
|
@feynmanliang Actually I'm not very convinced that coupling these with existing code is the way to go. We could have a But for now I think this should be okay. |
|
Having a What things do you think will be repeated? |
|
You are right. But I thought that the example code such as extracting the training loss, and common metrics would be repeated across some classification and regression models. We could maybe just have one complete example of a classification and regression model with the summaries (the training being one that is not already in the description page) and for the others which can instantiate the summary directly to show how else it can be used apart from the common stuff. [Just a suggestion and can be thought about later] |
|
Test build #41162 has finished for PR 8197 at commit
|
|
Okay. But he had to told something else in that PR discussion :p. I do
|
|
LGTM CC @jkbradley this is blocking SPARK-9905 so do you mind reviewing when you have a chance? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to put lambda in, but if you do, then how about putting it outside of big parentheses or brackets to make the equation easier to read?
|
Done with review |
|
I'll go ahead and merge this and ask @feynmanliang to make any needed updates when he sends a PR for the linear regression summary docs. Thanks! |
|
merging with master and branch-1.5 |
User guide for LogisticRegression summaries Author: MechCoder <[email protected]> Author: Manoj Kumar <[email protected]> Author: Feynman Liang <[email protected]> Closes #8197 from MechCoder/log_summary_user_guide. (cherry picked from commit c94ecdf) Signed-off-by: Joseph K. Bradley <[email protected]>
* Adds user guide for `LinearRegressionSummary` * Fixes unresolved issues in #8197 CC jkbradley mengxr Author: Feynman Liang <[email protected]> Closes #8491 from feynmanliang/SPARK-9905. (cherry picked from commit af0e124) Signed-off-by: Xiangrui Meng <[email protected]>
* Adds user guide for `LinearRegressionSummary` * Fixes unresolved issues in #8197 CC jkbradley mengxr Author: Feynman Liang <[email protected]> Closes #8491 from feynmanliang/SPARK-9905.
User guide for LogisticRegression summaries