Skip to content

Conversation

@MechCoder
Copy link
Contributor

User guide for LogisticRegression summaries

docs/ml-guide.md Outdated
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@feynmanliang Is there are a shorthand to do this directly in Spark SQL? Once I understand that I can update the Java Example as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to get it done by doing an aggregation (e.g. max) but I haven't tried myself. In any case, I think just keeping everything up to L868 is sufficient since L870-872 aren't really showing how to use the summary feature anyways

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought the user can choose the best threshold and re-run LogisticRegression on this best threshold.

In the first run, it will be a random sampling of the dataset and in the second run it will be the entire dataset. Do you still think it will not be useful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The closest I could come up with is.

val maxFMeasure = fMeasure.select(max(df("F-Measure"))).collect()(0).getFloat(0)
val threshold = fMeasure.filter(df("F-Measure") >= maxFMeasure).collect()(0).getFloat(0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

I think what you described is useful, but is outside the scope of LogisticRegressionSummary. L869-L872 don't demonstrate any of the functionality these docs are intended to describe, which is why I propose we remove it. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I wouldn't complain because it makes my job easier.

@SparkQA
Copy link

SparkQA commented Aug 14, 2015

Test build #40866 has finished for PR 8197 at commit 487b361.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

docs/ml-guide.md Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"toyData" -> "toy data"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Scala / Java effect :P

@MechCoder MechCoder force-pushed the log_summary_user_guide branch from 7cc3f58 to 56cb35b Compare August 15, 2015 15:42
@SparkQA
Copy link

SparkQA commented Aug 15, 2015

Test build #40966 has finished for PR 8197 at commit 56cb35b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

docs/ml-guide.md Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"objectiveHistory and metric" (surround code with backticks so docs apply correct styles to it)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, metrics is not part of the public API

@MechCoder
Copy link
Contributor Author

@feynmanliang ok, I've addressed your comments, anything else?

@MechCoder
Copy link
Contributor Author

retest this please

@feynmanliang
Copy link
Contributor

Jenkins test this please

@feynmanliang
Copy link
Contributor

Lgtm pending tests

@mengxr

@MechCoder MechCoder force-pushed the log_summary_user_guide branch from b79d780 to 5244459 Compare August 16, 2015 10:46
@MechCoder MechCoder force-pushed the log_summary_user_guide branch from 5244459 to 1ab3d9c Compare August 16, 2015 10:47
@SparkQA
Copy link

SparkQA commented Aug 16, 2015

Test build #40988 has finished for PR 8197 at commit 5244459.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

I think this should be part of the linear methods section, not part of the main ML guide. Once we have summary support in more model types, then we can mention it in the main ML guide. Could you please merge it with ml-linear-methods.md? Feel free to reorganize and stomp on existing examples as needed since that section is very basic right now. Thanks!

docs/ml-guide.md Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this links the scala api doc, move this under the scala codetab and add another one for the java api doc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@feynmanliang
Copy link
Contributor

👍 to @jkbradley 's suggestion; my apologies for not thinking of that and the misleading JIRA task description. I've updated the JIRA to reflect this.

@feynmanliang
Copy link
Contributor

@MechCoder I am working on LinearRegressionSummary docs (SPARK-9905) and I think we should communicate about where these docs should go. What do you think about adding a ### Model Summaries section with Linear Regression and Logistic Regression subsections right before the ### Optimization section in ml-linear-methods?

@jkbradley
Copy link
Member

@feynmanliang That organization sounds good, if the summary examples can make use of previous basic examples (and avoid copying code). Another option would be to intersperse text and code, as in [http://spark.apache.org/docs/latest/mllib-clustering.html#streaming-k-means], but that will make sense only if clear headers can go within the codetabs text.

Documentation cleanup in `ml-linear-methods`
@MechCoder
Copy link
Contributor Author

I have merged your changes. Thanks!

@feynmanliang
Copy link
Contributor

Cool LGTM, @jkbradley for final pass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrame?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, yep that's right

@SparkQA
Copy link

SparkQA commented Aug 18, 2015

Test build #41151 has finished for PR 8197 at commit 83d229f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MechCoder
Copy link
Contributor Author

@feynmanliang Actually I'm not very convinced that coupling these with existing code is the way to go. We could have a docs/ml-summary.md in the future with subsections about each summary with an example. This would prevent repetition of certain common descriptions across all summaries.

But for now I think this should be okay.

@feynmanliang
Copy link
Contributor

Having a ml-summary.md would lead to quite a bit of repeated documentation; we would have to supply model training example code in both the model's description page as well as in ml-summary since the model itself usually cannot be directly instantiated.

What things do you think will be repeated?

@MechCoder
Copy link
Contributor Author

You are right.

But I thought that the example code such as extracting the training loss, and common metrics would be repeated across some classification and regression models.

We could maybe just have one complete example of a classification and regression model with the summaries (the training being one that is not already in the description page) and for the others which can instantiate the summary directly to show how else it can be used apart from the common stuff.

[Just a suggestion and can be thought about later]

@SparkQA
Copy link

SparkQA commented Aug 18, 2015

Test build #41162 has finished for PR 8197 at commit 7bf922c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MechCoder
Copy link
Contributor Author

Okay. But he had to told something else in that PR discussion :p. I do
agree that doing model.binarySUmmary is much nearer than
model.asInstanceOf[]…
On Aug 19, 2015 2:27 AM, "UCB AMPLab" [email protected] wrote:

Merged build finished. Test PASSed.


Reply to this email directly or view it on GitHub
#8197 (comment).

@feynmanliang
Copy link
Contributor

LGTM

CC @jkbradley this is blocking SPARK-9905 so do you mind reviewing when you have a chance? Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to put lambda in, but if you do, then how about putting it outside of big parentheses or brackets to make the equation easier to read?

@jkbradley
Copy link
Member

Done with review

@jkbradley
Copy link
Member

I'll go ahead and merge this and ask @feynmanliang to make any needed updates when he sends a PR for the linear regression summary docs. Thanks!

@jkbradley
Copy link
Member

merging with master and branch-1.5

@asfgit asfgit closed this in c94ecdf Aug 27, 2015
asfgit pushed a commit that referenced this pull request Aug 27, 2015
User guide for LogisticRegression summaries

Author: MechCoder <[email protected]>
Author: Manoj Kumar <[email protected]>
Author: Feynman Liang <[email protected]>

Closes #8197 from MechCoder/log_summary_user_guide.

(cherry picked from commit c94ecdf)
Signed-off-by: Joseph K. Bradley <[email protected]>
@MechCoder MechCoder deleted the log_summary_user_guide branch August 27, 2015 23:35
asfgit pushed a commit that referenced this pull request Aug 28, 2015
* Adds user guide for `LinearRegressionSummary`
* Fixes unresolved issues in  #8197

CC jkbradley mengxr

Author: Feynman Liang <[email protected]>

Closes #8491 from feynmanliang/SPARK-9905.

(cherry picked from commit af0e124)
Signed-off-by: Xiangrui Meng <[email protected]>
asfgit pushed a commit that referenced this pull request Aug 28, 2015
* Adds user guide for `LinearRegressionSummary`
* Fixes unresolved issues in  #8197

CC jkbradley mengxr

Author: Feynman Liang <[email protected]>

Closes #8491 from feynmanliang/SPARK-9905.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants