[SPARK-17139][ML] Add model summary for MultinomialLogisticRegression #15435

WeichenXu123 · 2016-10-11T16:55:07Z

What changes were proposed in this pull request?

Add 4 traits, using the following hierarchy:
LogisticRegressionSummary
LogisticRegressionTrainingSummary: LogisticRegressionSummary
BinaryLogisticRegressionSummary: LogisticRegressionSummary
BinaryLogisticRegressionTrainingSummary: LogisticRegressionTrainingSummary, BinaryLogisticRegressionSummary

and the public method such as def summary only return trait type listed above.

and then implement 4 concrete classes:
LogisticRegressionSummaryImpl (multiclass case)
LogisticRegressionTrainingSummaryImpl (multiclass case)
BinaryLogisticRegressionSummaryImpl (binary case).
BinaryLogisticRegressionTrainingSummaryImpl (binary case).

How was this patch tested?

Existing tests & added tests.

WeichenXu123 · 2016-10-11T16:57:14Z

@sethah This pr seems to be discussed about several details, I am pleasure to hear your opinion, thanks!

SparkQA · 2016-10-11T17:11:51Z

Test build #66748 has finished for PR 15435 at commit e93740e.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-10-11T17:56:32Z

I'll try to take a look before too long. For now, I see there are no tests, could you please add tests, using the summary tests for binary classification as a guide? Thanks!

SparkQA · 2016-10-12T07:34:34Z

Test build #66804 has finished for PR 15435 at commit 805613c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-12T09:24:24Z

Test build #66805 has finished for PR 15435 at commit fdac2dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-10-13T03:44:07Z

So I've been reading through some of the history with logistic regression summaries. There was a lot of discussion on how to design the abstractions for this, here and here.

I'm reposting some of the relevant snippets (I will comment on them in a follow up):

"We'll need to use traits to fix the multiple inheritance issue:"

sealed trait LogisticRegressionSummary
sealed trait LogisticRegressionTrainingSummary
class BinaryLogisticRegressionSummary extends LogisticRegressionSummary
class BinaryLogisticRegressionTrainingSummary extends BinaryLogisticRegressionSummary with LogisticRegressionTrainingSummary

"Are we planning to have a MulticlassLogisticRegressionSummary inheriting from LogisticRegressionSummary in the future because without that I'm unable to understand how using a trait would help since there is no access to the predictions dataframe."

"Yes, MulticlassLogisticRegressionSummary should be analogous to the binary version, with both inheriting from LogisticRegressionSummary."

...

"Synced with @jkbradley offline. Summary:

We should not require end users to perform any sort of downcasting in the stabilized API. This is OK for now since the API is still experimental.

Eventually we could provide two methods, a summary : LogisticRegressionSummary and a binarySummary : BInaryLogisticRegressionSummary which errors when called on a multiclass LRModel. This will be easy to implement because summary is returning the base LogisticRegressionSummary class so will not require any public API change."

sethah · 2016-10-13T03:53:51Z

So, based on my interpretation of this and how this can actually work, we need to have:

sealed trait LogisticRegressionSummary
sealed trait LogisticRegressionTrainingSummary
class MulticlassLogisticRegressionSummary extends LogisticRegressionSummary
class MulticlassLogisticRegressionTrainingSummary extends MulticlassLogisticRegressionSummary with LogisticRegressionTrainingSummary
class BinaryLogisticRegressionSummary extends MulticlassLogisticRegressionSummary
class BinaryLogisticRegressionTrainingSummary extends BinaryLogisticRegressionSummary with LogisticRegressionTrainingSummary

Then, in LogisticRegressionModel we have:

def summary: LogisticRegressionTrainingSummary
def binarySummary: BinaryLogisticRegressionTrainingSummary = summary match {
  case b: BinaryLogisticRegressionTrainingSummary => b
  case _ => throw new Exception()
}

And we avoid downcasting in the summary case since MulticlassLogisticRegressionSummary only implements the methods defined in the trait. Otherwise, we would have to downcast to get access to those methods. Then if the summary is binary, you can just call binary summary. Anyway, I got this to compile, and if there is some other way, I'm not seeing it. Would really like to get some clarification from @jkbradley. Not sure if @feynmanliang is still involved with Spark.

WeichenXu123 · 2016-10-15T14:56:03Z

@sethah Good suggestion. code updated, thanks!

SparkQA · 2016-10-15T15:54:53Z

Test build #67017 has finished for PR 15435 at commit 1bf5aa4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2016-10-21T12:16:15Z

mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala

Non-double numeric datatypes other than LongType maybe also needed to test. Any thoughts? @sethah

@zhengruifeng Thanks for carefule review, because in the Summary class code we use labelValue.toDouble to do the cast so that we use one type LongType as the test is OK IMO, is there some possible case that LongType cast succeed but other legal numeric value will fails?

Well, from a completeness standpoint I agree that it's better to test all the types that it's intended to work for. However, since it's just calling cast under the hood, it does seem a bit redundant. I'm ok leaving it as is, but I don't feel strongly about it.

sethah · 2016-10-25T22:37:50Z

re-ping @jkbradley Would be great to get your thoughts on the above discussion since you were involved in the original design

WeichenXu123 · 2016-10-26T02:55:38Z

@sethah jkbradley seems not online recently we can invite @yanboliang to give some advice.

jkbradley · 2016-10-26T20:52:14Z

@sethah Thanks for pinging, and for bringing up those old design discussions (which are still the best options I know of). @WeichenXu123 sorry I've been slow to respond; I'm trying to keep up!

I'll take a look at the PR.

UPDATE: Regarding the class hierarchy above, the only one I'm not sure about is class BinaryLogisticRegressionSummary extends MulticlassLogisticRegressionSummary.

Option 1: We go this route. In that case, we should eliminate MulticlassLogisticRegressionSummary and just merge it into LogisticRegressionSummary.
Option 2: We have class BinaryLogisticRegressionSummary extends LogisticRegressionSummary instead. This could make sense in terms of separating out how people should think about binary & multiclass problems differently.

I'm unsure here. What do you think?

sethah · 2016-10-26T21:10:09Z

@jkbradley Thanks for your input. I'm happy to review this, but wanted to get clarification before proceeding. I can take a look in the next couple of days.

jkbradley

@sethah Sorry, didn't want to stomp on your review. I had started reviewing, but I'll just go ahead and submit these partial review comments. Thanks!
And thanks @WeichenXu123 for adding this.

jkbradley · 2016-10-26T20:55:05Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

I'd actually just remove these 2 lines since they don't say anything useful.

jkbradley · 2016-10-26T20:55:16Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

same here: remove these 2 lines since they don't say anything useful.

jkbradley · 2016-10-26T20:56:07Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

Add Scala doc, especially saying that this will throw an exception when numClasses > 2.

jkbradley · 2016-10-26T20:58:33Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

Don't use SparkException here; that's primarily for failures within Spark jobs AFAIK. I guess I'd use RuntimeException.

Also, the current error doesn't really help the user. How about "Cannot create a binarySummary for a non-binary model (with numClasses = $numClasses). Use multinomialSummary instead."

jkbradley · 2016-10-26T20:58:41Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

Add doc here too

jkbradley · 2016-10-26T21:05:54Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

This is the only place where you need the probability2predictionUDF. Rather than passing a UDF or the model, I'd like us to modify findSummaryModelAndProbabilityCol to also set predictionCol if needed.

jkbradley · 2016-10-26T21:06:14Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

These since tags should all be 2.1.0.

jkbradley · 2016-10-26T21:06:50Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

Do you need these 2 lines?

jkbradley · 2016-10-26T21:07:42Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

Add scala doc for all of these

jkbradley · 2016-10-26T21:08:19Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

fix indentation

sethah · 2016-10-26T21:23:17Z

@jkbradley No problems at all, it's always better to have more reviewers.

WeichenXu123 · 2016-10-27T02:35:41Z

@sethah @jkbradley

Now I am considering elemating the probabilityToPredictionUDF,
There are some fussy problems about the predictionCol and the summary class hierarchy....

currently, the base interface LogisticRegressionSummary do not have the member predictionCol, which is needed in MLOR summary.
if I add the predictionCol into the base interface LogisticRegressionSummary then BinaryLogisticRegressionTrainingSummary will also need to be modified, it seems break API compatibility, should we avoid such thing ?

And, whether should we make BinaryLogisticRegressionTrainingSummary be the subclass of MLOR summary ? I would like let @jkbradley decide it.

zhengruifeng · 2016-10-27T05:47:46Z

It seems that many metrics in MultinomialLogisticRegressionSummary are generic to other classification algos. So what about create a new class MultiClassificationSummary and put it in a new file? Just like what @yanboliang has done in #15555

jkbradley · 2016-10-28T00:10:55Z

@WeichenXu123 About breaking APIs: I'm OK adding predictionCol to LogisticRegressionSummary, even though I agree it technically breaks a public API:

The traits are OK to change since they are sealed.
The subclasses inheriting from the traits are public, but they all have private constructors, so users cannot have extended them (except by using hacks).

So I don't see a way this could break a non-hack use case, but let me know if I'm not thinking of an edge case.

@zhengruifeng I'm OK refactoring to create a MulticlassClassificationSummary (or MulticlassSummary?) now or later. Now does sound better. Let's keep it private[ml] for now though.

WeichenXu123 · 2016-10-28T01:27:08Z

All right. I will create a new class MulticlassClassificationSummary it looks better.
And I will change BinaryLogisticRegressionTrainingSummary be the subclass of MLOR summary, it looks more reasonable. If there is some problem in such hierarchy let me know it. @jkbradley

And thanks @zhengruifeng for good suggestion~

SparkQA · 2016-11-07T16:29:07Z

Test build #68283 has finished for PR 15435 at commit 9c0e3fe.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-07T17:07:20Z

Test build #68285 has finished for PR 15435 at commit 3b459f9.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-18T16:53:39Z

Test build #68864 has finished for PR 15435 at commit f0523f9.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-18T17:19:44Z

Test build #68865 has finished for PR 15435 at commit 79c5dda.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-21T11:19:40Z

Test build #80926 has finished for PR 15435 at commit d338a94.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2017-08-21T13:27:33Z

Jenkins, test this please.

SparkQA · 2017-08-21T13:40:02Z

Test build #80928 has finished for PR 15435 at commit d338a94.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-21T15:18:27Z

Test build #80932 has finished for PR 15435 at commit b6cde56.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-22T02:42:31Z

Test build #80947 has finished for PR 15435 at commit 67c57e5.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2017-08-22T03:44:07Z

Jenkins, test this please.

SparkQA · 2017-08-22T06:51:52Z

Test build #80954 has finished for PR 15435 at commit 67c57e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-08-22T23:42:27Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

+  }
+
+  /**
+   * Returns the sequence of labels in ascending order


Clarify: "Returns the sequence of labels in ascending order. This order matches the order used in metrics which are specified as arrays over labels, e.g., truePositiveRateByLabel."

jkbradley · 2017-08-22T23:51:27Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

I see; I think this should work.

jkbradley

Has there been discussion of removing the Experimental tags from the summary types? I'd prefer to leave them since there is still talk of further refactoring (e.g., generalizing out a ClassificationSummary across different models).

Other than that + my 1 small comment, this looks ready. @sethah and @yanboliang any more comments?

Thanks everyone!

MLnick · 2017-08-23T07:13:30Z

I agree keeping the @Experimental tags for now is best.

SparkQA · 2017-08-23T10:09:42Z

Test build #81022 has finished for PR 15435 at commit 0ebc943.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-08-24T16:25:49Z

OK, then let's reinstate the Experimental tags. @WeichenXu123 could you please mark the 4 public summary traits Experimental?

SparkQA · 2017-08-25T04:22:02Z

Test build #81111 has finished for PR 15435 at commit 1395de2.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2017-08-25T06:09:18Z

Jenkins test this please

SparkQA · 2017-08-25T07:04:48Z

Test build #81119 has finished for PR 15435 at commit 1395de2.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2017-08-25T07:51:34Z

Jenkins test this please

SparkQA · 2017-08-25T10:58:21Z

Test build #81124 has finished for PR 15435 at commit 1395de2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-08-28T20:29:45Z

LGTM
Merging with master
Thanks a lot @WeichenXu123 and everyone who reviewed!

zhengruifeng reviewed Oct 21, 2016

View reviewed changes

jkbradley reviewed Oct 26, 2016

View reviewed changes

WeichenXu123 force-pushed the mlor_summary branch from 1bf5aa4 to 9c0e3fe Compare November 7, 2016 16:17

WeichenXu123 force-pushed the mlor_summary branch from 9c0e3fe to 3b459f9 Compare November 7, 2016 16:47

WeichenXu123 force-pushed the mlor_summary branch from 3b459f9 to f0523f9 Compare November 18, 2016 16:49

WeichenXu123 force-pushed the mlor_summary branch from f0523f9 to 79c5dda Compare November 18, 2016 16:57

WeichenXu123 changed the title ~~[SPARK-17139][ML] Add model summary for MultinomialLogisticRegression~~ [WIP][SPARK-17139][ML] Add model summary for MultinomialLogisticRegression Nov 19, 2016

WeichenXu123 changed the title ~~[WIP][SPARK-17139][ML] Add model summary for MultinomialLogisticRegression~~ [SPARK-17139][ML] Add model summary for MultinomialLogisticRegression Nov 19, 2016

WeichenXu123 added 2 commits August 21, 2017 18:21

update

2bce87b

update

ce95023

WeichenXu123 force-pushed the mlor_summary branch from 46d49c9 to d338a94 Compare August 21, 2017 10:22

update

b6cde56

WeichenXu123 force-pushed the mlor_summary branch from d338a94 to b6cde56 Compare August 21, 2017 15:05

update mima

67c57e5

jkbradley reviewed Aug 22, 2017

View reviewed changes

tiny update comment

0ebc943

add experimental tag

1395de2

asfgit closed this in c7270a4 Aug 28, 2017

WeichenXu123 deleted the mlor_summary branch January 26, 2018 18:58

[SPARK-17139][ML] Add model summary for MultinomialLogisticRegression #15435

[SPARK-17139][ML] Add model summary for MultinomialLogisticRegression #15435

Uh oh!

Conversation

WeichenXu123 commented Oct 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

WeichenXu123 commented Oct 11, 2016

Uh oh!

SparkQA commented Oct 11, 2016

Uh oh!

sethah commented Oct 11, 2016

Uh oh!

SparkQA commented Oct 12, 2016

Uh oh!

SparkQA commented Oct 12, 2016

Uh oh!

sethah commented Oct 13, 2016

Uh oh!

sethah commented Oct 13, 2016

Uh oh!

WeichenXu123 commented Oct 15, 2016

Uh oh!

SparkQA commented Oct 15, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sethah commented Oct 25, 2016

Uh oh!

WeichenXu123 commented Oct 26, 2016

Uh oh!

jkbradley commented Oct 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sethah commented Oct 26, 2016

Uh oh!

jkbradley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sethah commented Oct 26, 2016

Uh oh!

WeichenXu123 commented Oct 27, 2016

Uh oh!

zhengruifeng commented Oct 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkbradley commented Oct 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WeichenXu123 commented Oct 28, 2016

Uh oh!

SparkQA commented Nov 7, 2016

WeichenXu123 commented Oct 11, 2016 •

edited

Loading

jkbradley commented Oct 26, 2016 •

edited

Loading

zhengruifeng commented Oct 27, 2016 •

edited

Loading

jkbradley commented Oct 28, 2016 •

edited

Loading

jkbradley commented Aug 24, 2017 •

edited

Loading