Skip to content

Conversation

@WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Oct 11, 2016

What changes were proposed in this pull request?

Add 4 traits, using the following hierarchy:
LogisticRegressionSummary
LogisticRegressionTrainingSummary: LogisticRegressionSummary
BinaryLogisticRegressionSummary: LogisticRegressionSummary
BinaryLogisticRegressionTrainingSummary: LogisticRegressionTrainingSummary, BinaryLogisticRegressionSummary

and the public method such as def summary only return trait type listed above.

and then implement 4 concrete classes:
LogisticRegressionSummaryImpl (multiclass case)
LogisticRegressionTrainingSummaryImpl (multiclass case)
BinaryLogisticRegressionSummaryImpl (binary case).
BinaryLogisticRegressionTrainingSummaryImpl (binary case).

How was this patch tested?

Existing tests & added tests.

@WeichenXu123
Copy link
Contributor Author

@sethah This pr seems to be discussed about several details, I am pleasure to hear your opinion, thanks!

@SparkQA
Copy link

SparkQA commented Oct 11, 2016

Test build #66748 has finished for PR 15435 at commit e93740e.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor

sethah commented Oct 11, 2016

I'll try to take a look before too long. For now, I see there are no tests, could you please add tests, using the summary tests for binary classification as a guide? Thanks!

@SparkQA
Copy link

SparkQA commented Oct 12, 2016

Test build #66804 has finished for PR 15435 at commit 805613c.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 12, 2016

Test build #66805 has finished for PR 15435 at commit fdac2dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor

sethah commented Oct 13, 2016

So I've been reading through some of the history with logistic regression summaries. There was a lot of discussion on how to design the abstractions for this, here and here.

I'm reposting some of the relevant snippets (I will comment on them in a follow up):

"We'll need to use traits to fix the multiple inheritance issue:"

sealed trait LogisticRegressionSummary
sealed trait LogisticRegressionTrainingSummary
class BinaryLogisticRegressionSummary extends LogisticRegressionSummary
class BinaryLogisticRegressionTrainingSummary extends BinaryLogisticRegressionSummary with LogisticRegressionTrainingSummary

"Are we planning to have a MulticlassLogisticRegressionSummary inheriting from LogisticRegressionSummary in the future because without that I'm unable to understand how using a trait would help since there is no access to the predictions dataframe."

"Yes, MulticlassLogisticRegressionSummary should be analogous to the binary version, with both inheriting from LogisticRegressionSummary."

...

"Synced with @jkbradley offline. Summary:

We should not require end users to perform any sort of downcasting in the stabilized API. This is OK for now since the API is still experimental.

Eventually we could provide two methods, a summary : LogisticRegressionSummary and a binarySummary : BInaryLogisticRegressionSummary which errors when called on a multiclass LRModel. This will be easy to implement because summary is returning the base LogisticRegressionSummary class so will not require any public API change."

@sethah
Copy link
Contributor

sethah commented Oct 13, 2016

So, based on my interpretation of this and how this can actually work, we need to have:

sealed trait LogisticRegressionSummary
sealed trait LogisticRegressionTrainingSummary
class MulticlassLogisticRegressionSummary extends LogisticRegressionSummary
class MulticlassLogisticRegressionTrainingSummary extends MulticlassLogisticRegressionSummary with LogisticRegressionTrainingSummary
class BinaryLogisticRegressionSummary extends MulticlassLogisticRegressionSummary
class BinaryLogisticRegressionTrainingSummary extends BinaryLogisticRegressionSummary with LogisticRegressionTrainingSummary

Then, in LogisticRegressionModel we have:

def summary: LogisticRegressionTrainingSummary
def binarySummary: BinaryLogisticRegressionTrainingSummary = summary match {
  case b: BinaryLogisticRegressionTrainingSummary => b
  case _ => throw new Exception()
}

And we avoid downcasting in the summary case since MulticlassLogisticRegressionSummary only implements the methods defined in the trait. Otherwise, we would have to downcast to get access to those methods. Then if the summary is binary, you can just call binary summary. Anyway, I got this to compile, and if there is some other way, I'm not seeing it. Would really like to get some clarification from @jkbradley. Not sure if @feynmanliang is still involved with Spark.

@WeichenXu123
Copy link
Contributor Author

@sethah Good suggestion. code updated, thanks!

@SparkQA
Copy link

SparkQA commented Oct 15, 2016

Test build #67017 has finished for PR 15435 at commit 1bf5aa4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-double numeric datatypes other than LongType maybe also needed to test. Any thoughts? @sethah

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhengruifeng Thanks for carefule review, because in the Summary class code we use labelValue.toDouble to do the cast so that we use one type LongType as the test is OK IMO, is there some possible case that LongType cast succeed but other legal numeric value will fails?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, from a completeness standpoint I agree that it's better to test all the types that it's intended to work for. However, since it's just calling cast under the hood, it does seem a bit redundant. I'm ok leaving it as is, but I don't feel strongly about it.

@sethah
Copy link
Contributor

sethah commented Oct 25, 2016

re-ping @jkbradley Would be great to get your thoughts on the above discussion since you were involved in the original design

@WeichenXu123
Copy link
Contributor Author

@sethah jkbradley seems not online recently we can invite @yanboliang to give some advice.

@jkbradley
Copy link
Member

jkbradley commented Oct 26, 2016

@sethah Thanks for pinging, and for bringing up those old design discussions (which are still the best options I know of). @WeichenXu123 sorry I've been slow to respond; I'm trying to keep up!

I'll take a look at the PR.

UPDATE: Regarding the class hierarchy above, the only one I'm not sure about is class BinaryLogisticRegressionSummary extends MulticlassLogisticRegressionSummary.

  • Option 1: We go this route. In that case, we should eliminate MulticlassLogisticRegressionSummary and just merge it into LogisticRegressionSummary.
  • Option 2: We have class BinaryLogisticRegressionSummary extends LogisticRegressionSummary instead. This could make sense in terms of separating out how people should think about binary & multiclass problems differently.

I'm unsure here. What do you think?

@sethah
Copy link
Contributor

sethah commented Oct 26, 2016

@jkbradley Thanks for your input. I'm happy to review this, but wanted to get clarification before proceeding. I can take a look in the next couple of days.

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sethah Sorry, didn't want to stomp on your review. I had started reviewing, but I'll just go ahead and submit these partial review comments. Thanks!
And thanks @WeichenXu123 for adding this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd actually just remove these 2 lines since they don't say anything useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here: remove these 2 lines since they don't say anything useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add Scala doc, especially saying that this will throw an exception when numClasses > 2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use SparkException here; that's primarily for failures within Spark jobs AFAIK. I guess I'd use RuntimeException.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the current error doesn't really help the user. How about "Cannot create a binarySummary for a non-binary model (with numClasses = $numClasses). Use multinomialSummary instead."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add doc here too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only place where you need the probability2predictionUDF. Rather than passing a UDF or the model, I'd like us to modify findSummaryModelAndProbabilityCol to also set predictionCol if needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These since tags should all be 2.1.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need these 2 lines?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add scala doc for all of these

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix indentation

@sethah
Copy link
Contributor

sethah commented Oct 26, 2016

@jkbradley No problems at all, it's always better to have more reviewers.

@WeichenXu123
Copy link
Contributor Author

@sethah @jkbradley

Now I am considering elemating the probabilityToPredictionUDF,
There are some fussy problems about the predictionCol and the summary class hierarchy....

currently, the base interface LogisticRegressionSummary do not have the member predictionCol, which is needed in MLOR summary.
if I add the predictionCol into the base interface LogisticRegressionSummary then BinaryLogisticRegressionTrainingSummary will also need to be modified, it seems break API compatibility, should we avoid such thing ?

And, whether should we make BinaryLogisticRegressionTrainingSummary be the subclass of MLOR summary ? I would like let @jkbradley decide it.

@zhengruifeng
Copy link
Contributor

zhengruifeng commented Oct 27, 2016

It seems that many metrics in MultinomialLogisticRegressionSummary are generic to other classification algos. So what about create a new class MultiClassificationSummary and put it in a new file? Just like what @yanboliang has done in #15555

@jkbradley
Copy link
Member

jkbradley commented Oct 28, 2016

@WeichenXu123 About breaking APIs: I'm OK adding predictionCol to LogisticRegressionSummary, even though I agree it technically breaks a public API:

  • The traits are OK to change since they are sealed.
  • The subclasses inheriting from the traits are public, but they all have private constructors, so users cannot have extended them (except by using hacks).

So I don't see a way this could break a non-hack use case, but let me know if I'm not thinking of an edge case.

@zhengruifeng I'm OK refactoring to create a MulticlassClassificationSummary (or MulticlassSummary?) now or later. Now does sound better. Let's keep it private[ml] for now though.

@WeichenXu123
Copy link
Contributor Author

All right. I will create a new class MulticlassClassificationSummary it looks better.
And I will change BinaryLogisticRegressionTrainingSummary be the subclass of MLOR summary, it looks more reasonable. If there is some problem in such hierarchy let me know it. @jkbradley

And thanks @zhengruifeng for good suggestion~

@SparkQA
Copy link

SparkQA commented Nov 7, 2016

Test build #68283 has finished for PR 15435 at commit 9c0e3fe.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 7, 2016

Test build #68285 has finished for PR 15435 at commit 3b459f9.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 18, 2016

Test build #68864 has finished for PR 15435 at commit f0523f9.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 18, 2016

Test build #68865 has finished for PR 15435 at commit 79c5dda.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123 WeichenXu123 changed the title [SPARK-17139][ML] Add model summary for MultinomialLogisticRegression [WIP][SPARK-17139][ML] Add model summary for MultinomialLogisticRegression Nov 19, 2016
@WeichenXu123 WeichenXu123 changed the title [WIP][SPARK-17139][ML] Add model summary for MultinomialLogisticRegression [SPARK-17139][ML] Add model summary for MultinomialLogisticRegression Nov 19, 2016
@SparkQA
Copy link

SparkQA commented Aug 21, 2017

Test build #80926 has finished for PR 15435 at commit d338a94.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Aug 21, 2017

Test build #80928 has finished for PR 15435 at commit d338a94.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 21, 2017

Test build #80932 has finished for PR 15435 at commit b6cde56.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #80947 has finished for PR 15435 at commit 67c57e5.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #80954 has finished for PR 15435 at commit 67c57e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

/**
* Returns the sequence of labels in ascending order
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify: "Returns the sequence of labels in ascending order. This order matches the order used in metrics which are specified as arrays over labels, e.g., truePositiveRateByLabel."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see; I think this should work.

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has there been discussion of removing the Experimental tags from the summary types? I'd prefer to leave them since there is still talk of further refactoring (e.g., generalizing out a ClassificationSummary across different models).

Other than that + my 1 small comment, this looks ready. @sethah and @yanboliang any more comments?

Thanks everyone!

@MLnick
Copy link
Contributor

MLnick commented Aug 23, 2017

I agree keeping the @Experimental tags for now is best.

@SparkQA
Copy link

SparkQA commented Aug 23, 2017

Test build #81022 has finished for PR 15435 at commit 0ebc943.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

jkbradley commented Aug 24, 2017

OK, then let's reinstate the Experimental tags. @WeichenXu123 could you please mark the 4 public summary traits Experimental?

@SparkQA
Copy link

SparkQA commented Aug 25, 2017

Test build #81111 has finished for PR 15435 at commit 1395de2.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

Jenkins test this please

@SparkQA
Copy link

SparkQA commented Aug 25, 2017

Test build #81119 has finished for PR 15435 at commit 1395de2.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

Jenkins test this please

@SparkQA
Copy link

SparkQA commented Aug 25, 2017

Test build #81124 has finished for PR 15435 at commit 1395de2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM
Merging with master
Thanks a lot @WeichenXu123 and everyone who reviewed!

@asfgit asfgit closed this in c7270a4 Aug 28, 2017
@WeichenXu123 WeichenXu123 deleted the mlor_summary branch January 26, 2018 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants