-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29489][ML][PySpark] ml.evaluation support log-loss #26135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
|
|
||
| private val confusions = predictionAndLabels.map { | ||
| private lazy val confusions = predictionAndLabels.map { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the metricName==logloss, then the confusion matrix is not needed, so I make this computation lazy.
| (prediction, label, 1.0) | ||
| case other => | ||
| throw new IllegalArgumentException(s"Expected Row of tuples, got $other") | ||
| this(predictionAndLabels.rdd.map { r => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
matching will not work in pyspark, so I have to use r.get instead.
MultilabelMetrics also deals with dataframe in this way.
|
Test build #112149 has finished for PR 26135 at commit
|
|
Test build #112150 has finished for PR 26135 at commit
|
|
Test build #112155 has finished for PR 26135 at commit
|
|
Test build #112163 has finished for PR 26135 at commit
|
| def logLoss(eps: Double = 1e-15): Double = { | ||
| require(eps > 0 && eps < 0.5, s"eps must be in range (0, 0.5), but got $eps") | ||
| val loss1 = - math.log(eps) | ||
| val loss2 = - math.log(1 - eps) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- math.log1p(-eps)? because eps is going to be very small
| lazy val labels: Array[Double] = tpByClass.keys.toArray.sorted | ||
|
|
||
| /** | ||
| * Returns the logLoss, aka logistic loss or cross-entropy loss. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could just use a @return tag
Also log-loss rather than logLoss
mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala
Show resolved
Hide resolved
|
Test build #112217 has finished for PR 26135 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking OK pending tests and one very minor comment
| /** | ||
| * An auxiliary constructor taking a DataFrame. | ||
| * @param predictionAndLabels a DataFrame with two double columns: prediction and label | ||
| * @param predictionAndLabels a DataFrame with columns: prediction, label, weight(optional) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: spaces before paren
|
Test build #112246 has finished for PR 26135 at commit
|
|
merged to master, thanks @srowen for reviewing! |
What changes were proposed in this pull request?
ml.MulticlassClassificationEvaluator&mllib.MulticlassMetricssupport log-lossWhy are the changes needed?
log-loss is an important classification metric and is widely used in practice
Does this PR introduce any user-facing change?
Yes, add new option ("logloss") and a related param
epsHow was this patch tested?
added testsuites & local tests refering to sklearn