[SPARK-29489][ML][PySpark] ml.evaluation support log-loss #26135

zhengruifeng · 2019-10-16T07:54:30Z

What changes were proposed in this pull request?

ml.MulticlassClassificationEvaluator & mllib.MulticlassMetrics support log-loss

Why are the changes needed?

log-loss is an important classification metric and is widely used in practice

Does this PR introduce any user-facing change?

Yes, add new option ("logloss") and a related param eps

How was this patch tested?

added testsuites & local tests refering to sklearn

zhengruifeng · 2019-10-16T07:58:02Z

mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala


-
-  private val confusions = predictionAndLabels.map {
+  private lazy val confusions = predictionAndLabels.map {


If the metricName==logloss, then the confusion matrix is not needed, so I make this computation lazy.

zhengruifeng · 2019-10-16T08:00:12Z

mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala

-        (prediction, label, 1.0)
-      case other =>
-        throw new IllegalArgumentException(s"Expected Row of tuples, got $other")
+    this(predictionAndLabels.rdd.map { r =>


matching will not work in pyspark, so I have to use r.get instead.
MultilabelMetrics also deals with dataframe in this way.

SparkQA · 2019-10-16T08:04:25Z

Test build #112149 has finished for PR 26135 at commit dadf716.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-16T08:54:54Z

Test build #112150 has finished for PR 26135 at commit 90c8ef2.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-16T10:19:50Z

Test build #112155 has finished for PR 26135 at commit 2b94170.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-16T11:48:17Z

Test build #112163 has finished for PR 26135 at commit a981f7b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-10-16T16:39:32Z

mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala

+  def logLoss(eps: Double = 1e-15): Double = {
+    require(eps > 0 && eps < 0.5, s"eps must be in range (0, 0.5), but got $eps")
+    val loss1 = - math.log(eps)
+    val loss2 = - math.log(1 - eps)


- math.log1p(-eps)? because eps is going to be very small

srowen · 2019-10-16T16:39:42Z

mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala

  lazy val labels: Array[Double] = tpByClass.keys.toArray.sorted
+
+  /**
+   * Returns the logLoss, aka logistic loss or cross-entropy loss.


You could just use a @return tag
Also log-loss rather than logLoss

python/pyspark/ml/evaluation.py

mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala

SparkQA · 2019-10-17T11:27:05Z

Test build #112217 has finished for PR 26135 at commit 38b901c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Looking OK pending tests and one very minor comment

srowen · 2019-10-17T14:37:13Z

mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala

  /**
   * An auxiliary constructor taking a DataFrame.
-   * @param predictionAndLabels a DataFrame with two double columns: prediction and label
+   * @param predictionAndLabels a DataFrame with columns: prediction, label, weight(optional)


Nit: spaces before paren

SparkQA · 2019-10-18T03:11:52Z

Test build #112246 has finished for PR 26135 at commit f46046c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2019-10-18T09:58:49Z

merged to master, thanks @srowen for reviewing!

zhengruifeng added 2 commits October 15, 2019 18:55

create pr

b97e8e8

update doc

dadf716

zhengruifeng added ML PYSPARK labels Oct 16, 2019

zhengruifeng commented Oct 16, 2019

View reviewed changes

fix style

90c8ef2

zhengruifeng added 2 commits October 16, 2019 17:05

fix pytest

2b94170

logloss -> logLoss

a981f7b

srowen reviewed Oct 16, 2019

View reviewed changes

address some commments

38b901c

srowen reviewed Oct 17, 2019

View reviewed changes

address some commments

f46046c

zhengruifeng closed this in dba673f Oct 18, 2019

zhengruifeng deleted the logloss branch October 18, 2019 09:58



		private val confusions = predictionAndLabels.map {
		private lazy val confusions = predictionAndLabels.map {

[SPARK-29489][ML][PySpark] ml.evaluation support log-loss #26135

[SPARK-29489][ML][PySpark] ml.evaluation support log-loss #26135

Uh oh!

Conversation

zhengruifeng commented Oct 16, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zhengruifeng Oct 16, 2019

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Oct 16, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 16, 2019

Uh oh!

SparkQA commented Oct 16, 2019

Uh oh!

SparkQA commented Oct 16, 2019

Uh oh!

SparkQA commented Oct 16, 2019

Uh oh!

srowen Oct 16, 2019

Choose a reason for hiding this comment

Uh oh!

srowen Oct 16, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Oct 17, 2019

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

srowen Oct 17, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 18, 2019

Uh oh!

zhengruifeng commented Oct 18, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants