Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
177 changes: 71 additions & 106 deletions docs/ml-linear-methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,20 +23,41 @@ displayTitle: <a href="ml-guide.html">ML</a> - Linear Methods
\]`


In MLlib, we implement popular linear methods such as logistic regression and linear least squares with L1 or L2 regularization. Refer to [the linear methods in mllib](mllib-linear-methods.html) for details. In `spark.ml`, we also include Pipelines API for [Elastic net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid of L1 and L2 regularization proposed in [this paper](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf). Mathematically it is defined as a linear combination of the L1-norm and the L2-norm:
In MLlib, we implement popular linear methods such as logistic
regression and linear least squares with $L_1$ or $L_2$ regularization.
Refer to [the linear methods in mllib](mllib-linear-methods.html) for
details. In `spark.ml`, we also include Pipelines API for [Elastic
net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid
of $L_1$ and $L_2$ regularization proposed in [Zou et al, Regularization
and variable selection via the elastic
net](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf).
Mathematically, it is defined as a convex combination of the $L_1$ and
the $L_2$ regularization terms:
`\[
\alpha \|\wv\|_1 + (1-\alpha) \frac{1}{2}\|\wv\|_2^2, \alpha \in [0, 1].
\alpha~\lambda \|\wv\|_1 + (1-\alpha) \frac{\lambda}{2}\|\wv\|_2^2, \alpha \in [0, 1], \lambda \geq 0.
\]`
By setting $\alpha$ properly, it contains both L1 and L2 regularization as special cases. For example, if a [linear regression](https://en.wikipedia.org/wiki/Linear_regression) model is trained with the elastic net parameter $\alpha$ set to $1$, it is equivalent to a [Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model. On the other hand, if $\alpha$ is set to $0$, the trained model reduces to a [ridge regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model. We implement Pipelines API for both linear regression and logistic regression with elastic net regularization.

**Examples**
By setting $\alpha$ properly, elastic net contains both $L_1$ and $L_2$
regularization as special cases. For example, if a [linear
regression](https://en.wikipedia.org/wiki/Linear_regression) model is
trained with the elastic net parameter $\alpha$ set to $1$, it is
equivalent to a
[Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model.
On the other hand, if $\alpha$ is set to $0$, the trained model reduces
to a [ridge
regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model.
We implement Pipelines API for both linear regression and logistic
regression with elastic net regularization.

## Example: Logistic Regression

The following example shows how to train a logistic regression model
with elastic net regularization. `elasticNetParam` corresponds to
$\alpha$ and `regParam` corresponds to $\lambda$.

<div class="codetabs">

<div data-lang="scala" markdown="1">

{% highlight scala %}

import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.mllib.util.MLUtils

Expand All @@ -53,15 +74,11 @@ val lrModel = lr.fit(training)

// Print the weights and intercept for logistic regression
println(s"Weights: ${lrModel.weights} Intercept: ${lrModel.intercept}")

{% endhighlight %}

</div>

<div data-lang="java" markdown="1">

{% highlight java %}

import org.apache.spark.ml.classification.LogisticRegression;
import org.apache.spark.ml.classification.LogisticRegressionModel;
import org.apache.spark.mllib.regression.LabeledPoint;
Expand Down Expand Up @@ -99,9 +116,7 @@ public class LogisticRegressionWithElasticNetExample {
</div>

<div data-lang="python" markdown="1">

{% highlight python %}

from pyspark.ml.classification import LogisticRegression
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.util import MLUtils
Expand All @@ -118,67 +133,33 @@ lrModel = lr.fit(training)
print("Weights: " + str(lrModel.weights))
print("Intercept: " + str(lrModel.intercept))
{% endhighlight %}

</div>

</div>

### Optimization

The optimization algorithm underlies the implementation is called [Orthant-Wise Limited-memory QuasiNewton](http://research-srv.microsoft.com/en-us/um/people/jfgao/paper/icml07scalable.pdf)
(OWL-QN). It is an extension of L-BFGS that can effectively handle L1 regularization and elastic net.

### Model Summaries
The `spark.ml` implementation of logistic regression also supports
extracting a summary of the model over the training set. Note that the
predictions and metrics which are stored as `Datafram`s in
`BinaryLogisticRegressionSummary` are annoted `@transient` and hence
only available on the driver.

Once a linear model is fit on data, it is useful to extract statistics such as the
loss per iteration and metrics to understand how well the model has performed on training
and test data. The examples provided below will help in understanding how to use the summaries
obtained by the summary method of the fitted linear models.
<div class="codetabs">

Note that the predictions and metrics which are stored as dataframes obtained from the summary
are transient and are not available on the driver. This is because these are as expensive
to store as the original data itself.
<div data-lang="scala" markdown="1">

#### Example: Summary for LogisticRegression
<div class="codetabs">
<div data-lang="scala">
[`LogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionTrainingSummary)
provides an interface to access information such as `objectiveHistory` and metrics
to evaluate the performance on the training data directly with very less code to be rewritten by
the user. [`LogisticRegression`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegression)
currently supports only binary classification and hence in order to access the binary metrics
the summary must be explicitly cast to
[BinaryLogisticRegressionTrainingSummary](api/scala/index.html#org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary)
as done in the code below. This avoids raising errors for multiclass outputs while providing
extensiblity when multiclass classification is supported in the future.
provides a summary for a
[`LogisticRegressionModel`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionModel).
Currently, only binary classification is supported and the
summary must be explicitly cast to
[`BinaryLogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary).
This will likely change when multiclass classification is supported.

This example illustrates the use of `LogisticRegressionTrainingSummary` on some toy data.
Continuing the earlier example:

{% highlight scala %}
import org.apache.spark.ml.classification.{LogisticRegression, BinaryLogisticRegressionSummary}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.Row

// Use some random data for demonstration.
// Note that the RDD of LabeledPoints can be converted to a dataframe directly.
val data = sc.parallelize(Array(
LabeledPoint(0.0, Vectors.dense(0.2, 4.5, 1.6)),
LabeledPoint(1.0, Vectors.dense(3.1, 6.8, 3.6)),
LabeledPoint(0.0, Vectors.dense(2.4, 0.9, 1.9)),
LabeledPoint(1.0, Vectors.dense(9.1, 3.1, 3.6)),
LabeledPoint(0.0, Vectors.dense(2.5, 1.9, 9.1)))
)
val logRegDataFrame = data.toDF()

// Run Logistic Regression on your toy data.
// Since LogisticRegression is an estimator, it returns an instance of LogisticRegressionModel
// which is a transformer.
val logReg = new LogisticRegression().setMaxIter(5).setRegParam(0.01)
val logRegModel = logReg.fit(logRegDataFrame)

// Extract the summary directly from the returned LogisticRegressionModel instance.
val trainingSummary = logRegModel.summary
// Extract the summary from the returned LogisticRegressionModel instance trained in the earlier example
val trainingSummary = lrModel.summary

// Obtain the loss per iteration.
val objectiveHistory = trainingSummary.objectiveHistory
Expand Down Expand Up @@ -206,60 +187,30 @@ logReg.fit(logRegDataFrame)
{% endhighlight %}
</div>

<div data-lang="java">
[`LogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/LogisticRegressionTrainingSummary)
provides an interface to access information such as `objectiveHistory` and metrics
to evaluate the performance on the training data directly with very less code to be rewritten by
the user. [`LogisticRegression`](api/java/org/apache/spark/ml/classification/LogisticRegression)
currently supports only binary classification and hence in order to access the binary metrics
the summary must be explicitly cast to
[BinaryLogisticRegressionTrainingSummary](api/java/org/apache/spark/ml/classification/LogisticRegressionTrainingSummary)
as done in the code below. This avoids raising errors for multiclass outputs while providing
extensiblity when multiclass classification is supported in the future
<div data-lang="java" markdown="1">
[`LogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/LogisticRegressionTrainingSummary.html)
provides a summary for a
[`LogisticRegressionModel`](api/java/org/apache/spark/ml/classification/LogisticRegressionModel.html).
Currently, only binary classification is supported and the
summary must be explicitly cast to
[`BinaryLogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/BinaryLogisticRegressionTrainingSummary.html).
This will likely change when multiclass classification is supported.

This example illustrates the use of `LogisticRegressionTrainingSummary` on some toy data.
Continuing the earlier example:

{% highlight java %}
import com.google.common.collect.Lists;

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.ml.classification.BinaryLogisticRegressionSummary;
import org.apache.spark.ml.classification.LogisticRegression;
import org.apache.spark.ml.classification.LogisticRegressionModel;
import org.apache.spark.ml.classification.LogisticRegressionTrainingSummary;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import static org.apache.spark.sql.functions.*;

// Use some random data for demonstration.
// Note that the RDD of LabeledPoints can be converted to a dataframe directly.
JavaRDD<LabeledPoint> data = sc.parallelize(Lists.newArrayList(
new LabeledPoint(0.0, Vectors.dense(0.2, 4.5, 1.6)),
new LabeledPoint(1.0, Vectors.dense(3.1, 6.8, 3.6)),
new LabeledPoint(0.0, Vectors.dense(2.4, 0.9, 1.9)),
new LabeledPoint(1.0, Vectors.dense(9.1, 3.1, 3.6)),
new LabeledPoint(0.0, Vectors.dense(2.5, 1.9, 9.1)))
);
DataFrame logRegDataFrame = sql.createDataFrame(data, LabeledPoint.class);

// Run Logistic Regression on your toy data.
// Since LogisticRegression is an estimator, it returns an instance of LogisticRegressionModel
// which is a transformer.
LogisticRegression logReg = new LogisticRegression().setMaxIter(5).setRegParam(0.01);
LogisticRegressionModel logRegModel = logReg.fit(logRegDataFrame);

// Extract the summary directly from the returned LogisticRegressionModel instance.
// Extract the summary from the returned LogisticRegressionModel instance trained in the earlier example
LogisticRegressionTrainingSummary trainingSummary = logRegModel.summary();

// Obtain the loss per iteration.
double[] objectiveHistory = trainingSummary.objectiveHistory();
for (double lossPerIteration: objectiveHistory) {
for (double lossPerIteration : objectiveHistory) {
System.out.println(lossPerIteration);
}

// Obtain the metrics useful to judge performance on test data.
// We cast the summary to a BinaryLogisticRegressionSummary since the problem is a
// binary classification problem.
BinaryLogisticRegressionSummary binarySummary = (BinaryLogisticRegressionSummary) trainingSummary;

// Obtain the receiver-operating characteristic as a dataframe and areaUnderROC.
Expand All @@ -278,4 +229,18 @@ logReg.setThreshold(bestThreshold);
logReg.fit(logRegDataFrame);
{% endhighlight %}
</div>

<div data-lang="python" markdown="1">
Logistic regression model summary is not yet supported in Python.
</div>

</div>

# Optimization

The optimization algorithm underlying the implementation is called
[Orthant-Wise Limited-memory
QuasiNewton](http://research-srv.microsoft.com/en-us/um/people/jfgao/paper/icml07scalable.pdf)
(OWL-QN). It is an extension of L-BFGS that can effectively handle L1
regularization and elastic net.