Skip to content

Commit 5ffd5d3

Browse files
committed
[SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guide
## What changes were proposed in this pull request? Made DataFrame-based API primary * Spark doc menu bar and other places now link to ml-guide.html, not mllib-guide.html * mllib-guide.html keeps RDD-specific list of features, with a link at the top redirecting people to ml-guide.html * ml-guide.html includes a "maintenance mode" announcement about the RDD-based API * **Reviewers: please check this carefully** * (minor) Titles for DF API no longer include "- spark.ml" suffix. Titles for RDD API have "- RDD-based API" suffix * Moved migration guide to ml-guide from mllib-guide * Also moved past guides from mllib-migration-guides to ml-migration-guides, with a redirect link on mllib-migration-guides * **Reviewers**: I did not change any of the content of the migration guides. Reorganized DataFrame-based guide: * ml-guide.html mimics the old mllib-guide.html page in terms of content: overview, migration guide, etc. * Moved Pipeline description into ml-pipeline.html and moved tuning into ml-tuning.html * **Reviewers**: I did not change the content of these guides, except some intro text. * Sidebar remains the same, but with pipeline and tuning sections added Other: * ml-classification-regression.html: Moved text about linear methods to new section in page ## How was this patch tested? Generated docs locally Author: Joseph K. Bradley <[email protected]> Closes apache#14213 from jkbradley/ml-guide-2.0.
1 parent 71ad945 commit 5ffd5d3

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+814
-746
lines changed

docs/_data/menu-ml.yaml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
- text: "Overview: estimators, transformers and pipelines"
2-
url: ml-guide.html
1+
- text: Pipelines
2+
url: ml-pipeline.html
33
- text: Extracting, transforming and selecting features
44
url: ml-features.html
55
- text: Classification and Regression
@@ -8,5 +8,7 @@
88
url: ml-clustering.html
99
- text: Collaborative filtering
1010
url: ml-collaborative-filtering.html
11+
- text: Model selection and tuning
12+
url: ml-tuning.html
1113
- text: Advanced topics
1214
url: ml-advanced.html
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
<div class="left-menu-wrapper">
22
<div class="left-menu">
3-
<h3><a href="ml-guide.html">spark.ml package</a></h3>
3+
<h3><a href="ml-guide.html">MLlib: Main Guide</a></h3>
44
{% include nav-left.html nav=include.nav-ml %}
5-
<h3><a href="mllib-guide.html">spark.mllib package</a></h3>
5+
<h3><a href="mllib-guide.html">MLlib: RDD-based API Guide</a></h3>
66
{% include nav-left.html nav=include.nav-mllib %}
77
</div>
88
</div>

docs/_layouts/global.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@
7474
<li><a href="streaming-programming-guide.html">Spark Streaming</a></li>
7575
<li><a href="sql-programming-guide.html">DataFrames, Datasets and SQL</a></li>
7676
<li><a href="structured-streaming-programming-guide.html">Structured Streaming</a></li>
77-
<li><a href="mllib-guide.html">MLlib (Machine Learning)</a></li>
77+
<li><a href="ml-guide.html">MLlib (Machine Learning)</a></li>
7878
<li><a href="graphx-programming-guide.html">GraphX (Graph Processing)</a></li>
7979
<li><a href="sparkr.html">SparkR (R on Spark)</a></li>
8080
</ul>

docs/index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ description: Apache Spark SPARK_VERSION_SHORT documentation homepage
88
Apache Spark is a fast and general-purpose cluster computing system.
99
It provides high-level APIs in Java, Scala, Python and R,
1010
and an optimized engine that supports general execution graphs.
11-
It also supports a rich set of higher-level tools including [Spark SQL](sql-programming-guide.html) for SQL and structured data processing, [MLlib](mllib-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html).
11+
It also supports a rich set of higher-level tools including [Spark SQL](sql-programming-guide.html) for SQL and structured data processing, [MLlib](ml-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html).
1212

1313
# Downloading
1414

@@ -87,7 +87,7 @@ options for deployment:
8787
* Modules built on Spark:
8888
* [Spark Streaming](streaming-programming-guide.html): processing real-time data streams
8989
* [Spark SQL, Datasets, and DataFrames](sql-programming-guide.html): support for structured data and relational queries
90-
* [MLlib](mllib-guide.html): built-in machine learning library
90+
* [MLlib](ml-guide.html): built-in machine learning library
9191
* [GraphX](graphx-programming-guide.html): Spark's new API for graph processing
9292

9393
**API Docs:**

docs/ml-advanced.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
layout: global
3-
title: Advanced topics - spark.ml
4-
displayTitle: Advanced topics - spark.ml
3+
title: Advanced topics
4+
displayTitle: Advanced topics
55
---
66

77
* Table of contents

docs/ml-ann.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
layout: global
3-
title: Multilayer perceptron classifier - spark.ml
4-
displayTitle: Multilayer perceptron classifier - spark.ml
3+
title: Multilayer perceptron classifier
4+
displayTitle: Multilayer perceptron classifier
55
---
66

77
> This section has been moved into the

docs/ml-classification-regression.md

Lines changed: 32 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
layout: global
3-
title: Classification and regression - spark.ml
4-
displayTitle: Classification and regression - spark.ml
3+
title: Classification and regression
4+
displayTitle: Classification and regression
55
---
66

77

@@ -22,37 +22,14 @@ displayTitle: Classification and regression - spark.ml
2222
\newcommand{\zero}{\mathbf{0}}
2323
\]`
2424

25+
This page covers algorithms for Classification and Regression. It also includes sections
26+
discussing specific classes of algorithms, such as linear methods, trees, and ensembles.
27+
2528
**Table of Contents**
2629

2730
* This will become a table of contents (this text will be scraped).
2831
{:toc}
2932

30-
In `spark.ml`, we implement popular linear methods such as logistic
31-
regression and linear least squares with $L_1$ or $L_2$ regularization.
32-
Refer to [the linear methods in mllib](mllib-linear-methods.html) for
33-
details about implementation and tuning. We also include a DataFrame API for [Elastic
34-
net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid
35-
of $L_1$ and $L_2$ regularization proposed in [Zou et al, Regularization
36-
and variable selection via the elastic
37-
net](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf).
38-
Mathematically, it is defined as a convex combination of the $L_1$ and
39-
the $L_2$ regularization terms:
40-
`\[
41-
\alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( \frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0
42-
\]`
43-
By setting $\alpha$ properly, elastic net contains both $L_1$ and $L_2$
44-
regularization as special cases. For example, if a [linear
45-
regression](https://en.wikipedia.org/wiki/Linear_regression) model is
46-
trained with the elastic net parameter $\alpha$ set to $1$, it is
47-
equivalent to a
48-
[Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model.
49-
On the other hand, if $\alpha$ is set to $0$, the trained model reduces
50-
to a [ridge
51-
regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model.
52-
We implement Pipelines API for both linear regression and logistic
53-
regression with elastic net regularization.
54-
55-
5633
# Classification
5734

5835
## Logistic regression
@@ -760,7 +737,34 @@ Refer to the [`IsotonicRegression` Python docs](api/python/pyspark.ml.html#pyspa
760737
</div>
761738
</div>
762739

740+
# Linear methods
741+
742+
We implement popular linear methods such as logistic
743+
regression and linear least squares with $L_1$ or $L_2$ regularization.
744+
Refer to [the linear methods guide for the RDD-based API](mllib-linear-methods.html) for
745+
details about implementation and tuning; this information is still relevant.
763746

747+
We also include a DataFrame API for [Elastic
748+
net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid
749+
of $L_1$ and $L_2$ regularization proposed in [Zou et al, Regularization
750+
and variable selection via the elastic
751+
net](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf).
752+
Mathematically, it is defined as a convex combination of the $L_1$ and
753+
the $L_2$ regularization terms:
754+
`\[
755+
\alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( \frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0
756+
\]`
757+
By setting $\alpha$ properly, elastic net contains both $L_1$ and $L_2$
758+
regularization as special cases. For example, if a [linear
759+
regression](https://en.wikipedia.org/wiki/Linear_regression) model is
760+
trained with the elastic net parameter $\alpha$ set to $1$, it is
761+
equivalent to a
762+
[Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model.
763+
On the other hand, if $\alpha$ is set to $0$, the trained model reduces
764+
to a [ridge
765+
regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model.
766+
We implement Pipelines API for both linear regression and logistic
767+
regression with elastic net regularization.
764768

765769
# Decision trees
766770

docs/ml-clustering.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
11
---
22
layout: global
3-
title: Clustering - spark.ml
4-
displayTitle: Clustering - spark.ml
3+
title: Clustering
4+
displayTitle: Clustering
55
---
66

7-
In this section, we introduce the pipeline API for [clustering in mllib](mllib-clustering.html).
7+
This page describes clustering algorithms in MLlib.
8+
The [guide for clustering in the RDD-based API](mllib-clustering.html) also has relevant information
9+
about these algorithms.
810

911
**Table of Contents**
1012

docs/ml-collaborative-filtering.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
layout: global
3-
title: Collaborative Filtering - spark.ml
4-
displayTitle: Collaborative Filtering - spark.ml
3+
title: Collaborative Filtering
4+
displayTitle: Collaborative Filtering
55
---
66

77
* Table of contents

docs/ml-decision-tree.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
layout: global
3-
title: Decision trees - spark.ml
4-
displayTitle: Decision trees - spark.ml
3+
title: Decision trees
4+
displayTitle: Decision trees
55
---
66

77
> This section has been moved into the

0 commit comments

Comments
 (0)