Skip to content
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
re-organization of docs + feedback
  • Loading branch information
bllchmbrs committed Dec 8, 2015
commit eb3f99c93d6a91d1d6da1765dbdc96d64ab3bf13
109 changes: 42 additions & 67 deletions docs/ml-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -613,7 +613,49 @@ for row in selected.collect():

{% endhighlight %}
</div>
</div>

## Example: Saving and Loading a Pipeline

Often times it is worth it to save a model to disk for later use. In Spark 1.6, model import/export functionality was added to the Pipeline API. Most basic transformers are supported as well as some of the more basic ML models such as:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a model import/export functionality...


* K-Means
* Naive Bayes
* ALS
* Linear Regression
* Logistic Regression

Below is an example of how a pipeline can be persisted and loaded. This example uses a model that we trained and created above.
<div class="codetabs">

<div data-lang="scala" markdown="1">
{% highlight scala %}
// fit the model as we did in the previous example
val model = pipeline.fit(training)
// now save it to disk
model.save("/tmp/spark-logistic-regression-model")

// load in the model
val loadedModel = Pipeline.load("/tmp/spark-logistic-regression-model")
// or equivalently
val loadedModel = Pipeline.read.load("/tmp/spark-logistic-regression-model")

val test = sqlContext.createDataFrame(Seq(
(4L, "spark i j k"),
(5L, "l m n"),
(6L, "mapreduce spark"),
(7L, "apache hadoop")
)).toDF("id", "text")

// Make predictions on test documents
loadedModel.transform(test)
.select("id", "text", "probability", "prediction")
.collect()
.foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
println(s"($id, $text) --> prob=$prob, prediction=$prediction")
}
{% endhighlight %}
</div>
</div>

## Example: model selection via cross-validation
Expand Down Expand Up @@ -952,70 +994,3 @@ model.transform(test)
</div>

</div>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference here? I'm confused why this diff is showing up (but wanted to make sure it was not a weird character or something in the line.


## Example: Saving and Loading a Previously Created Model Pipeline

Often times it is worth it to save a model to disk for usage later. In Spark 1.6, similar model import/export functionality was added to the Pipeline API. Most basic transformers are supported as well as some of the more basic ML Models such as:

* K-Means
* Naive Bayes
* ALS
* Linear Regression
* Logistic Regression

Below is an example of how a pipeline can be persisted and loaded. This example uses a model that we trained and created above.
<div data-lang="scala" markdown="1">
{% highlight scala %}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.sql.Row

// Prepare training documents from a list of (id, text, label) tuples.
val training = sqlContext.createDataFrame(Seq(
(0L, "a b c d e spark", 1.0),
(1L, "b d", 0.0),
(2L, "spark f g h", 1.0),
(3L, "hadoop mapreduce", 0.0)
)).toDF("id", "text", "label")

// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01)

val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))

val model = pipeline.fit(training)
model.save("/tmp/spark-logistic-regression-model")

// load in the model
val loadedModel = Pipeline.load("/tmp/spark-logistic-regression-model")
// or equivalently
val loadedModel = Pipeline.read.load("/tmp/spark-logistic-regression-model")

val test = sqlContext.createDataFrame(Seq(
(4L, "spark i j k"),
(5L, "l m n"),
(6L, "mapreduce spark"),
(7L, "apache hadoop")
)).toDF("id", "text")

// Make predictions on test documents.
loadedModel.transform(test)
.select("id", "text", "probability", "prediction")
.collect()
.foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
println(s"($id, $text) --> prob=$prob, prediction=$prediction")
}
{% endhighlight %}
</div>