Skip to content
Closed
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 68 additions & 1 deletion docs/ml-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -951,4 +951,71 @@ model.transform(test)
{% endhighlight %}
</div>

</div>
</div>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference here? I'm confused why this diff is showing up (but wanted to make sure it was not a weird character or something in the line.


## Example: Saving and Loading a Previously Created Model Pipeline

Often times it is worth it to save a model to disk for usage later. In Spark 1.6, similar model import/export functionality was added to the Pipeline API. Most basic transformers are supported as well as some of the more basic ML Models such as:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"for later use" instead of "for usage later" sounds better to me.
"In Spark 1.6, a model import/export" instead of "similar"
"ML models"


* K-Means
* Naive Bayes
* ALS
* Linear Regression
* Logistic Regression

Below is an example of how a pipeline can be persisted and loaded. This example uses a model that we trained and created above.
<div data-lang="scala" markdown="1">
{% highlight scala %}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.sql.Row

// Prepare training documents from a list of (id, text, label) tuples.
val training = sqlContext.createDataFrame(Seq(
(0L, "a b c d e spark", 1.0),
(1L, "b d", 0.0),
(2L, "spark f g h", 1.0),
(3L, "hadoop mapreduce", 0.0)
)).toDF("id", "text", "label")

// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01)

val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))

val model = pipeline.fit(training)
model.save("/tmp/spark-logistic-regression-model")

// load in the model
val loadedModel = Pipeline.load("/tmp/spark-logistic-regression-model")
// or equivalently
val loadedModel = Pipeline.read.load("/tmp/spark-logistic-regression-model")

val test = sqlContext.createDataFrame(Seq(
(4L, "spark i j k"),
(5L, "l m n"),
(6L, "mapreduce spark"),
(7L, "apache hadoop")
)).toDF("id", "text")

// Make predictions on test documents.
loadedModel.transform(test)
.select("id", "text", "probability", "prediction")
.collect()
.foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
println(s"($id, $text) --> prob=$prob, prediction=$prediction")
}
{% endhighlight %}
</div>