-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib migration guide from 1.6 to 2.0 #13378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
182414b
fb610d2
260f3a3
235930d
2339200
d2666ac
5472fb9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -102,32 +102,54 @@ MLlib is under active development. | |
| The APIs marked `Experimental`/`DeveloperApi` may change in future releases, | ||
| and the migration guide below will explain all changes between releases. | ||
|
|
||
| ## From 1.5 to 1.6 | ||
| ## From 1.6 to 2.0 | ||
|
|
||
| There are no breaking API changes in the `spark.mllib` or `spark.ml` packages, but there are | ||
| deprecations and changes of behavior. | ||
|
|
||
| Deprecations: | ||
|
|
||
| * [SPARK-11358](https://issues.apache.org/jira/browse/SPARK-11358): | ||
| In `spark.mllib.clustering.KMeans`, the `runs` parameter has been deprecated. | ||
| * [SPARK-10592](https://issues.apache.org/jira/browse/SPARK-10592): | ||
| In `spark.ml.classification.LogisticRegressionModel` and | ||
| `spark.ml.regression.LinearRegressionModel`, the `weights` field has been deprecated in favor of | ||
| the new name `coefficients`. This helps disambiguate from instance (row) "weights" given to | ||
| algorithms. | ||
| * [SPARK-14984](https://issues.apache.org/jira/browse/SPARK-14984): | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @yanboliang there are breaking changes for removing some deprecated methods in https://issues.apache.org/jira/browse/SPARK-14089 and https://issues.apache.org/jira/browse/SPARK-14952 that we should highlight.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Though I'm happy to just do that in a follow up PR once I've made a final pass through for MiMa changes.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good points. I forgot to record all removed deprecated methods. It's great that you can do that in a follow up PR. Thanks! |
||
| In `spark.ml.regression.LinearRegressionSummary`, the `model` field has been deprecated. | ||
| * [SPARK-13784](https://issues.apache.org/jira/browse/SPARK-13784): | ||
| In `spark.ml.regression.RandomForestRegressionModel` and `spark.ml.classification.RandomForestClassificationModel`, | ||
| the `numTrees` parameter has been deprecated in favor of `getNumTrees` method. | ||
| * [SPARK-13761](https://issues.apache.org/jira/browse/SPARK-13761): | ||
| In `spark.ml.param.Params`, the `validateParams` method has been deprecated. | ||
| We move all functionality in overridden methods to the corresponding `transformSchema`. | ||
| * [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829): | ||
| In `spark.mllib` package, `LinearRegressionWithSGD`, `LassoWithSGD`, `RidgeRegressionWithSGD` and `LogisticRegressionWithSGD` have been deprecated. | ||
| We encourage users to use `spark.ml.regression.LinearRegresson` and `spark.ml.classification.LogisticRegresson`. | ||
| * [SPARK-14900](https://issues.apache.org/jira/browse/SPARK-14900): | ||
| In `spark.mllib.evaluation.MulticlassMetrics`, the parameters `precision`, `recall` and `fMeasure` have been deprecated in favor of `accuracy`. | ||
|
|
||
| Changes of behavior: | ||
|
|
||
| * [SPARK-7770](https://issues.apache.org/jira/browse/SPARK-7770): | ||
| `spark.mllib.tree.GradientBoostedTrees`: `validationTol` has changed semantics in 1.6. | ||
| Previously, it was a threshold for absolute change in error. Now, it resembles the behavior of | ||
| `GradientDescent`'s `convergenceTol`: For large errors, it uses relative error (relative to the | ||
| previous error); for small errors (`< 0.01`), it uses absolute error. | ||
| * [SPARK-11069](https://issues.apache.org/jira/browse/SPARK-11069): | ||
| `spark.ml.feature.RegexTokenizer`: Previously, it did not convert strings to lowercase before | ||
| tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the | ||
| behavior of the simpler `Tokenizer` transformer. | ||
| * [SPARK-7780](https://issues.apache.org/jira/browse/SPARK-7780): | ||
| `spark.mllib.classification.LogisticRegressionWithLBFGS` directly calls `spark.ml.classification.LogisticRegresson` for binary classification now. | ||
| This will introduce the following behavior changes for `spark.mllib.classification.LogisticRegressionWithLBFGS`: | ||
| * The intercept will not be regularized when training binary classification model with L1/L2 Updater. | ||
| * If users set without regularization, training with or without feature scaling will return the same solution by the same convergence rate. | ||
| * [SPARK-13429](https://issues.apache.org/jira/browse/SPARK-13429): | ||
| In order to provide better and consistent result with `spark.ml.classification.LogisticRegresson`, | ||
| the default value of `spark.mllib.classification.LogisticRegressionWithLBFGS`: `convergenceTol` has been changed from 1E-4 to 1E-6. | ||
| * [SPARK-12363](https://issues.apache.org/jira/browse/SPARK-12363): | ||
| Fix a bug of `PowerIterationClustering` which will likely change its result. | ||
| * [SPARK-13048](https://issues.apache.org/jira/browse/SPARK-13048): | ||
| `LDA` using the `EM` optimizer will keep the last checkpoint by default, if checkpointing is being used. | ||
| * [SPARK-12153](https://issues.apache.org/jira/browse/SPARK-12153): | ||
| `Word2Vec` now respects sentence boundaries. Previously, it did not handle them correctly. | ||
| * [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574): | ||
| `HashingTF` uses `MurmurHash3` as default hash algorithm in both `spark.ml` and `spark.mllib`. | ||
| * [SPARK-14768](https://issues.apache.org/jira/browse/SPARK-14768): | ||
| We remove `expectedType` argument for PySpark `Param`. | ||
|
||
| * [SPARK-14931](https://issues.apache.org/jira/browse/SPARK-14931): | ||
| We change some default `Param` values which were mismatched between pipelines in Scala and Python. | ||
|
||
| * [SPARK-13600](https://issues.apache.org/jira/browse/SPARK-13600): | ||
| `QuantileDiscretizer` now uses `spark.sql.DataFrameStatFunctions.approxQuantile` to find splits (previously used custom sampling logic). | ||
| The output buckets will differ for same input data and params. | ||
| * [SPARK-14814](https://issues.apache.org/jira/browse/SPARK-14814): | ||
|
||
| Fix the java compatibility issue for the output of `spark.mllib.tree.model.DecisionTreeModel.predict` method. | ||
|
|
||
| ## Previous Spark versions | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not the case for this release