From 182414b89dca0056e5732cb3ddae654ae1379436 Mon Sep 17 00:00:00 2001 From: Yanbo Liang Date: Sat, 28 May 2016 05:43:58 -0700 Subject: [PATCH 1/7] Document MLlib deprecations and behavior changes in Spark 2.0 --- docs/mllib-guide.md | 56 +++++++++++++++++++++++----------- docs/mllib-migration-guides.md | 27 ++++++++++++++++ 2 files changed, 66 insertions(+), 17 deletions(-) diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index fa5e90603505..a090ff90189c 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -102,32 +102,54 @@ MLlib is under active development. The APIs marked `Experimental`/`DeveloperApi` may change in future releases, and the migration guide below will explain all changes between releases. -## From 1.5 to 1.6 +## From 1.6 to 2.0 There are no breaking API changes in the `spark.mllib` or `spark.ml` packages, but there are deprecations and changes of behavior. Deprecations: -* [SPARK-11358](https://issues.apache.org/jira/browse/SPARK-11358): - In `spark.mllib.clustering.KMeans`, the `runs` parameter has been deprecated. -* [SPARK-10592](https://issues.apache.org/jira/browse/SPARK-10592): - In `spark.ml.classification.LogisticRegressionModel` and - `spark.ml.regression.LinearRegressionModel`, the `weights` field has been deprecated in favor of - the new name `coefficients`. This helps disambiguate from instance (row) "weights" given to - algorithms. +* [SPARK-14984](https://issues.apache.org/jira/browse/SPARK-14984): + In `spark.ml.regression.LinearRegressionSummary`, the `model` field has been deprecated. +* [SPARK-13784](https://issues.apache.org/jira/browse/SPARK-13784): + In `spark.ml.regression.RandomForestRegressionModel` and `spark.ml.classification.RandomForestClassificationModel`, + the `numTrees` parameter has been deprecated in favor of new method `getNumTrees`. +* [SPARK-13761](https://issues.apache.org/jira/browse/SPARK-13761): + In `spark.ml.param.Params`, the `validateParams` method has been deprecated. + We move all functionality in overridden methods to `PipelineStage.transformSchema`. +* [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829): + In `spark.mllib` package, `LinearRegressionWithSGD`, `LassoWithSGD`, `RidgeRegressionWithSGD` and `LogisticRegressionWithSGD` have been deprecated. + We encourage users to use `spark.ml.regression.LinearRegresson` and `spark.ml.classification.LogisticRegresson`. +* [SPARK-14900](https://issues.apache.org/jira/browse/SPARK-14900): + In `spark.mllib.evaluation.MulticlassMetrics`, the parameters `precision`, `recall` and `fMeasure` have been deprecated in favor of `accuracy`. Changes of behavior: -* [SPARK-7770](https://issues.apache.org/jira/browse/SPARK-7770): - `spark.mllib.tree.GradientBoostedTrees`: `validationTol` has changed semantics in 1.6. - Previously, it was a threshold for absolute change in error. Now, it resembles the behavior of - `GradientDescent`'s `convergenceTol`: For large errors, it uses relative error (relative to the - previous error); for small errors (`< 0.01`), it uses absolute error. -* [SPARK-11069](https://issues.apache.org/jira/browse/SPARK-11069): - `spark.ml.feature.RegexTokenizer`: Previously, it did not convert strings to lowercase before - tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the - behavior of the simpler `Tokenizer` transformer. +* [SPARK-7780](https://issues.apache.org/jira/browse/SPARK-7780): + `spark.mllib.classification.LogisticRegressionWithLBFGS` directly calls `spark.ml.classification.LogisticRegresson` for binary classification now. + This will introduce the following behavior changes for `spark.mllib.classification.LogisticRegressionWithLBFGS`: + * The intercept will not be regularized when training binary classification model with L1/L2 Updater. + * If users set without regularization, training with or without feature scaling will return the same solution by the same convergence rate. +* [SPARK-13429](https://issues.apache.org/jira/browse/SPARK-13429): + In order to provide better and consistent result with `spark.ml.classification.LogisticRegresson`, + the default value of `spark.mllib.classification.LogisticRegressionWithLBFGS`: `convergenceTol` has been changed from 1E-4 to 1E-6. +* [SPARK-12363](https://issues.apache.org/jira/browse/SPARK-12363): + Fix a bug of `PowerIterationClustering` which will likely change its result. +* [SPARK-13048](https://issues.apache.org/jira/browse/SPARK-13048): + `LDA` using the `EM` optimizer will keep the last checkpoint by default, if checkpointing is being used. +* [SPARK-13048](https://issues.apache.org/jira/browse/SPARK-13048): + `Word2Vec` now respects sentence boundaries. Previously, it did not handle them correctly. +* [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574): + `HashingTF` uses `MurmurHash3` as default hash algorithm in both spark.ml and spark.mllib. +* [SPARK-14768](https://issues.apache.org/jira/browse/SPARK-14768): + We remove `expectedType` argument for PySpark `Param`. +* [SPARK-14931](https://issues.apache.org/jira/browse/SPARK-14931): + We change some default `Param` values which were mismatched between pipelines in Scala and Python. +* [SPARK-13600](https://issues.apache.org/jira/browse/SPARK-13600): + `QuantileDiscretizer` now uses `spark.sql.DataFrameStatFunctions.approxQuantile` to find splits (previously used custom sampling logic). + The output buckets will differ for same input data and params. +* [SPARK-14814](https://issues.apache.org/jira/browse/SPARK-14814): + Fix the java compatibility issue for the output of `spark.mllib.tree.model.DecisionTreeModel.predict` method. ## Previous Spark versions diff --git a/docs/mllib-migration-guides.md b/docs/mllib-migration-guides.md index f3daef2dbadb..970c6697f433 100644 --- a/docs/mllib-migration-guides.md +++ b/docs/mllib-migration-guides.md @@ -7,6 +7,33 @@ description: MLlib migration guides from before Spark SPARK_VERSION_SHORT The migration guide for the current Spark version is kept on the [MLlib Programming Guide main page](mllib-guide.html#migration-guide). +## From 1.5 to 1.6 + +There are no breaking API changes in the `spark.mllib` or `spark.ml` packages, but there are +deprecations and changes of behavior. + +Deprecations: + +* [SPARK-11358](https://issues.apache.org/jira/browse/SPARK-11358): + In `spark.mllib.clustering.KMeans`, the `runs` parameter has been deprecated. +* [SPARK-10592](https://issues.apache.org/jira/browse/SPARK-10592): + In `spark.ml.classification.LogisticRegressionModel` and + `spark.ml.regression.LinearRegressionModel`, the `weights` field has been deprecated in favor of + the new name `coefficients`. This helps disambiguate from instance (row) "weights" given to + algorithms. + +Changes of behavior: + +* [SPARK-7770](https://issues.apache.org/jira/browse/SPARK-7770): + `spark.mllib.tree.GradientBoostedTrees`: `validationTol` has changed semantics in 1.6. + Previously, it was a threshold for absolute change in error. Now, it resembles the behavior of + `GradientDescent`'s `convergenceTol`: For large errors, it uses relative error (relative to the + previous error); for small errors (`< 0.01`), it uses absolute error. +* [SPARK-11069](https://issues.apache.org/jira/browse/SPARK-11069): + `spark.ml.feature.RegexTokenizer`: Previously, it did not convert strings to lowercase before + tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the + behavior of the simpler `Tokenizer` transformer. + ## From 1.4 to 1.5 In the `spark.mllib` package, there are no breaking API changes but several behavior changes: From fb610d25e58cff765d481f1f15728d42806aa8de Mon Sep 17 00:00:00 2001 From: Yanbo Liang Date: Sat, 28 May 2016 06:36:00 -0700 Subject: [PATCH 2/7] fix typos --- docs/mllib-guide.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index a090ff90189c..1fe9ebd523b5 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -113,10 +113,10 @@ Deprecations: In `spark.ml.regression.LinearRegressionSummary`, the `model` field has been deprecated. * [SPARK-13784](https://issues.apache.org/jira/browse/SPARK-13784): In `spark.ml.regression.RandomForestRegressionModel` and `spark.ml.classification.RandomForestClassificationModel`, - the `numTrees` parameter has been deprecated in favor of new method `getNumTrees`. + the `numTrees` parameter has been deprecated in favor of `getNumTrees` method. * [SPARK-13761](https://issues.apache.org/jira/browse/SPARK-13761): In `spark.ml.param.Params`, the `validateParams` method has been deprecated. - We move all functionality in overridden methods to `PipelineStage.transformSchema`. + We move all functionality in overridden methods to the corresponding `transformSchema`. * [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829): In `spark.mllib` package, `LinearRegressionWithSGD`, `LassoWithSGD`, `RidgeRegressionWithSGD` and `LogisticRegressionWithSGD` have been deprecated. We encourage users to use `spark.ml.regression.LinearRegresson` and `spark.ml.classification.LogisticRegresson`. @@ -137,10 +137,10 @@ Changes of behavior: Fix a bug of `PowerIterationClustering` which will likely change its result. * [SPARK-13048](https://issues.apache.org/jira/browse/SPARK-13048): `LDA` using the `EM` optimizer will keep the last checkpoint by default, if checkpointing is being used. -* [SPARK-13048](https://issues.apache.org/jira/browse/SPARK-13048): +* [SPARK-12153](https://issues.apache.org/jira/browse/SPARK-12153): `Word2Vec` now respects sentence boundaries. Previously, it did not handle them correctly. * [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574): - `HashingTF` uses `MurmurHash3` as default hash algorithm in both spark.ml and spark.mllib. + `HashingTF` uses `MurmurHash3` as default hash algorithm in both `spark.ml` and `spark.mllib`. * [SPARK-14768](https://issues.apache.org/jira/browse/SPARK-14768): We remove `expectedType` argument for PySpark `Param`. * [SPARK-14931](https://issues.apache.org/jira/browse/SPARK-14931): From 260f3a35063e4dbf5775aef1cd2878731c0e1147 Mon Sep 17 00:00:00 2001 From: Yanbo Liang Date: Mon, 30 May 2016 01:45:10 -0700 Subject: [PATCH 3/7] update docs --- docs/mllib-guide.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index 1fe9ebd523b5..dc1b6fffc16a 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -142,9 +142,9 @@ Changes of behavior: * [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574): `HashingTF` uses `MurmurHash3` as default hash algorithm in both `spark.ml` and `spark.mllib`. * [SPARK-14768](https://issues.apache.org/jira/browse/SPARK-14768): - We remove `expectedType` argument for PySpark `Param`. + The `expectedType` argument for PySpark `Param` was removed. * [SPARK-14931](https://issues.apache.org/jira/browse/SPARK-14931): - We change some default `Param` values which were mismatched between pipelines in Scala and Python. + Some default `Param` values, which were mismatched between pipelines in Scala and Python, have been changed. * [SPARK-13600](https://issues.apache.org/jira/browse/SPARK-13600): `QuantileDiscretizer` now uses `spark.sql.DataFrameStatFunctions.approxQuantile` to find splits (previously used custom sampling logic). The output buckets will differ for same input data and params. From 235930d229021ef21921477478b39fa955aa5294 Mon Sep 17 00:00:00 2001 From: Yanbo Liang Date: Wed, 1 Jun 2016 03:10:50 -0700 Subject: [PATCH 4/7] fix typos --- docs/mllib-guide.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index dc1b6fffc16a..89803ba0ccb8 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -104,8 +104,7 @@ and the migration guide below will explain all changes between releases. ## From 1.6 to 2.0 -There are no breaking API changes in the `spark.mllib` or `spark.ml` packages, but there are -deprecations and changes of behavior. +The deprecations and changes of behavior in the `spark.mllib` or `spark.ml` packages include: Deprecations: From 23392006c3023e3c95f8bd434e5fe3090b5724b0 Mon Sep 17 00:00:00 2001 From: Yanbo Liang Date: Wed, 1 Jun 2016 20:29:24 -0700 Subject: [PATCH 5/7] Remove SPARK-14814 which is a breaking change --- docs/mllib-guide.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index 89803ba0ccb8..202ef1aa068b 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -147,8 +147,6 @@ Changes of behavior: * [SPARK-13600](https://issues.apache.org/jira/browse/SPARK-13600): `QuantileDiscretizer` now uses `spark.sql.DataFrameStatFunctions.approxQuantile` to find splits (previously used custom sampling logic). The output buckets will differ for same input data and params. -* [SPARK-14814](https://issues.apache.org/jira/browse/SPARK-14814): - Fix the java compatibility issue for the output of `spark.mllib.tree.model.DecisionTreeModel.predict` method. ## Previous Spark versions From d2666acc4fcd3813400589fc10f74743c8e0b38f Mon Sep 17 00:00:00 2001 From: Yanbo Liang Date: Mon, 27 Jun 2016 07:18:59 -0700 Subject: [PATCH 6/7] Add two deprecations --- docs/mllib-guide.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index 202ef1aa068b..1b88bcf978af 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -121,6 +121,9 @@ Deprecations: We encourage users to use `spark.ml.regression.LinearRegresson` and `spark.ml.classification.LogisticRegresson`. * [SPARK-14900](https://issues.apache.org/jira/browse/SPARK-14900): In `spark.mllib.evaluation.MulticlassMetrics`, the parameters `precision`, `recall` and `fMeasure` have been deprecated in favor of `accuracy`. +* [SPARK-15644](https://issues.apache.org/jira/browse/SPARK-15644): + In `spark.ml.util.BaseReadWrite`, the `context` method has been deprecated in favor of `session`. +* In `spark.ml.feature.ChiSqSelectorModel`, the `setLabelCol` method has been deprecated since it was not used by `ChiSqSelectorModel`. Changes of behavior: From 5472fb9e4d1158644c0c4fc22cc02083acc4576f Mon Sep 17 00:00:00 2001 From: Yanbo Liang Date: Mon, 27 Jun 2016 23:47:10 -0700 Subject: [PATCH 7/7] BaseReadWrite to MLReader/MLWriter --- docs/mllib-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index 1b88bcf978af..c28d13732eed 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -122,7 +122,7 @@ Deprecations: * [SPARK-14900](https://issues.apache.org/jira/browse/SPARK-14900): In `spark.mllib.evaluation.MulticlassMetrics`, the parameters `precision`, `recall` and `fMeasure` have been deprecated in favor of `accuracy`. * [SPARK-15644](https://issues.apache.org/jira/browse/SPARK-15644): - In `spark.ml.util.BaseReadWrite`, the `context` method has been deprecated in favor of `session`. + In `spark.ml.util.MLReader` and `spark.ml.util.MLWriter`, the `context` method has been deprecated in favor of `session`. * In `spark.ml.feature.ChiSqSelectorModel`, the `setLabelCol` method has been deprecated since it was not used by `ChiSqSelectorModel`. Changes of behavior: