[SPARK-2979][MLlib] Improve the convergence rate by minimizing the condition number #1897

dbtsai · 2014-08-12T01:11:36Z

In theory, the scale of your inputs are irrelevant to logistic regression.
You can "theoretically" multiply X1 by 1E6 and the estimate for β1 will
adjust accordingly. It will be 1E-6 times smaller than the original β1, due
to the invariance property of MLEs.

However, during the optimization process, the convergence (rate)
depends on the condition number of the training dataset. Scaling
the variables often reduces this condition number, thus improving
the convergence rate.

Without reducing the condition number, some training datasets
mixing the columns with different scales may not be able to converge.

GLMNET and LIBSVM packages perform the scaling to reduce
the condition number, and return the weights in the original scale.
See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf

Here, if useFeatureScaling is enabled, we will standardize the training
features by dividing the variance of each column (without subtracting
the mean to densify the sparse vector), and train the model in the
scaled space. Then we transform the coefficients from the scaled space
to the original scale as GLMNET and LIBSVM do.

Currently, it's only enabled in LogisticRegressionWithLBFGS.

SparkQA · 2014-08-12T01:14:45Z

QA tests have started for PR 1897. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18347/consoleFull

SparkQA · 2014-08-12T02:06:51Z

QA results for PR 1897:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18347/consoleFull

SparkQA · 2014-08-12T03:44:52Z

QA tests have started for PR 1897. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18356/consoleFull

…with LBFGS

SparkQA · 2014-08-12T04:04:52Z

QA tests have started for PR 1897. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18358/consoleFull

SparkQA · 2014-08-12T04:35:47Z

QA results for PR 1897:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18356/consoleFull

SparkQA · 2014-08-12T04:55:36Z

QA results for PR 1897:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18358/consoleFull

mengxr · 2014-08-12T07:35:29Z

mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala

should use input itself

It's not identical map. It's converting labeledPoint to tuple of response and feature vector for optimizer.

Sorry, I didn't realize that.

mengxr · 2014-08-14T05:33:27Z

Jenkins, test this please.

SparkQA · 2014-08-14T05:40:02Z

QA tests have started for PR 1897. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18521/consoleFull

SparkQA · 2014-08-14T06:28:02Z

QA results for PR 1897:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18521/consoleFull

dbtsai · 2014-08-14T06:44:40Z

Jenkins, test this please.

dbtsai · 2014-08-14T06:45:10Z

Seems that Jenkins is not stable. Failing on issues related to akka.

SparkQA · 2014-08-14T06:49:57Z

QA tests have started for PR 1897. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18527/consoleFull

SparkQA · 2014-08-14T07:40:33Z

QA results for PR 1897:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18527/consoleFull

…ndition number In theory, the scale of your inputs are irrelevant to logistic regression. You can "theoretically" multiply X1 by 1E6 and the estimate for β1 will adjust accordingly. It will be 1E-6 times smaller than the original β1, due to the invariance property of MLEs. However, during the optimization process, the convergence (rate) depends on the condition number of the training dataset. Scaling the variables often reduces this condition number, thus improving the convergence rate. Without reducing the condition number, some training datasets mixing the columns with different scales may not be able to converge. GLMNET and LIBSVM packages perform the scaling to reduce the condition number, and return the weights in the original scale. See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf Here, if useFeatureScaling is enabled, we will standardize the training features by dividing the variance of each column (without subtracting the mean to densify the sparse vector), and train the model in the scaled space. Then we transform the coefficients from the scaled space to the original scale as GLMNET and LIBSVM do. Currently, it's only enabled in LogisticRegressionWithLBFGS. Author: DB Tsai <[email protected]> Closes #1897 from dbtsai/dbtsai-feature-scaling and squashes the following commits: f19fc02 [DB Tsai] Added more comments 1d85289 [DB Tsai] Improve the convergence rate by minimize the condition number in LOR with LBFGS (cherry picked from commit 9622106) Signed-off-by: Xiangrui Meng <[email protected]>

mengxr · 2014-08-14T18:57:35Z

LGTM. Merged into both master and branch-1.1. Thanks!!

…ndition number In theory, the scale of your inputs are irrelevant to logistic regression. You can "theoretically" multiply X1 by 1E6 and the estimate for β1 will adjust accordingly. It will be 1E-6 times smaller than the original β1, due to the invariance property of MLEs. However, during the optimization process, the convergence (rate) depends on the condition number of the training dataset. Scaling the variables often reduces this condition number, thus improving the convergence rate. Without reducing the condition number, some training datasets mixing the columns with different scales may not be able to converge. GLMNET and LIBSVM packages perform the scaling to reduce the condition number, and return the weights in the original scale. See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf Here, if useFeatureScaling is enabled, we will standardize the training features by dividing the variance of each column (without subtracting the mean to densify the sparse vector), and train the model in the scaled space. Then we transform the coefficients from the scaled space to the original scale as GLMNET and LIBSVM do. Currently, it's only enabled in LogisticRegressionWithLBFGS. Author: DB Tsai <[email protected]> Closes apache#1897 from dbtsai/dbtsai-feature-scaling and squashes the following commits: f19fc02 [DB Tsai] Added more comments 1d85289 [DB Tsai] Improve the convergence rate by minimize the condition number in LOR with LBFGS

dbtsai changed the title ~~[SPARK-2979][MLlib ]Improve the convergence rate by minimize the condition number~~ [SPARK-2979][MLlib] Improve the convergence rate by minimize the condition number Aug 12, 2014

dbtsai changed the title ~~[SPARK-2979][MLlib] Improve the convergence rate by minimize the condition number~~ [SPARK-2979][MLlib] Improve the convergence rate by minimizing the condition number Aug 12, 2014

Improve the convergence rate by minimize the condition number in LOR …

1d85289

…with LBFGS

mengxr reviewed Aug 12, 2014
View reviewed changes

Added more comments

f19fc02

asfgit closed this in 9622106 Aug 14, 2014

dbtsai deleted the dbtsai-feature-scaling branch August 14, 2014 21:55

[SPARK-2979][MLlib] Improve the convergence rate by minimizing the condition number #1897

[SPARK-2979][MLlib] Improve the convergence rate by minimizing the condition number #1897

Conversation

dbtsai commented Aug 12, 2014

Uh oh!

SparkQA commented Aug 12, 2014

Uh oh!

SparkQA commented Aug 12, 2014

Uh oh!

SparkQA commented Aug 12, 2014

Uh oh!

SparkQA commented Aug 12, 2014

Uh oh!

SparkQA commented Aug 12, 2014

Uh oh!

SparkQA commented Aug 12, 2014

Uh oh!

mengxr Aug 12, 2014

Choose a reason for hiding this comment

Uh oh!

dbtsai Aug 13, 2014

Choose a reason for hiding this comment

Uh oh!

mengxr Aug 14, 2014

Choose a reason for hiding this comment

Uh oh!

mengxr commented Aug 14, 2014

Uh oh!

SparkQA commented Aug 14, 2014

Uh oh!

SparkQA commented Aug 14, 2014

Uh oh!

dbtsai commented Aug 14, 2014

Uh oh!

dbtsai commented Aug 14, 2014

Uh oh!

SparkQA commented Aug 14, 2014

Uh oh!

SparkQA commented Aug 14, 2014

Uh oh!

mengxr commented Aug 14, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants