Skip to content

Conversation

@dbtsai
Copy link
Member

@dbtsai dbtsai commented Aug 12, 2014

In theory, the scale of your inputs are irrelevant to logistic regression.
You can "theoretically" multiply X1 by 1E6 and the estimate for β1 will
adjust accordingly. It will be 1E-6 times smaller than the original β1, due
to the invariance property of MLEs.

However, during the optimization process, the convergence (rate)
depends on the condition number of the training dataset. Scaling
the variables often reduces this condition number, thus improving
the convergence rate.

Without reducing the condition number, some training datasets
mixing the columns with different scales may not be able to converge.

GLMNET and LIBSVM packages perform the scaling to reduce
the condition number, and return the weights in the original scale.
See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf

Here, if useFeatureScaling is enabled, we will standardize the training
features by dividing the variance of each column (without subtracting
the mean to densify the sparse vector), and train the model in the
scaled space. Then we transform the coefficients from the scaled space
to the original scale as GLMNET and LIBSVM do.

Currently, it's only enabled in LogisticRegressionWithLBFGS.

@SparkQA
Copy link

SparkQA commented Aug 12, 2014

QA tests have started for PR 1897. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18347/consoleFull

@dbtsai dbtsai changed the title [SPARK-2979][MLlib ]Improve the convergence rate by minimize the condition number [SPARK-2979][MLlib] Improve the convergence rate by minimize the condition number Aug 12, 2014
@dbtsai dbtsai changed the title [SPARK-2979][MLlib] Improve the convergence rate by minimize the condition number [SPARK-2979][MLlib] Improve the convergence rate by minimizing the condition number Aug 12, 2014
@SparkQA
Copy link

SparkQA commented Aug 12, 2014

QA results for PR 1897:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18347/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 12, 2014

QA tests have started for PR 1897. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18356/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 12, 2014

QA tests have started for PR 1897. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18358/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 12, 2014

QA results for PR 1897:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18356/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 12, 2014

QA results for PR 1897:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18358/consoleFull

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should use input itself

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not identical map. It's converting labeledPoint to tuple of response and feature vector for optimizer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't realize that.

@mengxr
Copy link
Contributor

mengxr commented Aug 14, 2014

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA tests have started for PR 1897. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18521/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA results for PR 1897:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18521/consoleFull

@dbtsai
Copy link
Member Author

dbtsai commented Aug 14, 2014

Jenkins, test this please.

@dbtsai
Copy link
Member Author

dbtsai commented Aug 14, 2014

Seems that Jenkins is not stable. Failing on issues related to akka.

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA tests have started for PR 1897. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18527/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA results for PR 1897:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18527/consoleFull

@asfgit asfgit closed this in 9622106 Aug 14, 2014
asfgit pushed a commit that referenced this pull request Aug 14, 2014
…ndition number

In theory, the scale of your inputs are irrelevant to logistic regression.
You can "theoretically" multiply X1 by 1E6 and the estimate for β1 will
adjust accordingly. It will be 1E-6 times smaller than the original β1, due
to the invariance property of MLEs.

However, during the optimization process, the convergence (rate)
depends on the condition number of the training dataset. Scaling
the variables often reduces this condition number, thus improving
the convergence rate.

Without reducing the condition number, some training datasets
mixing the columns with different scales may not be able to converge.

GLMNET and LIBSVM packages perform the scaling to reduce
the condition number, and return the weights in the original scale.
See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf

Here, if useFeatureScaling is enabled, we will standardize the training
features by dividing the variance of each column (without subtracting
the mean to densify the sparse vector), and train the model in the
scaled space. Then we transform the coefficients from the scaled space
to the original scale as GLMNET and LIBSVM do.

Currently, it's only enabled in LogisticRegressionWithLBFGS.

Author: DB Tsai <[email protected]>

Closes #1897 from dbtsai/dbtsai-feature-scaling and squashes the following commits:

f19fc02 [DB Tsai] Added more comments
1d85289 [DB Tsai] Improve the convergence rate by minimize the condition number in LOR with LBFGS

(cherry picked from commit 9622106)
Signed-off-by: Xiangrui Meng <[email protected]>
@mengxr
Copy link
Contributor

mengxr commented Aug 14, 2014

LGTM. Merged into both master and branch-1.1. Thanks!!

@dbtsai dbtsai deleted the dbtsai-feature-scaling branch August 14, 2014 21:55
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…ndition number

In theory, the scale of your inputs are irrelevant to logistic regression.
You can "theoretically" multiply X1 by 1E6 and the estimate for β1 will
adjust accordingly. It will be 1E-6 times smaller than the original β1, due
to the invariance property of MLEs.

However, during the optimization process, the convergence (rate)
depends on the condition number of the training dataset. Scaling
the variables often reduces this condition number, thus improving
the convergence rate.

Without reducing the condition number, some training datasets
mixing the columns with different scales may not be able to converge.

GLMNET and LIBSVM packages perform the scaling to reduce
the condition number, and return the weights in the original scale.
See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf

Here, if useFeatureScaling is enabled, we will standardize the training
features by dividing the variance of each column (without subtracting
the mean to densify the sparse vector), and train the model in the
scaled space. Then we transform the coefficients from the scaled space
to the original scale as GLMNET and LIBSVM do.

Currently, it's only enabled in LogisticRegressionWithLBFGS.

Author: DB Tsai <[email protected]>

Closes apache#1897 from dbtsai/dbtsai-feature-scaling and squashes the following commits:

f19fc02 [DB Tsai] Added more comments
1d85289 [DB Tsai] Improve the convergence rate by minimize the condition number in LOR with LBFGS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants