-
Notifications
You must be signed in to change notification settings - Fork 696
NonlinearMinimizer using Projection and Proximal Operators #364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ction code in BFGS using PowerMethod; Fixes in NonlinearMinimizer skeleton
… working properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dlwh minEigen works fine but in maxEigen I am hacking it. From the BFGS history, I am trying to build the CompactHessian to run power iterations but it fails big time....Any pointers will be really helpful...
|
I fixed the failure in maxEigen but I still think it has issues. My hope was that with 10-20 iterations of BFGS we will build a good surface for the function f(x) and then minEigen and maxEigen will give us the bounds for rho as we start the proximal updates. I am using rho = 1.0 right now but in QuadraticMinimizer rho tuning with eigen values helped improving convergence. To test this right now I solve ||Ax - b||_2^{2} using BFGS and when converged I run LBFGS.minEigen and LBFGS.maxEigen and compare them with min(eigSym(A'A).eigenValues) and max(eigSym(A'A).eigenValues). It's not working as I expected. minEigen 9.606173268621757E-6 LBFGS.minEigen -1.0165119664756253 For maxEigen power method is stopping after 1 iteration so there are still bugs in the way I construct CompactHessian from LBFGS.ApproximateInverseHistory. |
|
So, first, this is very cool! I think this could be a great addition to Breeze! A few high level comments. I'll poke around in the source as I get a chance.
|
NOTICE
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, all rights reserved has no legal meaning in modern copyright law.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PQN can do simplex projection but it does not separate f(x) and g(z)....what we really want to do is to solve f (x) through a blocked cyclic coordinate descent using bfgs and satisfy g (z) through proximal operator...that's our version of parameter server (distributed solver)...I am still thinking if we can replace coordinate descent with distributed bfgs by using some tricks...if I use owlqn or pqn I have to change code in all of them...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I looked at the paper...that's the same algorithm I am implementing....it is a proximal algorithm....may be we can plugin to pqn as well....i dont know pqn that well...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the PQN paper and it is possible to stick all the proximal operators inside PQN...I have tested bounds with PQN already and it works well compared to default ADMM.
If it holds for ProximalL1() compared to OWLQN then I am good to use PQN as the core of NonlinearMinimizer...PQN right now accepts closure of form DenseVector[Double] => DenseVector[Double]. Would it be fine if I change the signature to (x: DenseVector[Double], rho: Double) => DenseVector[Double] ?
This is in-tune to generic proximal algorithms:
minimize g(x) + rho||x - v||_{2}^{2}
g(x) can be constraints here: x \in C
g(x) can be L1 here as well through soft-thresholding I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm totally open to that, but I'm not quite sure how it will be used. PQN
won't manipulate it, right? Why not have your driver specify it in a new
instance of PQN?
-- David
On Tue, Feb 17, 2015 at 12:01 PM, Debasish Das [email protected]
wrote:
In NOTICE
#364 (comment):@@ -1 +1,17 @@
Breeze is distributed under an Apache License V2.0 (See LICENSE)
+
+===============================================================================
+
+Proximal algorithms outlined in Proximal.scala (package breeze.optimize.quadratic)
+are based on https://github.com/cvxgrp/proximal (see LICENSE for details) and distributed with
+Copyright (c) 2014 by Debasish Das (Verizon), all rights reserved.
+
+===============================================================================
+
+QuadraticMinimizer class in package breeze.optimize.proximal is distributed with Copyright (c)
+2014, Debasish Das (Verizon), all rights reserved.I read the PQN paper and it is possible to stick all the proximal
operators inside PQN...I have tested bounds with PQN already and it works
well compared to default ADMM.If it holds for ProximalL1() compared to OWLQN then I am good to use PQN
as the core of NonlinearMinimizer...PQN right now accepts closure of form
DenseVector[Double] => DenseVector[Double]. Would it be fine if I change
the signature to (x: DenseVector[Double], rho: Double) =>
DenseVector[Double] ?This is in-tune to generic proximal algorithms:
minimize g(x) + rho||x - v||_{2}^{2}g(x) can be constraints here: x \in C
g(x) can be L1 here as well through soft-thresholding I think—
Reply to this email directly or view it on GitHub
https://github.com/scalanlp/breeze/pull/364/files#r24848414.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not an interface like:
trait SeparableDiffFunction[T] {
def apply(x: T): IndexedSeq[(Double, T)]
}
?
maybe with a method to turn it into a normal DiffFunction given an OpAdd
for T?
On Sat, Feb 21, 2015 at 7:53 AM, Debasish Das [email protected]
wrote:
In NOTICE
#364 (comment):@@ -1 +1,17 @@
Breeze is distributed under an Apache License V2.0 (See LICENSE)
+
+===============================================================================
+
+Proximal algorithms outlined in Proximal.scala (package breeze.optimize.quadratic)
+are based on https://github.com/cvxgrp/proximal (see LICENSE for details) and distributed with
+Copyright (c) 2014 by Debasish Das (Verizon), all rights reserved.
+
+===============================================================================
+
+QuadraticMinimizer class in package breeze.optimize.proximal is distributed with Copyright (c)
+2014, Debasish Das (Verizon), all rights reserved.@mengxr https://github.com/mengxr
I added a driver which creates new instance of PQN and maps Proximal
operators to Projection operators..I can handle L1 and probability simplex
as well...Will update the code soon...Note that a variant of PQN can do
proximal quasi newton but it's still research-y...Several recent papers
focused on it http://arxiv.org/pdf/1206.1623v13Basically with PQN as the solver running on each worker, ADMM will do
distributed/multi-core consensus on top of it...For separable functions
(like logistic) it will look like this:
minimize F1(x1) + F2(x2) + ...Fn(xn)
s.t x1 = z, x2 = z, ..., xn = z
Here F1 is composite function which satisfies both f(x) and constraint
g(z).From Spark's perspective the master only keeps one vector z and all the
solver memory is moved to workers. With 16 GB RAM, we can still do logistic
with 2 x 10^9 x 8 byte = 2B features with theoretical guarantees.For non-separable function (neural net) it will be more fun...we will come
to it after handling separable functions.
The baseline will be block coordinate descent with PQN on each block until
we come up with something better (Gauss Sidel Iterations)
http://opt.kyb.tuebingen.mpg.de/papers/opt2013_submission_1.pdfOf course in breeze we are not allowed to use RDD, but does it make sense
to define an interface for ADMM based consensus solvers ? We can discuss in
person as well..—
Reply to this email directly or view it on GitHub
https://github.com/scalanlp/breeze/pull/364/files#r25123016.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me add the experimental version with RDD in it and you can help defining the clean interface...I feel we will need different interfaces for separable and non-separable functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the PQN driver and the projection operators for L1(x) <= c and ProbabilitySimplex...Somehow in my experiments I am unable to reproduce the results from Figure 1 of PQN paper...OWLQN is first run to extract the parameter lambda_L1(x_) to generate the constraint L1(x) <= c for PQN
runMain breeze.optimize.proximal.NonlinearMinimizer 500 0.4 0.99
Elastic Net with L1 and L2 regularization.
Linear Regression:
Issues:
- Max Iter-ed at 500
- Worse than OWLQN and naive ADMM in Objective
owlqn 678.072 ms iters 173
pqnSparseTime 30600.365 ms iters 500
owlqnObj -145.38669700980395 pqnSparseObj -135.15057488775597
Logistic Regression:
Cons:
- Runtime higher than OWLQN
- Worse Objective than OWLQN and naive ADMM
owlqn 187.894 ms iters 74 pqn 758.073 ms iters 28
objective owlqnLogistic 52.32713379333781 pqnLogistic 81.37098398138012
I am debugging the code further but any pointers will be great. I don't think this is the expected behavior from PQN on L1/ProbabilitySimplex constraint as per paper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random question: does it seem to be line searching a lot (relative to
OWLQN, or runs with a box constraint?)
On Mon, Feb 23, 2015 at 2:09 PM, Debasish Das [email protected]
wrote:
In NOTICE
#364 (comment):@@ -1 +1,17 @@
Breeze is distributed under an Apache License V2.0 (See LICENSE)
+
+===============================================================================
+
+Proximal algorithms outlined in Proximal.scala (package breeze.optimize.quadratic)
+are based on https://github.com/cvxgrp/proximal (see LICENSE for details) and distributed with
+Copyright (c) 2014 by Debasish Das (Verizon), all rights reserved.
+
+===============================================================================
+
+QuadraticMinimizer class in package breeze.optimize.proximal is distributed with Copyright (c)
+2014, Debasish Das (Verizon), all rights reserved.I added the PQN driver and the projection operators for L1(x) <= c and
ProbabilitySimplex...Somehow in my experiments I am unable to reproduce the
results from Figure 1 of PQN paper...OWLQN is first run to extract the
parameter lambda_L1(x_) to generate the constraint L1(x) <= c for PQNrunMain breeze.optimize.proximal.NonlinearMinimizer 500 0.4 0.99
Elastic Net with L1 and L2 regularization.
Linear Regression:
Issues:
- Max Iter-ed at 500
- Worse than OWLQN and naive ADMM in Objective
owlqn 678.072 ms iters 173
pqnSparseTime 30600.365 ms iters 500
owlqnObj -145.38669700980395 pqnSparseObj -135.15057488775597Logistic Regression:
Cons:
- Runtime higher than OWLQN
- Worse Objective than OWLQN and naive ADMM
owlqn 187.894 ms iters 74 pqn 758.073 ms iters 28
objective owlqnLogistic 52.32713379333781 pqnLogistic 81.37098398138012I am debugging the code further but any pointers will be great. I don't
think this is the expected behavior from PQN on L1/ProbabilitySimplex
constraint as per paper.—
Reply to this email directly or view it on GitHub
https://github.com/scalanlp/breeze/pull/364/files#r25207778.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my L1 projection has bugs in it...L1(x) is built using ProbabilitySimplex and PQN is beating naive ADMM and at par with QuadraticMinimizer for ProbabilitySimplex.
I will take a closer look into L1 projection for bugs but for ADMM based multicore/distributed consensus I will most likely choose OWLQN for elastic net and PQN for other constraints.
Here are the results with ProbabilitySimplex:
minimize f(x) s.t 1'x = c, x >= 0, c = 1
Linear Regression with ProbabilitySimplex, f(x) = ||Ax - b||^2
Objective pqn 85.34613832959954 nl 84.95320179604967 qp 85.33114863196366
Constraint pqn 0.9999999999999997 nl 1.001707509961072 qp 1.000000000000004
time pqn 150.552 ms nl 15847.125 ms qp 96.105 ms
Logistic Regression with ProbabilitySimplex, f(x) logistic loss from mllib
Objective pqn 257.6058563777358 nl 257.4025971846134
Constraint pqn 0.9999999999999998 nl 1.0007230450802203
time pqn 94.19 ms nl 25160.317 ms
|
re: maxEigen, I really don't know enough about what's going on to say what's wrong, other than to speculate that the approximation may not be good enough for your purposes? (I didn't write the PQN code.) Have you tried using an EmpiricalHessian? |
|
Is my CompactHessian construction from LBFGS history correct ? I thought there are bugs in it... |
…tion operators for ProbabilitySimplex and L1; Compared PQN with OWLQN and ADMM for L1 constraint
…orts Projection operators
|
@mengxr I refactored ADMM based ConsensusMinimizer to mllib.optimization since it needs to do RDD operations. More results on that will be added in Spark PR. It will use NonlinearMinimizer as a solver that runs on each worker. NonlinearMinimizer right now supports all the projection operators through PQN and ProximalL1() through OWLQN. I will send a seperate email to John so that he can review the ProjectL1() operators that I added. Later the plan is to merge PQN and OWLQN to Proximal Quasi Newton method. I will open up the PR for reviews with @yuekai (author of the Matlab code https://github.com/yuekai/pnopt that I adapted) Here are some benchmark results and comparison of breeze.optimize.proximal.NonlinearMinimizer with breeze.optimize.proximal.QuadraticMinimizer solver for the respective constraints: runMain breeze.optimize.proximal.NonlinearMinimizer 400 0.4 0.99 Linear Regression with Elastic Net formulation: owlqn 2699.234 ms iters 825 sparseQp 359.34 ms iters 68 Linear Regression with Bounds Logistic Regression with Bounds Linear Regression with ProbabilitySimplex Logistic Regression with ProbabilitySimplex You can see since ranks are low, QuadraticMinimizer convergence is much faster ! I am adding the test-cases and then @dlwh please take a pass at it. |
|
very nice! I'll take a look later this evening On Sat, Feb 28, 2015 at 12:31 PM, Debasish Das [email protected]
|
|
I set it up that way for scale invariance. 2 * f(x) and f(x) should No reference. I just made stuff up. As I've said before here and elsewhere, happy to take PRs to standardize On Tue, Mar 10, 2015 at 8:49 AM, Debasish Das [email protected]
|
…acktracking line search made similar to OWLQN;SPG adjust cleaned based on FirstOrderMinimizer convergence criteria;ADMM based generic proximal Solver added in NonlinearMinimizer;SPG/PQN based projection solver drivers added in NonlinearMinimizer;ProximalL1 updated to take lambda in constructor;Tests updated
|
ok...Let's go with the current convergence and we will rethink on it later. I cleaned OWLQN from NonlinearMinimizer since we feel ADMM/FISTA will converge at par with OWLQN with the tuning. I added a generic proximal solver using ADMM. These are for cases where projection does not converge fast enough (L1 for example) or for cases where proximal operators are defined like huber loss for example. Right now ADMM is still not at par in runtime with OWLQN but it's not that bad either. Some interesting benchmark numbers from ADMM based proximal solver:
ElasticNet Formulation Linear regression owlqn 7707.722 ms iters 1510 sparseQp 372.883 ms iters 60 owlqn 2X faster than nlSparse (uses SPG by default, L1(x) <= s) and nlProx (uses L1 soft-thresholding) Logistic Regression with ProximalL1 and ProjectL1 owlqn 3x faster than proximalL1 (soft thresholding) and projectL1 (L1(x) <= s) Logistic Regression with ProbabilitySimplex Objective nl 258.1749325226205 admm 258.174932597232 For probability simplex projection works much nice I added a TO DO that I will implement FISTA and compare against ADMM since I believe FISTA is currently one of the leading algorithms for smooth f(x) and non-smooth g(z) like L1. I will do this at a later PR. Right now with this we are covered on most of the mllib needs. We can either merge it now or wait for few weeks till I am done with stress testing this code through large rank factorization with convex loss in Spark. |
|
Great. I'm happy to merge if you could move minEigen from LBFGS. |
|
I will keep it on my private branch...I am not yet sure how useful are they...waiting for reply from Professor Nocedal on validation and tests for ApproximateInverseHessian/CompactHessian In BFGS we are using strong wolfe...I was wondering that whether we should add an option for line search that does not need new f & g calculation...something like what we did for SPG...thoughts ? It will speed up both mllib classifiers and ADMM based proximal code...I doubt we need to be so accurate in the line search for both the algorithm variants... |
|
sure, sounds good On Wed, Mar 11, 2015 at 8:36 AM, Debasish Das [email protected]
|
|
cleaned up minEigen...the line search is tricky...will open up a PR once I understand it more...Basically I am thinking about some sort of adaGrad style approach to come up with one learning rate number for ADMM line search...it will need more experiments on large datasets |
|
ok, ok to merge? |
|
Can you please take a closer look at the bfgs state manipulation inside NonlinearMinimizer...I am not doing any excessive object allocation thr ? |
|
sorry, where? Also I'm not quite sure why you're so worried about performance here; On Wed, Mar 11, 2015 at 7:58 PM, Debasish Das [email protected]
|
|
From bfgs I can only extract solution but then users can't see how inner solve is doing... |
|
I agree with the function eval point...I am ok |
|
I see. why not use lbfgs.minimizeAndReturnState? You can cap the iters for LBFGS in its constructor. |
|
Construct a bfgs solver every admm iteration ? That is cleaner but user won't get any info on inner solve which I guess is consistent with pqn...let me do that |
|
Initially I wanted to use the bfgs history from admm iter i in iter I + 1 but that needs more thought |
|
No. You construct the lbfgs once, but you pass in maxIter = maxInnerIter or On Wed, Mar 11, 2015 at 8:28 PM, Debasish Das [email protected]
|
|
It would be nice to have that functionality in general, but yeah we don't On Wed, Mar 11, 2015 at 8:32 PM, Debasish Das [email protected]
|
|
Ok will do that |
|
cleaned LBFGS inner iters...it will be great if you can push it to snapshots...I would like to use QuadraticMinimizer and NonlinearMinimizer for Spark PRs...once the codebases are stabilized I will send an email for cutting release...If you have an optimization page that I can modify, I would like to add proper documentation...Lot of optimization capabilities are added now through these proximal algorithm (direct and iterative)...It will be great if they can be used in other Apache projects as well... |
|
Right now, basically the only documentation for optimize is here: https://github.com/scalanlp/breeze/wiki/Quickstart#breezeoptimize That needs to be fixed. |
NonlinearMinimizer using Projection and Proximal Operators
|
Thanks again for contributing this! |
|
(publishing snapshots now. should be available in 10-15 minutes) |
|
how can I uppdate the documentation ? We will be working on fixing it over the weekend... |
|
feel free to edit the wiki, add new pages, whatever you think is best. On Wed, Mar 11, 2015 at 11:42 PM, Debasish Das [email protected]
|
@dlwh @mengxr @dbtsai
This PR introduces ADMM based Proximal algorithm to solve problems of the following structure
minimize f(x) + g(z)
s.t x = z
FISTA is also a candidate to solve this problem but we used ADMM since FISTA can't handle affine constraints natively.
Currently f(x) can be convex/nonlinear function. So far my stress tests are using convex functions (least square regression and logistic regression) but down the line we will come to nonlinear loss (DSGD formulation from matrix factorization or sigmoid function from neural net). We call it centralized ADMM since ADMM loop is still maintained on the Master node. Once centralized ADMM works, we will generalize it to Proximal Server (Parameter server with Proximal Algorithms).
Parameter Server with Proximal Algorithms is motivated by Mu's comparison of Blocked Proximal Gradient with Google's Distributed BFGS code:
http://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
We have two variants of ADMM iterations right now (experimentations are ongoing). One hot-starts on x from previous ADMM iterations while the other hot-starts on both x and BFGS history from previous ADMM iteration. New LBFGS API is added that returns state to facilitate BFGS history preservation.
runMain breeze.optimize.proximal.NonlinearMinimizer 200 0.4 0.99
We generate random data matrix (A) of size 200x200 from normal gaussian (0 mean 1 stddev). Label vector is generated from normal gaussian as well. If label > 0.5 we assign it to 1.0 else 0.0
Lambda = 0.4 ElasticNet beta = 0.99
lambdaL1 = 0.4_0.99
lambdaL2 = 0.4_0.01
Linear Regression:
Comparison 1: NonlinearMinimizer (nl) vs QuadraticMinimizer (qp)
NonlinearMinimizer takes f(x) = ||Ax - b||_2^{2} df/dx = 2(Ax - b)*A
QuadraticMinimizer takes gram=A'A, linear term= A'b
Solution quality:
||qp - nl|| norm 0.16799969531758605 max-norm 0.034283192270685436
||bfgs - nl|| norm 0.25715611616302037 max-norm 0.04888455483124021
Objective:
Objective qp -55.00000000000021 nl -54.999725074409454 bfgs -54.99983413525827
nl and bfgs are both using BFGS for the primal solve
Comparison 2: NonlinearMinimizer with g(z)=ProximalL1 (nlSparse/nlSparseHistory) vs OWLQN vs QuadraticMinimizer with g(z)=ProximalL1 (sparseQp)
Runtime:
owlqn 165.178 ms iters 143 sparseQp 32.87 ms iters 59
Solution Quality:
||sparseQp - owlqn|| norm 0.14861916579656498 inf-norm 0.03034052809134663
owlqnObj -47.42068811893944 sparseQpObj -47.45927392747443
nlSparse 1479.666 ms iters 195 nlSparseHistory 407.347 ms iters 429
||owlqn - nlSparse|| norm 0.1460259470192177 inf-norm 0.029597210620735503
||owlqn - nlSparseHistory|| norm 0.14649731508771383 inf-norm 0.030278923124194383
nlSparseObj -47.457604531534805 nlSparseHistoryObj -47.45480418795271
NonlinearMinimizer is ~2X slower than OWLQN but if you look at objective, it converges to better solution.
Comparison 3: NonlinearMinimizer with g(z) = ProximalBounds(0 <= z <= 1.0) (nlBox) vs PQN vs QuadraticMinimizer with g(z) = ProximalBounds (qpBox)
qpBox 6.769 ms iters 57
nlBox 10182.424 ms iters 2282 nlBoxHistory 1269.134 ms 2271
pqnBox 272.064 ms iters 28
||qpBox - nlBox|| norm 0.0022824265880237215 inf-norm 4.694687018757908E-4
||qpBox - nlBoxHistory|| norm 0.0023190292089652767 inf-norm 4.773077547529439E-4
||qpBox - nlBox|| norm 0.0021390982826103347 inf-norm 5.634071605518931E-4
qpBoxObj -31.130008148965725 nlBoxObj -31.209892573229023 nlBoxHistoryObj -31.20915524772166 pqnBoxObj -31.127492181959255
Here also we are 5X slower than PQN but our solution is better compared to PQN.
Logistic Regression:
Comparison 1: ElasticNet formulation with OWLQN and NonlinearMinimizer with g(z)=ProximalL1 (nlLogistic)
Runtime:
owlqn 59.57 ms iters 62 nlLogistic 185.436 ms iters 33
Solution difference:
||owlqn - nlLogistic|| norm 0.41682550361219334 inf-norm 0.10509790307392614
owlqnLogistic objective 30.83266520992968
nlLogistic objective 30.817377544996752
We are ~3X compared to OWLQN but again the objective is better. So NonlinearMinimizer is producing better solution. Also see that the solutions are completely different. I have looked at the solutions. Both are sparse.
Comparison 2: Logistic Regression + L2 regularization with Bounds using PQN and NonlinearMinimizer with g(z) = ProximalBox (nlBox)
pqnBox 226.606 ms iters 29 nlBox 1283.92 ms iters 263
pqn loss 57.56457339692869
nl loss 57.50013311157119
We are ~ 5X slower than PQN but producing better solution quality.
Next steps:
I wanted to open it up for early reviews. QuadraticMinimizer as mentioned here is the following PR:
#321