NonlinearMinimizer using Projection and Proximal Operators #364

debasish83 · 2015-02-11T07:48:38Z

This PR introduces ADMM based Proximal algorithm to solve problems of the following structure
minimize f(x) + g(z)
s.t x = z
FISTA is also a candidate to solve this problem but we used ADMM since FISTA can't handle affine constraints natively.

Currently f(x) can be convex/nonlinear function. So far my stress tests are using convex functions (least square regression and logistic regression) but down the line we will come to nonlinear loss (DSGD formulation from matrix factorization or sigmoid function from neural net). We call it centralized ADMM since ADMM loop is still maintained on the Master node. Once centralized ADMM works, we will generalize it to Proximal Server (Parameter server with Proximal Algorithms).

Parameter Server with Proximal Algorithms is motivated by Mu's comparison of Blocked Proximal Gradient with Google's Distributed BFGS code:
http://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf

We have two variants of ADMM iterations right now (experimentations are ongoing). One hot-starts on x from previous ADMM iterations while the other hot-starts on both x and BFGS history from previous ADMM iteration. New LBFGS API is added that returns state to facilitate BFGS history preservation.

runMain breeze.optimize.proximal.NonlinearMinimizer 200 0.4 0.99

We generate random data matrix (A) of size 200x200 from normal gaussian (0 mean 1 stddev). Label vector is generated from normal gaussian as well. If label > 0.5 we assign it to 1.0 else 0.0

Lambda = 0.4 ElasticNet beta = 0.99
lambdaL1 = 0.4_0.99
lambdaL2 = 0.4_0.01

Linear Regression:

Comparison 1: NonlinearMinimizer (nl) vs QuadraticMinimizer (qp)

NonlinearMinimizer takes f(x) = ||Ax - b||_2^{2} df/dx = 2(Ax - b)*A
QuadraticMinimizer takes gram=A'A, linear term= A'b

Solution quality:
||qp - nl|| norm 0.16799969531758605 max-norm 0.034283192270685436
||bfgs - nl|| norm 0.25715611616302037 max-norm 0.04888455483124021

Objective:

Objective qp -55.00000000000021 nl -54.999725074409454 bfgs -54.99983413525827

nl and bfgs are both using BFGS for the primal solve

Comparison 2: NonlinearMinimizer with g(z)=ProximalL1 (nlSparse/nlSparseHistory) vs OWLQN vs QuadraticMinimizer with g(z)=ProximalL1 (sparseQp)

Runtime:
owlqn 165.178 ms iters 143 sparseQp 32.87 ms iters 59

Solution Quality:
||sparseQp - owlqn|| norm 0.14861916579656498 inf-norm 0.03034052809134663
owlqnObj -47.42068811893944 sparseQpObj -47.45927392747443

nlSparse 1479.666 ms iters 195 nlSparseHistory 407.347 ms iters 429
||owlqn - nlSparse|| norm 0.1460259470192177 inf-norm 0.029597210620735503
||owlqn - nlSparseHistory|| norm 0.14649731508771383 inf-norm 0.030278923124194383
nlSparseObj -47.457604531534805 nlSparseHistoryObj -47.45480418795271

NonlinearMinimizer is ~2X slower than OWLQN but if you look at objective, it converges to better solution.

Comparison 3: NonlinearMinimizer with g(z) = ProximalBounds(0 <= z <= 1.0) (nlBox) vs PQN vs QuadraticMinimizer with g(z) = ProximalBounds (qpBox)

qpBox 6.769 ms iters 57
nlBox 10182.424 ms iters 2282 nlBoxHistory 1269.134 ms 2271
pqnBox 272.064 ms iters 28

||qpBox - nlBox|| norm 0.0022824265880237215 inf-norm 4.694687018757908E-4
||qpBox - nlBoxHistory|| norm 0.0023190292089652767 inf-norm 4.773077547529439E-4
||qpBox - nlBox|| norm 0.0021390982826103347 inf-norm 5.634071605518931E-4

qpBoxObj -31.130008148965725 nlBoxObj -31.209892573229023 nlBoxHistoryObj -31.20915524772166 pqnBoxObj -31.127492181959255

Here also we are 5X slower than PQN but our solution is better compared to PQN.

Logistic Regression:

Comparison 1: ElasticNet formulation with OWLQN and NonlinearMinimizer with g(z)=ProximalL1 (nlLogistic)

Runtime:

owlqn 59.57 ms iters 62 nlLogistic 185.436 ms iters 33

Solution difference:

||owlqn - nlLogistic|| norm 0.41682550361219334 inf-norm 0.10509790307392614

owlqnLogistic objective 30.83266520992968
nlLogistic objective 30.817377544996752

We are ~3X compared to OWLQN but again the objective is better. So NonlinearMinimizer is producing better solution. Also see that the solutions are completely different. I have looked at the solutions. Both are sparse.

Comparison 2: Logistic Regression + L2 regularization with Bounds using PQN and NonlinearMinimizer with g(z) = ProximalBox (nlBox)

pqnBox 226.606 ms iters 29 nlBox 1283.92 ms iters 263
pqn loss 57.56457339692869
nl loss 57.50013311157119

We are ~ 5X slower than PQN but producing better solution quality.

Next steps:

Runtime optimization to bring centralized ADMM closer to projected gradient algorithms (OWLQN, PQN)
Add proximal algorithm for probability simplex 1'x = 1, x >=0 (Centralized ADMM can solve it but not OWLQN/PQN)
Formulate MLE loss in mllib ALS and solve it using NonlinearMinimizer for PLSA formulation
Code refactoring and use Iterator pattern.
Test-cases showing up the objective improvement compared to OWLQN/PQN.

I wanted to open it up for early reviews. QuadraticMinimizer as mentioned here is the following PR:
#321

…ction code in BFGS using PowerMethod; Fixes in NonlinearMinimizer skeleton

…ry during BFGS

… working properly

debasish83 · 2015-02-11T07:51:02Z

math/src/main/scala/breeze/optimize/LBFGS.scala

@dlwh minEigen works fine but in maxEigen I am hacking it. From the BFGS history, I am trying to build the CompactHessian to run power iterations but it fails big time....Any pointers will be really helpful...

debasish83 · 2015-02-12T01:37:52Z

I fixed the failure in maxEigen but I still think it has issues. My hope was that with 10-20 iterations of BFGS we will build a good surface for the function f(x) and then minEigen and maxEigen will give us the bounds for rho as we start the proximal updates. I am using rho = 1.0 right now but in QuadraticMinimizer rho tuning with eigen values helped improving convergence.

To test this right now I solve ||Ax - b||_2^{2} using BFGS and when converged I run LBFGS.minEigen and LBFGS.maxEigen and compare them with min(eigSym(A'A).eigenValues) and max(eigSym(A'A).eigenValues).

It's not working as I expected.

minEigen 9.606173268621757E-6 LBFGS.minEigen -1.0165119664756253
maxEigen 765.2710923978012 LBFGS.maxEigen 1.0000000000000002

For maxEigen power method is stopping after 1 iteration so there are still bugs in the way I construct CompactHessian from LBFGS.ApproximateInverseHistory.

dlwh · 2015-02-13T04:43:40Z

So, first, this is very cool! I think this could be a great addition to Breeze!

A few high level comments. I'll poke around in the source as I get a chance.

Have you tried turning down the tolerance in LBFGS/OWLQN/PQN? I wonder if the differences in quality are from early stopping? If not, would it maybe make sense to warm start from one of these (probably PQN as it's more general?)
Why can't PQN do simplex projection? E.g. http://arxiv.org/abs/1309.1541 should give the necessary projection operator?
Low priority, but in general I'm trying to move benchmark methods to the benchmark subproject.

dlwh · 2015-02-13T04:44:52Z

NOTICE

PQN can do simplex projection but it does not separate f(x) and g(z)....what we really want to do is to solve f (x) through a blocked cyclic coordinate descent using bfgs and satisfy g (z) through proximal operator...that's our version of parameter server (distributed solver)...I am still thinking if we can replace coordinate descent with distributed bfgs by using some tricks...if I use owlqn or pqn I have to change code in all of them...

Also I looked at the paper...that's the same algorithm I am implementing....it is a proximal algorithm....may be we can plugin to pqn as well....i dont know pqn that well...

I read the PQN paper and it is possible to stick all the proximal operators inside PQN...I have tested bounds with PQN already and it works well compared to default ADMM.

If it holds for ProximalL1() compared to OWLQN then I am good to use PQN as the core of NonlinearMinimizer...PQN right now accepts closure of form DenseVector[Double] => DenseVector[Double]. Would it be fine if I change the signature to (x: DenseVector[Double], rho: Double) => DenseVector[Double] ?

This is in-tune to generic proximal algorithms:
minimize g(x) + rho||x - v||_{2}^{2}

g(x) can be constraints here: x \in C
g(x) can be L1 here as well through soft-thresholding I think

I'm totally open to that, but I'm not quite sure how it will be used. PQN
won't manipulate it, right? Why not have your driver specify it in a new
instance of PQN?

-- David

On Tue, Feb 17, 2015 at 12:01 PM, Debasish Das [email protected]
wrote:

In NOTICE
#364 (comment):

@@ -1 +1,17 @@
Breeze is distributed under an Apache License V2.0 (See LICENSE)
+
+===============================================================================
+
+Proximal algorithms outlined in Proximal.scala (package breeze.optimize.quadratic)
+are based on https://github.com/cvxgrp/proximal (see LICENSE for details) and distributed with
+Copyright (c) 2014 by Debasish Das (Verizon), all rights reserved.
+
+===============================================================================
+
+QuadraticMinimizer class in package breeze.optimize.proximal is distributed with Copyright (c)
+2014, Debasish Das (Verizon), all rights reserved.

I read the PQN paper and it is possible to stick all the proximal
operators inside PQN...I have tested bounds with PQN already and it works
well compared to default ADMM.

If it holds for ProximalL1() compared to OWLQN then I am good to use PQN
as the core of NonlinearMinimizer...PQN right now accepts closure of form
DenseVector[Double] => DenseVector[Double]. Would it be fine if I change
the signature to (x: DenseVector[Double], rho: Double) =>
DenseVector[Double] ?

This is in-tune to generic proximal algorithms:
minimize g(x) + rho||x - v||_{2}^{2}

g(x) can be constraints here: x \in C
g(x) can be L1 here as well through soft-thresholding I think

—
Reply to this email directly or view it on GitHub
https://github.com/scalanlp/breeze/pull/364/files#r24848414.

Why not an interface like:

trait SeparableDiffFunction[T] {
def apply(x: T): IndexedSeq[(Double, T)]
}

?

maybe with a method to turn it into a normal DiffFunction given an OpAdd
for T?

On Sat, Feb 21, 2015 at 7:53 AM, Debasish Das [email protected]
wrote:

In NOTICE
#364 (comment):

@@ -1 +1,17 @@
Breeze is distributed under an Apache License V2.0 (See LICENSE)
+
+===============================================================================
+
+Proximal algorithms outlined in Proximal.scala (package breeze.optimize.quadratic)
+are based on https://github.com/cvxgrp/proximal (see LICENSE for details) and distributed with
+Copyright (c) 2014 by Debasish Das (Verizon), all rights reserved.
+
+===============================================================================
+
+QuadraticMinimizer class in package breeze.optimize.proximal is distributed with Copyright (c)
+2014, Debasish Das (Verizon), all rights reserved.

@mengxr https://github.com/mengxr
I added a driver which creates new instance of PQN and maps Proximal
operators to Projection operators..I can handle L1 and probability simplex
as well...Will update the code soon...Note that a variant of PQN can do
proximal quasi newton but it's still research-y...Several recent papers
focused on it http://arxiv.org/pdf/1206.1623v13

Basically with PQN as the solver running on each worker, ADMM will do
distributed/multi-core consensus on top of it...For separable functions
(like logistic) it will look like this:
minimize F1(x1) + F2(x2) + ...Fn(xn)
s.t x1 = z, x2 = z, ..., xn = z
Here F1 is composite function which satisfies both f(x) and constraint
g(z).

From Spark's perspective the master only keeps one vector z and all the
solver memory is moved to workers. With 16 GB RAM, we can still do logistic
with 2 x 10^9 x 8 byte = 2B features with theoretical guarantees.

For non-separable function (neural net) it will be more fun...we will come
to it after handling separable functions.
The baseline will be block coordinate descent with PQN on each block until
we come up with something better (Gauss Sidel Iterations)
http://opt.kyb.tuebingen.mpg.de/papers/opt2013_submission_1.pdf

Of course in breeze we are not allowed to use RDD, but does it make sense
to define an interface for ADMM based consensus solvers ? We can discuss in
person as well..

—
Reply to this email directly or view it on GitHub
https://github.com/scalanlp/breeze/pull/364/files#r25123016.

Let me add the experimental version with RDD in it and you can help defining the clean interface...I feel we will need different interfaces for separable and non-separable functions.

I added the PQN driver and the projection operators for L1(x) <= c and ProbabilitySimplex...Somehow in my experiments I am unable to reproduce the results from Figure 1 of PQN paper...OWLQN is first run to extract the parameter lambda_L1(x_) to generate the constraint L1(x) <= c for PQN

runMain breeze.optimize.proximal.NonlinearMinimizer 500 0.4 0.99

Elastic Net with L1 and L2 regularization.

Linear Regression:
Issues:

Max Iter-ed at 500

Worse than OWLQN and naive ADMM in Objective

owlqn 678.072 ms iters 173
pqnSparseTime 30600.365 ms iters 500
owlqnObj -145.38669700980395 pqnSparseObj -135.15057488775597

Logistic Regression:
Cons:

Runtime higher than OWLQN

Worse Objective than OWLQN and naive ADMM

owlqn 187.894 ms iters 74 pqn 758.073 ms iters 28
objective owlqnLogistic 52.32713379333781 pqnLogistic 81.37098398138012

I am debugging the code further but any pointers will be great. I don't think this is the expected behavior from PQN on L1/ProbabilitySimplex constraint as per paper.

Random question: does it seem to be line searching a lot (relative to
OWLQN, or runs with a box constraint?)

On Mon, Feb 23, 2015 at 2:09 PM, Debasish Das [email protected]
wrote:

In NOTICE
#364 (comment):

@@ -1 +1,17 @@
Breeze is distributed under an Apache License V2.0 (See LICENSE)
+
+===============================================================================
+
+Proximal algorithms outlined in Proximal.scala (package breeze.optimize.quadratic)
+are based on https://github.com/cvxgrp/proximal (see LICENSE for details) and distributed with
+Copyright (c) 2014 by Debasish Das (Verizon), all rights reserved.
+
+===============================================================================
+
+QuadraticMinimizer class in package breeze.optimize.proximal is distributed with Copyright (c)
+2014, Debasish Das (Verizon), all rights reserved.

I added the PQN driver and the projection operators for L1(x) <= c and
ProbabilitySimplex...Somehow in my experiments I am unable to reproduce the
results from Figure 1 of PQN paper...OWLQN is first run to extract the
parameter lambda_L1(x_) to generate the constraint L1(x) <= c for PQN

runMain breeze.optimize.proximal.NonlinearMinimizer 500 0.4 0.99

Elastic Net with L1 and L2 regularization.

Linear Regression:
Issues:

Max Iter-ed at 500

Worse than OWLQN and naive ADMM in Objective

owlqn 678.072 ms iters 173
pqnSparseTime 30600.365 ms iters 500
owlqnObj -145.38669700980395 pqnSparseObj -135.15057488775597

Logistic Regression:
Cons:

Runtime higher than OWLQN

Worse Objective than OWLQN and naive ADMM

owlqn 187.894 ms iters 74 pqn 758.073 ms iters 28
objective owlqnLogistic 52.32713379333781 pqnLogistic 81.37098398138012

I am debugging the code further but any pointers will be great. I don't
think this is the expected behavior from PQN on L1/ProbabilitySimplex
constraint as per paper.

—
Reply to this email directly or view it on GitHub
https://github.com/scalanlp/breeze/pull/364/files#r25207778.

I think my L1 projection has bugs in it...L1(x) is built using ProbabilitySimplex and PQN is beating naive ADMM and at par with QuadraticMinimizer for ProbabilitySimplex.

I will take a closer look into L1 projection for bugs but for ADMM based multicore/distributed consensus I will most likely choose OWLQN for elastic net and PQN for other constraints.

Here are the results with ProbabilitySimplex:

minimize f(x) s.t 1'x = c, x >= 0, c = 1

Linear Regression with ProbabilitySimplex, f(x) = ||Ax - b||^2

Objective pqn 85.34613832959954 nl 84.95320179604967 qp 85.33114863196366
Constraint pqn 0.9999999999999997 nl 1.001707509961072 qp 1.000000000000004
time pqn 150.552 ms nl 15847.125 ms qp 96.105 ms

Logistic Regression with ProbabilitySimplex, f(x) logistic loss from mllib

Objective pqn 257.6058563777358 nl 257.4025971846134
Constraint pqn 0.9999999999999998 nl 1.0007230450802203
time pqn 94.19 ms nl 25160.317 ms

dlwh · 2015-02-13T04:59:43Z

re: maxEigen, I really don't know enough about what's going on to say what's wrong, other than to speculate that the approximation may not be good enough for your purposes? (I didn't write the PQN code.) Have you tried using an EmpiricalHessian?

debasish83 · 2015-02-14T02:48:10Z

Is my CompactHessian construction from LBFGS history correct ? I thought there are bugs in it...

…tion operators for ProbabilitySimplex and L1; Compared PQN with OWLQN and ADMM for L1 constraint

…orts Projection operators

debasish83 · 2015-02-28T20:31:05Z

@mengxr I refactored ADMM based ConsensusMinimizer to mllib.optimization since it needs to do RDD operations. More results on that will be added in Spark PR. It will use NonlinearMinimizer as a solver that runs on each worker.

NonlinearMinimizer right now supports all the projection operators through PQN and ProximalL1() through OWLQN. I will send a seperate email to John so that he can review the ProjectL1() operators that I added.

Later the plan is to merge PQN and OWLQN to Proximal Quasi Newton method. I will open up the PR for reviews with @yuekai (author of the Matlab code https://github.com/yuekai/pnopt that I adapted)

Here are some benchmark results and comparison of breeze.optimize.proximal.NonlinearMinimizer with breeze.optimize.proximal.QuadraticMinimizer solver for the respective constraints:

runMain breeze.optimize.proximal.NonlinearMinimizer 400 0.4 0.99

Linear Regression with Elastic Net formulation:

owlqn 2699.234 ms iters 825 sparseQp 359.34 ms iters 68
nlSparseTime 3989.946 ms iters 997
owlqnObj -111.47997989802914 sparseQpObj -111.47977512157428 nlSparseObj -111.47997989810649

Linear Regression with Bounds
qpBox 36.74 ms iters 78
pqnBox 2111.375 ms iters 76
qpBoxObj -59.97519548154472 nlBoxObj -59.96772680436779

Logistic Regression with Bounds
Objective pqn 116.83573241482054 time 1325.132 ms

Linear Regression with ProbabilitySimplex
Objective pqn 95.04618308050102 qp 95.03032612382239
Constraint pqn 0.9999999999999994 qp 1.0000000000000084
time pqn 221.27 ms qp 123.771 ms

Logistic Regression with ProbabilitySimplex
Objective pqn 257.6652169115999
Constraint pqn 0.9999999999999998
time pqn 106.671 ms

You can see since ranks are low, QuadraticMinimizer convergence is much faster !

I am adding the test-cases and then @dlwh please take a pass at it.

dlwh · 2015-02-28T20:38:29Z

very nice! I'll take a look later this evening

On Sat, Feb 28, 2015 at 12:31 PM, Debasish Das [email protected]
wrote:

@mengxr https://github.com/mengxr I refactored ADMM based
ConsensusMinimizer to mllib.optimization since it needs to do RDD
operations. More results on that will be added in Spark PR. It will use
NonlinearMinimizer as a solver that runs on each worker.

NonlinearMinimizer right now supports all the projection operators through
PQN and ProximalL1() through OWLQN. I will send a seperate email to John so
that he can review the ProjectL1() operators that I added.

Later the plan is to merge PQN and OWLQN to Proximal Quasi Newton method.
I will open up the PR for reviews with @yuekai https://github.com/yuekai
(author of the Matlab code https://github.com/yuekai/pnopt that I
adapted)

Here are some benchmark results and comparison of
breeze.optimize.proximal.NonlinearMinimizer with
breeze.optimize.proximal.QuadraticMinimizer solver for the respective
constraints:

runMain breeze.optimize.proximal.NonlinearMinimizer 400 0.4 0.99

Linear Regression with Elastic Net formulation:

owlqn 2699.234 ms iters 825 sparseQp 359.34 ms iters 68
nlSparseTime 3989.946 ms iters 997
owlqnObj -111.47997989802914 sparseQpObj -111.47977512157428 nlSparseObj
-111.47997989810649

Linear Regression with Bounds
qpBox 36.74 ms iters 78
pqnBox 2111.375 ms iters 76
qpBoxObj -59.97519548154472 nlBoxObj -59.96772680436779

Logistic Regression with Bounds
Objective pqn 116.83573241482054 time 1325.132 ms

Linear Regression with ProbabilitySimplex
Objective pqn 95.04618308050102 qp 95.03032612382239
Constraint pqn 0.9999999999999994 qp 1.0000000000000084
time pqn 221.27 ms qp 123.771 ms

Logistic Regression with ProbabilitySimplex
Objective pqn 257.6652169115999
Constraint pqn 0.9999999999999998
time pqn 106.671 ms

You can see since ranks are low, QuadraticMinimizer convergence is much
faster !

I am adding the test-cases and then @dlwh https://github.com/dlwh
please take a pass at it.

—
Reply to this email directly or view it on GitHub
#364 (comment).

dlwh · 2015-03-10T19:11:03Z

I set it up that way for scale invariance. 2 * f(x) and f(x) should
converge to the same solution.

No reference. I just made stuff up.

As I've said before here and elsewhere, happy to take PRs to standardize
this.

On Tue, Mar 10, 2015 at 8:49 AM, Debasish Das [email protected]
wrote:

I am still confused on this:

if (norm(adjustedGradient) <= math.max(tolerance * adjustedValue.abs,
1E-8))
Some(FirstOrderMinimizer.GradientConverged)

I have used convergence criterias like norm(adjustedGradient) <= tolerance

sqrt(problemSize) but I am not clear why we have adjustedValue over
here...

Can you please point me to a reference related to the convergence
condition ?

Both SPG and PQN papers needs norm(adjustedGradient) <= tolerance as a
convergence condition where adjustedGradient is Projection(x-g) - x

For cases where projection is identity this formula becomes norm(g) <=
tolerance which should be consistent with BFGS as well...

I am not sure about OWLQN because when I use norm(g) <= tolerance in my
testing, OWLQN gets lot slower...Is it a OWLQN specific optimization ? If
it is then it should be defined inside OWLQN and not FirstOrderMinimizer...

—
Reply to this email directly or view it on GitHub
#364 (comment).

…acktracking line search made similar to OWLQN;SPG adjust cleaned based on FirstOrderMinimizer convergence criteria;ADMM based generic proximal Solver added in NonlinearMinimizer;SPG/PQN based projection solver drivers added in NonlinearMinimizer;ProximalL1 updated to take lambda in constructor;Tests updated

debasish83 · 2015-03-11T07:33:49Z

ok...Let's go with the current convergence and we will rethink on it later. I cleaned OWLQN from NonlinearMinimizer since we feel ADMM/FISTA will converge at par with OWLQN with the tuning.

I added a generic proximal solver using ADMM. These are for cases where projection does not converge fast enough (L1 for example) or for cases where proximal operators are defined like huber loss for example. Right now ADMM is still not at par in runtime with OWLQN but it's not that bad either.

Some interesting benchmark numbers from ADMM based proximal solver:

runMain breeze.optimize.proximal.NonlinearMinimizer 400 0.4 0.99

ElasticNet Formulation

Linear regression

owlqn 7707.722 ms iters 1510 sparseQp 372.883 ms iters 60
nlSparseTime 12894.557 ms iters 1286
nlProxTime 15820.851 ms iters 851
owlqnObj -108.01874244271892 sparseQpObj -108.01852458810231 nlSparseObj -108.01869767843407 nlProxObj -108.01873166329395

owlqn 2X faster than nlSparse (uses SPG by default, L1(x) <= s) and nlProx (uses L1 soft-thresholding)

Logistic Regression with ProximalL1 and ProjectL1
Objective proximalL1 13.320283231557987 projectL1 13.319635853660154 owlqn 13.319731554801983
time proximalL1 789.501 ms projectL1 1264.812 ms owlqn 200.971 ms

owlqn 3x faster than proximalL1 (soft thresholding) and projectL1 (L1(x) <= s)

Logistic Regression with ProbabilitySimplex

Objective nl 258.1749325226205 admm 258.174932597232
Constraint nl 0.9999999999999996 admm 0.9999999999999805
time nlProjection 416.15 ms nlProx 9275.012 ms

For probability simplex projection works much nice

I added a TO DO that I will implement FISTA and compare against ADMM since I believe FISTA is currently one of the leading algorithms for smooth f(x) and non-smooth g(z) like L1. I will do this at a later PR. Right now with this we are covered on most of the mllib needs.

We can either merge it now or wait for few weeks till I am done with stress testing this code through large rank factorization with convex loss in Spark.

dlwh · 2015-03-11T15:16:52Z

Great. I'm happy to merge if you could move minEigen from LBFGS.

debasish83 · 2015-03-11T15:36:36Z

I will keep it on my private branch...I am not yet sure how useful are they...waiting for reply from Professor Nocedal on validation and tests for ApproximateInverseHessian/CompactHessian

In BFGS we are using strong wolfe...I was wondering that whether we should add an option for line search that does not need new f & g calculation...something like what we did for SPG...thoughts ? It will speed up both mllib classifiers and ADMM based proximal code...I doubt we need to be so accurate in the line search for both the algorithm variants...

dlwh · 2015-03-11T15:37:46Z

sure, sounds good

On Wed, Mar 11, 2015 at 8:36 AM, Debasish Das [email protected]
wrote:

I will keep it on my private branch...I am not yet sure how useful are
they...waiting for reply from Professor Nocedal on validation and tests for
ApproximateInverseHessian/CompactHessian

In BFGS we are using strong wolfe...I was wondering that whether we should
add an option for line search that does not need new f & g
calculation...something like what we did for SPG...thoughts ? It will speed
up both mllib classifiers and ADMM based proximal code...I doubt we need to
be so accurate in the line search for both the algorithm variants...

—
Reply to this email directly or view it on GitHub
#364 (comment).

…ze/tree/eigen

debasish83 · 2015-03-11T20:52:26Z

cleaned up minEigen...the line search is tricky...will open up a PR once I understand it more...Basically I am thinking about some sort of adaGrad style approach to come up with one learning rate number for ADMM line search...it will need more experiments on large datasets

dlwh · 2015-03-11T23:43:18Z

ok, ok to merge?

debasish83 · 2015-03-12T02:58:29Z

Can you please take a closer look at the bfgs state manipulation inside NonlinearMinimizer...I am not doing any excessive object allocation thr ?

dlwh · 2015-03-12T03:04:24Z

sorry, where?

Also I'm not quite sure why you're so worried about performance here;
presumably function evaluations are going to be the bottleneck by a wide
margin?

On Wed, Mar 11, 2015 at 7:58 PM, Debasish Das [email protected]
wrote:

Can you please take a closer look at the bfgs state manipulation inside
NonlinearMinimizer...I am not doing any excessive object allocation thr ?

—
Reply to this email directly or view it on GitHub
#364 (comment).

debasish83 · 2015-03-12T03:14:19Z

From bfgs I can only extract solution but then users can't see how inner solve is doing...

debasish83 · 2015-03-12T03:16:01Z

I agree with the function eval point...I am ok

dlwh · 2015-03-12T03:16:34Z

I see. why not use lbfgs.minimizeAndReturnState? You can cap the iters for LBFGS in its constructor.

debasish83 · 2015-03-12T03:28:50Z

Construct a bfgs solver every admm iteration ? That is cleaner but user won't get any info on inner solve which I guess is consistent with pqn...let me do that

debasish83 · 2015-03-12T03:32:19Z

Initially I wanted to use the bfgs history from admm iter i in iter I + 1 but that needs more thought

dlwh · 2015-03-12T03:33:11Z

No. You construct the lbfgs once, but you pass in maxIter = maxInnerIter or
whatever, and then you use minimizeAndReturnState

On Wed, Mar 11, 2015 at 8:28 PM, Debasish Das [email protected]
wrote:

Construct a bfgs solver every admm iteration ? That is cleaner but user
won't get any info on inner solve which I guess is consistent with
pqn...let me do that

—
Reply to this email directly or view it on GitHub
#364 (comment).

dlwh · 2015-03-12T03:34:00Z

It would be nice to have that functionality in general, but yeah we don't
really support it yet.

On Wed, Mar 11, 2015 at 8:32 PM, Debasish Das [email protected]
wrote:

Initially I wanted to use the bfgs history from admm iter i in iter I + 1
but that needs more thought

—
Reply to this email directly or view it on GitHub
#364 (comment).

debasish83 · 2015-03-12T03:34:36Z

Ok will do that

debasish83 · 2015-03-12T06:04:36Z

cleaned LBFGS inner iters...it will be great if you can push it to snapshots...I would like to use QuadraticMinimizer and NonlinearMinimizer for Spark PRs...once the codebases are stabilized I will send an email for cutting release...If you have an optimization page that I can modify, I would like to add proper documentation...Lot of optimization capabilities are added now through these proximal algorithm (direct and iterative)...It will be great if they can be used in other Apache projects as well...

dlwh · 2015-03-12T06:40:37Z

Right now, basically the only documentation for optimize is here: https://github.com/scalanlp/breeze/wiki/Quickstart#breezeoptimize

That needs to be fixed.

NonlinearMinimizer using Projection and Proximal Operators

dlwh · 2015-03-12T06:41:59Z

Thanks again for contributing this!

dlwh · 2015-03-12T06:42:32Z

(publishing snapshots now. should be available in 10-15 minutes)

debasish83 · 2015-03-12T06:42:47Z

how can I uppdate the documentation ? We will be working on fixing it over the weekend...

dlwh · 2015-03-12T06:50:37Z

feel free to edit the wiki, add new pages, whatever you think is best.

On Wed, Mar 11, 2015 at 11:42 PM, Debasish Das [email protected]
wrote:

how can I uppdate the documentation ? We will be working on fixing it over
the weekend...

—
Reply to this email directly or view it on GitHub
#364 (comment).

Debasish Das added 10 commits January 30, 2015 11:33

Initial commit and skeleton for NonlinearMinimizer

fb74645

Merge branch 'qp' of https://github.com/debasish83/breeze into nlqp

a7ee059

Skeleton for approximate eigen value calculation

679bb5f

Copyright message for NonlinearMinimizer

3d80b31

merge with qp branch; NOTICE file updated

3781e37

Initial checkin for PowerMethod and PowerMethodTest;Eigen value extra…

536886d

…ction code in BFGS using PowerMethod; Fixes in NonlinearMinimizer skeleton

Compilation fixes to LBFGS eigenMin and eigenMax

f98bd80

Power Method merged; NonlinearMinimizer now supports preserving histo…

ae795b6

…ry during BFGS

Generating PQN.CompactHessian from BFGS.ApproximateInverseHessian not…

51b4224

… working properly

Linear Regression formulation added for comparisons

ee697bf

debasish83 reviewed Feb 11, 2015
View reviewed changes

Fixed LBFGS.maxEigen using power law on CompactHessian

ce8638f

dlwh reviewed Feb 13, 2015
View reviewed changes

Merge branch 'qp' of https://github.com/debasish83/breeze into nlqp

bbc3edd

debasish83 changed the title ~~[WIP] NonlinearMinimizer using ADMM based Proximal Algorithm~~ [WIP] NonlinearMinimizer using ADMM and PQN based Proximal Algorithm Feb 19, 2015

Debasish Das added 4 commits February 22, 2015 08:02

Merge branch 'qp' of https://github.com/debasish83/breeze into nlqp

f85ff86

Added a proximal interface to ProjectQuasiNewton solver; Added projec…

e3a61a9

…tion operators for ProbabilitySimplex and L1; Compared PQN with OWLQN and ADMM for L1 constraint

probability simplex benchmark

928de32

After experimentation NonlinearMinimizer now users PQN/OWLQN and supp…

91f2e17

…orts Projection operators

debasish83 changed the title ~~[WIP] NonlinearMinimizer using ADMM and PQN based Proximal Algorithm~~ NonlinearMinimizer using PQN + OWLQN and Projection Operators Feb 28, 2015

Debasish Das added 2 commits March 11, 2015 11:52

migrate LBFGS Eigen calculation to https://github.com/debasish83/bree…

8533ada

…ze/tree/eigen

cleaned up minEigen call from QuadraticMinimizer

a0bbd33

NonlinearMinimizer inner iterations through BFGS cleaned

40a45a8

Updated contributions in README.md

7308c7a

dlwh added a commit that referenced this pull request Mar 12, 2015

Merge pull request #364 from debasish83/nlqp

e9d4b31

NonlinearMinimizer using Projection and Proximal Operators

dlwh merged commit e9d4b31 into scalanlp:master Mar 12, 2015

NonlinearMinimizer using Projection and Proximal Operators #364

NonlinearMinimizer using Projection and Proximal Operators #364

Uh oh!

Conversation

debasish83 commented Feb 11, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

debasish83 commented Feb 12, 2015

Uh oh!

dlwh commented Feb 13, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dlwh commented Feb 13, 2015

Uh oh!

debasish83 commented Feb 14, 2015

Uh oh!

debasish83 commented Feb 28, 2015

Uh oh!

dlwh commented Feb 28, 2015

Uh oh!

dlwh commented Mar 10, 2015

Uh oh!

debasish83 commented Mar 11, 2015

Uh oh!

dlwh commented Mar 11, 2015

Uh oh!

debasish83 commented Mar 11, 2015

Uh oh!

dlwh commented Mar 11, 2015

Uh oh!

debasish83 commented Mar 11, 2015

Uh oh!

dlwh commented Mar 11, 2015

Uh oh!

debasish83 commented Mar 12, 2015

Uh oh!

dlwh commented Mar 12, 2015

Uh oh!

debasish83 commented Mar 12, 2015

Uh oh!

debasish83 commented Mar 12, 2015

Uh oh!

dlwh commented Mar 12, 2015

Uh oh!

debasish83 commented Mar 12, 2015

Uh oh!

debasish83 commented Mar 12, 2015

Uh oh!

dlwh commented Mar 12, 2015

Uh oh!

dlwh commented Mar 12, 2015

Uh oh!

debasish83 commented Mar 12, 2015

Uh oh!

debasish83 commented Mar 12, 2015

Uh oh!

dlwh commented Mar 12, 2015

Uh oh!

dlwh commented Mar 12, 2015

Uh oh!