Skip to content

Conversation

@debasish83
Copy link
Contributor

@dlwh @mengxr @dbtsai

This PR introduces ADMM based Proximal algorithm to solve problems of the following structure
minimize f(x) + g(z)
s.t x = z
FISTA is also a candidate to solve this problem but we used ADMM since FISTA can't handle affine constraints natively.

Currently f(x) can be convex/nonlinear function. So far my stress tests are using convex functions (least square regression and logistic regression) but down the line we will come to nonlinear loss (DSGD formulation from matrix factorization or sigmoid function from neural net). We call it centralized ADMM since ADMM loop is still maintained on the Master node. Once centralized ADMM works, we will generalize it to Proximal Server (Parameter server with Proximal Algorithms).

Parameter Server with Proximal Algorithms is motivated by Mu's comparison of Blocked Proximal Gradient with Google's Distributed BFGS code:
http://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf

We have two variants of ADMM iterations right now (experimentations are ongoing). One hot-starts on x from previous ADMM iterations while the other hot-starts on both x and BFGS history from previous ADMM iteration. New LBFGS API is added that returns state to facilitate BFGS history preservation.

runMain breeze.optimize.proximal.NonlinearMinimizer 200 0.4 0.99

We generate random data matrix (A) of size 200x200 from normal gaussian (0 mean 1 stddev). Label vector is generated from normal gaussian as well. If label > 0.5 we assign it to 1.0 else 0.0

Lambda = 0.4 ElasticNet beta = 0.99
lambdaL1 = 0.4_0.99
lambdaL2 = 0.4_0.01

Linear Regression:

Comparison 1: NonlinearMinimizer (nl) vs QuadraticMinimizer (qp)

NonlinearMinimizer takes f(x) = ||Ax - b||_2^{2} df/dx = 2(Ax - b)*A
QuadraticMinimizer takes gram=A'A, linear term= A'b

Solution quality:
||qp - nl|| norm 0.16799969531758605 max-norm 0.034283192270685436
||bfgs - nl|| norm 0.25715611616302037 max-norm 0.04888455483124021

Objective:

Objective qp -55.00000000000021 nl -54.999725074409454 bfgs -54.99983413525827

nl and bfgs are both using BFGS for the primal solve

Comparison 2: NonlinearMinimizer with g(z)=ProximalL1 (nlSparse/nlSparseHistory) vs OWLQN vs QuadraticMinimizer with g(z)=ProximalL1 (sparseQp)

Runtime:
owlqn 165.178 ms iters 143 sparseQp 32.87 ms iters 59

Solution Quality:
||sparseQp - owlqn|| norm 0.14861916579656498 inf-norm 0.03034052809134663
owlqnObj -47.42068811893944 sparseQpObj -47.45927392747443

nlSparse 1479.666 ms iters 195 nlSparseHistory 407.347 ms iters 429
||owlqn - nlSparse|| norm 0.1460259470192177 inf-norm 0.029597210620735503
||owlqn - nlSparseHistory|| norm 0.14649731508771383 inf-norm 0.030278923124194383
nlSparseObj -47.457604531534805 nlSparseHistoryObj -47.45480418795271

NonlinearMinimizer is ~2X slower than OWLQN but if you look at objective, it converges to better solution.

Comparison 3: NonlinearMinimizer with g(z) = ProximalBounds(0 <= z <= 1.0) (nlBox) vs PQN vs QuadraticMinimizer with g(z) = ProximalBounds (qpBox)

qpBox 6.769 ms iters 57
nlBox 10182.424 ms iters 2282 nlBoxHistory 1269.134 ms 2271
pqnBox 272.064 ms iters 28

||qpBox - nlBox|| norm 0.0022824265880237215 inf-norm 4.694687018757908E-4
||qpBox - nlBoxHistory|| norm 0.0023190292089652767 inf-norm 4.773077547529439E-4
||qpBox - nlBox|| norm 0.0021390982826103347 inf-norm 5.634071605518931E-4

qpBoxObj -31.130008148965725 nlBoxObj -31.209892573229023 nlBoxHistoryObj -31.20915524772166 pqnBoxObj -31.127492181959255

Here also we are 5X slower than PQN but our solution is better compared to PQN.

Logistic Regression:

Comparison 1: ElasticNet formulation with OWLQN and NonlinearMinimizer with g(z)=ProximalL1 (nlLogistic)

Runtime:

owlqn 59.57 ms iters 62 nlLogistic 185.436 ms iters 33

Solution difference:

||owlqn - nlLogistic|| norm 0.41682550361219334 inf-norm 0.10509790307392614

owlqnLogistic objective 30.83266520992968
nlLogistic objective 30.817377544996752

We are ~3X compared to OWLQN but again the objective is better. So NonlinearMinimizer is producing better solution. Also see that the solutions are completely different. I have looked at the solutions. Both are sparse.

Comparison 2: Logistic Regression + L2 regularization with Bounds using PQN and NonlinearMinimizer with g(z) = ProximalBox (nlBox)

pqnBox 226.606 ms iters 29 nlBox 1283.92 ms iters 263
pqn loss 57.56457339692869
nl loss 57.50013311157119

We are ~ 5X slower than PQN but producing better solution quality.

Next steps:

  1. Runtime optimization to bring centralized ADMM closer to projected gradient algorithms (OWLQN, PQN)
  2. Add proximal algorithm for probability simplex 1'x = 1, x >=0 (Centralized ADMM can solve it but not OWLQN/PQN)
  3. Formulate MLE loss in mllib ALS and solve it using NonlinearMinimizer for PLSA formulation
  4. Code refactoring and use Iterator pattern.
  5. Test-cases showing up the objective improvement compared to OWLQN/PQN.

I wanted to open it up for early reviews. QuadraticMinimizer as mentioned here is the following PR:
#321

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dlwh minEigen works fine but in maxEigen I am hacking it. From the BFGS history, I am trying to build the CompactHessian to run power iterations but it fails big time....Any pointers will be really helpful...

@debasish83
Copy link
Contributor Author

I fixed the failure in maxEigen but I still think it has issues. My hope was that with 10-20 iterations of BFGS we will build a good surface for the function f(x) and then minEigen and maxEigen will give us the bounds for rho as we start the proximal updates. I am using rho = 1.0 right now but in QuadraticMinimizer rho tuning with eigen values helped improving convergence.

To test this right now I solve ||Ax - b||_2^{2} using BFGS and when converged I run LBFGS.minEigen and LBFGS.maxEigen and compare them with min(eigSym(A'A).eigenValues) and max(eigSym(A'A).eigenValues).

It's not working as I expected.

minEigen 9.606173268621757E-6 LBFGS.minEigen -1.0165119664756253
maxEigen 765.2710923978012 LBFGS.maxEigen 1.0000000000000002

For maxEigen power method is stopping after 1 iteration so there are still bugs in the way I construct CompactHessian from LBFGS.ApproximateInverseHistory.

@dlwh
Copy link
Member

dlwh commented Feb 13, 2015

So, first, this is very cool! I think this could be a great addition to Breeze!

A few high level comments. I'll poke around in the source as I get a chance.

  1. Have you tried turning down the tolerance in LBFGS/OWLQN/PQN? I wonder if the differences in quality are from early stopping? If not, would it maybe make sense to warm start from one of these (probably PQN as it's more general?)
  2. Why can't PQN do simplex projection? E.g. http://arxiv.org/abs/1309.1541 should give the necessary projection operator?
  3. Low priority, but in general I'm trying to move benchmark methods to the benchmark subproject.

NOTICE Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, all rights reserved has no legal meaning in modern copyright law.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PQN can do simplex projection but it does not separate f(x) and g(z)....what we really want to do is to solve f (x) through a blocked cyclic coordinate descent using bfgs and satisfy g (z) through proximal operator...that's our version of parameter server (distributed solver)...I am still thinking if we can replace coordinate descent with distributed bfgs by using some tricks...if I use owlqn or pqn I have to change code in all of them...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I looked at the paper...that's the same algorithm I am implementing....it is a proximal algorithm....may be we can plugin to pqn as well....i dont know pqn that well...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the PQN paper and it is possible to stick all the proximal operators inside PQN...I have tested bounds with PQN already and it works well compared to default ADMM.

If it holds for ProximalL1() compared to OWLQN then I am good to use PQN as the core of NonlinearMinimizer...PQN right now accepts closure of form DenseVector[Double] => DenseVector[Double]. Would it be fine if I change the signature to (x: DenseVector[Double], rho: Double) => DenseVector[Double] ?

This is in-tune to generic proximal algorithms:
minimize g(x) + rho||x - v||_{2}^{2}

g(x) can be constraints here: x \in C
g(x) can be L1 here as well through soft-thresholding I think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm totally open to that, but I'm not quite sure how it will be used. PQN
won't manipulate it, right? Why not have your driver specify it in a new
instance of PQN?

-- David

On Tue, Feb 17, 2015 at 12:01 PM, Debasish Das [email protected]
wrote:

In NOTICE
#364 (comment):

@@ -1 +1,17 @@
Breeze is distributed under an Apache License V2.0 (See LICENSE)
+
+===============================================================================
+
+Proximal algorithms outlined in Proximal.scala (package breeze.optimize.quadratic)
+are based on https://github.com/cvxgrp/proximal (see LICENSE for details) and distributed with
+Copyright (c) 2014 by Debasish Das (Verizon), all rights reserved.
+
+===============================================================================
+
+QuadraticMinimizer class in package breeze.optimize.proximal is distributed with Copyright (c)
+2014, Debasish Das (Verizon), all rights reserved.

I read the PQN paper and it is possible to stick all the proximal
operators inside PQN...I have tested bounds with PQN already and it works
well compared to default ADMM.

If it holds for ProximalL1() compared to OWLQN then I am good to use PQN
as the core of NonlinearMinimizer...PQN right now accepts closure of form
DenseVector[Double] => DenseVector[Double]. Would it be fine if I change
the signature to (x: DenseVector[Double], rho: Double) =>
DenseVector[Double] ?

This is in-tune to generic proximal algorithms:
minimize g(x) + rho||x - v||_{2}^{2}

g(x) can be constraints here: x \in C
g(x) can be L1 here as well through soft-thresholding I think


Reply to this email directly or view it on GitHub
https://github.com/scalanlp/breeze/pull/364/files#r24848414.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not an interface like:

trait SeparableDiffFunction[T] {
def apply(x: T): IndexedSeq[(Double, T)]
}

?

maybe with a method to turn it into a normal DiffFunction given an OpAdd
for T?

On Sat, Feb 21, 2015 at 7:53 AM, Debasish Das [email protected]
wrote:

In NOTICE
#364 (comment):

@@ -1 +1,17 @@
Breeze is distributed under an Apache License V2.0 (See LICENSE)
+
+===============================================================================
+
+Proximal algorithms outlined in Proximal.scala (package breeze.optimize.quadratic)
+are based on https://github.com/cvxgrp/proximal (see LICENSE for details) and distributed with
+Copyright (c) 2014 by Debasish Das (Verizon), all rights reserved.
+
+===============================================================================
+
+QuadraticMinimizer class in package breeze.optimize.proximal is distributed with Copyright (c)
+2014, Debasish Das (Verizon), all rights reserved.

@mengxr https://github.com/mengxr
I added a driver which creates new instance of PQN and maps Proximal
operators to Projection operators..I can handle L1 and probability simplex
as well...Will update the code soon...Note that a variant of PQN can do
proximal quasi newton but it's still research-y...Several recent papers
focused on it http://arxiv.org/pdf/1206.1623v13

Basically with PQN as the solver running on each worker, ADMM will do
distributed/multi-core consensus on top of it...For separable functions
(like logistic) it will look like this:
minimize F1(x1) + F2(x2) + ...Fn(xn)
s.t x1 = z, x2 = z, ..., xn = z
Here F1 is composite function which satisfies both f(x) and constraint
g(z).

From Spark's perspective the master only keeps one vector z and all the
solver memory is moved to workers. With 16 GB RAM, we can still do logistic
with 2 x 10^9 x 8 byte = 2B features with theoretical guarantees.

For non-separable function (neural net) it will be more fun...we will come
to it after handling separable functions.
The baseline will be block coordinate descent with PQN on each block until
we come up with something better (Gauss Sidel Iterations)
http://opt.kyb.tuebingen.mpg.de/papers/opt2013_submission_1.pdf

Of course in breeze we are not allowed to use RDD, but does it make sense
to define an interface for ADMM based consensus solvers ? We can discuss in
person as well..


Reply to this email directly or view it on GitHub
https://github.com/scalanlp/breeze/pull/364/files#r25123016.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add the experimental version with RDD in it and you can help defining the clean interface...I feel we will need different interfaces for separable and non-separable functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the PQN driver and the projection operators for L1(x) <= c and ProbabilitySimplex...Somehow in my experiments I am unable to reproduce the results from Figure 1 of PQN paper...OWLQN is first run to extract the parameter lambda_L1(x_) to generate the constraint L1(x) <= c for PQN

runMain breeze.optimize.proximal.NonlinearMinimizer 500 0.4 0.99

Elastic Net with L1 and L2 regularization.

Linear Regression:
Issues:

  1. Max Iter-ed at 500
  2. Worse than OWLQN and naive ADMM in Objective

owlqn 678.072 ms iters 173
pqnSparseTime 30600.365 ms iters 500
owlqnObj -145.38669700980395 pqnSparseObj -135.15057488775597

Logistic Regression:
Cons:

  1. Runtime higher than OWLQN
  2. Worse Objective than OWLQN and naive ADMM

owlqn 187.894 ms iters 74 pqn 758.073 ms iters 28
objective owlqnLogistic 52.32713379333781 pqnLogistic 81.37098398138012

I am debugging the code further but any pointers will be great. I don't think this is the expected behavior from PQN on L1/ProbabilitySimplex constraint as per paper.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random question: does it seem to be line searching a lot (relative to
OWLQN, or runs with a box constraint?)

On Mon, Feb 23, 2015 at 2:09 PM, Debasish Das [email protected]
wrote:

In NOTICE
#364 (comment):

@@ -1 +1,17 @@
Breeze is distributed under an Apache License V2.0 (See LICENSE)
+
+===============================================================================
+
+Proximal algorithms outlined in Proximal.scala (package breeze.optimize.quadratic)
+are based on https://github.com/cvxgrp/proximal (see LICENSE for details) and distributed with
+Copyright (c) 2014 by Debasish Das (Verizon), all rights reserved.
+
+===============================================================================
+
+QuadraticMinimizer class in package breeze.optimize.proximal is distributed with Copyright (c)
+2014, Debasish Das (Verizon), all rights reserved.

I added the PQN driver and the projection operators for L1(x) <= c and
ProbabilitySimplex...Somehow in my experiments I am unable to reproduce the
results from Figure 1 of PQN paper...OWLQN is first run to extract the
parameter lambda_L1(x_) to generate the constraint L1(x) <= c for PQN

runMain breeze.optimize.proximal.NonlinearMinimizer 500 0.4 0.99

Elastic Net with L1 and L2 regularization.

Linear Regression:
Issues:

  1. Max Iter-ed at 500
  2. Worse than OWLQN and naive ADMM in Objective

owlqn 678.072 ms iters 173
pqnSparseTime 30600.365 ms iters 500
owlqnObj -145.38669700980395 pqnSparseObj -135.15057488775597

Logistic Regression:
Cons:

  1. Runtime higher than OWLQN
  2. Worse Objective than OWLQN and naive ADMM

owlqn 187.894 ms iters 74 pqn 758.073 ms iters 28
objective owlqnLogistic 52.32713379333781 pqnLogistic 81.37098398138012

I am debugging the code further but any pointers will be great. I don't
think this is the expected behavior from PQN on L1/ProbabilitySimplex
constraint as per paper.


Reply to this email directly or view it on GitHub
https://github.com/scalanlp/breeze/pull/364/files#r25207778.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my L1 projection has bugs in it...L1(x) is built using ProbabilitySimplex and PQN is beating naive ADMM and at par with QuadraticMinimizer for ProbabilitySimplex.

I will take a closer look into L1 projection for bugs but for ADMM based multicore/distributed consensus I will most likely choose OWLQN for elastic net and PQN for other constraints.

Here are the results with ProbabilitySimplex:

minimize f(x) s.t 1'x = c, x >= 0, c = 1

Linear Regression with ProbabilitySimplex, f(x) = ||Ax - b||^2

Objective pqn 85.34613832959954 nl 84.95320179604967 qp 85.33114863196366
Constraint pqn 0.9999999999999997 nl 1.001707509961072 qp 1.000000000000004
time pqn 150.552 ms nl 15847.125 ms qp 96.105 ms

Logistic Regression with ProbabilitySimplex, f(x) logistic loss from mllib

Objective pqn 257.6058563777358 nl 257.4025971846134
Constraint pqn 0.9999999999999998 nl 1.0007230450802203
time pqn 94.19 ms nl 25160.317 ms

@dlwh
Copy link
Member

dlwh commented Feb 13, 2015

re: maxEigen, I really don't know enough about what's going on to say what's wrong, other than to speculate that the approximation may not be good enough for your purposes? (I didn't write the PQN code.) Have you tried using an EmpiricalHessian?

@debasish83
Copy link
Contributor Author

Is my CompactHessian construction from LBFGS history correct ? I thought there are bugs in it...

@debasish83 debasish83 changed the title [WIP] NonlinearMinimizer using ADMM based Proximal Algorithm [WIP] NonlinearMinimizer using ADMM and PQN based Proximal Algorithm Feb 19, 2015
@debasish83 debasish83 changed the title [WIP] NonlinearMinimizer using ADMM and PQN based Proximal Algorithm NonlinearMinimizer using PQN + OWLQN and Projection Operators Feb 28, 2015
@debasish83
Copy link
Contributor Author

@mengxr I refactored ADMM based ConsensusMinimizer to mllib.optimization since it needs to do RDD operations. More results on that will be added in Spark PR. It will use NonlinearMinimizer as a solver that runs on each worker.

NonlinearMinimizer right now supports all the projection operators through PQN and ProximalL1() through OWLQN. I will send a seperate email to John so that he can review the ProjectL1() operators that I added.

Later the plan is to merge PQN and OWLQN to Proximal Quasi Newton method. I will open up the PR for reviews with @yuekai (author of the Matlab code https://github.com/yuekai/pnopt that I adapted)

Here are some benchmark results and comparison of breeze.optimize.proximal.NonlinearMinimizer with breeze.optimize.proximal.QuadraticMinimizer solver for the respective constraints:

runMain breeze.optimize.proximal.NonlinearMinimizer 400 0.4 0.99

Linear Regression with Elastic Net formulation:

owlqn 2699.234 ms iters 825 sparseQp 359.34 ms iters 68
nlSparseTime 3989.946 ms iters 997
owlqnObj -111.47997989802914 sparseQpObj -111.47977512157428 nlSparseObj -111.47997989810649

Linear Regression with Bounds
qpBox 36.74 ms iters 78
pqnBox 2111.375 ms iters 76
qpBoxObj -59.97519548154472 nlBoxObj -59.96772680436779

Logistic Regression with Bounds
Objective pqn 116.83573241482054 time 1325.132 ms

Linear Regression with ProbabilitySimplex
Objective pqn 95.04618308050102 qp 95.03032612382239
Constraint pqn 0.9999999999999994 qp 1.0000000000000084
time pqn 221.27 ms qp 123.771 ms

Logistic Regression with ProbabilitySimplex
Objective pqn 257.6652169115999
Constraint pqn 0.9999999999999998
time pqn 106.671 ms

You can see since ranks are low, QuadraticMinimizer convergence is much faster !

I am adding the test-cases and then @dlwh please take a pass at it.

@dlwh
Copy link
Member

dlwh commented Feb 28, 2015

very nice! I'll take a look later this evening

On Sat, Feb 28, 2015 at 12:31 PM, Debasish Das [email protected]
wrote:

@mengxr https://github.com/mengxr I refactored ADMM based
ConsensusMinimizer to mllib.optimization since it needs to do RDD
operations. More results on that will be added in Spark PR. It will use
NonlinearMinimizer as a solver that runs on each worker.

NonlinearMinimizer right now supports all the projection operators through
PQN and ProximalL1() through OWLQN. I will send a seperate email to John so
that he can review the ProjectL1() operators that I added.

Later the plan is to merge PQN and OWLQN to Proximal Quasi Newton method.
I will open up the PR for reviews with @yuekai https://github.com/yuekai
(author of the Matlab code https://github.com/yuekai/pnopt that I
adapted)

Here are some benchmark results and comparison of
breeze.optimize.proximal.NonlinearMinimizer with
breeze.optimize.proximal.QuadraticMinimizer solver for the respective
constraints:

runMain breeze.optimize.proximal.NonlinearMinimizer 400 0.4 0.99

Linear Regression with Elastic Net formulation:

owlqn 2699.234 ms iters 825 sparseQp 359.34 ms iters 68
nlSparseTime 3989.946 ms iters 997
owlqnObj -111.47997989802914 sparseQpObj -111.47977512157428 nlSparseObj
-111.47997989810649

Linear Regression with Bounds
qpBox 36.74 ms iters 78
pqnBox 2111.375 ms iters 76
qpBoxObj -59.97519548154472 nlBoxObj -59.96772680436779

Logistic Regression with Bounds
Objective pqn 116.83573241482054 time 1325.132 ms

Linear Regression with ProbabilitySimplex
Objective pqn 95.04618308050102 qp 95.03032612382239
Constraint pqn 0.9999999999999994 qp 1.0000000000000084
time pqn 221.27 ms qp 123.771 ms

Logistic Regression with ProbabilitySimplex
Objective pqn 257.6652169115999
Constraint pqn 0.9999999999999998
time pqn 106.671 ms

You can see since ranks are low, QuadraticMinimizer convergence is much
faster !

I am adding the test-cases and then @dlwh https://github.com/dlwh
please take a pass at it.


Reply to this email directly or view it on GitHub
#364 (comment).

@dlwh
Copy link
Member

dlwh commented Mar 10, 2015

I set it up that way for scale invariance. 2 * f(x) and f(x) should
converge to the same solution.

No reference. I just made stuff up.

As I've said before here and elsewhere, happy to take PRs to standardize
this.

On Tue, Mar 10, 2015 at 8:49 AM, Debasish Das [email protected]
wrote:

I am still confused on this:

if (norm(adjustedGradient) <= math.max(tolerance * adjustedValue.abs,
1E-8))
Some(FirstOrderMinimizer.GradientConverged)

I have used convergence criterias like norm(adjustedGradient) <= tolerance

  • sqrt(problemSize) but I am not clear why we have adjustedValue over
    here...

Can you please point me to a reference related to the convergence
condition ?

Both SPG and PQN papers needs norm(adjustedGradient) <= tolerance as a
convergence condition where adjustedGradient is Projection(x-g) - x

For cases where projection is identity this formula becomes norm(g) <=
tolerance which should be consistent with BFGS as well...

I am not sure about OWLQN because when I use norm(g) <= tolerance in my
testing, OWLQN gets lot slower...Is it a OWLQN specific optimization ? If
it is then it should be defined inside OWLQN and not FirstOrderMinimizer...


Reply to this email directly or view it on GitHub
#364 (comment).

…acktracking line search made similar to OWLQN;SPG adjust cleaned based on FirstOrderMinimizer convergence criteria;ADMM based generic proximal Solver added in NonlinearMinimizer;SPG/PQN based projection solver drivers added in NonlinearMinimizer;ProximalL1 updated to take lambda in constructor;Tests updated
@debasish83
Copy link
Contributor Author

ok...Let's go with the current convergence and we will rethink on it later. I cleaned OWLQN from NonlinearMinimizer since we feel ADMM/FISTA will converge at par with OWLQN with the tuning.

I added a generic proximal solver using ADMM. These are for cases where projection does not converge fast enough (L1 for example) or for cases where proximal operators are defined like huber loss for example. Right now ADMM is still not at par in runtime with OWLQN but it's not that bad either.

Some interesting benchmark numbers from ADMM based proximal solver:

runMain breeze.optimize.proximal.NonlinearMinimizer 400 0.4 0.99

ElasticNet Formulation

Linear regression

owlqn 7707.722 ms iters 1510 sparseQp 372.883 ms iters 60
nlSparseTime 12894.557 ms iters 1286
nlProxTime 15820.851 ms iters 851
owlqnObj -108.01874244271892 sparseQpObj -108.01852458810231 nlSparseObj -108.01869767843407 nlProxObj -108.01873166329395

owlqn 2X faster than nlSparse (uses SPG by default, L1(x) <= s) and nlProx (uses L1 soft-thresholding)

Logistic Regression with ProximalL1 and ProjectL1
Objective proximalL1 13.320283231557987 projectL1 13.319635853660154 owlqn 13.319731554801983
time proximalL1 789.501 ms projectL1 1264.812 ms owlqn 200.971 ms

owlqn 3x faster than proximalL1 (soft thresholding) and projectL1 (L1(x) <= s)

Logistic Regression with ProbabilitySimplex

Objective nl 258.1749325226205 admm 258.174932597232
Constraint nl 0.9999999999999996 admm 0.9999999999999805
time nlProjection 416.15 ms nlProx 9275.012 ms

For probability simplex projection works much nice

I added a TO DO that I will implement FISTA and compare against ADMM since I believe FISTA is currently one of the leading algorithms for smooth f(x) and non-smooth g(z) like L1. I will do this at a later PR. Right now with this we are covered on most of the mllib needs.

We can either merge it now or wait for few weeks till I am done with stress testing this code through large rank factorization with convex loss in Spark.

@dlwh
Copy link
Member

dlwh commented Mar 11, 2015

Great. I'm happy to merge if you could move minEigen from LBFGS.

@debasish83
Copy link
Contributor Author

I will keep it on my private branch...I am not yet sure how useful are they...waiting for reply from Professor Nocedal on validation and tests for ApproximateInverseHessian/CompactHessian

In BFGS we are using strong wolfe...I was wondering that whether we should add an option for line search that does not need new f & g calculation...something like what we did for SPG...thoughts ? It will speed up both mllib classifiers and ADMM based proximal code...I doubt we need to be so accurate in the line search for both the algorithm variants...

@dlwh
Copy link
Member

dlwh commented Mar 11, 2015

sure, sounds good

On Wed, Mar 11, 2015 at 8:36 AM, Debasish Das [email protected]
wrote:

I will keep it on my private branch...I am not yet sure how useful are
they...waiting for reply from Professor Nocedal on validation and tests for
ApproximateInverseHessian/CompactHessian

In BFGS we are using strong wolfe...I was wondering that whether we should
add an option for line search that does not need new f & g
calculation...something like what we did for SPG...thoughts ? It will speed
up both mllib classifiers and ADMM based proximal code...I doubt we need to
be so accurate in the line search for both the algorithm variants...


Reply to this email directly or view it on GitHub
#364 (comment).

@debasish83
Copy link
Contributor Author

cleaned up minEigen...the line search is tricky...will open up a PR once I understand it more...Basically I am thinking about some sort of adaGrad style approach to come up with one learning rate number for ADMM line search...it will need more experiments on large datasets

@dlwh
Copy link
Member

dlwh commented Mar 11, 2015

ok, ok to merge?

@debasish83
Copy link
Contributor Author

Can you please take a closer look at the bfgs state manipulation inside NonlinearMinimizer...I am not doing any excessive object allocation thr ?

@dlwh
Copy link
Member

dlwh commented Mar 12, 2015

sorry, where?

Also I'm not quite sure why you're so worried about performance here;
presumably function evaluations are going to be the bottleneck by a wide
margin?

On Wed, Mar 11, 2015 at 7:58 PM, Debasish Das [email protected]
wrote:

Can you please take a closer look at the bfgs state manipulation inside
NonlinearMinimizer...I am not doing any excessive object allocation thr ?


Reply to this email directly or view it on GitHub
#364 (comment).

@debasish83
Copy link
Contributor Author

From bfgs I can only extract solution but then users can't see how inner solve is doing...

@debasish83
Copy link
Contributor Author

I agree with the function eval point...I am ok

@dlwh
Copy link
Member

dlwh commented Mar 12, 2015

I see. why not use lbfgs.minimizeAndReturnState? You can cap the iters for LBFGS in its constructor.

@debasish83
Copy link
Contributor Author

Construct a bfgs solver every admm iteration ? That is cleaner but user won't get any info on inner solve which I guess is consistent with pqn...let me do that

@debasish83
Copy link
Contributor Author

Initially I wanted to use the bfgs history from admm iter i in iter I + 1 but that needs more thought

@dlwh
Copy link
Member

dlwh commented Mar 12, 2015

No. You construct the lbfgs once, but you pass in maxIter = maxInnerIter or
whatever, and then you use minimizeAndReturnState

On Wed, Mar 11, 2015 at 8:28 PM, Debasish Das [email protected]
wrote:

Construct a bfgs solver every admm iteration ? That is cleaner but user
won't get any info on inner solve which I guess is consistent with
pqn...let me do that


Reply to this email directly or view it on GitHub
#364 (comment).

@dlwh
Copy link
Member

dlwh commented Mar 12, 2015

It would be nice to have that functionality in general, but yeah we don't
really support it yet.

On Wed, Mar 11, 2015 at 8:32 PM, Debasish Das [email protected]
wrote:

Initially I wanted to use the bfgs history from admm iter i in iter I + 1
but that needs more thought


Reply to this email directly or view it on GitHub
#364 (comment).

@debasish83
Copy link
Contributor Author

Ok will do that

@debasish83
Copy link
Contributor Author

cleaned LBFGS inner iters...it will be great if you can push it to snapshots...I would like to use QuadraticMinimizer and NonlinearMinimizer for Spark PRs...once the codebases are stabilized I will send an email for cutting release...If you have an optimization page that I can modify, I would like to add proper documentation...Lot of optimization capabilities are added now through these proximal algorithm (direct and iterative)...It will be great if they can be used in other Apache projects as well...

@dlwh
Copy link
Member

dlwh commented Mar 12, 2015

Right now, basically the only documentation for optimize is here: https://github.com/scalanlp/breeze/wiki/Quickstart#breezeoptimize

That needs to be fixed.

dlwh added a commit that referenced this pull request Mar 12, 2015
NonlinearMinimizer using Projection and Proximal Operators
@dlwh dlwh merged commit e9d4b31 into scalanlp:master Mar 12, 2015
@dlwh
Copy link
Member

dlwh commented Mar 12, 2015

Thanks again for contributing this!

@dlwh
Copy link
Member

dlwh commented Mar 12, 2015

(publishing snapshots now. should be available in 10-15 minutes)

@debasish83
Copy link
Contributor Author

how can I uppdate the documentation ? We will be working on fixing it over the weekend...

@dlwh
Copy link
Member

dlwh commented Mar 12, 2015

feel free to edit the wiki, add new pages, whatever you think is best.

On Wed, Mar 11, 2015 at 11:42 PM, Debasish Das [email protected]
wrote:

how can I uppdate the documentation ? We will be working on fixing it over
the weekend...


Reply to this email directly or view it on GitHub
#364 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants