[SPARK-9835] [ML] Implement IterativelyReweightedLeastSquares solver #10639

yanboliang · 2016-01-07T09:42:57Z

Implement IterativelyReweightedLeastSquares solver for GLM. I consider it as a solver rather than estimator, it only used internal so I keep it private[ml].
There are two limitations in the current implementation compared with R:

It can not support Tuple as response for Binomial family, such as the following code:

glm( cbind(using, notUsing) ~  age + education + wantsMore , family = binomial)

It does not support offset.

Because I considered that RFormula did not support Tuple as label and offset keyword, so I simplified the implementation. But to add support for these two functions is not very hard, I can do it in follow-up PR if it is necessary. Meanwhile, we can also add R-like statistic summary for IRLS.
The implementation refers R, statsmodels and sparkGLM.
Please focus on the main structure and overpass minor issues/docs that I will update later. Any comments and opinions will be appreciated.

cc @mengxr @jkbradley

SparkQA · 2016-01-07T10:50:36Z

Test build #48930 has finished for PR 10639 at commit 86492bb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-01-08T00:13:21Z

mllib/src/main/scala/org/apache/spark/ml/optim/GLMFamilies.scala

I believe the link function here should default to Log not Logit

Yep, it's typo.

sethah · 2016-01-08T00:22:03Z

@yanboliang Could you post a link to a reference paper? I find documentation on IRLS scattered, so it would be nice to have something concrete to point to.

yanboliang · 2016-01-08T04:16:06Z

@sethah Thanks for your comments. The mainly reference document is http://data.princeton.edu/wws509/notes/

SparkQA · 2016-01-08T04:52:20Z

Test build #49000 has finished for PR 10639 at commit 6d809d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-08T10:38:48Z

Test build #49016 has finished for PR 10639 at commit 5b26d2a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-01-08T19:25:00Z

mllib/src/main/scala/org/apache/spark/ml/optim/GLMFamilies.scala

This is the generic form for the z update, but in the Poisson family you have hard coded the update specific to the family. I think we should stick to a single convention, and it makes most sense to me to use the generic update. That way, we can implement in the parent class and we don't need to implement in the base classes.

Additionally, I think it would be better to call this something other than z because it isn't very descriptive. I don't have a strong opinion on what it should be called, but I have seen it called adjusted response in other places (among other names).

JoshRosen · 2016-01-11T08:49:03Z

Jenkins, retest this please.

SparkQA · 2016-01-11T10:06:40Z

Test build #49125 has finished for PR 10639 at commit 65133ad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-11T10:10:24Z

Test build #49126 has finished for PR 10639 at commit 65133ad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-01-11T23:28:42Z

mllib/src/main/scala/org/apache/spark/ml/optim/GLMFamilies.scala

Hard coding the weights like this here won't be correct if anything other than the canonical link function is used, I believe. Since we aren't doing anything to restrict link functions to only the canonical ones, this should probably defined in terms of the link function's derivative. Statsmodels does it here.

sethah · 2016-01-11T23:32:02Z

I'm still not exactly clear what the scope of this GLM implementation will be (i.e. if it will be a public API) and I'd be interested to hear more on that topic. But along those lines, some GLM implementations restrict the link functions that can be used with a given family, so I'm wondering if that should be done in Spark as well. I suppose this is only a problem if we expose it as an API. Curious to hear others' thoughts...

sethah · 2016-01-12T00:27:30Z

mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala

Is there a reason to prefer y.reduce(_ + _) / y.count() over y.mean() ?

yanboliang · 2016-01-12T15:01:52Z

@sethah Thanks for your comments! Generally speaking, a public algorithm should extend from Estimator in Spark ML. This is not a public GLM API but a resolver like WeightedLeastSquares. It will be used to resolve GLM and other algorithms such as LogisticRegression, so I marked all the classes as private[ml].
Just like what you said, some GLM implementations restrict the link functions that can be used with a given family, I think we should keep the restriction in Spark as well and this should further refinement when we implement public API for GLM.

SparkQA · 2016-01-12T15:46:12Z

Test build #49240 has finished for PR 10639 at commit 446b1cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-01-13T17:03:00Z

mllib/src/main/scala/org/apache/spark/ml/optim/GLMFamilies.scala

Is there a reference you can point to for this exact form of the deviance equation (other than SparkGLM package)? I haven't seen the max function in other places.

This may or may not be related, but how will we guarantee that the endogenous variable does not contain invalid values (negative for poisson, outside [0, 1] for binomial, etc...?

sethah · 2016-01-13T17:07:33Z

@yanboliang The approach looks good, nice PR! It will be good to get others feedback, especially those with an idea of how the GLM package will fit into the larger picture. I think the exact form of the link functions and families will depend on what the plans are for future additions (e.g. what summary stats will we return?) but those can be done in a separate PR.

mengxr · 2016-01-15T01:32:28Z

@yanboliang Thanks for working on this! I'll make a pass soon.

mengxr · 2016-01-18T21:20:53Z

@yanboliang I made a pass on your implementation and read Green's paper "Iteratively Reweighted Least Squares for Maximum Likelihood Estimation, and some Robust and Resistant Alternatives". My main question is whether we should couple IRLS and GLM together. IRLS could be applied to models outside GLM family. Coupling them together makes it harder to extend and some concepts harder to understand, e.g., "Family.startingMu". I thought about decoupling the concepts but didn't find a satisfactory solution. One possible approach is to make IRLS accept the initial guess x0 and a reweighting function that updates weights and offsets at each iteration. Then we can wrap GLM family and link functions to use that interface. It would be great if the IRLS implementation could be extended to handle: GLMs, Lp regression, LASSO, and maybe some non-convex loss functions. If this sounds good to you, we can think more about the refactoring.

yanboliang · 2016-01-20T14:26:49Z

@mengxr Thanks for your comments! IRLS is not bound with GLM in essence, so it's make sense to decouple them. Based on your prompt, I propose the following API for IterativelyReweightedLeastSquares:

private[ml] class IterativelyReweightedLeastSquares(
    val initialModel: WeightedLeastSquaresModel,
    val reweightedFunction: (RDD[Instance], WeightedLeastSquaresModel) => RDD[(Double, Double)],
    val fitIntercept: Boolean,
    val regParam: Double,
    val standardizeFeatures: Boolean,
    val standardizeLabel: Boolean,
    val maxIter: Int,
    val tol: Double) extends Logging with Serializable {
......
}

where initialModel is the initial guess, reweightedFunction is used to update y/z(adjusted response variable) and weights. The terminate condition of iteration is delta model not great than tolerance.
This framework can fit GLMs, Lp regression and Lasso. Further more, we can also add variables for statistic summary. If I have some misunderstand, please correct me. Thanks! I will updated this PR soon.
BTW, Do you know which package is the most authoritative one for Lp regression? I found pracma:::l1linreg and L1pack:::l1fit, but they use qr.solve() to solve WLS equation and produce slightly different result compared with ML WeightedLeastSquares.

mengxr · 2016-01-20T19:09:27Z

@yanboliang The proposal looks okay to me. Some minor comments:

WeightedLeastSquares model contains extra content that we don't need in IRLS iterations, e.g., diagInvAWA. This could be a follow-up work because the API is currently private.
reweightedFunction -> reweightFunc. Does it need to operate on RDDs? We can simplify the API to (Instance, Model) => (Double, Double).
standardizeFeatures and standardizeLabel only affect the regularization term. We don't need it to match glm in R. So let's remove them from the API. Turning them on would change the convergence of IRLS algorithm, which needs further discussion.

QR is more stable than the normal equation solver for WLS. We should switch to it in the future. For general Lp regression, there is an interior-point method implemented in R's quantreg package.

yanboliang · 2016-01-21T11:47:17Z

mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala

R glm has argument named offset, but offsetsAndWeights is private. I hope it won't confuse users, or should we rename to other better one?

It is fine for internal use.

yanboliang · 2016-01-21T11:54:07Z

@mengxr Thanks for your comments. For the issue that WeightedLeastSquares contains extra content such as diagInvAWA, it will be used to generate statistic summary of IRLS or GLM. We can discuss them in the follow-up work.

SparkQA · 2016-01-21T12:21:34Z

Test build #49875 has finished for PR 10639 at commit 2191d2a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-25T07:50:41Z

Test build #49981 has finished for PR 10639 at commit 6074fd6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-01-26T07:12:40Z

mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala

yanboliang · 2016-01-26T14:32:59Z

@mengxr Thanks for your comments, I will update other issues soon. To the convergence criterion issue, please see my inline comments and I want to here your thoughts.

mengxr · 2016-01-27T21:15:57Z

@yanboliang This PR should be ready to go after the comments are addressed. FYI, I assigned SPARK-12811 to you, which is along the line of your work. Thanks for working on GLM features!

SparkQA · 2016-01-28T07:43:55Z

Test build #50261 has finished for PR 10639 at commit cb2057e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-01-28T22:30:07Z

LGTM. Merged into master. Thanks!

yanboliang changed the title ~~[SPARK-9835] [ML] Initial draft of IterativelyReweightedLeastSquares~~ [SPARK-9835] [ML] IterativelyReweightedLeastSquares for GLM Jan 7, 2016

sethah reviewed Jan 8, 2016
View reviewed changes

yanboliang added 5 commits January 11, 2016 16:22

Initial draft of IterativelyReweightedLeastSquares

b7d24c5

fix typos

ca4f16d

Make Link as object

abed2f3

Add wrapper function: predict & fitted

f1485ca

Rename z to adjusted

d7653d6

yanboliang force-pushed the spark-9835 branch from 5b26d2a to d7653d6 Compare January 11, 2016 08:31

add Binomial and Poisson object

65133ad

sethah reviewed Jan 11, 2016
View reviewed changes

sethah reviewed Jan 12, 2016
View reviewed changes

update calculation of Family.weights.

446b1cb

sethah reviewed Jan 13, 2016
View reviewed changes

yanboliang added 3 commits January 21, 2016 18:49

IRLS refactor

6d108cd

Update doc

8ecb763

Fix typos

2191d2a

yanboliang reviewed Jan 21, 2016
View reviewed changes

clean IRLS tests

6074fd6

yanboliang changed the title ~~[SPARK-9835] [ML] IterativelyReweightedLeastSquares for GLM~~ [SPARK-9835] [ML] IterativelyReweightedLeastSquares optimization Jan 25, 2016

yanboliang changed the title ~~[SPARK-9835] [ML] IterativelyReweightedLeastSquares optimization~~ [SPARK-9835] [ML] Implement IterativelyReweightedLeastSquares solver Jan 25, 2016

mengxr reviewed Jan 26, 2016
View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala Outdated

Copy link

Contributor

mengxr Jan 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

= null

make the convergence check more efficient

cb2057e

asfgit closed this in df78a93 Jan 28, 2016

yanboliang deleted the spark-9835 branch January 29, 2016 03:28

[SPARK-9835] [ML] Implement IterativelyReweightedLeastSquares solver #10639

[SPARK-9835] [ML] Implement IterativelyReweightedLeastSquares solver #10639

Uh oh!

Conversation

yanboliang commented Jan 7, 2016

Uh oh!

SparkQA commented Jan 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sethah commented Jan 8, 2016

Uh oh!

yanboliang commented Jan 8, 2016

Uh oh!

SparkQA commented Jan 8, 2016

Uh oh!

SparkQA commented Jan 8, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Jan 11, 2016

Uh oh!

SparkQA commented Jan 11, 2016

Uh oh!

SparkQA commented Jan 11, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sethah commented Jan 11, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanboliang commented Jan 12, 2016

Uh oh!

SparkQA commented Jan 12, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sethah commented Jan 13, 2016

Uh oh!

mengxr commented Jan 15, 2016

Uh oh!

mengxr commented Jan 18, 2016

Uh oh!

yanboliang commented Jan 20, 2016

Uh oh!

mengxr commented Jan 20, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanboliang commented Jan 21, 2016

Uh oh!

SparkQA commented Jan 21, 2016

Uh oh!

SparkQA commented Jan 25, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanboliang commented Jan 26, 2016

Uh oh!

mengxr commented Jan 27, 2016

Uh oh!

SparkQA commented Jan 28, 2016

Uh oh!

mengxr commented Jan 28, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants