Skip to content

Conversation

@yanboliang
Copy link
Contributor

Implement IterativelyReweightedLeastSquares solver for GLM. I consider it as a solver rather than estimator, it only used internal so I keep it private[ml].
There are two limitations in the current implementation compared with R:

  • It can not support Tuple as response for Binomial family, such as the following code:
glm( cbind(using, notUsing) ~  age + education + wantsMore , family = binomial) 
  • It does not support offset.

Because I considered that RFormula did not support Tuple as label and offset keyword, so I simplified the implementation. But to add support for these two functions is not very hard, I can do it in follow-up PR if it is necessary. Meanwhile, we can also add R-like statistic summary for IRLS.
The implementation refers R, statsmodels and sparkGLM.
Please focus on the main structure and overpass minor issues/docs that I will update later. Any comments and opinions will be appreciated.

cc @mengxr @jkbradley

@yanboliang yanboliang changed the title [SPARK-9835] [ML] Initial draft of IterativelyReweightedLeastSquares [SPARK-9835] [ML] IterativelyReweightedLeastSquares for GLM Jan 7, 2016
@SparkQA
Copy link

SparkQA commented Jan 7, 2016

Test build #48930 has finished for PR 10639 at commit 86492bb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the link function here should default to Log not Logit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it's typo.

@sethah
Copy link
Contributor

sethah commented Jan 8, 2016

@yanboliang Could you post a link to a reference paper? I find documentation on IRLS scattered, so it would be nice to have something concrete to point to.

@yanboliang
Copy link
Contributor Author

@sethah Thanks for your comments. The mainly reference document is http://data.princeton.edu/wws509/notes/

@SparkQA
Copy link

SparkQA commented Jan 8, 2016

Test build #49000 has finished for PR 10639 at commit 6d809d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 8, 2016

Test build #49016 has finished for PR 10639 at commit 5b26d2a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the generic form for the z update, but in the Poisson family you have hard coded the update specific to the family. I think we should stick to a single convention, and it makes most sense to me to use the generic update. That way, we can implement in the parent class and we don't need to implement in the base classes.

Additionally, I think it would be better to call this something other than z because it isn't very descriptive. I don't have a strong opinion on what it should be called, but I have seen it called adjusted response in other places (among other names).

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jan 11, 2016

Test build #49125 has finished for PR 10639 at commit 65133ad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 11, 2016

Test build #49126 has finished for PR 10639 at commit 65133ad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard coding the weights like this here won't be correct if anything other than the canonical link function is used, I believe. Since we aren't doing anything to restrict link functions to only the canonical ones, this should probably defined in terms of the link function's derivative. Statsmodels does it here.

@sethah
Copy link
Contributor

sethah commented Jan 11, 2016

I'm still not exactly clear what the scope of this GLM implementation will be (i.e. if it will be a public API) and I'd be interested to hear more on that topic. But along those lines, some GLM implementations restrict the link functions that can be used with a given family, so I'm wondering if that should be done in Spark as well. I suppose this is only a problem if we expose it as an API. Curious to hear others' thoughts...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to prefer y.reduce(_ + _) / y.count() over y.mean() ?

@yanboliang
Copy link
Contributor Author

@sethah Thanks for your comments! Generally speaking, a public algorithm should extend from Estimator in Spark ML. This is not a public GLM API but a resolver like WeightedLeastSquares. It will be used to resolve GLM and other algorithms such as LogisticRegression, so I marked all the classes as private[ml].
Just like what you said, some GLM implementations restrict the link functions that can be used with a given family, I think we should keep the restriction in Spark as well and this should further refinement when we implement public API for GLM.

@SparkQA
Copy link

SparkQA commented Jan 12, 2016

Test build #49240 has finished for PR 10639 at commit 446b1cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reference you can point to for this exact form of the deviance equation (other than SparkGLM package)? I haven't seen the max function in other places.

This may or may not be related, but how will we guarantee that the endogenous variable does not contain invalid values (negative for poisson, outside [0, 1] for binomial, etc...?

@sethah
Copy link
Contributor

sethah commented Jan 13, 2016

@yanboliang The approach looks good, nice PR! It will be good to get others feedback, especially those with an idea of how the GLM package will fit into the larger picture. I think the exact form of the link functions and families will depend on what the plans are for future additions (e.g. what summary stats will we return?) but those can be done in a separate PR.

@mengxr
Copy link
Contributor

mengxr commented Jan 15, 2016

@yanboliang Thanks for working on this! I'll make a pass soon.

@mengxr
Copy link
Contributor

mengxr commented Jan 18, 2016

@yanboliang I made a pass on your implementation and read Green's paper "Iteratively Reweighted Least Squares for Maximum Likelihood Estimation, and some Robust and Resistant Alternatives". My main question is whether we should couple IRLS and GLM together. IRLS could be applied to models outside GLM family. Coupling them together makes it harder to extend and some concepts harder to understand, e.g., "Family.startingMu". I thought about decoupling the concepts but didn't find a satisfactory solution. One possible approach is to make IRLS accept the initial guess x0 and a reweighting function that updates weights and offsets at each iteration. Then we can wrap GLM family and link functions to use that interface. It would be great if the IRLS implementation could be extended to handle: GLMs, Lp regression, LASSO, and maybe some non-convex loss functions. If this sounds good to you, we can think more about the refactoring.

@yanboliang
Copy link
Contributor Author

@mengxr Thanks for your comments! IRLS is not bound with GLM in essence, so it's make sense to decouple them. Based on your prompt, I propose the following API for IterativelyReweightedLeastSquares:

private[ml] class IterativelyReweightedLeastSquares(
    val initialModel: WeightedLeastSquaresModel,
    val reweightedFunction: (RDD[Instance], WeightedLeastSquaresModel) => RDD[(Double, Double)],
    val fitIntercept: Boolean,
    val regParam: Double,
    val standardizeFeatures: Boolean,
    val standardizeLabel: Boolean,
    val maxIter: Int,
    val tol: Double) extends Logging with Serializable {
......
}

where initialModel is the initial guess, reweightedFunction is used to update y/z(adjusted response variable) and weights. The terminate condition of iteration is delta model not great than tolerance.
This framework can fit GLMs, Lp regression and Lasso. Further more, we can also add variables for statistic summary. If I have some misunderstand, please correct me. Thanks! I will updated this PR soon.
BTW, Do you know which package is the most authoritative one for Lp regression? I found pracma:::l1linreg and L1pack:::l1fit, but they use qr.solve() to solve WLS equation and produce slightly different result compared with ML WeightedLeastSquares.

@mengxr
Copy link
Contributor

mengxr commented Jan 20, 2016

@yanboliang The proposal looks okay to me. Some minor comments:

  1. WeightedLeastSquares model contains extra content that we don't need in IRLS iterations, e.g., diagInvAWA. This could be a follow-up work because the API is currently private.
  2. reweightedFunction -> reweightFunc. Does it need to operate on RDDs? We can simplify the API to (Instance, Model) => (Double, Double).
  3. standardizeFeatures and standardizeLabel only affect the regularization term. We don't need it to match glm in R. So let's remove them from the API. Turning them on would change the convergence of IRLS algorithm, which needs further discussion.

QR is more stable than the normal equation solver for WLS. We should switch to it in the future. For general Lp regression, there is an interior-point method implemented in R's quantreg package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R glm has argument named offset, but offsetsAndWeights is private. I hope it won't confuse users, or should we rename to other better one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fine for internal use.

@yanboliang
Copy link
Contributor Author

@mengxr Thanks for your comments. For the issue that WeightedLeastSquares contains extra content such as diagInvAWA, it will be used to generate statistic summary of IRLS or GLM. We can discuss them in the follow-up work.

@SparkQA
Copy link

SparkQA commented Jan 21, 2016

Test build #49875 has finished for PR 10639 at commit 2191d2a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang yanboliang changed the title [SPARK-9835] [ML] IterativelyReweightedLeastSquares for GLM [SPARK-9835] [ML] IterativelyReweightedLeastSquares optimization Jan 25, 2016
@yanboliang yanboliang changed the title [SPARK-9835] [ML] IterativelyReweightedLeastSquares optimization [SPARK-9835] [ML] Implement IterativelyReweightedLeastSquares solver Jan 25, 2016
@SparkQA
Copy link

SparkQA commented Jan 25, 2016

Test build #49981 has finished for PR 10639 at commit 6074fd6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

= null

@yanboliang
Copy link
Contributor Author

@mengxr Thanks for your comments, I will update other issues soon. To the convergence criterion issue, please see my inline comments and I want to here your thoughts.

@mengxr
Copy link
Contributor

mengxr commented Jan 27, 2016

@yanboliang This PR should be ready to go after the comments are addressed. FYI, I assigned SPARK-12811 to you, which is along the line of your work. Thanks for working on GLM features!

@SparkQA
Copy link

SparkQA commented Jan 28, 2016

Test build #50261 has finished for PR 10639 at commit cb2057e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Jan 28, 2016

LGTM. Merged into master. Thanks!

@asfgit asfgit closed this in df78a93 Jan 28, 2016
@yanboliang yanboliang deleted the spark-9835 branch January 29, 2016 03:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants