-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-9835] [ML] Implement IterativelyReweightedLeastSquares solver #10639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #48930 has finished for PR 10639 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the link function here should default to Log not Logit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it's typo.
|
@yanboliang Could you post a link to a reference paper? I find documentation on IRLS scattered, so it would be nice to have something concrete to point to. |
|
@sethah Thanks for your comments. The mainly reference document is http://data.princeton.edu/wws509/notes/ |
|
Test build #49000 has finished for PR 10639 at commit
|
|
Test build #49016 has finished for PR 10639 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the generic form for the z update, but in the Poisson family you have hard coded the update specific to the family. I think we should stick to a single convention, and it makes most sense to me to use the generic update. That way, we can implement in the parent class and we don't need to implement in the base classes.
Additionally, I think it would be better to call this something other than z because it isn't very descriptive. I don't have a strong opinion on what it should be called, but I have seen it called adjusted response in other places (among other names).
5b26d2a to
d7653d6
Compare
|
Jenkins, retest this please. |
|
Test build #49125 has finished for PR 10639 at commit
|
|
Test build #49126 has finished for PR 10639 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard coding the weights like this here won't be correct if anything other than the canonical link function is used, I believe. Since we aren't doing anything to restrict link functions to only the canonical ones, this should probably defined in terms of the link function's derivative. Statsmodels does it here.
|
I'm still not exactly clear what the scope of this GLM implementation will be (i.e. if it will be a public API) and I'd be interested to hear more on that topic. But along those lines, some GLM implementations restrict the link functions that can be used with a given family, so I'm wondering if that should be done in Spark as well. I suppose this is only a problem if we expose it as an API. Curious to hear others' thoughts... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to prefer y.reduce(_ + _) / y.count() over y.mean() ?
|
@sethah Thanks for your comments! Generally speaking, a public algorithm should extend from |
|
Test build #49240 has finished for PR 10639 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reference you can point to for this exact form of the deviance equation (other than SparkGLM package)? I haven't seen the max function in other places.
This may or may not be related, but how will we guarantee that the endogenous variable does not contain invalid values (negative for poisson, outside [0, 1] for binomial, etc...?
|
@yanboliang The approach looks good, nice PR! It will be good to get others feedback, especially those with an idea of how the GLM package will fit into the larger picture. I think the exact form of the link functions and families will depend on what the plans are for future additions (e.g. what summary stats will we return?) but those can be done in a separate PR. |
|
@yanboliang Thanks for working on this! I'll make a pass soon. |
|
@yanboliang I made a pass on your implementation and read Green's paper "Iteratively Reweighted Least Squares for Maximum Likelihood Estimation, and some Robust and Resistant Alternatives". My main question is whether we should couple IRLS and GLM together. IRLS could be applied to models outside GLM family. Coupling them together makes it harder to extend and some concepts harder to understand, e.g., "Family.startingMu". I thought about decoupling the concepts but didn't find a satisfactory solution. One possible approach is to make IRLS accept the initial guess |
|
@mengxr Thanks for your comments! IRLS is not bound with GLM in essence, so it's make sense to decouple them. Based on your prompt, I propose the following API for private[ml] class IterativelyReweightedLeastSquares(
val initialModel: WeightedLeastSquaresModel,
val reweightedFunction: (RDD[Instance], WeightedLeastSquaresModel) => RDD[(Double, Double)],
val fitIntercept: Boolean,
val regParam: Double,
val standardizeFeatures: Boolean,
val standardizeLabel: Boolean,
val maxIter: Int,
val tol: Double) extends Logging with Serializable {
......
}where |
|
@yanboliang The proposal looks okay to me. Some minor comments:
QR is more stable than the normal equation solver for WLS. We should switch to it in the future. For general Lp regression, there is an interior-point method implemented in R's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R glm has argument named offset, but offsetsAndWeights is private. I hope it won't confuse users, or should we rename to other better one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is fine for internal use.
|
@mengxr Thanks for your comments. For the issue that |
|
Test build #49875 has finished for PR 10639 at commit
|
|
Test build #49981 has finished for PR 10639 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
= null
|
@mengxr Thanks for your comments, I will update other issues soon. To the convergence criterion issue, please see my inline comments and I want to here your thoughts. |
|
@yanboliang This PR should be ready to go after the comments are addressed. FYI, I assigned SPARK-12811 to you, which is along the line of your work. Thanks for working on GLM features! |
|
Test build #50261 has finished for PR 10639 at commit
|
|
LGTM. Merged into master. Thanks! |
Implement
IterativelyReweightedLeastSquaressolver for GLM. I consider it as a solver rather than estimator, it only used internal so I keep itprivate[ml].There are two limitations in the current implementation compared with R:
Tupleas response forBinomialfamily, such as the following code:offset.Because I considered that
RFormuladid not supportTupleas label andoffsetkeyword, so I simplified the implementation. But to add support for these two functions is not very hard, I can do it in follow-up PR if it is necessary. Meanwhile, we can also add R-like statistic summary for IRLS.The implementation refers R, statsmodels and sparkGLM.
Please focus on the main structure and overpass minor issues/docs that I will update later. Any comments and opinions will be appreciated.
cc @mengxr @jkbradley