Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 15 additions & 3 deletions R/pkg/R/mllib.R
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,12 @@ setClass("PipelineModel", representation(model = "jobj"))
#' @param family Error distribution. "gaussian" -> linear regression, "binomial" -> logistic reg.
#' @param lambda Regularization parameter
#' @param alpha Elastic-net mixing parameter (see glmnet's documentation for details)
#' @param standardize Whether to standardize features before training
#' @param solver The solver algorithm used for optimization, this can be "l-bfgs", "normal" and
#' "auto". "l-bfgs" denotes Limited-memory BFGS which is a limited-memory
#' quasi-Newton optimization method. "normal" denotes using Normal Equation as an
#' analytical solution to the linear regression problem. The default value is "auto"
#' which means that the solver algorithm is selected automatically.
#' @return a fitted MLlib model
#' @rdname glm
#' @export
Expand Down Expand Up @@ -79,9 +85,15 @@ setMethod("predict", signature(object = "PipelineModel"),
#'
#' Returns the summary of a model produced by glm(), similarly to R's summary().
#'
#' @param x A fitted MLlib model
#' @return a list with a 'coefficient' component, which is the matrix of coefficients. See
#' summary.glm for more information.
#' @param object A fitted MLlib model
#' @return a list with 'devianceResiduals' and 'coefficients' components for gaussian family
#' or a list with 'coefficients' component for binomial family. \cr
#' For gaussian family: the 'devianceResiduals' gives the min/max deviance residuals
#' of the estimation, the 'coefficients' gives the estimated coefficients and their
#' estimated standard errors, t values and p-values. (It only available when model
#' fitted by normal solver.) \cr
#' For binomial family: the 'coefficients' gives the estimated coefficients.
#' See summary.glm for more information. \cr
#' @rdname summary
#' @export
#' @examples
Expand Down
50 changes: 42 additions & 8 deletions docs/sparkr.md
Original file line number Diff line number Diff line change
Expand Up @@ -286,24 +286,37 @@ head(teenagers)

# Machine Learning

SparkR allows the fitting of generalized linear models over DataFrames using the [glm()](api/R/glm.html) function. Under the hood, SparkR uses MLlib to train a model of the specified family. Currently the gaussian and binomial families are supported. We support a subset of the available R formula operators for model fitting, including '~', '.', '+', and '-'. The example below shows the use of building a gaussian GLM model using SparkR.
SparkR allows the fitting of generalized linear models over DataFrames using the [glm()](api/R/glm.html) function. Under the hood, SparkR uses MLlib to train a model of the specified family. Currently the gaussian and binomial families are supported. We support a subset of the available R formula operators for model fitting, including '~', '.', ':', '+', and '-'.

The [summary()](api/R/summary.html) function gives the summary of a model produced by [glm()](api/R/glm.html).

* For gaussian GLM model, it returns a list with 'devianceResiduals' and 'coefficients' components. The 'devianceResiduals' gives the min/max deviance residuals of the estimation; the 'coefficients' gives the estimated coefficients and their estimated standard errors, t values and p-values. (It only available when model fitted by normal solver.)
* For binomial GLM model, it returns a list with 'coefficients' component which gives the estimated coefficients.

The examples below show the use of building gaussian GLM model and binomial GLM model using SparkR.

## Gaussian GLM model

<div data-lang="r" markdown="1">
{% highlight r %}
# Create the DataFrame
df <- createDataFrame(sqlContext, iris)

# Fit a linear model over the dataset.
# Fit a gaussian GLM model over the dataset.
model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")

# Model coefficients are returned in a similar format to R's native glm().
# Model summary are returned in a similar format to R's native glm().
summary(model)
##$devianceResiduals
## Min Max
## -1.307112 1.412532
##
##$coefficients
## Estimate
##(Intercept) 2.2513930
##Sepal_Width 0.8035609
##Species_versicolor 1.4587432
##Species_virginica 1.9468169
## Estimate Std. Error t value Pr(>|t|)
##(Intercept) 2.251393 0.3697543 6.08889 9.568102e-09
##Sepal_Width 0.8035609 0.106339 7.556598 4.187317e-12
##Species_versicolor 1.458743 0.1121079 13.01195 0
##Species_virginica 1.946817 0.100015 19.46525 0

# Make predictions based on the model.
predictions <- predict(model, newData = df)
Expand All @@ -317,3 +330,24 @@ head(select(predictions, "Sepal_Length", "prediction"))
##6 5.4 5.385281
{% endhighlight %}
</div>

## Binomial GLM model

<div data-lang="r" markdown="1">
{% highlight r %}
# Create the DataFrame
df <- createDataFrame(sqlContext, iris)
training <- filter(df, df$Species != "setosa")

# Fit a binomial GLM model over the dataset.
model <- glm(Species ~ Sepal_Length + Sepal_Width, data = training, family = "binomial")

# Model coefficients are returned in a similar format to R's native glm().
summary(model)
##$coefficients
## Estimate
##(Intercept) -13.046005
##Sepal_Length 1.902373
##Sepal_Width 0.404655
{% endhighlight %}
</div>
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,9 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String
/**
* Set the solver algorithm used for optimization.
* In case of linear regression, this can be "l-bfgs", "normal" and "auto".
* "l-bfgs" denotes Limited-memory BFGS which is a limited-memory quasi-Newton
* optimization method. "normal" denotes using Normal Equation as an analytical
* solution to the linear regression problem.
* The default value is "auto" which means that the solver algorithm is
* selected automatically.
* @group setParam
Expand Down