Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
149 commits
Select commit Hold shift + click to select a range
8666433
[SPARK-17528][SQL][FOLLOWUP] remove unnecessary data copy in object h…
cloud-fan Jul 24, 2017
b09ec92
[SPARK-21502][MESOS] fix --supervise for mesos in cluster mode
skonto Jul 24, 2017
7f29505
[SPARK-21516][SQL][TEST] Overriding afterEach() in DatasetCacheSuite …
kiszk Jul 25, 2017
4f77c06
[SPARK-20855][Docs][DStream] Update the Spark kinesis docs to use the…
yashs360 Jul 25, 2017
996a809
[SPARK-21498][EXAMPLES] quick start -> one py demo have some bug in code
lizhaoch Jul 25, 2017
799e131
[SPARK-21175] Reject OpenBlocks when memory shortage on shuffle service.
Jul 25, 2017
8de080d
[SPARK-21383][YARN] Fix the YarnAllocator allocates more Resource
Jul 25, 2017
06a9793
[SPARK-21447][WEB UI] Spark history server fails to render compressed
Jul 25, 2017
9b4da7b
[SPARK-21491][GRAPHX] Enhance GraphX performance: breakOut instead of…
SereneAnt Jul 25, 2017
ebc24a9
[SPARK-20586][SQL] Add deterministic to ScalaUDF
gatorsmile Jul 26, 2017
300807c
[SPARK-21494][NETWORK] Use correct app id when authenticating to exte…
Jul 26, 2017
1661263
[SPARK-21517][CORE] Avoid copying memory when transfer chunks remotely
caneMi Jul 26, 2017
ae4ea5f
[SPARK-21524][ML] unit test fix: ValidatorParamsSuiteHelpers generate…
YY-OnCall Jul 26, 2017
cf29828
[SPARK-20988][ML] Logistic regression uses aggregator hierarchy
sethah Jul 26, 2017
60472db
[SPARK-21485][SQL][DOCS] Spark SQL documentation generation for built…
HyukjinKwon Jul 26, 2017
cfb25b2
[SPARK-21530] Update description of spark.shuffle.maxChunksBeingTrans…
Jul 27, 2017
ebbe589
[SPARK-21271][SQL] Ensure Unsafe.sizeInBytes is a multiple of 8
kiszk Jul 27, 2017
2ff35a0
[SPARK-21440][SQL][PYSPARK] Refactor ArrowConverters and add ArrayTyp…
ueshin Jul 27, 2017
ddcd2e8
[SPARK-19270][ML] Add summary table to GLM summary
actuaryzhang Jul 27, 2017
9f5647d
[SPARK-21319][SQL] Fix memory leak in sorter
cloud-fan Jul 27, 2017
f44ead8
[SPARK-21538][SQL] Attribute resolution inconsistency in the Dataset API
Jul 27, 2017
a5a3189
[SPARK-21306][ML] OneVsRest should support setWeightCol
facaiy Jul 28, 2017
63d168c
[MINOR][BUILD] Fix current lint-java failures
srowen Jul 28, 2017
7846809
[SPARK-21553][SPARK SHELL] Add the description of the default value o…
Jul 28, 2017
69ab0e4
[SPARK-21541][YARN] Spark Logs show incorrect job status for a job th…
Jul 28, 2017
0ef9fe6
Typo in comment
nahoj Jul 28, 2017
b56f79c
[SPARK-20090][PYTHON] Add StructType.fieldNames in PySpark
HyukjinKwon Jul 29, 2017
c143820
[SPARK-21508][DOC] Fix example code provided in Spark Streaming Docum…
Jul 29, 2017
60e9b2b
[SPARK-21357][DSTREAMS] FileInputDStream not remove out of date RDD
shaofei007 Jul 29, 2017
9c8109e
[SPARK-21555][SQL] RuntimeReplaceable should be compared semantically…
viirya Jul 29, 2017
92d8563
[SPARK-19451][SQL] rangeBetween method should accept Long value as bo…
jiangxb1987 Jul 29, 2017
6550086
[SPARK-20962][SQL] Support subquery column aliases in FROM clause
maropu Jul 29, 2017
51f99fb
[SQL] Fix typo in DataframeWriter doc
Jul 30, 2017
d79816d
[SPARK-21297][WEB-UI] Add count in 'JDBC/ODBC Server' page.
Jul 30, 2017
6830e90
[MINOR][DOC] Replace numTasks with numPartitions in programming guide
polarker Jul 30, 2017
f1a798b
[MINOR] Minor comment fixes in merge_spark_pr.py script
HyukjinKwon Jul 31, 2017
44e501a
[SPARK-19839][CORE] release longArray in BytesToBytesMap
Jul 31, 2017
106eaa9
[SPARK-21575][SPARKR] Eliminate needless synchronization in java-R se…
SereneAnt Jul 31, 2017
6b186c9
[SPARK-18950][SQL] Report conflicting fields when merging two StructT…
jiayue-zhang Aug 1, 2017
9570e81
[SPARK-21381][SPARKR] SparkR: pass on setHandleInvalid for classifica…
wangmiao1981 Aug 1, 2017
110695d
[SPARK-21589][SQL][DOC] Add documents about Hive UDF/UDTF/UDAF
maropu Aug 1, 2017
5fd0294
[SPARK-21475][CORE] Use NIO's Files API to replace FileInputStream/Fi…
jerryshao Aug 1, 2017
253a07e
[SPARK-21388][ML][PYSPARK] GBTs inherit from HasStepSize & LInearSVC …
zhengruifeng Aug 1, 2017
97ccc63
[SPARK-21585] Application Master marking application status as Failed…
Aug 1, 2017
b133501
[SPARK-21522][CORE] Fix flakiness in LauncherServerSuite.
Aug 1, 2017
6735433
[SPARK-20079][YARN] Fix client AM not allocating executors after rest…
Aug 1, 2017
74cda94
[SPARK-21592][BUILD] Skip maven-compiler-plugin main and test compila…
gslowikowski Aug 1, 2017
b1d59e6
[SPARK-21593][DOCS] Fix 2 rendering errors on configuration page
srowen Aug 1, 2017
58da1a2
[SPARK-21339][CORE] spark-shell --packages option does not add jars t…
Aug 1, 2017
77cc0d6
[SPARK-12717][PYTHON] Adding thread-safe broadcast pickle registry
BryanCutler Aug 1, 2017
4cc704b
[CORE][MINOR] Improve the error message of checkpoint RDD verification
gatorsmile Aug 2, 2017
14e7575
[SPARK-21578][CORE] Add JavaSparkContextSuite
dongjoon-hyun Aug 2, 2017
845c039
[SPARK-20601][ML] Python API for Constrained Logistic Regression
zero323 Aug 2, 2017
7f63e85
[SPARK-21597][SS] Fix a potential overflow issue in EventTimeStats
zsxwing Aug 2, 2017
9456176
[SPARK-21490][CORE] Make sure SparkLauncher redirects needed streams.
Aug 2, 2017
0d26b3a
[SPARK-21546][SS] dropDuplicates should ignore watermark when it's no…
zsxwing Aug 2, 2017
7c206dd
[SPARK-21615][ML][MLLIB][DOCS] Fix broken redirect in collaborative f…
Aug 3, 2017
f13dbb3
[SPARK-21604][SQL] if the object extends Logging, i suggest to remove…
Aug 3, 2017
3221470
[SPARK-21611][SQL] Error class name for log in several classes.
Aug 3, 2017
e7c59b4
[SPARK-21605][BUILD] Let IntelliJ IDEA correctly detect Language leve…
baibaichen Aug 3, 2017
97ba491
[SPARK-21602][R] Add map_keys and map_values functions to R
HyukjinKwon Aug 3, 2017
13785da
[SPARK-21599][SQL] Collecting column statistics for datasource tables…
dilipbiswal Aug 3, 2017
bb7afb4
[SPARK-20713][SPARK CORE] Convert CommitDenied to TaskKilled.
Aug 3, 2017
dd72b10
Fix Java SimpleApp spark application
christiam Aug 3, 2017
e3967dc
[SPARK-21254][WEBUI] History UI performance fixes
2ooom Aug 4, 2017
25826c7
[SPARK-21330][SQL] Bad partitioning does not allow to read a JDBC tab…
aray Aug 4, 2017
1347b2a
[SPARK-21633][ML][PYTHON] UnaryTransformer in Python
ajaysaini725 Aug 4, 2017
231f672
[SPARK-21205][SQL] pmod(number, 0) should be null.
wangyum Aug 4, 2017
5ad1796
[SPARK-21634][SQL] Change OneRowRelation from a case object to case c…
rxin Aug 4, 2017
6cbd18c
[SPARK-21374][CORE] Fix reading globbed paths from S3 into DF with di…
zsxwing Aug 5, 2017
894d5a4
[SPARK-21580][SQL] Integers in aggregation expressions are wrongly ta…
10110346 Aug 5, 2017
3a45c7f
[INFRA] Close stale PRs
HyukjinKwon Aug 5, 2017
ba327ee
[SPARK-21485][FOLLOWUP][SQL][DOCS] Describes examples and arguments s…
HyukjinKwon Aug 5, 2017
dcac1d5
[SPARK-21640] Add errorifexists as a valid string for ErrorIfExists s…
Aug 5, 2017
41568e9
[SPARK-21637][SPARK-21451][SQL] get `spark.hadoop.*` properties from …
yaooqinn Aug 6, 2017
990efad
[SPARK-20963][SQL] Support column aliases for join relations in FROM …
maropu Aug 6, 2017
1ba967b
[SPARK-21588][SQL] SQLContext.getConf(key, null) should return null
vinodkc Aug 6, 2017
d4e7f20
[SPARKR][BUILD] AppVeyor change to latest R version
felixcheung Aug 6, 2017
10b3ca3
[SPARK-21574][SQL] Point out user to set hive config before SparkSess…
wangyum Aug 6, 2017
74b4784
[SPARK-20963][SQL][FOLLOW-UP] Use UnresolvedSubqueryColumnAliases for…
maropu Aug 6, 2017
55aa4da
[SPARK-21622][ML][SPARKR] Support offset in SparkR GLM
actuaryzhang Aug 6, 2017
438c381
Add "full_outer" name to join types
BartekH Aug 6, 2017
39e044e
[MINOR][BUILD] Remove duplicate test-jar:test spark-sql dependency fr…
srowen Aug 6, 2017
534a063
[SPARK-21621][CORE] Reset numRecordsWritten after DiskBlockObjectWrit…
ConeyLiu Aug 7, 2017
663f30d
[SPARK-13041][MESOS] Adds sandbox uri to spark dispatcher ui
skonto Aug 7, 2017
1426eea
[SPARK-21623][ML] fix RF doc
Aug 7, 2017
8b69b17
[SPARK-21544][DEPLOY][TEST-MAVEN] Tests jar of some module should not…
caneGuy Aug 7, 2017
bbfd6b5
[SPARK-21647][SQL] Fix SortMergeJoin when using CROSS
gatorsmile Aug 7, 2017
4f7ec3a
[SPARK][DOCS] Added note on meaning of position to substring function
maclockard Aug 7, 2017
cce25b3
[SPARK-21565][SS] Propagate metadata in attribute replacement.
Aug 7, 2017
baf5cac
[SPARK-21648][SQL] Fix confusing assert failure in JDBC source when p…
gatorsmile Aug 7, 2017
fdcee02
[SPARK-21542][ML][PYTHON] Python persistence helper functions
ajaysaini725 Aug 8, 2017
f763d84
[SPARK-19270][FOLLOW-UP][ML] PySpark GLR model.summary should return …
yanboliang Aug 8, 2017
312bebf
[SPARK-21640][FOLLOW-UP][SQL] added errorifexists on IllegalArgumentE…
Aug 8, 2017
ee13041
[SPARK-21567][SQL] Dataset should work with type alias
viirya Aug 8, 2017
08ef7d7
[MINOR][R][BUILD] More reliable detection of R version for Windows in…
HyukjinKwon Aug 8, 2017
979bf94
[SPARK-20655][CORE] In-memory KVStore implementation.
Aug 8, 2017
2c1bfb4
[SPARK-21671][CORE] Move kvstore to "util" sub-package, add private a…
Aug 8, 2017
fb54a56
[SPARK-20433][BUILD] Bump jackson from 2.6.5 to 2.6.7.1
srowen Aug 9, 2017
6edfff0
[SPARK-21596][SS] Ensure places calling HDFSMetadataLog.get check the…
zsxwing Aug 9, 2017
031910b
[SPARK-21608][SPARK-9221][SQL] Window rangeBetween() API should allow…
jiangxb1987 Aug 9, 2017
f016f5c
[SPARK-21503][UI] Spark UI shows incorrect task status for a killed E…
Aug 9, 2017
ae8a2b1
[SPARK-21176][WEB UI] Use a single ProxyServlet to proxy all workers …
aosagie Aug 9, 2017
b35660d
[SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in …
WeichenXu123 Aug 9, 2017
6426adf
[SPARK-21663][TESTS] test("remote fetch below max RPC message size") …
wangjiaochun Aug 9, 2017
83fe3b5
[SPARK-21665][CORE] Need to close resources after use
vinodkc Aug 9, 2017
b78cf13
[SPARK-21276][CORE] Update lz4-java to the latest (v1.4.0)
maropu Aug 9, 2017
2d799d0
[SPARK-21504][SQL] Add spark version info into table metadata
gatorsmile Aug 9, 2017
0fb7325
[SPARK-21587][SS] Added filter pushdown through watermarks.
Aug 9, 2017
c06f3f5
[SPARK-21551][PYTHON] Increase timeout for PythonRDD.serveIterator
peay Aug 9, 2017
84454d7
[SPARK-14932][SQL] Allow DataFrame.replace() to replace values with None
jiayue-zhang Aug 10, 2017
95ad960
[SPARK-21669] Internal API for collecting metrics/stats during FileFo…
adrian-ionescu Aug 10, 2017
ca69558
[SPARK-21638][ML] Fix RF/GBT Warning message error
Aug 10, 2017
584c7f1
[SPARK-21699][SQL] Remove unused getTableOption in ExternalCatalog
rxin Aug 11, 2017
2387f1e
[SPARK-21675][WEBUI] Add a navigation bar at the bottom of the Detail…
yaooqinn Aug 11, 2017
0377338
[SPARK-21519][SQL] Add an option to the JDBC data source to initializ…
LucaCanali Aug 11, 2017
9443999
[SPARK-21595] Separate thresholds for buffering and spilling in Exter…
tejasapatil Aug 11, 2017
7f16c69
[SPARK-19122][SQL] Unnecessary shuffle+sort added if join predicates …
tejasapatil Aug 11, 2017
da8c59b
[SPARK-12559][SPARK SUBMIT] fix --packages for stand-alone cluster mode
skonto Aug 11, 2017
b0bdfce
[MINOR][BUILD] Download RAT and R version info over HTTPS; use RAT 0.12
srowen Aug 12, 2017
35db3b9
[SPARK-17025][ML][PYTHON] Persistence for Pipelines with Python-only …
ajaysaini725 Aug 12, 2017
c0e333d
[SPARK-21709][BUILD] sbt 0.13.16 and some plugin updates
Aug 12, 2017
5596ce8
[MINOR][SQL] Additional test case for CheckCartesianProducts rule
Aug 14, 2017
34d2134
[SPARK-21176][WEB UI] Format worker page links to work with proxy
aosagie Aug 14, 2017
6847e93
[SPARK-21563][CORE] Fix race condition when serializing TaskDescripti…
ash211 Aug 14, 2017
0fcde87
[SPARK-21658][SQL][PYSPARK] Add default None for value in na.replace …
chihhanyu Aug 14, 2017
0326b69
[MINOR][SQL][TEST] no uncache table in joinsuite test
heary-cao Aug 14, 2017
fbc2692
[SPARK-19471][SQL] AggregationIterator does not initialize the genera…
DonnyZone Aug 14, 2017
282f00b
[SPARK-21696][SS] Fix a potential issue that may generate partial sna…
zsxwing Aug 14, 2017
4c3cf1c
[SPARK-21721][SQL] Clear FileSystem deleteOnExit cache when paths are…
viirya Aug 15, 2017
0422ce0
[SPARK-21724][SQL][DOC] Adds since information in the documentation o…
HyukjinKwon Aug 15, 2017
12411b5
[SPARK-21732][SQL] Lazily init hive metastore client
zsxwing Aug 15, 2017
bc99025
[SPARK-19471][SQL] AggregationIterator does not initialize the genera…
DonnyZone Aug 15, 2017
14bdb25
[SPARK-18464][SQL][FOLLOWUP] support old table which doesn't store sc…
cloud-fan Aug 15, 2017
cba826d
[SPARK-17742][CORE] Handle child process exit in SparkLauncher.
Aug 15, 2017
3f958a9
[SPARK-21731][BUILD] Upgrade scalastyle to 0.9.
Aug 15, 2017
42b9eda
[MINOR] Fix a typo in the method name `UserDefinedFunction.asNonNullabe`
jiangxb1987 Aug 15, 2017
9660831
[SPARK-21712][PYSPARK] Clarify type error for Column.substr()
nchammas Aug 16, 2017
07549b2
[SPARK-19634][ML] Multivariate summarizer - dataframes API
WeichenXu123 Aug 16, 2017
8c54f1e
[SPARK-21422][BUILD] Depend on Apache ORC 1.4.0
dongjoon-hyun Aug 16, 2017
8321c14
[SPARK-21723][ML] Fix writing LibSVM (key not found: numFeatures)
Aug 16, 2017
0bb8d1f
[SPARK-13969][ML] Add FeatureHasher transformer
Aug 16, 2017
adf005d
[SPARK-21656][CORE] spark dynamic allocation should not idle timeout …
Aug 16, 2017
1cce1a3
[SPARK-21603][SQL] The wholestage codegen will be much slower then th…
eatoncys Aug 16, 2017
7add4e9
[SPARK-21738] Thriftserver doesn't cancel jobs when session is closed
mgaido91 Aug 16, 2017
a0345cb
[SPARK-21680][ML][MLLIB] optimize Vector compress
Aug 16, 2017
b8ffb51
[SPARK-3151][BLOCK MANAGER] DiskStore.getBytes fails for files larger…
Aug 17, 2017
a45133b
[SPARK-21743][SQL] top-most limit should not cause memory leak
cloud-fan Aug 17, 2017
6d474a1
working prototype
cloud-fan Jul 11, 2017
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,8 @@ dev/pr-deps/
dist/
docs/_site
docs/api
sql/docs
sql/site
lib_managed/
lint-r-report.log
log/
Expand Down
2 changes: 2 additions & 0 deletions R/pkg/NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -286,6 +286,8 @@ exportMethods("%<=>%",
"lower",
"lpad",
"ltrim",
"map_keys",
"map_values",
"max",
"md5",
"mean",
Expand Down
33 changes: 32 additions & 1 deletion R/pkg/R/functions.R
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,10 @@ NULL
#' head(tmp2)
#' head(select(tmp, posexplode(tmp$v1)))
#' head(select(tmp, sort_array(tmp$v1)))
#' head(select(tmp, sort_array(tmp$v1, asc = FALSE)))}
#' head(select(tmp, sort_array(tmp$v1, asc = FALSE)))
#' tmp3 <- mutate(df, v3 = create_map(df$model, df$cyl))
#' head(select(tmp3, map_keys(tmp3$v3)))
#' head(select(tmp3, map_values(tmp3$v3)))}
NULL

#' Window functions for Column operations
Expand Down Expand Up @@ -3055,6 +3058,34 @@ setMethod("array_contains",
column(jc)
})

#' @details
#' \code{map_keys}: Returns an unordered array containing the keys of the map.
#'
#' @rdname column_collection_functions
#' @aliases map_keys map_keys,Column-method
#' @export
#' @note map_keys since 2.3.0
setMethod("map_keys",
signature(x = "Column"),
function(x) {
jc <- callJStatic("org.apache.spark.sql.functions", "map_keys", x@jc)
column(jc)
})

#' @details
#' \code{map_values}: Returns an unordered array containing the values of the map.
#'
#' @rdname column_collection_functions
#' @aliases map_values map_values,Column-method
#' @export
#' @note map_values since 2.3.0
setMethod("map_values",
signature(x = "Column"),
function(x) {
jc <- callJStatic("org.apache.spark.sql.functions", "map_values", x@jc)
column(jc)
})

#' @details
#' \code{explode}: Creates a new row for each element in the given array or map column.
#'
Expand Down
10 changes: 10 additions & 0 deletions R/pkg/R/generics.R
Original file line number Diff line number Diff line change
Expand Up @@ -1213,6 +1213,16 @@ setGeneric("lpad", function(x, len, pad) { standardGeneric("lpad") })
#' @name NULL
setGeneric("ltrim", function(x) { standardGeneric("ltrim") })

#' @rdname column_collection_functions
#' @export
#' @name NULL
setGeneric("map_keys", function(x) { standardGeneric("map_keys") })

#' @rdname column_collection_functions
#' @export
#' @name NULL
setGeneric("map_values", function(x) { standardGeneric("map_values") })

#' @rdname column_misc_functions
#' @export
#' @name NULL
Expand Down
49 changes: 40 additions & 9 deletions R/pkg/R/mllib_classification.R
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,11 @@ setClass("NaiveBayesModel", representation(jobj = "jobj"))
#' @param aggregationDepth The depth for treeAggregate (greater than or equal to 2). If the dimensions of features
#' or the number of partitions are large, this param could be adjusted to a larger size.
#' This is an expert parameter. Default value should be good for most cases.
#' @param handleInvalid How to handle invalid data (unseen labels or NULL values) in features and label
#' column of string type.
#' Supported options: "skip" (filter out rows with invalid data),
#' "error" (throw an error), "keep" (put invalid data in a special additional
#' bucket, at index numLabels). Default is "error".
#' @param ... additional arguments passed to the method.
#' @return \code{spark.svmLinear} returns a fitted linear SVM model.
#' @rdname spark.svmLinear
Expand Down Expand Up @@ -98,7 +103,8 @@ setClass("NaiveBayesModel", representation(jobj = "jobj"))
#' @note spark.svmLinear since 2.2.0
setMethod("spark.svmLinear", signature(data = "SparkDataFrame", formula = "formula"),
function(data, formula, regParam = 0.0, maxIter = 100, tol = 1E-6, standardization = TRUE,
threshold = 0.0, weightCol = NULL, aggregationDepth = 2) {
threshold = 0.0, weightCol = NULL, aggregationDepth = 2,
handleInvalid = c("error", "keep", "skip")) {
formula <- paste(deparse(formula), collapse = "")

if (!is.null(weightCol) && weightCol == "") {
Expand All @@ -107,10 +113,12 @@ setMethod("spark.svmLinear", signature(data = "SparkDataFrame", formula = "formu
weightCol <- as.character(weightCol)
}

handleInvalid <- match.arg(handleInvalid)

jobj <- callJStatic("org.apache.spark.ml.r.LinearSVCWrapper", "fit",
data@sdf, formula, as.numeric(regParam), as.integer(maxIter),
as.numeric(tol), as.logical(standardization), as.numeric(threshold),
weightCol, as.integer(aggregationDepth))
weightCol, as.integer(aggregationDepth), handleInvalid)
new("LinearSVCModel", jobj = jobj)
})

Expand Down Expand Up @@ -218,6 +226,11 @@ function(object, path, overwrite = FALSE) {
#' @param upperBoundsOnIntercepts The upper bounds on intercepts if fitting under bound constrained optimization.
#' The bound vector size must be equal to 1 for binomial regression, or the number
#' of classes for multinomial regression.
#' @param handleInvalid How to handle invalid data (unseen labels or NULL values) in features and label
#' column of string type.
#' Supported options: "skip" (filter out rows with invalid data),
#' "error" (throw an error), "keep" (put invalid data in a special additional
#' bucket, at index numLabels). Default is "error".
#' @param ... additional arguments passed to the method.
#' @return \code{spark.logit} returns a fitted logistic regression model.
#' @rdname spark.logit
Expand Down Expand Up @@ -257,7 +270,8 @@ setMethod("spark.logit", signature(data = "SparkDataFrame", formula = "formula")
tol = 1E-6, family = "auto", standardization = TRUE,
thresholds = 0.5, weightCol = NULL, aggregationDepth = 2,
lowerBoundsOnCoefficients = NULL, upperBoundsOnCoefficients = NULL,
lowerBoundsOnIntercepts = NULL, upperBoundsOnIntercepts = NULL) {
lowerBoundsOnIntercepts = NULL, upperBoundsOnIntercepts = NULL,
handleInvalid = c("error", "keep", "skip")) {
formula <- paste(deparse(formula), collapse = "")
row <- 0
col <- 0
Expand Down Expand Up @@ -304,6 +318,8 @@ setMethod("spark.logit", signature(data = "SparkDataFrame", formula = "formula")
upperBoundsOnCoefficients <- as.array(as.vector(upperBoundsOnCoefficients))
}

handleInvalid <- match.arg(handleInvalid)

jobj <- callJStatic("org.apache.spark.ml.r.LogisticRegressionWrapper", "fit",
data@sdf, formula, as.numeric(regParam),
as.numeric(elasticNetParam), as.integer(maxIter),
Expand All @@ -312,7 +328,8 @@ setMethod("spark.logit", signature(data = "SparkDataFrame", formula = "formula")
weightCol, as.integer(aggregationDepth),
as.integer(row), as.integer(col),
lowerBoundsOnCoefficients, upperBoundsOnCoefficients,
lowerBoundsOnIntercepts, upperBoundsOnIntercepts)
lowerBoundsOnIntercepts, upperBoundsOnIntercepts,
handleInvalid)
new("LogisticRegressionModel", jobj = jobj)
})

Expand Down Expand Up @@ -394,7 +411,12 @@ setMethod("write.ml", signature(object = "LogisticRegressionModel", path = "char
#' @param stepSize stepSize parameter.
#' @param seed seed parameter for weights initialization.
#' @param initialWeights initialWeights parameter for weights initialization, it should be a
#' numeric vector.
#' numeric vector.
#' @param handleInvalid How to handle invalid data (unseen labels or NULL values) in features and label
#' column of string type.
#' Supported options: "skip" (filter out rows with invalid data),
#' "error" (throw an error), "keep" (put invalid data in a special additional
#' bucket, at index numLabels). Default is "error".
#' @param ... additional arguments passed to the method.
#' @return \code{spark.mlp} returns a fitted Multilayer Perceptron Classification Model.
#' @rdname spark.mlp
Expand Down Expand Up @@ -426,7 +448,8 @@ setMethod("write.ml", signature(object = "LogisticRegressionModel", path = "char
#' @note spark.mlp since 2.1.0
setMethod("spark.mlp", signature(data = "SparkDataFrame", formula = "formula"),
function(data, formula, layers, blockSize = 128, solver = "l-bfgs", maxIter = 100,
tol = 1E-6, stepSize = 0.03, seed = NULL, initialWeights = NULL) {
tol = 1E-6, stepSize = 0.03, seed = NULL, initialWeights = NULL,
handleInvalid = c("error", "keep", "skip")) {
formula <- paste(deparse(formula), collapse = "")
if (is.null(layers)) {
stop ("layers must be a integer vector with length > 1.")
Expand All @@ -441,10 +464,11 @@ setMethod("spark.mlp", signature(data = "SparkDataFrame", formula = "formula"),
if (!is.null(initialWeights)) {
initialWeights <- as.array(as.numeric(na.omit(initialWeights)))
}
handleInvalid <- match.arg(handleInvalid)
jobj <- callJStatic("org.apache.spark.ml.r.MultilayerPerceptronClassifierWrapper",
"fit", data@sdf, formula, as.integer(blockSize), as.array(layers),
as.character(solver), as.integer(maxIter), as.numeric(tol),
as.numeric(stepSize), seed, initialWeights)
as.numeric(stepSize), seed, initialWeights, handleInvalid)
new("MultilayerPerceptronClassificationModel", jobj = jobj)
})

Expand Down Expand Up @@ -514,6 +538,11 @@ setMethod("write.ml", signature(object = "MultilayerPerceptronClassificationMode
#' @param formula a symbolic description of the model to be fitted. Currently only a few formula
#' operators are supported, including '~', '.', ':', '+', and '-'.
#' @param smoothing smoothing parameter.
#' @param handleInvalid How to handle invalid data (unseen labels or NULL values) in features and label
#' column of string type.
#' Supported options: "skip" (filter out rows with invalid data),
#' "error" (throw an error), "keep" (put invalid data in a special additional
#' bucket, at index numLabels). Default is "error".
#' @param ... additional argument(s) passed to the method. Currently only \code{smoothing}.
#' @return \code{spark.naiveBayes} returns a fitted naive Bayes model.
#' @rdname spark.naiveBayes
Expand Down Expand Up @@ -543,10 +572,12 @@ setMethod("write.ml", signature(object = "MultilayerPerceptronClassificationMode
#' }
#' @note spark.naiveBayes since 2.0.0
setMethod("spark.naiveBayes", signature(data = "SparkDataFrame", formula = "formula"),
function(data, formula, smoothing = 1.0) {
function(data, formula, smoothing = 1.0,
handleInvalid = c("error", "keep", "skip")) {
formula <- paste(deparse(formula), collapse = "")
handleInvalid <- match.arg(handleInvalid)
jobj <- callJStatic("org.apache.spark.ml.r.NaiveBayesWrapper", "fit",
formula, data@sdf, smoothing)
formula, data@sdf, smoothing, handleInvalid)
new("NaiveBayesModel", jobj = jobj)
})

Expand Down
22 changes: 18 additions & 4 deletions R/pkg/R/mllib_regression.R
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,8 @@ setClass("IsotonicRegressionModel", representation(jobj = "jobj"))
#' "frequencyDesc", "frequencyAsc", "alphabetDesc", and "alphabetAsc".
#' The default value is "frequencyDesc". When the ordering is set to
#' "alphabetDesc", this drops the same category as R when encoding strings.
#' @param offsetCol the offset column name. If this is not set or empty, we treat all instance offsets
#' as 0.0. The feature specified as offset has a constant coefficient of 1.0.
#' @param ... additional arguments passed to the method.
#' @aliases spark.glm,SparkDataFrame,formula-method
#' @return \code{spark.glm} returns a fitted generalized linear model.
Expand Down Expand Up @@ -127,7 +129,8 @@ setMethod("spark.glm", signature(data = "SparkDataFrame", formula = "formula"),
function(data, formula, family = gaussian, tol = 1e-6, maxIter = 25, weightCol = NULL,
regParam = 0.0, var.power = 0.0, link.power = 1.0 - var.power,
stringIndexerOrderType = c("frequencyDesc", "frequencyAsc",
"alphabetDesc", "alphabetAsc")) {
"alphabetDesc", "alphabetAsc"),
offsetCol = NULL) {

stringIndexerOrderType <- match.arg(stringIndexerOrderType)
if (is.character(family)) {
Expand Down Expand Up @@ -159,12 +162,19 @@ setMethod("spark.glm", signature(data = "SparkDataFrame", formula = "formula"),
weightCol <- as.character(weightCol)
}

if (!is.null(offsetCol)) {
offsetCol <- as.character(offsetCol)
if (nchar(offsetCol) == 0) {
offsetCol <- NULL
}
}

# For known families, Gamma is upper-cased
jobj <- callJStatic("org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper",
"fit", formula, data@sdf, tolower(family$family), family$link,
tol, as.integer(maxIter), weightCol, regParam,
as.double(var.power), as.double(link.power),
stringIndexerOrderType)
stringIndexerOrderType, offsetCol)
new("GeneralizedLinearRegressionModel", jobj = jobj)
})

Expand Down Expand Up @@ -192,6 +202,8 @@ setMethod("spark.glm", signature(data = "SparkDataFrame", formula = "formula"),
#' "frequencyDesc", "frequencyAsc", "alphabetDesc", and "alphabetAsc".
#' The default value is "frequencyDesc". When the ordering is set to
#' "alphabetDesc", this drops the same category as R when encoding strings.
#' @param offsetCol the offset column name. If this is not set or empty, we treat all instance offsets
#' as 0.0. The feature specified as offset has a constant coefficient of 1.0.
#' @return \code{glm} returns a fitted generalized linear model.
#' @rdname glm
#' @export
Expand All @@ -209,10 +221,12 @@ setMethod("glm", signature(formula = "formula", family = "ANY", data = "SparkDat
function(formula, family = gaussian, data, epsilon = 1e-6, maxit = 25, weightCol = NULL,
var.power = 0.0, link.power = 1.0 - var.power,
stringIndexerOrderType = c("frequencyDesc", "frequencyAsc",
"alphabetDesc", "alphabetAsc")) {
"alphabetDesc", "alphabetAsc"),
offsetCol = NULL) {
spark.glm(data, formula, family, tol = epsilon, maxIter = maxit, weightCol = weightCol,
var.power = var.power, link.power = link.power,
stringIndexerOrderType = stringIndexerOrderType)
stringIndexerOrderType = stringIndexerOrderType,
offsetCol = offsetCol)
})

# Returns the summary of a model produced by glm() or spark.glm(), similarly to R's summary().
Expand Down
Loading