Skip to content

Commit 11d2b07

Browse files
committed
[SPARK-31918][R] Ignore S4 generic methods under SparkR namespace in closure cleaning to support R 4.0.0+
### What changes were proposed in this pull request? This PR proposes to ignore S4 generic methods under SparkR namespace in closure cleaning to support R 4.0.0+. Currently, when you run the codes that runs R native codes, it fails as below with R 4.0.0: ```r df <- createDataFrame(lapply(seq(100), function (e) list(value=e))) count(dapply(df, function(x) as.data.frame(x[x$value < 50,]), schema(df))) ``` ``` org.apache.spark.SparkException: R unexpectedly exited. R worker produced errors: Error in lapply(part, FUN) : attempt to bind a variable to R_UnboundValue ``` The root cause seems to be related to when an S4 generic method is manually included into the closure's environment via `SparkR:::cleanClosure`. For example, when an RRDD is created via `createDataFrame` with calling `lapply` to convert, `lapply` itself: https://github.com/apache/spark/blob/f53d8c63e80172295e2fbc805c0c391bdececcaa/R/pkg/R/RDD.R#L484 is added into the environment of the cleaned closure - because this is not an exposed namespace; however, this is broken in R 4.0.0+ for an unknown reason with an error message such as "attempt to bind a variable to R_UnboundValue". Actually, we don't need to add the `lapply` into the environment of the closure because it is not supposed to be called in worker side. In fact, there is no private generic methods supposed to be called in worker side in SparkR at all from my understanding. Therefore, this PR takes a simpler path to work around just by explicitly excluding the S4 generic methods under SparkR namespace to support R 4.0.0. in SparkR. ### Why are the changes needed? To support R 4.0.0+ with SparkR, and unblock the releases on CRAN. CRAN requires the tests pass with the latest R. ### Does this PR introduce _any_ user-facing change? Yes, it will support R 4.0.0 to end-users. ### How was this patch tested? Manually tested. Both CRAN and tests with R 4.0.1: ``` ══ testthat results ═══════════════════════════════════════════════════════════ [ OK: 13 | SKIPPED: 0 | WARNINGS: 0 | FAILED: 0 ] ✔ | OK F W S | Context ✔ | 11 | binary functions [2.5 s] ✔ | 4 | functions on binary files [2.1 s] ✔ | 2 | broadcast variables [0.5 s] ✔ | 5 | functions in client.R ✔ | 46 | test functions in sparkR.R [6.3 s] ✔ | 2 | include R packages [0.3 s] ✔ | 2 | JVM API [0.2 s] ✔ | 75 | MLlib classification algorithms, except for tree-based algorithms [86.3 s] ✔ | 70 | MLlib clustering algorithms [44.5 s] ✔ | 6 | MLlib frequent pattern mining [3.0 s] ✔ | 8 | MLlib recommendation algorithms [9.6 s] ✔ | 136 | MLlib regression algorithms, except for tree-based algorithms [76.0 s] ✔ | 8 | MLlib statistics algorithms [0.6 s] ✔ | 94 | MLlib tree-based algorithms [85.2 s] ✔ | 29 | parallelize() and collect() [0.5 s] ✔ | 428 | basic RDD functions [25.3 s] ✔ | 39 | SerDe functionality [2.2 s] ✔ | 20 | partitionBy, groupByKey, reduceByKey etc. [3.9 s] ✔ | 4 | functions in sparkR.R ✔ | 16 | SparkSQL Arrow optimization [19.2 s] ✔ | 6 | test show SparkDataFrame when eager execution is enabled. [1.1 s] ✔ | 1175 | SparkSQL functions [134.8 s] ✔ | 42 | Structured Streaming [478.2 s] ✔ | 16 | tests RDD function take() [1.1 s] ✔ | 14 | the textFile() function [2.9 s] ✔ | 46 | functions in utils.R [0.7 s] ✔ | 0 1 | Windows-specific tests ──────────────────────────────────────────────────────────────────────────────── test_Windows.R:22: skip: sparkJars tag in SparkContext Reason: This test is only for Windows, skipped ──────────────────────────────────────────────────────────────────────────────── ══ Results ═════════════════════════════════════════════════════════════════════ Duration: 987.3 s OK: 2304 Failed: 0 Warnings: 0 Skipped: 1 ... Status: OK + popd Tests passed. ``` Note that I tested to build SparkR in R 4.0.0, and run the tests with R 3.6.3. It all passed. See also [the comment in the JIRA](https://issues.apache.org/jira/browse/SPARK-31918?focusedCommentId=17142837&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17142837). Closes apache#28907 from HyukjinKwon/SPARK-31918. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
1 parent e00f43c commit 11d2b07

File tree

5 files changed

+18
-13
lines changed

5 files changed

+18
-13
lines changed

R/pkg/R/utils.R

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -529,7 +529,10 @@ processClosure <- function(node, oldEnv, defVars, checkedFuncs, newEnv) {
529529
# Namespaces other than "SparkR" will not be searched.
530530
if (!isNamespace(func.env) ||
531531
(getNamespaceName(func.env) == "SparkR" &&
532-
!(nodeChar %in% getNamespaceExports("SparkR")))) {
532+
!(nodeChar %in% getNamespaceExports("SparkR")) &&
533+
# Note that generic S4 methods should not be set to the environment of
534+
# cleaned closure. It does not work with R 4.0.0+. See also SPARK-31918.
535+
nodeChar != "" && !methods::isGeneric(nodeChar, func.env))) {
533536
# Only include SparkR internals.
534537

535538
# Set parameter 'inherits' to FALSE since we do not need to search in

R/pkg/tests/fulltests/test_context.R

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,9 @@ test_that("Check masked functions", {
2626
"colnames", "colnames<-", "intersect", "rank", "rbind", "sample", "subset",
2727
"summary", "transform", "drop", "window", "as.data.frame", "union", "not")
2828
version <- packageVersion("base")
29-
if (as.numeric(version$major) >= 3 && as.numeric(version$minor) >= 3) {
29+
is33Above <- as.numeric(version$major) >= 3 && as.numeric(version$minor) >= 3
30+
is40Above <- as.numeric(version$major) >= 4
31+
if (is33Above || is40Above) {
3032
namesOfMasked <- c("endsWith", "startsWith", namesOfMasked)
3133
}
3234
masked <- conflicts(detail = TRUE)$`package:SparkR`

R/pkg/tests/fulltests/test_mllib_classification.R

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ test_that("spark.svmLinear", {
3434
summary <- summary(model)
3535

3636
# test summary coefficients return matrix type
37-
expect_true(class(summary$coefficients) == "matrix")
37+
expect_true(any(class(summary$coefficients) == "matrix"))
3838
expect_true(class(summary$coefficients[, 1]) == "numeric")
3939

4040
coefs <- summary$coefficients[, "Estimate"]
@@ -130,7 +130,7 @@ test_that("spark.logit", {
130130
summary <- summary(model)
131131

132132
# test summary coefficients return matrix type
133-
expect_true(class(summary$coefficients) == "matrix")
133+
expect_true(any(class(summary$coefficients) == "matrix"))
134134
expect_true(class(summary$coefficients[, 1]) == "numeric")
135135

136136
versicolorCoefsR <- c(1.52, 0.03, -0.53, 0.04, 0.00)
@@ -242,8 +242,8 @@ test_that("spark.logit", {
242242
# Test binomial logistic regression against two classes with upperBoundsOnCoefficients
243243
# and upperBoundsOnIntercepts
244244
u <- matrix(c(1.0, 0.0, 1.0, 0.0), nrow = 1, ncol = 4)
245-
model <- spark.logit(training, Species ~ ., upperBoundsOnCoefficients = u,
246-
upperBoundsOnIntercepts = 1.0)
245+
model <- suppressWarnings(spark.logit(training, Species ~ ., upperBoundsOnCoefficients = u,
246+
upperBoundsOnIntercepts = 1.0))
247247
summary <- summary(model)
248248
coefsR <- c(-11.13331, 1.00000, 0.00000, 1.00000, 0.00000)
249249
coefs <- summary$coefficients[, "Estimate"]
@@ -255,8 +255,8 @@ test_that("spark.logit", {
255255
# Test binomial logistic regression against two classes with lowerBoundsOnCoefficients
256256
# and lowerBoundsOnIntercepts
257257
l <- matrix(c(0.0, -1.0, 0.0, -1.0), nrow = 1, ncol = 4)
258-
model <- spark.logit(training, Species ~ ., lowerBoundsOnCoefficients = l,
259-
lowerBoundsOnIntercepts = 0.0)
258+
model <- suppressWarnings(spark.logit(training, Species ~ ., lowerBoundsOnCoefficients = l,
259+
lowerBoundsOnIntercepts = 0.0))
260260
summary <- summary(model)
261261
coefsR <- c(0, 0, -1, 0, 1.902192)
262262
coefs <- summary$coefficients[, "Estimate"]
@@ -268,9 +268,9 @@ test_that("spark.logit", {
268268
# Test multinomial logistic regression with lowerBoundsOnCoefficients
269269
# and lowerBoundsOnIntercepts
270270
l <- matrix(c(0.0, -1.0, 0.0, -1.0, 0.0, -1.0, 0.0, -1.0), nrow = 2, ncol = 4)
271-
model <- spark.logit(training, Species ~ ., family = "multinomial",
272-
lowerBoundsOnCoefficients = l,
273-
lowerBoundsOnIntercepts = as.array(c(0.0, 0.0)))
271+
model <- suppressWarnings(spark.logit(training, Species ~ ., family = "multinomial",
272+
lowerBoundsOnCoefficients = l,
273+
lowerBoundsOnIntercepts = as.array(c(0.0, 0.0))))
274274
summary <- summary(model)
275275
versicolorCoefsR <- c(42.639465, 7.258104, 14.330814, 16.298243, 11.716429)
276276
virginicaCoefsR <- c(0.0002970796, 4.79274, 7.65047, 25.72793, 30.0021)

R/pkg/tests/fulltests/test_mllib_clustering.R

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -171,7 +171,7 @@ test_that("spark.kmeans", {
171171
expect_equal(sort(collect(distinct(select(cluster, "prediction")))$prediction), c(0, 1))
172172

173173
# test summary coefficients return matrix type
174-
expect_true(class(summary.model$coefficients) == "matrix")
174+
expect_true(any(class(summary.model$coefficients) == "matrix"))
175175
expect_true(class(summary.model$coefficients[1, ]) == "numeric")
176176

177177
# Test model save/load

R/pkg/tests/fulltests/test_mllib_regression.R

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ test_that("spark.glm summary", {
116116
rStats <- summary(glm(Sepal.Width ~ Sepal.Length + Species, data = dataset))
117117

118118
# test summary coefficients return matrix type
119-
expect_true(class(stats$coefficients) == "matrix")
119+
expect_true(any(class(stats$coefficients) == "matrix"))
120120
expect_true(class(stats$coefficients[, 1]) == "numeric")
121121

122122
coefs <- stats$coefficients

0 commit comments

Comments
 (0)