Skip to content

Commit b60f690

Browse files
felixcheungFelix Cheung
authored andcommitted
[SPARK-18817][SPARKR][SQL] change derby log output to temp dir
## What changes were proposed in this pull request? Passes R `tempdir()` (this is the R session temp dir, shared with other temp files/dirs) to JVM, set System.Property for derby home dir to move derby.log ## How was this patch tested? Manually, unit tests With this, these are relocated to under /tmp ``` # ls /tmp/RtmpG2M0cB/ derby.log ``` And they are removed automatically when the R session is ended. Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16330 from felixcheung/rderby. (cherry picked from commit 422aa67) Signed-off-by: Felix Cheung <felixcheung@apache.org>
1 parent 780f606 commit b60f690

File tree

4 files changed

+63
-1
lines changed

4 files changed

+63
-1
lines changed

R/pkg/R/sparkR.R

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -322,10 +322,19 @@ sparkRHive.init <- function(jsc = NULL) {
322322
#' SparkSession or initializes a new SparkSession.
323323
#' Additional Spark properties can be set in \code{...}, and these named parameters take priority
324324
#' over values in \code{master}, \code{appName}, named lists of \code{sparkConfig}.
325-
#' When called in an interactive session, this checks for the Spark installation, and, if not
325+
#'
326+
#' When called in an interactive session, this method checks for the Spark installation, and, if not
326327
#' found, it will be downloaded and cached automatically. Alternatively, \code{install.spark} can
327328
#' be called manually.
328329
#'
330+
#' A default warehouse is created automatically in the current directory when a managed table is
331+
#' created via \code{sql} statement \code{CREATE TABLE}, for example. To change the location of the
332+
#' warehouse, set the named parameter \code{spark.sql.warehouse.dir} to the SparkSession. Along with
333+
#' the warehouse, an accompanied metastore may also be automatically created in the current
334+
#' directory when a new SparkSession is initialized with \code{enableHiveSupport} set to
335+
#' \code{TRUE}, which is the default. For more details, refer to Hive configuration at
336+
#' \url{http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables}.
337+
#'
329338
#' For details on how to initialize and use SparkR, refer to SparkR programming guide at
330339
#' \url{http://spark.apache.org/docs/latest/sparkr.html#starting-up-sparksession}.
331340
#'
@@ -381,6 +390,10 @@ sparkR.session <- function(
381390
deployMode <- sparkConfigMap[["spark.submit.deployMode"]]
382391
}
383392

393+
if (!exists("spark.r.sql.derby.temp.dir", envir = sparkConfigMap)) {
394+
sparkConfigMap[["spark.r.sql.derby.temp.dir"]] <- tempdir()
395+
}
396+
384397
if (!exists(".sparkRjsc", envir = .sparkREnv)) {
385398
retHome <- sparkCheckInstall(sparkHome, master, deployMode)
386399
if (!is.null(retHome)) sparkHome <- retHome

R/pkg/inst/tests/testthat/test_sparkSQL.R

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ unsetHiveContext <- function() {
6060

6161
# Tests for SparkSQL functions in SparkR
6262

63+
filesBefore <- list.files(path = sparkRDir, all.files = TRUE)
6364
sparkSession <- sparkR.session()
6465
sc <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", "getJavaSparkContext", sparkSession)
6566

@@ -2839,6 +2840,39 @@ test_that("Collect on DataFrame when NAs exists at the top of a timestamp column
28392840
expect_equal(class(ldf3$col3), c("POSIXct", "POSIXt"))
28402841
})
28412842

2843+
compare_list <- function(list1, list2) {
2844+
# get testthat to show the diff by first making the 2 lists equal in length
2845+
expect_equal(length(list1), length(list2))
2846+
l <- max(length(list1), length(list2))
2847+
length(list1) <- l
2848+
length(list2) <- l
2849+
expect_equal(sort(list1, na.last = TRUE), sort(list2, na.last = TRUE))
2850+
}
2851+
2852+
# This should always be the **very last test** in this test file.
2853+
test_that("No extra files are created in SPARK_HOME by starting session and making calls", {
2854+
# Check that it is not creating any extra file.
2855+
# Does not check the tempdir which would be cleaned up after.
2856+
filesAfter <- list.files(path = sparkRDir, all.files = TRUE)
2857+
2858+
expect_true(length(sparkRFilesBefore) > 0)
2859+
# first, ensure derby.log is not there
2860+
expect_false("derby.log" %in% filesAfter)
2861+
# second, ensure only spark-warehouse is created when calling SparkSession, enableHiveSupport = F
2862+
# note: currently all other test files have enableHiveSupport = F, so we capture the list of files
2863+
# before creating a SparkSession with enableHiveSupport = T at the top of this test file
2864+
# (filesBefore). The test here is to compare that (filesBefore) against the list of files before
2865+
# any test is run in run-all.R (sparkRFilesBefore).
2866+
# sparkRWhitelistSQLDirs is also defined in run-all.R, and should contain only 2 whitelisted dirs,
2867+
# here allow the first value, spark-warehouse, in the diff, everything else should be exactly the
2868+
# same as before any test is run.
2869+
compare_list(sparkRFilesBefore, setdiff(filesBefore, sparkRWhitelistSQLDirs[[1]]))
2870+
# third, ensure only spark-warehouse and metastore_db are created when enableHiveSupport = T
2871+
# note: as the note above, after running all tests in this file while enableHiveSupport = T, we
2872+
# check the list of files again. This time we allow both whitelisted dirs to be in the diff.
2873+
compare_list(sparkRFilesBefore, setdiff(filesAfter, sparkRWhitelistSQLDirs))
2874+
})
2875+
28422876
unlink(parquetPath)
28432877
unlink(orcPath)
28442878
unlink(jsonPath)

R/pkg/tests/run-all.R

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,12 @@ library(SparkR)
2222
options("warn" = 2)
2323

2424
# Setup global test environment
25+
sparkRDir <- file.path(Sys.getenv("SPARK_HOME"), "R")
26+
sparkRFilesBefore <- list.files(path = sparkRDir, all.files = TRUE)
27+
sparkRWhitelistSQLDirs <- c("spark-warehouse", "metastore_db")
28+
invisible(lapply(sparkRWhitelistSQLDirs,
29+
function(x) { unlink(file.path(sparkRDir, x), recursive = TRUE, force = TRUE)}))
30+
2531
install.spark()
2632

2733
test_package("SparkR")

core/src/main/scala/org/apache/spark/api/r/RRDD.scala

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717

1818
package org.apache.spark.api.r
1919

20+
import java.io.File
2021
import java.util.{Map => JMap}
2122

2223
import scala.collection.JavaConverters._
@@ -127,6 +128,14 @@ private[r] object RRDD {
127128
sparkConf.setExecutorEnv(name.toString, value.toString)
128129
}
129130

131+
if (sparkEnvirMap.containsKey("spark.r.sql.derby.temp.dir") &&
132+
System.getProperty("derby.stream.error.file") == null) {
133+
// This must be set before SparkContext is instantiated.
134+
System.setProperty("derby.stream.error.file",
135+
Seq(sparkEnvirMap.get("spark.r.sql.derby.temp.dir").toString, "derby.log")
136+
.mkString(File.separator))
137+
}
138+
130139
val jsc = new JavaSparkContext(sparkConf)
131140
jars.foreach { jar =>
132141
jsc.addJar(jar)

0 commit comments

Comments
 (0)