Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions R/pkg/NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -213,6 +213,8 @@ exportMethods("%in%",
"count",
"countDistinct",
"crc32",
"create_array",
"create_map",
"hash",
"cume_dist",
"date_add",
Expand Down
53 changes: 53 additions & 0 deletions R/pkg/R/functions.R
Original file line number Diff line number Diff line change
Expand Up @@ -3652,3 +3652,56 @@ setMethod("posexplode",
jc <- callJStatic("org.apache.spark.sql.functions", "posexplode", x@jc)
column(jc)
})

#' create_array
#'
#' Creates a new array column. The input columns must all have the same data type.
#'
#' @param x Column to compute on
#' @param ... additional Column(s).
#'
#' @family normal_funcs
#' @rdname create_array
#' @name create_array
#' @aliases create_array,Column-method
#' @export
#' @examples \dontrun{create_array(df$x, df$y, df$z)}
#' @note create_array since 2.3.0
setMethod("create_array",
signature(x = "Column"),
function(x, ...) {
jcols <- lapply(list(x, ...), function (x) {
stopifnot(class(x) == "Column")
x@jc
})
jc <- callJStatic("org.apache.spark.sql.functions", "array", jcols)
column(jc)
})

#' create_map
#'
#' Creates a new map column. The input columns must be grouped as key-value pairs,
#' e.g. (key1, value1, key2, value2, ...).
#' The key columns must all have the same data type, and can't be null.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null in JVM is mapped to NA in R - we haven't documented that consistently, but would be good to start thinking about the better way to do that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is clear from the context that we mean SQL NULL and both lit(NA) and lit(NULL) create SQL NULL literal. But this reminds me of something else:

> lit(NaN)
Column NULL 

> select(createDataFrame(data.frame(x=c(1))), lit(NaN))
SparkDataFrame[NULL:null]

doesn't look right. PySpark handles this correctly

>>> lit(float("Nan"))
Column<b'NaN'>

with DoubleType.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't be surprised that we have some issues with NaN...
but does it work if you add it to an existing dataframe instead of going via createDataFrame? there's some additional type inference going on in the 2nd route.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't work with createDataFrame either.

For lit it should be a quick fix because we can call Java lit with Float.NaN. createDataFrame won't be that simple.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually but does it work if you add it to an existing dataframe instead of going via createDataFrame? there's some additional type inference going on in the 2nd route.
I mean like

a <- as.DataFrame(cars)
a$foo <- lit(NaN)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it doesn't.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let's open a JIRA on that separately..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts exactly.

#' The value columns must all have the same data type.
#'
#' @param x Column to compute on
#' @param ... additional Column(s).
#'
#' @family normal_funcs
#' @rdname create_map
#' @name create_map
#' @aliases create_map,Column-method
#' @export
#' @examples \dontrun{create_map(lit("x"), lit(1.0), lit("y"), lit(-1.0))}
#' @note create_map since 2.3.0
setMethod("create_map",
signature(x = "Column"),
function(x, ...) {
jcols <- lapply(list(x, ...), function (x) {
stopifnot(class(x) == "Column")
x@jc
})
jc <- callJStatic("org.apache.spark.sql.functions", "map", jcols)
column(jc)
})
8 changes: 8 additions & 0 deletions R/pkg/R/generics.R
Original file line number Diff line number Diff line change
Expand Up @@ -942,6 +942,14 @@ setGeneric("countDistinct", function(x, ...) { standardGeneric("countDistinct")
#' @export
setGeneric("crc32", function(x) { standardGeneric("crc32") })

#' @rdname create_array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also ###################### Expression Function Methods ########################## might not be the right place

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It covers all o.a.s.sql.functions right now. I am not sure these two are different enough to be an exception (and what about struct which belongs to the same category).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually you are right - I saw ###################### Column Methods ########################## and thought that's the place but you are right, we already have them in both places.

I'm fine with what you have

#' @export
setGeneric("create_array", function(x, ...) { standardGeneric("create_array") })

#' @rdname create_map
#' @export
setGeneric("create_map", function(x, ...) { standardGeneric("create_map") })

#' @rdname hash
#' @export
setGeneric("hash", function(x, ...) { standardGeneric("hash") })
Expand Down
17 changes: 17 additions & 0 deletions R/pkg/inst/tests/testthat/test_sparkSQL.R
Original file line number Diff line number Diff line change
Expand Up @@ -1461,6 +1461,23 @@ test_that("column functions", {
expect_equal(length(arr$arrcol[[1]]), 2)
expect_equal(arr$arrcol[[1]][[1]]$name, "Bob")
expect_equal(arr$arrcol[[1]][[2]]$name, "Alice")

# Test create_array() and create_map()
df <- as.DataFrame(data.frame(
x = c(1.0, 2.0), y = c(-1.0, 3.0), z = c(-2.0, 5.0)
))

arrs <- collect(select(df, create_array(df$x, df$y, df$z)))
expect_equal(arrs[, 1], list(list(1, -1, -2), list(2, 3, 5)))

maps <- collect(select(
df, create_map(lit("x"), df$x, lit("y"), df$y, lit("z"), df$z)))

expect_equal(
maps[, 1],
lapply(
list(list(x = 1, y = -1, z = -2), list(x = 2, y = 3, z = 5)),
as.environment))
})

test_that("column binary mathfunctions", {
Expand Down