Skip to content

Conversation

@NarineK
Copy link
Contributor

@NarineK NarineK commented Jun 18, 2016

What changes were proposed in this pull request?

gapplyCollect() does gapply() on a SparkDataFrame and collect the result back to R. Compared to gapply() + collect(), gapplyCollect() offers performance optimization as well as programming convenience, as no schema is needed to be provided.

This is similar to dapplyCollect().

How was this patch tested?

Added test cases for gapplyCollect similar to dapplyCollect

@NarineK NarineK changed the title [SPARK-16012][SparkR] GapplyCollect - applies a R function to each group similar to gapply and collects the result back to R data.frame [SPARK-16012][SparkR] GapplyCollect - applies a R function on each group similar to gapply and collects the result back to R data.frame Jun 18, 2016
@SparkQA
Copy link

SparkQA commented Jun 18, 2016

Test build #60777 has finished for PR 13760 at commit de5dbb0.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 18, 2016

Test build #60778 has finished for PR 13760 at commit 4b9cd3e.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 18, 2016

Test build #60779 has finished for PR 13760 at commit f96fae9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 19, 2016

Test build #60815 has finished for PR 13760 at commit 7dd883f.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 19, 2016

Test build #60818 has finished for PR 13760 at commit 11c7cd6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

Thanks @NarineK -- cc @sun-rui for review

R/pkg/R/group.R Outdated
#' @return a SparkDataFrame
#' @rdname gapply
#' @name gapply
#' @seealso gapplyCollect \link{gapplyCollect}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can leave it as #' @seealso \link{gapplyCollect}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add @export

@SparkQA
Copy link

SparkQA commented Jun 20, 2016

Test build #60871 has finished for PR 13760 at commit 21de08f.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

#'
#' result <- gapplyCollect(
#' df,
#' list("a", "c"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use c("a", "c") is more natural?

@NarineK NarineK changed the title [SPARK-16012][SparkR] GapplyCollect - applies a R function on each group similar to gapply and collects the result back to R data.frame [SPARK-16012][SparkR] gapplyCollect - applies a R function on each group similar to gapply and collects the result back to R data.frame Jun 21, 2016
@SparkQA
Copy link

SparkQA commented Jun 21, 2016

Test build #60971 has finished for PR 13760 at commit 1d62c38.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 21, 2016

Test build #60972 has finished for PR 13760 at commit 77fb205.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

actual <- collect(df1)
expect_identical(actual, expected)

df1Collect <- gapplyCollect(df, list("a"), function(key, x) { x })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe better change list("a") to "a" to test if a scalar column parameter can work

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should have both?

@sun-rui
Copy link
Contributor

sun-rui commented Jun 22, 2016

LGTM except on minor comment

setMethod("gapply",
signature(x = "GroupedData"),
function(x, func, schema) {
try(if (is.null(schema)) stop("schema cannot be NULL"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check of schema not being null still needs to be preserved for the the gapply call ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we need it. I tried to do it like dapply but dapply forces it by signature and gapply not.
will bring it back thnx

@SparkQA
Copy link

SparkQA commented Jun 22, 2016

Test build #61064 has finished for PR 13760 at commit 022f87d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@NarineK NarineK changed the title [SPARK-16012][SparkR] gapplyCollect - applies a R function on each group similar to gapply and collects the result back to R data.frame [SPARK-16012][SparkR] implement gapplyCollect which will apply a R function on each group similar to gapply and collects the result back to R data.frame Jun 23, 2016
R/pkg/R/group.R Outdated
setMethod("gapply",
signature(x = "GroupedData"),
function(x, func, schema) {
try(if (is.null(schema)) stop("schema cannot be NULL"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do have it inside try again? don't we want this to fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with or without try both work work fine.

Without try the error looks like:
Error in .local(x, ...) : schema cannot be NULL

with try:
Error in try(if (is.null(schema)) stop("schema cannot be NULL")) :
schema cannot be NULL

Is there a convention in SparkR for showing an error message ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we use stop or warn directly in SparkR without wrapping in a try block. I think the try block is useful if we want to catch an error of one kind and then reformat it or show a different error etc. In this case stop should be sufficient.

@SparkQA
Copy link

SparkQA commented Jun 23, 2016

Test build #61125 has finished for PR 13760 at commit 963f14f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

@felixcheung Any other comments on this ?

#' schema <- structType(structField("a", "integer"), structField("c", "string"),
#' structField("avg", "double"))
#' df1 <- gapply(
#' result <- gapply(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if what is returned is a DataFrame it might help to keep it a variant of "df"
in fact, you might want to add @return to document to return value and type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, @felixcheung , I think I kept it consistent with dapply/dapplyCollect. Those do not have @return. I can add it to gapply

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@SparkQA
Copy link

SparkQA commented Jun 27, 2016

Test build #61300 has finished for PR 13760 at commit 1e3f0ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@NarineK
Copy link
Contributor Author

NarineK commented Jun 28, 2016

@felixcheung , I've addressed the comments or put a comment for the non-addressed ones.

@felixcheung
Copy link
Member

looks good, thanks!

@NarineK
Copy link
Contributor Author

NarineK commented Jun 30, 2016

Do you have any questions on this @shivaram , @sun-rui ?

@NarineK NarineK changed the title [SPARK-16012][SparkR] implement gapplyCollect which will apply a R function on each group similar to gapply and collects the result back to R data.frame [SPARK-16012][SparkR] implement gapplyCollect which will apply a R function on each group similar to gapply and collect the result back to R data.frame Jun 30, 2016
@NarineK NarineK changed the title [SPARK-16012][SparkR] implement gapplyCollect which will apply a R function on each group similar to gapply and collect the result back to R data.frame [SPARK-16012][SparkR] Implement gapplyCollect which will apply a R function on each group similar to gapply and collect the result back to R data.frame Jun 30, 2016
@sun-rui
Copy link
Contributor

sun-rui commented Jul 1, 2016

no

@shivaram
Copy link
Contributor

shivaram commented Jul 1, 2016

Thanks all. LGTM. Merging this to master and branch-2.0

asfgit pushed a commit that referenced this pull request Jul 1, 2016
…nction on each group similar to gapply and collect the result back to R data.frame

## What changes were proposed in this pull request?
gapplyCollect() does gapply() on a SparkDataFrame and collect the result back to R. Compared to gapply() + collect(), gapplyCollect() offers performance optimization as well as programming convenience, as no schema is needed to be provided.

This is similar to dapplyCollect().

## How was this patch tested?
Added test cases for gapplyCollect similar to dapplyCollect

Author: Narine Kokhlikyan <[email protected]>

Closes #13760 from NarineK/gapplyCollect.

(cherry picked from commit 26afb4c)
Signed-off-by: Shivaram Venkataraman <[email protected]>
@asfgit asfgit closed this in 26afb4c Jul 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants