-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16012][SparkR] Implement gapplyCollect which will apply a R function on each group similar to gapply and collect the result back to R data.frame #13760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #60777 has finished for PR 13760 at commit
|
|
Test build #60778 has finished for PR 13760 at commit
|
|
Test build #60779 has finished for PR 13760 at commit
|
|
Test build #60815 has finished for PR 13760 at commit
|
|
Test build #60818 has finished for PR 13760 at commit
|
…or gapply and gapplyCollect
R/pkg/R/group.R
Outdated
| #' @return a SparkDataFrame | ||
| #' @rdname gapply | ||
| #' @name gapply | ||
| #' @seealso gapplyCollect \link{gapplyCollect} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can leave it as #' @seealso \link{gapplyCollect}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add @export
|
Test build #60871 has finished for PR 13760 at commit
|
R/pkg/R/DataFrame.R
Outdated
| #' | ||
| #' result <- gapplyCollect( | ||
| #' df, | ||
| #' list("a", "c"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use c("a", "c") is more natural?
|
Test build #60971 has finished for PR 13760 at commit
|
|
Test build #60972 has finished for PR 13760 at commit
|
| actual <- collect(df1) | ||
| expect_identical(actual, expected) | ||
|
|
||
| df1Collect <- gapplyCollect(df, list("a"), function(key, x) { x }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe better change list("a") to "a" to test if a scalar column parameter can work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should have both?
|
LGTM except on minor comment |
| setMethod("gapply", | ||
| signature(x = "GroupedData"), | ||
| function(x, func, schema) { | ||
| try(if (is.null(schema)) stop("schema cannot be NULL")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check of schema not being null still needs to be preserved for the the gapply call ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we need it. I tried to do it like dapply but dapply forces it by signature and gapply not.
will bring it back thnx
|
Test build #61064 has finished for PR 13760 at commit
|
R/pkg/R/group.R
Outdated
| setMethod("gapply", | ||
| signature(x = "GroupedData"), | ||
| function(x, func, schema) { | ||
| try(if (is.null(schema)) stop("schema cannot be NULL")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do have it inside try again? don't we want this to fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with or without try both work work fine.
Without try the error looks like:
Error in .local(x, ...) : schema cannot be NULL
with try:
Error in try(if (is.null(schema)) stop("schema cannot be NULL")) :
schema cannot be NULL
Is there a convention in SparkR for showing an error message ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we use stop or warn directly in SparkR without wrapping in a try block. I think the try block is useful if we want to catch an error of one kind and then reformat it or show a different error etc. In this case stop should be sufficient.
|
Test build #61125 has finished for PR 13760 at commit
|
|
@felixcheung Any other comments on this ? |
| #' schema <- structType(structField("a", "integer"), structField("c", "string"), | ||
| #' structField("avg", "double")) | ||
| #' df1 <- gapply( | ||
| #' result <- gapply( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if what is returned is a DataFrame it might help to keep it a variant of "df"
in fact, you might want to add @return to document to return value and type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, @felixcheung , I think I kept it consistent with dapply/dapplyCollect. Those do not have @return. I can add it to gapply
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
Test build #61300 has finished for PR 13760 at commit
|
|
@felixcheung , I've addressed the comments or put a comment for the non-addressed ones. |
|
looks good, thanks! |
|
no |
|
Thanks all. LGTM. Merging this to master and branch-2.0 |
…nction on each group similar to gapply and collect the result back to R data.frame ## What changes were proposed in this pull request? gapplyCollect() does gapply() on a SparkDataFrame and collect the result back to R. Compared to gapply() + collect(), gapplyCollect() offers performance optimization as well as programming convenience, as no schema is needed to be provided. This is similar to dapplyCollect(). ## How was this patch tested? Added test cases for gapplyCollect similar to dapplyCollect Author: Narine Kokhlikyan <[email protected]> Closes #13760 from NarineK/gapplyCollect. (cherry picked from commit 26afb4c) Signed-off-by: Shivaram Venkataraman <[email protected]>
What changes were proposed in this pull request?
gapplyCollect() does gapply() on a SparkDataFrame and collect the result back to R. Compared to gapply() + collect(), gapplyCollect() offers performance optimization as well as programming convenience, as no schema is needed to be provided.
This is similar to dapplyCollect().
How was this patch tested?
Added test cases for gapplyCollect similar to dapplyCollect