-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21584][SQL][SparkR] Update R method for summary to call new implementation #18786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #80091 has finished for PR 18786 at commit
|
R/pkg/R/DataFrame.R
Outdated
| #' - mean | ||
| #' - stddev | ||
| #' - min | ||
| #' - max |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these bullets and whitespaces get collapsed by roxgyen2 - try itemize/item https://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html
R/pkg/R/DataFrame.R
Outdated
|
|
||
| #' summary | ||
| #' | ||
| #' Computes specified statistics for numeric and string columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this apply to "string columns"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unchanged from before. The stats are only computed for NumericType and StringType columns. Of course the only ones that are non null for strings are count, min, and max.
| expect_equal(collect(stats2)[4, "summary"], "min") | ||
| expect_equal(collect(stats2)[5, "age"], "30") | ||
| expect_equal(collect(stats2)[5, "summary"], "25%") | ||
| expect_equal(collect(stats2)[5, "age"], "30.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean this change the output of summary(df) call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
R/pkg/R/DataFrame.R
Outdated
| #' - stddev | ||
| #' - min | ||
| #' - max | ||
| #' - arbitrary approximate percentiles specified as a percentage (eg, 75%) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd clarify that 75% should be a string, eg. "75%"
| #' - arbitrary approximate percentiles specified as a percentage (eg, 75%) | ||
| #' | ||
| #' If no statistics are given, this function computes count, mean, stddev, min, | ||
| #' approximate quartiles (percentiles at 25%, 50%, and 75%), and max. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, don't use empty line - like #' - the 2nd paragraph after such empty line becomes the "details" section in the doc as formatted by roxygen2
|
Test build #80120 has finished for PR 18786 at commit
|
|
@aray, it looks the tests with AppVeyor failed due to time limit, 1.5 hours. Would you mind closing and reopening this one to retrigger the test? @felixcheung, It sounds now we are sometimes reaching the limit again... I requested to increase this to AppVeyor from 1 to 1.5 hours before but sounds we should figure out another way ... Now it looks now taking roughly 1.1 ~ 1.5 hours. |
|
@HyukjinKwon I'm not sure how - in AppVeyor we are building everything from scratch... it does take time |
felixcheung
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's track behavior/output changes in release note/migration guide.
opened https://issues.apache.org/jira/browse/SPARK-21616
@aray when the output changes to summary, is it "additive"? ie. at the end?
|
No the changes to |
|
I see. I recall the method name discussion; though changing API and/or output format is something we generally want to avoid. Something like this has been called out in past releases as we shouldn't do in the future. |
|
Is it too late to change the Scala side output format? I suspect it doesn't matter too much on Scala/Python which order they are in and preserving the existing order in R could be helpful. |
|
@rxin Any thoughts on whether it's ok to change the output of |
|
I suspect it is ok for R ... |
|
I don't think it's a big deal either way, which is why I suggest to change the order in Scala since it is new in Scala in this release, whereas it has been in R for a few releases (or since the beginning)
|
|
I'm pushing for it to stay as is because it's the more logical layout of the data: min=0%, 25%, 50%, 75%, max=100%. It's also more consistent with summary of native R dataframes (and for Python the Pandas describe method). Anyone who is accessing these fields blindly by index should know they are taking a risk. Furthermore we already cast everything to strings for the output of summary so it should be obvious that it's not meant for reuse. @felixcheung If you still feel strongly a compromise might be to print a warning when |
R/pkg/R/DataFrame.R
Outdated
| #' describe(df, "col1") | ||
| #' describe(df, "col1", "col2") | ||
| #' } | ||
| #' @seealso Ues \code{\link{summary}} for expanded statistics and control over which statistics to compute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ues -> Use? Or should we say See here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, I think \link{summary}} is sufficient, no need for \code
| #' @param ... (optional) statistics to be computed for all columns. | ||
| #' @rdname summary | ||
| #' @name summary | ||
| #' @aliases summary,SparkDataFrame-method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should have a @return - see describe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should have a @family
R/pkg/R/DataFrame.R
Outdated
| #' } | ||
| #' @note summary(SparkDataFrame) since 1.5.0 | ||
| #' @note The statistics provided by \code{summary} were change in 2.3.0 use \code{\link{describe}} for previous defaults. | ||
| #' @seealso \code{\link{describe}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto here and the previous line with \code
R/pkg/R/DataFrame.R
Outdated
| #' approximate quartiles (percentiles at 25%, 50%, and 75%), and max. | ||
| #' This function is meant for exploratory data analysis, as we make no guarantee about the | ||
| #' backward compatibility of the schema of the resulting Dataset. If you want to | ||
| #' programmatically compute summary statistics, use the `agg` function instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh and don't use backtick - it doesn't get processed by roxygen2
use \code{agg} instead
|
Test build #80687 has finished for PR 18786 at commit
|
|
closing and reopening to trigger AppVeyor test that timed out |
felixcheung
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
merged to master |
What changes were proposed in this pull request?
SPARK-21100 introduced a new
summarymethod to the Scala/Java Dataset API that included expanded statistics (vsdescribe) and control over which statistics to compute. Currently in the R APIsummaryacts as an alias fordescribe. This patch updates the R API to call the newsummarymethod in the JVM that includes additional statistics and ability to select which to compute.This does not break the current interface as the present
summarymethod does not take additional arguments likedescribeand the output was never meant to be used programmatically.How was this patch tested?
Modified and additional unit tests.