[SPARK-21584][SQL][SparkR] Update R method for summary to call new implementation #18786

aray · 2017-07-31T20:13:54Z

What changes were proposed in this pull request?

SPARK-21100 introduced a new summary method to the Scala/Java Dataset API that included expanded statistics (vs describe) and control over which statistics to compute. Currently in the R API summary acts as an alias for describe. This patch updates the R API to call the new summary method in the JVM that includes additional statistics and ability to select which to compute.

This does not break the current interface as the present summary method does not take additional arguments like describe and the output was never meant to be used programmatically.

How was this patch tested?

Modified and additional unit tests.

SparkQA · 2017-07-31T20:57:47Z

Test build #80091 has finished for PR 18786 at commit b8784b4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-08-01T03:40:01Z

R/pkg/R/DataFrame.R

+#' - mean
+#' - stddev
+#' - min
+#' - max


these bullets and whitespaces get collapsed by roxgyen2 - try itemize/item https://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html

felixcheung · 2017-08-01T03:40:40Z

R/pkg/R/DataFrame.R


+#' summary
+#'
+#' Computes specified statistics for numeric and string columns.


does this apply to "string columns"?

This is unchanged from before. The stats are only computed for NumericType and StringType columns. Of course the only ones that are non null for strings are count, min, and max.

felixcheung · 2017-08-01T03:42:55Z

R/pkg/tests/fulltests/test_sparkSQL.R

-  expect_equal(collect(stats2)[4, "summary"], "min")
-  expect_equal(collect(stats2)[5, "age"], "30")
+  expect_equal(collect(stats2)[5, "summary"], "25%")
+  expect_equal(collect(stats2)[5, "age"], "30.0")


does this mean this change the output of summary(df) call?

felixcheung · 2017-08-01T03:43:39Z

R/pkg/R/DataFrame.R

+#' - stddev
+#' - min
+#' - max
+#' - arbitrary approximate percentiles specified as a percentage (eg, 75%)


I'd clarify that 75% should be a string, eg. "75%"

felixcheung · 2017-08-01T03:44:23Z

R/pkg/R/DataFrame.R

+#' - arbitrary approximate percentiles specified as a percentage (eg, 75%)
+#'
+#' If no statistics are given, this function computes count, mean, stddev, min,
+#' approximate quartiles (percentiles at 25%, 50%, and 75%), and max.


also, don't use empty line - like #' - the 2nd paragraph after such empty line becomes the "details" section in the doc as formatted by roxygen2

SparkQA · 2017-08-01T15:29:28Z

Test build #80120 has finished for PR 18786 at commit 08f3cf8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-08-02T01:44:16Z

@aray, it looks the tests with AppVeyor failed due to time limit, 1.5 hours. Would you mind closing and reopening this one to retrigger the test?

@felixcheung, It sounds now we are sometimes reaching the limit again... I requested to increase this to AppVeyor from 1 to 1.5 hours before but sounds we should figure out another way ... Now it looks now taking roughly 1.1 ~ 1.5 hours.

felixcheung · 2017-08-02T18:24:28Z

@HyukjinKwon I'm not sure how - in AppVeyor we are building everything from scratch... it does take time

felixcheung

Let's track behavior/output changes in release note/migration guide.

opened https://issues.apache.org/jira/browse/SPARK-21616

@aray when the output changes to summary, is it "additive"? ie. at the end?

aray · 2017-08-02T18:44:31Z

No the changes to summary are not additive, it inserts 25%, 50%, and 75% percentiles before max (the last row). People that want the previous behavior can use describe. Or if they are trying to programmatically access these fields they should really be explicitly specifying an aggregation. If you recall we discussed using the summary name in the original PR #18307 (comment)

felixcheung · 2017-08-03T17:14:43Z

I see. I recall the method name discussion; though changing API and/or output format is something we generally want to avoid. Something like this has been called out in past releases as we shouldn't do in the future.

felixcheung · 2017-08-03T17:15:52Z

Is it too late to change the Scala side output format? I suspect it doesn't matter too much on Scala/Python which order they are in and preserving the existing order in R could be helpful.

aray · 2017-08-08T20:58:57Z

@rxin Any thoughts on whether it's ok to change the output of summary in R in a non "additive" way?

rxin · 2017-08-08T21:01:38Z

I suspect it is ok for R ...

felixcheung · 2017-08-08T21:22:31Z

I don't think it's a big deal either way, which is why I suggest to change the order in Scala since it is new in Scala in this release, whereas it has been in R for a few releases (or since the beginning)

aray · 2017-08-09T14:33:28Z

I'm pushing for it to stay as is because it's the more logical layout of the data: min=0%, 25%, 50%, 75%, max=100%. It's also more consistent with summary of native R dataframes (and for Python the Pandas describe method).

Anyone who is accessing these fields blindly by index should know they are taking a risk. Furthermore we already cast everything to strings for the output of summary so it should be obvious that it's not meant for reuse.

@felixcheung If you still feel strongly a compromise might be to print a warning when summary is called from R with no additional arguments.

felixcheung · 2017-08-10T16:29:34Z

R/pkg/R/DataFrame.R

 #' describe(df, "col1")
 #' describe(df, "col1", "col2")
 #' }
+#' @seealso Ues \code{\link{summary}} for expanded statistics and control over which statistics to compute.


Ues -> Use? Or should we say See here

also, I think \link{summary}} is sufficient, no need for \code

felixcheung · 2017-08-10T16:31:30Z

R/pkg/R/DataFrame.R

+#' @param ... (optional) statistics to be computed for all columns.
 #' @rdname summary
 #' @name summary
 #' @aliases summary,SparkDataFrame-method


should have a @return - see describe

should have a @family

felixcheung · 2017-08-10T17:14:39Z

R/pkg/R/DataFrame.R

+#' }
 #' @note summary(SparkDataFrame) since 1.5.0
+#' @note The statistics provided by \code{summary} were change in 2.3.0 use \code{\link{describe}} for previous defaults.
+#' @seealso \code{\link{describe}}


ditto here and the previous line with \code

felixcheung · 2017-08-10T17:30:32Z

R/pkg/R/DataFrame.R

+#' approximate quartiles (percentiles at 25%, 50%, and 75%), and max.
+#' This function is meant for exploratory data analysis, as we make no guarantee about the
+#' backward compatibility of the schema of the resulting Dataset. If you want to
+#' programmatically compute summary statistics, use the `agg` function instead.


oh and don't use backtick - it doesn't get processed by roxygen2
use \code{agg} instead

SparkQA · 2017-08-15T14:53:44Z

Test build #80687 has finished for PR 18786 at commit 9c9f0f6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

aray · 2017-08-17T14:33:51Z

closing and reopening to trigger AppVeyor test that timed out

felixcheung

LGTM

felixcheung · 2017-08-22T06:08:50Z

merged to master

aray added 3 commits July 28, 2017 14:16

R lang summary

210601a

doc

e05cdba

remove comment

b8784b4

felixcheung reviewed Aug 1, 2017

View reviewed changes

address doc formatting comments

08f3cf8

aray closed this Aug 2, 2017

aray reopened this Aug 2, 2017

felixcheung reviewed Aug 2, 2017

View reviewed changes

felixcheung reviewed Aug 10, 2017

View reviewed changes

clean up doc per review

9c9f0f6

aray closed this Aug 17, 2017

aray reopened this Aug 17, 2017

felixcheung approved these changes Aug 18, 2017

View reviewed changes

asfgit closed this in 5c9b301 Aug 22, 2017

[SPARK-21584][SQL][SparkR] Update R method for summary to call new implementation #18786

[SPARK-21584][SQL][SparkR] Update R method for summary to call new implementation #18786

Uh oh!

Conversation

aray commented Jul 31, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 31, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung Aug 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 1, 2017

Uh oh!

HyukjinKwon commented Aug 2, 2017

Uh oh!

felixcheung commented Aug 2, 2017

Uh oh!

felixcheung left a comment

Choose a reason for hiding this comment

Uh oh!

aray commented Aug 2, 2017

Uh oh!

felixcheung commented Aug 3, 2017

Uh oh!

felixcheung commented Aug 3, 2017

Uh oh!

aray commented Aug 8, 2017

Uh oh!

rxin commented Aug 8, 2017

Uh oh!

felixcheung commented Aug 8, 2017 via email

Uh oh!

aray commented Aug 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung Aug 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 15, 2017

Uh oh!

aray commented Aug 17, 2017

Uh oh!

felixcheung left a comment

Choose a reason for hiding this comment

Uh oh!

felixcheung commented Aug 22, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

felixcheung Aug 1, 2017 •

edited

Loading

felixcheung Aug 10, 2017 •

edited

Loading