Skip to content

Conversation

@aray
Copy link
Contributor

@aray aray commented Jul 31, 2017

What changes were proposed in this pull request?

SPARK-21100 introduced a new summary method to the Scala/Java Dataset API that included expanded statistics (vs describe) and control over which statistics to compute. Currently in the R API summary acts as an alias for describe. This patch updates the R API to call the new summary method in the JVM that includes additional statistics and ability to select which to compute.

This does not break the current interface as the present summary method does not take additional arguments like describe and the output was never meant to be used programmatically.

How was this patch tested?

Modified and additional unit tests.

@SparkQA
Copy link

SparkQA commented Jul 31, 2017

Test build #80091 has finished for PR 18786 at commit b8784b4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

#' - mean
#' - stddev
#' - min
#' - max
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these bullets and whitespaces get collapsed by roxgyen2 - try itemize/item https://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html


#' summary
#'
#' Computes specified statistics for numeric and string columns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this apply to "string columns"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unchanged from before. The stats are only computed for NumericType and StringType columns. Of course the only ones that are non null for strings are count, min, and max.

expect_equal(collect(stats2)[4, "summary"], "min")
expect_equal(collect(stats2)[5, "age"], "30")
expect_equal(collect(stats2)[5, "summary"], "25%")
expect_equal(collect(stats2)[5, "age"], "30.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean this change the output of summary(df) call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

#' - stddev
#' - min
#' - max
#' - arbitrary approximate percentiles specified as a percentage (eg, 75%)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd clarify that 75% should be a string, eg. "75%"

#' - arbitrary approximate percentiles specified as a percentage (eg, 75%)
#'
#' If no statistics are given, this function computes count, mean, stddev, min,
#' approximate quartiles (percentiles at 25%, 50%, and 75%), and max.
Copy link
Member

@felixcheung felixcheung Aug 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, don't use empty line - like #' - the 2nd paragraph after such empty line becomes the "details" section in the doc as formatted by roxygen2

@SparkQA
Copy link

SparkQA commented Aug 1, 2017

Test build #80120 has finished for PR 18786 at commit 08f3cf8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@aray, it looks the tests with AppVeyor failed due to time limit, 1.5 hours. Would you mind closing and reopening this one to retrigger the test?

@felixcheung, It sounds now we are sometimes reaching the limit again... I requested to increase this to AppVeyor from 1 to 1.5 hours before but sounds we should figure out another way ... Now it looks now taking roughly 1.1 ~ 1.5 hours.

@aray aray closed this Aug 2, 2017
@aray aray reopened this Aug 2, 2017
@felixcheung
Copy link
Member

@HyukjinKwon I'm not sure how - in AppVeyor we are building everything from scratch... it does take time

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's track behavior/output changes in release note/migration guide.

opened https://issues.apache.org/jira/browse/SPARK-21616

@aray when the output changes to summary, is it "additive"? ie. at the end?

@aray
Copy link
Contributor Author

aray commented Aug 2, 2017

No the changes to summary are not additive, it inserts 25%, 50%, and 75% percentiles before max (the last row). People that want the previous behavior can use describe. Or if they are trying to programmatically access these fields they should really be explicitly specifying an aggregation. If you recall we discussed using the summary name in the original PR #18307 (comment)

@felixcheung
Copy link
Member

I see. I recall the method name discussion; though changing API and/or output format is something we generally want to avoid. Something like this has been called out in past releases as we shouldn't do in the future.

@felixcheung
Copy link
Member

Is it too late to change the Scala side output format? I suspect it doesn't matter too much on Scala/Python which order they are in and preserving the existing order in R could be helpful.

@aray
Copy link
Contributor Author

aray commented Aug 8, 2017

@rxin Any thoughts on whether it's ok to change the output of summary in R in a non "additive" way?

@rxin
Copy link
Contributor

rxin commented Aug 8, 2017

I suspect it is ok for R ...

@felixcheung
Copy link
Member

felixcheung commented Aug 8, 2017 via email

@aray
Copy link
Contributor Author

aray commented Aug 9, 2017

I'm pushing for it to stay as is because it's the more logical layout of the data: min=0%, 25%, 50%, 75%, max=100%. It's also more consistent with summary of native R dataframes (and for Python the Pandas describe method).

Anyone who is accessing these fields blindly by index should know they are taking a risk. Furthermore we already cast everything to strings for the output of summary so it should be obvious that it's not meant for reuse.

@felixcheung If you still feel strongly a compromise might be to print a warning when summary is called from R with no additional arguments.

#' describe(df, "col1")
#' describe(df, "col1", "col2")
#' }
#' @seealso Ues \code{\link{summary}} for expanded statistics and control over which statistics to compute.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ues -> Use? Or should we say See here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, I think \link{summary}} is sufficient, no need for \code

#' @param ... (optional) statistics to be computed for all columns.
#' @rdname summary
#' @name summary
#' @aliases summary,SparkDataFrame-method
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should have a @return - see describe

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should have a @family

#' }
#' @note summary(SparkDataFrame) since 1.5.0
#' @note The statistics provided by \code{summary} were change in 2.3.0 use \code{\link{describe}} for previous defaults.
#' @seealso \code{\link{describe}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto here and the previous line with \code

#' approximate quartiles (percentiles at 25%, 50%, and 75%), and max.
#' This function is meant for exploratory data analysis, as we make no guarantee about the
#' backward compatibility of the schema of the resulting Dataset. If you want to
#' programmatically compute summary statistics, use the `agg` function instead.
Copy link
Member

@felixcheung felixcheung Aug 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh and don't use backtick - it doesn't get processed by roxygen2
use \code{agg} instead

@SparkQA
Copy link

SparkQA commented Aug 15, 2017

Test build #80687 has finished for PR 18786 at commit 9c9f0f6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@aray
Copy link
Contributor Author

aray commented Aug 17, 2017

closing and reopening to trigger AppVeyor test that timed out

@aray aray closed this Aug 17, 2017
@aray aray reopened this Aug 17, 2017
Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@felixcheung
Copy link
Member

merged to master

@asfgit asfgit closed this in 5c9b301 Aug 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants