Skip to content

Conversation

@felixcheung
Copy link
Member

Added tests for function that are reported as masked, to make sure the base:: or stats:: function can be called.

For those we can't call, added them to SparkR programming guide.

It would seem to me table, sample, subset, filter, cov not working are not actually expected - I investigated/experimented with them but couldn't get them to work. It looks like as they are defined in base or stats they are missing the S3 generic, eg.

> methods("transform")
[1] transform,ANY-method       transform.data.frame
[3] transform,DataFrame-method transform.default
see '?methods' for accessing help and source code
> methods("subset")
[1] subset.data.frame       subset,DataFrame-method subset.default
[4] subset.matrix
see '?methods' for accessing help and source code
Warning message:
In .S3methods(generic.function, class, parent.frame()) :
  function 'subset' appears not to be S3 generic; found functions that look like S3 methods

Any idea?

More information on masking:
http://www.ats.ucla.edu/stat/r/faq/referencing_objects.htm
http://www.sfu.ca/~sweldon/howTo/guide4.pdf

This is what the output doc looks like (minus css):
image

@felixcheung
Copy link
Member Author

@shivaram @sun-rui

@SparkQA
Copy link

SparkQA commented Nov 18, 2015

Test build #46135 has finished for PR 9785 at commit a7b02fa.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 18, 2015

Test build #46137 has finished for PR 9785 at commit a34311f.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 18, 2015

Test build #46140 has finished for PR 9785 at commit ff4ff26.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

For things which are missing generics we can add a generic in our generics.R. We just need to make sure the generic exactly matches the S3 function arguments (summary is an example of that)

@felixcheung
Copy link
Member Author

It passes locally, it looks like it failing something to do with unicode - any idea?

1. Failure (at test_sparkSQL.R#889): column functions --------------------------
lag(ldeaths, 12) does not match 'unable to find an inherited method for function ���lag��� for signature ���"ts", "numeric"���'. Actual value: "Error in (function (classes, fdef, mtable)  : \n  unable to find an inherited method for function 'lag' for signature '"ts", "numeric"'\n"
Error: Test failures

@felixcheung
Copy link
Member Author

@shivaram I checked that - it didn't work with subset even when signature matched. I think that's only affect S4 but when something, like subset is S3 only it doesn't work. I think we need a subset.ANY S3 signature (like the transform.ANY example above), but I thought that might be too much coming from the SparkR package.

@shivaram
Copy link
Contributor

For the unicode thing I think there is a problem with the ' character. Can't we just match the text with some regex like unable to find.*lag ?

@shivaram
Copy link
Contributor

BTW I just made this local change

-setGeneric("subset", function(x, subset, select, ...) { standardGeneric("subset") })
+setGeneric("subset", function(x, ...) { standardGeneric("subset") })

and after this both

subset(airquality, select = Ozone:Wind) # base
head(subset(df, df$Petal_Width %in% c(1, 20))) # SparkR subset

work correctly for me.

@felixcheung
Copy link
Member Author

ok, I see. I guess I picked the closer form

## S3 method for class 'data.frame'
subset(x, subset, select, drop = FALSE, ...)

I'll try/fix the other ones too.

@shivaram
Copy link
Contributor

Yeah I think the trick is to pick the least specific one and set a generic for that

@felixcheung
Copy link
Member Author

I fixed subset, lag, and filter. Not sure about the rest:

cov(x, y = NULL, use = "everything",
           method = c("pearson", "kendall", "spearman"))
sample(x, size, replace = FALSE, prob = NULL)
table(...,
            exclude = if (useNA == "no") c(NA, NaN),
            useNA = c("no", "ifany", "always"),
            dnn = list.names(...), deparse.level = 1)

For cov and sample, they are hard match for SparkR ones, and we can't add param not in the generics as there is no ...:

Error in rematchDefinition(definition, fdef, mnames, fnames, signature) :
  methods can add arguments to the generic ‘sample’ only if '...' is an argument to the generic

For table, when added a setGenerics I can call base::table but then calling table(sqlContext, "table") it fails - I think it's because table is S3

> find("table")
[1] "package:SparkR" "package:base"
> get("table")
standardGeneric for "table" defined from package "base"
function (..., exclude = if (useNA == "no") c(NA, NaN), useNA = c("no",
    "ifany", "always"), dnn = list.names(...), deparse.level = 1)
standardGeneric("table")
<environment: 0x2526e28>
Methods may be defined for arguments: exclude, useNA, dnn, deparse.level
Use  showMethods("table")  for currently available ones.
> showMethods("table")
Function: table (package base)
...="ANY"
> table(sqlContext, "table")
Error in unique.default(x, nmax = nmax) :
  unique() applies only to vectors

@SparkQA
Copy link

SparkQA commented Nov 18, 2015

Test build #46188 has finished for PR 9785 at commit 46b2a6e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sun-rui
Copy link
Contributor

sun-rui commented Nov 18, 2015

I am not sure if it makes sense to match SparkR functions to R signatures of same name to avoid name conflict in the case that SparkR functions have different semantic from R. Is it a convenience to R users or a confusion?
for example,

 setMethod("lag",
          signature(x = "characterOrColumn"),

This will interfere with a call to base lag(), where x is of type character.

@shivaram
Copy link
Contributor

So my take on this is as follows:

  1. If we have have the same name but different parameter types when compared to base R then we should make sure the base R command works correctly. This is the case for things like subset, summary etc. FWIW I think this should be the most common scenario.
  2. If we have same name and same parameter types (for example in lag): I think we can override the behavior but we need to be more careful in these cases. My guess is this mostly happens only when column names are passed as strings and we could stop supporting that if we resolve SPARK-7499. For the specific case of lag I think we are fine as the base lag isn't meant to be used with characters as far as I can see.
  3. If we have same name but the param types completely don't match (and no room for ...) then we override those functions but (This is true for sample, table, cov right now I guess) we should try to limit the number of functions where we do this. Also we should revisit some of these to see if we can avoid it (for example table can be renamed ?)

@SparkQA
Copy link

SparkQA commented Nov 18, 2015

Test build #46235 has finished for PR 9785 at commit 9ac4b12.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * public class JavaGradientBoostedTreeClassifierExample\n * public class JavaGradientBoostedTreeRegressorExample\n * public class JavaRandomForestClassifierExample\n * public class JavaRandomForestRegressorExample\n * case class SerializeWithKryo(child: Expression) extends UnaryExpression\n * case class DeserializeWithKryo[T](child: Expression, tag: ClassTag[T]) extends UnaryExpression\n

@felixcheung
Copy link
Member Author

@shivaram agreed completely.
Should we merge this then? I could open JIRA for sample, table, cov

@felixcheung
Copy link
Member Author

rebased again for merge conflict

@SparkQA
Copy link

SparkQA commented Nov 19, 2015

Test build #46281 has finished for PR 9785 at commit 2d81394.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * public final class UnsafeSorterSpillReader extends UnsafeSorterIterator implements Closeable\n * class CountVectorizerModelWriter(instance: CountVectorizerModel) extends Writer\n * final class IDF(override val uid: String) extends Estimator[IDFModel] with IDFBase with Writable\n * class MinMaxScalerModelWriter(instance: MinMaxScalerModel) extends Writer\n * class StandardScalerModelWriter(instance: StandardScalerModel) extends Writer\n * class StringIndexModelWriter(instance: StringIndexerModel) extends Writer\n * trait ScalaReflection\n * case class Schema(dataType: DataType, nullable: Boolean)\n * s\"Unable to generate an encoder for inner class$\n *case class EncodeUsingSerializer(child: Expression, kryo: Boolean) extends UnaryExpression \n *case class DecodeUsingSerializer[T](child: Expression, tag: ClassTag[T], kryo: Boolean)`\n

@shivaram
Copy link
Contributor

Actually can we revert the change to filter here ? I just checked dplyr and they also override filter. If you load dplyr and run our filter unit test you get an error which looks like

> is.ts(filter(1:100, rep(1, 3)))
Error in UseMethod("filter_") :
  no applicable method for 'filter_' applied to an object of class "c('integer', 'numeric')"

@felixcheung
Copy link
Member Author

Ok, am I understanding it correctly? They don't make it compatible doesn't mean we should follow right?
The change to filter is minimial:

before:

setMethod("filter",
          signature(x = "DataFrame", condition = "characterOrColumn"),
          function(x, condition) {

after:

setMethod("filter",
          signature(x = "DataFrame", filter = "characterOrColumn"),
          function(x, filter) {

@shivaram
Copy link
Contributor

Its less about the change to the SparkR code - I worry that the generic we are exporting is not really that "generic" and its very specific instead. In the case of things like subset and lag we are removing parameters from our generic and hence seems to be a good idea.

@SparkQA
Copy link

SparkQA commented Nov 19, 2015

Test build #46299 has finished for PR 9785 at commit 70c806a.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 19, 2015

Test build #46309 has finished for PR 9785 at commit dd3f83c.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 19, 2015

Test build #46312 has finished for PR 9785 at commit 71f3be7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

LGTM. Thanks @felixcheung ! Merging this to master and 1.6

asfgit pushed a commit that referenced this pull request Nov 19, 2015
…e that are masked by functions with same name in SparkR

Added tests for function that are reported as masked, to make sure the base:: or stats:: function can be called.

For those we can't call, added them to SparkR programming guide.

It would seem to me `table, sample, subset, filter, cov` not working are not actually expected - I investigated/experimented with them but couldn't get them to work. It looks like as they are defined in base or stats they are missing the S3 generic, eg.
```
> methods("transform")
[1] transform,ANY-method       transform.data.frame
[3] transform,DataFrame-method transform.default
see '?methods' for accessing help and source code
> methods("subset")
[1] subset.data.frame       subset,DataFrame-method subset.default
[4] subset.matrix
see '?methods' for accessing help and source code
Warning message:
In .S3methods(generic.function, class, parent.frame()) :
  function 'subset' appears not to be S3 generic; found functions that look like S3 methods
```
Any idea?

More information on masking:
http://www.ats.ucla.edu/stat/r/faq/referencing_objects.htm
http://www.sfu.ca/~sweldon/howTo/guide4.pdf

This is what the output doc looks like (minus css):
![image](https://cloud.githubusercontent.com/assets/8969467/11229714/2946e5de-8d4d-11e5-94b0-dda9696b6fdd.png)

Author: felixcheung <[email protected]>

Closes #9785 from felixcheung/rmasked.

(cherry picked from commit 1a93323)
Signed-off-by: Shivaram Venkataraman <[email protected]>
@asfgit asfgit closed this in 1a93323 Nov 19, 2015
@felixcheung
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants