-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-10863][SPARKR] Method coltypes() to get R's data types of a DataFrame #8984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 1 commit
Commits
Show all changes
42 commits
Select commit
Hold shift + click to select a range
461714d
SPARK-10807. Added as.data.frame as a synonym for collect().
e9e34b5
Removed operator %++%, which is a synonym for paste()
c65b682
Removed extra blank space.
cee871c
Removed extra spaces to comply with R style
0851163
Moved setGeneric declaration to generics.R.
7a8e62a
Added test cases for as.data.frame
de6d164
Merge remote-tracking branch 'origin/SPARK-10807' into SPARK-10807
a346cc6
Changed setMethod declaration to comply with standard
6c4dcbc
Removed changes to .gitignore
99e6304
Merge remote-tracking branch 'upstream/master'
30c5d26
coltypes
4a92d99
Merged
a68f97a
coltypes
360156c
coltypes
0c2da6c
Added more types. Scala types that can't be mapped to R will remain a…
909e4e3
Removed white space
3cd2079
Added more tests
a7723d9
Fixed typo
0a0b278
Fixed typo
7e89935
Moved coltypes to new file types.R and refactored schema.R
21c0799
Updated DESCRIPTION file to add types.R
fee5a2e
Updated DESCRIPTION file
e1056ab
Coding style for setGeneric definition
75f5ced
Coding style
908abf4
Coding style
37bdc46
Fixed data type mapping. Put data types in an environment for more ef…
3b5c2d5
Removed unnecessary cat
001884a
Removed white space
eaaf178
Removed blank space
9a9618e
Update DataFrame.R
25faa4e
Update types.R
57a47a4
Update types.R
e5ab466
Update DataFrame.R
772de99
Added tests for complex types
67b12a4
Update types.R
0bb39dc
Update types.R
8aa13ef
Update test_sparkSQL.R
9b36955
Removed for loop
95a8ece
Update DataFrame.R
462b1f1
Update DataFrame.R
cd033c0
Removed blank space
ba091fb
Merge tests and description files
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Update DataFrame.R
- Loading branch information
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2179,18 +2179,30 @@ setMethod("coltypes", | |
| ) | ||
|
|
||
| # Get the data types of the DataFrame by invoking dtypes() function | ||
| types <- lapply(dtypes(x), function(x) {x[[2]]}) | ||
| types <- sapply(dtypes(x), function(x) {x[[2]]}) | ||
|
|
||
| # Map Spark data types into R's data types using DATA_TYPES environment | ||
| rTypes <- lapply(types, function(x) { | ||
| if (exists(x, envir=DATA_TYPES)) { | ||
| get(x, envir=DATA_TYPES) | ||
| } else { | ||
| stop(paste("Unsupported data type: ", x)) | ||
| rTypes <- sapply(types, USE.NAMES=F, FUN=function(x) { | ||
|
|
||
| # Check for primitive types | ||
| type <- PRIMITIVE_TYPES[[x]] | ||
| if (is.null(type)) { | ||
| # Check for complex types | ||
| for (t in names(COMPLEX_TYPES)) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Try to eliminate for loop in R, could you try something like:
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or ? |
||
| if (substring(x, 1, nchar(t)) == t) { | ||
| type <- COMPLEX_TYPES[[t]] | ||
| break | ||
| } | ||
| } | ||
|
|
||
| if (is.null(type)) { | ||
| stop(paste("Unsupported data type: ", x)) | ||
| } | ||
| } | ||
| type | ||
| }) | ||
|
|
||
| # Find which types could not be mapped | ||
| # Find which types don't have mapping to R | ||
| naIndices <- which(is.na(rTypes)) | ||
|
|
||
| # Assign the original scala data types to the unmatched ones | ||
|
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You only handle primitive types here, but no complex types, like Array, Struct and Map.
It would be better you can refactor the type mapping related code here and that in SerDe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sun-rui For complex types (Array/Struct/Map), I can't think of any mapping to R types. Therefore, as agreed with @felixcheung and @shivaram, these will remain the same. For example:
Original column types: ["string", "boolean", "map..."]
Result of coltypes(): ["character", "logical", "map..."]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@olarayej I think the fall back mechanism here is good. But @sun-rui makes another good point that it will be good to have one unified place where we do a mapping from R types to java types. Right now part of that is in serialize.R / deserialize.R
Could you see if there is some refactoring we could do for this to not be duplicated ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sun-rui @shivaram
The notion of coltypes is actually spread in three files: schema.R, serialize.R, deserialize.R.
In file serialize.R, method writeType (see below) turns the full data type into a one-character string. Then, method readTypedObject (see below), uses this one-character type to read accordingly. I suspect this is because complex types could be like map (String,String)?
In my opinion, it would be better to use the full data type, as opposed to the first letter (which could be especially confusing since we support data types starting with the same letter Date/Double, String/Struct). Also, having the full data type would allow for centralizing the data types in one place, though this would require some major changes
We could have mapping arrays:
PRIMITIVE_TYPES <- c("string"="character",
"long"="integer",
"tinyint"="integer",
"short"="integer",
"integer"="integer",
"byte"="integer",
"double"="numeric",
"float"="numeric",
"decimal"="numeric",
"boolean"="logical")
COMPLEX_TYPES <- c("map", "array", "struct", ...)
DATA_TYPES <- c(PRIMITIVE_TYPES, COMPLEX_TYPES)
And then we'd need to modify deserialize.R, serialize.R, and schema.R to acknowledge these accordingly.
Thoughts?
writeType <- function(con, class) {
type <- switch(class,
NULL = "n",
integer = "i",
character = "c",
logical = "b",
double = "d",
numeric = "d",
raw = "r",
array = "a",
list = "l",
struct = "s",
jobj = "j",
environment = "e",
Date = "D",
POSIXlt = "t",
POSIXct = "t",
stop(paste("Unsupported type for serialization", class)))
writeBin(charToRaw(type), con)
}
readTypedObject <- function(con, type) {
switch (type,
"i" = readInt(con),
"c" = readString(con),
"b" = readBoolean(con),
"d" = readDouble(con),
"r" = readRaw(con),
"D" = readDate(con),
"t" = readTime(con),
"a" = readArray(con),
"l" = readList(con),
"e" = readEnv(con),
"s" = readStruct(con),
"n" = NULL,
"j" = getJobj(readString(con)),
stop(paste("Unsupported type for deserialization", type)))
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The single character names are to reduce the amount of data serialized when we transfer these data types to the JVM. Its not meant to be remembered by anybody so I don't see it being a source of confusion. @sun-rui also added tests which ensure these mappings don't break.
However I think having a list of primitive types, complex types and mapping in a common file (types.R ?) sounds good to me.