-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-26227][R] from_[csv|json] should accept schema_of_[csv|json] in R API #23184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
193d688 to
8877837
Compare
|
cc @felixcheung, @viirya and @MaxGekk |
|
Test build #99494 has finished for PR 23184 at commit
|
sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala
Outdated
Show resolved
Hide resolved
R/pkg/R/functions.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if as.json.array is TRUE but schema is also set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, the provided schema is wrapped by Array. The test cases are ...
sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala
Outdated
Show resolved
Hide resolved
R/pkg/R/functions.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably try to pull all the setClassUnion in one place. (to avoid conflict or duplication)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I agree.. Would you mind if I do this separately? I roughly checked by grep and looks:
./pkg/R/DataFrame.R:setClassUnion("characterOrstructType", c("character", "structType"))
./pkg/R/DataFrame.R:setClassUnion("numericOrcharacter", c("numeric", "character"))
./pkg/R/DataFrame.R:setClassUnion("characterOrColumn", c("character", "Column"))
./pkg/R/DataFrame.R:setClassUnion("numericOrColumn", c("numeric", "Column"))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
8877837 to
c731ad1
Compare
|
Test build #99547 has finished for PR 23184 at commit
|
|
retest this please |
|
Hey, @felixcheung, mind if I ask to take a look when you're available please? |
|
Test build #100335 has finished for PR 23184 at commit
|
R/pkg/R/functions.R
Outdated
| jschema <- schema@jc | ||
| } else if (is.character(schema)) { | ||
| if (class(schema) == "structType") { | ||
| schema <- callJMethod(schema$job, "toDDL") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
schema$jobj? is there test for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite get this - is it because from_csv doesn't take StructType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. Weird, this is being tested below from_csv(df$col, structType("a INT")).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I did this because from_csv with Java map doesn't take StructType ..
| def from_csv(e: Column, schema: Column, options: java.util.Map[String, String]): Column = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I have no idea why it works:
> structType("a int")$job
Java ref type org.apache.spark.sql.types.StructType id 6
> structType("a int")$jobj
Java ref type org.apache.spark.sql.types.StructType id 8
> structType("a int")$j
Java ref type org.apache.spark.sql.types.StructType id 10looks that's why it passed. Let me fix it anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R has some funky partial/prefix matching
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha. I haven't been aware of that so far!
| } | ||
|
|
||
| if (is.character(schema)) { | ||
| jschema <- callJStatic("org.apache.spark.sql.functions", "lit", schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, why in the case for from_json, if schema is character is structType(schema)$jobj
where for from_csv, is callJStatic("org.apache.spark.sql.functions", "lit", schema)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yea, that looks a bit confusing. It's similar reason. Fortunately, from_json has StructType with Java Map. So we can directly call it from R side.
| def from_json(e: Column, schema: StructType, options: java.util.Map[String, String]): Column = |
|
Test build #100496 has finished for PR 23184 at commit
|
|
I see - I'm good with this. LGTM. sounds like Scala API can be made more consistent though? |
|
Yea, Scala side change is already made about However, about #23184 (comment) and #23184 (comment), yea, the signatures are not consistent .. One possibility that was considered before is to match it to only use I think we should handle this problem for whole APIs later .. |
|
Merged to master. Thanks, @felixcheung and @viirya. |
…n R API
## What changes were proposed in this pull request?
**1. Document `from_csv(..., schema_of_csv(...))` support:**
```R
csv <- "Amsterdam,2018"
df <- sql(paste0("SELECT '", csv, "' as csv"))
head(select(df, from_csv(df$csv, schema_of_csv(csv))))
```
```
from_csv(csv)
1 Amsterdam, 2018
```
**2. Allow `from_json(..., schema_of_json(...))`**
Before:
```R
df2 <- sql("SELECT named_struct('name', 'Bob') as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
head(select(df2, from_json(df2$people_json, schema_of_json(head(df2)$people_json))))
```
```
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘from_json’ for signature ‘"Column", "Column"’
```
After:
```R
df2 <- sql("SELECT named_struct('name', 'Bob') as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
head(select(df2, from_json(df2$people_json, schema_of_json(head(df2)$people_json))))
```
```
from_json(people_json)
1 Bob
```
**3. (While I'm here) Allow `structType` as schema for `from_csv` support to match with `from_json`.**
Before:
```R
csv <- "Amsterdam,2018"
df <- sql(paste0("SELECT '", csv, "' as csv"))
head(select(df, from_csv(df$csv, structType("city STRING, year INT"))))
```
```
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘from_csv’ for signature ‘"Column", "structType"’
```
After:
```R
csv <- "Amsterdam,2018"
df <- sql(paste0("SELECT '", csv, "' as csv"))
head(select(df, from_csv(df$csv, structType("city STRING, year INT"))))
```
```
from_csv(csv)
1 Amsterdam, 2018
```
## How was this patch tested?
Manually tested and unittests were added.
Closes apache#23184 from HyukjinKwon/SPARK-26227-1.
Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
1. Document
from_csv(..., schema_of_csv(...))support:2. Allow
from_json(..., schema_of_json(...))Before:
After:
3. (While I'm here) Allow
structTypeas schema forfrom_csvsupport to match withfrom_json.Before:
After:
How was this patch tested?
Manually tested and unittests were added.