Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

1. Document from_csv(..., schema_of_csv(...)) support:

csv <- "Amsterdam,2018"
df <- sql(paste0("SELECT '", csv, "' as csv"))
head(select(df, from_csv(df$csv, schema_of_csv(csv))))
    from_csv(csv)
1 Amsterdam, 2018

2. Allow from_json(..., schema_of_json(...))

Before:

df2 <- sql("SELECT named_struct('name', 'Bob') as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
head(select(df2, from_json(df2$people_json, schema_of_json(head(df2)$people_json))))
Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function ‘from_json’ for signature ‘"Column", "Column"’

After:

df2 <- sql("SELECT named_struct('name', 'Bob') as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
head(select(df2, from_json(df2$people_json, schema_of_json(head(df2)$people_json))))
  from_json(people_json)
1                    Bob

3. (While I'm here) Allow structType as schema for from_csv support to match with from_json.

Before:

csv <- "Amsterdam,2018"
df <- sql(paste0("SELECT '", csv, "' as csv"))
head(select(df, from_csv(df$csv, structType("city STRING, year INT"))))
Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function ‘from_csv’ for signature ‘"Column", "structType"’

After:

csv <- "Amsterdam,2018"
df <- sql(paste0("SELECT '", csv, "' as csv"))
head(select(df, from_csv(df$csv, structType("city STRING, year INT"))))
    from_csv(csv)
1 Amsterdam, 2018

How was this patch tested?

Manually tested and unittests were added.

@HyukjinKwon
Copy link
Member Author

cc @felixcheung, @viirya and @MaxGekk

@SparkQA
Copy link

SparkQA commented Nov 30, 2018

Test build #99494 has finished for PR 23184 at commit 8877837.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if as.json.array is TRUE but schema is also set?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, the provided schema is wrapped by Array. The test cases are ...

here https://github.com/apache/spark/pull/23184/files#diff-d4011863c8b176830365b2f224a84bf2R1707

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably try to pull all the setClassUnion in one place. (to avoid conflict or duplication)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I agree.. Would you mind if I do this separately? I roughly checked by grep and looks:

./pkg/R/DataFrame.R:setClassUnion("characterOrstructType", c("character", "structType"))
./pkg/R/DataFrame.R:setClassUnion("numericOrcharacter", c("numeric", "character"))
./pkg/R/DataFrame.R:setClassUnion("characterOrColumn", c("character", "Column"))
./pkg/R/DataFrame.R:setClassUnion("numericOrColumn", c("numeric", "Column"))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@SparkQA
Copy link

SparkQA commented Dec 1, 2018

Test build #99547 has finished for PR 23184 at commit c731ad1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@HyukjinKwon
Copy link
Member Author

Hey, @felixcheung, mind if I ask to take a look when you're available please?

@SparkQA
Copy link

SparkQA commented Dec 20, 2018

Test build #100335 has finished for PR 23184 at commit c731ad1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

jschema <- schema@jc
} else if (is.character(schema)) {
if (class(schema) == "structType") {
schema <- callJMethod(schema$job, "toDDL")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

schema$jobj? is there test for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite get this - is it because from_csv doesn't take StructType?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Weird, this is being tested below from_csv(df$col, structType("a INT")).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I did this because from_csv with Java map doesn't take StructType ..

def from_csv(e: Column, schema: Column, options: java.util.Map[String, String]): Column = {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I have no idea why it works:

> structType("a int")$job
Java ref type org.apache.spark.sql.types.StructType id 6
> structType("a int")$jobj
Java ref type org.apache.spark.sql.types.StructType id 8
> structType("a int")$j
Java ref type org.apache.spark.sql.types.StructType id 10

looks that's why it passed. Let me fix it anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R has some funky partial/prefix matching

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha. I haven't been aware of that so far!

}

if (is.character(schema)) {
jschema <- callJStatic("org.apache.spark.sql.functions", "lit", schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, why in the case for from_json, if schema is character is structType(schema)$jobj
where for from_csv, is callJStatic("org.apache.spark.sql.functions", "lit", schema)
?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yea, that looks a bit confusing. It's similar reason. Fortunately, from_json has StructType with Java Map. So we can directly call it from R side.

def from_json(e: Column, schema: StructType, options: java.util.Map[String, String]): Column =

@SparkQA
Copy link

SparkQA commented Dec 28, 2018

Test build #100496 has finished for PR 23184 at commit 66e6290.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

I see - I'm good with this. LGTM. sounds like Scala API can be made more consistent though?

@HyukjinKwon
Copy link
Member Author

Yea, Scala side change is already made about from_[csv|json](schema_of_[csv|json]()).

However, about #23184 (comment) and #23184 (comment), yea, the signatures are not consistent ..

One possibility that was considered before is to match it to only use Column across API (vaguely talked with Wenchen in .. somewhere of a PR before), but it's too breaking change..
There's concern that functions.scala file is getting too long so it's been avoided to add overriden versions of such APIs.

I think we should handle this problem for whole APIs later ..

@HyukjinKwon
Copy link
Member Author

Merged to master.

Thanks, @felixcheung and @viirya.

@asfgit asfgit closed this in 39a0493 Jan 2, 2019
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…n R API

## What changes were proposed in this pull request?

**1. Document `from_csv(..., schema_of_csv(...))` support:**

```R
csv <- "Amsterdam,2018"
df <- sql(paste0("SELECT '", csv, "' as csv"))
head(select(df, from_csv(df$csv, schema_of_csv(csv))))
```

```
    from_csv(csv)
1 Amsterdam, 2018
```

**2. Allow `from_json(..., schema_of_json(...))`**

Before:

```R
df2 <- sql("SELECT named_struct('name', 'Bob') as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
head(select(df2, from_json(df2$people_json, schema_of_json(head(df2)$people_json))))
```

```
Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function ‘from_json’ for signature ‘"Column", "Column"’
```

After:

```R
df2 <- sql("SELECT named_struct('name', 'Bob') as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
head(select(df2, from_json(df2$people_json, schema_of_json(head(df2)$people_json))))
```

```
  from_json(people_json)
1                    Bob
```

**3. (While I'm here) Allow `structType` as schema for `from_csv` support to match with `from_json`.**

Before:

```R
csv <- "Amsterdam,2018"
df <- sql(paste0("SELECT '", csv, "' as csv"))
head(select(df, from_csv(df$csv, structType("city STRING, year INT"))))
```

```
Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function ‘from_csv’ for signature ‘"Column", "structType"’
```

After:

```R
csv <- "Amsterdam,2018"
df <- sql(paste0("SELECT '", csv, "' as csv"))
head(select(df, from_csv(df$csv, structType("city STRING, year INT"))))
```

```
    from_csv(csv)
1 Amsterdam, 2018
```

## How was this patch tested?

Manually tested and unittests were added.

Closes apache#23184 from HyukjinKwon/SPARK-26227-1.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
@HyukjinKwon HyukjinKwon deleted the SPARK-26227-1 branch March 3, 2020 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants