[SPARK-23341][SQL] define some standard options for data source v2 #20535

cloud-fan · 2018-02-07T15:00:48Z

What changes were proposed in this pull request?

Each data source implementation can define its own options and teach its users how to set them. Spark doesn't have any restrictions about what options a data source should or should not have. It's possible that some options are very common and many data sources use them. However different data sources may define the common options(key and meaning) differently, which is quite confusing to end users.

This PR defines some standard options that data sources can optionally adopt: path, table and database.

How was this patch tested?

a new test case.

cloud-fan · 2018-02-07T15:01:40Z

cc @rxin @rdblue @gatorsmile

SparkQA · 2018-02-07T15:17:19Z

Test build #87168 has finished for PR 20535 at commit c9009d8.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2018-02-07T17:48:04Z

This should move the standard options to DataSourceV2Relation to avoid needing to instantiate DataSourceOptions wherever the relation is created.

rxin · 2018-02-07T17:56:28Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceOptions.java

what do you mean by "without any interpretation"?

It means it's a pure string, there is not parsing rule for it like SQL identifier. I put some examples below and hopefully they can explain it well.

I think this is clear with the examples.

rdblue · 2018-02-07T18:25:38Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

It seems odd to me to change this string. While there's no behavior change, the constant is for a key in v2's DataSourceOptions, not for the DataFrameReader API. We could change it to "PATH" and it would be perfectly fine for v2, but would change the behavior here. Such a change is incredibly unlikely, which is why I say it is just "odd".

makes sense, let me change it back.

cloud-fan · 2018-02-08T03:43:13Z

This should move the standard options to DataSourceV2Relation to avoid needing to instantiate DataSourceOptions wherever the relation is created.

@rdblue We don't have this problem now, so I'd like to not touch DataSourceV2Relation here and rethink about it when the problem really comes out.

SparkQA · 2018-02-08T04:06:02Z

Test build #87186 has finished for PR 20535 at commit 86bcda9.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-08T04:25:59Z

Test build #87187 has finished for PR 20535 at commit 3e8f71b.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-08T05:33:50Z

Test build #87191 has finished for PR 20535 at commit 6644e49.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-08T08:05:01Z

Test build #87194 has finished for PR 20535 at commit e92b6b2.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-02-08T08:55:53Z

retest this please

SparkQA · 2018-02-08T12:10:05Z

Test build #87207 has finished for PR 20535 at commit e92b6b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2018-02-21T17:03:39Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceOptions.java

I think it is more friendly when using this in scala to drop the get and use just path or database.

rdblue · 2018-02-21T17:04:41Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

KEY_PATH sounds like the path for a key. It would be more clear if the name was PATH_KEY.

Also, path is singular and I don't think it is correct to pass multiple paths as a comma-separated string for it. paths would be a better key. I'd rather see both path and paths supported. paths should also be escaped or encoded to avoid collisions with ,.

gengliangwang · 2018-04-09T17:31:11Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceOptions.java

+  public String[] paths() {
+    String[] singularPath = path().map(s -> new String[]{s}).orElseGet(() -> new String[0]);
+    Optional<String> pathsStr = get(PATHS_KEY);
+    System.out.println(pathsStr);


remove println :)

SparkQA · 2018-04-09T20:12:29Z

Test build #89069 has finished for PR 20535 at commit c811d72.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-10T11:07:32Z

Test build #89098 has finished for PR 20535 at commit c5e403c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-04-10T12:12:32Z

retest this please

SparkQA · 2018-04-10T15:52:02Z

Test build #89116 has finished for PR 20535 at commit c5e403c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang

LGTM

cloud-fan · 2018-04-16T02:12:19Z

retest this please

SparkQA · 2018-04-16T05:59:52Z

Test build #89382 has finished for PR 20535 at commit c5e403c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-04-18T03:51:25Z

thanks, merging to master!

gatorsmile · 2018-04-23T17:11:08Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

+        }
        Dataset.ofRows(sparkSession, DataSourceV2Relation.create(
-          ds, extraOptions.toMap ++ sessionOptions,
+          ds, extraOptions.toMap ++ sessionOptions + pathsOption,


issue an exception when extraOptions("path") is not empty?

Basically we may have duplicated entries in session configs and DataFrameReader/Writer options, not only path. The rule is, DataFrameReader/Writer options should overwrite session configs.

cc @jiangxb1987 can you submit a PR to explicitly document it in SessionConfigSupport?

Each data source implementation can define its own options and teach its users how to set them. Spark doesn't have any restrictions about what options a data source should or should not have. It's possible that some options are very common and many data sources use them. However different data sources may define the common options(key and meaning) differently, which is quite confusing to end users. This PR defines some standard options that data sources can optionally adopt: path, table and database. a new test case. Author: Wenchen Fan <[email protected]> Closes apache#20535 from cloud-fan/options. (cherry picked from commit 1ada3435b148001323b5eeb25881bbbc83d8fda2) Conflicts: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

Each data source implementation can define its own options and teach its users how to set them. Spark doesn't have any restrictions about what options a data source should or should not have. It's possible that some options are very common and many data sources use them. However different data sources may define the common options(key and meaning) differently, which is quite confusing to end users. This PR defines some standard options that data sources can optionally adopt: path, table and database. a new test case. Author: Wenchen Fan <[email protected]> Closes apache#20535 from cloud-fan/options.

Each data source implementation can define its own options and teach its users how to set them. Spark doesn't have any restrictions about what options a data source should or should not have. It's possible that some options are very common and many data sources use them. However different data sources may define the common options(key and meaning) differently, which is quite confusing to end users. This PR defines some standard options that data sources can optionally adopt: path, table and database. a new test case. Author: Wenchen Fan <[email protected]> Closes apache#20535 from cloud-fan/options. (cherry picked from commit 1ada3435b148001323b5eeb25881bbbc83d8fda2) Conflicts: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

rxin reviewed Feb 7, 2018

View reviewed changes

rdblue reviewed Feb 7, 2018

View reviewed changes

cloud-fan force-pushed the options branch from 86bcda9 to 3e8f71b Compare February 8, 2018 04:10

cloud-fan force-pushed the options branch from 3e8f71b to 6644e49 Compare February 8, 2018 05:13

cloud-fan force-pushed the options branch from 6644e49 to e92b6b2 Compare February 8, 2018 06:55

rdblue reviewed Feb 21, 2018

View reviewed changes

cloud-fan added 3 commits April 9, 2018 23:55

define some standard options for data source v2

c73b524

address comments

23e4d72

address comments

c811d72

cloud-fan force-pushed the options branch from e92b6b2 to c811d72 Compare April 9, 2018 17:28

gengliangwang reviewed Apr 9, 2018

View reviewed changes

address comments

c5e403c

gengliangwang approved these changes Apr 11, 2018

View reviewed changes

asfgit closed this in 310a8cd Apr 18, 2018

gatorsmile reviewed Apr 23, 2018

View reviewed changes

[SPARK-23341][SQL] define some standard options for data source v2 #20535

[SPARK-23341][SQL] define some standard options for data source v2 #20535

Uh oh!

Conversation

cloud-fan commented Feb 7, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Feb 7, 2018

Uh oh!

SparkQA commented Feb 7, 2018

Uh oh!

rdblue commented Feb 7, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Feb 8, 2018

Uh oh!

SparkQA commented Feb 8, 2018

Uh oh!

SparkQA commented Feb 8, 2018

Uh oh!

SparkQA commented Feb 8, 2018

Uh oh!

SparkQA commented Feb 8, 2018

Uh oh!

cloud-fan commented Feb 8, 2018

Uh oh!

SparkQA commented Feb 8, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Feb 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 9, 2018

Uh oh!

SparkQA commented Apr 10, 2018

Uh oh!

cloud-fan commented Apr 10, 2018

Uh oh!

SparkQA commented Apr 10, 2018

Uh oh!

gengliangwang left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Apr 16, 2018

Uh oh!

SparkQA commented Apr 16, 2018

Uh oh!

cloud-fan commented Apr 18, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

rdblue Feb 21, 2018 •

edited

Loading