Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

Each data source implementation can define its own options and teach its users how to set them. Spark doesn't have any restrictions about what options a data source should or should not have. It's possible that some options are very common and many data sources use them. However different data sources may define the common options(key and meaning) differently, which is quite confusing to end users.

This PR defines some standard options that data sources can optionally adopt: path, table and database.

How was this patch tested?

a new test case.

@cloud-fan
Copy link
Contributor Author

cc @rxin @rdblue @gatorsmile

@SparkQA
Copy link

SparkQA commented Feb 7, 2018

Test build #87168 has finished for PR 20535 at commit c9009d8.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor

rdblue commented Feb 7, 2018

This should move the standard options to DataSourceV2Relation to avoid needing to instantiate DataSourceOptions wherever the relation is created.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by "without any interpretation"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means it's a pure string, there is not parsing rule for it like SQL identifier. I put some examples below and hopefully they can explain it well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is clear with the examples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems odd to me to change this string. While there's no behavior change, the constant is for a key in v2's DataSourceOptions, not for the DataFrameReader API. We could change it to "PATH" and it would be perfectly fine for v2, but would change the behavior here. Such a change is incredibly unlikely, which is why I say it is just "odd".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, let me change it back.

@cloud-fan
Copy link
Contributor Author

This should move the standard options to DataSourceV2Relation to avoid needing to instantiate DataSourceOptions wherever the relation is created.

@rdblue We don't have this problem now, so I'd like to not touch DataSourceV2Relation here and rethink about it when the problem really comes out.

@SparkQA
Copy link

SparkQA commented Feb 8, 2018

Test build #87186 has finished for PR 20535 at commit 86bcda9.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 8, 2018

Test build #87187 has finished for PR 20535 at commit 3e8f71b.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 8, 2018

Test build #87191 has finished for PR 20535 at commit 6644e49.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 8, 2018

Test build #87194 has finished for PR 20535 at commit e92b6b2.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 8, 2018

Test build #87207 has finished for PR 20535 at commit e92b6b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is more friendly when using this in scala to drop the get and use just path or database.

Copy link
Contributor

@rdblue rdblue Feb 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KEY_PATH sounds like the path for a key. It would be more clear if the name was PATH_KEY.

Also, path is singular and I don't think it is correct to pass multiple paths as a comma-separated string for it. paths would be a better key. I'd rather see both path and paths supported. paths should also be escaped or encoded to avoid collisions with ,.

public String[] paths() {
String[] singularPath = path().map(s -> new String[]{s}).orElseGet(() -> new String[0]);
Optional<String> pathsStr = get(PATHS_KEY);
System.out.println(pathsStr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove println :)

@SparkQA
Copy link

SparkQA commented Apr 9, 2018

Test build #89069 has finished for PR 20535 at commit c811d72.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 10, 2018

Test build #89098 has finished for PR 20535 at commit c5e403c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 10, 2018

Test build #89116 has finished for PR 20535 at commit c5e403c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 16, 2018

Test build #89382 has finished for PR 20535 at commit c5e403c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

thanks, merging to master!

@asfgit asfgit closed this in 310a8cd Apr 18, 2018
}
Dataset.ofRows(sparkSession, DataSourceV2Relation.create(
ds, extraOptions.toMap ++ sessionOptions,
ds, extraOptions.toMap ++ sessionOptions + pathsOption,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue an exception when extraOptions("path") is not empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically we may have duplicated entries in session configs and DataFrameReader/Writer options, not only path. The rule is, DataFrameReader/Writer options should overwrite session configs.

cc @jiangxb1987 can you submit a PR to explicitly document it in SessionConfigSupport?

jzhuge pushed a commit to jzhuge/spark that referenced this pull request Mar 7, 2019
Each data source implementation can define its own options and teach its users how to set them. Spark doesn't have any restrictions about what options a data source should or should not have. It's possible that some options are very common and many data sources use them. However different data sources may define the common options(key and meaning) differently, which is quite confusing to end users.

This PR defines some standard options that data sources can optionally adopt: path, table and database.

a new test case.

Author: Wenchen Fan <[email protected]>

Closes apache#20535 from cloud-fan/options.

(cherry picked from commit 1ada3435b148001323b5eeb25881bbbc83d8fda2)

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
rdblue pushed a commit to rdblue/spark that referenced this pull request Apr 3, 2019
Each data source implementation can define its own options and teach its users how to set them. Spark doesn't have any restrictions about what options a data source should or should not have. It's possible that some options are very common and many data sources use them. However different data sources may define the common options(key and meaning) differently, which is quite confusing to end users.

This PR defines some standard options that data sources can optionally adopt: path, table and database.

a new test case.

Author: Wenchen Fan <[email protected]>

Closes apache#20535 from cloud-fan/options.
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Oct 15, 2019
Each data source implementation can define its own options and teach its users how to set them. Spark doesn't have any restrictions about what options a data source should or should not have. It's possible that some options are very common and many data sources use them. However different data sources may define the common options(key and meaning) differently, which is quite confusing to end users.

This PR defines some standard options that data sources can optionally adopt: path, table and database.

a new test case.

Author: Wenchen Fan <[email protected]>

Closes apache#20535 from cloud-fan/options.

(cherry picked from commit 1ada3435b148001323b5eeb25881bbbc83d8fda2)

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants