-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23341][SQL] define some standard options for data source v2 #20535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #87168 has finished for PR 20535 at commit
|
|
This should move the standard options to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean by "without any interpretation"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means it's a pure string, there is not parsing rule for it like SQL identifier. I put some examples below and hopefully they can explain it well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is clear with the examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems odd to me to change this string. While there's no behavior change, the constant is for a key in v2's DataSourceOptions, not for the DataFrameReader API. We could change it to "PATH" and it would be perfectly fine for v2, but would change the behavior here. Such a change is incredibly unlikely, which is why I say it is just "odd".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, let me change it back.
@rdblue We don't have this problem now, so I'd like to not touch |
|
Test build #87186 has finished for PR 20535 at commit
|
|
Test build #87187 has finished for PR 20535 at commit
|
|
Test build #87191 has finished for PR 20535 at commit
|
|
Test build #87194 has finished for PR 20535 at commit
|
|
retest this please |
|
Test build #87207 has finished for PR 20535 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is more friendly when using this in scala to drop the get and use just path or database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KEY_PATH sounds like the path for a key. It would be more clear if the name was PATH_KEY.
Also, path is singular and I don't think it is correct to pass multiple paths as a comma-separated string for it. paths would be a better key. I'd rather see both path and paths supported. paths should also be escaped or encoded to avoid collisions with ,.
| public String[] paths() { | ||
| String[] singularPath = path().map(s -> new String[]{s}).orElseGet(() -> new String[0]); | ||
| Optional<String> pathsStr = get(PATHS_KEY); | ||
| System.out.println(pathsStr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove println :)
|
Test build #89069 has finished for PR 20535 at commit
|
|
Test build #89098 has finished for PR 20535 at commit
|
|
retest this please |
|
Test build #89116 has finished for PR 20535 at commit
|
gengliangwang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
retest this please |
|
Test build #89382 has finished for PR 20535 at commit
|
|
thanks, merging to master! |
| } | ||
| Dataset.ofRows(sparkSession, DataSourceV2Relation.create( | ||
| ds, extraOptions.toMap ++ sessionOptions, | ||
| ds, extraOptions.toMap ++ sessionOptions + pathsOption, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue an exception when extraOptions("path") is not empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically we may have duplicated entries in session configs and DataFrameReader/Writer options, not only path. The rule is, DataFrameReader/Writer options should overwrite session configs.
cc @jiangxb1987 can you submit a PR to explicitly document it in SessionConfigSupport?
Each data source implementation can define its own options and teach its users how to set them. Spark doesn't have any restrictions about what options a data source should or should not have. It's possible that some options are very common and many data sources use them. However different data sources may define the common options(key and meaning) differently, which is quite confusing to end users. This PR defines some standard options that data sources can optionally adopt: path, table and database. a new test case. Author: Wenchen Fan <[email protected]> Closes apache#20535 from cloud-fan/options. (cherry picked from commit 1ada3435b148001323b5eeb25881bbbc83d8fda2) Conflicts: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
Each data source implementation can define its own options and teach its users how to set them. Spark doesn't have any restrictions about what options a data source should or should not have. It's possible that some options are very common and many data sources use them. However different data sources may define the common options(key and meaning) differently, which is quite confusing to end users. This PR defines some standard options that data sources can optionally adopt: path, table and database. a new test case. Author: Wenchen Fan <[email protected]> Closes apache#20535 from cloud-fan/options.
Each data source implementation can define its own options and teach its users how to set them. Spark doesn't have any restrictions about what options a data source should or should not have. It's possible that some options are very common and many data sources use them. However different data sources may define the common options(key and meaning) differently, which is quite confusing to end users. This PR defines some standard options that data sources can optionally adopt: path, table and database. a new test case. Author: Wenchen Fan <[email protected]> Closes apache#20535 from cloud-fan/options. (cherry picked from commit 1ada3435b148001323b5eeb25881bbbc83d8fda2) Conflicts: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
What changes were proposed in this pull request?
Each data source implementation can define its own options and teach its users how to set them. Spark doesn't have any restrictions about what options a data source should or should not have. It's possible that some options are very common and many data sources use them. However different data sources may define the common options(key and meaning) differently, which is quite confusing to end users.
This PR defines some standard options that data sources can optionally adopt: path, table and database.
How was this patch tested?
a new test case.