-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-15689][SQL] data source v2 read path #19136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -23,8 +23,8 @@ | |
| import java.util.Optional; | ||
|
|
||
| /** | ||
| * An immutable case-insensitive string-to-string map, which is used to represent data source | ||
| * options. | ||
| * An immutable string-to-string map in which keys are case-insensitive. This is used to represent | ||
| * data source options. | ||
| */ | ||
| public class DataSourceV2Options { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a simple test suite for this |
||
| private final Map<String, String> keyLowerCasedMap; | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql.sources.v2; | ||
|
|
||
| import org.apache.spark.sql.sources.v2.reader.DataSourceV2Reader; | ||
|
|
||
| /** | ||
| * A mix-in interface for `DataSourceV2`. Users can implement this interface to provide data reading | ||
|
||
| * ability and scan the data from the data source. | ||
| */ | ||
| public interface ReadSupport { | ||
|
|
||
| /** | ||
| * Creates a `DataSourceV2Reader` to scan the data for this data source. | ||
| * | ||
| * @param options the options for this data source reader, which is an immutable case-insensitive | ||
| * string-to-string map. | ||
| * @return a reader that implements the actual read logic. | ||
| */ | ||
| DataSourceV2Reader createReader(DataSourceV2Options options); | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -21,11 +21,14 @@ | |
| import org.apache.spark.sql.types.StructType; | ||
|
|
||
| /** | ||
| * A variant of `DataSourceV2` which requires users to provide a schema when reading data. A data | ||
| * source can inherit both `DataSourceV2` and `SchemaRequiredDataSourceV2` if it supports both schema | ||
| * inference and user-specified schemas. | ||
| * A mix-in interface for `DataSourceV2`. Users can implement this interface to provide data reading | ||
| * ability and scan the data from the data source. | ||
| * | ||
| * This is a variant of `ReadSupport` that accepts user-specified schema when reading data. A data | ||
| * source can implement both `ReadSupport` and `ReadSupportWithSchema` if it supports both schema | ||
| * inference and user-specified schema. | ||
| */ | ||
| public interface SchemaRequiredDataSourceV2 { | ||
| public interface ReadSupportWithSchema { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I still find ReadSupport vs ReadSupportWithSchema pretty confusing. But let's address that separately. |
||
|
|
||
| /** | ||
| * Create a `DataSourceV2Reader` to scan the data for this data source. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -28,13 +28,15 @@ | |
| * this data source reader. | ||
| * | ||
| * There are mainly 3 kinds of query optimizations: | ||
| * 1. push operators downward to the data source, e.g., column pruning, filter push down, etc. | ||
| * 2. propagate information upward to Spark, e.g., report statistics, report ordering, etc. | ||
| * 3. special scans like columnar scan, unsafe row scan, etc. Note that a data source reader can | ||
| * implement at most one special scan. | ||
| * 1. Operators push-down. E.g., filter push-down, required columns push-down(aka column | ||
| * pruning), etc. These push-down interfaces are named like `SupportsPushDownXXX`. | ||
| * 2. Information Reporting. E.g., statistics reporting, ordering reporting, etc. These | ||
| * reporting interfaces are named like `SupportsReportingXXX`. | ||
| * 3. Special scan. E.g, columnar scan, unsafe row scan, etc. Note that a data source reader can | ||
| * implement at most one special scan. These scan interfaces are named like `SupportsScanXXX`. | ||
| * | ||
| * Spark first applies all operator push-down optimizations which this data source supports. Then | ||
| * Spark collects information this data source provides for further optimizations. Finally Spark | ||
| * Spark first applies all operator push-down optimizations that this data source supports. Then | ||
| * Spark collects information this data source reported for further optimizations. Finally Spark | ||
| * issues the scan request and does the actual data reading. | ||
|
||
| */ | ||
| public interface DataSourceV2Reader { | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -15,22 +15,25 @@ | |
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql.sources.v2.reader.downward; | ||
| package org.apache.spark.sql.sources.v2.reader; | ||
|
|
||
| import org.apache.spark.sql.types.StructType; | ||
|
|
||
| /** | ||
| * A mix-in interface for `DataSourceV2Reader`. Users can implement this interface to only read the | ||
| * required columns/nested fields during scan. | ||
| * A mix-in interface for `DataSourceV2Reader`. Users can implement this interface to push down | ||
| * required columns and only read these columns during scan. | ||
| */ | ||
| public interface ColumnPruningSupport { | ||
| public interface SupportsPushDownRequiredColumns { | ||
|
|
||
| /** | ||
| * Apply column pruning w.r.t. the given requiredSchema. | ||
| * Applies column pruning w.r.t. the given requiredSchema. | ||
| * | ||
| * Implementation should try its best to prune the unnecessary columns/nested fields, but it's | ||
| * Implementation should try its best to prune the unnecessary columns or nested fields, but it's | ||
| * also OK to do the pruning partially, e.g., a data source may not be able to prune nested | ||
| * fields, and only prune top-level columns. | ||
| * | ||
| * Note that, data source implementations should update `DataSourceReader.readSchema` after | ||
| * applying column pruning. | ||
| */ | ||
| void pruneColumns(StructType requiredSchema); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. link this to readSchema function |
||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -15,12 +15,12 @@ | |
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql.sources.v2.reader.upward; | ||
| package org.apache.spark.sql.sources.v2.reader; | ||
|
|
||
| /** | ||
| * A mix in interface for `DataSourceV2Reader`. Users can implement this interface to report | ||
| * statistics to Spark. | ||
| */ | ||
| public interface StatisticsSupport { | ||
| public interface SupportsReportStatistics { | ||
| Statistics getStatistics(); | ||
|
||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use an actual link ...