-
Notifications
You must be signed in to change notification settings - Fork 2
Prototype: Data Source V2 #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -29,12 +29,7 @@ | |
| @InterfaceStability.Unstable | ||
| public interface CatalystFilterPushDownSupport { | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. boolean pushDownCatalystFilter(Expression filter); This interface is very nice. Just wondering, for data source to implement this method , is it ok for implementation to access sub types of Expression in Spark , for example functions like Abs ?
Owner
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. implementations can pattern match the given expression and access any subtype you like. |
||
| /** | ||
| * Push down one filter, returns true if this filter can be pushed down to this data source, | ||
| * false otherwise. This method might be called many times if more than one filter need to be | ||
| * pushed down. | ||
| * | ||
| * TODO: we can also make it `Expression[] pushDownCatalystFilters(Expression[] filters)` which | ||
| * returns unsupported filters. | ||
| * Push down filters, returns unsupported filters. | ||
| */ | ||
| boolean pushDownCatalystFilter(Expression filter); | ||
| Expression[] pushDownCatalystFilters(Expression[] filters); | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -35,6 +35,9 @@ public interface ColumnarReadSupport { | |
| * A safety door for columnar reader. It's possible that the implementation can only support | ||
| * columnar reads for some certain columns, users can overwrite this method to fallback to | ||
| * normal read path under some conditions. | ||
| * | ||
| * Note that, if the implementation always return true here, then he can throw exception in | ||
| * the row based `DataSourceV2Reader.createReadTasks`, as it will never be called. | ||
| */ | ||
| default boolean supportsColumnarReads() { | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any chance we could just have a utility method that converts a Do we have a use case for this?
Owner
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yea this one we have a use case. The current vectorized parquet reader only support flat schema, i.e., struct/array/map types are not supported. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is a bit strange - shouldn't this return value sometimes depend on schema?
Owner
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the schema is a state of the reader, so when a reader mix-in this interface, it should know what the current schema is, after column pruning or something. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok so this is only called after all push downs are done? we should specify that. |
||
| return true; | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,27 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql.sources.v2.reader; | ||
|
|
||
| import java.io.Closeable; | ||
| import java.util.Iterator; | ||
|
|
||
| /** | ||
| * A data reader returned by a read task and is responsible for outputting data for an RDD | ||
| * partition. | ||
| */ | ||
| public interface DataReader<T> extends Iterator<T>, Closeable {} | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is this an "Iterator"? Don't do this ... Use explicit next(), with close(). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it makes sense to use something like |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -20,8 +20,8 @@ | |
| import java.io.Serializable; | ||
|
|
||
| /** | ||
| * A read task returned by a data source reader and is responsible for outputting data for an RDD | ||
| * partition. | ||
| * A read task returned by a data source reader and is responsible to create the data reader. | ||
| * The relationship between `ReadTask` and `DataReader` is similar to `Iterable` and `Iterator`. | ||
| */ | ||
| public interface ReadTask<T> extends Serializable { | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not use standard interfaces like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With these kinds of Serializable interfaces, it can be a pain to implement because you end up needing to make all of your data access objects serializable (or construct them all in the open method, which is also quite sad, for reasons @rdblue notes). In datasources V1, we've used a pattern where you include a serializable datastructure that contains enough information to construct your objects properly (so for example, the params map is serializable). Ideally we could have something similar here; what if
Owner
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that we should make There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we just return an object with an interface like: interface Reader<T> extends Spliterator<T>, Closeable {}(spliterator is a much nicer interface than iterator to implement) or is the issue here that we can't use Java 8? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would use |
||
| /** | ||
|
|
@@ -33,25 +33,5 @@ default String[] preferredLocations() { | |
| return new String[0]; | ||
| } | ||
|
|
||
| /** | ||
| * This method will be called before running this read task, users can overwrite this method | ||
| * and put initialization logic here. | ||
| */ | ||
| default void open() {} | ||
|
|
||
| /** | ||
| * Proceed to next record, returns false if there is no more records. | ||
| */ | ||
| boolean next(); | ||
|
|
||
| /** | ||
| * Return the current record. This method should return same value until `next` is called. | ||
| */ | ||
| T get(); | ||
|
|
||
| /** | ||
| * This method will be called after finishing this read task, users can overwrite this method | ||
| * and put clean up logic here. | ||
| */ | ||
| default void close() {} | ||
| DataReader<T> getReader(); | ||
| } | ||
This file was deleted.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -26,5 +26,5 @@ public interface SortPushDown { | |
| * Returns true if the implementation can handle this sorting requirement and save a sort | ||
| * operation at Spark side. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Like the discussion on hash support, I would rather see this the other way around, where the data source can reports its underlying sort order. Maybe there are some sources that can perform a sort before passing the data to Spark, but its hard to know when it should. I think the more useful case is when the data is already sorted. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1
Owner
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok I agree, shall we distinguish global sort and per-partition sort? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd leave this out of the first iteration and add it later, but some thoughts: specify the preferred sort orders, and the data source can return the sortedness. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Per-partition Sort would be great, We usually can't handle a Global sort but there are some per-partition sorts (By partition here I mean SQL Style Partitioning) we can handle.
Like in this case if pkey is the C* partition key then there is a natural ordering we can take advantage of when returning values. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note @brianmhess wrote that sql, I don't know SQL like him and he is great. |
||
| */ | ||
| boolean pushDownSort(String[] sortingColumns); | ||
| boolean pushDownSort(String[] sortingColumns, boolean asc, boolean nullFirst); | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,34 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql.sources.v2.reader.distribution; | ||
|
|
||
| /** | ||
| * Represents a distribution where records that share the same values for the `clusteringColumns` | ||
| * will be co-located, which means, they will be produced by the same `ReadTask`. | ||
| */ | ||
| public class ClusteredDistribution { | ||
| private String[] clusteringColumns; | ||
|
|
||
| public ClusteredDistribution(String[] clusteringColumns) { | ||
| this.clusteringColumns = clusteringColumns; | ||
| } | ||
|
|
||
| public String[] getClusteringColumns() { | ||
| return clusteringColumns; | ||
| } | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,25 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql.sources.v2.reader.distribution; | ||
|
|
||
| /** | ||
| * Specifies how data should be distributed when a query is executed in parallel on many machines. | ||
| * | ||
| * Current implementations: `ClusteredDistribution`. | ||
| */ | ||
| public interface Distribution {} |
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In V2, can we introduceOops, sorry. I found that it's documented asPlanPushDownSupport, too?out of scope.