-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-28612][SQL] Add DataFrameWriterV2 API #25354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
358e5bf
5dbc850
002339b
6a52509
f545692
6c0a98b
1bc6954
e424c2c
57e6c5b
9864d42
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,77 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql.catalyst.expressions | ||
|
|
||
| import org.apache.spark.sql.types.{DataType, IntegerType} | ||
|
|
||
| /** | ||
| * Base class for expressions that are converted to v2 partition transforms. | ||
| * | ||
| * Subclasses represent abstract transform functions with concrete implementations that are | ||
| * determined by data source implementations. Because the concrete implementation is not known, | ||
| * these expressions are [[Unevaluable]]. | ||
| * | ||
| * These expressions are used to pass transformations from the DataFrame API: | ||
| * | ||
| * {{{ | ||
| * df.writeTo("catalog.db.table").partitionedBy($"category", days($"timestamp")).create() | ||
| * }}} | ||
| */ | ||
| abstract class PartitionTransformExpression extends Expression with Unevaluable { | ||
| override def nullable: Boolean = true | ||
| } | ||
|
|
||
| /** | ||
| * Expression for the v2 partition transform years. | ||
| */ | ||
| case class Years(child: Expression) extends PartitionTransformExpression { | ||
| override def dataType: DataType = IntegerType | ||
| override def children: Seq[Expression] = Seq(child) | ||
| } | ||
|
|
||
| /** | ||
| * Expression for the v2 partition transform months. | ||
| */ | ||
| case class Months(child: Expression) extends PartitionTransformExpression { | ||
| override def dataType: DataType = IntegerType | ||
| override def children: Seq[Expression] = Seq(child) | ||
| } | ||
|
|
||
| /** | ||
| * Expression for the v2 partition transform days. | ||
| */ | ||
| case class Days(child: Expression) extends PartitionTransformExpression { | ||
| override def dataType: DataType = IntegerType | ||
| override def children: Seq[Expression] = Seq(child) | ||
| } | ||
|
|
||
| /** | ||
| * Expression for the v2 partition transform hours. | ||
| */ | ||
| case class Hours(child: Expression) extends PartitionTransformExpression { | ||
| override def dataType: DataType = IntegerType | ||
| override def children: Seq[Expression] = Seq(child) | ||
| } | ||
|
|
||
| /** | ||
| * Expression for the v2 partition transform bucket. | ||
| */ | ||
| case class Bucket(numBuckets: Literal, child: Expression) extends PartitionTransformExpression { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Question about bucketing. I assume Spark will be leveraging this bucketing information in the future to potentially optimize joins and stuff. If we're leaving the behavior of bucketing to the underlying data source, couldn't that cause potential correctness issues?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unfortunately, Spark can't force sources to use a specific bucket function. A great example is Hive tables, which use a Hive-specific function. So the partitioning transforms are general. It's up to the source what function to use when bucketing. When we add support for bucketed joins, for example, we will need a way to look up a UDF with the source's bucketing function implementation. The FunctionCatalog interface I proposed does that. Spark would look up the "bucket" function in the table's catalog to get a concrete implementation that it can use to partition the other side of the bucketed join. If that side is also bucketed, we will want a way to compare the bucket functions to see if they are the same.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here's the FunctionCatalog PR: #24559 |
||
| override def dataType: DataType = IntegerType | ||
| override def children: Seq[Expression] = Seq(numBuckets, child) | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -489,7 +489,7 @@ case class ReplaceTableAsSelect( | |
| override def tableSchema: StructType = query.schema | ||
| override def children: Seq[LogicalPlan] = Seq(query) | ||
|
|
||
| override lazy val resolved: Boolean = { | ||
| override lazy val resolved: Boolean = childrenResolved && { | ||
| // the table schema is created from the query schema, so the only resolution needed is to check | ||
| // that the columns referenced by the table's partitioning exist in the query schema | ||
| val references = partitioning.flatMap(_.references).toSet | ||
|
|
@@ -507,15 +507,22 @@ case class ReplaceTableAsSelect( | |
| case class AppendData( | ||
| table: NamedRelation, | ||
| query: LogicalPlan, | ||
| writeOptions: Map[String, String], | ||
| isByName: Boolean) extends V2WriteCommand | ||
|
|
||
| object AppendData { | ||
| def byName(table: NamedRelation, df: LogicalPlan): AppendData = { | ||
| new AppendData(table, df, isByName = true) | ||
| def byName( | ||
| table: NamedRelation, | ||
| df: LogicalPlan, | ||
| writeOptions: Map[String, String] = Map.empty): AppendData = { | ||
| new AppendData(table, df, writeOptions, isByName = true) | ||
| } | ||
|
|
||
| def byPosition(table: NamedRelation, query: LogicalPlan): AppendData = { | ||
| new AppendData(table, query, isByName = false) | ||
| def byPosition( | ||
| table: NamedRelation, | ||
| query: LogicalPlan, | ||
| writeOptions: Map[String, String] = Map.empty): AppendData = { | ||
| new AppendData(table, query, writeOptions, isByName = false) | ||
| } | ||
| } | ||
|
|
||
|
|
@@ -526,19 +533,26 @@ case class OverwriteByExpression( | |
| table: NamedRelation, | ||
| deleteExpr: Expression, | ||
| query: LogicalPlan, | ||
| writeOptions: Map[String, String], | ||
| isByName: Boolean) extends V2WriteCommand { | ||
| override lazy val resolved: Boolean = outputResolved && deleteExpr.resolved | ||
| } | ||
|
|
||
| object OverwriteByExpression { | ||
| def byName( | ||
| table: NamedRelation, df: LogicalPlan, deleteExpr: Expression): OverwriteByExpression = { | ||
| OverwriteByExpression(table, deleteExpr, df, isByName = true) | ||
| table: NamedRelation, | ||
| df: LogicalPlan, | ||
| deleteExpr: Expression, | ||
| writeOptions: Map[String, String] = Map.empty): OverwriteByExpression = { | ||
| OverwriteByExpression(table, deleteExpr, df, writeOptions, isByName = true) | ||
| } | ||
|
|
||
| def byPosition( | ||
| table: NamedRelation, query: LogicalPlan, deleteExpr: Expression): OverwriteByExpression = { | ||
| OverwriteByExpression(table, deleteExpr, query, isByName = false) | ||
| table: NamedRelation, | ||
| query: LogicalPlan, | ||
| deleteExpr: Expression, | ||
| writeOptions: Map[String, String] = Map.empty): OverwriteByExpression = { | ||
| OverwriteByExpression(table, deleteExpr, query, writeOptions, isByName = false) | ||
| } | ||
| } | ||
|
|
||
|
|
@@ -548,15 +562,22 @@ object OverwriteByExpression { | |
| case class OverwritePartitionsDynamic( | ||
| table: NamedRelation, | ||
| query: LogicalPlan, | ||
| writeOptions: Map[String, String], | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just curious, is it coding style to have boolean parameter as the last one like isByName? here
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it is for style. Boolean parameters should be passed by name, like
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks Ryan 👍 |
||
| isByName: Boolean) extends V2WriteCommand | ||
|
|
||
| object OverwritePartitionsDynamic { | ||
| def byName(table: NamedRelation, df: LogicalPlan): OverwritePartitionsDynamic = { | ||
| OverwritePartitionsDynamic(table, df, isByName = true) | ||
| def byName( | ||
| table: NamedRelation, | ||
| df: LogicalPlan, | ||
| writeOptions: Map[String, String] = Map.empty): OverwritePartitionsDynamic = { | ||
| OverwritePartitionsDynamic(table, df, writeOptions, isByName = true) | ||
| } | ||
|
|
||
| def byPosition(table: NamedRelation, query: LogicalPlan): OverwritePartitionsDynamic = { | ||
| OverwritePartitionsDynamic(table, query, isByName = false) | ||
| def byPosition( | ||
| table: NamedRelation, | ||
| query: LogicalPlan, | ||
| writeOptions: Map[String, String] = Map.empty): OverwritePartitionsDynamic = { | ||
| OverwritePartitionsDynamic(table, query, writeOptions, isByName = false) | ||
| } | ||
| } | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Ryan, maybe naive question, why not supporting granularity down to minutes or seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not seen an example of a table that requires partitioning down to minutes or seconds. I'm not opposed to adding them, but it seems to me that those would not be very useful and would probably get people that use them into trouble by over-partitioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, thanks. agree that if not many use cases for minutes or seconds then we can ignore it.