[SPARK-28178][SQL] DataSourceV2: DataFrameWriter.insertInfo #24980

jzhuge · 2019-06-27T07:58:39Z

What changes were proposed in this pull request?

Support multiple catalogs in the following InsertInto use cases:

DataFrameWriter.insertInto("catalog.db.tbl")

Support matrix:

SaveMode	Partitioned Table	Partition Overwrite Mode	Action
Append	*	*	AppendData
Overwrite	no	*	OverwriteByExpression(true)
Overwrite	yes	STATIC	OverwriteByExpression(true)
Overwrite	yes	DYNAMIC	OverwritePartitionsDynamic

How was this patch tested?

New tests.
All existing catalyst and sql/core tests.

SparkQA · 2019-06-27T11:07:18Z

Test build #106958 has finished for PR 24980 at commit 9de941e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-27T18:31:51Z

Test build #106971 has finished for PR 24980 at commit 34cd710.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-27T19:22:26Z

Test build #106972 has finished for PR 24980 at commit eaf2336.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-03T03:58:10Z

Test build #107140 has finished for PR 24980 at commit fb45cb1.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class KafkaTable extends Table with SupportsRead with SupportsWrite
public final class ColumnarBatch implements AutoCloseable
case class DummyExpressionHolder(exprs: Seq[Expression]) extends LeafNode
abstract class QuaternaryExpression extends Expression
case class CheckOverflow(
case class Overlay(input: Expression, replace: Expression, pos: Expression, len: Expression)
case class MapPartitionsInPandas(
class ColumnarRule
case class ColumnarToRowExec(child: SparkPlan)
case class RowToColumnarExec(child: SparkPlan) extends UnaryExecNode
case class ApplyColumnarRulesAndInsertTransitions(conf: SQLConf, columnarRules: Seq[ColumnarRule])
case class InputAdapter(child: SparkPlan, isChildColumnar: Boolean)
case class MapPartitionsInPandasExec(

SparkQA · 2019-07-03T04:45:09Z

Test build #107143 has finished for PR 24980 at commit 3c3aa04.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-07-08T00:09:24Z

Retest this please.

SparkQA · 2019-07-08T01:41:28Z

Test build #107313 has finished for PR 24980 at commit 3c3aa04.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-26T22:35:18Z

Test build #108235 has finished for PR 24980 at commit 65a0d43.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jzhuge · 2019-07-27T13:51:09Z

@dongjoon-hyun @brkyvz @cloud-fan @rdblue This PR is ready for review. It is a follow-up to DSv2 INSERT INTO.

cloud-fan · 2019-07-29T08:49:49Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+
+    assertNotBucketed("insertInto")
+
+    if (partitioningColumns.isDefined) {


shall we move these 2 checks to the public insertTo method, instead of duplicating it in the 2 private methods?

cloud-fan · 2019-07-29T08:56:28Z

sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2DataFrameSuite.scala

+    }
+  }
+
+  test("insertInto: append partitioned table - dynamic clause") {


what do you mean by dynamic clause?

This is a copy-paste issue. I will remove " - dynamic clause" from the title. insertInto does not have anything similar to INSERT INTO's PARTITION clause.

SparkQA · 2019-07-29T20:06:10Z

Test build #108338 has finished for PR 24980 at commit c4eeee5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-07-30T09:23:02Z

thanks, merging to master!

jzhuge · 2019-07-30T16:00:09Z

Thanks @cloud-fan !

cloud-fan · 2019-08-02T05:57:56Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+
+    val command = modeForDSV2 match {
+      case SaveMode.Append =>
+        AppendData.byName(table, df.logicalPlan)


I missed it. If you look at the doc of insertInto, it says

* Inserts the content of the `DataFrame` to the specified table. It requires that * the schema of the `DataFrame` is the same as the schema of the table. * * @note Unlike `saveAsTable`, `insertInto` ignores the column names and just uses position-based * resolution. For example:

We should use byPosition here.

Agreed. This is an oversight and should be by position.

Thanks @cloud-fan @rdblue. I will submit a hotfix.

I can create a follow-up PR to introduce an option matchByName, default to false. If true, insertInto uses byName; otherwise, byPosition.

Maybe even included in this PR?

If we are going to create a new dataframe writer API in the future, I'd like to keep it as it is, and always do by-position in this insertInto.

If we are going to create a new dataframe writer API in the future, I'd like to keep it as it is, and always do by-position in this insertInto.

Sounds good to me. I'll submit a PR for the new API.

@rdblue @cloud-fan @dongjoon-hyun #25353 is ready for review.

dongjoon-hyun added the SQL label Jun 28, 2019

jzhuge force-pushed the SPARK-28178-pr branch from fb45cb1 to 3c3aa04 Compare July 3, 2019 01:25

jzhuge mentioned this pull request Jul 25, 2019

[SPARK-27845][SQL] DataSourceV2: InsertTable #24832

Closed

[SPARK-28178][SQL] DataSourceV2: DataFrameWriter.insertInfo

65a0d43

jzhuge force-pushed the SPARK-28178-pr branch from 3c3aa04 to 65a0d43 Compare July 26, 2019 17:35

cloud-fan reviewed Jul 29, 2019

View reviewed changes

Wenchen's comments

c4eeee5

cloud-fan closed this in 749b1d3 Jul 30, 2019

cloud-fan reviewed Aug 2, 2019

View reviewed changes

rdblue mentioned this pull request Aug 4, 2019

[SPARK-28612][SQL] Add DataFrameWriterV2 API #25354

Closed


		assertNotBucketed("insertInto")

		if (partitioningColumns.isDefined) {

[SPARK-28178][SQL] DataSourceV2: DataFrameWriter.insertInfo #24980

[SPARK-28178][SQL] DataSourceV2: DataFrameWriter.insertInfo #24980

Uh oh!

Conversation

jzhuge commented Jun 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jun 27, 2019

Uh oh!

SparkQA commented Jun 27, 2019

Uh oh!

SparkQA commented Jun 27, 2019

Uh oh!

SparkQA commented Jul 3, 2019

Uh oh!

SparkQA commented Jul 3, 2019

Uh oh!

dongjoon-hyun commented Jul 8, 2019

Uh oh!

SparkQA commented Jul 8, 2019

Uh oh!

SparkQA commented Jul 26, 2019

Uh oh!

jzhuge commented Jul 27, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 29, 2019

Uh oh!

cloud-fan commented Jul 30, 2019

Uh oh!

jzhuge commented Jul 30, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jzhuge commented Jun 27, 2019 •

edited

Loading