Skip to content

Conversation

@jzhuge
Copy link
Member

@jzhuge jzhuge commented Jun 27, 2019

What changes were proposed in this pull request?

Support multiple catalogs in the following InsertInto use cases:

  • DataFrameWriter.insertInto("catalog.db.tbl")

Support matrix:

SaveMode Partitioned Table Partition Overwrite Mode Action
Append * * AppendData
Overwrite no * OverwriteByExpression(true)
Overwrite yes STATIC OverwriteByExpression(true)
Overwrite yes DYNAMIC OverwritePartitionsDynamic

How was this patch tested?

New tests.
All existing catalyst and sql/core tests.

@SparkQA
Copy link

SparkQA commented Jun 27, 2019

Test build #106958 has finished for PR 24980 at commit 9de941e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 27, 2019

Test build #106971 has finished for PR 24980 at commit 34cd710.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 27, 2019

Test build #106972 has finished for PR 24980 at commit eaf2336.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 3, 2019

Test build #107140 has finished for PR 24980 at commit fb45cb1.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class KafkaTable extends Table with SupportsRead with SupportsWrite
  • public final class ColumnarBatch implements AutoCloseable
  • case class DummyExpressionHolder(exprs: Seq[Expression]) extends LeafNode
  • abstract class QuaternaryExpression extends Expression
  • case class CheckOverflow(
  • case class Overlay(input: Expression, replace: Expression, pos: Expression, len: Expression)
  • case class MapPartitionsInPandas(
  • class ColumnarRule
  • case class ColumnarToRowExec(child: SparkPlan)
  • case class RowToColumnarExec(child: SparkPlan) extends UnaryExecNode
  • case class ApplyColumnarRulesAndInsertTransitions(conf: SQLConf, columnarRules: Seq[ColumnarRule])
  • case class InputAdapter(child: SparkPlan, isChildColumnar: Boolean)
  • case class MapPartitionsInPandasExec(

@SparkQA
Copy link

SparkQA commented Jul 3, 2019

Test build #107143 has finished for PR 24980 at commit 3c3aa04.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Jul 8, 2019

Test build #107313 has finished for PR 24980 at commit 3c3aa04.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 26, 2019

Test build #108235 has finished for PR 24980 at commit 65a0d43.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jzhuge
Copy link
Member Author

jzhuge commented Jul 27, 2019

@dongjoon-hyun @brkyvz @cloud-fan @rdblue This PR is ready for review. It is a follow-up to DSv2 INSERT INTO.


assertNotBucketed("insertInto")

if (partitioningColumns.isDefined) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we move these 2 checks to the public insertTo method, instead of duplicating it in the 2 private methods?

}
}

test("insertInto: append partitioned table - dynamic clause") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by dynamic clause?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a copy-paste issue. I will remove " - dynamic clause" from the title. insertInto does not have anything similar to INSERT INTO's PARTITION clause.

@SparkQA
Copy link

SparkQA commented Jul 29, 2019

Test build #108338 has finished for PR 24980 at commit c4eeee5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 749b1d3 Jul 30, 2019
@jzhuge
Copy link
Member Author

jzhuge commented Jul 30, 2019

Thanks @cloud-fan !


val command = modeForDSV2 match {
case SaveMode.Append =>
AppendData.byName(table, df.logicalPlan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed it. If you look at the doc of insertInto, it says

   * Inserts the content of the `DataFrame` to the specified table. It requires that
   * the schema of the `DataFrame` is the same as the schema of the table.
   *
   * @note Unlike `saveAsTable`, `insertInto` ignores the column names and just uses position-based
   * resolution. For example:

We should use byPosition here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This is an oversight and should be by position.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cloud-fan @rdblue. I will submit a hotfix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can create a follow-up PR to introduce an option matchByName, default to false. If true, insertInto uses byName; otherwise, byPosition.

Maybe even included in this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to create a new dataframe writer API in the future, I'd like to keep it as it is, and always do by-position in this insertInto.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to create a new dataframe writer API in the future, I'd like to keep it as it is, and always do by-position in this insertInto.

Sounds good to me. I'll submit a PR for the new API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue @cloud-fan @dongjoon-hyun #25353 is ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants