Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Dec 24, 2018

What changes were proposed in this pull request?

Spark SQL doesn't support creating partitioned table using Hive CTAS in SQL syntax. However it is supported by using DataFrameWriter API.

val df = Seq(("a", 1)).toDF("part", "id")
df.write.format("hive").partitionBy("part").saveAsTable("t")

Hive begins to support this syntax in newer version: https://issues.apache.org/jira/browse/HIVE-20241:

CREATE TABLE t PARTITIONED BY (part) AS SELECT 1 as id, "a" as part

This patch adds this support to SQL syntax.

How was this patch tested?

Added tests.

@SparkQA
Copy link

SparkQA commented Dec 24, 2018

Test build #100423 has finished for PR 23376 at commit 2ea2a4d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Dec 24, 2018

cc @cloud-fan

if (tableDesc.partitionColumnNames.nonEmpty) {
val errorMessage = "A Create Table As Select (CTAS) statement is not allowed to " +
"create a partitioned table using Hive's file formats. " +
"create a partitioned table using Hive's file formats by specifying table schema. " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does Hive report for this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive 3.2.0:

hive> CREATE TABLE t PARTITIONED BY (part string) AS SELECT id, part FROM src;
FAILED: SemanticException [Error 10068]: CREATE-TABLE-AS-SELECT does not support partitioning in the target table 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about

Create Partitioned Table As Select cannot specify data type for the partition columns of the target table.

@SparkQA
Copy link

SparkQA commented Dec 25, 2018

Test build #100435 has finished for PR 23376 at commit 934d6f1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Dec 25, 2018

retest this please.

@SparkQA
Copy link

SparkQA commented Dec 25, 2018

Test build #100438 has finished for PR 23376 at commit 934d6f1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (schema.nonEmpty) {
operationNotAllowed(
"Schema may not be specified in a Create Table As Select (CTAS) statement",
ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this check should go first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, because val schema = StructType(dataCols ++ partitionCols) is defined, if this check goes first, it will shadow the next check if (tableDesc.partitionColumnNames.nonEmpty).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then can we check dataCols directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. That's good.

""".stripMargin)
checkAnswer(spark.table("t"), Row(1, "a"))

assert(sql("DESC t").collect().containsSlice(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a better way to test it: spark.sessionState.getTable and check if the partion columns exixts in table metadata.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Changed.


// When creating partitioned table with CTAS statement, we can't specify data type for the
// partition columns.
if (partitionCols.nonEmpty) {
Copy link
Member Author

@viirya viirya Dec 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the check of tableDesc.partitionColumnNames to partitionCols. They are the same effect here, but partitionCols is more accurate and less confusing.

@SparkQA
Copy link

SparkQA commented Dec 26, 2018

Test build #100449 has finished for PR 23376 at commit 6cd9c2f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 26, 2018

Test build #100447 has finished for PR 23376 at commit 1a3c63c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Dec 26, 2018

retest this please.

@SparkQA
Copy link

SparkQA commented Dec 26, 2018

Test build #100451 has finished for PR 23376 at commit 6cd9c2f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Dec 26, 2018

retest this please.

@SparkQA
Copy link

SparkQA commented Dec 26, 2018

Test build #100453 has finished for PR 23376 at commit 6cd9c2f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Dec 26, 2018

Test build #100455 has finished for PR 23376 at commit 6cd9c2f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 26, 2018

Test build #100456 has finished for PR 23376 at commit d56a82a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Dec 27, 2018

retest this please...

@SparkQA
Copy link

SparkQA commented Dec 27, 2018

Test build #100462 has finished for PR 23376 at commit d56a82a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Dec 27, 2018

Test build #100464 has finished for PR 23376 at commit d56a82a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Dec 27, 2018

It finally passes. :)

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in f89cdec Dec 27, 2018
holdenk pushed a commit to holdenk/spark that referenced this pull request Jan 5, 2019
… by specifying partition column names

## What changes were proposed in this pull request?

Spark SQL doesn't support creating partitioned table using Hive CTAS in SQL syntax. However it is supported by using DataFrameWriter API.

```scala
val df = Seq(("a", 1)).toDF("part", "id")
df.write.format("hive").partitionBy("part").saveAsTable("t")
```
Hive begins to support this syntax in newer version: https://issues.apache.org/jira/browse/HIVE-20241:

```
CREATE TABLE t PARTITIONED BY (part) AS SELECT 1 as id, "a" as part
```

This patch adds this support to SQL syntax.

## How was this patch tested?

Added tests.

Closes apache#23376 from viirya/hive-ctas-partitioned-table.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
… by specifying partition column names

## What changes were proposed in this pull request?

Spark SQL doesn't support creating partitioned table using Hive CTAS in SQL syntax. However it is supported by using DataFrameWriter API.

```scala
val df = Seq(("a", 1)).toDF("part", "id")
df.write.format("hive").partitionBy("part").saveAsTable("t")
```
Hive begins to support this syntax in newer version: https://issues.apache.org/jira/browse/HIVE-20241:

```
CREATE TABLE t PARTITIONED BY (part) AS SELECT 1 as id, "a" as part
```

This patch adds this support to SQL syntax.

## How was this patch tested?

Added tests.

Closes apache#23376 from viirya/hive-ctas-partitioned-table.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
IceMimosa pushed a commit to growingio/spark that referenced this pull request Jun 9, 2019
… by specifying partition column names

Spark SQL doesn't support creating partitioned table using Hive CTAS in SQL syntax. However it is supported by using DataFrameWriter API.

```scala
val df = Seq(("a", 1)).toDF("part", "id")
df.write.format("hive").partitionBy("part").saveAsTable("t")
```
Hive begins to support this syntax in newer version: https://issues.apache.org/jira/browse/HIVE-20241:

```
CREATE TABLE t PARTITIONED BY (part) AS SELECT 1 as id, "a" as part
```

This patch adds this support to SQL syntax.

Added tests.

Closes apache#23376 from viirya/hive-ctas-partitioned-table.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@viirya viirya deleted the hive-ctas-partitioned-table branch December 27, 2023 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants