Skip to content

Conversation

@ericl
Copy link
Contributor

@ericl ericl commented Nov 23, 2016

What changes were proposed in this pull request?

We failed to properly propagate table metadata for existing tables for the saveAsTable command. This caused a downstream component to think the table was MANAGED, writing data to the wrong location.

How was this patch tested?

Unit test that fails before the patch.

)
df.sparkSession.sessionState.executePlan(
CreateTable(tableDesc, mode, Some(df.logicalPlan))).toRdd
if (tableDesc.partitionColumnNames.nonEmpty &&
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is unnecessary since the create / insert commands already recover partitions themselves. We'll see if any tests fail.

Copy link
Member

@gatorsmile gatorsmile Dec 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed when we create the data source table.

@ericl
Copy link
Contributor Author

ericl commented Nov 23, 2016

cc @yhuai @cloud-fan

@SparkQA
Copy link

SparkQA commented Nov 23, 2016

Test build #69039 has finished for PR 15983 at commit 39b6249.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case _ =>
val storage = DataSource.buildStorageFormatFromOptions(extraOptions.toMap)
val tableType = if (storage.locationUri.isDefined) {
val existingTable = if (tableExists) {
Copy link
Contributor

@cloud-fan cloud-fan Nov 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we move this logic to CreateDataSourceTableAsSelectCommand?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, I think it's fine here

@SparkQA
Copy link

SparkQA commented Nov 23, 2016

Test build #69045 has finished for PR 15983 at commit ca75331.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Nov 28, 2016

Test build #69223 has started for PR 15983 at commit ca75331.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Nov 29, 2016

Test build #69281 has finished for PR 15983 at commit ca75331.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Nov 29, 2016

Merging in master/branch-2.1.

@asfgit asfgit closed this in e2318ed Nov 29, 2016
asfgit pushed a commit that referenced this pull request Nov 29, 2016
…cation

## What changes were proposed in this pull request?

We failed to properly propagate table metadata for existing tables for the saveAsTable command. This caused a downstream component to think the table was MANAGED, writing data to the wrong location.

## How was this patch tested?

Unit test that fails before the patch.

Author: Eric Liang <[email protected]>

Closes #15983 from ericl/spark-18544.

(cherry picked from commit e2318ed)
Signed-off-by: Reynold Xin <[email protected]>
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 2, 2016
…cation

## What changes were proposed in this pull request?

We failed to properly propagate table metadata for existing tables for the saveAsTable command. This caused a downstream component to think the table was MANAGED, writing data to the wrong location.

## How was this patch tested?

Unit test that fails before the patch.

Author: Eric Liang <[email protected]>

Closes apache#15983 from ericl/spark-18544.
withTempDir { dir =>
setupPartitionedDatasourceTable("test", dir)
if (enabled) {
spark.sql("msck repair table test")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If HIVE_MANAGE_FILESOURCE_PARTITIONS is on, we have to repair the table; otherwise the table is empty to the external users. This looks weird to me. It is also inconsistent with the behavior when HIVE_MANAGE_FILESOURCE_PARTITIONS is off. I think we should repair the table after we create the table. Let me submit a PR and cc you then.

@ericl
Copy link
Contributor Author

ericl commented Dec 18, 2016 via email

@gatorsmile
Copy link
Member

Yeah, table repair is expensive, but this causes an external behavior change. I tried it in 2.0. It can show the whole data source table without repairing the table. In 2.1, it returns empty unless we repair the table.

scala> spark.range(5).selectExpr("id as fieldOne", "id as partCol").write.partitionBy("partCol").mode("overwrite").saveAsTable("test")
[Stage 0:======================>                                    (3 + 5) / 8]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
16/12/17 17:41:20 WARN CreateDataSourceTableUtils: Persisting partitioned data source relation `test` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. Input path(s): 
file:/Users/xiaoli/sparkBin/spark-2.0.2-bin-hadoop2.7/bin/spark-warehouse/test

scala> spark.sql("select * from test").show()
+--------+-------+
|fieldOne|partCol|
+--------+-------+
|       2|      2|
|       1|      1|
|       3|      3|
|       0|      0|
|       4|      4|
+--------+-------+

scala> spark.sql("desc formatted test").show(50, false)
+----------------------------+------------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                     |comment|
+----------------------------+------------------------------------------------------------------------------+-------+
...
|  path                      |file:/Users/xiaoli/sparkBin/spark-2.0.2-bin-hadoop2.7/bin/spark-warehouse/test|       |
+----------------------------+------------------------------------------------------------------------------+-------+


scala> spark.sql(s"create table newTab (fieldOne long, partCol int) using parquet options (path '/Users/xiaoli/sparkBin/spark-2.0.2-bin-hadoop2.7/bin/spark-warehouse/test') partitioned by (partCol)")
16/12/17 17:43:24 WARN CreateDataSourceTableUtils: Persisting partitioned data source relation `newTab` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. Input path(s): 
file:/Users/xiaoli/sparkBin/spark-2.0.2-bin-hadoop2.7/bin/spark-warehouse/test
res3: org.apache.spark.sql.DataFrame = []

scala> spark.table("newTab").show()
+--------+-------+
|fieldOne|partCol|
+--------+-------+
|       2|      2|
|       1|      1|
|       3|      3|
|       0|      0|
|       4|      4|
+--------+-------+

@gatorsmile
Copy link
Member

gatorsmile commented Dec 18, 2016

Regarding the concern of the repair cost, I think we still face the same issue. Each time when we append an extra row, we also repair the table, right? That is still expensive.

Update: when we insert the table, we do not repair the table. We repair the table only in CTAS of partitioned data source tables.

@cloud-fan
Copy link
Contributor

cloud-fan commented Dec 18, 2016

@gatorsmile I think this behavior change is necessary for the new "Scalable Partition Handling" feature. BTW, the table created by Spark 2.0 can still be read by Spark 2.1 without REPAIR TABLE, and users can fall back to the old behavior by turn off spark.sql.hive.manageFilesourcePartitions. So the backward compatibility is still good, if we put this in migration guide.

@gatorsmile
Copy link
Member

gatorsmile commented Dec 18, 2016

@cloud-fan

First, if this behavior change is required, we need to document it. I think this is not clear to external users when they do not realize the underlying change since spark.sql.hive.manageFilesourcePartitions is set to true by default.

Second, let us discuss the existing user interface when they creating data source tables.

  • CREATE EXTERNAL data source TABLE: not allowed.
  • CREATE MANAGED data source table without specifying the path in option:
    • when the data source table is non partitioned: an empty table
    • when the data source table is partitioned: an empty table
  • CREATE MANAGED data source table with specifying the path in option:
    • when the data source table is non partitioned: the data in the path is visible. No need to repair the table.
    • when the data source table is partitioned and spark.sql.hive.manageFilesourcePartitions is set to false, the data in the path is visible. No need to repair the table.
    • when the data source table is partitioned and spark.sql.hive.manageFilesourcePartitions is set to true (by default):
      • the data in the path is not visible if users do nothing against the table.
      • all the data/partitions in the path is visible if users repair the table.
      • without the repair, if users append the new rows to that partition, the existing/old rows in the relevant partitions will be visiable with the newly inserted row.
  • CREATE MANAGED partitioned hive serde table with specifying the LOCATION: automatically converted to an EXTERNAL table in Spark 2.0+, but allowed by Hive (Hive views it as an empty directory and overwrites the existing data when new data is inserted/loaded).

It is pretty complex to document/rememeber the above behaviors.

  • Should we support EXTERNAL for data source tables? Like what we did for Hive serde tables, convert it to EXTERNAL table when users specify the path in option?
  • If we decide to change the behaviors, should we follow Hive's MANAGED table when users specify a location? Ignore the existing data in the specified path/location?

BTW, will try the behaviors for CTAS and update the above bullets.

@gatorsmile
Copy link
Member

gatorsmile commented Dec 18, 2016

If we do not repair the table when we creating such a table, we are facing another behavior inconsistency between CTAS + Append and Insert Into, because CTAS always repairs the table but InsertInto does not do it. Below is an example.

  test("append with saveAsTable + partition management") {
    withSQLConf(SQLConf.HIVE_MANAGE_FILESOURCE_PARTITIONS.key -> "true") {
      withTable("test") {
        withTempDir { dir =>
          setupPartitionedDatasourceTable("test", dir)
          spark.range(2).selectExpr("id as fieldOne", "id as partCol")
            .write.partitionBy("partCol").mode("append").saveAsTable("test")
          // all the partitions are visible
          assert(spark.sql("select * from test").count() == 7)
        }
      }
    }
  }


  test("insertInto + partition management") {
    withSQLConf(SQLConf.HIVE_MANAGE_FILESOURCE_PARTITIONS.key -> "true") {
      withTable("test") {
        withTempDir { dir =>
          setupPartitionedDatasourceTable("test", dir)
          spark.range(2).selectExpr("id as fieldOne", "id as partCol")
            .write.insertInto("test")
          // only the involved partitions are visible
          assert(spark.sql("select * from test").count() == 4)
        }
      }
    }
  }

@cloud-fan
Copy link
Contributor

cloud-fan commented Dec 19, 2016

Should we support EXTERNAL for data source tables?

I'd like to hide the EXTERNAL concept to users, and only expose the path option or LOCATION syntax, for both data source and hive serde tables. It's pretty weird to have a managed table with a custom location, or an external table without custom location, and I don't think Hive's behavior makes sense when we rename, drop, alterLocation, ... these tables.
After we unify the CREATE TABLE syntax, we will only document the new syntax(which doesn't have the EXTERNAL concept), and leave the legacy hive syntax undocumented(but still supported).

CTAS always repairs the table but InsertInto does not do it

As hive can't append a table with CTAS, we can define our own semantic here. InsertInto only update the involved partitions, I think it makes more sense to follow it for CTAS, which will be done in #15996

@gatorsmile
Copy link
Member

I see the plan, but the behavior difference will still be affected by the value of spark.sql.hive.manageFilesourcePartitions, right?

I might need more time to chew over it to find out the potential impacts.

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…cation

## What changes were proposed in this pull request?

We failed to properly propagate table metadata for existing tables for the saveAsTable command. This caused a downstream component to think the table was MANAGED, writing data to the wrong location.

## How was this patch tested?

Unit test that fails before the patch.

Author: Eric Liang <[email protected]>

Closes apache#15983 from ericl/spark-18544.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants