[SPARK-18544] [SQL] Append with df.saveAsTable writes data to wrong location #15983

ericl · 2016-11-23T00:55:07Z

What changes were proposed in this pull request?

We failed to properly propagate table metadata for existing tables for the saveAsTable command. This caused a downstream component to think the table was MANAGED, writing data to the wrong location.

How was this patch tested?

Unit test that fails before the patch.

ericl · 2016-11-23T00:55:49Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

        )
        df.sparkSession.sessionState.executePlan(
          CreateTable(tableDesc, mode, Some(df.logicalPlan))).toRdd
-        if (tableDesc.partitionColumnNames.nonEmpty &&


I think this is unnecessary since the create / insert commands already recover partitions themselves. We'll see if any tests fail.

This is needed when we create the data source table.

ericl · 2016-11-23T00:56:15Z

cc @yhuai @cloud-fan

SparkQA · 2016-11-23T02:23:45Z

Test build #69039 has finished for PR 15983 at commit 39b6249.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-11-23T04:44:51Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

      case _ =>
-        val storage = DataSource.buildStorageFormatFromOptions(extraOptions.toMap)
-        val tableType = if (storage.locationUri.isDefined) {
+        val existingTable = if (tableExists) {


shall we move this logic to CreateDataSourceTableAsSelectCommand?

nvm, I think it's fine here

SparkQA · 2016-11-23T05:50:09Z

Test build #69045 has finished for PR 15983 at commit ca75331.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-11-28T06:14:28Z

retest this please

SparkQA · 2016-11-28T06:17:35Z

Test build #69223 has started for PR 15983 at commit ca75331.

cloud-fan · 2016-11-29T01:06:11Z

retest this please

SparkQA · 2016-11-29T03:44:02Z

Test build #69281 has finished for PR 15983 at commit ca75331.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-11-29T05:57:34Z

Merging in master/branch-2.1.

…cation ## What changes were proposed in this pull request? We failed to properly propagate table metadata for existing tables for the saveAsTable command. This caused a downstream component to think the table was MANAGED, writing data to the wrong location. ## How was this patch tested? Unit test that fails before the patch. Author: Eric Liang <[email protected]> Closes #15983 from ericl/spark-18544. (cherry picked from commit e2318ed) Signed-off-by: Reynold Xin <[email protected]>

…cation ## What changes were proposed in this pull request? We failed to properly propagate table metadata for existing tables for the saveAsTable command. This caused a downstream component to think the table was MANAGED, writing data to the wrong location. ## How was this patch tested? Unit test that fails before the patch. Author: Eric Liang <[email protected]> Closes apache#15983 from ericl/spark-18544.

gatorsmile · 2016-12-18T01:11:43Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionProviderCompatibilitySuite.scala

+          withTempDir { dir =>
+            setupPartitionedDatasourceTable("test", dir)
+            if (enabled) {
+              spark.sql("msck repair table test")


If HIVE_MANAGE_FILESOURCE_PARTITIONS is on, we have to repair the table; otherwise the table is empty to the external users. This looks weird to me. It is also inconsistent with the behavior when HIVE_MANAGE_FILESOURCE_PARTITIONS is off. I think we should repair the table after we create the table. Let me submit a PR and cc you then.

ericl · 2016-12-18T01:48:36Z

It's the hive behavior to not repair the table. Otherwise, create table can have an unbounded cost if there are many partitions.

…

On Sat, Dec 17, 2016, 5:12 PM Xiao Li ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionProviderCompatibilitySuite.scala <#15983 (review)>: > @@ -188,6 +188,25 @@ class PartitionProviderCompatibilitySuite } } + for (enabled <- Seq(true, false)) { + test(s"SPARK-18544 append with saveAsTable - partition management $enabled") { + withSQLConf(SQLConf.HIVE_MANAGE_FILESOURCE_PARTITIONS.key -> enabled.toString) { + withTable("test") { + withTempDir { dir => + setupPartitionedDatasourceTable("test", dir) + if (enabled) { + spark.sql("msck repair table test") If HIVE_MANAGE_FILESOURCE_PARTITIONS is on, we have to repair the table; otherwise the table is empty to the external users. This looks weird to me. It is also inconsistent with the behavior when HIVE_MANAGE_FILESOURCE_PARTITIONS is off. I think we should repair the table after we create the table. Let me submit a PR and cc you then. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#15983 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAA6Sq9Ee_qTVz9aI85PcIu2z0MEoJKDks5rJIhrgaJpZM4K6Hh0> .

gatorsmile · 2016-12-18T03:36:26Z

Yeah, table repair is expensive, but this causes an external behavior change. I tried it in 2.0. It can show the whole data source table without repairing the table. In 2.1, it returns empty unless we repair the table.

scala> spark.range(5).selectExpr("id as fieldOne", "id as partCol").write.partitionBy("partCol").mode("overwrite").saveAsTable("test")
[Stage 0:======================>                                    (3 + 5) / 8]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
16/12/17 17:41:20 WARN CreateDataSourceTableUtils: Persisting partitioned data source relation `test` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. Input path(s): 
file:/Users/xiaoli/sparkBin/spark-2.0.2-bin-hadoop2.7/bin/spark-warehouse/test

scala> spark.sql("select * from test").show()
+--------+-------+
|fieldOne|partCol|
+--------+-------+
|       2|      2|
|       1|      1|
|       3|      3|
|       0|      0|
|       4|      4|
+--------+-------+

scala> spark.sql("desc formatted test").show(50, false)
+----------------------------+------------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                     |comment|
+----------------------------+------------------------------------------------------------------------------+-------+
...
|  path                      |file:/Users/xiaoli/sparkBin/spark-2.0.2-bin-hadoop2.7/bin/spark-warehouse/test|       |
+----------------------------+------------------------------------------------------------------------------+-------+


scala> spark.sql(s"create table newTab (fieldOne long, partCol int) using parquet options (path '/Users/xiaoli/sparkBin/spark-2.0.2-bin-hadoop2.7/bin/spark-warehouse/test') partitioned by (partCol)")
16/12/17 17:43:24 WARN CreateDataSourceTableUtils: Persisting partitioned data source relation `newTab` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. Input path(s): 
file:/Users/xiaoli/sparkBin/spark-2.0.2-bin-hadoop2.7/bin/spark-warehouse/test
res3: org.apache.spark.sql.DataFrame = []

scala> spark.table("newTab").show()
+--------+-------+
|fieldOne|partCol|
+--------+-------+
|       2|      2|
|       1|      1|
|       3|      3|
|       0|      0|
|       4|      4|
+--------+-------+

gatorsmile · 2016-12-18T03:53:16Z

Regarding the concern of the repair cost, I think we still face the same issue. Each time when we append an extra row, we also repair the table, right? That is still expensive.

Update: when we insert the table, we do not repair the table. We repair the table only in CTAS of partitioned data source tables.

cloud-fan · 2016-12-18T13:05:17Z

@gatorsmile I think this behavior change is necessary for the new "Scalable Partition Handling" feature. BTW, the table created by Spark 2.0 can still be read by Spark 2.1 without REPAIR TABLE, and users can fall back to the old behavior by turn off spark.sql.hive.manageFilesourcePartitions. So the backward compatibility is still good, if we put this in migration guide.

gatorsmile · 2016-12-18T18:09:19Z

@cloud-fan

First, if this behavior change is required, we need to document it. I think this is not clear to external users when they do not realize the underlying change since spark.sql.hive.manageFilesourcePartitions is set to true by default.

Second, let us discuss the existing user interface when they creating data source tables.

CREATE EXTERNAL data source TABLE: not allowed.
CREATE MANAGED data source table without specifying the path in option:
- when the data source table is non partitioned: an empty table
- when the data source table is partitioned: an empty table
CREATE MANAGED data source table with specifying the path in option:
- when the data source table is non partitioned: the data in the path is visible. No need to repair the table.
- when the data source table is partitioned and spark.sql.hive.manageFilesourcePartitions is set to false, the data in the path is visible. No need to repair the table.
- when the data source table is partitioned and spark.sql.hive.manageFilesourcePartitions is set to true (by default):
  - the data in the path is not visible if users do nothing against the table.
  - all the data/partitions in the path is visible if users repair the table.
  - without the repair, if users append the new rows to that partition, the existing/old rows in the relevant partitions will be visiable with the newly inserted row.
CREATE MANAGED partitioned hive serde table with specifying the LOCATION: automatically converted to an EXTERNAL table in Spark 2.0+, but allowed by Hive (Hive views it as an empty directory and overwrites the existing data when new data is inserted/loaded).

It is pretty complex to document/rememeber the above behaviors.

Should we support EXTERNAL for data source tables? Like what we did for Hive serde tables, convert it to EXTERNAL table when users specify the path in option?
If we decide to change the behaviors, should we follow Hive's MANAGED table when users specify a location? Ignore the existing data in the specified path/location?

BTW, will try the behaviors for CTAS and update the above bullets.

gatorsmile · 2016-12-18T19:43:38Z

If we do not repair the table when we creating such a table, we are facing another behavior inconsistency between CTAS + Append and Insert Into, because CTAS always repairs the table but InsertInto does not do it. Below is an example.

  test("append with saveAsTable + partition management") {
    withSQLConf(SQLConf.HIVE_MANAGE_FILESOURCE_PARTITIONS.key -> "true") {
      withTable("test") {
        withTempDir { dir =>
          setupPartitionedDatasourceTable("test", dir)
          spark.range(2).selectExpr("id as fieldOne", "id as partCol")
            .write.partitionBy("partCol").mode("append").saveAsTable("test")
          // all the partitions are visible
          assert(spark.sql("select * from test").count() == 7)
        }
      }
    }
  }


  test("insertInto + partition management") {
    withSQLConf(SQLConf.HIVE_MANAGE_FILESOURCE_PARTITIONS.key -> "true") {
      withTable("test") {
        withTempDir { dir =>
          setupPartitionedDatasourceTable("test", dir)
          spark.range(2).selectExpr("id as fieldOne", "id as partCol")
            .write.insertInto("test")
          // only the involved partitions are visible
          assert(spark.sql("select * from test").count() == 4)
        }
      }
    }
  }

cloud-fan · 2016-12-19T03:44:55Z

Should we support EXTERNAL for data source tables?

I'd like to hide the EXTERNAL concept to users, and only expose the path option or LOCATION syntax, for both data source and hive serde tables. It's pretty weird to have a managed table with a custom location, or an external table without custom location, and I don't think Hive's behavior makes sense when we rename, drop, alterLocation, ... these tables.
After we unify the CREATE TABLE syntax, we will only document the new syntax(which doesn't have the EXTERNAL concept), and leave the legacy hive syntax undocumented(but still supported).

CTAS always repairs the table but InsertInto does not do it

As hive can't append a table with CTAS, we can define our own semantic here. InsertInto only update the involved partitions, I think it makes more sense to follow it for CTAS, which will be done in #15996

gatorsmile · 2016-12-20T01:17:29Z

I see the plan, but the behavior difference will still be affected by the value of spark.sql.hive.manageFilesourcePartitions, right?

I might need more time to chew over it to find out the potential impacts.

…cation ## What changes were proposed in this pull request? We failed to properly propagate table metadata for existing tables for the saveAsTable command. This caused a downstream component to think the table was MANAGED, writing data to the wrong location. ## How was this patch tested? Unit test that fails before the patch. Author: Eric Liang <[email protected]> Closes apache#15983 from ericl/spark-18544.

fix it

39b6249

ericl commented Nov 23, 2016

View reviewed changes

fix tests

ca75331

cloud-fan reviewed Nov 23, 2016

View reviewed changes

asfgit closed this in e2318ed Nov 29, 2016

cloud-fan mentioned this pull request Dec 15, 2016

[SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelectCommand #15996

Closed

gatorsmile reviewed Dec 18, 2016

View reviewed changes

gatorsmile mentioned this pull request Dec 20, 2016

[SPARK-18915] [SQL] Automatic Table Repair when Creating a Partitioned Data Source Table with a Specified Path #16326

Closed

[SPARK-18544] [SQL] Append with df.saveAsTable writes data to wrong location #15983

[SPARK-18544] [SQL] Append with df.saveAsTable writes data to wrong location #15983

Uh oh!

Conversation

ericl commented Nov 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

ericl Nov 23, 2016

Choose a reason for hiding this comment

Uh oh!

gatorsmile Dec 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericl commented Nov 23, 2016

Uh oh!

SparkQA commented Nov 23, 2016

Uh oh!

cloud-fan Nov 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 28, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 23, 2016

Uh oh!

cloud-fan commented Nov 28, 2016

Uh oh!

SparkQA commented Nov 28, 2016

Uh oh!

cloud-fan commented Nov 29, 2016

Uh oh!

SparkQA commented Nov 29, 2016

Uh oh!

rxin commented Nov 29, 2016

Uh oh!

gatorsmile Dec 18, 2016

Choose a reason for hiding this comment

Uh oh!

ericl commented Dec 18, 2016 via email

Uh oh!

gatorsmile commented Dec 18, 2016

Uh oh!

gatorsmile commented Dec 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Dec 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Dec 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Dec 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Dec 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Dec 20, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ericl commented Nov 23, 2016 •

edited

Loading

gatorsmile Dec 18, 2016 •

edited

Loading

cloud-fan Nov 23, 2016 •

edited

Loading

gatorsmile commented Dec 18, 2016 •

edited

Loading

cloud-fan commented Dec 18, 2016 •

edited

Loading

gatorsmile commented Dec 18, 2016 •

edited

Loading

gatorsmile commented Dec 18, 2016 •

edited

Loading

cloud-fan commented Dec 19, 2016 •

edited

Loading