[SPARK-21144][SQL] Print a warning if the data schema and partition schema have the duplicate columns #18375

maropu · 2017-06-21T10:35:01Z

What changes were proposed in this pull request?

The current master outputs unexpected results when the data schema and partition schema have the duplicate columns:

withTempPath { dir =>
  val basePath = dir.getCanonicalPath
  spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=1").toString)
  spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=a").toString)
  spark.read.parquet(basePath).show()
}

+---+
|foo|
+---+
|  1|
|  1|
|  a|
|  a|
|  1|
|  a|
+---+

This patch added code to print a warning when the duplication found.

How was this patch tested?

Manually checked.

SparkQA · 2017-06-21T10:39:47Z

Test build #78378 has finished for PR 18375 at commit d5c7d08.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class OrcParDataWithKey(intField: Int, stringField: String)
case class ParquetDataWithKey(intField: Int, stringField: String)

SparkQA · 2017-06-21T12:22:16Z

Test build #78382 has finished for PR 18375 at commit 0371562.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class OrcParDataWithKey(intField: Int, stringField: String)
case class ParquetDataWithKey(intField: Int, stringField: String)

SparkQA · 2017-06-21T14:54:05Z

Test build #78388 has finished for PR 18375 at commit c4c48df.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class OrcParDataWithKey(intField: Int, stringField: String)
case class ParquetDataWithKey(intField: Int, stringField: String)

maropu · 2017-06-21T15:24:39Z

This fix breaks the existing test suites, so I'm looking for other approaches to fix this issue only...

maropu · 2017-06-21T16:15:56Z

@gatorsmile I remembered @liancheng said we want to allow users to create partitioned tables that allow data schema to contain (part of) the partition columns, and there are test cases for this use case before (#16030 (comment)). But, I feel the query in the description seems to be error-prone, so how about just printing warning messages when detecting the duplication (like here)?

SparkQA · 2017-06-21T16:37:36Z

Test build #78391 has finished for PR 18375 at commit a03c907.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class OrcParDataWithKey(intField: Int, pi: Int, stringField: String, ps: String)
case class ParquetDataWithKey(p: Int, intField: Int, stringField: String)

gatorsmile · 2017-06-23T06:05:02Z

uh, I see. Let us log it as a warning.

maropu · 2017-06-23T06:06:28Z

ok, I'll update.

…lumns

This reverts commit c4c48dfdd299c61dcdff54c7678e954e5f88bd48.

maropu · 2017-06-23T12:17:36Z

@gatorsmile Manually checked:

scala> Seq((1, 2, 3)).toDF("a", "b", "c").write.save(s"$path/a=0")
scala> spark.read.load(path).show
17/06/23 21:04:44 WARN SchemaUtils: Found duplicate column(s) in the data schema and the partition schema: `a`. You might need to assign different column names.
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  0|  2|  3|
+---+---+---+

SparkQA · 2017-06-23T13:40:39Z

Test build #78524 has finished for PR 18375 at commit eba52f3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-06-23T13:42:05Z

Jenkins, retest this please.

brad-kaiser · 2017-06-23T14:10:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala

+   */
+  def checkColumnNameDuplication(
+      columnNames: Seq[String], colType: String, caseSensitiveAnalysis: Boolean): Unit = {
+    val names = if (caseSensitiveAnalysis) {


This if might be easier to read on one line.

val names = if (caseSensitiveAnalysis) columnNames else columnNames.map(_.toLowerCase)

Yeah. Let me first merge this.

@maropu , could you resolve this in your another related PR?

SparkQA · 2017-06-23T14:41:51Z

Test build #78525 has finished for PR 18375 at commit 0f449df.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-23T15:12:15Z

Test build #78528 has finished for PR 18375 at commit 0f449df.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-23T16:05:06Z

Test build #78530 has finished for PR 18375 at commit 0f449df.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…chema have the duplicate columns ## What changes were proposed in this pull request? The current master outputs unexpected results when the data schema and partition schema have the duplicate columns: ``` withTempPath { dir => val basePath = dir.getCanonicalPath spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=1").toString) spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=a").toString) spark.read.parquet(basePath).show() } +---+ |foo| +---+ | 1| | 1| | a| | a| | 1| | a| +---+ ``` This patch added code to print a warning when the duplication found. ## How was this patch tested? Manually checked. Author: Takeshi Yamamuro <[email protected]> Closes #18375 from maropu/SPARK-21144-3. (cherry picked from commit f3dea60) Signed-off-by: gatorsmile <[email protected]>

…chema have the duplicate columns ## What changes were proposed in this pull request? The current master outputs unexpected results when the data schema and partition schema have the duplicate columns: ``` withTempPath { dir => val basePath = dir.getCanonicalPath spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=1").toString) spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=a").toString) spark.read.parquet(basePath).show() } +---+ |foo| +---+ | 1| | 1| | a| | a| | 1| | a| +---+ ``` This patch added code to print a warning when the duplication found. ## How was this patch tested? Manually checked. Author: Takeshi Yamamuro <[email protected]> Closes apache#18375 from maropu/SPARK-21144-3.

maropu force-pushed the SPARK-21144-3 branch from d5c7d08 to 0371562 Compare June 21, 2017 11:00

maropu force-pushed the SPARK-21144-3 branch from 0371562 to c4c48df Compare June 21, 2017 13:15

maropu added 3 commits June 23, 2017 19:48

Fails when the data schema and partition schema have the duplicate co…

a0044ce

…lumns

[WIP] Need to modify existing tests?

652d54a

Revert "[WIP] Need to modify existing tests?"

a69a095

This reverts commit c4c48dfdd299c61dcdff54c7678e954e5f88bd48.

maropu force-pushed the SPARK-21144-3 branch from a03c907 to eba52f3 Compare June 23, 2017 12:10

maropu changed the title ~~[SPARK-21144][SQL] Check if the data schema and partition schema have the duplicate columns~~ [SPARK-21144][SQL] Print a warning if the data schema and partition schema have the duplicate columns Jun 23, 2017

Print a warning if duplication column names found

0f449df

maropu force-pushed the SPARK-21144-3 branch from eba52f3 to 0f449df Compare June 23, 2017 12:15

brad-kaiser reviewed Jun 23, 2017

View reviewed changes

asfgit closed this in f3dea60 Jun 23, 2017

[SPARK-21144][SQL] Print a warning if the data schema and partition schema have the duplicate columns #18375

[SPARK-21144][SQL] Print a warning if the data schema and partition schema have the duplicate columns #18375

Uh oh!

Conversation

maropu commented Jun 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jun 21, 2017

Uh oh!

SparkQA commented Jun 21, 2017

Uh oh!

SparkQA commented Jun 21, 2017

Uh oh!

maropu commented Jun 21, 2017

Uh oh!

maropu commented Jun 21, 2017

Uh oh!

SparkQA commented Jun 21, 2017

Uh oh!

gatorsmile commented Jun 23, 2017

Uh oh!

maropu commented Jun 23, 2017

Uh oh!

maropu commented Jun 23, 2017

Uh oh!

SparkQA commented Jun 23, 2017

Uh oh!

maropu commented Jun 23, 2017

Uh oh!

brad-kaiser Jun 23, 2017

Choose a reason for hiding this comment

Uh oh!

gatorsmile Jun 23, 2017

Choose a reason for hiding this comment

Uh oh!

maropu Jun 24, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 23, 2017

Uh oh!

SparkQA commented Jun 23, 2017

Uh oh!

SparkQA commented Jun 23, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

maropu commented Jun 21, 2017 •

edited

Loading