Skip to content

Conversation

@maropu
Copy link
Member

@maropu maropu commented Jun 21, 2017

What changes were proposed in this pull request?

The current master outputs unexpected results when the data schema and partition schema have the duplicate columns:

withTempPath { dir =>
  val basePath = dir.getCanonicalPath
  spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=1").toString)
  spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=a").toString)
  spark.read.parquet(basePath).show()
}

+---+
|foo|
+---+
|  1|
|  1|
|  a|
|  a|
|  1|
|  a|
+---+

This patch added code to print a warning when the duplication found.

How was this patch tested?

Manually checked.

@SparkQA
Copy link

SparkQA commented Jun 21, 2017

Test build #78378 has finished for PR 18375 at commit d5c7d08.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class OrcParDataWithKey(intField: Int, stringField: String)
  • case class ParquetDataWithKey(intField: Int, stringField: String)

@SparkQA
Copy link

SparkQA commented Jun 21, 2017

Test build #78382 has finished for PR 18375 at commit 0371562.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class OrcParDataWithKey(intField: Int, stringField: String)
  • case class ParquetDataWithKey(intField: Int, stringField: String)

@SparkQA
Copy link

SparkQA commented Jun 21, 2017

Test build #78388 has finished for PR 18375 at commit c4c48df.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class OrcParDataWithKey(intField: Int, stringField: String)
  • case class ParquetDataWithKey(intField: Int, stringField: String)

@maropu
Copy link
Member Author

maropu commented Jun 21, 2017

This fix breaks the existing test suites, so I'm looking for other approaches to fix this issue only...

@maropu
Copy link
Member Author

maropu commented Jun 21, 2017

@gatorsmile I remembered @liancheng said we want to allow users to create partitioned tables that allow data schema to contain (part of) the partition columns, and there are test cases for this use case before (#16030 (comment)). But, I feel the query in the description seems to be error-prone, so how about just printing warning messages when detecting the duplication (like here)?

@SparkQA
Copy link

SparkQA commented Jun 21, 2017

Test build #78391 has finished for PR 18375 at commit a03c907.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class OrcParDataWithKey(intField: Int, pi: Int, stringField: String, ps: String)
  • case class ParquetDataWithKey(p: Int, intField: Int, stringField: String)

@gatorsmile
Copy link
Member

uh, I see. Let us log it as a warning.

@maropu
Copy link
Member Author

maropu commented Jun 23, 2017

ok, I'll update.

@maropu maropu changed the title [SPARK-21144][SQL] Check if the data schema and partition schema have the duplicate columns [SPARK-21144][SQL] Print a warning if the data schema and partition schema have the duplicate columns Jun 23, 2017
@maropu
Copy link
Member Author

maropu commented Jun 23, 2017

@gatorsmile Manually checked:

scala> Seq((1, 2, 3)).toDF("a", "b", "c").write.save(s"$path/a=0")
scala> spark.read.load(path).show
17/06/23 21:04:44 WARN SchemaUtils: Found duplicate column(s) in the data schema and the partition schema: `a`. You might need to assign different column names.
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  0|  2|  3|
+---+---+---+

@SparkQA
Copy link

SparkQA commented Jun 23, 2017

Test build #78524 has finished for PR 18375 at commit eba52f3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jun 23, 2017

Jenkins, retest this please.

*/
def checkColumnNameDuplication(
columnNames: Seq[String], colType: String, caseSensitiveAnalysis: Boolean): Unit = {
val names = if (caseSensitiveAnalysis) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if might be easier to read on one line.

    val names = if (caseSensitiveAnalysis) columnNames else columnNames.map(_.toLowerCase)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Let me first merge this.

@maropu , could you resolve this in your another related PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@SparkQA
Copy link

SparkQA commented Jun 23, 2017

Test build #78525 has finished for PR 18375 at commit 0f449df.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 23, 2017

Test build #78528 has finished for PR 18375 at commit 0f449df.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 23, 2017

Test build #78530 has finished for PR 18375 at commit 0f449df.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Jun 23, 2017
…chema have the duplicate columns

## What changes were proposed in this pull request?
The current master outputs unexpected results when the data schema and partition schema have the duplicate columns:
```
withTempPath { dir =>
  val basePath = dir.getCanonicalPath
  spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=1").toString)
  spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=a").toString)
  spark.read.parquet(basePath).show()
}

+---+
|foo|
+---+
|  1|
|  1|
|  a|
|  a|
|  1|
|  a|
+---+
```
This patch added code to print a warning when the duplication found.

## How was this patch tested?
Manually checked.

Author: Takeshi Yamamuro <[email protected]>

Closes #18375 from maropu/SPARK-21144-3.

(cherry picked from commit f3dea60)
Signed-off-by: gatorsmile <[email protected]>
@asfgit asfgit closed this in f3dea60 Jun 23, 2017
robert3005 pushed a commit to palantir/spark that referenced this pull request Jun 29, 2017
…chema have the duplicate columns

## What changes were proposed in this pull request?
The current master outputs unexpected results when the data schema and partition schema have the duplicate columns:
```
withTempPath { dir =>
  val basePath = dir.getCanonicalPath
  spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=1").toString)
  spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=a").toString)
  spark.read.parquet(basePath).show()
}

+---+
|foo|
+---+
|  1|
|  1|
|  a|
|  a|
|  1|
|  a|
+---+
```
This patch added code to print a warning when the duplication found.

## How was this patch tested?
Manually checked.

Author: Takeshi Yamamuro <[email protected]>

Closes apache#18375 from maropu/SPARK-21144-3.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants