-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21144][SQL] Print a warning if the data schema and partition schema have the duplicate columns #18375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #78378 has finished for PR 18375 at commit
|
|
Test build #78382 has finished for PR 18375 at commit
|
|
Test build #78388 has finished for PR 18375 at commit
|
|
This fix breaks the existing test suites, so I'm looking for other approaches to fix this issue only... |
|
@gatorsmile I remembered @liancheng said |
|
Test build #78391 has finished for PR 18375 at commit
|
|
uh, I see. Let us log it as a warning. |
|
ok, I'll update. |
This reverts commit c4c48dfdd299c61dcdff54c7678e954e5f88bd48.
|
@gatorsmile Manually checked: |
|
Test build #78524 has finished for PR 18375 at commit
|
|
Jenkins, retest this please. |
| */ | ||
| def checkColumnNameDuplication( | ||
| columnNames: Seq[String], colType: String, caseSensitiveAnalysis: Boolean): Unit = { | ||
| val names = if (caseSensitiveAnalysis) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if might be easier to read on one line.
val names = if (caseSensitiveAnalysis) columnNames else columnNames.map(_.toLowerCase)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Let me first merge this.
@maropu , could you resolve this in your another related PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
Test build #78525 has finished for PR 18375 at commit
|
|
Test build #78528 has finished for PR 18375 at commit
|
|
Test build #78530 has finished for PR 18375 at commit
|
…chema have the duplicate columns
## What changes were proposed in this pull request?
The current master outputs unexpected results when the data schema and partition schema have the duplicate columns:
```
withTempPath { dir =>
val basePath = dir.getCanonicalPath
spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=1").toString)
spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=a").toString)
spark.read.parquet(basePath).show()
}
+---+
|foo|
+---+
| 1|
| 1|
| a|
| a|
| 1|
| a|
+---+
```
This patch added code to print a warning when the duplication found.
## How was this patch tested?
Manually checked.
Author: Takeshi Yamamuro <[email protected]>
Closes #18375 from maropu/SPARK-21144-3.
(cherry picked from commit f3dea60)
Signed-off-by: gatorsmile <[email protected]>
…chema have the duplicate columns
## What changes were proposed in this pull request?
The current master outputs unexpected results when the data schema and partition schema have the duplicate columns:
```
withTempPath { dir =>
val basePath = dir.getCanonicalPath
spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=1").toString)
spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=a").toString)
spark.read.parquet(basePath).show()
}
+---+
|foo|
+---+
| 1|
| 1|
| a|
| a|
| 1|
| a|
+---+
```
This patch added code to print a warning when the duplication found.
## How was this patch tested?
Manually checked.
Author: Takeshi Yamamuro <[email protected]>
Closes apache#18375 from maropu/SPARK-21144-3.
What changes were proposed in this pull request?
The current master outputs unexpected results when the data schema and partition schema have the duplicate columns:
This patch added code to print a warning when the duplication found.
How was this patch tested?
Manually checked.