-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-39731][SQL] Fix issue in CSV and JSON data sources when parsing dates in "yyyyMMdd" format with CORRECTED time parser policy #37147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
1193ce7
b714b7f
8a10a68
9b65761
45011a0
40d07bd
55c5579
ef91606
15c07f7
a83288b
bf9351d
ac63b63
a447b08
8feb707
8a01ece
fbdf9d8
2962cd9
10ca4a4
b2a3db2
739e7db
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -36,7 +36,7 @@ import org.apache.hadoop.io.SequenceFile.CompressionType | |||||||||
| import org.apache.hadoop.io.compress.GzipCodec | ||||||||||
| import org.apache.logging.log4j.Level | ||||||||||
|
|
||||||||||
| import org.apache.spark.{SparkConf, SparkException, TestUtils} | ||||||||||
| import org.apache.spark.{SparkConf, SparkException, SparkUpgradeException, TestUtils} | ||||||||||
| import org.apache.spark.sql.{AnalysisException, Column, DataFrame, Encoders, QueryTest, Row} | ||||||||||
| import org.apache.spark.sql.catalyst.util.{DateTimeTestUtils, DateTimeUtils} | ||||||||||
| import org.apache.spark.sql.execution.datasources.CommonFileDataSourceSuite | ||||||||||
|
|
@@ -2788,6 +2788,52 @@ abstract class CSVSuite | |||||||||
| } | ||||||||||
| } | ||||||||||
| } | ||||||||||
|
|
||||||||||
| test("SPARK-39731: Correctly parse dates and timestamps with yyyyMMdd pattern") { | ||||||||||
| withTempPath { path => | ||||||||||
| Seq( | ||||||||||
| "1,2020011,2020011", | ||||||||||
| "2,20201203,20201203").toDF() | ||||||||||
| .repartition(1) | ||||||||||
| .write.text(path.getAbsolutePath) | ||||||||||
| val schema = new StructType() | ||||||||||
| .add("id", IntegerType) | ||||||||||
| .add("date", DateType) | ||||||||||
| .add("ts", TimestampType) | ||||||||||
| val output = spark.read | ||||||||||
| .schema(schema) | ||||||||||
| .option("dateFormat", "yyyyMMdd") | ||||||||||
| .option("timestampFormat", "yyyyMMdd") | ||||||||||
| .csv(path.getAbsolutePath) | ||||||||||
|
|
||||||||||
| def check(mode: String, res: Seq[Row]): Unit = { | ||||||||||
| withSQLConf(SQLConf.LEGACY_TIME_PARSER_POLICY.key -> mode) { | ||||||||||
| checkAnswer(output, res) | ||||||||||
| } | ||||||||||
| } | ||||||||||
|
|
||||||||||
| check( | ||||||||||
| "legacy", | ||||||||||
| Seq( | ||||||||||
| Row(1, Date.valueOf("2020-01-01"), Timestamp.valueOf("2020-01-01 00:00:00")), | ||||||||||
| Row(2, Date.valueOf("2020-12-03"), Timestamp.valueOf("2020-12-03 00:00:00")) | ||||||||||
| ) | ||||||||||
| ) | ||||||||||
|
|
||||||||||
| check( | ||||||||||
| "corrected", | ||||||||||
| Seq( | ||||||||||
| Row(1, null, null), | ||||||||||
| Row(2, Date.valueOf("2020-12-03"), Timestamp.valueOf("2020-12-03 00:00:00")) | ||||||||||
| ) | ||||||||||
|
Comment on lines
+2866
to
+2879
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For completeness, would you consider adding a check for spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala Lines 2598 to 2601 in 1193ce7
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done! |
||||||||||
| ) | ||||||||||
|
|
||||||||||
| val err = intercept[SparkException] { | ||||||||||
| check("exception", Nil) | ||||||||||
| }.getCause | ||||||||||
| assert(err.isInstanceOf[SparkUpgradeException]) | ||||||||||
| } | ||||||||||
| } | ||||||||||
| } | ||||||||||
|
|
||||||||||
| class CSVv1Suite extends CSVSuite { | ||||||||||
|
|
||||||||||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this technically a breaking change for users who could previously specify an invalid pattern without LEGACY mode?
Before -- ignore the invalid pattern and parse with
DateTimeUtils.stringToTimestampNow -- it throws an error
We don't support invalid patterns but as a user I would be unhappy to see my code break. I'm unsure if this is actually considered a breaking change because this is such an edge case and the user is already doing something invalid. I'm curious to hear your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. It would be a breaking change for users if they were relying on the compatibility fallback.
There could an alternative fix, maybe we can look into updating
DateTimeUtils.stringToDatebut I am not sure.I can also add a feature flag to control this behaviour in JSON and CSV connectors so users can always opt in to use legacy behaviour. For example, I can a data source option "useLegacyParsing" or something similar. The option could be disabled by default, the exception would contain a message saying that you can enable the option to maintain the previous behaviour. Maybe this could be a good solution.
Let me know if something like that could work, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should work. It feels weird that users have to opt-in to the correct behavior but hopefully this is a small percentage of users. Maybe @kamcheungting-db or @cloud-fan can weigh in.
I personally wouldn't be confident updating
DateTimeUtils.stringToDatebecause there are so many usages elsewhere. But if you are familiar with the other use cases ofDateTimeUtils.stringToDatethen this could work.I'll loop back if I think of an alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the safest option is to copy-paste the old code of
stringToDatebefore #32959 and use it here, but that's really ugly and hard to maintain.I'd like to understand more about the invalid pattern behavior. Will we trigger the fallback for every input row? That sounds like a big perf problem...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the invalid pattern and before this PR, yes, the fallback code would be triggered on every pattern mismatch. With the change, we will just throw an exception parsing those values as nulls. Yes, it does sound like a performance issue but it has been there for some time.
I agree with copy-paste of stringToDate, I proposed to add a data source config to keep the old behaviour. What do you think?