-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31755][SQL] allow missing year/hour when parsing date/timestamp string #28576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| case _: Throwable => None | ||
| } | ||
| if (res.nonEmpty) { | ||
| case e: DateTimeException if SQLConf.get.legacyTimeParserPolicy == EXCEPTION => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when the field value exceeds the valid range, the JDK throws DateTimeException, and we should catch it too. DateTimeParseException extends DateTimeException
| .parseDefaulting(ChronoField.MONTH_OF_YEAR, 1) | ||
| .parseDefaulting(ChronoField.DAY_OF_MONTH, 1) | ||
| .parseDefaulting(ChronoField.MINUTE_OF_HOUR, 0) | ||
| .parseDefaulting(ChronoField.SECOND_OF_MINUTE, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's too fragile to use these default values, as it may fail to parse due to conflicts. This PR decides default values manually, after parsing. See https://github.com/apache/spark/pull/28576/files#diff-3e4b85f54d2e75f1b5b845279749024eR42
|
Test build #122832 has finished for PR 28576 at commit
|
|
|
||
| // Converts the parsed temporal object to ZonedDateTime. It sets time components to zeros | ||
| // if they does not exist in the parsed object. | ||
| protected def toZonedDateTime( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, the additional functional calls will slow down overall performance. I think it makes sense to re-run datetime benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Can you help to re-generate benchmark numbers and send me a PR? It was generated by you last time and I'm afraid rerunning it locally at my side can't help us to show the difference.
| } | ||
|
|
||
| protected def toLocalDate(temporalAccessor: TemporalAccessor): LocalDate = { | ||
| val year = getOrDefault(temporalAccessor, ChronoField.YEAR, 1970) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How often do users consider 1970 as the default year, your guess? I would propose current year as we do in stringToTimestamp:
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
Line 396 in ebc8fa5
| LocalDate.now(zoneId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's from 2.4. I don't care too much about it as long as we have a default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is some kind of inconsistent behaviour between CAST and parsing via patterns (in to_timestamp for example). Should we document somewhere the default year?
| val hour = if (temporalAccessor.isSupported(ChronoField.HOUR_OF_DAY)) { | ||
| temporalAccessor.get(ChronoField.HOUR_OF_DAY) | ||
| } else if (temporalAccessor.isSupported(ChronoField.HOUR_OF_AMPM)) { | ||
| // When we reach here, is mean am/pm is not specified. Here we assume it's am. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is mean am/pm is not specified. -> it means that am/pm is not specified.
| } | ||
|
|
||
| protected def toLocalDate(temporalAccessor: TemporalAccessor): LocalDate = { | ||
| val year = getOrDefault(temporalAccessor, ChronoField.YEAR, 1970) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about the case when ERA is set. In this case, year will be in the YEAR_OF_ERA fields, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TemporalAccessor.get is not getting the field, it's actually a query. So for yyyy G, we can still query the YEAR field because both YEAR_OF_ERA and ERA are available.
|
|
||
| val micros1 = formatter.parse("2009-12-12 00") | ||
| assert(micros1 === TimeUnit.SECONDS.toMicros( | ||
| LocalDateTime.of(2009, 12, 12, 0, 0, 0).toEpochSecond(ZoneOffset.UTC))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use DateTimeTestUtils.days
spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeTestUtils.scala
Lines 72 to 85 in 9f0c010
| // Returns microseconds since epoch for the given date | |
| def date( | |
| year: Int, | |
| month: Byte = 1, | |
| day: Byte = 1, | |
| hour: Byte = 0, | |
| minute: Byte = 0, | |
| sec: Byte = 0, | |
| micros: Int = 0, | |
| zid: ZoneId = ZoneOffset.UTC): Long = { | |
| val nanos = TimeUnit.MICROSECONDS.toNanos(micros).toInt | |
| val localDateTime = LocalDateTime.of(year, month, day, hour, minute, sec, nanos) | |
| localDateTimeToMicros(localDateTime, zid) | |
| } |
| test("parsing hour with various patterns") { | ||
| def createFormatter(pattern: String): TimestampFormatter = { | ||
| // Use `SIMPLE_DATE_FORMAT`, so that the legacy parser also fails with invalid value range. | ||
| TimestampFormatter(pattern, ZoneOffset.UTC, LegacyDateFormats.SIMPLE_DATE_FORMAT, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can import UTC from DateTimeTestUtils instead of spreading ZoneOffset.UTC everywhere in tests.
|
|
||
| val micros2 = formatter.parse("2009-12-12 15") | ||
| assert(micros2 === TimeUnit.SECONDS.toMicros( | ||
| LocalDateTime.of(2009, 12, 12, 15, 0, 0).toEpochSecond(ZoneOffset.UTC))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would propose to improve test coverage and test different years -100, 0001, 1582, 1900, 1970, 2037, 2100 instead of checking only one year in all tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's UT, which is whitebox testing. It's clear from the code that the time parsing is unrelated to the value of year
| } | ||
|
|
||
| protected def toLocalDate(temporalAccessor: TemporalAccessor): LocalDate = { | ||
| val year = getOrDefault(temporalAccessor, ChronoField.YEAR, 1970) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe leave a comment here to illustrate the default year/month/day is for compatibility with Spark 2.4, it's a little strange to hard code the value here.
|
Note that if we were designing this fresh, the year 1970 would not be a great choice when it would be used to parse arbitrary Month/Day combinations. The reason being that it is not a leap year, which means that it would never parse Feb 29. This behavior seems to be pretty broken. We should consider forbidding missing things in the middle / from the top, i.e., you have to have a prefix of (year, month, day, hour, minute, second, millis/micros/nanos). If anybody wants that, they can simulate the defaults by prefixing with their own stuff, e.g. if you want to parse hh:mm:ss then you have to parse |
This is a good point. Now I agree 1970 is not a good default. I'll fail it by default, with a legacy config to use 1970 as default. |
| 1970 | ||
| } else { | ||
| throw new SparkUpgradeException("3.0", | ||
| "Year must be given in the date/timestamp string to be parsed. You can set " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also suggest the alternative workaround, e.g. prepending '1970 ' to the string-to-be-parsed and prepending 'yyyy '` to the format string. That one works without the legacy setting so I'd say it's preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A problem is that, the pattern is flexible and the year may not be in the beginning, e.g. MM/dd yyyy .... I don't think there is a workaround, unless we provide an API to allow users to set the default values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually great -- if any part is missing from the format and the input, they can just provide a default by appending a constant to the input, and a constant to the format string. I.e., the real format string is "MM dd" or "dd MM" or anything arbitrary but you're missing year, then you can always prepend "yyyy " to the format string, and "2009 " to the input value, and that fills out that part of the value.
| SQLConf.LEGACY_ALLOW_MISSING_YEAR_DURING_PARSING.key + " to true, to pick 1970 as " + | ||
| "the default value of year.", null) | ||
| } | ||
| val month = getOrDefault(accessor, ChronoField.MONTH_OF_YEAR, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we also going to error out if they specify the day but not the month? Really, the only formats that make sense are the ones where a full prefix is given in the y-m-d h-m-s sequence, and all others are likely to be a case where they made a mistake (e.g. asked for "mm" twice where they meant MM).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
January should be fine? It accepts any DAY_OF_MONTH.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asked for "mm" twice where they meant MM
This will error out unless the 2 mm parse to the same value, e.g. mm mm yyyy with value 10 10 1990.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm saying it works (i.e., it accepts all valid values) but it's nonsensical; it'll never return the complete timestamp that people intended.
That said, I can see cases where people want to parse contiguous ranges (e.g. just a time, or just a month, or just a month + day) and then extract the parsed components using EXTRACT / date_part, or just to format them differently.
So I guess the most permissive but still safe thing here would be to forbid missing "year" only if someone also requests "day" and "month", because that's the situation where "year" affects the valid range of "day". But if they don't request year and they don't request month, then month would default to 1 and there would be no impact on the valid range of days, it would always go up to 32. I.e.:
"yyyy MM dd" -> OK
"MM dd" -> fail, need year to know valid range of dd
"MM" -> OK
"dd" -> OK
"hh:mm:ss" -> OK
"MM dd hh:mm:ss" -> fail, need year to know valid range of dd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah this looks better!
I'm wondering if we still need the legacy config though. It's very hard to hit the missing year exception, and if it happens it's very likely to be a data error. Users should fix the data or change the pattern and provide the default year.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok so we should go back to the first version, which just uses 1970 as the default value and fails for invalid dates at runtime 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW the current framework just returns null if the parser fails. Shall we treat this case specially and throw an error that fails the query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error out in ANSI mode only, I'd say, just like with all other parse failures (which should fail in ANSI mode -- I hope they already do).
FWIW, going back to the first version is not a shame now that we know it's the right thing to do. :) Maybe we can leave some comments in the code to explain the reasoning why we don't error out if things are missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately datetime functions don't respect the ANSI flag at all. Maybe we should keep everything the same as Spark 2.4 (return null when parsing Feb 29 because 1970-02-29 is invalid), and implement ANSI behavior in Spark 3.1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yikes! It seems terrible that we have to do that later, because then it can break existing workloads. Launching the ANSI behavior was our chance to do the right thing here. If we can still fix this in 3.0 then that would of course be preferable, but that seems like a stretch...
| val LEGACY_ALLOW_MISSING_YEAR_DURING_PARSING = | ||
| buildConf("spark.sql.legacy.allowMissingYearDuringParsing") | ||
| .internal() | ||
| .doc("When true, DateFormatter/TimestampFormatter allows parsing date/timestamp string " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too, you could suggest the alternative workaround that doesn't require setting the legacy flag.
|
Test build #122851 has finished for PR 28576 at commit
|
|
Test build #122880 has finished for PR 28576 at commit
|
|
Test build #122877 has finished for PR 28576 at commit
|
|
I have re-generated results of DateTimeBenchmark, CSVBenchmark and JsonBenchmark on JDK 8 and 11. @cloud-fan Please, review and merge this PR cloud-fan#17 |
8c2ce05 to
9f03a47
Compare
|
Test build #122923 has finished for PR 28576 at commit
|
|
Test build #122932 has finished for PR 28576 at commit
|
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - I am just going to merge and ignore one nit.
| import org.apache.spark.sql.internal.SQLConf.LegacyBehaviorPolicy._ | ||
|
|
||
| trait DateTimeFormatterHelper { | ||
| private def getOrDefault(accessor: TemporalAccessor, field: ChronoField, default: Int): Int = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would follow the naming of getOrElse though ..
|
Merged to master. |
|
@cloud-fan, there's conflict on SQL file output against branch-3.0. It's trivial but can you just open a PR to backport since we're running RCs? |
…p string
This PR allows missing hour fields when parsing date/timestamp string, with 0 as the default value.
If the year field is missing, this PR still fail the query by default, but provides a new legacy config to allow it and use 1970 as the default value. It's not a good default value, as it is not a leap year, which means that it would never parse Feb 29. We just pick it for backward compatibility.
To keep backward compatibility with Spark 2.4.
Yes.
Spark 2.4:
```
scala> sql("select to_timestamp('16', 'dd')").show
+------------------------+
|to_timestamp('16', 'dd')|
+------------------------+
| 1970-01-16 00:00:00|
+------------------------+
scala> sql("select to_date('16', 'dd')").show
+-------------------+
|to_date('16', 'dd')|
+-------------------+
| 1970-01-16|
+-------------------+
scala> sql("select to_timestamp('2019 40', 'yyyy mm')").show
+----------------------------------+
|to_timestamp('2019 40', 'yyyy mm')|
+----------------------------------+
| 2019-01-01 00:40:00|
+----------------------------------+
scala> sql("select to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')").show
+----------------------------------------------+
|to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')|
+----------------------------------------------+
| 2019-01-01 10:10:10|
+----------------------------------------------+
```
in branch 3.0
```
scala> sql("select to_timestamp('16', 'dd')").show
+--------------------+
|to_timestamp(16, dd)|
+--------------------+
| null|
+--------------------+
scala> sql("select to_date('16', 'dd')").show
+---------------+
|to_date(16, dd)|
+---------------+
| null|
+---------------+
scala> sql("select to_timestamp('2019 40', 'yyyy mm')").show
+------------------------------+
|to_timestamp(2019 40, yyyy mm)|
+------------------------------+
| 2019-01-01 00:00:00|
+------------------------------+
scala> sql("select to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')").show
+------------------------------------------+
|to_timestamp(2019 10:10:10, yyyy hh:mm:ss)|
+------------------------------------------+
| 2019-01-01 00:00:00|
+------------------------------------------+
```
After this PR, the behavior becomes the same as 2.4, if the legacy config is enabled.
new tests
Closes apache#28576 from cloud-fan/bug.
Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
…p string
This PR allows missing hour fields when parsing date/timestamp string, with 0 as the default value.
If the year field is missing, this PR still fail the query by default, but provides a new legacy config to allow it and use 1970 as the default value. It's not a good default value, as it is not a leap year, which means that it would never parse Feb 29. We just pick it for backward compatibility.
To keep backward compatibility with Spark 2.4.
Yes.
Spark 2.4:
```
scala> sql("select to_timestamp('16', 'dd')").show
+------------------------+
|to_timestamp('16', 'dd')|
+------------------------+
| 1970-01-16 00:00:00|
+------------------------+
scala> sql("select to_date('16', 'dd')").show
+-------------------+
|to_date('16', 'dd')|
+-------------------+
| 1970-01-16|
+-------------------+
scala> sql("select to_timestamp('2019 40', 'yyyy mm')").show
+----------------------------------+
|to_timestamp('2019 40', 'yyyy mm')|
+----------------------------------+
| 2019-01-01 00:40:00|
+----------------------------------+
scala> sql("select to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')").show
+----------------------------------------------+
|to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')|
+----------------------------------------------+
| 2019-01-01 10:10:10|
+----------------------------------------------+
```
in branch 3.0
```
scala> sql("select to_timestamp('16', 'dd')").show
+--------------------+
|to_timestamp(16, dd)|
+--------------------+
| null|
+--------------------+
scala> sql("select to_date('16', 'dd')").show
+---------------+
|to_date(16, dd)|
+---------------+
| null|
+---------------+
scala> sql("select to_timestamp('2019 40', 'yyyy mm')").show
+------------------------------+
|to_timestamp(2019 40, yyyy mm)|
+------------------------------+
| 2019-01-01 00:00:00|
+------------------------------+
scala> sql("select to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')").show
+------------------------------------------+
|to_timestamp(2019 10:10:10, yyyy hh:mm:ss)|
+------------------------------------------+
| 2019-01-01 00:00:00|
+------------------------------------------+
```
After this PR, the behavior becomes the same as 2.4, if the legacy config is enabled.
new tests
Closes apache#28576 from cloud-fan/bug.
Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
… results ### What changes were proposed in this pull request? Re-generate results of: - DateTimeBenchmark - CSVBenchmark - JsonBenchmark in the environment: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 | ### Why are the changes needed? 1. The PR #28576 changed date-time parser. The `DateTimeBenchmark` should confirm that the PR didn't slow down date/timestamp parsing. 2. CSV/JSON datasources are affected by the above PR too. This PR updates the benchmark results in the same environment as other benchmarks to have a base line for future optimizations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running benchmarks via the script: ```python #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #28613 from MaxGekk/missing-hour-year-benchmarks. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
… results ### What changes were proposed in this pull request? Re-generate results of: - DateTimeBenchmark - CSVBenchmark - JsonBenchmark in the environment: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 | ### Why are the changes needed? 1. The PR #28576 changed date-time parser. The `DateTimeBenchmark` should confirm that the PR didn't slow down date/timestamp parsing. 2. CSV/JSON datasources are affected by the above PR too. This PR updates the benchmark results in the same environment as other benchmarks to have a base line for future optimizations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running benchmarks via the script: ```python #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #28613 from MaxGekk/missing-hour-year-benchmarks. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 92685c0) Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
This PR allows missing year and hour fields when parsing date/timestamp string, with 1970 as the default year and 0 as the default hour.
Why are the changes needed?
To keep backward compatibility with Spark 2.4.
Does this PR introduce any user-facing change?
Yes.
Spark 2.4:
in branch 3.0
After this PR, the behavior becomes the same as 2.4.
How was this patch tested?
new tests