-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31755][SQL] allow missing year/hour when parsing date/timestamp string #28576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -19,7 +19,7 @@ package org.apache.spark.sql.catalyst.util | |
|
|
||
| import java.time._ | ||
| import java.time.chrono.IsoChronology | ||
| import java.time.format.{DateTimeFormatter, DateTimeFormatterBuilder, DateTimeParseException, ResolverStyle} | ||
| import java.time.format.{DateTimeFormatter, DateTimeFormatterBuilder, ResolverStyle} | ||
| import java.time.temporal.{ChronoField, TemporalAccessor, TemporalQueries} | ||
| import java.util.Locale | ||
|
|
||
|
|
@@ -31,17 +31,52 @@ import org.apache.spark.sql.internal.SQLConf | |
| import org.apache.spark.sql.internal.SQLConf.LegacyBehaviorPolicy._ | ||
|
|
||
| trait DateTimeFormatterHelper { | ||
| private def getOrDefault(accessor: TemporalAccessor, field: ChronoField, default: Int): Int = { | ||
| if (accessor.isSupported(field)) { | ||
| accessor.get(field) | ||
| } else { | ||
| default | ||
| } | ||
| } | ||
|
|
||
| protected def toLocalDate(accessor: TemporalAccessor): LocalDate = { | ||
| val localDate = accessor.query(TemporalQueries.localDate()) | ||
| // If all the date fields are specified, return the local date directly. | ||
| if (localDate != null) return localDate | ||
|
|
||
| // Users may want to parse only a few datetime fields from a string and extract these fields | ||
| // later, and we should provide default values for missing fields. | ||
| // To be compatible with Spark 2.4, we pick 1970 as the default value of year. | ||
| val year = getOrDefault(accessor, ChronoField.YEAR, 1970) | ||
| val month = getOrDefault(accessor, ChronoField.MONTH_OF_YEAR, 1) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we also going to error out if they specify the day but not the month? Really, the only formats that make sense are the ones where a full prefix is given in the y-m-d h-m-s sequence, and all others are likely to be a case where they made a mistake (e.g. asked for "mm" twice where they meant MM).
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. January should be fine? It accepts any DAY_OF_MONTH.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This will error out unless the 2 mm parse to the same value, e.g. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm saying it works (i.e., it accepts all valid values) but it's nonsensical; it'll never return the complete timestamp that people intended. That said, I can see cases where people want to parse contiguous ranges (e.g. just a time, or just a month, or just a month + day) and then extract the parsed components using So I guess the most permissive but still safe thing here would be to forbid missing "year" only if someone also requests "day" and "month", because that's the situation where "year" affects the valid range of "day". But if they don't request year and they don't request month, then month would default to 1 and there would be no impact on the valid range of days, it would always go up to 32. I.e.:
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah this looks better! I'm wondering if we still need the legacy config though. It's very hard to hit the missing year exception, and if it happens it's very likely to be a data error. Users should fix the data or change the pattern and provide the default year.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok so we should go back to the first version, which just uses 1970 as the default value and fails for invalid dates at runtime 😂
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. BTW the current framework just returns null if the parser fails. Shall we treat this case specially and throw an error that fails the query? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Error out in ANSI mode only, I'd say, just like with all other parse failures (which should fail in ANSI mode -- I hope they already do). FWIW, going back to the first version is not a shame now that we know it's the right thing to do. :) Maybe we can leave some comments in the code to explain the reasoning why we don't error out if things are missing.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. unfortunately datetime functions don't respect the ANSI flag at all. Maybe we should keep everything the same as Spark 2.4 (return null when parsing Feb 29 because 1970-02-29 is invalid), and implement ANSI behavior in Spark 3.1? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yikes! It seems terrible that we have to do that later, because then it can break existing workloads. Launching the ANSI behavior was our chance to do the right thing here. If we can still fix this in 3.0 then that would of course be preferable, but that seems like a stretch... |
||
| val day = getOrDefault(accessor, ChronoField.DAY_OF_MONTH, 1) | ||
| LocalDate.of(year, month, day) | ||
| } | ||
|
|
||
| private def toLocalTime(accessor: TemporalAccessor): LocalTime = { | ||
| val localTime = accessor.query(TemporalQueries.localTime()) | ||
| // If all the time fields are specified, return the local time directly. | ||
| if (localTime != null) return localTime | ||
|
|
||
| val hour = if (accessor.isSupported(ChronoField.HOUR_OF_DAY)) { | ||
| accessor.get(ChronoField.HOUR_OF_DAY) | ||
| } else if (accessor.isSupported(ChronoField.HOUR_OF_AMPM)) { | ||
| // When we reach here, it means am/pm is not specified. Here we assume it's am. | ||
| accessor.get(ChronoField.HOUR_OF_AMPM) | ||
| } else { | ||
| 0 | ||
| } | ||
| val minute = getOrDefault(accessor, ChronoField.MINUTE_OF_HOUR, 0) | ||
| val second = getOrDefault(accessor, ChronoField.SECOND_OF_MINUTE, 0) | ||
| val nanoSecond = getOrDefault(accessor, ChronoField.NANO_OF_SECOND, 0) | ||
| LocalTime.of(hour, minute, second, nanoSecond) | ||
| } | ||
|
|
||
| // Converts the parsed temporal object to ZonedDateTime. It sets time components to zeros | ||
| // if they does not exist in the parsed object. | ||
| protected def toZonedDateTime( | ||
| temporalAccessor: TemporalAccessor, | ||
| zoneId: ZoneId): ZonedDateTime = { | ||
| // Parsed input might not have time related part. In that case, time component is set to zeros. | ||
| val parsedLocalTime = temporalAccessor.query(TemporalQueries.localTime) | ||
| val localTime = if (parsedLocalTime == null) LocalTime.MIDNIGHT else parsedLocalTime | ||
| // Parsed input must have date component. At least, year must present in temporalAccessor. | ||
| val localDate = temporalAccessor.query(TemporalQueries.localDate) | ||
|
|
||
| protected def toZonedDateTime(accessor: TemporalAccessor, zoneId: ZoneId): ZonedDateTime = { | ||
| val localDate = toLocalDate(accessor) | ||
| val localTime = toLocalTime(accessor) | ||
| ZonedDateTime.of(localDate, localTime, zoneId) | ||
| } | ||
|
|
||
|
|
@@ -72,19 +107,15 @@ trait DateTimeFormatterHelper { | |
| // DateTimeParseException will address by the caller side. | ||
| protected def checkDiffResult[T]( | ||
| s: String, legacyParseFunc: String => T): PartialFunction[Throwable, T] = { | ||
| case e: DateTimeParseException if SQLConf.get.legacyTimeParserPolicy == EXCEPTION => | ||
| val res = try { | ||
| Some(legacyParseFunc(s)) | ||
| case e: DateTimeException if SQLConf.get.legacyTimeParserPolicy == EXCEPTION => | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. when the field value exceeds the valid range, the JDK throws |
||
| try { | ||
| legacyParseFunc(s) | ||
| } catch { | ||
| case _: Throwable => None | ||
| } | ||
| if (res.nonEmpty) { | ||
| throw new SparkUpgradeException("3.0", s"Fail to parse '$s' in the new parser. You can " + | ||
| s"set ${SQLConf.LEGACY_TIME_PARSER_POLICY.key} to LEGACY to restore the behavior " + | ||
| s"before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.", e) | ||
| } else { | ||
| throw e | ||
| case _: Throwable => throw e | ||
| } | ||
| throw new SparkUpgradeException("3.0", s"Fail to parse '$s' in the new parser. You can " + | ||
| s"set ${SQLConf.LEGACY_TIME_PARSER_POLICY.key} to LEGACY to restore the behavior " + | ||
| s"before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.", e) | ||
| } | ||
| } | ||
|
|
||
|
|
@@ -101,10 +132,6 @@ private object DateTimeFormatterHelper { | |
|
|
||
| def toFormatter(builder: DateTimeFormatterBuilder, locale: Locale): DateTimeFormatter = { | ||
| builder | ||
| .parseDefaulting(ChronoField.MONTH_OF_YEAR, 1) | ||
| .parseDefaulting(ChronoField.DAY_OF_MONTH, 1) | ||
| .parseDefaulting(ChronoField.MINUTE_OF_HOUR, 0) | ||
| .parseDefaulting(ChronoField.SECOND_OF_MINUTE, 0) | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's too fragile to use these default values, as it may fail to parse due to conflicts. This PR decides default values manually, after parsing. See https://github.com/apache/spark/pull/28576/files#diff-3e4b85f54d2e75f1b5b845279749024eR42 |
||
| .toFormatter(locale) | ||
| .withChronology(IsoChronology.INSTANCE) | ||
| .withResolverStyle(ResolverStyle.STRICT) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would follow the naming of
getOrElsethough ..