-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-28420][SQL] Support the INTERVAL type in date_part()
#25981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
b75a80f
b233c90
6f017a6
ac1c3a8
7265508
554df71
d0f89f4
8174fd5
6378b95
8e4ca7d
8ed01c7
f1eea12
561f789
0a671a9
894a6c7
f86f4f5
283fd99
5e189ca
62c21b9
a4fbb5e
e1c9415
7f4100f
f3cf7f0
9262f9d
b9890ec
bea3faf
77e0fb3
58017a7
f5620b3
dcaf5b2
f202b15
dca29e5
f8a2385
7b1663e
8a494a2
f08531b
e8a61c8
f8a45b3
a496d73
47a0290
2099a91
d4375b5
5620472
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -18,7 +18,7 @@ | |
| package org.apache.spark.sql.catalyst.expressions | ||
|
|
||
| import java.sql.Timestamp | ||
| import java.time.{DateTimeException, Instant, LocalDate, LocalDateTime, ZoneId} | ||
| import java.time.{DateTimeException, LocalDate, LocalDateTime, ZoneId} | ||
| import java.time.temporal.IsoFields | ||
| import java.util.{Locale, TimeZone} | ||
|
|
||
|
|
@@ -2032,10 +2032,11 @@ object DatePart { | |
| } | ||
|
|
||
| @ExpressionDescription( | ||
| usage = "_FUNC_(field, source) - Extracts a part of the date/timestamp.", | ||
| usage = "_FUNC_(field, source) - Extracts a part of the date/timestamp or interval source.", | ||
| arguments = """ | ||
| Arguments: | ||
| * field - selects which part of the source should be extracted. Supported string values are: | ||
| * field - selects which part of the source should be extracted. | ||
| Supported string values of `field` for dates and timestamps are: | ||
| ["MILLENNIUM", ("MILLENNIA", "MIL", "MILS"), | ||
| "CENTURY", ("CENTURIES", "C", "CENT"), | ||
| "DECADE", ("DECADES", "DEC", "DECS"), | ||
|
|
@@ -2055,7 +2056,21 @@ object DatePart { | |
| "MILLISECONDS", ("MSEC", "MSECS", "MILLISECON", "MSECONDS", "MS"), | ||
| "MICROSECONDS", ("USEC", "USECS", "USECONDS", "MICROSECON", "US"), | ||
| "EPOCH"] | ||
| * source - a date (or timestamp) column from where `field` should be extracted | ||
| Supported string values of `field` for intervals are: | ||
| ["MILLENNIUM", ("MILLENNIA", "MIL", "MILS"), | ||
| "CENTURY", ("CENTURIES", "C", "CENT"), | ||
| "DECADE", ("DECADES", "DEC", "DECS"), | ||
| "YEAR", ("Y", "YEARS", "YR", "YRS"), | ||
| "QUARTER", ("QTR"), | ||
| "MONTH", ("MON", "MONS", "MONTHS"), | ||
| "DAY", ("D", "DAYS"), | ||
| "HOUR", ("H", "HOURS", "HR", "HRS"), | ||
| "MINUTE", ("M", "MIN", "MINS", "MINUTES"), | ||
| "SECOND", ("S", "SEC", "SECONDS", "SECS"), | ||
| "MILLISECONDS", ("MSEC", "MSECS", "MILLISECON", "MSECONDS", "MS"), | ||
| "MICROSECONDS", ("USEC", "USECS", "USECONDS", "MICROSECON", "US"), | ||
| "EPOCH"] | ||
| * source - a date/timestamp or interval column from where `field` should be extracted | ||
| """, | ||
| examples = """ | ||
| Examples: | ||
|
|
@@ -2067,6 +2082,10 @@ object DatePart { | |
| 224 | ||
| > SELECT _FUNC_('SECONDS', timestamp'2019-10-01 00:00:01.000001'); | ||
srowen marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| 1.000001 | ||
| > SELECT _FUNC_('days', interval 1 year 10 months 5 days); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, feel free to tell me to go look up a reference, but I am sort of confused about the semantics. The "days" part of "interval 1 year 10 months 5 days" is 5 days, not the interval days (roughly 365 + 10 *31 + 5). But below, the "seconds" part of "30 seconds 1 ms 1 us" is 30.001001, not 30. So what's the number of days in "interval 1 year 10 months 5 days 12 hours"? 5 or 5.5?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sort of answering my own question. From PostgreSQL, at least: It seems like the answer to the second example here should be 30? I'm getting off on a tangent, but, can you specify "interval 5 minutes 90 seconds"? if so, what's the minute part -- 5 or 6? if you can't specify that, can you specify "interval 90 seconds"? if not why not? Just getting confused about the intended semantics of the date part of an interval!
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Semantic is defined by internal representation of Catalyst's When an user request components of an interval, we need to select one of its representation. We do that by constructing "normalized" representation in where number of
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I get why we split the months and ms part, and can see what the convenient implementation is w.r.t. the current implementation. Maybe the details of the semantics aren't that important for practical purposes. But it does seem like this proposal doesn't match PostgreSQL in the
This feels inconsistent. What if we construed the semantics to always mean 'the given interval in the given units'? that's consistent, but doesn't quite sound like what Am I right about that so far?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It does. In PostgreSQL: maxim=# SELECT date_part('hour', INTERVAL '4 hours 3 minutes');
date_part
-----------
4In SparkSQL: spark-sql> SELECT date_part('hour', interval 4 hours 3 minutes);
4
No, it is 0 because the range of hours is maxim=# SELECT date_part('hour', INTERVAL '1 month 1 day');
date_part
-----------
0In SparkSQL: spark-sql> SELECT date_part('hour', INTERVAL 1 month 1 day);
0
No, it gives 1. maxim=# SELECT date_part('month', INTERVAL '1 year 1 month');
date_part
-----------
1In SparkSQL: spark-sql> SELECT date_part('month', INTERVAL 1 year 1 month);
1
I cannot agree with you so far.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see, that's good. So as far as you know it's consistent with PostgreSQL? and correctly returns the time-unit part of an interval, not the interval in the time-unit? in that case I'm not sure why this is correct: Shouldn't it be 30?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I agree with you that For maxim=# select date_part('milliseconds', interval '30 seconds 1 milliseconds 1 microseconds');
date_part
-----------
30001.001maxim=# select date_part('microseconds', interval '30 seconds 1 milliseconds 1 microseconds');
date_part
-----------
30001001I interpret this as the Precisely follow PostgreSQL implementation is still open question for me. I believe PostgreSQL has badly designed date-time API. As you can see similar
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, it seems inconsistent. Sometimes the entire interval's length is return in the given unit, sometimes just a portion of the interval matching that unit is returned. I wonder if there is any other standard to look at. Doesn't this current PR have the same consistency question, given the current test output? I'm mostly concerned with having consistent semantics, whatever they are. If it can't be done reasonably, hm, should we implement this? I'm trying to figure out the use case for "a date part of an interval". I can only think of cases where the interval should be converted entirely into some unit.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The entire interval's length is returned for
I like java 8 approach where there are 2 types
I don't know. From my point of semantic of each extraction field is well defined. There is a difference in implementation of this PR and PostgreSQL. As I wrote above Spark store interval in 2 variables - maxim=# SELECT interval '-100 years 100 days -100 hours';
interval
---------------------------------
-100 years +100 days -100:00:00
(1 row)
I can image at least 2 use cases:
SELECT name, place
FROM t1, t2
WHERE t1.sellTimestamp - t2.deliveredTimestamp < interval 1 month 15 days;Here, the
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Hm, but then back to the original question; why is the answer not 30 below? I'm not concerned necessarily about matching PostgreSQL, but am concerned about internal consistency. Unless I'm really missing something, these two examples here seem to have different semantics. I understand if you mean it's the natural way to implement this given the representation, but, I'm not sure that makes it right? I think the discussion of That's why I'm questioning supporting this at all. |
||
| 5 | ||
| > SELECT _FUNC_('seconds', interval 5 hours 30 seconds 1 milliseconds 1 microseconds); | ||
| 30.001001 | ||
| """, | ||
| since = "3.0.0") | ||
| case class DatePart(field: Expression, source: Expression, child: Expression) | ||
|
|
@@ -2082,9 +2101,16 @@ case class DatePart(field: Expression, source: Expression, child: Expression) | |
| Literal(null, DoubleType) | ||
| } else { | ||
| val fieldStr = fieldEval.asInstanceOf[UTF8String].toString | ||
| DatePart.parseExtractField(fieldStr, source, { | ||
| throw new AnalysisException(s"Literals of type '$fieldStr' are currently not supported.") | ||
| }) | ||
| val errMsg = s"Literals of type '$fieldStr' are currently not supported " + | ||
| s"for the ${source.dataType.catalogString} type." | ||
| if (source.dataType == CalendarIntervalType) { | ||
| ExtractIntervalPart.parseExtractField( | ||
| fieldStr, | ||
| source, | ||
| throw new AnalysisException(errMsg)) | ||
| } else { | ||
| DatePart.parseExtractField(fieldStr, source, throw new AnalysisException(errMsg)) | ||
| } | ||
| } | ||
| }) | ||
| } | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,111 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql.catalyst.expressions | ||
|
|
||
| import java.util.Locale | ||
|
|
||
| import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, ExprCode} | ||
| import org.apache.spark.sql.catalyst.util.IntervalUtils | ||
| import org.apache.spark.sql.catalyst.util.IntervalUtils._ | ||
| import org.apache.spark.sql.types._ | ||
| import org.apache.spark.unsafe.types.CalendarInterval | ||
|
|
||
| abstract class ExtractIntervalPart( | ||
| child: Expression, | ||
| val dataType: DataType, | ||
| func: CalendarInterval => Any, | ||
| funcName: String) | ||
| extends UnaryExpression with ExpectsInputTypes with Serializable { | ||
|
|
||
| override def inputTypes: Seq[AbstractDataType] = Seq(CalendarIntervalType) | ||
|
|
||
| override protected def nullSafeEval(interval: Any): Any = { | ||
| func(interval.asInstanceOf[CalendarInterval]) | ||
| } | ||
|
|
||
| override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { | ||
| val iu = IntervalUtils.getClass.getName.stripSuffix("$") | ||
| defineCodeGen(ctx, ev, c => s"$iu.$funcName($c)") | ||
| } | ||
| } | ||
|
|
||
| case class ExtractIntervalMillenniums(child: Expression) | ||
| extends ExtractIntervalPart(child, IntegerType, getMillenniums, "getMillenniums") | ||
|
|
||
| case class ExtractIntervalCenturies(child: Expression) | ||
| extends ExtractIntervalPart(child, IntegerType, getCenturies, "getCenturies") | ||
|
|
||
| case class ExtractIntervalDecades(child: Expression) | ||
| extends ExtractIntervalPart(child, IntegerType, getDecades, "getDecades") | ||
|
|
||
| case class ExtractIntervalYears(child: Expression) | ||
| extends ExtractIntervalPart(child, IntegerType, getYears, "getYears") | ||
|
|
||
| case class ExtractIntervalQuarters(child: Expression) | ||
| extends ExtractIntervalPart(child, ByteType, getQuarters, "getQuarters") | ||
|
|
||
| case class ExtractIntervalMonths(child: Expression) | ||
| extends ExtractIntervalPart(child, ByteType, getMonths, "getMonths") | ||
|
|
||
| case class ExtractIntervalDays(child: Expression) | ||
| extends ExtractIntervalPart(child, LongType, getDays, "getDays") | ||
|
|
||
| case class ExtractIntervalHours(child: Expression) | ||
| extends ExtractIntervalPart(child, ByteType, getHours, "getHours") | ||
|
|
||
| case class ExtractIntervalMinutes(child: Expression) | ||
| extends ExtractIntervalPart(child, ByteType, getMinutes, "getMinutes") | ||
|
|
||
| case class ExtractIntervalSeconds(child: Expression) | ||
| extends ExtractIntervalPart(child, DecimalType(8, 6), getSeconds, "getSeconds") | ||
|
|
||
| case class ExtractIntervalMilliseconds(child: Expression) | ||
| extends ExtractIntervalPart(child, DecimalType(8, 3), getMilliseconds, "getMilliseconds") | ||
|
|
||
| case class ExtractIntervalMicroseconds(child: Expression) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not related to this PR, but we can apply the same naming policy to the related date/timestamp functions. |
||
| extends ExtractIntervalPart(child, LongType, getMicroseconds, "getMicroseconds") | ||
|
|
||
| // Number of seconds in 10000 years is 315576000001 (30 days per one month) | ||
| // which is 12 digits + 6 digits for the fractional part of seconds. | ||
| case class ExtractIntervalEpoch(child: Expression) | ||
| extends ExtractIntervalPart(child, DecimalType(18, 6), getEpoch, "getEpoch") | ||
|
|
||
| object ExtractIntervalPart { | ||
|
|
||
| def parseExtractField( | ||
| extractField: String, | ||
| source: Expression, | ||
| errorHandleFunc: => Nothing): Expression = extractField.toUpperCase(Locale.ROOT) match { | ||
| case "MILLENNIUM" | "MILLENNIA" | "MIL" | "MILS" => ExtractIntervalMillenniums(source) | ||
| case "CENTURY" | "CENTURIES" | "C" | "CENT" => ExtractIntervalCenturies(source) | ||
| case "DECADE" | "DECADES" | "DEC" | "DECS" => ExtractIntervalDecades(source) | ||
| case "YEAR" | "Y" | "YEARS" | "YR" | "YRS" => ExtractIntervalYears(source) | ||
| case "QUARTER" | "QTR" => ExtractIntervalQuarters(source) | ||
| case "MONTH" | "MON" | "MONS" | "MONTHS" => ExtractIntervalMonths(source) | ||
| case "DAY" | "D" | "DAYS" => ExtractIntervalDays(source) | ||
| case "HOUR" | "H" | "HOURS" | "HR" | "HRS" => ExtractIntervalHours(source) | ||
| case "MINUTE" | "M" | "MIN" | "MINS" | "MINUTES" => ExtractIntervalMinutes(source) | ||
| case "SECOND" | "S" | "SEC" | "SECONDS" | "SECS" => ExtractIntervalSeconds(source) | ||
| case "MILLISECONDS" | "MSEC" | "MSECS" | "MILLISECON" | "MSECONDS" | "MS" => | ||
| ExtractIntervalMilliseconds(source) | ||
| case "MICROSECONDS" | "USEC" | "USECS" | "USECONDS" | "MICROSECON" | "US" => | ||
| ExtractIntervalMicroseconds(source) | ||
| case "EPOCH" => ExtractIntervalEpoch(source) | ||
| case _ => errorHandleFunc | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,91 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql.catalyst.util | ||
|
|
||
| import org.apache.spark.sql.types.Decimal | ||
| import org.apache.spark.unsafe.types.CalendarInterval | ||
|
|
||
| object IntervalUtils { | ||
| val MONTHS_PER_YEAR: Int = 12 | ||
| val MONTHS_PER_QUARTER: Byte = 3 | ||
| val YEARS_PER_MILLENNIUM: Int = 1000 | ||
| val YEARS_PER_CENTURY: Int = 100 | ||
| val YEARS_PER_DECADE: Int = 10 | ||
| val MICROS_PER_HOUR: Long = DateTimeUtils.MILLIS_PER_HOUR * DateTimeUtils.MICROS_PER_MILLIS | ||
| val MICROS_PER_MINUTE: Long = DateTimeUtils.MILLIS_PER_MINUTE * DateTimeUtils.MICROS_PER_MILLIS | ||
| val DAYS_PER_MONTH: Byte = 30 | ||
| val MICROS_PER_MONTH: Long = DAYS_PER_MONTH * DateTimeUtils.SECONDS_PER_DAY | ||
| /* 365.25 days per year assumes leap year every four years */ | ||
dongjoon-hyun marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| val MICROS_PER_YEAR: Long = (36525L * DateTimeUtils.MICROS_PER_DAY) / 100 | ||
|
|
||
| def getYears(interval: CalendarInterval): Int = { | ||
| interval.months / MONTHS_PER_YEAR | ||
| } | ||
|
|
||
| def getMillenniums(interval: CalendarInterval): Int = { | ||
| getYears(interval) / YEARS_PER_MILLENNIUM | ||
| } | ||
|
|
||
| def getCenturies(interval: CalendarInterval): Int = { | ||
| getYears(interval) / YEARS_PER_CENTURY | ||
| } | ||
|
|
||
| def getDecades(interval: CalendarInterval): Int = { | ||
| getYears(interval) / YEARS_PER_DECADE | ||
| } | ||
|
|
||
| def getMonths(interval: CalendarInterval): Byte = { | ||
| (interval.months % MONTHS_PER_YEAR).toByte | ||
| } | ||
|
|
||
| def getQuarters(interval: CalendarInterval): Byte = { | ||
| (getMonths(interval) / MONTHS_PER_QUARTER + 1).toByte | ||
| } | ||
|
|
||
| def getDays(interval: CalendarInterval): Long = { | ||
| interval.microseconds / DateTimeUtils.MICROS_PER_DAY | ||
| } | ||
|
|
||
| def getHours(interval: CalendarInterval): Byte = { | ||
| ((interval.microseconds % DateTimeUtils.MICROS_PER_DAY) / MICROS_PER_HOUR).toByte | ||
| } | ||
|
|
||
| def getMinutes(interval: CalendarInterval): Byte = { | ||
| ((interval.microseconds % MICROS_PER_HOUR) / MICROS_PER_MINUTE).toByte | ||
| } | ||
|
|
||
| def getMicroseconds(interval: CalendarInterval): Long = { | ||
| interval.microseconds % MICROS_PER_MINUTE | ||
| } | ||
|
|
||
| def getSeconds(interval: CalendarInterval): Decimal = { | ||
| Decimal(getMicroseconds(interval), 8, 6) | ||
| } | ||
|
|
||
| def getMilliseconds(interval: CalendarInterval): Decimal = { | ||
| Decimal(getMicroseconds(interval), 8, 3) | ||
| } | ||
|
|
||
| // Returns total number of seconds with microseconds fractional part in the given interval. | ||
| def getEpoch(interval: CalendarInterval): Decimal = { | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| var result = interval.microseconds | ||
| result += MICROS_PER_YEAR * (interval.months / MONTHS_PER_YEAR) | ||
| result += MICROS_PER_MONTH * (interval.months % MONTHS_PER_YEAR) | ||
| Decimal(result, 18, 6) | ||
| } | ||
| } | ||
Uh oh!
There was an error while loading. Please reload this page.