-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-35139][SQL] Support ANSI intervals as Arrow Column vectors #32340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
44d56b1
daaf6ad
01367f2
d8d2c12
b6f1dc0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -24,6 +24,7 @@ import org.apache.arrow.vector.complex._ | |
|
|
||
| import org.apache.spark.sql.catalyst.InternalRow | ||
| import org.apache.spark.sql.catalyst.expressions.SpecializedGetters | ||
| import org.apache.spark.sql.catalyst.util.DateTimeConstants.{MICROS_PER_DAY, MICROS_PER_MILLIS} | ||
| import org.apache.spark.sql.errors.QueryExecutionErrors | ||
| import org.apache.spark.sql.types._ | ||
| import org.apache.spark.sql.util.ArrowUtils | ||
|
|
@@ -74,6 +75,8 @@ object ArrowWriter { | |
| } | ||
| new StructWriter(vector, children.toArray) | ||
| case (NullType, vector: NullVector) => new NullWriter(vector) | ||
| case (YearMonthIntervalType, vector: IntervalYearVector) => new IntervalYearWriter(vector) | ||
| case (DayTimeIntervalType, vector: IntervalDayVector) => new IntervalDayWriter(vector) | ||
| case (dt, _) => | ||
| throw QueryExecutionErrors.unsupportedDataTypeError(dt) | ||
| } | ||
|
|
@@ -394,3 +397,28 @@ private[arrow] class NullWriter(val valueVector: NullVector) extends ArrowFieldW | |
| override def setValue(input: SpecializedGetters, ordinal: Int): Unit = { | ||
| } | ||
| } | ||
|
|
||
| private[arrow] class IntervalYearWriter(val valueVector: IntervalYearVector) | ||
| extends ArrowFieldWriter { | ||
| override def setNull(): Unit = { | ||
| valueVector.setNull(count) | ||
| } | ||
|
|
||
| override def setValue(input: SpecializedGetters, ordinal: Int): Unit = { | ||
| valueVector.setSafe(count, input.getInt(ordinal)); | ||
| } | ||
| } | ||
|
|
||
| private[arrow] class IntervalDayWriter(val valueVector: IntervalDayVector) | ||
| extends ArrowFieldWriter { | ||
| override def setNull(): Unit = { | ||
| valueVector.setNull(count) | ||
| } | ||
|
|
||
| override def setValue(input: SpecializedGetters, ordinal: Int): Unit = { | ||
| val totalMicroseconds = input.getLong(ordinal) | ||
| val days = totalMicroseconds / MICROS_PER_DAY | ||
| val millis = (totalMicroseconds % MICROS_PER_DAY) / MICROS_PER_MILLIS | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hm, do we lose micro seconds part? I think this is another reason to use duration.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah. we lose micro seconds part, end with millisecond. It's inconsistent with that convert |
||
| valueVector.set(count, days.toInt, millis.toInt) | ||
| } | ||
| } | ||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, there's something wrong here. We mapped Spark's
DayTimeIntervalTypeto Java (Scala)'sjava.time.Durationin Java but we map it here to Arrow'sIntervalTypethat represents a calendar instance (see also https://github.com/apache/arrow/blob/master/format/Schema.fbs).I think we should map it to Arrow's
DurationType(Python'sdatetime.timedelta). I am working on SPARK-37277 to support this in Arrow conversion at PySpark but this became a blocker to me. I am preparing a PR to change this but please let me know if you guys have different thoughts.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure why
DayTimeIntervalTypemap Arrow'sIntervalTypehere. just according ArrowUtils.scala#L60. I try to learn about Arrow types. it's sql style. And in hiveINTERVAL_DAY_TIMEmap arrow'sIntervalTypewithIntervalUnit.DAY_TIMEunit. If we mapDayTimeIntervalTypeto Arrow'sDurationType. Then which typeYearMonthIntervalTypeto match?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the very least
Durationcannot be mapped toYearMonthIntervalType. ForDayTimeIntervalType, Arrow-wise, mapping toIntervalTypemakes sense but it makes less sense in Spark SQL because we're already mappingDuration.I am not saying either way is 100% correct but I would pick the one to make it coherent in Spark's perspective if I have to pick one of both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and,
YearMonthIntervalTypeis mapped tojava.time.Periodwhich is a calendar instance:So
YearMonthIntervalTypeseems fine.