-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-37277][PYTHON][SQL] Support DayTimeIntervalType in pandas UDF and Arrow optimization #34631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -19,19 +19,14 @@ | |
|
|
||
| import org.apache.arrow.vector.*; | ||
| import org.apache.arrow.vector.complex.*; | ||
| import org.apache.arrow.vector.holders.NullableIntervalDayHolder; | ||
| import org.apache.arrow.vector.holders.NullableVarCharHolder; | ||
|
|
||
| import org.apache.spark.sql.util.ArrowUtils; | ||
| import org.apache.spark.sql.types.*; | ||
| import org.apache.spark.unsafe.types.UTF8String; | ||
|
|
||
| import static org.apache.spark.sql.catalyst.util.DateTimeConstants.MICROS_PER_DAY; | ||
| import static org.apache.spark.sql.catalyst.util.DateTimeConstants.MICROS_PER_MILLIS; | ||
|
|
||
| /** | ||
| * A column vector backed by Apache Arrow. Currently calendar interval type and map type are not | ||
| * supported. | ||
|
Comment on lines
-33
to
-34
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't want to keep
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Map type was added from #30393 but the doc fix was missing 😂 |
||
| * A column vector backed by Apache Arrow. | ||
| */ | ||
| public final class ArrowColumnVector extends ColumnVector { | ||
|
|
||
|
|
@@ -180,8 +175,8 @@ public ArrowColumnVector(ValueVector vector) { | |
| accessor = new NullAccessor((NullVector) vector); | ||
| } else if (vector instanceof IntervalYearVector) { | ||
| accessor = new IntervalYearAccessor((IntervalYearVector) vector); | ||
| } else if (vector instanceof IntervalDayVector) { | ||
| accessor = new IntervalDayAccessor((IntervalDayVector) vector); | ||
| } else if (vector instanceof DurationVector) { | ||
| accessor = new DurationAccessor((DurationVector) vector); | ||
| } else { | ||
| throw new UnsupportedOperationException(); | ||
| } | ||
|
|
@@ -549,21 +544,18 @@ int getInt(int rowId) { | |
| } | ||
| } | ||
|
|
||
| private static class IntervalDayAccessor extends ArrowVectorAccessor { | ||
| private static class DurationAccessor extends ArrowVectorAccessor { | ||
|
|
||
| private final IntervalDayVector accessor; | ||
| private final NullableIntervalDayHolder intervalDayHolder = new NullableIntervalDayHolder(); | ||
| private final DurationVector accessor; | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we need to care about precision here? (microsecond)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh, so it's just an int64 physically, and the type annotation (or logical type) decides its semantic |
||
|
|
||
| IntervalDayAccessor(IntervalDayVector vector) { | ||
| DurationAccessor(DurationVector vector) { | ||
| super(vector); | ||
| this.accessor = vector; | ||
| } | ||
|
|
||
| @Override | ||
| long getLong(int rowId) { | ||
| accessor.get(rowId, intervalDayHolder); | ||
| return Math.addExact(Math.multiplyExact(intervalDayHolder.days, MICROS_PER_DAY), | ||
| intervalDayHolder.milliseconds * MICROS_PER_MILLIS); | ||
| final long getLong(int rowId) { | ||
| return DurationVector.get(accessor.getDataBuffer(), rowId); | ||
| } | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you actually meaning
len(pdf) != 0? Or I miss-read the code/comment?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh comments are wrong. Let me rewrite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is, BTW, to work around a bug from Arrow <> pandas.
For some reasons,
pd.Series(pd.Timedelta(...), dtype="object")created from Arrow becomesfloat64when you cast withseries.astype("timedelta64[us]")when the data is non-empty - this cannot be reproduced with plain pandasSeries.So, here I avoided it by just skipping the casting because the type becomes correct when it is not empty. When data is empty, the type becomes
object, and it has to be casted.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating it. Looks good now.