-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-37277][PYTHON][SQL] Support DayTimeIntervalType in pandas UDF and Arrow optimization #34631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
475acd4 to
200205d
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
a1949a1 to
2c6ac93
Compare
2c6ac93 to
e2d7ab5
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
|
||
| private final IntervalDayVector accessor; | ||
| private final NullableIntervalDayHolder intervalDayHolder = new NullableIntervalDayHolder(); | ||
| private final DurationVector accessor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to care about precision here? (microsecond)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, so it's just an int64 physically, and the type annotation (or logical type) decides its semantic
|
LGTM for *.scala and *.java |
|
Kubernetes integration test starting |
|
|
||
| if t is not None: | ||
| # No need to cast for empty series for timedelta. | ||
| should_check_timedelta = is_timedelta64_dtype(t) and len(pdf) == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you actually meaning len(pdf) != 0? Or I miss-read the code/comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh comments are wrong. Let me rewrite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is, BTW, to work around a bug from Arrow <> pandas.
For some reasons, pd.Series(pd.Timedelta(...), dtype="object") created from Arrow becomes float64 when you cast with series.astype("timedelta64[us]") when the data is non-empty - this cannot be reproduced with plain pandas Series.
So, here I avoided it by just skipping the casting because the type becomes correct when it is not empty. When data is empty, the type becomes object, and it has to be casted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating it. Looks good now.
| * A column vector backed by Apache Arrow. Currently calendar interval type and map type are not | ||
| * supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't want to keep map type ....?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Map type was added from #30393 but the doc fix was missing 😂
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Thanks guys. Merged to master. |
|
Test build #145371 has finished for PR 34631 at commit
|
|
Test build #145377 has finished for PR 34631 at commit
|
What changes were proposed in this pull request?
This PR proposes to support
DayTimeIntervalTypein pandas UDF and Arrow optimization.IntervalTypetoDurationTypeforDayTimeIntervalType(migration guide updated for Arrow developer APIs).numpy.timedelta64<>pyarrow.duration("us")<>DayTimeIntervalTypeWhy are the changes needed?
For changing the mapping of Arrow's
Intervaltype toDurationtype forDayTimeIntervalType, please refer to #32340 (comment).DayTimeIntervalTypeis already mapped to the concept of duration instead of calendar instance: it's is matched topyarrow.duration("us"),datetime.timedelta, andjava.util.Duration.Spark SQL example
Python example:
pandas / PyArrow example:
Does this PR introduce any user-facing change?
Yes, after this change, users can use
DayTimeIntervalTypeinSparkSession.createDataFrame(pandas_df),DataFrame.to_pandas, and pandas UDFs:How was this patch tested?
Manually tested, and unittests were added.