[SPARK-37277][PYTHON][SQL] Support DayTimeIntervalType in pandas UDF and Arrow optimization #34631

HyukjinKwon · 2021-11-17T12:11:54Z

What changes were proposed in this pull request?

This PR proposes to support DayTimeIntervalType in pandas UDF and Arrow optimization.

Change the mapping of Arrow's IntervalType to DurationType for DayTimeIntervalType (migration guide updated for Arrow developer APIs).
Add a type mapping for other code paths: numpy.timedelta64 <> pyarrow.duration("us") <> DayTimeIntervalType

Why are the changes needed?

For changing the mapping of Arrow's Interval type to Duration type for DayTimeIntervalType, please refer to #32340 (comment).

DayTimeIntervalType is already mapped to the concept of duration instead of calendar instance: it's is matched to pyarrow.duration("us"), datetime.timedelta, and java.util.Duration.

Spark SQL example

scala> sql("SELECT timestamp '2029-01-01 00:00:00' - timestamp '2019-01-01 00:00:00'").show()

+-------------------------------------------------------------------+
|(TIMESTAMP '2029-01-01 00:00:00' - TIMESTAMP '2019-01-01 00:00:00')|
+-------------------------------------------------------------------+
|                                               INTERVAL '3653 00...|
+-------------------------------------------------------------------+

scala> sql("SELECT timestamp '2029-01-01 00:00:00' - timestamp '2019-01-01 00:00:00'").printSchema()

root
 |-- (TIMESTAMP '2029-01-01 00:00:00' - TIMESTAMP '2019-01-01 00:00:00'): interval day to second (nullable = false)

Python example:

>>> import datetime
>>> datetime.datetime.now() - datetime.datetime.now()
datetime.timedelta(days=-1, seconds=86399, microseconds=999996)

pandas / PyArrow example:

>>> import pyarrow as pa
>>> import pandas as pd
>>> pdf = pd.DataFrame({'a': [datetime.datetime.now() - datetime.datetime.now()]})
>>> tbl = pa.Table.from_pandas(pdf)
>>> tbl.schema
a: duration[ns]
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 368

Does this PR introduce any user-facing change?

Yes, after this change, users can use DayTimeIntervalType in SparkSession.createDataFrame(pandas_df), DataFrame.to_pandas, and pandas UDFs:

>>> import datetime
>>> import pandas as pd
>>> from pyspark.sql.functions import pandas_udf
>>>
>>> @pandas_udf("interval day to second")
... def noop(s: pd.Series) -> pd.Series:
...     assert s.iloc[0] == datetime.timedelta(microseconds=123)
...     return s
...
>>> df = spark.createDataFrame(pd.DataFrame({"a": [pd.Timedelta(microseconds=123)]}))
>>> df.toPandas()
                       a
0 0 days 00:00:00.000123

How was this patch tested?

Manually tested, and unittests were added.

HyukjinKwon · 2021-11-18T00:10:32Z

cc @cloud-fan, @BryanCutler @ueshin @viirya @Peng-Lei @Yikun @MaxGekk FYI

cloud-fan · 2021-11-18T05:46:56Z

sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java


-    private final IntervalDayVector accessor;
-    private final NullableIntervalDayHolder intervalDayHolder = new NullableIntervalDayHolder();
+    private final DurationVector accessor;


do we need to care about precision here? (microsecond)

oh, so it's just an int64 physically, and the type annotation (or logical type) decides its semantic

Peng-Lei · 2021-11-18T06:25:36Z

LGTM for *.scala and *.java

SparkQA · 2021-11-18T06:26:29Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49844/

viirya · 2021-11-18T06:39:16Z

python/pyspark/sql/pandas/conversion.py


-            if t is not None:
+            # No need to cast for empty series for timedelta.
+            should_check_timedelta = is_timedelta64_dtype(t) and len(pdf) == 0


Are you actually meaning len(pdf) != 0? Or I miss-read the code/comment?

Ohh comments are wrong. Let me rewrite.

This is, BTW, to work around a bug from Arrow <> pandas.

For some reasons, pd.Series(pd.Timedelta(...), dtype="object") created from Arrow becomes float64 when you cast with series.astype("timedelta64[us]") when the data is non-empty - this cannot be reproduced with plain pandas Series.

So, here I avoided it by just skipping the casting because the type becomes correct when it is not empty. When data is empty, the type becomes object, and it has to be casted.

Thanks for updating it. Looks good now.

viirya · 2021-11-18T06:45:12Z

sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java

- * A column vector backed by Apache Arrow. Currently calendar interval type and map type are not
- * supported.


Don't want to keep map type ....?

Map type was added from #30393 but the doc fix was missing 😂

SparkQA · 2021-11-18T07:31:24Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49844/

SparkQA · 2021-11-18T07:49:32Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49850/

SparkQA · 2021-11-18T08:30:55Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49850/

HyukjinKwon · 2021-11-18T09:14:11Z

Thanks guys.

Merged to master.

SparkQA · 2021-11-18T10:27:06Z

Test build #145371 has finished for PR 34631 at commit d90d8bf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-18T12:04:13Z

Test build #145377 has finished for PR 34631 at commit eb2a55e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

github-actions bot added CORE DOCS PYTHON SQL labels Nov 17, 2021

HyukjinKwon force-pushed the SPARK-37277-1 branch from 475acd4 to 200205d Compare November 17, 2021 12:12

This comment has been minimized.

Sign in to view

HyukjinKwon force-pushed the SPARK-37277-1 branch from a1949a1 to 2c6ac93 Compare November 18, 2021 00:06

Support DayTimeIntervalType in pandas UDF and Arrow optimization

e2d7ab5

HyukjinKwon force-pushed the SPARK-37277-1 branch from 2c6ac93 to e2d7ab5 Compare November 18, 2021 00:10

This comment has been minimized.

Sign in to view

Fix toPandas with timedelta

d90d8bf

cloud-fan reviewed Nov 18, 2021

View reviewed changes

cloud-fan approved these changes Nov 18, 2021

View reviewed changes

viirya reviewed Nov 18, 2021

View reviewed changes

Update python/pyspark/sql/pandas/conversion.py

eb2a55e

viirya approved these changes Nov 18, 2021

View reviewed changes

HyukjinKwon closed this in 88e53e5 Nov 18, 2021

HyukjinKwon mentioned this pull request Nov 22, 2021

[SPARK-30537][PYTHON] Fix toPandas wrong dtypes when applied on empty DF when Arrow enabled #34401

Closed

HyukjinKwon deleted the SPARK-37277-1 branch January 4, 2022 00:52

		* A column vector backed by Apache Arrow. Currently calendar interval type and map type are not
		* supported.

[SPARK-37277][PYTHON][SQL] Support DayTimeIntervalType in pandas UDF and Arrow optimization #34631

[SPARK-37277][PYTHON][SQL] Support DayTimeIntervalType in pandas UDF and Arrow optimization #34631

Uh oh!

Conversation

HyukjinKwon commented Nov 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

HyukjinKwon commented Nov 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

cloud-fan Nov 18, 2021

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 18, 2021

Choose a reason for hiding this comment

Uh oh!

Peng-Lei commented Nov 18, 2021

Uh oh!

SparkQA commented Nov 18, 2021

Uh oh!

viirya Nov 18, 2021

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 18, 2021

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 18, 2021

Choose a reason for hiding this comment

Uh oh!

viirya Nov 18, 2021

Choose a reason for hiding this comment

Uh oh!

viirya Nov 18, 2021

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 18, 2021

Uh oh!

SparkQA commented Nov 18, 2021

Uh oh!

SparkQA commented Nov 18, 2021

Uh oh!

HyukjinKwon commented Nov 18, 2021

Uh oh!

SparkQA commented Nov 18, 2021

Uh oh!

SparkQA commented Nov 18, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HyukjinKwon commented Nov 17, 2021 •

edited

Loading

HyukjinKwon commented Nov 18, 2021 •

edited

Loading

HyukjinKwon Nov 18, 2021 •

edited

Loading