[SPARK-35139][SQL] Support ANSI intervals as Arrow Column vectors #32340

Peng-Lei · 2021-04-26T01:21:50Z

What changes were proposed in this pull request?

Support YearMonthIntervalType and DayTimeIntervalType to extend ArrowColumnVector

Why are the changes needed?

https://issues.apache.org/jira/browse/SPARK-35139

Does this PR introduce any user-facing change?

No

How was this patch tested?

By checking coding style via:
$ ./dev/scalastyle
$ ./dev/lint-java
Run the test "ArrowWriterSuite"

HyukjinKwon · 2021-04-26T01:28:13Z

ok to test

HyukjinKwon · 2021-04-26T01:31:49Z

cc @MaxGekk @cloud-fan @BryanCutler FYI

SparkQA · 2021-04-26T02:11:07Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42448/

SparkQA · 2021-04-26T02:11:09Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42448/

cloud-fan · 2021-04-26T04:20:01Z

sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java

+
+    @Override
+    int getInt(int rowId) {
+      int months = accessor.get(rowId);


nit: return accessor.get(rowId);

cloud-fan · 2021-04-26T04:21:08Z

sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java

+    long getLong(int rowId) {
+      accessor.get(rowId, intervalDayHolder);
+      final long microseconds = intervalDayHolder.days * MICROS_PER_DAY
+                              + (long)intervalDayHolder.milliseconds * MICROS_PER_MILLIS;


should we handle overflow?

return Math.addExact( intervalDayHolder.days * MICROS_PER_DAY, intervalDayHolder.milliseconds * MICROS_PER_MILLIS;)

resolved，Thanks very much

cloud-fan · 2021-04-26T04:45:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala

+  override def setValue(input: SpecializedGetters, ordinal: Int): Unit = {
+    val totalMicroseconds = input.getLong(ordinal)
+    val days = totalMicroseconds / MICROS_PER_DAY
+    val millis = (totalMicroseconds - days * MICROS_PER_DAY) / MICROS_PER_MILLIS


nit (totalMicroseconds % MICROS_PER_DAY) / MICROS_PER_MILLIS

resolved, Thanks very much

SparkQA · 2021-04-26T06:01:38Z

Test build #137928 has finished for PR 32340 at commit 01367f2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2021-04-26T07:58:30Z

sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java

Both multiplications can overflow too. Could you use Math.multiplyExact, please.

OK，Thanks @MaxGekk

@MaxGekk done

SparkQA · 2021-04-26T08:17:23Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42462/

SparkQA · 2021-04-26T08:17:24Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42462/

cloud-fan · 2021-04-26T08:40:02Z

sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java

@MaxGekk intervalDayHolder.days and intervalDayHolder.milliseconds are int, can they really overflow?

I think so:

scala> Long.MaxValue / MICROS_PER_DAY res0: Long = 106751991 scala> (106751991 + 1) * MICROS_PER_DAY res1: Long = -9223371964909551616

but Int.Max * MICROS_PER_MILLIS won't overflow, right?

I use "intervalDayHolder.days * MICROS_PER_DAY" instead of Math.multiplyExact

@Peng-Lei Wenchen asked about milliseconds part, why did you changed the days multiplication? I showed above that it can overflow Long.

Sorry，My mistake. I get it. I'll fix it right now.

MaxGekk · 2021-04-26T09:14:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala

Maybe the following code for consistency with the cases above:

Suggested change

case YearMonthIntervalType => Types.MinorType.INTERVALYEAR.getType

case DayTimeIntervalType => Types.MinorType.INTERVALDAY.getType

case YearMonthIntervalType => new ArrowType.Interval(IntervalUnit.YEAR_MONTH)

case DayTimeIntervalType => new ArrowType.Interval(IntervalUnit.DAY_TIME)

MaxGekk

@Peng-Lei Could you add checks to ArrowUtilsSuite:

spark/sql/catalyst/src/test/scala/org/apache/spark/sql/util/ArrowUtilsSuite.scala

Line 50 in 0494dc9

roundtrip(DateType)

SparkQA · 2021-04-26T09:20:59Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42468/

SparkQA · 2021-04-26T09:21:01Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42468/

MaxGekk · 2021-04-26T09:29:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala

Remove the blank line.

MaxGekk · 2021-04-26T09:34:21Z

sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowWriterSuite.scala

Use Int.MaxValue

Suggested change

check(YearMonthIntervalType,

Seq(null, 0, 1, -1, scala.Int.MaxValue, scala.Int.MinValue))

check(YearMonthIntervalType, Seq(null, 0, 1, -1, Int.MaxValue, Int.MinValue))

MaxGekk · 2021-04-26T09:36:58Z

sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowWriterSuite.scala

nit: scala.Long.MinValue -> Long.MinValue

SparkQA · 2021-04-26T12:25:25Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42477/

SparkQA · 2021-04-26T12:25:31Z

Test build #137940 has finished for PR 32340 at commit a70e730.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2021-04-26T12:41:26Z

sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java

days has the Int type, MICROS_PER_DAY is Long but intervalDayHolder.days * MICROS_PER_DAY can overflow Long in general case. @Peng-Lei If you believe the overflow never happens, could proof that or/and add an assert.

Yeah，I test it. It did overflow. Thank you very much.

Such negative test could be useful, can you add it to the PR? So, we could catch the behavior change if someone will change your code in the future.

OK, I'll try to add it

MaxGekk · 2021-04-26T13:05:05Z

@Peng-Lei I wonder why do you make changes in your fork master and merge them to SPARK-35139 instead of direct changes in Peng-Lei:SPARK-35139?

MaxGekk · 2021-04-26T13:09:30Z

sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java

Sorry, I didn't clarify well

Suggested change

return Math.addExact(Math.multiplyExact(intervalDayHolder.days, MICROS_PER_DAY),

Math.multiplyExact(intervalDayHolder.milliseconds, MICROS_PER_MILLIS));

return Math.addExact(

Math.multiplyExact(intervalDayHolder.days, MICROS_PER_DAY),

intervalDayHolder.milliseconds * MICROS_PER_MILLIS);

SparkQA · 2021-04-26T13:12:23Z

Test build #137946 has finished for PR 32340 at commit 1de4cbe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Peng-Lei · 2021-04-26T13:17:55Z

@Peng-Lei I wonder why do you make changes in your fork master and merge them to SPARK-35139 instead of direct changes in Peng-Lei:SPARK-35139?

@MaxGekk That's what I did.
1, git branch SPARK-XXX
2, develop...
3, git add and commit
3, git checkout master
4, git pull upstream master
5, git checkout SPARK-XXX
6, git pull origin master
7, git push origin SPARK-XXX

SparkQA · 2021-04-26T13:41:01Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42480/

SparkQA · 2021-04-26T13:41:02Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42480/

MaxGekk · 2021-04-26T13:46:28Z

@Peng-Lei You can skip a few steps I think:
1, git branch SPARK-XXX
2, develop...
3, git add and commit
...
7, git push origin SPARK-XXX

You can merge/rebase on the master only if you see conflicts in the PR.

SparkQA · 2021-04-26T16:11:19Z

Test build #137957 has finished for PR 32340 at commit a1212d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-26T17:46:21Z

Test build #137960 has finished for PR 32340 at commit ff16a56.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-27T05:41:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42502/

SparkQA · 2021-04-27T05:48:30Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42502/

cloud-fan · 2021-04-27T06:08:16Z

thanks, merging to master!

MaxGekk

LGTM

Yikun · 2021-04-27T06:57:32Z

Looks like we can remove the note decription on the ArrowColumnVector [1]：Currently calendar interval type and map type are not supported., we can do it in a followup patch.

[1] https://github.com/apache/spark/pull/32340/files#diff-94174f963b367bc222d41c4ef9ed34563dafbd6243f5e3c273b04bc8e4979e57L29

SparkQA · 2021-04-27T09:09:10Z

Test build #137982 has finished for PR 32340 at commit b6f1dc0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-11-17T06:15:09Z

sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java

+    } else if (vector instanceof IntervalYearVector) {
+      accessor = new IntervalYearAccessor((IntervalYearVector) vector);
+    } else if (vector instanceof IntervalDayVector) {
+      accessor = new IntervalDayAccessor((IntervalDayVector) vector);


Hm, there's something wrong here. We mapped Spark's DayTimeIntervalType to Java (Scala)'s java.time.Duration in Java but we map it here to Arrow's IntervalType that represents a calendar instance (see also https://github.com/apache/arrow/blob/master/format/Schema.fbs).

I think we should map it to Arrow's DurationType (Python's datetime.timedelta). I am working on SPARK-37277 to support this in Arrow conversion at PySpark but this became a blocker to me. I am preparing a PR to change this but please let me know if you guys have different thoughts.

good catch!

I'm not quite sure why DayTimeIntervalType map Arrow's IntervalType here. just according ArrowUtils.scala#L60. I try to learn about Arrow types. it's sql style. And in hive INTERVAL_DAY_TIME map arrow's IntervalType with IntervalUnit.DAY_TIME unit. If we map DayTimeIntervalType to Arrow's DurationType . Then which type YearMonthIntervalType to match?

At the very least Duration cannot be mapped to YearMonthIntervalType. For DayTimeIntervalType , Arrow-wise, mapping to IntervalType makes sense but it makes less sense in Spark SQL because we're already mapping Duration.

I am not saying either way is 100% correct but I would pick the one to make it coherent in Spark's perspective if I have to pick one of both.

and, YearMonthIntervalType is mapped to java.time.Period which is a calendar instance:

A date-based amount of time in the ISO-8601 calendar system, such as '2 years, 3 months and 4 days'.

So YearMonthIntervalType seems fine.

HyukjinKwon · 2021-11-17T06:28:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala

+  override def setValue(input: SpecializedGetters, ordinal: Int): Unit = {
+    val totalMicroseconds = input.getLong(ordinal)
+    val days = totalMicroseconds / MICROS_PER_DAY
+    val millis = (totalMicroseconds % MICROS_PER_DAY) / MICROS_PER_MILLIS


Hm, do we lose micro seconds part? I think this is another reason to use duration.

Yeah. we lose micro seconds part, end with millisecond. It's inconsistent with that convert java.time.Duration to DayTimeIntervalType that drop any excess presision that greater than microsecond precision.

…and Arrow optimization ### What changes were proposed in this pull request? This PR proposes to support `DayTimeIntervalType` in pandas UDF and Arrow optimization. - Change the mapping of Arrow's `IntervalType` to `DurationType` for `DayTimeIntervalType` (migration guide updated for Arrow developer APIs). - Add a type mapping for other code paths: `numpy.timedelta64` <> `pyarrow.duration("us")` <> `DayTimeIntervalType` ### Why are the changes needed? For changing the mapping of Arrow's `Interval` type to `Duration` type for `DayTimeIntervalType`, please refer to #32340 (comment). `DayTimeIntervalType` is already mapped to the concept of duration instead of calendar instance: it's is matched to `pyarrow.duration("us")`, `datetime.timedelta`, and `java.util.Duration`. **Spark SQL example** ```scala scala> sql("SELECT timestamp '2029-01-01 00:00:00' - timestamp '2019-01-01 00:00:00'").show() ``` ``` +-------------------------------------------------------------------+ |(TIMESTAMP '2029-01-01 00:00:00' - TIMESTAMP '2019-01-01 00:00:00')| +-------------------------------------------------------------------+ | INTERVAL '3653 00...| +-------------------------------------------------------------------+ ``` ```scala scala> sql("SELECT timestamp '2029-01-01 00:00:00' - timestamp '2019-01-01 00:00:00'").printSchema() ``` ``` root |-- (TIMESTAMP '2029-01-01 00:00:00' - TIMESTAMP '2019-01-01 00:00:00'): interval day to second (nullable = false) ``` **Python example:** ```python >>> import datetime >>> datetime.datetime.now() - datetime.datetime.now() datetime.timedelta(days=-1, seconds=86399, microseconds=999996) ``` **pandas / PyArrow example:** ```python >>> import pyarrow as pa >>> import pandas as pd >>> pdf = pd.DataFrame({'a': [datetime.datetime.now() - datetime.datetime.now()]}) >>> tbl = pa.Table.from_pandas(pdf) >>> tbl.schema a: duration[ns] -- schema metadata -- pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 368 ``` ### Does this PR introduce _any_ user-facing change? Yes, after this change, users can use `DayTimeIntervalType` in `SparkSession.createDataFrame(pandas_df)`, `DataFrame.to_pandas`, and pandas UDFs: ```python >>> import datetime >>> import pandas as pd >>> from pyspark.sql.functions import pandas_udf >>> >>> pandas_udf("interval day to second") ... def noop(s: pd.Series) -> pd.Series: ... assert s.iloc[0] == datetime.timedelta(microseconds=123) ... return s ... >>> df = spark.createDataFrame(pd.DataFrame({"a": [pd.Timedelta(microseconds=123)]})) >>> df.toPandas() a 0 0 days 00:00:00.000123 ``` ### How was this patch tested? Manually tested, and unittests were added. Closes #34631 from HyukjinKwon/SPARK-37277-1. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

Peng-Lei added 3 commits April 25, 2021 18:05

Add SPARK-35139 demo

44d56b1

Merge branch 'master' of github.com:Peng-Lei/spark into SPARK-35139

daaf6ad

Merge branch 'master' of github.com:Peng-Lei/spark into SPARK-35139

01367f2

github-actions bot added the SQL label Apr 26, 2021

cloud-fan reviewed Apr 26, 2021

View reviewed changes

Peng-Lei requested a review from cloud-fan April 26, 2021 06:04

Merge branch 'master' of github.com:Peng-Lei/spark into SPARK-35139

d8d2c12

cloud-fan approved these changes Apr 26, 2021

View reviewed changes

Peng-Lei requested a review from cloud-fan April 26, 2021 07:34

MaxGekk reviewed Apr 26, 2021

View reviewed changes

MaxGekk changed the title ~~[SPARK-35139][SQL]Support ANSI intervals as Arrow Column vectors~~ [SPARK-35139][SQL] Support ANSI intervals as Arrow Column vectors Apr 26, 2021

Peng-Lei force-pushed the SPARK-35139 branch from a70e730 to 1de4cbe Compare April 26, 2021 08:12

Peng-Lei requested a review from MaxGekk April 26, 2021 08:15

cloud-fan reviewed Apr 26, 2021

View reviewed changes

MaxGekk reviewed Apr 26, 2021

View reviewed changes

Peng-Lei force-pushed the SPARK-35139 branch from a1212d0 to ff16a56 Compare April 26, 2021 12:41

Peng-Lei requested a review from MaxGekk April 26, 2021 12:53

MaxGekk reviewed Apr 26, 2021

View reviewed changes

Merge branch 'master' of github.com:Peng-Lei/spark into SPARK-35139

b6f1dc0

Peng-Lei force-pushed the SPARK-35139 branch from ff16a56 to b6f1dc0 Compare April 27, 2021 03:48

Peng-Lei requested a review from MaxGekk April 27, 2021 06:05

cloud-fan approved these changes Apr 27, 2021

View reviewed changes

cloud-fan closed this in eb08b90 Apr 27, 2021

MaxGekk reviewed Apr 27, 2021

View reviewed changes

HyukjinKwon reviewed Nov 17, 2021

View reviewed changes

HyukjinKwon mentioned this pull request Nov 17, 2021

[SPARK-37277][PYTHON][SQL] Support DayTimeIntervalType in pandas UDF and Arrow optimization #34631

Closed

-    case YearMonthIntervalType => Types.MinorType.INTERVALYEAR.getType
-    case DayTimeIntervalType => Types.MinorType.INTERVALDAY.getType
+    case YearMonthIntervalType => new ArrowType.Interval(IntervalUnit.YEAR_MONTH)
+    case DayTimeIntervalType => new ArrowType.Interval(IntervalUnit.DAY_TIME)

	check(YearMonthIntervalType,
	Seq(null, 0, 1, -1, scala.Int.MaxValue, scala.Int.MinValue))
	check(YearMonthIntervalType, Seq(null, 0, 1, -1, Int.MaxValue, Int.MinValue))

-      return Math.addExact(Math.multiplyExact(intervalDayHolder.days, MICROS_PER_DAY),
-                           Math.multiplyExact(intervalDayHolder.milliseconds, MICROS_PER_MILLIS));
+      return Math.addExact(
+        Math.multiplyExact(intervalDayHolder.days, MICROS_PER_DAY),
+        intervalDayHolder.milliseconds * MICROS_PER_MILLIS);

[SPARK-35139][SQL] Support ANSI intervals as Arrow Column vectors #32340

[SPARK-35139][SQL] Support ANSI intervals as Arrow Column vectors #32340

Uh oh!

Conversation

Peng-Lei commented Apr 26, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon commented Apr 26, 2021

Uh oh!

HyukjinKwon commented Apr 26, 2021

Uh oh!

SparkQA commented Apr 26, 2021

Uh oh!

SparkQA commented Apr 26, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Peng-Lei Apr 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Peng-Lei Apr 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 26, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 26, 2021

Uh oh!

SparkQA commented Apr 26, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 26, 2021

Uh oh!

SparkQA commented Apr 26, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk Apr 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Peng-Lei Apr 26, 2021 •

edited

Loading

Peng-Lei Apr 26, 2021 •

edited

Loading

MaxGekk Apr 26, 2021 •

edited

Loading

MaxGekk Apr 26, 2021 •

edited

Loading

Peng-Lei commented Apr 26, 2021 •

edited

Loading