Skip to content

Conversation

@Peng-Lei
Copy link
Contributor

What changes were proposed in this pull request?

Support YearMonthIntervalType and DayTimeIntervalType to extend ArrowColumnVector

Why are the changes needed?

https://issues.apache.org/jira/browse/SPARK-35139

Does this PR introduce any user-facing change?

No

How was this patch tested?

  1. By checking coding style via:
    $ ./dev/scalastyle
    $ ./dev/lint-java
  2. Run the test "ArrowWriterSuite"

@github-actions github-actions bot added the SQL label Apr 26, 2021
@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

cc @MaxGekk @cloud-fan @BryanCutler FYI

@SparkQA
Copy link

SparkQA commented Apr 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42448/

@SparkQA
Copy link

SparkQA commented Apr 26, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42448/


@Override
int getInt(int rowId) {
int months = accessor.get(rowId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: return accessor.get(rowId);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

long getLong(int rowId) {
accessor.get(rowId, intervalDayHolder);
final long microseconds = intervalDayHolder.days * MICROS_PER_DAY
+ (long)intervalDayHolder.milliseconds * MICROS_PER_MILLIS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we handle overflow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return Math.addExact(
  intervalDayHolder.days * MICROS_PER_DAY,
  intervalDayHolder.milliseconds * MICROS_PER_MILLIS;)

Copy link
Contributor Author

@Peng-Lei Peng-Lei Apr 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved,Thanks very much

override def setValue(input: SpecializedGetters, ordinal: Int): Unit = {
val totalMicroseconds = input.getLong(ordinal)
val days = totalMicroseconds / MICROS_PER_DAY
val millis = (totalMicroseconds - days * MICROS_PER_DAY) / MICROS_PER_MILLIS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (totalMicroseconds % MICROS_PER_DAY) / MICROS_PER_MILLIS

Copy link
Contributor Author

@Peng-Lei Peng-Lei Apr 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved, Thanks very much

@SparkQA
Copy link

SparkQA commented Apr 26, 2021

Test build #137928 has finished for PR 32340 at commit 01367f2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Peng-Lei Peng-Lei requested a review from cloud-fan April 26, 2021 06:04
@Peng-Lei Peng-Lei requested a review from cloud-fan April 26, 2021 07:34
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both multiplications can overflow too. Could you use Math.multiplyExact, please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK,Thanks @MaxGekk

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk done

@MaxGekk MaxGekk changed the title [SPARK-35139][SQL]Support ANSI intervals as Arrow Column vectors [SPARK-35139][SQL] Support ANSI intervals as Arrow Column vectors Apr 26, 2021
@Peng-Lei Peng-Lei requested a review from MaxGekk April 26, 2021 08:15
@SparkQA
Copy link

SparkQA commented Apr 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42462/

@SparkQA
Copy link

SparkQA commented Apr 26, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42462/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk intervalDayHolder.days and intervalDayHolder.milliseconds are int, can they really overflow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so:

scala> Long.MaxValue / MICROS_PER_DAY
res0: Long = 106751991

scala> (106751991 + 1) * MICROS_PER_DAY
res1: Long = -9223371964909551616

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but Int.Max * MICROS_PER_MILLIS won't overflow, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use "intervalDayHolder.days * MICROS_PER_DAY" instead of Math.multiplyExact

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Peng-Lei Wenchen asked about milliseconds part, why did you changed the days multiplication? I showed above that it can overflow Long.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry,My mistake. I get it. I'll fix it right now.

Comment on lines 57 to 58
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the following code for consistency with the cases above:

Suggested change
case YearMonthIntervalType => Types.MinorType.INTERVALYEAR.getType
case DayTimeIntervalType => Types.MinorType.INTERVALDAY.getType
case YearMonthIntervalType => new ArrowType.Interval(IntervalUnit.YEAR_MONTH)
case DayTimeIntervalType => new ArrowType.Interval(IntervalUnit.DAY_TIME)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Peng-Lei Could you add checks to ArrowUtilsSuite:

@SparkQA
Copy link

SparkQA commented Apr 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42468/

@SparkQA
Copy link

SparkQA commented Apr 26, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42468/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the blank line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 78 to 79
Copy link
Member

@MaxGekk MaxGekk Apr 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Int.MaxValue

Suggested change
check(YearMonthIntervalType,
Seq(null, 0, 1, -1, scala.Int.MaxValue, scala.Int.MinValue))
check(YearMonthIntervalType, Seq(null, 0, 1, -1, Int.MaxValue, Int.MinValue))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: scala.Long.MinValue -> Long.MinValue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@SparkQA
Copy link

SparkQA commented Apr 26, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42477/

@SparkQA
Copy link

SparkQA commented Apr 26, 2021

Test build #137940 has finished for PR 32340 at commit a70e730.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

days has the Int type, MICROS_PER_DAY is Long but intervalDayHolder.days * MICROS_PER_DAY can overflow Long in general case. @Peng-Lei If you believe the overflow never happens, could proof that or/and add an assert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah,I test it. It did overflow. Thank you very much.

Copy link
Member

@MaxGekk MaxGekk Apr 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such negative test could be useful, can you add it to the PR? So, we could catch the behavior change if someone will change your code in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll try to add it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@MaxGekk
Copy link
Member

MaxGekk commented Apr 26, 2021

@Peng-Lei I wonder why do you make changes in your fork master and merge them to SPARK-35139 instead of direct changes in Peng-Lei:SPARK-35139?

Comment on lines 548 to 549
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't clarify well

Suggested change
return Math.addExact(Math.multiplyExact(intervalDayHolder.days, MICROS_PER_DAY),
Math.multiplyExact(intervalDayHolder.milliseconds, MICROS_PER_MILLIS));
return Math.addExact(
Math.multiplyExact(intervalDayHolder.days, MICROS_PER_DAY),
intervalDayHolder.milliseconds * MICROS_PER_MILLIS);

@SparkQA
Copy link

SparkQA commented Apr 26, 2021

Test build #137946 has finished for PR 32340 at commit 1de4cbe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Peng-Lei
Copy link
Contributor Author

Peng-Lei commented Apr 26, 2021

@Peng-Lei I wonder why do you make changes in your fork master and merge them to SPARK-35139 instead of direct changes in Peng-Lei:SPARK-35139?

@MaxGekk That's what I did.
1, git branch SPARK-XXX
2, develop...
3, git add and commit
3, git checkout master
4, git pull upstream master
5, git checkout SPARK-XXX
6, git pull origin master
7, git push origin SPARK-XXX

@SparkQA
Copy link

SparkQA commented Apr 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42480/

@SparkQA
Copy link

SparkQA commented Apr 26, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42480/

@MaxGekk
Copy link
Member

MaxGekk commented Apr 26, 2021

@Peng-Lei You can skip a few steps I think:
1, git branch SPARK-XXX
2, develop...
3, git add and commit
...
7, git push origin SPARK-XXX

You can merge/rebase on the master only if you see conflicts in the PR.

@SparkQA
Copy link

SparkQA commented Apr 26, 2021

Test build #137957 has finished for PR 32340 at commit a1212d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 26, 2021

Test build #137960 has finished for PR 32340 at commit ff16a56.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 27, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42502/

@SparkQA
Copy link

SparkQA commented Apr 27, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42502/

@Peng-Lei Peng-Lei requested a review from MaxGekk April 27, 2021 06:05
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in eb08b90 Apr 27, 2021
Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Yikun
Copy link
Member

Yikun commented Apr 27, 2021

Looks like we can remove the note decription on the ArrowColumnVector [1]:Currently calendar interval type and map type are not supported., we can do it in a followup patch.

[1] https://github.com/apache/spark/pull/32340/files#diff-94174f963b367bc222d41c4ef9ed34563dafbd6243f5e3c273b04bc8e4979e57L29

@SparkQA
Copy link

SparkQA commented Apr 27, 2021

Test build #137982 has finished for PR 32340 at commit b6f1dc0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

} else if (vector instanceof IntervalYearVector) {
accessor = new IntervalYearAccessor((IntervalYearVector) vector);
} else if (vector instanceof IntervalDayVector) {
accessor = new IntervalDayAccessor((IntervalDayVector) vector);
Copy link
Member

@HyukjinKwon HyukjinKwon Nov 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, there's something wrong here. We mapped Spark's DayTimeIntervalType to Java (Scala)'s java.time.Duration in Java but we map it here to Arrow's IntervalType that represents a calendar instance (see also https://github.com/apache/arrow/blob/master/format/Schema.fbs).

I think we should map it to Arrow's DurationType (Python's datetime.timedelta). I am working on SPARK-37277 to support this in Arrow conversion at PySpark but this became a blocker to me. I am preparing a PR to change this but please let me know if you guys have different thoughts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure why DayTimeIntervalType map Arrow's IntervalType here. just according ArrowUtils.scala#L60. I try to learn about Arrow types. it's sql style. And in hive INTERVAL_DAY_TIME map arrow's IntervalType with IntervalUnit.DAY_TIME unit. If we map DayTimeIntervalType to Arrow's DurationType . Then which type YearMonthIntervalType to match?

Copy link
Member

@HyukjinKwon HyukjinKwon Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least Duration cannot be mapped to YearMonthIntervalType. For DayTimeIntervalType , Arrow-wise, mapping to IntervalType makes sense but it makes less sense in Spark SQL because we're already mapping Duration.

I am not saying either way is 100% correct but I would pick the one to make it coherent in Spark's perspective if I have to pick one of both.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and, YearMonthIntervalType is mapped to java.time.Period which is a calendar instance:

A date-based amount of time in the ISO-8601 calendar system, such as '2 years, 3 months and 4 days'.

So YearMonthIntervalType seems fine.

override def setValue(input: SpecializedGetters, ordinal: Int): Unit = {
val totalMicroseconds = input.getLong(ordinal)
val days = totalMicroseconds / MICROS_PER_DAY
val millis = (totalMicroseconds % MICROS_PER_DAY) / MICROS_PER_MILLIS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, do we lose micro seconds part? I think this is another reason to use duration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. we lose micro seconds part, end with millisecond. It's inconsistent with that convert java.time.Duration to DayTimeIntervalType that drop any excess presision that greater than microsecond precision.

HyukjinKwon added a commit that referenced this pull request Nov 18, 2021
…and Arrow optimization

### What changes were proposed in this pull request?

This PR proposes to support `DayTimeIntervalType` in pandas UDF and Arrow optimization.

- Change the mapping of Arrow's  `IntervalType` to `DurationType` for `DayTimeIntervalType` (migration guide updated for Arrow developer APIs).
- Add a type mapping for other code paths: `numpy.timedelta64` <> `pyarrow.duration("us")` <> `DayTimeIntervalType`

### Why are the changes needed?

For changing the mapping of Arrow's  `Interval` type to `Duration` type for `DayTimeIntervalType`, please refer to #32340 (comment).

`DayTimeIntervalType` is already mapped to the concept of duration instead of calendar instance: it's is matched to `pyarrow.duration("us")`, `datetime.timedelta`, and `java.util.Duration`.

**Spark SQL example**

```scala
scala> sql("SELECT timestamp '2029-01-01 00:00:00' - timestamp '2019-01-01 00:00:00'").show()
```
```
+-------------------------------------------------------------------+
|(TIMESTAMP '2029-01-01 00:00:00' - TIMESTAMP '2019-01-01 00:00:00')|
+-------------------------------------------------------------------+
|                                               INTERVAL '3653 00...|
+-------------------------------------------------------------------+
```

```scala
scala> sql("SELECT timestamp '2029-01-01 00:00:00' - timestamp '2019-01-01 00:00:00'").printSchema()
```
```
root
 |-- (TIMESTAMP '2029-01-01 00:00:00' - TIMESTAMP '2019-01-01 00:00:00'): interval day to second (nullable = false)
```

**Python example:**

```python
>>> import datetime
>>> datetime.datetime.now() - datetime.datetime.now()
datetime.timedelta(days=-1, seconds=86399, microseconds=999996)
```

**pandas / PyArrow example:**

```python
>>> import pyarrow as pa
>>> import pandas as pd
>>> pdf = pd.DataFrame({'a': [datetime.datetime.now() - datetime.datetime.now()]})
>>> tbl = pa.Table.from_pandas(pdf)
>>> tbl.schema
a: duration[ns]
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 368
```

### Does this PR introduce _any_ user-facing change?

Yes, after this change, users can use `DayTimeIntervalType` in `SparkSession.createDataFrame(pandas_df)`, `DataFrame.to_pandas`, and pandas UDFs:

```python
>>> import datetime
>>> import pandas as pd
>>> from pyspark.sql.functions import pandas_udf
>>>
>>> pandas_udf("interval day to second")
... def noop(s: pd.Series) -> pd.Series:
...     assert s.iloc[0] == datetime.timedelta(microseconds=123)
...     return s
...
>>> df = spark.createDataFrame(pd.DataFrame({"a": [pd.Timedelta(microseconds=123)]}))
>>> df.toPandas()
                       a
0 0 days 00:00:00.000123
```

### How was this patch tested?

Manually tested, and unittests were added.

Closes #34631 from HyukjinKwon/SPARK-37277-1.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants