Skip to content
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions docs/sql-pyspark-pandas-with-arrow.md
Original file line number Diff line number Diff line change
Expand Up @@ -219,3 +219,14 @@ Note that a standard UDF (non-Pandas) will load timestamp data as Python datetim
different than a Pandas timestamp. It is recommended to use Pandas time series functionality when
working with timestamps in `pandas_udf`s to get the best performance, see
[here](https://pandas.pydata.org/pandas-docs/stable/timeseries.html) for details.

### Compatibiliy Setting for PyArrow >= 0.15.0 and Spark 2.3.x, 2.4.x
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, actually @BryanCutler, would you mind adding this (or link back) at https://github.com/apache/spark/blob/master/docs/sparkr.md#apache-arrow-in-sparkr ? No worry about testing it out, I will do it tonight.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon so you don't need this note for R, Arrow was not used in 2.4.x?


Since Arrow 0.15.0, a change in the binary IPC format requires an environment variable to be set in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a link http://arrow.apache.org/blog/2019/10/06/0.15.0-release/#columnar-streaming-protocol-change-since-0140 to the release blog of Apache Arrow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that would be good thanks!

Spark so that PySpark maintain compatibility with versions on PyArrow 0.15.0 and above. The following can be added to `conf/spark-env.sh` to use the legacy IPC format:

```
ARROW_PRE_0_15_IPC_FORMAT=1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just set it by default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for already released Spark versions, where the user has upgraded pyarrow to 0.15.0 in their cluster

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd just clarify this in the new docs, that it's only needed if you manually update pyarrow this way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree, I'll clarify this

```

This will instruct PyArrow >= 0.15.0 to use the legacy IPC format with the older Arrow Java that is in Spark 2.3.x and 2.4.x.
Copy link
Member

@dongjoon-hyun dongjoon-hyun Oct 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to mention this in SQL migration guide too? This sounds like a requirement for migrating from 2.3 and 2.4.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the wording, does it mean if using Spark 3.0 which with newer Arrow Java, you do not need to set it?

Copy link
Member

@HyukjinKwon HyukjinKwon Oct 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, @BryanCutler do you target to upgrade and also increase minimum versions of PyArrow at SPARK-29376 (we upgrade in JVM one too; therefore, we don't need to set the environment variable in Spark 3.0)?

If so, we don't have to deal with https://github.com/apache/spark/pull/26045/files#r332285077 since Arrow with R is new in Spark 3.0.

If that's the case, increasing minimum version of Arrow R to 0.15.0 is fine to me too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't updating Arrow in 2.x, right? This would just be for users who go offroad and manually update it in their usage of Pyspark?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, we don't update in 2.x I guess and seems this note targets 2.x.

I was wondering if we're going to upgrade the Arrow of JVM in Spark 3.0, and if it provides compatibility with lower PyArrow and Arrow R - If we don't, we should increase the minimal versions of PyArrow and Arrow R.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to mention this in SQL migration guide too? This sounds like a requirement for migrating from 2.3 and 2.4.

It's not a requirement for migrating, because it's only when upgrading pyarrow to 0.15.0 with Spark 2.x. Although, someone from Spark 2.3.x might have pyarrow=0.8.0 and then migrating to Spark 2.4.x will be forced to upgrade pyarrow to a minimum version of 0.12.1, and might end up with 0.15.0 and need the env var. It's a bit of a stretch I, but if you think it's could help to add it in the guide I will.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, @BryanCutler do you target to upgrade and also increase minimum versions of PyArrow at SPARK-29376 (we upgrade in JVM one too; therefore, we don't need to set the environment variable in Spark 3.0)?

So once we upgrade Arrow Java to 0.15.0, it is not necessary to set the env var and will work with older versions of pyarrow also. Because of this, I don't think it's necessary to increase the minimum version right now. I do think we will have Arrow 1.0 before Spark 3.0, so it would make sense to set that as the minimum version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't updating Arrow in 2.x, right?

I wondered about this, I know we don't usually do this but it would remove the need for the env var I believe. Is it something we should consider?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarification! Then, I guess we don;t need a note for R one.