[SPARK-27659][PYTHON] Allow PySpark to prefetch during toLocalIterator #25515

holdenk · 2019-08-20T22:05:04Z

What changes were proposed in this pull request?

This PR allows Python toLocalIterator to prefetch the next partition while the first partition is being collected. The PR also adds a demo micro bench mark in the examples directory, we may wish to keep this or not.

Why are the changes needed?

In https://issues.apache.org/jira/browse/SPARK-23961 / 5e79ae3 we changed PySpark to only pull one partition at a time. This is memory efficient, but if partitions take time to compute this can mean we're spending more time blocking.

Does this PR introduce any user-facing change?

A new param is added to toLocalIterator

How was this patch tested?

New unit test inside of test_rdd.py checks the time that the elements are evaluated at. Another test that the results remain the same are added to test_dataframe.py.

I also ran a micro benchmark in the examples directory prefetch.py which shows an improvement of ~40% in this specific use case.

19/08/16 17:11:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running timers:

[Stage 32:> (0 + 1) / 1]
Results:

Prefetch time:

100.228110831

Regular time:

188.341721614

…looking at the next elem not the one we are about to block on, and fix the Python tests.

…the head

holdenk · 2019-08-20T22:05:35Z

cc @BryanCutler who created 5e79ae3

SparkQA · 2019-08-20T22:13:02Z

Test build #109437 has finished for PR 25515 at commit c477fec.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

python/pyspark/sql/dataframe.py

SparkQA · 2019-08-21T04:00:36Z

Test build #109448 has finished for PR 25515 at commit e0327a2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

Thanks for doing this @holdenk ! It can definitely improve performance when calculating partitions takes some time. I know this issue was just for Python, but Scala toLocalIterators could also benefit from prefetch, I believe. WDYT?

BryanCutler · 2019-08-22T21:41:11Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

+            if (prefetchPartitions) {
+              prefetchIter.headOption
+            }
+            val partitionArray = ThreadUtils.awaitResult(partitionFuture, Duration.Inf)


It might be best to avoid awaitResult if possible. Could you make a buffered iterator yourself?
maybe something like

var next = collectPartitionIter.next() val prefetchIter = collectPartitionIter.map { part => val tmp = next next = part tmp } ++ Iterator(next)

So the awaitFuture (or something similar) is required for us to use futures. If we just used a buffered iterator without allowing the job to schedule separately we'd just block for both partitions right away instead of evaluating the other future in the background while we block on the first. (Implicitly this awaitResult is already effectively done inside of the previous DAGScheduler's runJob.

Ah yes, you are totally right. That would block while getting the prefetched partition. This looks pretty good to me then.

One question though, when should the first job be triggered? I think the old behavior used to start the first job as soon as toLocalIterator() was called. From what I can tell, this will wait until the first iteration and then trigger the first 2 jobs. Either way is probably fine, but you might get slightly better performance by starting the first job immediately.

In either case it waits for reading a request of data from the Python side before starting a job, because the map on the partition indices is lazily evaluated.

BryanCutler · 2019-08-22T21:52:01Z

python/pyspark/tests/test_rdd.py

+        timesPrefetchNext = next(timesIterPrefetch)
+        print("With prefetch times are: " + str(timesPrefetchHead) + "," + str(timesPrefetchNext))
+        self.assertTrue(timesNext - timesHead >= timedelta(seconds=2))
+        self.assertTrue(timesPrefetchNext - timesPrefetchHead < timedelta(seconds=1))


This is a pretty clever test! Anything with timings make me a bit worried about flakiness, but I don't have any other idea how to test this.. Is it possible to see if the jobs were scheduled?

I think we could if we used a fresh SparkContext but with the reused context I'm not sure how I'd know if the job was run or not.

holdenk · 2019-08-23T00:00:46Z

I think Scala support is worth exploring too, I'm happy to file a follow up issue.

HyukjinKwon

just quick nit and to double check, is the benchmark was performed with #25515 (comment)? Seems like the feature was mistakenly disabled.

HyukjinKwon · 2019-08-23T00:49:43Z

examples/src/main/python/prefetch.py

@@ -0,0 +1,86 @@
+#


I think examples in this directory target to show how the feature or API is used rather than showing the perf results .. - I think it can be just shown in the PR description.
Virtually the example seems it has to be just .toLocalIterator(prefetchPartitions=True) which I don't think worth as a separate example file.

reasonable, I'll remove it from the examples, was mostly a simple way to share the microbenchmark.

HyukjinKwon · 2019-08-23T00:51:54Z

python/pyspark/tests/test_rdd.py

+        rdd = self.sc.parallelize(range(2), 2)
+        times1 = rdd.map(lambda x: datetime.now())
+        times2 = rdd.map(lambda x: datetime.now())
+        timesIterPrefetch = times1.toLocalIterator(prefetchPartitions=True)


Shall we stick to underscore naming rule?

holdenk · 2019-08-23T02:24:48Z

So the benchmark was done on RDDs not on dataframes (you can see the benchmark code in this PR).

BryanCutler · 2019-08-26T19:16:17Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

            // Client requested more data, attempt to collect the next partition
-            val partitionArray = collectPartitionIter.next()
+            val partitionFuture = prefetchIter.next()
+            // Cause the next job to be submitted if prefecthPartitions is enabled.


typo: prefecthPartitions -> prefetchPartitions

HyukjinKwon · 2019-08-27T12:43:38Z

python/pyspark/tests/test_rdd.py

+        time.sleep(2)
+        timesNext = next(timesIter)
+        timesPrefetchNext = next(timesIterPrefetch)
+        print("With prefetch times are: " + str(timesPrefetchHead) + "," + str(timesPrefetchNext))


Shall we remove print?

SparkQA · 2019-09-10T21:21:53Z

Test build #110432 has finished for PR 25515 at commit f8e67f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2019-09-14T02:36:42Z

Filed the follow up issue in https://issues.apache.org/jira/browse/SPARK-29083

holdenk · 2019-09-14T02:45:31Z

If there are no more comments by Monday I'll merge this :)

holdenk · 2019-09-20T17:01:15Z

Merged to master

BryanCutler · 2019-09-24T20:47:53Z

Late review, but LGTM. Thanks @holdenk !

holdenk added 7 commits August 15, 2019 13:53

Start working on allowing toLocalIter to prefetch in Python

0937158

Move the partitionArray blocking up above the peak at head so we are …

77cab47

…looking at the next elem not the one we are about to block on, and fix the Python tests.

Fix python long line

4fc6db9

Pull the head off & peak at the head+1 elem while before we block on …

b39a83c

…the head

Add a micro benchmark in prefetch

e0b3871

accidental line change we don't need

a060214

oops on \n

c477fec

viirya reviewed Aug 20, 2019

View reviewed changes

python/pyspark/sql/dataframe.py Show resolved Hide resolved

holdenk added 2 commits August 20, 2019 17:36

Fix missing call

6dc4748

Fix sphinx build issues

e0327a2

HyukjinKwon changed the title ~~[SPARK-27659][PYTHON] Allow PySpark to prefetch during toLocalItr~~ [SPARK-27659][PYTHON] Allow PySpark to prefetch during toLocalIterator Aug 21, 2019

dongjoon-hyun added the PYSPARK label Aug 21, 2019

BryanCutler reviewed Aug 22, 2019

View reviewed changes

HyukjinKwon reviewed Aug 23, 2019

View reviewed changes

BryanCutler reviewed Aug 26, 2019

View reviewed changes

HyukjinKwon reviewed Aug 27, 2019

View reviewed changes

holdenk added 2 commits September 10, 2019 11:52

Cleanup the tests and some typos

11d6688

Remove the prefetch example we used as a benchmark

f8e67f3

asfgit closed this in 42050c3 Sep 20, 2019

zero323 mentioned this pull request Sep 21, 2019

Sync with changes merged after 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 zero323/pyspark-stubs#230

Closed

47 tasks

[SPARK-27659][PYTHON] Allow PySpark to prefetch during toLocalIterator #25515

[SPARK-27659][PYTHON] Allow PySpark to prefetch during toLocalIterator #25515

Uh oh!

Conversation

holdenk commented Aug 20, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

holdenk commented Aug 20, 2019

Uh oh!

SparkQA commented Aug 20, 2019

Uh oh!

Uh oh!

SparkQA commented Aug 21, 2019

Uh oh!

BryanCutler left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

holdenk commented Aug 23, 2019

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Aug 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

holdenk commented Aug 23, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 10, 2019

Uh oh!

holdenk commented Sep 14, 2019

Uh oh!

holdenk commented Sep 14, 2019

Uh oh!

holdenk commented Sep 20, 2019

Uh oh!

BryanCutler commented Sep 24, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

BryanCutler left a comment •

edited

Loading

HyukjinKwon Aug 23, 2019 •

edited

Loading