[SPARK-4398][PySpark] specialize sc.parallelize(xrange) #3264

mengxr · 2014-11-14T08:44:52Z

sc.parallelize(range(1 << 20), 1).count() may take 15 seconds to finish and the rdd object stores the entire list, making task size very large. This PR adds a specialized version for xrange.

@JoshRosen @davies

SparkQA · 2014-11-14T08:49:57Z

Test build #23360 has started for PR 3264 at commit cbd58e3.

This patch merges cleanly.

davies · 2014-11-14T09:17:10Z

python/pyspark/context.py

How about pre-calculate all the boundaries for all the partitions?

This only serializes an xrange object. If we pre-calculate the boundaries, the cost is O(p).

Yes, but the size + 1 is tricky, how about this one:

start = c[0] def getStart(split): return start + size * split / numSlices * step def f(split, iterator): return xrange(getStart(split), getStart(split+1), step)

Yes, this is better!

SparkQA · 2014-11-14T10:13:52Z

Test build #23360 has finished for PR 3264 at commit cbd58e3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-14T10:13:55Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23360/
Test PASSed.

SparkQA · 2014-11-14T17:44:57Z

Test build #23374 has started for PR 3264 at commit c184fcc.

This patch merges cleanly.

SparkQA · 2014-11-14T17:47:39Z

Test build #23375 has started for PR 3264 at commit 8953c41.

This patch merges cleanly.

davies · 2014-11-14T18:36:30Z

LGTM, thanks!

SparkQA · 2014-11-14T19:10:34Z

Test build #23375 has finished for PR 3264 at commit 8953c41.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-14T19:10:38Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23375/
Test PASSed.

SparkQA · 2014-11-14T19:11:51Z

Test build #23374 has finished for PR 3264 at commit c184fcc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-14T19:11:55Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23374/
Test PASSed.

mengxr · 2014-11-14T20:44:18Z

@davies Thanks! I've merged this into master and branch-1.2.

`sc.parallelize(range(1 << 20), 1).count()` may take 15 seconds to finish and the rdd object stores the entire list, making task size very large. This PR adds a specialized version for xrange. JoshRosen davies Author: Xiangrui Meng <[email protected]> Closes #3264 from mengxr/SPARK-4398 and squashes the following commits: 8953c41 [Xiangrui Meng] follow davies' suggestion cbd58e3 [Xiangrui Meng] specialize sc.parallelize(xrange) (cherry picked from commit abd5817) Signed-off-by: Xiangrui Meng <[email protected]>

… parallelize lazy iterable range ## What changes were proposed in this pull request? During the follow-up work(#23435) for PySpark worker reuse scenario, we found that the worker reuse takes no effect for `sc.parallelize(xrange(...))`. It happened because of the specialize rdd.parallelize logic for xrange(introduced in #3264) generated data by lazy iterable range, which don't need to use the passed-in iterator. But this will break the end of stream checking in python worker and finally cause worker reuse takes no effect. See more details in [SPARK-26549](https://issues.apache.org/jira/browse/SPARK-26549) description. We fix this by force using the passed-in iterator. ## How was this patch tested? New UT in test_worker.py. Closes #23470 from xuanyuanking/SPARK-26549. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

… parallelize lazy iterable range ## What changes were proposed in this pull request? During the follow-up work(apache#23435) for PySpark worker reuse scenario, we found that the worker reuse takes no effect for `sc.parallelize(xrange(...))`. It happened because of the specialize rdd.parallelize logic for xrange(introduced in apache#3264) generated data by lazy iterable range, which don't need to use the passed-in iterator. But this will break the end of stream checking in python worker and finally cause worker reuse takes no effect. See more details in [SPARK-26549](https://issues.apache.org/jira/browse/SPARK-26549) description. We fix this by force using the passed-in iterator. ## How was this patch tested? New UT in test_worker.py. Closes apache#23470 from xuanyuanking/SPARK-26549. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

specialize sc.parallelize(xrange)

cbd58e3

mengxr force-pushed the SPARK-4398 branch from f027acd to cbd58e3 Compare November 14, 2014 08:46

davies reviewed Nov 14, 2014
View reviewed changes

follow davies' suggestion

8953c41

mengxr force-pushed the SPARK-4398 branch from c184fcc to 8953c41 Compare November 14, 2014 17:46

asfgit closed this in abd5817 Nov 14, 2014

xuanyuanking mentioned this pull request Jan 7, 2019

[SPARK-26549][PySpark] Fix for python worker reuse take no effect for parallelize lazy iterable range #23470

Closed

[SPARK-4398][PySpark] specialize sc.parallelize(xrange) #3264

[SPARK-4398][PySpark] specialize sc.parallelize(xrange) #3264

Uh oh!

Conversation

mengxr commented Nov 14, 2014

Uh oh!

SparkQA commented Nov 14, 2014

Uh oh!

davies Nov 14, 2014

Choose a reason for hiding this comment

Uh oh!

mengxr Nov 14, 2014

Choose a reason for hiding this comment

Uh oh!

davies Nov 14, 2014

Choose a reason for hiding this comment

Uh oh!

mengxr Nov 14, 2014

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 14, 2014

Uh oh!

AmplabJenkins commented Nov 14, 2014

Uh oh!

SparkQA commented Nov 14, 2014

Uh oh!

SparkQA commented Nov 14, 2014

Uh oh!

davies commented Nov 14, 2014

Uh oh!

SparkQA commented Nov 14, 2014

Uh oh!

AmplabJenkins commented Nov 14, 2014

Uh oh!

SparkQA commented Nov 14, 2014

Uh oh!

AmplabJenkins commented Nov 14, 2014

Uh oh!

mengxr commented Nov 14, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants