Skip to content

Conversation

@mengxr
Copy link
Contributor

@mengxr mengxr commented Nov 14, 2014

sc.parallelize(range(1 << 20), 1).count() may take 15 seconds to finish and the rdd object stores the entire list, making task size very large. This PR adds a specialized version for xrange.

@JoshRosen @davies

@SparkQA
Copy link

SparkQA commented Nov 14, 2014

Test build #23360 has started for PR 3264 at commit cbd58e3.

  • This patch merges cleanly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about pre-calculate all the boundaries for all the partitions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only serializes an xrange object. If we pre-calculate the boundaries, the cost is O(p).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the size + 1 is tricky, how about this one:

start = c[0]
def getStart(split):
      return start + size * split / numSlices * step
def f(split, iterator):
      return xrange(getStart(split), getStart(split+1), step)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is better!

@SparkQA
Copy link

SparkQA commented Nov 14, 2014

Test build #23360 has finished for PR 3264 at commit cbd58e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23360/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 14, 2014

Test build #23374 has started for PR 3264 at commit c184fcc.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 14, 2014

Test build #23375 has started for PR 3264 at commit 8953c41.

  • This patch merges cleanly.

@davies
Copy link
Contributor

davies commented Nov 14, 2014

LGTM, thanks!

@SparkQA
Copy link

SparkQA commented Nov 14, 2014

Test build #23375 has finished for PR 3264 at commit 8953c41.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23375/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 14, 2014

Test build #23374 has finished for PR 3264 at commit c184fcc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23374/
Test PASSed.

@mengxr
Copy link
Contributor Author

mengxr commented Nov 14, 2014

@davies Thanks! I've merged this into master and branch-1.2.

@asfgit asfgit closed this in abd5817 Nov 14, 2014
asfgit pushed a commit that referenced this pull request Nov 14, 2014
`sc.parallelize(range(1 << 20), 1).count()` may take 15 seconds to finish and the rdd object stores the entire list, making task size very large. This PR adds a specialized version for xrange.

JoshRosen davies

Author: Xiangrui Meng <[email protected]>

Closes #3264 from mengxr/SPARK-4398 and squashes the following commits:

8953c41 [Xiangrui Meng] follow davies' suggestion
cbd58e3 [Xiangrui Meng] specialize sc.parallelize(xrange)

(cherry picked from commit abd5817)
Signed-off-by: Xiangrui Meng <[email protected]>
asfgit pushed a commit that referenced this pull request Jan 9, 2019
… parallelize lazy iterable range

## What changes were proposed in this pull request?

During the follow-up work(#23435) for PySpark worker reuse scenario, we found that the worker reuse takes no effect for `sc.parallelize(xrange(...))`. It happened because of the specialize rdd.parallelize logic for xrange(introduced in #3264) generated data by lazy iterable range, which don't need to use the passed-in iterator. But this will break the end of stream checking in python worker and finally cause worker reuse takes no effect. See more details in [SPARK-26549](https://issues.apache.org/jira/browse/SPARK-26549) description.

We fix this by force using the passed-in iterator.

## How was this patch tested?
New UT in test_worker.py.

Closes #23470 from xuanyuanking/SPARK-26549.

Authored-by: Yuanjian Li <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
… parallelize lazy iterable range

## What changes were proposed in this pull request?

During the follow-up work(apache#23435) for PySpark worker reuse scenario, we found that the worker reuse takes no effect for `sc.parallelize(xrange(...))`. It happened because of the specialize rdd.parallelize logic for xrange(introduced in apache#3264) generated data by lazy iterable range, which don't need to use the passed-in iterator. But this will break the end of stream checking in python worker and finally cause worker reuse takes no effect. See more details in [SPARK-26549](https://issues.apache.org/jira/browse/SPARK-26549) description.

We fix this by force using the passed-in iterator.

## How was this patch tested?
New UT in test_worker.py.

Closes apache#23470 from xuanyuanking/SPARK-26549.

Authored-by: Yuanjian Li <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants