Skip to content

Conversation

@xuanyuanking
Copy link
Member

@xuanyuanking xuanyuanking commented Jan 5, 2019

What changes were proposed in this pull request?

During the follow-up work(#23435) for PySpark worker reuse scenario, we found that the worker reuse takes no effect for sc.parallelize(xrange(...)). It happened because of the specialize rdd.parallelize logic for xrange(introduced in #3264) generated data by lazy iterable range, which don't need to use the passed-in iterator. But this will break the end of stream checking in python worker and finally cause worker reuse takes no effect. See more details in SPARK-26549 description.

We fix this by force using the passed-in iterator.

How was this patch tested?

New UT in test_worker.py.

@SparkQA
Copy link

SparkQA commented Jan 5, 2019

Test build #100802 has finished for PR 23470 at commit 2f371d7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

It happened because, during the python worker check end of the stream in Python3, we got an unexpected value -1 here which refers to END_OF_DATA_SECTION.

I haven't taken a look yet but where's difference between Python 2 and 3? Can you also explain why?

@xuanyuanking
Copy link
Member Author

xuanyuanking commented Jan 7, 2019

Thanks Wenchen Liangchi and Hyukjin for your comment, the JIRA description has more detailed and the code I added before: https://issues.apache.org/jira/browse/SPARK-26549.
I think the bug here triggered by Python2 has handled the -1 value while Python3 is not.
The root cause is different behavior and call stack between Python2 and Python3, I'm still keeping tracking this, will give more detailed log and trace stack soon, any help and advise is appreciated.

Sorry for the mess, the bug only for sc.parallelize(xrange(x)), it's nothing to do with specific python version, I didn't realize that the code path difference between xrange and range in 'parallelize'... I'll change the JIRA and PR description.

@xuanyuanking xuanyuanking changed the title [SPARK-26549][PySpark] Fix for python worker reuse take no effect for Python3 [SPARK-26549][PySpark] Fix for python worker reuse take no effect for parallelize xrange Jan 7, 2019
@cloud-fan
Copy link
Contributor

looks reasonable to me, cc @ueshin @BryanCutler

@SparkQA
Copy link

SparkQA commented Jan 7, 2019

Test build #100888 has finished for PR 23470 at commit ab451e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Jan 7, 2019

Looks fine to me too.

@ueshin
Copy link
Member

ueshin commented Jan 8, 2019

LGTM, too.

@BryanCutler
Copy link
Member

Does this mean that the user could also map a function that doesn't consume the iterator and inadvertently cause the worker to not be reused? If so, should the fix be in PythonRunner or worker.py?

@HyukjinKwon
Copy link
Member

re: #23470 (comment)

Yea, I think so. I took a look to fix the root cause but it's going to be quite invasive from my look. Maybe there's another way I missed. So, the current fix is like a bandaid fix .. but I think it's good enough.

@HyukjinKwon
Copy link
Member

LGTM too considering it's a quick bandaid fix.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jan 8, 2019

Also, let's fix PR description and title from xrange to lazy iterable range? Range in Python 3 is a lazy iterable already.

@BryanCutler
Copy link
Member

This is fine as a band-aid fix for use in rdd.range, I went ahead and made https://issues.apache.org/jira/browse/SPARK-26573 to track the root cause

@xuanyuanking xuanyuanking changed the title [SPARK-26549][PySpark] Fix for python worker reuse take no effect for parallelize xrange [SPARK-26549][PySpark] Fix for python worker reuse take no effect for parallelize lazy iterable range Jan 9, 2019
@xuanyuanking
Copy link
Member Author

xuanyuanking commented Jan 9, 2019

@HyukjinKwon Thanks for your comments and advice, all addressed done.
@BryanCutler Thanks for the tracking JIRA, actually I try to fix this in PythonRunner or worker.py at the beginning but meet some problems, I'll comment some thoughts in SPARK-26573.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA
Copy link

SparkQA commented Jan 9, 2019

Test build #100950 has finished for PR 23470 at commit 4868e82.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in dbbba80 Jan 9, 2019
@xuanyuanking
Copy link
Member Author

Thanks for all reviewers.

@xuanyuanking xuanyuanking deleted the SPARK-26549 branch January 9, 2019 06:14
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
… parallelize lazy iterable range

## What changes were proposed in this pull request?

During the follow-up work(apache#23435) for PySpark worker reuse scenario, we found that the worker reuse takes no effect for `sc.parallelize(xrange(...))`. It happened because of the specialize rdd.parallelize logic for xrange(introduced in apache#3264) generated data by lazy iterable range, which don't need to use the passed-in iterator. But this will break the end of stream checking in python worker and finally cause worker reuse takes no effect. See more details in [SPARK-26549](https://issues.apache.org/jira/browse/SPARK-26549) description.

We fix this by force using the passed-in iterator.

## How was this patch tested?
New UT in test_worker.py.

Closes apache#23470 from xuanyuanking/SPARK-26549.

Authored-by: Yuanjian Li <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants