[SPARK-26549][PySpark] Fix for python worker reuse take no effect for parallelize lazy iterable range #23470

xuanyuanking · 2019-01-05T19:39:24Z

What changes were proposed in this pull request?

During the follow-up work(#23435) for PySpark worker reuse scenario, we found that the worker reuse takes no effect for sc.parallelize(xrange(...)). It happened because of the specialize rdd.parallelize logic for xrange(introduced in #3264) generated data by lazy iterable range, which don't need to use the passed-in iterator. But this will break the end of stream checking in python worker and finally cause worker reuse takes no effect. See more details in SPARK-26549 description.

We fix this by force using the passed-in iterator.

How was this patch tested?

New UT in test_worker.py.

SparkQA · 2019-01-05T20:12:43Z

Test build #100802 has finished for PR 23470 at commit 2f371d7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-01-06T03:41:36Z

It happened because, during the python worker check end of the stream in Python3, we got an unexpected value -1 here which refers to END_OF_DATA_SECTION.

I haven't taken a look yet but where's difference between Python 2 and 3? Can you also explain why?

python/pyspark/worker.py

xuanyuanking · 2019-01-07T08:29:20Z

Thanks Wenchen Liangchi and Hyukjin for your comment, the JIRA description has more detailed and the code I added before: https://issues.apache.org/jira/browse/SPARK-26549.
I think the bug here triggered by Python2 has handled the -1 value while Python3 is not.
The root cause is different behavior and call stack between Python2 and Python3, I'm still keeping tracking this, will give more detailed log and trace stack soon, any help and advise is appreciated.

Sorry for the mess, the bug only for sc.parallelize(xrange(x)), it's nothing to do with specific python version, I didn't realize that the code path difference between xrange and range in 'parallelize'... I'll change the JIRA and PR description.

cloud-fan · 2019-01-07T14:30:49Z

looks reasonable to me, cc @ueshin @BryanCutler

SparkQA · 2019-01-07T14:34:19Z

Test build #100888 has finished for PR 23470 at commit ab451e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-01-07T14:41:20Z

Looks fine to me too.

ueshin · 2019-01-08T04:38:02Z

LGTM, too.

BryanCutler · 2019-01-08T04:47:01Z

Does this mean that the user could also map a function that doesn't consume the iterator and inadvertently cause the worker to not be reused? If so, should the fix be in PythonRunner or worker.py?

python/pyspark/context.py

python/pyspark/tests/test_worker.py

HyukjinKwon · 2019-01-08T05:10:36Z

re: #23470 (comment)

Yea, I think so. I took a look to fix the root cause but it's going to be quite invasive from my look. Maybe there's another way I missed. So, the current fix is like a bandaid fix .. but I think it's good enough.

HyukjinKwon · 2019-01-08T05:12:11Z

LGTM too considering it's a quick bandaid fix.

HyukjinKwon · 2019-01-08T05:50:51Z

Also, let's fix PR description and title from xrange to lazy iterable range? Range in Python 3 is a lazy iterable already.

python/pyspark/tests/test_worker.py

BryanCutler · 2019-01-08T18:34:53Z

This is fine as a band-aid fix for use in rdd.range, I went ahead and made https://issues.apache.org/jira/browse/SPARK-26573 to track the root cause

xuanyuanking · 2019-01-09T03:23:45Z

@HyukjinKwon Thanks for your comments and advice, all addressed done.
@BryanCutler Thanks for the tracking JIRA, actually I try to fix this in PythonRunner or worker.py at the beginning but meet some problems, I'll comment some thoughts in SPARK-26573.

HyukjinKwon

LGTM

SparkQA · 2019-01-09T03:52:52Z

Test build #100950 has finished for PR 23470 at commit 4868e82.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-01-09T03:53:57Z

Merged to master.

xuanyuanking · 2019-01-09T06:14:12Z

Thanks for all reviewers.

… parallelize lazy iterable range ## What changes were proposed in this pull request? During the follow-up work(apache#23435) for PySpark worker reuse scenario, we found that the worker reuse takes no effect for `sc.parallelize(xrange(...))`. It happened because of the specialize rdd.parallelize logic for xrange(introduced in apache#3264) generated data by lazy iterable range, which don't need to use the passed-in iterator. But this will break the end of stream checking in python worker and finally cause worker reuse takes no effect. See more details in [SPARK-26549](https://issues.apache.org/jira/browse/SPARK-26549) description. We fix this by force using the passed-in iterator. ## How was this patch tested? New UT in test_worker.py. Closes apache#23470 from xuanyuanking/SPARK-26549. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

Fix for python worker reuse take no effect

2f371d7

HyukjinKwon mentioned this pull request Jan 6, 2019

[SPARK-25921][Follow Up][PySpark] Fix barrier task run without BarrierTaskContext while python worker reuse #23435

Closed

viirya reviewed Jan 6, 2019

View reviewed changes

python/pyspark/worker.py Outdated Show resolved Hide resolved

xuanyuanking changed the title ~~[SPARK-26549][PySpark] Fix for python worker reuse take no effect for Python3~~ [SPARK-26549][PySpark] Fix for python worker reuse take no effect for parallelize xrange Jan 7, 2019

Simplify approach just for parallelize xrange

ab451e5