[SPARK-1065] [PySpark] improve supporting for large broadcast #1912

davies · 2014-08-13T00:31:59Z

Passing large object by py4j is very slow (cost much memory), so pass broadcast objects via files (similar to parallelize()).

Add an option to keep object in driver (it's False by default) to save memory in driver.

Passing large object by py4j is very slow (cost much memory), so pass broadcast objects via files (similar to parallelize()). Add an option to keep object in driver (it's False by default) to save memory in driver.

SparkQA · 2014-08-13T00:34:59Z

QA tests have started for PR 1912. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18398/consoleFull

SparkQA · 2014-08-13T01:19:50Z

QA results for PR 1912:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18398/consoleFull

davies · 2014-08-13T02:07:38Z

failed tests were not related to this PR

andrewor14 · 2014-08-13T06:43:57Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

does this fit in the line above?

andrewor14 · 2014-08-13T06:44:22Z

test this please

SparkQA · 2014-08-13T07:05:00Z

QA tests have started for PR 1912. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18430/consoleFull

andrewor14 · 2014-08-13T07:10:19Z

I was talking to Jenkins when I said "test this please", but thanks @davies for adding tests too.

davies · 2014-08-13T07:20:38Z

LoL, I realized this just after pushing the commit :)

SparkQA · 2014-08-13T07:50:54Z

QA results for PR 1912:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18430/consoleFull

SparkQA · 2014-08-13T16:54:55Z

QA tests have started for PR 1912. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18447/consoleFull

SparkQA · 2014-08-13T16:55:03Z

QA results for PR 1912:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class CompressedSerializer(FramedSerializer):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18447/consoleFull

frol · 2014-08-13T17:15:27Z

@davies I am about to test it again with CompressedSerializer. Am I right that I don't need to change anything in my project, but just rebuild Spark?

davies · 2014-08-13T19:02:08Z

@frol , Yes, thanks again!

SparkQA · 2014-08-13T19:09:55Z

QA tests have started for PR 1912. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18456/consoleFull

JoshRosen · 2014-08-13T19:27:19Z

python/pyspark/broadcast.py

Good call here; it was a bad idea to expose these internals in user-facing module doctests.

JoshRosen · 2014-08-13T19:39:48Z

This looks good to me and I'm really glad to read the JIRA comments saying how it sped things up.

I left one minor usability-related comment, but otherwise this looks great.

SparkQA · 2014-08-13T19:56:10Z

QA results for PR 1912:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class CompressedSerializer(FramedSerializer):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18456/consoleFull

add better message when try to access Broadcast.value in driver.

SparkQA · 2014-08-13T20:09:55Z

QA tests have started for PR 1912. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18460/consoleFull

SparkQA · 2014-08-13T21:01:26Z

QA results for PR 1912:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class CompressedSerializer(FramedSerializer):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18460/consoleFull

frol · 2014-08-13T22:18:07Z

@davies Compression improved things, but my tasks have heavy computations inside, so it saved only 10 seconds on a 4.5-minute task and also about 10-20 seconds on a 18-minute task. In both cases I have only 340 partitions.

I'm still investigating where the second copy of my fat object is, because I can easily notice that in comparison with my local tests. And also if I cut my big object twice, the memory consumption decreases as it would be cut 4 times on local run.

davies · 2014-08-13T22:41:30Z

@frol , The big win of compression maybe save the memory in JVM. It's also a win if it does not increase the runtime. If the future, we could try LZ4, it may help a little bit about runtime, but will not contribute much in your case.

What is the memory you are talking about? in Python driver, JVM, or Python worker?

frol · 2014-08-13T22:49:07Z

@davies I'm talking about memory in Python workers and it is my issue. (I figured out that my local test had a mistake and after I fix it local test and Spark Python workers consume the same amount of memory). I'm sorry to confuse you.

JoshRosen · 2014-08-14T21:40:19Z

@frol After fixing your local test, are you still noticing any broadcast performance issues? If you still see any odd behavior, could you post a small script or set of pyspark shell commands so we can test it out?

frol · 2014-08-14T22:08:54Z

@JoshRosen No, I'm not noticing any broadcast performance issues now. PySpark works like a charm again. Thank you!

JoshRosen · 2014-08-15T04:32:07Z

python/pyspark/broadcast.py

Typo: should be spelled "accessible." Also, maybe the error message could be a little clearer about how broadcast variables are created and why call failed. I'm thinking of something like "please call sc.broadcast() with keep=True to make values accessible in the driver".

JoshRosen · 2014-08-15T04:38:45Z

It occurs to me: what if we had .value retrieve and depickle the value from the JVM? Also, won't we still experience memory leaks in the JVM if we iteratively create broadcast variables, since we will never clean up those pickled values?

One approach is to have .value() depickle the JVM value (so we're not changing the user-facing API) and add a Python equivalent of Broadcast.destroy() for performing permanent cleanup of a broadcast's resources. What do you think of this approach?

add Broadcast.unpersist()

SparkQA · 2014-08-15T05:55:13Z

QA tests have started for PR 1912 at commit e06df4a.

This patch merges cleanly.

davies · 2014-08-15T06:05:16Z

I had add Broadcast.unpersist(blocking=False).

Because we have an copy in disks, so read it from there when user want to access it driver, then we can keep the SparkContext.broadcast() unchanged.

JoshRosen · 2014-08-15T20:53:24Z

Hmm, looks like this was affected by the Jenkins timeouts last night.

Jenkins, retest this please.

JoshRosen · 2014-08-16T01:11:22Z

Jenkins, retest this please.

SparkQA · 2014-08-16T01:15:17Z

QA tests have started for PR 1912 at commit e06df4a.

This patch merges cleanly.

SparkQA · 2014-08-16T02:01:58Z

QA tests have finished for PR 1912 at commit e06df4a.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait TaskCompletionListener extends EventListener
- class AvroWrapperToJavaConverter extends Converter[Any, Any]
- class CompressedSerializer(FramedSerializer):

JoshRosen · 2014-08-16T06:24:35Z

Jenkins, retest this please.

SparkQA · 2014-08-16T06:30:11Z

QA tests have started for PR 1912 at commit e06df4a.

This patch merges cleanly.

SparkQA · 2014-08-16T07:25:59Z

QA tests have finished for PR 1912 at commit e06df4a.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class Serializer
- abstract class SerializerInstance
- abstract class SerializationStream
- abstract class DeserializationStream
- class CompressedSerializer(FramedSerializer):

JoshRosen · 2014-08-16T23:55:00Z

python/pyspark/broadcast.py

Can you add a docstring? It's fine to just copy it over from the Scala equivalent. In this case:

/** * Delete cached copies of this broadcast on the executors. If the broadcast is used after * this is called, it will need to be re-sent to each executor. * @param blocking Whether to block until unpersisting has completed */

JoshRosen · 2014-08-16T23:56:41Z

I guess we don't necessarily want to expose destroy() to the end-user, since it's private in the Scala APIs. I suppose we might still be leaking broadcast variables in the driver's JVM, but I think that's a problem that affects Scala/Java jobs as well, so maybe we can address it more generally in a separate PR.

JoshRosen · 2014-08-16T23:57:35Z

Actually, I'm just going to merge this now and I'll add the docstring as part of a subsequent documentation-improvement PR (I also want to edit some Scala / Java docs, too).

JoshRosen · 2014-08-17T00:00:09Z

I've merged this into master and branch-1.1. Thanks!

Passing large object by py4j is very slow (cost much memory), so pass broadcast objects via files (similar to parallelize()). Add an option to keep object in driver (it's False by default) to save memory in driver. Author: Davies Liu <[email protected]> Closes #1912 from davies/broadcast and squashes the following commits: e06df4a [Davies Liu] load broadcast from disk in driver automatically db3f232 [Davies Liu] fix serialization of accumulator 631a827 [Davies Liu] Merge branch 'master' into broadcast c7baa8c [Davies Liu] compress serrialized broadcast and command 9a7161f [Davies Liu] fix doc tests e93cf4b [Davies Liu] address comments: add test 6226189 [Davies Liu] improve large broadcast (cherry picked from commit 2fc8aca) Signed-off-by: Josh Rosen <[email protected]>

Passing large object by py4j is very slow (cost much memory), so pass broadcast objects via files (similar to parallelize()). Add an option to keep object in driver (it's False by default) to save memory in driver. Author: Davies Liu <[email protected]> Closes apache#1912 from davies/broadcast and squashes the following commits: e06df4a [Davies Liu] load broadcast from disk in driver automatically db3f232 [Davies Liu] fix serialization of accumulator 631a827 [Davies Liu] Merge branch 'master' into broadcast c7baa8c [Davies Liu] compress serrialized broadcast and command 9a7161f [Davies Liu] fix doc tests e93cf4b [Davies Liu] address comments: add test 6226189 [Davies Liu] improve large broadcast

This PR updates UC-Spark-Authz plugin to 0.1.3 Change list: https://github.pie.apple.com/apple-cloud-services/uc-spark-authz/compare/0d9f00e...f371827

improve large broadcast

6226189

Passing large object by py4j is very slow (cost much memory), so pass broadcast objects via files (similar to parallelize()). Add an option to keep object in driver (it's False by default) to save memory in driver.

andrewor14 reviewed Aug 13, 2014
View reviewed changes

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala Outdated

Copy link

Contributor

andrewor14 Aug 13, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this fit in the line above?

address comments: add test

e93cf4b

davies added 2 commits August 13, 2014 09:05

fix doc tests

9a7161f

compress serrialized broadcast and command

c7baa8c

Merge branch 'master' into broadcast

631a827

JoshRosen reviewed Aug 13, 2014
View reviewed changes

python/pyspark/broadcast.py Outdated

Copy link

Contributor

JoshRosen Aug 13, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call here; it was a bad idea to expose these internals in user-facing module doctests.

fix serialization of accumulator

db3f232

add better message when try to access Broadcast.value in driver.

JoshRosen reviewed Aug 15, 2014
View reviewed changes

load broadcast from disk in driver automatically

e06df4a

add Broadcast.unpersist()

JoshRosen reviewed Aug 16, 2014
View reviewed changes

asfgit closed this in 2fc8aca Aug 17, 2014

davies deleted the broadcast branch September 15, 2014 22:16

[SPARK-1065] [PySpark] improve supporting for large broadcast #1912

[SPARK-1065] [PySpark] improve supporting for large broadcast #1912

Uh oh!

Conversation

davies commented Aug 13, 2014

Uh oh!

SparkQA commented Aug 13, 2014

Uh oh!

SparkQA commented Aug 13, 2014

Uh oh!

davies commented Aug 13, 2014

Uh oh!

andrewor14 Aug 13, 2014

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Aug 13, 2014

Uh oh!

SparkQA commented Aug 13, 2014

Uh oh!

andrewor14 commented Aug 13, 2014

Uh oh!

davies commented Aug 13, 2014

Uh oh!

SparkQA commented Aug 13, 2014

Uh oh!

SparkQA commented Aug 13, 2014

Uh oh!

SparkQA commented Aug 13, 2014

Uh oh!

frol commented Aug 13, 2014

Uh oh!

davies commented Aug 13, 2014

Uh oh!

SparkQA commented Aug 13, 2014

Uh oh!

JoshRosen Aug 13, 2014

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Aug 13, 2014

Uh oh!

SparkQA commented Aug 13, 2014

Uh oh!

SparkQA commented Aug 13, 2014

Uh oh!

SparkQA commented Aug 13, 2014

Uh oh!

frol commented Aug 13, 2014

Uh oh!

davies commented Aug 13, 2014

Uh oh!

frol commented Aug 13, 2014

Uh oh!

JoshRosen commented Aug 14, 2014

Uh oh!

frol commented Aug 14, 2014

Uh oh!

JoshRosen Aug 15, 2014

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Aug 15, 2014

Uh oh!

SparkQA commented Aug 15, 2014

Uh oh!

davies commented Aug 15, 2014

Uh oh!

JoshRosen commented Aug 15, 2014

Uh oh!

JoshRosen commented Aug 16, 2014

Uh oh!

SparkQA commented Aug 16, 2014

Uh oh!

SparkQA commented Aug 16, 2014

Uh oh!

JoshRosen commented Aug 16, 2014

Uh oh!

SparkQA commented Aug 16, 2014

Uh oh!

SparkQA commented Aug 16, 2014

Uh oh!

JoshRosen Aug 16, 2014