Skip to content

Conversation

@davies
Copy link
Contributor

@davies davies commented Aug 13, 2014

Passing large object by py4j is very slow (cost much memory), so pass broadcast objects via files (similar to parallelize()).

Add an option to keep object in driver (it's False by default) to save memory in driver.

Passing large object by py4j is very slow (cost much memory),
so pass broadcast objects via files (similar to parallelize()).

Add an option to keep object in driver (it's False by default)
to save memory in driver.
@SparkQA
Copy link

SparkQA commented Aug 13, 2014

QA tests have started for PR 1912. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18398/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 13, 2014

QA results for PR 1912:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18398/consoleFull

@davies
Copy link
Contributor Author

davies commented Aug 13, 2014

failed tests were not related to this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this fit in the line above?

@andrewor14
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Aug 13, 2014

QA tests have started for PR 1912. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18430/consoleFull

@andrewor14
Copy link
Contributor

I was talking to Jenkins when I said "test this please", but thanks @davies for adding tests too.

@davies
Copy link
Contributor Author

davies commented Aug 13, 2014

LoL, I realized this just after pushing the commit :)

@SparkQA
Copy link

SparkQA commented Aug 13, 2014

QA results for PR 1912:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18430/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 13, 2014

QA tests have started for PR 1912. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18447/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 13, 2014

QA results for PR 1912:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class CompressedSerializer(FramedSerializer):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18447/consoleFull

@frol
Copy link
Contributor

frol commented Aug 13, 2014

@davies I am about to test it again with CompressedSerializer. Am I right that I don't need to change anything in my project, but just rebuild Spark?

@davies
Copy link
Contributor Author

davies commented Aug 13, 2014

@frol , Yes, thanks again!

@SparkQA
Copy link

SparkQA commented Aug 13, 2014

QA tests have started for PR 1912. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18456/consoleFull

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call here; it was a bad idea to expose these internals in user-facing module doctests.

@JoshRosen
Copy link
Contributor

This looks good to me and I'm really glad to read the JIRA comments saying how it sped things up.

I left one minor usability-related comment, but otherwise this looks great.

@SparkQA
Copy link

SparkQA commented Aug 13, 2014

QA results for PR 1912:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class CompressedSerializer(FramedSerializer):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18456/consoleFull

add better message when try to access Broadcast.value in driver.
@SparkQA
Copy link

SparkQA commented Aug 13, 2014

QA tests have started for PR 1912. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18460/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 13, 2014

QA results for PR 1912:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class CompressedSerializer(FramedSerializer):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18460/consoleFull

@frol
Copy link
Contributor

frol commented Aug 13, 2014

@davies Compression improved things, but my tasks have heavy computations inside, so it saved only 10 seconds on a 4.5-minute task and also about 10-20 seconds on a 18-minute task. In both cases I have only 340 partitions.

I'm still investigating where the second copy of my fat object is, because I can easily notice that in comparison with my local tests. And also if I cut my big object twice, the memory consumption decreases as it would be cut 4 times on local run.

@davies
Copy link
Contributor Author

davies commented Aug 13, 2014

@frol , The big win of compression maybe save the memory in JVM. It's also a win if it does not increase the runtime. If the future, we could try LZ4, it may help a little bit about runtime, but will not contribute much in your case.

What is the memory you are talking about? in Python driver, JVM, or Python worker?

@frol
Copy link
Contributor

frol commented Aug 13, 2014

@davies I'm talking about memory in Python workers and it is my issue. (I figured out that my local test had a mistake and after I fix it local test and Spark Python workers consume the same amount of memory). I'm sorry to confuse you.

@JoshRosen
Copy link
Contributor

@frol After fixing your local test, are you still noticing any broadcast performance issues? If you still see any odd behavior, could you post a small script or set of pyspark shell commands so we can test it out?

@frol
Copy link
Contributor

frol commented Aug 14, 2014

@JoshRosen No, I'm not noticing any broadcast performance issues now. PySpark works like a charm again. Thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: should be spelled "accessible." Also, maybe the error message could be a little clearer about how broadcast variables are created and why call failed. I'm thinking of something like "please call sc.broadcast() with keep=True to make values accessible in the driver".

@JoshRosen
Copy link
Contributor

It occurs to me: what if we had .value retrieve and depickle the value from the JVM? Also, won't we still experience memory leaks in the JVM if we iteratively create broadcast variables, since we will never clean up those pickled values?

One approach is to have .value() depickle the JVM value (so we're not changing the user-facing API) and add a Python equivalent of Broadcast.destroy() for performing permanent cleanup of a broadcast's resources. What do you think of this approach?

@SparkQA
Copy link

SparkQA commented Aug 15, 2014

QA tests have started for PR 1912 at commit e06df4a.

  • This patch merges cleanly.

@davies
Copy link
Contributor Author

davies commented Aug 15, 2014

I had add Broadcast.unpersist(blocking=False).

Because we have an copy in disks, so read it from there when user want to access it driver, then we can keep the SparkContext.broadcast() unchanged.

@JoshRosen
Copy link
Contributor

Hmm, looks like this was affected by the Jenkins timeouts last night.

Jenkins, retest this please.

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Aug 16, 2014

QA tests have started for PR 1912 at commit e06df4a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 16, 2014

QA tests have finished for PR 1912 at commit e06df4a.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait TaskCompletionListener extends EventListener
    • class AvroWrapperToJavaConverter extends Converter[Any, Any]
    • class CompressedSerializer(FramedSerializer):

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Aug 16, 2014

QA tests have started for PR 1912 at commit e06df4a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 16, 2014

QA tests have finished for PR 1912 at commit e06df4a.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class Serializer
    • abstract class SerializerInstance
    • abstract class SerializationStream
    • abstract class DeserializationStream
    • class CompressedSerializer(FramedSerializer):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a docstring? It's fine to just copy it over from the Scala equivalent. In this case:

  /**
   * Delete cached copies of this broadcast on the executors. If the broadcast is used after
   * this is called, it will need to be re-sent to each executor.
   * @param blocking Whether to block until unpersisting has completed
   */

@JoshRosen
Copy link
Contributor

I guess we don't necessarily want to expose destroy() to the end-user, since it's private in the Scala APIs. I suppose we might still be leaking broadcast variables in the driver's JVM, but I think that's a problem that affects Scala/Java jobs as well, so maybe we can address it more generally in a separate PR.

@JoshRosen
Copy link
Contributor

Actually, I'm just going to merge this now and I'll add the docstring as part of a subsequent documentation-improvement PR (I also want to edit some Scala / Java docs, too).

@JoshRosen
Copy link
Contributor

I've merged this into master and branch-1.1. Thanks!

asfgit pushed a commit that referenced this pull request Aug 17, 2014
Passing large object by py4j is very slow (cost much memory), so pass broadcast objects via files (similar to parallelize()).

Add an option to keep object in driver (it's False by default) to save memory in driver.

Author: Davies Liu <[email protected]>

Closes #1912 from davies/broadcast and squashes the following commits:

e06df4a [Davies Liu] load broadcast from disk in driver automatically
db3f232 [Davies Liu] fix serialization of accumulator
631a827 [Davies Liu] Merge branch 'master' into broadcast
c7baa8c [Davies Liu] compress serrialized broadcast and command
9a7161f [Davies Liu] fix doc tests
e93cf4b [Davies Liu] address comments: add test
6226189 [Davies Liu] improve large broadcast

(cherry picked from commit 2fc8aca)
Signed-off-by: Josh Rosen <[email protected]>
@asfgit asfgit closed this in 2fc8aca Aug 17, 2014
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
Passing large object by py4j is very slow (cost much memory), so pass broadcast objects via files (similar to parallelize()).

Add an option to keep object in driver (it's False by default) to save memory in driver.

Author: Davies Liu <[email protected]>

Closes apache#1912 from davies/broadcast and squashes the following commits:

e06df4a [Davies Liu] load broadcast from disk in driver automatically
db3f232 [Davies Liu] fix serialization of accumulator
631a827 [Davies Liu] Merge branch 'master' into broadcast
c7baa8c [Davies Liu] compress serrialized broadcast and command
9a7161f [Davies Liu] fix doc tests
e93cf4b [Davies Liu] address comments: add test
6226189 [Davies Liu] improve large broadcast
@davies davies deleted the broadcast branch September 15, 2014 22:16
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants