Skip to content

Conversation

@davies
Copy link
Contributor

@davies davies commented Jan 13, 2015

After the default batchSize changed to 0 (batched based on the size of object), but parallelize() still use BatchedSerializer with batchSize=1, this PR will use batchSize=1024 for parallelize by default.

Also, BatchedSerializer did not work well with list and numpy.ndarray, this improve BatchedSerializer by using len and getslice.

Here is the benchmark for parallelize 1 millions int with list or ndarray:

before after improvements
list 11.7 s 0.8 s 14x
numpy.ndarray 32 s 0.7 s 40x

@SparkQA
Copy link

SparkQA commented Jan 13, 2015

Test build #25479 has started for PR 4024 at commit 7618c7c.

  • This patch merges cleanly.

@davies davies changed the title [SPARK-5224] improve performance of parallelize list/ndarray [SPARK-5224] [PySpark] improve performance of parallelize list/ndarray Jan 13, 2015
@SparkQA
Copy link

SparkQA commented Jan 13, 2015

Test build #25479 has finished for PR 4024 at commit 7618c7c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25479/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 13, 2015

Test build #564 has started for PR 4024 at commit 7618c7c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 13, 2015

Test build #564 has finished for PR 4024 at commit 7618c7c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • SparkSubmit.printErrorAndExit(s"Cannot load main class from JAR $primaryResource")
    • class BinaryClassificationMetrics(

@davies
Copy link
Contributor Author

davies commented Jan 15, 2015

@JoshRosen ping!

@JoshRosen
Copy link
Contributor

LGTM, so I'm going to merge this into master (1.3.0) and branch-1.2 (1.2.1). Thanks!

asfgit pushed a commit that referenced this pull request Jan 15, 2015
After the default batchSize changed to 0 (batched based on the size of object), but parallelize() still use BatchedSerializer with batchSize=1, this PR will use batchSize=1024 for parallelize by default.

Also, BatchedSerializer did not work well with list and numpy.ndarray, this improve BatchedSerializer by using __len__ and __getslice__.

Here is the benchmark for parallelize 1 millions int with list or ndarray:

    |          before     |   after  | improvements
 ------- | ------------ | ------------- | -------
list |   11.7 s  | 0.8 s |  14x
numpy.ndarray     |  32 s  |   0.7 s | 40x

Author: Davies Liu <[email protected]>

Closes #4024 from davies/opt_numpy and squashes the following commits:

7618c7c [Davies Liu] improve performance of parallelize list/ndarray

(cherry picked from commit 3c8650c)
Signed-off-by: Josh Rosen <[email protected]>
@asfgit asfgit closed this in 3c8650c Jan 15, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants