Skip to content

Conversation

@WeichenXu123
Copy link
Contributor

What changes were proposed in this pull request?

Fix KMeans performance regression caused by double-caching input dataset.

How was this patch tested?

N/A

@WeichenXu123
Copy link
Contributor Author

cc @jkbradley @smurching
This should be merged and backport to 2.2 ASAP!
Other improvement (e.g adding handlePersistence param) can be left in this PR #17014

@SparkQA
Copy link

SparkQA commented Sep 2, 2017

Test build #81337 has finished for PR 19107 at commit ea06225.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

cc @smurching Thanks!

@smurching
Copy link
Contributor

Sorry for the delay, this looks good to me -- thanks @WeichenXu123!

@smurching
Copy link
Contributor

@jkbradley would you be able to give this a look? Thanks!

@jkbradley
Copy link
Member

@WeichenXu123 I just commented on https://issues.apache.org/jira/browse/SPARK-18608 to clarify our efforts here. Can you please either retarget this for SPARK-18608 and update it, or ask @zhengruifeng to submit his original PR as the fix? Please coordinate, thanks!

@zhengruifeng
Copy link
Contributor

I am OK to resubmit the original PR if needed.

@WeichenXu123
Copy link
Contributor Author

OK. Thanks @zhengruifeng .I will close this PR.

@WeichenXu123 WeichenXu123 deleted the fix_kmeans_perf_regression branch September 12, 2017 04:29
asfgit pushed a commit that referenced this pull request Sep 12, 2017
## What changes were proposed in this pull request?
`df.rdd.getStorageLevel` => `df.storageLevel`

using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in "\.rdd\.getStorageLevel" {} && echo {}'` to make sure all algs involved in this issue are fixed.

Previous discussion in other PRs: #19107, #17014

## How was this patch tested?
existing tests

Author: Zheng RuiFeng <[email protected]>

Closes #19197 from zhengruifeng/double_caching.

(cherry picked from commit c5f9b89)
Signed-off-by: Joseph K. Bradley <[email protected]>
ghost pushed a commit to dbtsai/spark that referenced this pull request Sep 12, 2017
## What changes were proposed in this pull request?
`df.rdd.getStorageLevel` => `df.storageLevel`

using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in "\.rdd\.getStorageLevel" {} && echo {}'` to make sure all algs involved in this issue are fixed.

Previous discussion in other PRs: apache#19107, apache#17014

## How was this patch tested?
existing tests

Author: Zheng RuiFeng <[email protected]>

Closes apache#19197 from zhengruifeng/double_caching.
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
## What changes were proposed in this pull request?
`df.rdd.getStorageLevel` => `df.storageLevel`

using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in "\.rdd\.getStorageLevel" {} && echo {}'` to make sure all algs involved in this issue are fixed.

Previous discussion in other PRs: apache#19107, apache#17014

## How was this patch tested?
existing tests

Author: Zheng RuiFeng <[email protected]>

Closes apache#19197 from zhengruifeng/double_caching.

(cherry picked from commit c5f9b89)
Signed-off-by: Joseph K. Bradley <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants