-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21799][ML] Fix KMeans performance regression caused by double-caching
#19107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21799][ML] Fix KMeans performance regression caused by double-caching
#19107
Conversation
|
cc @jkbradley @smurching |
|
Test build #81337 has finished for PR 19107 at commit
|
|
cc @smurching Thanks! |
|
Sorry for the delay, this looks good to me -- thanks @WeichenXu123! |
|
@jkbradley would you be able to give this a look? Thanks! |
|
@WeichenXu123 I just commented on https://issues.apache.org/jira/browse/SPARK-18608 to clarify our efforts here. Can you please either retarget this for SPARK-18608 and update it, or ask @zhengruifeng to submit his original PR as the fix? Please coordinate, thanks! |
|
I am OK to resubmit the original PR if needed. |
|
OK. Thanks @zhengruifeng .I will close this PR. |
## What changes were proposed in this pull request?
`df.rdd.getStorageLevel` => `df.storageLevel`
using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in "\.rdd\.getStorageLevel" {} && echo {}'` to make sure all algs involved in this issue are fixed.
Previous discussion in other PRs: #19107, #17014
## How was this patch tested?
existing tests
Author: Zheng RuiFeng <[email protected]>
Closes #19197 from zhengruifeng/double_caching.
(cherry picked from commit c5f9b89)
Signed-off-by: Joseph K. Bradley <[email protected]>
## What changes were proposed in this pull request?
`df.rdd.getStorageLevel` => `df.storageLevel`
using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in "\.rdd\.getStorageLevel" {} && echo {}'` to make sure all algs involved in this issue are fixed.
Previous discussion in other PRs: apache#19107, apache#17014
## How was this patch tested?
existing tests
Author: Zheng RuiFeng <[email protected]>
Closes apache#19197 from zhengruifeng/double_caching.
## What changes were proposed in this pull request?
`df.rdd.getStorageLevel` => `df.storageLevel`
using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in "\.rdd\.getStorageLevel" {} && echo {}'` to make sure all algs involved in this issue are fixed.
Previous discussion in other PRs: apache#19107, apache#17014
## How was this patch tested?
existing tests
Author: Zheng RuiFeng <[email protected]>
Closes apache#19197 from zhengruifeng/double_caching.
(cherry picked from commit c5f9b89)
Signed-off-by: Joseph K. Bradley <[email protected]>
What changes were proposed in this pull request?
Fix
KMeansperformance regression caused by double-caching input dataset.How was this patch tested?
N/A