Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Oct 11, 2022

What changes were proposed in this pull request?

Reduce the shuffle size of ALS by using Array[V] instead of BoundedPriorityQueue[V] in ser/deser
this is a corresponding change of #37918 on the .mllib side

Why are the changes needed?

Reduce the shuffle size of ALS

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing UT

@zhengruifeng zhengruifeng changed the title [SPARK-40745][MLLIB] Reduce the shuffle size of ALS in mllib [SPARK-40745][MLLIB] Reduce the shuffle size of ALS in .mllib Oct 11, 2022
@github-actions github-actions bot added the MLLIB label Oct 11, 2022
@zhengruifeng
Copy link
Contributor Author

take RankingMetricsExample for example:

import org.apache.spark.mllib.evaluation.{RankingMetrics, RegressionMetrics}
import org.apache.spark.mllib.recommendation.{ALS, Rating}


val ratings = spark.read.textFile("data/mllib/sample_movielens_data.txt").rdd.map { line =>
    val fields = line.split("::")
    Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble - 2.5)
}.cache()

// Map ratings to 1 or 0, 1 indicating a movie that should be recommended
val binarizedRatings = ratings.map(r => Rating(r.user, r.product, if (r.rating > 0) 1.0 else 0.0)).cache()


val numIterations = 1
val rank = 10
val lambda = 0.01
val model = ALS.train(ratings, rank, numIterations, lambda)

val userRecommended = model.recommendProductsForUsers(10)

userRecommended.count()

before:
image

after:
image

the shuffle size was reduced from 310.1 KiB to 110.9 KiB

@zhengruifeng
Copy link
Contributor Author

cc @srowen @WeichenXu123

@srowen
Copy link
Member

srowen commented Oct 11, 2022

Seems fine. Do you have one sentence about what the optimization is here - why is it smaller?

@zhengruifeng
Copy link
Contributor Author

@srowen the new implementation uses Array[V] instead of BoundedPriorityQueue[V] in ser/deser

@srowen srowen closed this in 6bbf4f5 Oct 11, 2022
@srowen
Copy link
Member

srowen commented Oct 11, 2022

Merged to master

@zhengruifeng zhengruifeng deleted the ml_topbykey branch October 12, 2022 02:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants