[SPARK-40476][ML][SQL] Reduce the shuffle size of ALS #37918

zhengruifeng · 2022-09-17T00:22:00Z

What changes were proposed in this pull request?

implement a new expression CollectTopK, which uses Array instead of BoundedPriorityQueue in ser/deser

Why are the changes needed?

Reduce the shuffle size of ALS in prediction

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing testsuites

zhengruifeng · 2022-09-17T00:25:40Z

take the ALSExample for example:

import org.apache.spark.ml.recommendation._

case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)

def parseRating(str: String): Rating = {
    val fields = str.split("::")
    assert(fields.size == 4)
    Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, fields(3).toLong)
}

val ratings = spark.read.textFile("data/mllib/als/sample_movielens_ratings.txt").map(parseRating).toDF()

val als = new ALS().setMaxIter(1).setRegParam(0.01).setUserCol("userId").setItemCol("movieId").setRatingCol("rating")

val model = als.fit(ratings)

model.recommendForAllItems(10).collect()

before:

after:

the shuffle size in this case was reduced from 298.4 KiB to 130.3 KiB

dongjoon-hyun

Hi, @zhengruifeng .

If you don't mind, could you make an independent PR moving TopByKeyAggregator to CollectTopK because that is orthogonal from Reduce the shuffle size of ALS?
In addition, we need a test coverage for CollectTopK because we remove TopByKeyAggregatorSuite.

zhengruifeng · 2022-09-18T23:27:28Z

@dongjoon-hyun

could you make an independent PR moving TopByKeyAggregator to CollectTopK because that is orthogonal from Reduce the shuffle size of ALS?

It is just the moving from TopByKeyAggregator to CollectTopK that reduce the shuffle size, since the ser/deser is optimized in CollectTopK, let me update the PR description

In addition, we need a test coverage for CollectTopK because we remove TopByKeyAggregatorSuite.

Sure, will update soon

dongjoon-hyun · 2022-09-18T23:45:56Z

Thanks. If the PR title is clear, +1 for that.

zhengruifeng · 2022-09-19T09:33:49Z

cc @srowen @WeichenXu123

WeichenXu123 · 2022-09-19T12:33:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala

The naming is not clear,
Why not call it collect_top_k ?

and we can add it into spark.sql.functions collect_top_k

WeichenXu123 · 2022-09-19T12:36:17Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

I think we can define a spark sql function and wrap this part within the function, like:

def collect_top_k(ratingColumn, outputColumn) = { CollectOrdered(struct(ratingColumn, outputColumn).expr, num, true).toAggregateExpression(false) }

sure, I think we don't want to expose it, so mark it private[spark]

WeichenXu123 · 2022-09-22T07:23:30Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

   */
  def collect_set(columnName: String): Column = collect_set(Column(columnName))

+  private[spark] def collect_top_k(e: Column, num: Int, reverse: Boolean): Column =


nit: shall we make it public ? It might be a useful function.

We don't need to do it in this PR.

I don't know, I also think it's useful and may further use it in Pandas-API-on-Spark.
But I don't know whether it is suitable to be public @cloud-fan @HyukjinKwon

Let's keep it private for now.

srowen · 2022-09-22T13:28:21Z

Merged to master

zhengruifeng · 2022-09-23T00:04:22Z

Thanks for the reviews!

WeichenXu123 · 2022-09-23T14:05:35Z

Thanks! :)

### What changes were proposed in this pull request? Reduce the shuffle size of ALS by using `Array[V]` instead of `BoundedPriorityQueue[V]` in ser/deser this is a corresponding change of #37918 on the `.mllib` side ### Why are the changes needed? Reduce the shuffle size of ALS ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing UT Closes #38203 from zhengruifeng/ml_topbykey. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Sean Owen <[email protected]>

github-actions bot added ML SQL labels Sep 17, 2022

zhengruifeng changed the title ~~[SPARK-40476][ML][SQL] Reduce the shuffle size of ALS~~ [SPARK-40476][ML][SQL] Reduce the shuffle size of ALS.recommend Sep 17, 2022

zhengruifeng changed the title ~~[SPARK-40476][ML][SQL] Reduce the shuffle size of ALS.recommend~~ [SPARK-40476][ML][SQL] Reduce the shuffle size of ALS Sep 17, 2022

dongjoon-hyun reviewed Sep 17, 2022

View reviewed changes

zhengruifeng force-pushed the sql_collect_topk branch from c484647 to 8bde0b9 Compare September 19, 2022 09:33

WeichenXu123 reviewed Sep 19, 2022

View reviewed changes

zhengruifeng added 5 commits September 20, 2022 10:32

init

74c97b0

fix

2ae961d

add pretty name

ee8179d

add tests

0ea4dd8

address comments

01ad8cb

zhengruifeng force-pushed the sql_collect_topk branch from 8bde0b9 to 01ad8cb Compare September 20, 2022 02:47

srowen approved these changes Sep 21, 2022

View reviewed changes

WeichenXu123 reviewed Sep 22, 2022

View reviewed changes

srowen closed this in 0867845 Sep 22, 2022

zhengruifeng deleted the sql_collect_topk branch September 23, 2022 00:04

zhengruifeng mentioned this pull request Oct 11, 2022

[SPARK-40745][MLLIB] Reduce the shuffle size of ALS in .mllib #38203

Closed

[SPARK-40476][ML][SQL] Reduce the shuffle size of ALS #37918

[SPARK-40476][ML][SQL] Reduce the shuffle size of ALS #37918

Uh oh!

Conversation

zhengruifeng commented Sep 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zhengruifeng commented Sep 17, 2022

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Sep 18, 2022

Uh oh!

dongjoon-hyun commented Sep 18, 2022

Uh oh!

zhengruifeng commented Sep 19, 2022

Uh oh!

WeichenXu123 Sep 19, 2022

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Sep 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Sep 20, 2022

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Sep 22, 2022

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Sep 22, 2022

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Sep 22, 2022

Choose a reason for hiding this comment

Uh oh!

srowen commented Sep 22, 2022

Uh oh!

zhengruifeng commented Sep 23, 2022

Uh oh!

WeichenXu123 commented Sep 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zhengruifeng commented Sep 17, 2022 •

edited

Loading

WeichenXu123 Sep 19, 2022 •

edited

Loading