-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-15784][ML]:Add Power Iteration Clustering to spark.ml #15770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
b80bb1f
75004e8
e1d9a33
f8343e0
c62a2c0
1277f75
f50873d
88a9ae0
0618815
04fddbd
b49f4c7
d3f86d0
655bc67
d5975bc
f012624
bef0594
a4bee89
0f97907
015383a
2d29570
af549e8
9b4f3d5
e35fe54
73485d8
bd5ca5d
3b0f71c
752b685
cfa18af
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -170,24 +170,13 @@ class PowerIterationClustering private[clustering] ( | |
| @Since("2.3.0") | ||
| override def transform(dataset: Dataset[_]): DataFrame = { | ||
| val sparkSession = dataset.sparkSession | ||
| /* | ||
| val rdd: RDD[(Long, Long, Double)] = | ||
| dataset.select(col($(idCol)), col($(neighborCol)), col($(weightCol))).rdd.map { | ||
| case Row(id: Long, nbr: Vector, weight: Vector) => (id, nbr, weight) | ||
| }.flatMap{ case (id, nbr, weight) => | ||
| require(nbr.size == weight.size, | ||
| "The length of neighbor list must be equal to the the length of the weight list.") | ||
| val ids = Array.fill(nbr.size)(id) | ||
| ids.zip(nbr.toArray).zip(weight.toArray)}.map {case ((i, j), k) => (i, j.toLong, k)} | ||
| */ | ||
| val rdd: RDD[(Long, Long, Double)] = | ||
| dataset.select(col($(idCol)), col($(neighborCol)), col($(weightCol))).rdd.flatMap { | ||
| case Row(id: Long, nbr: Vector, weight: Vector) => | ||
| require(nbr.size == weight.size, | ||
| "The length of neighbor list must be equal to the the length of the weight list.") | ||
| val ids = Array.fill(nbr.size)(id) | ||
| for (i <- 0 until ids.size) yield (ids(i), nbr(i).toLong, weight(i))} | ||
|
||
|
|
||
| val algorithm = new MLlibPowerIterationClustering() | ||
| .setK($(k)) | ||
| .setInitializationMode($(initMode)) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PIC require input graph matrix to be symmetric, and the weight should be non-negative. It is better to check them here. But checking symmetric seems cost too much, I have no good idea for now. cc @jkbradley Do you have some thoughts ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think checking symmetric is too much for PIC in this data format. Maybe, we can omit the check and put a comment and INFO on console to let users take care of it. @WeichenXu123
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree about not checking for symmetry as long as we document it.
But I do have one suggestion: Let's take neighbors and weights as Arrays, not Vectors. That may help prevent users from mistakenly passing in feature Vectors.