Skip to content

Conversation

@BertrandDechoux
Copy link
Contributor

First step done by @tex0l ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it compile?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed not. The diff I received must have been a work in progress. This is now fixed.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it was... Sorry about that.

2015-09-21 14:01 GMT-07:00 Bertrand Dechoux [email protected]:

In
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala
#8849 (comment):

@@ -77,6 +77,22 @@ class KMeansModel @SInCE("1.1.0") (@SInCE("1.0.0") val clusterCenters: Array[Vec
def predict(points: JavaRDD[Vector]): JavaRDD[java.lang.Integer] =
predict(points.rdd).toJavaRDD().asInstanceOf[JavaRDD[java.lang.Integer]]

  • /** */
  • def distanceToCenters(point: Vector): (Int, Double) = {

Indeed not. The diff I received must have been a work in progress. This is
now fixed.


Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/8849/files#r40025294.

Timothée Rebours
13, rue Georges Bizet
78380 BOUGIVAL

@mengxr
Copy link
Contributor

mengxr commented Sep 21, 2015

ok to test

@BertrandDechoux
Copy link
Contributor Author

ERROR: Timeout after 15 minutes
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from https://github.com/apache/spark.git

Somehow I am glad that it does not only happen to my projects.

The patch is still basic and need a few changes

  • cluster index instead of cluster location?
  • Since annotation (1.5.0?)
    but the tests shouldn't fail.

@SparkQA
Copy link

SparkQA commented Sep 21, 2015

Test build #42773 has finished for PR 8849 at commit 7a746d1.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BertrandDechoux
Copy link
Contributor Author

ERROR: Timeout after 15 minutes
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from https://github.com/apache/spark.git

@SparkQA
Copy link

SparkQA commented Sep 21, 2015

Test build #42775 has finished for PR 8849 at commit beed882.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BertrandDechoux
Copy link
Contributor Author

I think I still need to change the method signature.

Predict on a RDD is the following

def predict(points: RDD[Vector]): RDD[Int]

Distances on a RDD is the following

def distanceToCenters(points: RDD[Vector]): RDD[(Vector, Iterable[(VectorWithNorm, Double)])]

I am more confortable with having the input point in the output but predict does no work like that.
If you say the predict template should be followed, I will change the method signature.
But having to rely on the order to match them with the results is not natural for me.

The second aspect is that I may want to output cluster indices instead of their locations.

I would be glad to have your point of view @mengxr.

@BertrandDechoux
Copy link
Contributor Author

ERROR: Timeout after 15 minutes
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from https://github.com/apache/spark.git

@SparkQA
Copy link

SparkQA commented Sep 22, 2015

Test build #42856 has finished for PR 8849 at commit a193011.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 22, 2015

Test build #42854 has finished for PR 8849 at commit d97d8f8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BertrandDechoux
Copy link
Contributor Author

ERROR: Timeout after 15 minutes
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from https://github.com/apache/spark.git

@BertrandDechoux
Copy link
Contributor Author

  • Both methods now return the cluster indices.
  • I have added a toList call to make sure there is no laziness involved but that's maybe not necessary.
  • Still need a feedback for the return type of
def distanceToCenters(points: RDD[Vector]): RDD[(Vector, Iterable[(VectorWithNorm, Double)])]

@yu-iskw
Copy link
Contributor

yu-iskw commented Nov 2, 2015

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Nov 2, 2015

Test build #44797 has finished for PR 8849 at commit 159608a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yu-iskw
Copy link
Contributor

yu-iskw commented Nov 2, 2015

@BertrandDechoux thank you for the update. Sounds interesting, but I don't think we should calculate distances between a point and all centers. Personally, I think we should calculate the distance between a point and the closest center.

And I'm wondering if the method should return the pair of cluster index and distance or only distance.

@mengxr @jkbradley what do you think?

@BertrandDechoux
Copy link
Contributor Author

In a perfect world, each point belongs to a specific cluster and the number of clusters is easy to find. In reality, it is less so. Knowing the distance is a way to appreciate the closeness of a point with regard to a cluster.

K-means can be thought as a special mixture model. When using a mixture model, the impact of each 'cluster' with regard to a specific point is an important information. I think the same holds true for K-means.

But, in the end, it does depend in which context and how you are using K-means indeed.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented Jul 3, 2018

I think this should just be closed. I don't think there's enough value in adding this API now.

@srowen srowen mentioned this pull request Jul 3, 2018
@asfgit asfgit closed this in 5bf95f2 Jul 4, 2018
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
Closes apache#20932
Closes apache#17843
Closes apache#13477
Closes apache#14291
Closes apache#20919
Closes apache#17907
Closes apache#18766
Closes apache#20809
Closes apache#8849
Closes apache#21076
Closes apache#21507
Closes apache#21336
Closes apache#21681
Closes apache#21691

Author: Sean Owen <[email protected]>

Closes apache#21708 from srowen/CloseStalePRs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants