Skip to content

Conversation

@lu-wang-dl
Copy link
Contributor

What changes were proposed in this pull request?

Add support for all of the clustering methods

How was this patch tested?

unit tests added

Please review http://spark.apache.org/contributing.html before opening a pull request.

@lu-wang-dl lu-wang-dl changed the title [Spark 23975][ML] Add support of array input for all clustering methods [Spark-23975][ML] Add support of array input for all clustering methods Apr 30, 2018
@MrBago
Copy link
Contributor

MrBago commented Apr 30, 2018

Looking now.

@jkbradley
Copy link
Member

add to whitelist

@SparkQA
Copy link

SparkQA commented May 1, 2018

Test build #89981 has finished for PR 21195 at commit 45e6e96.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Rerunning tests in case the R CRAN failure was from flakiness

@SparkQA
Copy link

SparkQA commented May 1, 2018

Test build #4166 has finished for PR 21195 at commit 45e6e96.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MrBago
Copy link
Contributor

MrBago commented May 2, 2018

Thanks Lu!

I had a pass over this PR and it looks pretty straightforward. One thing I noticed is that there are two patterns that we keep repeating. I think we should add private APIs for these patterns and delegate to those.

The first pattern is the validate schema method defined in terms of typeCandidates. I suggest we add something like validateVectorCompatibleColumn to DatasetUtils. In addition to helping with code reuse, this api would make it easier if we ever decide, for example, to support Arrays of Ints.

The second pattern is going from a dataframe & column name to an rdd[OldVector]. Lets add a method that does this, maybe something like (DataFrame, String) => RDD[OldVector].

@SparkQA
Copy link

SparkQA commented May 3, 2018

Test build #90160 has finished for PR 21195 at commit 877c126.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@mengxr mengxr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we should keep each PR minimal. First implement Array support for one estimator, get it reviewed and merged, and then implement support for other estimators. If other estimators share exactly the same pattern, we may put the rest in a single PR. But it is totally fine if we split them into multiple PRs. This helps avoid unnecessary code refactoring during code review.


/**
* Check whether the given column in the schema is one of the supporting vector type: Vector,
* Array[Dloat]. Array[Double]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Float

val featuresColNameF = "array_float_features"
val doubleUDF = udf { (features: Vector) =>
val featureArray = Array.fill[Double](features.size)(0.0)
features.foreachActive((idx, value) => featureArray(idx) = value.toFloat)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • If .toFloat is to keep the same precision, we should leave an inline comment.
  • features.toArray.map(_.toFloat.toDouble) should do the work.

featureArray
}
val newdatasetD = dataset.withColumn(featuresColNameD, doubleUDF(col("features")))
.drop("features")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Unnecessary to drop features. Or you can simply replace the features column:
val newdatasetD = dataset.withColumn(FEATURES, doubleUDF(col(FEATURES)))

val transformedF = modelF.transform(newdatasetF)
val predictDifference = transformedD.select("prediction")
.except(transformedF.select("prediction"))
assert(predictDifference.count() == 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only verifies it handles Array[Double] and Array[Float] the same way. But it doesn't guarantee that the result is correct. We can define a method that takes a dataset, apply one iteration, and return the cost.

def trainAndComputeCost(dataset: DataFrame): Double = {
  val model = new BisectingKMeans()
    .setK(k).setMaxIter(1).setSeed(1)
    .fit(dataset)
  model.computeCost(dataset)
}

val trueCost = trainAndComputeCost(dataset)
val floatArrayCost = trainAndComputeCost(newDatasetF)
assert(floatArrayCost === trueCost)
val doubleArrayCost = trainAndComputeCost(newDatasetD)
assert(doubleArrayCost === trueCost)

We can map the original dataset to single precision to have exact match. Or we can test equality with a threshold. See https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/util/TestingUtils.scala

assert(predictDifference.count() == 0)
val probabilityDifference = transformedD.select("probability")
.except(transformedF.select("probability"))
assert(probabilityDifference.count() == 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

val lpF = modelF.logPerplexity(newdatasetF)
// assert(lpD == lpF)
assert(lpD >= 0.0 && lpD != Double.NegativeInfinity)
assert(lpF >= 0.0 && lpF != Double.NegativeInfinity)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@SparkQA
Copy link

SparkQA commented May 5, 2018

Test build #90233 has finished for PR 21195 at commit c7a14bb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 7, 2018

Test build #90335 has finished for PR 21195 at commit d065634.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val floatLikelihood = trainAndComputlogLikelihood(newDatasetF)

// checking the cost is fine enough as a sanity check
assert(trueLikelihood == doubleLikelihood)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: should use === instead of == for assertions, the former gives a better error message. (not necessary to update this PR)

}

val (newDataset, newDatasetD, newDatasetF) = MLTestingUtils.generateArrayFeatureDataset(dataset)
val (ll, lp) = trainAndLogLikelihoodAndPerplexity(newDataset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: the output are not used. I expect they will be used once we fixed SPARK-22210

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I want to use this as the base for the comparison after we fix SPARK-22210.


/**
* Helper function for testing different input types for features. Given a DataFrame, generate
* three output DataFrames: one having vector feature column with float precision, one having
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: should say features column to make the contract clear.

val toFloatVectorUDF = udf { (features: Vector) => features.toArray.map(_.toFloat).toVector}
val toDoubleArrayUDF = udf { (features: Vector) => features.toArray}
val toFloatArrayUDF = udf { (features: Vector) => features.toArray.map(_.toFloat)}
val newDataset = dataset.withColumn("features", toFloatVectorUDF(col("features")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: maybe useful to define "features" as a constant at the beginning of the function

val toDoubleArrayUDF = udf { (features: Vector) => features.toArray}
val toFloatArrayUDF = udf { (features: Vector) => features.toArray.map(_.toFloat)}
val newDataset = dataset.withColumn("features", toFloatVectorUDF(col("features")))
val newDatasetD = dataset.withColumn("features", toDoubleArrayUDF(col("features")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't truncate the precision to single. Did you want to use newDataset instead of dataset?

@SparkQA
Copy link

SparkQA commented May 7, 2018

Test build #90344 has finished for PR 21195 at commit c9f478e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented May 8, 2018

LGTM. Merged into master. Thanks!

@asfgit asfgit closed this in 0d63eb8 May 8, 2018
@lu-wang-dl lu-wang-dl deleted the SPARK-23975-1 branch May 16, 2018 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants