Skip to content
Prev Previous commit
Next Next commit
[SPARK-11497][MLLIB][PYTHON] PySpark RowMatrix Constructor Has Type E…
…rasure Issue

As noted in PR apache#9441, implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor.  As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark.  Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`.  Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`.  As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type.  `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types.

This PR currently contains that retagging fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`.  This PR blocks apache#9441, so once this is merged, the other can be rebased.

cc holdenk

Author: Mike Dusenberry <[email protected]>

Closes apache#9458 from dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue.

(cherry picked from commit 1b82203)
Signed-off-by: Joseph K. Bradley <[email protected]>
  • Loading branch information
dusenberrymw authored and jkbradley committed Dec 11, 2015
commit e4cf1211803097eb3cdd93deccb7eb996e12da3d
Original file line number Diff line number Diff line change
Expand Up @@ -1101,7 +1101,7 @@ private[python] class PythonMLLibAPI extends Serializable {
* Wrapper around RowMatrix constructor.
*/
def createRowMatrix(rows: JavaRDD[Vector], numRows: Long, numCols: Int): RowMatrix = {
new RowMatrix(rows.rdd, numRows, numCols)
new RowMatrix(rows.rdd.retag(classOf[Vector]), numRows, numCols)
}

/**
Expand Down