[SPARK-17471][ML] Add compressed method to ML matrices #15628

sethah · 2016-10-25T20:14:46Z

What changes were proposed in this pull request?

This patch adds a compressed method to ML Matrix class, which returns the minimal storage representation of the matrix - either sparse or dense. Because the space occupied by a sparse matrix is dependent upon its layout (i.e. column major or row major), this method must consider both cases. It may also be useful to force the layout to be column or row major beforehand, so an overload is added which takes in a columnMajor: Boolean parameter.

The compressed implementation relies upon two new abstract methods toDense(columnMajor: Boolean) and toSparse(columnMajor: Boolean), similar to the compressed method implemented in the Vector class. These methods also allow the layout of the resulting matrix to be specified via the columnMajor parameter. More detail on the new methods is given below.

How was this patch tested?

Added many new unit tests

New methods (summary, not exhaustive list)

Matrix trait

private[ml] def toDenseMatrix(columnMajor: Boolean): DenseMatrix (abstract) - converts the matrix (either sparse or dense) to dense format
private[ml] def toSparseMatrix(columnMajor: Boolean): SparseMatrix (abstract) - converts the matrix (either sparse or dense) to sparse format
def toDense: DenseMatrix = toDense(true) - converts the matrix (either sparse or dense) to dense format in column major layout
def toSparse: SparseMatrix = toSparse(true) - converts the matrix (either sparse or dense) to sparse format in column major layout
def compressed: Matrix - finds the minimum space representation of this matrix, considering both column and row major layouts, and converts it
def compressed(columnMajor: Boolean): Matrix - finds the minimum space representation of this matrix considering only column OR row major, and converts it

DenseMatrix class

private[ml] def toDenseMatrix(columnMajor: Boolean): DenseMatrix - converts the dense matrix to a dense matrix, optionally changing the layout (data is NOT duplicated if the layouts are the same)
private[ml] def toSparseMatrix(columnMajor: Boolean): SparseMatrix - converts the dense matrix to sparse matrix, using the specified layout

SparseMatrix class

private[ml] def toDenseMatrix(columnMajor: Boolean): DenseMatrix - converts the sparse matrix to a dense matrix, using the specified layout
private[ml] def toSparseMatrix(columnMajors: Boolean): SparseMatrix - converts the sparse matrix to sparse matrix. If the sparse matrix contains any explicit zeros, they are removed. If the layout requested does not match the current layout, data is copied to a new representation. If the layouts match and no explicit zeros exist, the current matrix is returned.

sethah · 2016-10-25T20:23:26Z

cc @dbtsai @yanboliang

Note for reviewers: The driving need behind this change is that we now have multiclass logistic regression which represents coefficients as a matrix. When we use L1 regularization it can be more efficient to store the coefficient matrix as sparse since there can be a lot of zero elements. We will incorporate this compression in a follow-up PR

SparkQA · 2016-10-25T20:35:28Z

Test build #67527 has finished for PR 15628 at commit a51e217.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-25T20:40:30Z

Test build #67528 has finished for PR 15628 at commit 5600d14.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-10-25T21:30:08Z

I'm not sure I understand why there are MiMa failures. It should not be a problem that the new methods only exist in the current version AFAIK.

sethah · 2016-10-31T18:12:48Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+   * @param columnMajor Whether the values of the resulting dense matrix should be in column major
+   *                    or row major order. If `false`, resulting matrix will be row major.
+   */
+  private [ml] def toDenseMatrix(columnMajor: Boolean): DenseMatrix


minor: space

sethah · 2016-10-31T18:16:23Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+   * @param columnMajor Whether the values of the resulting sparse matrix should be in column major
+   *                    or row major order. If `false`, resulting matrix will be row major.
+   */
+  private[ml] def toSparseMatrix(columnMajor: Boolean): SparseMatrix


A bit of explanation: I made this a private method with a different name because toDense: DenseMatrix and toSparse: SparseMatrix should be implemented in the trait, not in the subclasses. But we can't just put them here and use overloading, because we will get ambiguous reference compile errors. So, we implement them here and make this private with a different name to avoid this. I appreciate feedback on this approach - it feels a bit awkward.

Yeah, this is very hacky in my opinion too!

The problem is that when one overloads a function without parenthesis, the ambiguity happens because when that function is invoked without parenthesis, this can be calling the actual function without parenthesis, or getting the function with parenthesis. The following is an example demonstrating the issue.

In my opinion, I would like to call it toSparse(columnMajor: Boolean) and toSparse() = toSparse(true), but in the vector api, we already use the one without parenthesis, so it will result inconsistency in the api design.

I think exposing the ability of converting it to columnMajor or rowMajor is very useful, as a result, we can expose it as toCSRMatrix, toCSCMatrix, and toSparse which converts the matrix to the one with smallest storage.

scala> trait A { | def foo(b: Boolean): String | def foo: String = foo(true) | } defined trait A scala> class B extends A { | def foo(b: Boolean): String = b.toString | } defined class B scala> val b = new B b: B = B@67b6d4ae scala> b.foo <console>:18: error: ambiguous reference to overloaded definition, both method foo in class B of type (b: Boolean)String and method foo in trait A of type => String match expected type ? b.foo ^ scala> val x: String = b.foo x: String = true scala> val y: Boolean=> String = b.foo y: Boolean => String = <function1>

Should we just leave toDenseMatrix private and have toDense always use colMajor = true? I think that's ok to do for now.

EDIT: I don't think we can make toSparse choose the smaller representation since before it always chose column major layout. This would change the behavior.

SparkQA · 2016-10-31T18:18:07Z

Test build #67819 has finished for PR 15628 at commit b5277c9.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-31T18:28:27Z

Test build #67820 has finished for PR 15628 at commit a931fb9.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-11-03T05:31:50Z

What are you going to add? compressed already exists on vectors...

WeichenXu123 · 2016-11-03T07:19:03Z

@sethah Oh..I miss it..

MLnick · 2016-11-03T09:19:04Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+
+  private[ml] def getSparseSize(numActives: Long, numPtrs: Long): Long = {
+    // 8 * values.length + 4 * rowIndices.length + 4 * colPtrs.length + 8 + 8 + 1
+    12L * numActives + 4L * numPtrs + 17L


The comment says 8 * values while this is 12? Seems like a mistype?

I wondered how confusing this comment might be. Since values.length == rowIndices.length == numActives:

8 * values.length + 4 * rowIndices.length = 8 * numActives + 4 * numActives = 12 * numActives

The comment is meant to show where each number comes from and the implementation is meant to just be a condensed computation. But please let me know if you think it's too confusing.

ah right - no that's fine.

sethah · 2016-12-07T05:14:03Z

ping @dbtsai :)

SparkQA · 2016-12-14T16:37:09Z

Test build #70135 has finished for PR 15628 at commit 10ec41b.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2017-01-05T19:48:34Z

pinging other potential reviewers @jkbradley @srowen @MLnick I think this is an important patch for multiclass logistic regression.

sethah · 2017-02-01T00:50:21Z

ping @imatiach-msft @dbtsai

sethah · 2017-02-27T22:27:59Z

re-ping @dbtsai @MLnick @yanboliang I still think this is an important patch :D

dbtsai · 2016-11-11T22:21:54Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+   * Converts this matrix to a dense matrix in column major order.
+   */
+  @Since("2.1.0")
+  def toDense: DenseMatrix = toDenseMatrix(columnMajor = true)


Nit, since we're using numCols already, should we call it colMajor? I saw couple packages using colMajor as the variable name.

dbtsai · 2016-11-11T22:31:47Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+    val csrSize = getSparseSizeInBytes(columnMajor = false)
+    val minSparseSize = cscSize.min(csrSize)
+    if (getDenseSizeInBytes < minSparseSize) {
+      // size is the same either way, so maintain current layout


if (getDenseSizeInBytes < math.min(cscSize, csrSize)) ... ... if (cscSize < csrSize)

could be easier to read.

Also, can you elaborate the comment like

// sizes for dense matrix in row major or column major are the same, so maintain current layout

dbtsai · 2017-03-08T21:50:46Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

@@ -153,6 +153,86 @@ sealed trait Matrix extends Serializable {
   */


Can you make foreachActive(f: (Int, Int, Double) => Unit) public? This is public for vector. I believe it will be very useful, and I think it's stable enough to make it public.

I made the change. Not sure if we should do this in a separate PR though.

Should be fine. Small enough change :)

BTW, it's nice to have return type in public method. Can you add Unit as return type?

dbtsai · 2017-03-09T01:26:10Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+   * @param columnMajor Whether the values of the resulting sparse matrix should be in column major
+   *                    or row major order. If `false`, resulting matrix will be row major.
+   */
+  private[ml] def toSparseMatrix(columnMajor: Boolean): SparseMatrix


Yeah, this is very hacky in my opinion too!

The problem is that when one overloads a function without parenthesis, the ambiguity happens because when that function is invoked without parenthesis, this can be calling the actual function without parenthesis, or getting the function with parenthesis. The following is an example demonstrating the issue.

In my opinion, I would like to call it toSparse(columnMajor: Boolean) and toSparse() = toSparse(true), but in the vector api, we already use the one without parenthesis, so it will result inconsistency in the api design.

I think exposing the ability of converting it to columnMajor or rowMajor is very useful, as a result, we can expose it as toCSRMatrix, toCSCMatrix, and toSparse which converts the matrix to the one with smallest storage.

scala> trait A { | def foo(b: Boolean): String | def foo: String = foo(true) | } defined trait A scala> class B extends A { | def foo(b: Boolean): String = b.toString | } defined class B scala> val b = new B b: B = B@67b6d4ae scala> b.foo <console>:18: error: ambiguous reference to overloaded definition, both method foo in class B of type (b: Boolean)String and method foo in trait A of type => String match expected type ? b.foo ^ scala> val x: String = b.foo x: String = true scala> val y: Boolean=> String = b.foo y: Boolean => String = <function1>

SparkQA · 2017-03-09T21:10:06Z

Test build #74280 has finished for PR 15628 at commit 810a077.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2017-03-09T21:16:14Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+   * the current layout order.
+   */
+  @Since("2.2.0")
+  def compressed: Matrix = {


Won't compressed(colMajor: Boolean) and compressed cause the overloading ambiguous issue?

It's only a problem if we override/implement it in a subclass. Since it's contained wholly in the trait, it will be fine. I think this is ok to leave, though we could make it final? Also we could make three methods: compressed, compressedCSC, compressedCSR. I think the latter is a good solution, thoughts?

+1 on the later one.

dbtsai · 2017-03-09T21:16:46Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

@@ -153,6 +153,86 @@ sealed trait Matrix extends Serializable {
   */


Should be fine. Small enough change :)

dbtsai · 2017-03-09T21:19:42Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+   * Converts this matrix to a sparse matrix in column major order.
+   */
+  @Since("2.2.0")
+  def toCSCMatrix: SparseMatrix = toSparseMatrix(colMajor = true)


How about we follow scipy, and call it as toCSC?

dbtsai · 2017-03-09T21:20:33Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+   * Converts this matrix to a sparse matrix in row major order.
+   */
+  @Since("2.2.0")
+  def toCSRMatrix: SparseMatrix = toSparseMatrix(colMajor = false)


dbtsai · 2017-03-09T21:24:46Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+   */
+  @Since("2.2.0")
+  def toDense: DenseMatrix = toDenseMatrix(colMajor = true)
+


Could we add toColumnMajorDense and toRowMajorDense?

ditto. should we consider to maintain the same layout?

sethah · 2017-03-14T00:56:34Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+   * @param colMajor Whether the values of the resulting sparse matrix should be in column major
+   *                    or row major order. If `false`, resulting matrix will be row major.
+   */
+  private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix


You mean @inline private[ml] def ... ? Do we expect this to be called often enough for that to make a difference?

Fair enough.

sethah · 2017-03-14T01:29:43Z

@dbtsai Let me know your thoughts on the comments I left. Thanks for the review!

SparkQA · 2017-03-14T01:37:16Z

Test build #74477 has finished for PR 15628 at commit cf8945a.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-14T04:32:30Z

Test build #74486 has started for PR 15628 at commit baa8c9d.

SparkQA · 2017-03-14T05:09:43Z

Test build #74479 has finished for PR 15628 at commit 254b9fb.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

sethah · 2017-03-14T20:13:09Z

Jenkins test this please.

SparkQA · 2017-03-14T22:54:23Z

Test build #74552 has finished for PR 15628 at commit baa8c9d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2017-03-17T20:12:52Z

ping @dbtsai! Code freeze is upcoming :)

dbtsai · 2017-03-22T00:09:56Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+   * Converts this matrix to a sparse matrix in column major order.
+   */
+  @Since("2.2.0")
+  def toCSC: SparseMatrix = toSparseMatrix(colMajor = true)


I'm not good at naming, but since we use toDenseRowMajor for dense vector, should we use toSparseColumnMajor? Almost many packages are using toCSC, but I think we can make them consistent. Just my 2 cents.

I'm not sure I have a preference. I don't mind leaving them as CSC and CSR.

After thinking about it again, let's have it as toSparseColumnMajor to make the apis consistent with the dense ones if you don't mind?

dbtsai · 2017-03-22T00:10:59Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+   * Converts this matrix to a sparse matrix in row major order.
+   */
+  @Since("2.2.0")
+  def toCSR: SparseMatrix = toSparseMatrix(colMajor = false)


Same question.

dbtsai · 2017-03-22T00:14:05Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+   *                    or row major order. If `false`, resulting matrix will be row major.
+   */
+  @Since("2.2.0")
+  def compressed(colMajor: Boolean): Matrix = {


Let's make it private, and follow the previous style. Should add compressedRowMajor, compressedColumnMajor since it can be dense matrix in certain situations.

dbtsai · 2017-03-22T00:35:26Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+   */
+  private[ml] override def toDenseMatrix(colMajor: Boolean): DenseMatrix = {
+    if (!(isTransposed ^ colMajor)) {
+      val newValues = new Array[Double](numCols * numRows)


Simpler to use foreachActive? With it, both toDenseMatrix can have the same implementation for sparse and dense in trait.

I don't think we can put this in the trait, since when it is called on dense matrix we would like to return this in some cases (when no layout change is needed). But, yes, I think it simpler to use toArray which calls foreachActive. Thanks!

The following could work, and we only need one implementation in trait. Thanks.

trait Matrix { var isTransposed: Boolean = true var numCols: Int = 0 var numRows: Int = 0 def foreachActive(f: (Int, Int, Double) => Unit): Unit def toDenseMatrix(colMajor: Boolean): Matrix = { this match { case _: DenseMatrix if this.isTransposed != colMajor => this case _: SparseMatrix | _: DenseMatrix if this.isTransposed == colMajor => val newValues = new Array[Double](numCols * numRows) this.foreachActive { case (row, col, value) => // filling the newValues } new DenseMatrix(numRows, numCols, newValues, isTransposed = !colMajor) case _ => throw new IllegalArgumentException("") } } } class DenseMatrix extends Matrix { def foreachActive(f: (Int, Int, Double) => Unit): Unit = { } } class SparseMatrix extends Matrix { def foreachActive(f: (Int, Int, Double) => Unit): Unit = { } }

Hm, I don't think this solution is better. The entire point of abstract methods is to allow subclasses to implement a method differently. Since we need different implementations depending on the subclass, we should just implement them in the subclasses. We can do this with the following:

trait Matrix { def toDenseMatrix(colMajor: Boolean): Matrix } class DenseMatrix extends Matrix { private[ml] override def toDenseMatrix(colMajor: Boolean): DenseMatrix = { if (isTransposed && colMajor) { new DenseMatrix(numRows, numCols, toArray, isTransposed = false) } else if (!isTransposed && !colMajor) { new DenseMatrix(numRows, numCols, transpose.toArray, isTransposed = true) } else { this } } } class SparseMatrix extends Matrix { private[ml] override def toDenseMatrix(colMajor: Boolean): DenseMatrix = { if (colMajor) new DenseMatrix(numRows, numCols, toArray) else new DenseMatrix(numRows, numCols, this.transpose.toArray, isTransposed = true) } }

Which is less verbose than the previous code. I'm going to put that in the next commit. Let me know what you think.

This looks great to me!

dbtsai · 2017-03-22T00:52:56Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

-    new DenseMatrix(numRows, numCols, toArray)
+  private[ml] override def toSparseMatrix(colMajor: Boolean): SparseMatrix = {
+    if (!(colMajor ^ isTransposed)) {
+      // breeze transpose rearranges values in column major and removes explicit zeros


I think it's hacky to use breeze's transpose behavior to remove zeros in sparse matrices. Can we have our own implementation given we're potentially remove breeze?

I pretty much agree with you, but this is non-trivial code if we want to do it efficiently. Breeze has a pretty well-optimized implementation to do this. I would leave it as a follow up JIRA, or do it when/if we ever remove the Breeze dependency. Or do you think this is a blocker for this PR?

This is not a blocker.

dbtsai · 2017-03-22T00:56:22Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+        val breezeTransposed = asBreeze.asInstanceOf[BSM[Double]]
+        Matrices.fromBreeze(breezeTransposed).asInstanceOf[SparseMatrix]
+      }
+    } else {


Can we document here that it's when the layout of this and colMajor is different? Easier read than (colMajor ^ isTranspose) condition here. Even more readable to use pattern matching with exact boolean on both variables.

dbtsai · 2017-03-22T01:00:00Z

mllib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala

+    assert(!dm3.isTransposed)
+    assert(dm3.values.equals(dm1.values))
+
+    val dm4 = dm1.toDenseMatrix(false)


I would like to make toDenseMatrix as private, and we test against toDenseRowMajor which is more explicit.

dbtsai · 2017-03-22T01:02:41Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+   * Converts this matrix to a sparse matrix in column major order.
+   */
+  @Since("2.2.0")
+  def toSparse: SparseMatrix = toSparseMatrix(colMajor = true)


I'm debating that should we keep the same ordering of layout when we call toSparse or toDense?

Well, this would change the behavior of external facing code. Before if I called toSparse on a row major matrix, I'd get a column major matrix. If we maintain the layout, then I'd now get something different (row major). Otherwise, I'd agree it is best to maintain the layout.

But I thought this is a new api being added, so we can make it to maintain the same layout.

dbtsai · 2017-03-22T01:06:29Z

mllib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala

+            1.0 -3.0 -8.0
+     */
+    val dm1 = new DenseMatrix(2, 3, Array(4.0, -1.0, 2.0, 7.0, -8.0, 4.0))
+    val dm2 = new DenseMatrix(2, 3, Array(5.0, -9.0, 4.0, 1.0, -3.0, -8.0), isTransposed = true)


Why not just make dm2 dm1.transposed, but explicitly assign the value? Thus, you don't need to type the value in the array for the comparison.

I'm not sure I understand your meaning here. These are made to be two entirely different matrices anyway.

dbtsai · 2017-03-23T20:55:10Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+   */
+  @Since("2.2.0")
+  def toDense: DenseMatrix = toDenseMatrix(colMajor = true)
+


ditto. should we consider to maintain the same layout?

dbtsai · 2017-03-23T21:10:18Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+      toDenseMatrix(!isTransposed)
+    } else {
+      if (cscSize <= csrSize) {
+        toSparseMatrix(colMajor = true)


toSparseColumnMajor

dbtsai · 2017-03-23T21:10:33Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+      if (cscSize <= csrSize) {
+        toSparseMatrix(colMajor = true)
+      } else {
+        toSparseMatrix(colMajor = false)


toSparseRowMajor

dbtsai · 2017-03-23T21:11:08Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+    val csrSize = getSparseSizeInBytes(colMajor = false)
+    if (getDenseSizeInBytes < math.min(cscSize, csrSize)) {
+      // dense matrix size is the same for column major and row major, so maintain current layout
+      toDenseMatrix(!isTransposed)


Call toDense if we decide to make toDense and toSparse outputting the same layout.

I don't see the change here.

this.toDense

dbtsai · 2017-03-23T21:45:06Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+   * Generate a `DenseMatrix` from this `DenseMatrix`.
+   *
+   * @param colMajor Whether the resulting `DenseMatrix` values will be in column major order.
+   */


Minor, can we have

protected def isColMajor = !isTransposed protected def isRowMajor = isTransposed

So the code can be understood easier?

Ok, I added these methods. I updated the test suites to use them instead of isTransposed.

dbtsai · 2017-03-23T21:56:53Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

    }
-    new SparseMatrix(numRows, numCols, colPtrs, rowIndices.result(), spVals.result())
  }



Could we override the toArray in DenseMatrix so when this is column major, we just return this.values? Otherwise, it's very expensive to create a new array.

I have to say I'm a bit surprised - I thought that's what toArray already did! Yeah, that seems like a good change, but I'd prefer to do it in another pr because we need to make sure that this doesn't adversely affect other places that use toArray as well as adding unit tests. If that sounds ok, I'll make a JIRA for it.

Sounds good. Let's do it in another PR. Thanks.

SparkQA · 2017-03-23T23:26:17Z

Test build #75116 has finished for PR 15628 at commit 4746ec0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-24T00:35:15Z

Test build #75123 has finished for PR 15628 at commit 4026e89.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-24T00:38:14Z

Test build #75122 has finished for PR 15628 at commit 354935b.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2017-03-24T04:43:39Z

Test build #75144 has finished for PR 15628 at commit c880a45.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-24T04:52:30Z

Test build #75147 has started for PR 15628 at commit 93ec250.

dbtsai · 2017-03-24T04:06:16Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

    }
-    new SparseMatrix(numRows, numCols, colPtrs, rowIndices.result(), spVals.result())
  }



Sounds good. Let's do it in another PR. Thanks.

dbtsai · 2017-03-24T04:17:32Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+    val csrSize = getSparseSizeInBytes(colMajor = false)
+    if (getDenseSizeInBytes < math.min(cscSize, csrSize)) {
+      // dense matrix size is the same for column major and row major, so maintain current layout
+      toDenseMatrix(!isTransposed)


I don't see the change here.

dbtsai · 2017-03-24T04:17:38Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+      toDenseMatrix(!isTransposed)
+    } else {
+      if (cscSize <= csrSize) {
+        toSparseMatrix(colMajor = true)


dbtsai · 2017-03-24T04:17:44Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+      if (cscSize <= csrSize) {
+        toSparseMatrix(colMajor = true)
+      } else {
+        toSparseMatrix(colMajor = false)


dbtsai · 2017-03-24T04:22:06Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+   * @param colMajor Whether the resulting `DenseMatrix` values are in column major order.
+   */
+  private[ml] override def toDenseMatrix(colMajor: Boolean): DenseMatrix = {
+    if (colMajor) new DenseMatrix(numRows, numCols, toArray)


new DenseMatrix(numRows, numCols, this.toArray, isTransposed = false) to make the style consistent.

dbtsai · 2017-03-24T05:24:35Z

mllib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala

+    assert(cm1 === sm1)
+    assert(!cm1.isTransposed)
+    assert(cm1.values.equals(sm1.values))
+    assert(cm1.getSizeInBytes <= sm1.getSizeInBytes)


ditto. remove =

dbtsai · 2017-03-24T05:26:50Z

mllib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala

+    assert(cm2.isTransposed)
+    // forced to be row major, so we have increased the size
+    assert(cm2.getSizeInBytes > sm1.getSizeInBytes)
+    assert(cm2.getSizeInBytes <= sm1.toDense.getSizeInBytes)


= is needed?

I think it's ok to leave it here since they could potentially be equal.

If =, dense should be preferred since it will be faster. As a result, we should only check <.

Ok, before we broke ties with sparse, so I changed it to choose dense, and added unit tests. Also removed the = here.

dbtsai · 2017-03-24T05:28:46Z

mllib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala

+    assert(cm8 === sm2)
+    assert(!cm8.isTransposed)
+    assert(cm8.getSizeInBytes > sm2.getSizeInBytes)
+    assert(cm8.getSizeInBytes <= sm2.toDense.getSizeInBytes)


For completion, could you test sm1. compressedColumnMajor and sm2. compressedRowMajor? Thanks.

dbtsai · 2017-03-24T05:29:21Z

mllib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala

+    val cm4 = sm3.compressed.asInstanceOf[DenseMatrix]
+    assert(cm4 === sm3)
+    assert(!cm4.isTransposed)
+    assert(cm4.getSizeInBytes <= sm3.getSizeInBytes)


remove =?

dbtsai · 2017-03-24T05:30:00Z

mllib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala

+    val cm5 = sm3.compressedRowMajor.asInstanceOf[DenseMatrix]
+    assert(cm5 === sm3)
+    assert(cm5.isTransposed)
+    assert(cm5.getSizeInBytes <= sm3.getSizeInBytes)


remove =? and sm3.compressedColumnMajor?

SparkQA · 2017-03-24T06:42:35Z

Test build #75155 has started for PR 15628 at commit 5411d46.

sethah · 2017-03-24T06:42:39Z

@dbtsai Thanks for the good suggestions. I rearranged the test suites, removed redundancies, and filled in some gaps. Things got a bit jumbled when changing some of the methods around. I also added the this prefix to some of the method calls as you suggest.

dbtsai

Minor

dbtsai

LGTM. Couple small comments.

dbtsai · 2017-03-24T17:43:39Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+   * Converts this matrix to a sparse matrix while maintaining the layout of the current matrix.
+   */
+  @Since("2.2.0")
+  def toSparse: SparseMatrix = toSparseMatrix(colMajor = !isTransposed)


def toSparse: SparseMatrix = toSparseMatrix(colMajor = isColMajor)

dbtsai · 2017-03-24T17:44:05Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+   * Converts this matrix to a dense matrix while maintaining the layout of the current matrix.
+   */
+  @Since("2.2.0")
+  def toDense: DenseMatrix = toDenseMatrix(colMajor = !isTransposed)


def toDense: DenseMatrix = toDenseMatrix(colMajor = isColMajor)

dbtsai · 2017-03-24T17:55:14Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+  }
+
+  private[ml] def getDenseSize(numCols: Long, numRows: Long): Long = {
+    // 8 * values.length + 12 + 1


Can you document what is the magical number 12 + 1? Also, can we make it

java.lang.Double.BYTES * numCols * numRows + 13L

since the size of primitive type can depend on the implementation of JVM.

dbtsai · 2017-03-24T18:07:19Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

+
+  private[ml] def getSparseSize(numActives: Long, numPtrs: Long): Long = {
+    // 8 * values.length + 4 * rowIndices.length + 4 * colPtrs.length + 12 + 12 + 12 + 1
+    12L * numActives + 4L * numPtrs + 37L


(java.lang.Double.BYTES + java.lang.Integer.BYTES) * numActives + java.lang.Integer.BYTES * numPtrs + 37L

Nice that we can get 37L using java apis to ensure the portability.

SparkQA · 2017-03-24T19:58:37Z

Test build #75167 has finished for PR 15628 at commit 95ac0e0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2017-03-24T20:03:28Z

Thanks for finally cooperating, Jenkins!

dbtsai · 2017-03-24T20:23:43Z

Thanks @sethah and Jenkins! Merged into master.

sethah · 2017-03-24T20:29:46Z

Thanks for all the time reviewing!

SparkQA · 2017-03-24T21:33:54Z

Test build #75171 has finished for PR 15628 at commit 87dfaa0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-05-15T20:45:09Z

mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala

@@ -148,7 +154,8 @@ sealed trait Matrix extends Serializable {
   *          and column indices respectively with the type `Int`, and the final parameter is the
   *          corresponding value in the matrix with type `Double`.
   */
-  private[spark] def foreachActive(f: (Int, Int, Double) => Unit)
+  @Since("2.2.0")
+  def foreachActive(f: (Int, Int, Double) => Unit): Unit


@sethah @dbtsai Hi all, just saw this during QA. This method is not very Java-friendly. I'm OK with adding it as long as we document the fact that it's not Java-friendly. We could also consider adding a Java-friendly version, perhaps using https://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/function/Function2.html

I think we should do the same for foreachActive in Vector which was already a public api long time ago.

That's a good point. I guess we can leave it until someone complains & add a Java-friendly one as needed.

### What changes were proposed in this pull request? This PR upgrades Netty from 4.1.126.Final to 4.1.127.Final and netty-tcnative from 2.0.72.Final to 2.0.73.Final. ### Why are the changes needed? This version fixes a BouncyCastle-related regression introduced in Netty 4.1.126.Final: - BouncyCastleAlpnSslUtils needs to use the correct SSLEngine class as otherwise it will fail to init static fields. ([#15628](netty/netty#15628)) The full release notes as follows: - https://netty.io/news/2025/09/08/4-1-127-Final.html ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass Github Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #52356 from LuciferYang/SPARK-53599. Authored-by: yangjie01 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

sethah commented Oct 31, 2016

View reviewed changes

sethah mentioned this pull request Nov 1, 2016

[SPARK-18201][ML] Add toDense and toSparse into Matrix trait, like Vector design #15716

Closed

MLnick reviewed Nov 3, 2016

View reviewed changes

dbtsai reviewed Mar 9, 2017

View reviewed changes

dbtsai reviewed Mar 10, 2017

View reviewed changes

sethah commented Mar 14, 2017

View reviewed changes

sethah force-pushed the matrix_compress branch from 254b9fb to baa8c9d Compare March 14, 2017 04:28

dbtsai reviewed Mar 22, 2017

View reviewed changes

sethah added 4 commits March 23, 2017 14:48

mima

f356828

add compressedColRowMajor

5dbdc64

toSparseColMajor

8bfbd4e

toSparse, toDense maintain current layout

4026e89

sethah force-pushed the matrix_compress branch from 354935b to 4026e89 Compare March 23, 2017 21:49

dbtsai reviewed Mar 23, 2017

View reviewed changes

add isRowMajor, isColMajor

93ec250

sethah force-pushed the matrix_compress branch from c880a45 to 93ec250 Compare March 24, 2017 04:48

dbtsai reviewed Mar 24, 2017

View reviewed changes

organize test suites

5411d46

dbtsai reviewed Mar 24, 2017

View reviewed changes

break ties with dense

95ac0e0

dbtsai approved these changes Mar 24, 2017

View reviewed changes

clarify get size functions

87dfaa0

asfgit closed this in e8810b7 Mar 24, 2017

jkbradley reviewed May 15, 2017

View reviewed changes

		@@ -153,6 +153,86 @@ sealed trait Matrix extends Serializable {
		*/

[SPARK-17471][ML] Add compressed method to ML matrices #15628

[SPARK-17471][ML] Add compressed method to ML matrices #15628

Uh oh!

Conversation

sethah commented Oct 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

New methods (summary, not exhaustive list)

Uh oh!

sethah commented Oct 25, 2016

Uh oh!

SparkQA commented Oct 25, 2016

Uh oh!

SparkQA commented Oct 25, 2016

Uh oh!

sethah commented Oct 25, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sethah Mar 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 31, 2016

Uh oh!

SparkQA commented Oct 31, 2016

Uh oh!

sethah commented Nov 3, 2016

Uh oh!

WeichenXu123 commented Nov 3, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sethah commented Dec 7, 2016

Uh oh!

SparkQA commented Dec 14, 2016

Uh oh!

sethah commented Jan 5, 2017

Uh oh!

sethah commented Feb 1, 2017

Uh oh!

sethah commented Feb 27, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sethah commented Oct 25, 2016 •

edited

Loading

sethah Mar 9, 2017 •

edited

Loading