-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-17471][ML] Add compressed method to ML matrices #15628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Note for reviewers: The driving need behind this change is that we now have multiclass logistic regression which represents coefficients as a matrix. When we use L1 regularization it can be more efficient to store the coefficient matrix as sparse since there can be a lot of zero elements. We will incorporate this compression in a follow-up PR |
Test build #67527 has finished for PR 15628 at commit
|
Test build #67528 has finished for PR 15628 at commit
|
I'm not sure I understand why there are MiMa failures. It should not be a problem that the new methods only exist in the current version AFAIK. |
* @param columnMajor Whether the values of the resulting dense matrix should be in column major | ||
* or row major order. If `false`, resulting matrix will be row major. | ||
*/ | ||
private [ml] def toDenseMatrix(columnMajor: Boolean): DenseMatrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: space
* @param columnMajor Whether the values of the resulting sparse matrix should be in column major | ||
* or row major order. If `false`, resulting matrix will be row major. | ||
*/ | ||
private[ml] def toSparseMatrix(columnMajor: Boolean): SparseMatrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit of explanation: I made this a private method with a different name because toDense: DenseMatrix
and toSparse: SparseMatrix
should be implemented in the trait, not in the subclasses. But we can't just put them here and use overloading, because we will get ambiguous reference compile errors. So, we implement them here and make this private with a different name to avoid this. I appreciate feedback on this approach - it feels a bit awkward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is very hacky in my opinion too!
The problem is that when one overloads a function without parenthesis, the ambiguity happens because when that function is invoked without parenthesis, this can be calling the actual function without parenthesis, or getting the function with parenthesis. The following is an example demonstrating the issue.
In my opinion, I would like to call it toSparse(columnMajor: Boolean)
and toSparse() = toSparse(true)
, but in the vector api, we already use the one without parenthesis, so it will result inconsistency in the api design.
I think exposing the ability of converting it to columnMajor
or rowMajor
is very useful, as a result, we can expose it as toCSRMatrix
, toCSCMatrix
, and toSparse
which converts the matrix to the one with smallest storage.
scala> trait A {
| def foo(b: Boolean): String
| def foo: String = foo(true)
| }
defined trait A
scala> class B extends A {
| def foo(b: Boolean): String = b.toString
| }
defined class B
scala> val b = new B
b: B = B@67b6d4ae
scala> b.foo
<console>:18: error: ambiguous reference to overloaded definition,
both method foo in class B of type (b: Boolean)String
and method foo in trait A of type => String
match expected type ?
b.foo
^
scala> val x: String = b.foo
x: String = true
scala> val y: Boolean=> String = b.foo
y: Boolean => String = <function1>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just leave toDenseMatrix
private and have toDense
always use colMajor = true? I think that's ok to do for now.
EDIT: I don't think we can make toSparse
choose the smaller representation since before it always chose column major layout. This would change the behavior.
Test build #67819 has finished for PR 15628 at commit
|
Test build #67820 has finished for PR 15628 at commit
|
What are you going to add? |
@sethah Oh..I miss it.. |
|
||
private[ml] def getSparseSize(numActives: Long, numPtrs: Long): Long = { | ||
// 8 * values.length + 4 * rowIndices.length + 4 * colPtrs.length + 8 + 8 + 1 | ||
12L * numActives + 4L * numPtrs + 17L |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says 8 * values while this is 12? Seems like a mistype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered how confusing this comment might be. Since values.length == rowIndices.length == numActives
:
8 * values.length + 4 * rowIndices.length = 8 * numActives + 4 * numActives = 12 * numActives
The comment is meant to show where each number comes from and the implementation is meant to just be a condensed computation. But please let me know if you think it's too confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah right - no that's fine.
ping @dbtsai :) |
Test build #70135 has finished for PR 15628 at commit
|
pinging other potential reviewers @jkbradley @srowen @MLnick I think this is an important patch for multiclass logistic regression. |
ping @imatiach-msft @dbtsai |
re-ping @dbtsai @MLnick @yanboliang I still think this is an important patch :D |
* Converts this matrix to a dense matrix in column major order. | ||
*/ | ||
@Since("2.1.0") | ||
def toDense: DenseMatrix = toDenseMatrix(columnMajor = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, since we're using numCols
already, should we call it colMajor
? I saw couple packages using colMajor
as the variable name.
val csrSize = getSparseSizeInBytes(columnMajor = false) | ||
val minSparseSize = cscSize.min(csrSize) | ||
if (getDenseSizeInBytes < minSparseSize) { | ||
// size is the same either way, so maintain current layout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (getDenseSizeInBytes < math.min(cscSize, csrSize))
...
...
if (cscSize < csrSize)
could be easier to read.
Also, can you elaborate the comment like
// sizes for dense matrix in row major or column major are the same, so maintain current layout
@@ -153,6 +153,86 @@ sealed trait Matrix extends Serializable { | |||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make foreachActive(f: (Int, Int, Double) => Unit)
public? This is public for vector. I believe it will be very useful, and I think it's stable enough to make it public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the change. Not sure if we should do this in a separate PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fine. Small enough change :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, it's nice to have return type in public method. Can you add Unit
as return type?
* @param columnMajor Whether the values of the resulting sparse matrix should be in column major | ||
* or row major order. If `false`, resulting matrix will be row major. | ||
*/ | ||
private[ml] def toSparseMatrix(columnMajor: Boolean): SparseMatrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is very hacky in my opinion too!
The problem is that when one overloads a function without parenthesis, the ambiguity happens because when that function is invoked without parenthesis, this can be calling the actual function without parenthesis, or getting the function with parenthesis. The following is an example demonstrating the issue.
In my opinion, I would like to call it toSparse(columnMajor: Boolean)
and toSparse() = toSparse(true)
, but in the vector api, we already use the one without parenthesis, so it will result inconsistency in the api design.
I think exposing the ability of converting it to columnMajor
or rowMajor
is very useful, as a result, we can expose it as toCSRMatrix
, toCSCMatrix
, and toSparse
which converts the matrix to the one with smallest storage.
scala> trait A {
| def foo(b: Boolean): String
| def foo: String = foo(true)
| }
defined trait A
scala> class B extends A {
| def foo(b: Boolean): String = b.toString
| }
defined class B
scala> val b = new B
b: B = B@67b6d4ae
scala> b.foo
<console>:18: error: ambiguous reference to overloaded definition,
both method foo in class B of type (b: Boolean)String
and method foo in trait A of type => String
match expected type ?
b.foo
^
scala> val x: String = b.foo
x: String = true
scala> val y: Boolean=> String = b.foo
y: Boolean => String = <function1>
Test build #74280 has finished for PR 15628 at commit
|
* the current layout order. | ||
*/ | ||
@Since("2.2.0") | ||
def compressed: Matrix = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't compressed(colMajor: Boolean)
and compressed
cause the overloading ambiguous issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only a problem if we override/implement it in a subclass. Since it's contained wholly in the trait, it will be fine. I think this is ok to leave, though we could make it final? Also we could make three methods: compressed
, compressedCSC
, compressedCSR
. I think the latter is a good solution, thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on the later one.
@@ -153,6 +153,86 @@ sealed trait Matrix extends Serializable { | |||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fine. Small enough change :)
* Converts this matrix to a sparse matrix in column major order. | ||
*/ | ||
@Since("2.2.0") | ||
def toCSCMatrix: SparseMatrix = toSparseMatrix(colMajor = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we follow scipy, and call it as toCSC
?
* Converts this matrix to a sparse matrix in row major order. | ||
*/ | ||
@Since("2.2.0") | ||
def toCSRMatrix: SparseMatrix = toSparseMatrix(colMajor = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toCSR
?
*/ | ||
@Since("2.2.0") | ||
def toDense: DenseMatrix = toDenseMatrix(colMajor = true) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add toColumnMajorDense
and toRowMajorDense
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto. should we consider to maintain the same layout?
* @param colMajor Whether the values of the resulting sparse matrix should be in column major | ||
* or row major order. If `false`, resulting matrix will be row major. | ||
*/ | ||
private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean @inline private[ml] def ...
? Do we expect this to be called often enough for that to make a difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough.
@dbtsai Let me know your thoughts on the comments I left. Thanks for the review! |
Test build #74477 has finished for PR 15628 at commit
|
Test build #74486 has started for PR 15628 at commit |
Test build #74479 has finished for PR 15628 at commit
|
Jenkins test this please. |
Test build #74552 has finished for PR 15628 at commit
|
ping @dbtsai! Code freeze is upcoming :) |
* Converts this matrix to a sparse matrix in column major order. | ||
*/ | ||
@Since("2.2.0") | ||
def toCSC: SparseMatrix = toSparseMatrix(colMajor = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not good at naming, but since we use toDenseRowMajor
for dense vector, should we use toSparseColumnMajor
? Almost many packages are using toCSC
, but I think we can make them consistent. Just my 2 cents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I have a preference. I don't mind leaving them as CSC and CSR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about it again, let's have it as toSparseColumnMajor
to make the apis consistent with the dense ones if you don't mind?
* Converts this matrix to a sparse matrix in row major order. | ||
*/ | ||
@Since("2.2.0") | ||
def toCSR: SparseMatrix = toSparseMatrix(colMajor = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question.
* or row major order. If `false`, resulting matrix will be row major. | ||
*/ | ||
@Since("2.2.0") | ||
def compressed(colMajor: Boolean): Matrix = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make it private, and follow the previous style. Should add compressedRowMajor
, compressedColumnMajor
since it can be dense matrix in certain situations.
*/ | ||
private[ml] override def toDenseMatrix(colMajor: Boolean): DenseMatrix = { | ||
if (!(isTransposed ^ colMajor)) { | ||
val newValues = new Array[Double](numCols * numRows) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simpler to use foreachActive
? With it, both toDenseMatrix
can have the same implementation for sparse and dense in trait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can put this in the trait, since when it is called on dense matrix we would like to return this
in some cases (when no layout change is needed). But, yes, I think it simpler to use toArray
which calls foreachActive
. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following could work, and we only need one implementation in trait. Thanks.
trait Matrix {
var isTransposed: Boolean = true
var numCols: Int = 0
var numRows: Int = 0
def foreachActive(f: (Int, Int, Double) => Unit): Unit
def toDenseMatrix(colMajor: Boolean): Matrix = {
this match {
case _: DenseMatrix if this.isTransposed != colMajor =>
this
case _: SparseMatrix | _: DenseMatrix if this.isTransposed == colMajor =>
val newValues = new Array[Double](numCols * numRows)
this.foreachActive { case (row, col, value) =>
// filling the newValues
}
new DenseMatrix(numRows, numCols, newValues, isTransposed = !colMajor)
case _ =>
throw new IllegalArgumentException("")
}
}
}
class DenseMatrix extends Matrix {
def foreachActive(f: (Int, Int, Double) => Unit): Unit = {
}
}
class SparseMatrix extends Matrix {
def foreachActive(f: (Int, Int, Double) => Unit): Unit = {
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I don't think this solution is better. The entire point of abstract methods is to allow subclasses to implement a method differently. Since we need different implementations depending on the subclass, we should just implement them in the subclasses. We can do this with the following:
trait Matrix {
def toDenseMatrix(colMajor: Boolean): Matrix
}
class DenseMatrix extends Matrix {
private[ml] override def toDenseMatrix(colMajor: Boolean): DenseMatrix = {
if (isTransposed && colMajor) {
new DenseMatrix(numRows, numCols, toArray, isTransposed = false)
} else if (!isTransposed && !colMajor) {
new DenseMatrix(numRows, numCols, transpose.toArray, isTransposed = true)
} else {
this
}
}
}
class SparseMatrix extends Matrix {
private[ml] override def toDenseMatrix(colMajor: Boolean): DenseMatrix = {
if (colMajor) new DenseMatrix(numRows, numCols, toArray)
else new DenseMatrix(numRows, numCols, this.transpose.toArray, isTransposed = true)
}
}
Which is less verbose than the previous code. I'm going to put that in the next commit. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me!
new DenseMatrix(numRows, numCols, toArray) | ||
private[ml] override def toSparseMatrix(colMajor: Boolean): SparseMatrix = { | ||
if (!(colMajor ^ isTransposed)) { | ||
// breeze transpose rearranges values in column major and removes explicit zeros |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's hacky to use breeze's transpose behavior to remove zeros in sparse matrices. Can we have our own implementation given we're potentially remove breeze?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pretty much agree with you, but this is non-trivial code if we want to do it efficiently. Breeze has a pretty well-optimized implementation to do this. I would leave it as a follow up JIRA, or do it when/if we ever remove the Breeze dependency. Or do you think this is a blocker for this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a blocker.
val breezeTransposed = asBreeze.asInstanceOf[BSM[Double]] | ||
Matrices.fromBreeze(breezeTransposed).asInstanceOf[SparseMatrix] | ||
} | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we document here that it's when the layout of this and colMajor is different? Easier read than (colMajor ^ isTranspose)
condition here. Even more readable to use pattern matching with exact boolean on both variables.
assert(!dm3.isTransposed) | ||
assert(dm3.values.equals(dm1.values)) | ||
|
||
val dm4 = dm1.toDenseMatrix(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to make toDenseMatrix
as private, and we test against toDenseRowMajor
which is more explicit.
* Converts this matrix to a sparse matrix in column major order. | ||
*/ | ||
@Since("2.2.0") | ||
def toSparse: SparseMatrix = toSparseMatrix(colMajor = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm debating that should we keep the same ordering of layout when we call toSparse
or toDense
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this would change the behavior of external facing code. Before if I called toSparse
on a row major matrix, I'd get a column major matrix. If we maintain the layout, then I'd now get something different (row major). Otherwise, I'd agree it is best to maintain the layout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I thought this is a new api being added, so we can make it to maintain the same layout.
1.0 -3.0 -8.0 | ||
*/ | ||
val dm1 = new DenseMatrix(2, 3, Array(4.0, -1.0, 2.0, 7.0, -8.0, 4.0)) | ||
val dm2 = new DenseMatrix(2, 3, Array(5.0, -9.0, 4.0, 1.0, -3.0, -8.0), isTransposed = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just make dm2
dm1.transposed, but explicitly assign the value? Thus, you don't need to type the value in the array for the comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand your meaning here. These are made to be two entirely different matrices anyway.
*/ | ||
@Since("2.2.0") | ||
def toDense: DenseMatrix = toDenseMatrix(colMajor = true) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto. should we consider to maintain the same layout?
toDenseMatrix(!isTransposed) | ||
} else { | ||
if (cscSize <= csrSize) { | ||
toSparseMatrix(colMajor = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toSparseColumnMajor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
if (cscSize <= csrSize) { | ||
toSparseMatrix(colMajor = true) | ||
} else { | ||
toSparseMatrix(colMajor = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toSparseRowMajor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
val csrSize = getSparseSizeInBytes(colMajor = false) | ||
if (getDenseSizeInBytes < math.min(cscSize, csrSize)) { | ||
// dense matrix size is the same for column major and row major, so maintain current layout | ||
toDenseMatrix(!isTransposed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call toDense
if we decide to make toDense
and toSparse
outputting the same layout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the change here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this.toDense
* Generate a `DenseMatrix` from this `DenseMatrix`. | ||
* | ||
* @param colMajor Whether the resulting `DenseMatrix` values will be in column major order. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor, can we have
protected def isColMajor = !isTransposed
protected def isRowMajor = isTransposed
So the code can be understood easier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I added these methods. I updated the test suites to use them instead of isTransposed
.
} | ||
new SparseMatrix(numRows, numCols, colPtrs, rowIndices.result(), spVals.result()) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we override the toArray
in DenseMatrix so when this
is column major, we just return this.values
? Otherwise, it's very expensive to create a new array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to say I'm a bit surprised - I thought that's what toArray
already did! Yeah, that seems like a good change, but I'd prefer to do it in another pr because we need to make sure that this doesn't adversely affect other places that use toArray
as well as adding unit tests. If that sounds ok, I'll make a JIRA for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Let's do it in another PR. Thanks.
Test build #75116 has finished for PR 15628 at commit
|
Test build #75123 has finished for PR 15628 at commit
|
Test build #75122 has finished for PR 15628 at commit
|
Test build #75144 has finished for PR 15628 at commit
|
Test build #75147 has started for PR 15628 at commit |
} | ||
new SparseMatrix(numRows, numCols, colPtrs, rowIndices.result(), spVals.result()) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Let's do it in another PR. Thanks.
val csrSize = getSparseSizeInBytes(colMajor = false) | ||
if (getDenseSizeInBytes < math.min(cscSize, csrSize)) { | ||
// dense matrix size is the same for column major and row major, so maintain current layout | ||
toDenseMatrix(!isTransposed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the change here.
toDenseMatrix(!isTransposed) | ||
} else { | ||
if (cscSize <= csrSize) { | ||
toSparseMatrix(colMajor = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
if (cscSize <= csrSize) { | ||
toSparseMatrix(colMajor = true) | ||
} else { | ||
toSparseMatrix(colMajor = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
* @param colMajor Whether the resulting `DenseMatrix` values are in column major order. | ||
*/ | ||
private[ml] override def toDenseMatrix(colMajor: Boolean): DenseMatrix = { | ||
if (colMajor) new DenseMatrix(numRows, numCols, toArray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new DenseMatrix(numRows, numCols, this.toArray, isTransposed = false)
to make the style consistent.
assert(cm1 === sm1) | ||
assert(!cm1.isTransposed) | ||
assert(cm1.values.equals(sm1.values)) | ||
assert(cm1.getSizeInBytes <= sm1.getSizeInBytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto. remove =
assert(cm2.isTransposed) | ||
// forced to be row major, so we have increased the size | ||
assert(cm2.getSizeInBytes > sm1.getSizeInBytes) | ||
assert(cm2.getSizeInBytes <= sm1.toDense.getSizeInBytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
=
is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ok to leave it here since they could potentially be equal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If =, dense should be preferred since it will be faster. As a result, we should only check <.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, before we broke ties with sparse, so I changed it to choose dense, and added unit tests. Also removed the = here.
assert(cm8 === sm2) | ||
assert(!cm8.isTransposed) | ||
assert(cm8.getSizeInBytes > sm2.getSizeInBytes) | ||
assert(cm8.getSizeInBytes <= sm2.toDense.getSizeInBytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completion, could you test sm1. compressedColumnMajor
and sm2. compressedRowMajor
? Thanks.
val cm4 = sm3.compressed.asInstanceOf[DenseMatrix] | ||
assert(cm4 === sm3) | ||
assert(!cm4.isTransposed) | ||
assert(cm4.getSizeInBytes <= sm3.getSizeInBytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove =
?
val cm5 = sm3.compressedRowMajor.asInstanceOf[DenseMatrix] | ||
assert(cm5 === sm3) | ||
assert(cm5.isTransposed) | ||
assert(cm5.getSizeInBytes <= sm3.getSizeInBytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove =
? and sm3.compressedColumnMajor
?
Test build #75155 has started for PR 15628 at commit |
@dbtsai Thanks for the good suggestions. I rearranged the test suites, removed redundancies, and filled in some gaps. Things got a bit jumbled when changing some of the methods around. I also added the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Couple small comments.
* Converts this matrix to a sparse matrix while maintaining the layout of the current matrix. | ||
*/ | ||
@Since("2.2.0") | ||
def toSparse: SparseMatrix = toSparseMatrix(colMajor = !isTransposed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def toSparse: SparseMatrix = toSparseMatrix(colMajor = isColMajor)
* Converts this matrix to a dense matrix while maintaining the layout of the current matrix. | ||
*/ | ||
@Since("2.2.0") | ||
def toDense: DenseMatrix = toDenseMatrix(colMajor = !isTransposed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def toDense: DenseMatrix = toDenseMatrix(colMajor = isColMajor)
} | ||
|
||
private[ml] def getDenseSize(numCols: Long, numRows: Long): Long = { | ||
// 8 * values.length + 12 + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document what is the magical number 12 + 1
? Also, can we make it
java.lang.Double.BYTES * numCols * numRows + 13L
since the size of primitive type can depend on the implementation of JVM.
|
||
private[ml] def getSparseSize(numActives: Long, numPtrs: Long): Long = { | ||
// 8 * values.length + 4 * rowIndices.length + 4 * colPtrs.length + 12 + 12 + 12 + 1 | ||
12L * numActives + 4L * numPtrs + 37L |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(java.lang.Double.BYTES + java.lang.Integer.BYTES) * numActives + java.lang.Integer.BYTES * numPtrs + 37L
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice that we can get 37L
using java apis to ensure the portability.
Test build #75167 has finished for PR 15628 at commit
|
Thanks for finally cooperating, Jenkins! |
Thanks @sethah and Jenkins! Merged into master. |
Thanks for all the time reviewing! |
Test build #75171 has finished for PR 15628 at commit
|
@@ -148,7 +154,8 @@ sealed trait Matrix extends Serializable { | |||
* and column indices respectively with the type `Int`, and the final parameter is the | |||
* corresponding value in the matrix with type `Double`. | |||
*/ | |||
private[spark] def foreachActive(f: (Int, Int, Double) => Unit) | |||
@Since("2.2.0") | |||
def foreachActive(f: (Int, Int, Double) => Unit): Unit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sethah @dbtsai Hi all, just saw this during QA. This method is not very Java-friendly. I'm OK with adding it as long as we document the fact that it's not Java-friendly. We could also consider adding a Java-friendly version, perhaps using https://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/function/Function2.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do the same for foreachActive
in Vector which was already a public api long time ago.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I guess we can leave it until someone complains & add a Java-friendly one as needed.
### What changes were proposed in this pull request? This PR upgrades Netty from 4.1.126.Final to 4.1.127.Final and netty-tcnative from 2.0.72.Final to 2.0.73.Final. ### Why are the changes needed? This version fixes a BouncyCastle-related regression introduced in Netty 4.1.126.Final: - BouncyCastleAlpnSslUtils needs to use the correct SSLEngine class as otherwise it will fail to init static fields. ([#15628](netty/netty#15628)) The full release notes as follows: - https://netty.io/news/2025/09/08/4-1-127-Final.html ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass Github Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #52356 from LuciferYang/SPARK-53599. Authored-by: yangjie01 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
This patch adds a
compressed
method to MLMatrix
class, which returns the minimal storage representation of the matrix - either sparse or dense. Because the space occupied by a sparse matrix is dependent upon its layout (i.e. column major or row major), this method must consider both cases. It may also be useful to force the layout to be column or row major beforehand, so an overload is added which takes in acolumnMajor: Boolean
parameter.The compressed implementation relies upon two new abstract methods
toDense(columnMajor: Boolean)
andtoSparse(columnMajor: Boolean)
, similar to the compressed method implemented in theVector
class. These methods also allow the layout of the resulting matrix to be specified via thecolumnMajor
parameter. More detail on the new methods is given below.How was this patch tested?
Added many new unit tests
New methods (summary, not exhaustive list)
Matrix trait
private[ml] def toDenseMatrix(columnMajor: Boolean): DenseMatrix
(abstract) - converts the matrix (either sparse or dense) to dense formatprivate[ml] def toSparseMatrix(columnMajor: Boolean): SparseMatrix
(abstract) - converts the matrix (either sparse or dense) to sparse formatdef toDense: DenseMatrix = toDense(true)
- converts the matrix (either sparse or dense) to dense format in column major layoutdef toSparse: SparseMatrix = toSparse(true)
- converts the matrix (either sparse or dense) to sparse format in column major layoutdef compressed: Matrix
- finds the minimum space representation of this matrix, considering both column and row major layouts, and converts itdef compressed(columnMajor: Boolean): Matrix
- finds the minimum space representation of this matrix considering only column OR row major, and converts itDenseMatrix class
private[ml] def toDenseMatrix(columnMajor: Boolean): DenseMatrix
- converts the dense matrix to a dense matrix, optionally changing the layout (data is NOT duplicated if the layouts are the same)private[ml] def toSparseMatrix(columnMajor: Boolean): SparseMatrix
- converts the dense matrix to sparse matrix, using the specified layoutSparseMatrix class
private[ml] def toDenseMatrix(columnMajor: Boolean): DenseMatrix
- converts the sparse matrix to a dense matrix, using the specified layoutprivate[ml] def toSparseMatrix(columnMajors: Boolean): SparseMatrix
- converts the sparse matrix to sparse matrix. If the sparse matrix contains any explicit zeros, they are removed. If the layout requested does not match the current layout, data is copied to a new representation. If the layouts match and no explicit zeros exist, the current matrix is returned.