-
Notifications
You must be signed in to change notification settings - Fork 29k
SPARK-7856 Principal components and variance using computeSVD() #17907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The previous computePrincipalComponentsAndExplainedVariance() evaluates the covariance matrix in a local breeze matrix causing OutOfMemory exceptions for tall and fat matrices. The decomposition of the matrix X-mean(X) provides the eigenvectors and eigenvalues for the covariance matrix. X = U S V' // (1) X' = V S'U' X'X = V S'U'U S V' X'X = V S'S V' // U'U = I (X'X)V = V (S'S)(V'V) (X'X)V = V (S'S) // V'V = I A*V = V*lambda // if A=X'X, V is the eigenvector matrix of X'X and can be obtained from (1)
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that this could be much slower for a moderately large number of columns, like 1000 or so, especially in the sparse case. This makes the representation dense and then does a distributed SVD when currently it's handled fairly efficiently locally.
| Vectors.dense(Arrays.copyOfRange(explainedVariance, 0, k))) | ||
| // X' = X - µ | ||
| def subPairs = (vPair: (Double, Double)) => vPair._1 - vPair._2 | ||
| def subMean = (v: Vector) => Vectors.dense(v.toArray.zip(mean.toArray).map(subPairs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty inefficient way to subtract the mean, and it's going to make sparse data dense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not possible to keep sparse vectors sparse if we center them to the origin. However, as your concern for efficiency I could try using a mllib.features.StandardScaler().fit(data).setWithMean(true).setWithVariance(false) on the data.
|
|
||
| if (k == n) { | ||
| (Matrices.dense(n, k, u.data), Vectors.dense(explainedVariance)) | ||
| // Check matrix is standarized with mean 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scaladoc comments are no longer consistent with the impl.
|
My understanding is that the RowMatrix computes the SVD locally, when the data allows performance improvement, otherwise it does in a distributed fashion. Then, the suggested implementation NOT always relies on a distributed SVD, and sometimes it can be computed locally depending on the data. It's to my knowledge that the current implementation can't handle PCA on tall&fat as it soon goes to OutOfMemory as it tries to compute the PC from eigenvector decomposition on X'X using Breeze SVD. Moreover the SVD for PCA is preferred over X'X eigenvector decomposition for numerical reasons (https://math.stackexchange.com/questions/359397/why-svd-on-x-is-preferred-to-eigendecomposition-of-xx-top-in-pca), the SVD PCA in the current implementation wouldn't crash as easily as computing locally the eigenvectors of X'X. This implementation might be a bit slower than the current (not verified), but it adds much more stability. From https://spark.apache.org/docs/2.1.0/mllib-dimensionality-reduction.html
|
|
Yes, it is likely more accurate to not base the PCA on the Gramian. However it's probably going to be more efficient than what the SVD method does even when operating locally. If this change makes other cases very slow, that could be just as bad. However getting it to work with more than ~65535 columns is of course a good thing. How much memory does your driver have? at the size you're computing, the matrix should only take < 1GB of memory even with some overhead. This isn't large enough to run out of memory, assuming you've given your driver more than the default amount of memory. The real question is scaling past 65535, which isn't possible no matter what the heap size here. But then the question is what happens in the regime of thousands of columns -- your change may make it a lot slower when it's pretty reasonable to compute locally. It could be that a threshold is needed here -- above some size, it's probably better to distribute vs compute the Gramian locally, but we don't know what that scale is here. |
|
With classic Spark PCA, approx. 55Kx15K matrix and 10GB in driver I go out of memory. I chopped the matrix to be 55Kx3K and I can get the PCA. With the SVD distributed approach I could compute PCA with the original matrix and it took about 7 minutes to complete training. I think as well that if local approach is faster for small to medium size matrices, we should be allowed to set a threshold, to therefore chose the PCA computation. On the other side, I'm working on my spare time on implementing a Probabilistic PCA with EM. That should scale pretty well and converges pretty fast. Moreover, some flavors allow to have missing values in the matrix. But this is a separate business. |
|
Hm, that sounds like a whole lot more memory being used than I'd imagine. How are you running the driver, and is it certainly the driver that runs out of memory? do you have timings for local vs distributed for a scale that works in both cases? |
|
Originally the driver was set to 3GB, but since I was having this OutOfMemory in the driver I decided to give a try and increase the size. For other benchmarks, I think I need more time. |
|
Can one of the admins verify this patch? |
|
Closes apache#20932 Closes apache#17843 Closes apache#13477 Closes apache#14291 Closes apache#20919 Closes apache#17907 Closes apache#18766 Closes apache#20809 Closes apache#8849 Closes apache#21076 Closes apache#21507 Closes apache#21336 Closes apache#21681 Closes apache#21691 Author: Sean Owen <[email protected]> Closes apache#21708 from srowen/CloseStalePRs.
What changes were proposed in this pull request?
The current implementation of computePrincipalComponentsAndExplainedVariance in RowMatrix for tall and fat matrices usually crashes as it computes the covariance matrix locally.
Instead in this patch I used the same RowMatrix.computeSVD which is already optimized for big matrices, to compute the Principal components and Explained Variance.
It's known that if a matrix X with mean µ and covariance (X - µ)'(X - µ)
(X - µ) can be decomposed with SVD such that
(X - µ) = USV'
and
(X - µ)' = VS'U'
V and U are orthonormal, therefore V'V = I and U'U = I.
cov = (X - µ)'(X -µ) = VS'U'USV' = VS'SV'
and
cov*V = V(S'S), with V the eigenvectors of covariance and S'S contains the eigenvalues
How was this patch tested?
This patch has been tested running the current RowMatrixSuite, mllib.PCASuite and ml.PCASuite, passing all the tests.
Also, this patch allowed to run PCA over a matrix 56k x 12k without crashing from OutOfMemory errors (not included in testing as it takes long time to execute and it's generated by a private dataset)
Please review http://spark.apache.org/contributing.html before opening a pull request.