-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-2309][MLlib] Generalize the binary logistic regression into multinomial logistic regression #1379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
QA tests have started for PR 1379. This patch merges cleanly. |
|
QA results for PR 1379: |
|
Jenkins, retest this please. |
|
I think it fails due to the apache license is not in the test file. As you suggest, I'll move it to be generated in the runtime. Would like to know the general feedback. I'll make the test pass tomorrow. Thanks. |
|
QA tests have started for PR 1379. This patch DID NOT merge cleanly! |
|
QA results for PR 1379: |
|
It is easier to review if it passes the tests. @SparkQA shows new public classes and interface changes. Could you remove the data file and generate some synthetic data for unit tests? Thanks! |
|
@mengxr Is there any problem with asfgit? This is not finished yet, why asfgit said it's merged into apache:master. |
|
... I have no idea. Let me check. |
|
@pwendell I didn't see |
|
What is the current state of the PR? Can't see any changes in the code... |
|
@BigCrunsh I'm working on this. Let's see if we can merge in Spark 1.2 |
|
@dbtsai Hi! What is the current state of PR? I would like to download and test. Could you suggest where are the sources? |
|
Apparently, I've found this implementation https://github.com/dbtsai/spark/tree/dbtsai-mlor. It did work on my examples producing reasonable results. Could you comment on the following? Why the number of parameters (weights) is equal to (num_features + 1)(num_classes-1) ? I would expect (num_features + 1)(num_classes) as it is here for example: http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression |
|
@avulanov I will merge this on Spark 1.3, and sorry for delay since I was very busy recently. Yes, the branch you found should work, but it can not be cleanly merged in upstream, and I'm working on it. You can try that branch for now. Also, in the branch, we don't use LBFGS as optimizer, so the convergent rate will be slow. Basically, you can model the whole problem using (num_features + 1)(num_classes), but the solution will not be unique. You can chose one of the class as base class to make the solution unique, and I chose the first class as base class. See |
|
@dbtsai Thanks for explanation! Do I understand correct, that if I want to get (num_features+1)*(num_classes) parameters from your model, I need to concatenate a vector of length (num_features+1) with zeros at the beginning of the vector that your model returns with |
|
no, in the algorithm, I already model the problem http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297/24 , so there will always be only (num_features + 1)(num_classes-1) parameters. Of course, you can chose any transformation to make it over-parameterize, see |
|
@dbtsai I've tried your implementation with |
|
@avulanov Sure, it's interesting to see the comparison. Let me know the result once you have it. I'm going to make it merge in 1.3, so will be easier to use it in the future. |
|
@dbtsai Here are the results of my tests:
It seems that ANN is almost 2x faster (with the mentioned settings), though accuracy is 1.6% smaller. The difference in accuracy can be explained by the fact that ANN uses (half) squared error cost function instead of cross entropy and no softmax. They are supposed to be better for classification. |
|
@avulanov I did couple performance turning in the MLOR gradient calculation in my company's proprietary implementation which results 4x faster than the open source one in github you tested. I'm trying to make it open source and merge into spark soon. (ps, simple polynomial expansion with MLOR can increase the mnist8m accuracy from 86% to 94% in my experiment. See Prof. CJ Lin's talk - https://www.youtube.com/watch?v=GCIJP0cLSmU ) |
|
@avulanov Nice tests! A few comments:
|
|
@dbtsai 1) Could you elaborate on what kind of optimizations did you do? Probably, they could be applied to the broader MLlib, which is beneficial. 2) Do you know the reason why our ANN implementation worked faster than the MLOR you shared? This could also be interesting in terms of MLlib optimization. 3) Did you mean fitting a n-th degree polynom instead of a linear function? Thanks for the link, it seems very interesting! |
|
@jkbradley Thank you! They took some time.
|
|
|
@dbtsai Thank you, I look forward for your code to perform benchmarks. Thanks again for the video! I've enjoy ed it, especially Q&A after the talk. At 51:23 Prof CJ Lin mentiones that "we released dataset of about 600 Gigabytes". Do you know where I can download it? It should be quite a challenging workload for classification in Spark! Upd: is it this one? http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#splice-site |
|
@avulanov I remembered CJ Lin said he posted the 600GB dataset on his website. |
|
@dbtsai Hi! Did you have a chance to check our implementation and send me the optimized one? |
|
@avulanov I don't check your implementation yet, but I'm ready to have the optimized MLOR for you to test. Can you try the @DeveloperApi
class LogisticGradient extends Gradient {
override def compute(data: Vector, label: Double, weights: Vector): (Vector, Double) = {
val gradient = Vectors.zeros(weights.size)
val loss = compute(data, label, weights, gradient)
(gradient, loss)
}
override def compute(
data: Vector,
label: Double,
weights: Vector,
cumGradient: Vector): Double = {
assert((weights.size % data.size) == 0)
val dataSize = data.size
// (n + 1) is number of classes
val n = (weights.size / dataSize)
val numerators = Array.ofDim[Double](n)
var denominator = 0.0
var margin = 0.0
val weightsArray = weights match {
case dv: DenseVector => dv.values
case _ =>
throw new IllegalArgumentException(
s"weights only supports dense vector but got type ${weights.getClass}.")
}
val cumGradientArray = cumGradient match {
case dv: DenseVector => dv.values
case _ =>
throw new IllegalArgumentException(
s"cumGradient only supports dense vector but got type ${cumGradient.getClass}.")
}
var i = 0
while (i < n) {
var sum = 0.0
data.foreachActive { (index, value) =>
if (value != 0.0) sum += value * weightsArray((i * dataSize) + index)
}
if (i == label.toInt - 1) margin = sum
numerators(i) = math.exp(sum)
denominator += numerators(i)
i += 1
}
i = 0
while (i < n) {
val multiplier = numerators(i) / (denominator + 1.0) - {
if (label != 0.0 && label == i + 1) 1.0 else 0.0
}
data.foreachActive { (index, value) =>
if (value != 0.0) cumGradientArray(i * dataSize + index) += multiplier * value
}
i += 1
}
if (label > 0.0) {
math.log1p(denominator) - margin
} else {
math.log1p(denominator)
}
}
} |
|
@avulanov PS, you can just replace the gradient function without doing any change. Let me know how much performance gain you see, and I'm very interested in this. Thanks. |
|
@dbtsai Thank you! Should I use the latest Spark with this Gradient? |
|
Yes, |
|
@dbtsai |
|
@avulanov The new branch is not finished yet. You need to rebase https://github.com/dbtsai/spark/tree/dbtsai-mlor to master, and just replace the gradient function. |
|
@dbtsai I did local experiment on mnist and your new implementation seems to be more than 2x faster than the previous one! I am going to perform bigger experiments. In the meantime, could you suggest if optimizations that you did are applicable for ANN Gradient? It will be extremely helpful for us. https://github.com/bgreeven/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/ann/ArtificialNeuralNetwork.scala#L467 |
|
New results of experiments with optimized ANN and MLOR are below. I used the same cluster of 6 machines with 12 workers total, mnist8m dataset as train and the standard mnist test converted to 784 attributes.
The ANN became ~3x and MLOR ~10x faster (!) than before. The current MLOR is ~60% faster than current ANN. I assume that there are the following overheads in ANN: 1) it uses back-propagation, so there are two matrix vector multiplications on forward and backward passes 2) it does rolling the parameters stored in matrices to the vector form. I will be happy to know how these overheads can be reduced. We can't compare with previously obtained accuracy because I used different test sets. |
|
@avulanov It's very encouraging benchmark result you saw in real world cluster setup. Since I'm on vacation recently, I don't actually deploy the new code and benchmark in our cluster. Great to see such huge 10x performance gain (actually bigger than what I thought, and in my local single machine testing, I only saw 2~4x difference.) What optimization do you do on your ANN implementation? The same things in MLOR? @mengxr Is it possible to reopne this closed PR in github? There are lots of useful discussion here, so I don't want to open another PR in github. I think I'm mostly done except the unit-test, and I can push the code for code review now before our meeting. (PS, the now code is more generalized than binary one, and has the same performance in the binary special case in my local testing.) |
|
@dbtsai I used my old implementation of the matrix form of back propagation and made sure that it properly uses stride of matrices in breeze. Also, I optimized roll of parameters into vector combined with in-place update of cumulative sum. |
|
@dbtsai BTW., have you thought about batch processing of input vectors, i.e. stack N vectors into matrix and perform optimization with this matrix instead of vector? With native BLAS enabled this might improve the performance. |
|
@avulanov I've thought about that. However, @mengxr told me that they have a intern trying to do this type of experiment last year, and they don't see significant performance gain. I'm thinking to implement the whole gradient function using native code/SIMD by batching the input vectors as matrix. Since for MLOR, the computation of objective function is very expensive. |
|
@dbtsai I did batching for artificial neural networks and the performance improved ~5x #1290 (comment) |
#1379 is automatically closed by asfgit, and github can not reopen it once it's closed, so this will be the new PR. Binary Logistic Regression can be extended to Multinomial Logistic Regression by running K-1 independent Binary Logistic Regression models. The following formula is implemented. http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297/25 Author: DB Tsai <[email protected]> Closes #3833 from dbtsai/mlor and squashes the following commits: 4e2f354 [DB Tsai] triger jenkins 697b7c9 [DB Tsai] address some feedback 4ce4d33 [DB Tsai] refactoring ff843b3 [DB Tsai] rebase f114135 [DB Tsai] refactoring 4348426 [DB Tsai] Addressed feedback from Sean Owen a252197 [DB Tsai] first commit
… S3 rate limiter (apache#1379) ### What changes were proposed in this pull request? This PR aims to improve `Fallback Storage` upload speed by randomizing the path in order to avoid S3 rate limiter. ### Why are the changes needed? Currently, `Fallback Storage` is using `a single prefix per shuffle`. This PR aims to randomize the upload prefixes even in a single shuffle to avoid S3 rate limiter. - https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/ ### Does this PR introduce _any_ user-facing change? No. This is used internally during the runtime. ### How was this patch tested? Pass the CIs to verify read and write operations. To check the layout, check the uploaded path manually with the following configs. ``` spark.decommission.enabled true spark.storage.decommission.enabled true spark.storage.decommission.shuffleBlocks.enabled true spark.storage.decommission.fallbackStorage.path file:///tmp/fallback/ ``` Start one master and worker. Connect with `spark-shell` and generate shuffle data. ``` scala> sc.parallelize(1 to 11, 10).map(x => (x % 3, 1)).reduceByKey(_ + _).count() res0: Long = 3 ``` Invoke decommission and check. Since we have only one worker, the shuffle data go to the fallback storage directly. ``` $ kill -PWR <CoarseGrainedExecutorBackend JVM PID> $ tree /tmp/fallback /tmp/fallback └── app-20211130135922-0001 └── 0 ├── 103417883 │ └── shuffle_0_7_0.data ├── 1036881592 │ └── shuffle_0_4_0.data ├── 1094002679 │ └── shuffle_0_7_0.index ├── 1393510154 │ └── shuffle_0_6_0.index ├── 1515275369 │ └── shuffle_0_3_0.data ├── 1541340402 │ └── shuffle_0_2_0.index ├── 1639392452 │ └── shuffle_0_8_0.data ├── 1774061049 │ └── shuffle_0_9_0.index ├── 1846228218 │ └── shuffle_0_6_0.data ├── 1970345301 │ └── shuffle_0_1_0.data ├── 2073568524 │ └── shuffle_0_4_0.index ├── 227534966 │ └── shuffle_0_2_0.data ├── 266114061 │ └── shuffle_0_3_0.index ├── 413944309 │ └── shuffle_0_5_0.index ├── 581811660 │ └── shuffle_0_0_0.data ├── 705928743 │ └── shuffle_0_5_0.data ├── 713451784 │ └── shuffle_0_8_0.index ├── 861282032 │ └── shuffle_0_0_0.index ├── 912764509 │ └── shuffle_0_9_0.data └── 946172431 └── shuffle_0_1_0.index ``` Closes apache#34762 from dongjoon-hyun/SPARK-37509. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit ca25534) Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit c88b258) Signed-off-by: Dongjoon Hyun <[email protected]> Co-authored-by: Dongjoon Hyun <[email protected]>
Currently, there is no multi-class classifier in mllib. Logistic regression can be extended to multinomial classifier straightforwardly.
The following formula will be implemented.
http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297/25
Note: When multi-classes mode, there will be multiple intercepts, so we don't use the single intercept in
GeneralizedLinearModel, and have all the intercepts into weights. It makes some inconsistency. For example, in the binary mode, the intercept can not be specified by users, but since in the multinomial mode, the intercepts are combined into weights, users can specify them.@mengxr Should we just deprecate the intercept, and have everything in weights? It makes sense in term of optimization point of view, and also make the interface cleaner. Thanks.