-
Notifications
You must be signed in to change notification settings - Fork 19.6k
[WIP] Stochastic Gradient Descent with Restarts - Callback #3525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
After playing around with this a bit more, I could somewhat reproduce the results of the original paper. It seems to me though that the main advantage of this approach is in the aggressive learning rate scheduling. I'm not convinced that the restarts are beneficial. In fact, in my experiments the network started to overfit quite heavily after the first restart. I therefore think that this method should not be integrated into keras until there's some kind of follow up on this research that shows that the restarts are actually useful... |
I would be happy to know more about what was your test scenario, hyperparameters and results.
If you experience a faster convergence, then an overfitting might naturally occur earlier. However, this would be very surprising to see (i.e., never the case in my experience) right after 1 or 10 epochs on CIFAR 10/100 for ResNets or Wide ResNets. |
|
To be clear, I in no way meant to imply that your results are not reproducible when I wrote that I 'somewhat' reproduced the results of the paper. I tested the scheduler with smaller models (WRN40-2 and WRN40-4) for which you don't have error/loss plots in your paper. I initially tested my implementation on CIFAR-10 and then switched to a non-public dataset of mine when the results looked promising. I can certainly confirm that the networks converge (both train and test loss) much faster when using the learning rate schedule proposed in the paper, but so far I haven't seen any clear benefits from the restarts. After having a look at the error plots in the paper again, it seems that the restarts mostly helped on your biggest net and on CIFAR-100 (e.g. for your WRN-28-10 net on CIFAR-10, you obtained the best result with hyperparameters T_0 = 100 and T_mult = 1 just before the first restart, which is similar to the results in my initial tests). Btw., a big gotcha for me when applying your hyperparameters was that you flip and concatentate the train images to the non-flipped images in your dataset, effectively doubling the size of each epoch. I didn't to this initially and therefore applying your hyperparameters led to vastly different results than expected. Edit: The overfitting on my dataset occurred after the first restart, and it started to overfit while the train loss was still much higher than it was before the restart. I.e. at the same train loss, the test loss was much higher after first restart than before the first restart. |
|
Thank you for the detailed reply. |
That's a fair point. I'd certainly be willing to finish the implementation (i.e. add unit tests and so on) and reopen the pull request if @fchollet is interested in merging the callback. |
|
I also came across this paper and wanted to experiment with it and was in the process of translating the code in to keras. But, trying to run the lasage original code I ran in to memory issues pretty quickly even for the scenario 1 for the K=10. So, I reduced K to 2 and the accuracy degradation becomes so obvious and the learning slows. @loshchil: did you try this on just simple thin original resnet? It would be easy to proof whether or not the gain is due to the extremely wide networks or because of the restart. At this point it is not that obvious to me. |
The accuracy degradation is obviously due the use of a smaller network. This would be the case no matter what learning approach you are using. Note that the scenario 1 you are using is the default setting without restarts as the README file says "scenario #1 and #2 correspond to the original multi-step learning rate decay on CIFAR-10"
Which gain you mean? Thin ResNet with K=1 would give you something around 6-7% on CIFAR-10. Wider networks give better errors on CIFAR datasets around 4%, this is why they were originally proposed. Thus, the gain is due to the usage of wider network. The paper says "We also achieved new state-of-the-art results with SGDR, mainly by using even wider WRNs." |
|
Okay that makes sense. But, the quick accuracy degredation I observed for k=2 was worse than the original resnet. So, it would be nice to show first you can reproduce the case of the original resnet with k=1 and no restart. I am saying this because it appears that the way you preprocess the images is a bit different than the way the original resnet did. Also, you mentioned in your paper that you are not using Nesterov momentum. So, it might not be an equal comparison to begin with. In short, the early loss reduction could be only attributed to the restart scheduling if you start with the original resnet with all preprocessing and params the same and show that you can reproduce that and then by only changing the learning rate scheduling to show the difference. Also, I think, there's already a learning scheduling callback in keras. This can be implemented as the scheduler callback function by a user. You can modify the LearningRateScheduler callback that is already in keras to add what you want for instance, give you the current learning rate, which I actually did a pull request for. I didn't change the test so currently it is failing to build but that is easier to change if @fchollet agrees for these kinds of changes to LearningRateScheduler. |
Please show your data.
Please have a look at the paper. The "default" approach described in the paper (see Table 1 and Figures) already reproduces the results by Zagoruyko and Komodakis for WRN-28-10, less of preprocessing and a simpler learning algorithm is only a plus.
I am not the author of this pull request and I am not familiar with Keras to see what implementation suits better. However, the code of the current pull request looked neat to me. |
|
I am not implying the restart is not helpful. I am saying, it would be easier to show the restart helps if I can reproduce the resnet result for n=4 and k=1 and without changing anything else, just change the lr scheduling and see the effect of the restart. I am using CIFAR-10 not my data. Reducing the network to scenario #1 and k=1, which is the improved ResNet of 28 layers, it struggles to learn at the first 10 epochs beyond 70% test accuracy. The whole advantage of ResNet is the relatively lower #params for even ultra deeper networks (>1000). By, making the networks 10x to 20x fatter, WRNs are making the #params extremely large quickly and defeats the whole purpose of lowering the #params compared to other plain networks such as VGG. |
This is a normal behaviour for this network/parameters. Below, see the median from 5 runs. If you would run this WRN-28-1 until the end you would get 7.3%, a slightly better result (it is an improved structure) than for a slightly bigger original Resnet-32.
It is not depth, but performance vs time/space/ease to use that makes ResNets popular. Chances are that the typical number of parameters will grow both for CIFAR and ImageNet datasets to use the limits GPUs. Sure, low-weight memory/time cost networks are always of interest but it is the use of big networks that pushes the search for better network architectures. The latter, however, has little to do with the learning approach we discuss. |
As you requested, see below the median results of 5 runs with WRN-28-1 and 50k images per epoch. Blue color depicts the standard approach. Red color depicts SGDR. For both cases, the shown results are for the best initial learning rates out of [0.1, 0.05, 0.025, 0.01] for the standard approach and out of [0.1, 0.05, 0.025] for SGDR. As you can see, the pattern is the same as for WRN-28-10 and WRN-28-20 given in the paper. |
|
Hey @loshchil thanks for producing the plots and your replies. I am trying to reproduce your result in keras, but, it settled for ~89.2% (10.8% error). So, I decided to just save your processed images and use the same minibatch generator as I suspected the discrepancy could be in the keras on the fly image data augmentation as it applies all transforms in to single transform homography matrix. In your case, you flipped the images and concatenated them essentially doubling the training data size then you do the mean subtraction and on the fly random padding and cropping. I am now running an exactly similar network to yours (I plotted your network in lasage to pdf for comparison with my network in keras also plotted). I will let you know the result and will put the plot for the losses and errors here. I am running your network in lasagne and the keras network side by side. From some of the early runs, it looks like the keras training loss is reduced significantly faster compared to the lasage, which stays around 0.9 for a while and drops significantly once the learning rate is reduced. Note: I am using the data you prepared and the same minibatch iterator. Also, all other parameters for the optimizer are the same: SGD with momentum without nestrov. I suspect that keras's SGD is a bit different from the lasagne momentum and weight decay. We will see when they finish. But, my questions to you at this point are: Thanks for your time. |
|
Thank you for doing experiments in both platforms. I would also suggest to check the original step-decay of the learning. It might happen that there is a difference already there, i.e., with the default approach, and so it is expected for restarts. Regarding 100k examples per epoch instead of 50k. I took the code The Wide ResNet code follows the implementation of |
|
I'm actually doing experiments using the step scheduling not restarts yet Also, when you have time, could you comment on your choice of the first Thanks! On Aug 30, 2016 7:56 PM, "Ilya Loshchilov" [email protected] wrote: Thank you for doing experiments in both platforms. I would also suggest to Regarding 100k examples per epoch instead of 50k. I took the code The Wide ResNet code follows the implementation of — |
Whether it achieves 90% earlier matters only if it later can achieve better results, e.g. 92-93% that are normal for that network. If not, then one could use a smaller network. |
Do you refer to I am not familiar with Lua but the code seems to match If you think that the Lasagne code should be different, the best would be to know what exactly should it be according to you. |
|
@loshchil Yeah, I am talking about the first resnet unit of the first block. You have conv-bn-relu-conv in the first unit whereas all other implementations I saw have: bn-conv-relu-bn-conv-relu and just a simple one conv at the very beginning after the input. Even the wide-resnet basic block you shared above follows this strictly. So, in that sense, your network and the lua network you shared above don't match at the beginning. It might not make a whole lot of difference. I actually tested the lua model in your lasagne code and the difference is so minor. As to the discripancy between keras and the lasagne code, I didn't notice at first that you were applying l2 normalization for all the layers params and that was why the keras model was overfitting quickly. To test that I just removed the l2_penality loss from your lasagne loss and it exhibited similar quick overfitting. Then, I added l2 weight regulizers to the keras layers and they quickly run in to a 'nan' loss after few iterations. Moreover, I was never able to get the common behavior of the loss being significantly reduced after you drop the learning rate in keras. Lasagne optimizer has that behavior if you added regularizer. I closely inspected the SGD of lasagne and the SGD of keras and there seems to be quite a difference. The lasagne version is very straight forward and directly invokes theano functions and doesn't add any form of post processing. The keras version seems to do some clipping to the gradients after computed by theano and they have some constraint applied after that and even the regular momentum computation is not as straight forward as the lasagne version. So, it is really hard to tell what is going wrong. I will share the modular resnet units and blocks I wrote both in keras and lasagne soon aws. As for the early error drop, you were right, although I could get it to 90% test accuracy within 15 epochs, it saturated around 91.2% in the long run. Mind you this is just one single run, not 5 runs. Thanks for your time. |
|
Thanks for your investigations. I guess that the clipping should be optional, otherwise depending on parameters it might be harmful like you just found. If you will have a chance to share your Lasagne/Keras, please let me know. |
|
@loshchil Finally, I was able to figure out the problem with Keras SGD. Setting the learning rate decay to zero produces undesirable 'nan' occasionally. So, I had to comment out the exponential learning rate decay line in keras optimizers.py under SGD class. So, finally, I was able to reproduce at least the WRN 28x4 network both in keras and lasagne and here are the results and the model graphs for your comparison. The models are essentially the same visually. However, it looks like the keras version has a slightly larger number of parameters. I don't know why yet. |
|
@esube: That looks great! How did you determine that the difference in accuracy was caused by nans due to the weight decay? Any idea whats going on there? Did you test if the same problem occurs without nvcc.fastmath? Also, would you mind sharing your code that you used to reproduce the WRN 28x4 results with keras? |
|
@nebw I didn't say the difference in accuracy is due to 'nans'. I said previously, I was unable to train the networks in keras because of the conflict between the two learning rate decay schedules. If there is 'nan' all you have to do is stop training and change things until you avoid the 'nan' :-) There is no accuracy after a 'nan'. I set the learning rate decay in SGD instantiation to 0, 1e-6, etc to remove its effect as I only want the step decay scheduling to have effect. Going that low produced 'nan' sporadically as the exponential learning rate decay implemented in keras is dividing by the decay. So, I commented out that line and I was able to train just fine. I have tested that on several versions even as deep as 74 layers and it works just fine. Since, the exponential decay is meant for when you don't have additional learning rate scheduler, I think, that part of the code needs to be guarded if you have your own callback or learning rate schedulers. I have tried it with or without the fastmath flag so I don't think that is the problem. As for the code, I am sharing it as an example to a PR am about to make for the step learning rate scheduler. Keep an eye for it in keras PRs. |





I've implemented Stochastic Gradient Descent with Restarts [1] based on the reference implementation of the authors [2]. I'm currently evaluating the perfomance on CIFAR.
Unit tests are still missing, but the paper has recently gained some attention on the machine learning subreddit, so I thought it might be a good idea to open the pull request right away in case someone wants to experiment with it.
[1] http://arxiv.org/abs/1608.03983
[2] https://github.com/loshchil/SGDR