[WIP] Stochastic Gradient Descent with Restarts - Callback #3525

nebw · 2016-08-19T16:46:48Z

I've implemented Stochastic Gradient Descent with Restarts [1] based on the reference implementation of the authors [2]. I'm currently evaluating the perfomance on CIFAR.

Unit tests are still missing, but the paper has recently gained some attention on the machine learning subreddit, so I thought it might be a good idea to open the pull request right away in case someone wants to experiment with it.

[1] http://arxiv.org/abs/1608.03983
[2] https://github.com/loshchil/SGDR

fchollet · 2016-08-19T17:06:14Z

There is no documentation; no examples.
The name is not descriptive of what it does (nor does it allow to search Google for an explanation).
Most of the variables are inscrutable 2-letter abbreviations.

nebw · 2016-08-22T15:38:40Z

After playing around with this a bit more, I could somewhat reproduce the results of the original paper. It seems to me though that the main advantage of this approach is in the aggressive learning rate scheduling. I'm not convinced that the restarts are beneficial. In fact, in my experiments the network started to overfit quite heavily after the first restart. I therefore think that this method should not be integrated into keras until there's some kind of follow up on this research that shows that the restarts are actually useful...

loshchil · 2016-08-23T14:03:16Z

After playing around with this a bit more, I could somewhat reproduce the results of the original

paper.

I would be happy to know more about what was your test scenario, hyperparameters and results.
The Lasagne code should be reproducible one to one +- noise.

In fact, in my experiments the network started to overfit quite heavily after the first

restart.

If you experience a faster convergence, then an overfitting might naturally occur earlier. However, this would be very surprising to see (i.e., never the case in my experience) right after 1 or 10 epochs on CIFAR 10/100 for ResNets or Wide ResNets.

nebw · 2016-08-24T10:15:05Z

To be clear, I in no way meant to imply that your results are not reproducible when I wrote that I 'somewhat' reproduced the results of the paper. I tested the scheduler with smaller models (WRN40-2 and WRN40-4) for which you don't have error/loss plots in your paper. I initially tested my implementation on CIFAR-10 and then switched to a non-public dataset of mine when the results looked promising.

I can certainly confirm that the networks converge (both train and test loss) much faster when using the learning rate schedule proposed in the paper, but so far I haven't seen any clear benefits from the restarts. After having a look at the error plots in the paper again, it seems that the restarts mostly helped on your biggest net and on CIFAR-100 (e.g. for your WRN-28-10 net on CIFAR-10, you obtained the best result with hyperparameters T_0 = 100 and T_mult = 1 just before the first restart, which is similar to the results in my initial tests).

Btw., a big gotcha for me when applying your hyperparameters was that you flip and concatentate the train images to the non-flipped images in your dataset, effectively doubling the size of each epoch. I didn't to this initially and therefore applying your hyperparameters led to vastly different results than expected.

Edit: The overfitting on my dataset occurred after the first restart, and it started to overfit while the train loss was still much higher than it was before the restart. I.e. at the same train loss, the test loss was much higher after first restart than before the first restart.

loshchil · 2016-08-24T11:10:50Z

Thank you for the detailed reply.
I thought you mean CIFAR datasets as your original message does not mention your own dataset.
The transfer to your own dataset might requite some validation-based adjustments of the initial learning rate and .the weight decay factor. However, the paper does not promise better validation errors, instead "The main purpose of the proposed restart scheme for SGD is to improve its any-time performance."

nebw · 2016-08-24T11:51:08Z

However, the paper does not promise better validation errors, instead "The main purpose of the proposed restart scheme for SGD is to improve its any-time performance."

That's a fair point. I'd certainly be willing to finish the implementation (i.e. add unit tests and so on) and reopen the pull request if @fchollet is interested in merging the callback.

esube · 2016-08-25T02:39:45Z

I also came across this paper and wanted to experiment with it and was in the process of translating the code in to keras. But, trying to run the lasage original code I ran in to memory issues pretty quickly even for the scenario 1 for the K=10. So, I reduced K to 2 and the accuracy degradation becomes so obvious and the learning slows. @loshchil: did you try this on just simple thin original resnet? It would be easy to proof whether or not the gain is due to the extremely wide networks or because of the restart. At this point it is not that obvious to me.

loshchil · 2016-08-25T08:42:55Z

I ran in to memory issues pretty quickly even for the scenario 1 for the K=10. So, I reduced K to 2 and the accuracy degradation becomes so obvious and the learning slows.

The accuracy degradation is obviously due the use of a smaller network. This would be the case no matter what learning approach you are using. Note that the scenario 1 you are using is the default setting without restarts as the README file says "scenario #1 and #2 correspond to the original multi-step learning rate decay on CIFAR-10"

It would be easy to proof whether or not the gain is due to the extremely wide networks or because of the restart.

Which gain you mean? Thin ResNet with K=1 would give you something around 6-7% on CIFAR-10. Wider networks give better errors on CIFAR datasets around 4%, this is why they were originally proposed. Thus, the gain is due to the usage of wider network. The paper says "We also achieved new state-of-the-art results with SGDR, mainly by using even wider WRNs."
The gain that is discussed in the paper is a better any-time performance that was assessed for relevant situations of K=10 and K=20 when compared with the "standard" learning schedule given in the paper.

esube · 2016-08-25T17:19:02Z

Okay that makes sense. But, the quick accuracy degredation I observed for k=2 was worse than the original resnet. So, it would be nice to show first you can reproduce the case of the original resnet with k=1 and no restart. I am saying this because it appears that the way you preprocess the images is a bit different than the way the original resnet did. Also, you mentioned in your paper that you are not using Nesterov momentum. So, it might not be an equal comparison to begin with. In short, the early loss reduction could be only attributed to the restart scheduling if you start with the original resnet with all preprocessing and params the same and show that you can reproduce that and then by only changing the learning rate scheduling to show the difference.
I know that the aim of the paper is that with restarts, you can converge faster and can reach state-of-the-art quicker.

Also, I think, there's already a learning scheduling callback in keras. This can be implemented as the scheduler callback function by a user. You can modify the LearningRateScheduler callback that is already in keras to add what you want for instance, give you the current learning rate, which I actually did a pull request for. I didn't change the test so currently it is failing to build but that is easier to change if @fchollet agrees for these kinds of changes to LearningRateScheduler.

loshchil · 2016-08-25T19:04:20Z

Okay that makes sense. But, the quick accuracy degredation I observed for k=2 was worse than the original resnet.

Please show your data.
It is unclear to which original ResNet you refer. The best original ResNet by He et al. has 6.43% for 1.7M parameters, it is for 110 layers and not 28 twice wider layers. "quick accuracy degredation" how much is that exactly?

So, it would be nice to show first you can reproduce the case of the original resnet with k=1 and no restart. I am saying this because it appears that the way you preprocess the images is a bit different than the way the original resnet did. Also, you mentioned in your paper that you are not using Nesterov momentum. So, it might not be an equal comparison to begin with. In short, the early loss reduction could be only attributed to the restart scheduling if you start with the original resnet with all preprocessing and params the same and show that you can reproduce that and then by only changing the learning rate scheduling to show the difference.

Please have a look at the paper. The "default" approach described in the paper (see Table 1 and Figures) already reproduces the results by Zagoruyko and Komodakis for WRN-28-10, less of preprocessing and a simpler learning algorithm is only a plus.

Also, I think, there's already a learning scheduling callback in keras. This can be implemented as the scheduler callback function by a user. You can modify the LearningRateScheduler callback that is already in keras to add what you want for instance, give you the current learning rate, which I actually did a pull request for. I didn't change the test so currently it is failing to build but that is easier to change if @fchollet agrees for these kinds of changes to LearningRateScheduler.

I am not the author of this pull request and I am not familiar with Keras to see what implementation suits better. However, the code of the current pull request looked neat to me.

esube · 2016-08-26T04:13:20Z

I am not implying the restart is not helpful. I am saying, it would be easier to show the restart helps if I can reproduce the resnet result for n=4 and k=1 and without changing anything else, just change the lr scheduling and see the effect of the restart.

I am using CIFAR-10 not my data. Reducing the network to scenario #1 and k=1, which is the improved ResNet of 28 layers, it struggles to learn at the first 10 epochs beyond 70% test accuracy.

The whole advantage of ResNet is the relatively lower #params for even ultra deeper networks (>1000). By, making the networks 10x to 20x fatter, WRNs are making the #params extremely large quickly and defeats the whole purpose of lowering the #params compared to other plain networks such as VGG.

loshchil · 2016-08-29T08:23:24Z

I am using CIFAR-10 not my data. Reducing the network to scenario #1 and k=1, which is the improved ResNet of 28 layers, it struggles to learn at the first 10 epochs beyond 70% test accuracy.

This is a normal behaviour for this network/parameters. Below, see the median from 5 runs. If you would run this WRN-28-1 until the end you would get 7.3%, a slightly better result (it is an improved structure) than for a slightly bigger original Resnet-32.

The whole advantage of ResNet is the relatively lower #params for even ultra deeper networks (>1000). By, making the networks 10x to 20x fatter, WRNs are making the #params extremely large quickly and defeats the whole purpose of lowering the #params compared to other plain networks such as VGG.

It is not depth, but performance vs time/space/ease to use that makes ResNets popular. Chances are that the typical number of parameters will grow both for CIFAR and ImageNet datasets to use the limits GPUs. Sure, low-weight memory/time cost networks are always of interest but it is the use of big networks that pushes the search for better network architectures. The latter, however, has little to do with the learning approach we discuss.

loshchil · 2016-08-29T11:16:52Z

I am not implying the restart is not helpful. I am saying, it would be easier to show the restart helps if I can reproduce the resnet result for n=4 and k=1 and without changing anything else, just change the lr scheduling and see the effect of the restart.

As you requested, see below the median results of 5 runs with WRN-28-1 and 50k images per epoch. Blue color depicts the standard approach. Red color depicts SGDR. For both cases, the shown results are for the best initial learning rates out of [0.1, 0.05, 0.025, 0.01] for the standard approach and out of [0.1, 0.05, 0.025] for SGDR. As you can see, the pattern is the same as for WRN-28-10 and WRN-28-20 given in the paper.

esube · 2016-08-30T20:05:13Z

Hey @loshchil thanks for producing the plots and your replies.
Clearly, the restarts seem to help in quick convergence than the regular step scheduling.

I am trying to reproduce your result in keras, but, it settled for ~89.2% (10.8% error). So, I decided to just save your processed images and use the same minibatch generator as I suspected the discrepancy could be in the keras on the fly image data augmentation as it applies all transforms in to single transform homography matrix. In your case, you flipped the images and concatenated them essentially doubling the training data size then you do the mean subtraction and on the fly random padding and cropping.

I am now running an exactly similar network to yours (I plotted your network in lasage to pdf for comparison with my network in keras also plotted). I will let you know the result and will put the plot for the losses and errors here. I am running your network in lasagne and the keras network side by side. From some of the early runs, it looks like the keras training loss is reduced significantly faster compared to the lasage, which stays around 0.9 for a while and drops significantly once the learning rate is reduced. Note: I am using the data you prepared and the same minibatch iterator. Also, all other parameters for the optimizer are the same: SGD with momentum without nestrov. I suspect that keras's SGD is a bit different from the lasagne momentum and weight decay. We will see when they finish.

But, my questions to you at this point are:
I see that for the first unit of the first block of the resnet, you only used one batch normalization and activation but two convolution. Why do you do that?
Also, there's a convolution of 1x1 with stride 1x1 on the first unit, what is the significance of that, why not direct shortcut?
Also, from the original residual network paper, the data augmentation they used is: for each image OR its flipped version, they did the padding and random cropping. You instead flipped every image and added them to the original, essentially, doubling the training size. Do you think this won't raise questions for a fair comparison with the original resnet?

Thanks for your time.

loshchil · 2016-08-30T20:50:40Z

Thank you for doing experiments in both platforms. I would also suggest to check the original step-decay of the learning. It might happen that there is a difference already there, i.e., with the default approach, and so it is expected for restarts.

Regarding 100k examples per epoch instead of 50k. I took the code
https://github.com/Lasagne/Recipes/tree/master/papers/deep_residual_learning
where 100k case is implemented. At that time I believed that this is how people do it. However, it seems that most of them (not Lasagne guys) use 50k. Practically, it corresponds to twice longer runs. Importantly, in all my experiments (with and without restarts) I use the same settings, i.e., 100k. The figure from my previous post is made specifically for 50k case, i.e., when one 100k epoch is counted as two 50k. I don't know what was the motivation of the Lasagne Recipe case, but in expectation 1 epoch with [original + beforehand flipped images] is the same thing as 2 epochs with [online flipped images].

The Wide ResNet code follows the implementation of
https://gist.github.com/FlorianMuellerklein/3d9ba175038a3f2e7de3794fa303f1ee
When I checked the description and the code, it seemed like a match. Then there was a match of the number of parameters per network type and the error rates given in Zagoruyko and Komodakis.

esube · 2016-08-31T00:19:39Z

I'm actually doing experiments using the step scheduling not restarts yet
and it seems there's already a difference in the optimizers. The keras one
overfits extremely quickly whereas the lasagne stays very high up in train
loss until a leaning rate reduction is performed. So, I took that cue and
modified the original epochs where the leaning rate drop happens from [60,
120, 160] to [30, 60, 120] interestingly, I can get to 90% test accuracy
much earlier, which looks like the restart plot you have posted earlier.

Also, when you have time, could you comment on your choice of the first
unit of the first resnet block? It's a bit different from the rest. You
commented on the code, outs kind of a back to make the order of
convolution, bn, and activation. But, can't you just start with simple
convolution before the blocks star and in the first unit the next layer
will be batch normalization?

Thanks!

On Aug 30, 2016 7:56 PM, "Ilya Loshchilov" [email protected] wrote:

Thank you for doing experiments in both platforms. I would also suggest to
check the original step-decay of the learning. It might happen that there
is a difference already there, i.e., with the default approach, and so it
is expected for restarts.

Regarding 100k examples per epoch instead of 50k. I took the code
https://github.com/Lasagne/Recipes/tree/master/papers/deep_residual_learning
where 100k case is implemented. At that time I believed that this is how
people do it. However, it seems that most of them (not Lasagne guys) use
50k. Practically, it corresponds to twice longer runs. Importantly, in all
my experiments (with and without restarts) I use the same settings, i.e.,
100k. The figure from my previous post is made specifically for 50k case,
i.e., when one 100k epoch is counted as two 50k. I don't know what was the
motivation of the Lasagne Recipe case, but in expectation 1 epoch with
[original + beforehand flipped images] is the same thing as 2 epochs with
[online flipped images].

The Wide ResNet code follows the implementation of
https://gist.github.com/FlorianMuellerklein/3d9ba175038a3f2e7de3794fa303f1ee
When I checked the description and the code, it seemed like a match. Then
there was a match of the number of parameters per network type and the
error rates given in Zagoruyko and Komodakis.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#3525 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHCzJi7GVMGnNaPWRY5MVzP7YGgXB8uQks5qlJe6gaJpZM4Jopry
.

loshchil · 2016-08-31T08:30:53Z

I can get to 90% test accuracy much earlier, which looks like the restart plot you have posted earlier.

Whether it achieves 90% earlier matters only if it later can achieve better results, e.g. 92-93% that are normal for that network. If not, then one could use a smaller network.

loshchil · 2016-08-31T13:58:11Z

I see that for the first unit of the first block of the resnet, you only used one batch normalization and activation but two convolution. Why do you do that?
But, can't you just start with simple
convolution before the blocks star and in the first unit the next layer
will be batch normalization?

Do you refer to
conv_1 = batch_norm(ConvLayer(bn_pre_relu, num_filters=filters, filter_size=(3,3), stride=first_stride, nonlinearity=rectify, pad='same', W=HeNormal(gain='relu')))
conv_2 = ConvLayer(conv_1, num_filters=filters, filter_size=(3,3), stride=(1,1), nonlinearity=None, pad='same', W=HeNormal(gain='relu'))

I am not familiar with Lua but the code seems to match
convs:add(SBatchNorm(nBottleneckPlane)):add(ReLU(true))
convs:add(Convolution(nBottleneckPlane,nBottleneckPlane,table.unpack(v)))
from https://github.com/szagoruyko/wide-residual-networks/blob/master/models/wide-resnet.lua

If you think that the Lasagne code should be different, the best would be to know what exactly should it be according to you.

esube · 2016-08-31T20:50:33Z

@loshchil Yeah, I am talking about the first resnet unit of the first block. You have conv-bn-relu-conv in the first unit whereas all other implementations I saw have: bn-conv-relu-bn-conv-relu and just a simple one conv at the very beginning after the input. Even the wide-resnet basic block you shared above follows this strictly. So, in that sense, your network and the lua network you shared above don't match at the beginning. It might not make a whole lot of difference. I actually tested the lua model in your lasagne code and the difference is so minor.

As to the discripancy between keras and the lasagne code, I didn't notice at first that you were applying l2 normalization for all the layers params and that was why the keras model was overfitting quickly. To test that I just removed the l2_penality loss from your lasagne loss and it exhibited similar quick overfitting. Then, I added l2 weight regulizers to the keras layers and they quickly run in to a 'nan' loss after few iterations. Moreover, I was never able to get the common behavior of the loss being significantly reduced after you drop the learning rate in keras. Lasagne optimizer has that behavior if you added regularizer.

I closely inspected the SGD of lasagne and the SGD of keras and there seems to be quite a difference. The lasagne version is very straight forward and directly invokes theano functions and doesn't add any form of post processing. The keras version seems to do some clipping to the gradients after computed by theano and they have some constraint applied after that and even the regular momentum computation is not as straight forward as the lasagne version. So, it is really hard to tell what is going wrong. I will share the modular resnet units and blocks I wrote both in keras and lasagne soon aws.

As for the early error drop, you were right, although I could get it to 90% test accuracy within 15 epochs, it saturated around 91.2% in the long run. Mind you this is just one single run, not 5 runs.
In conclusion, the restart could be an interesting idea and will further investigate it in lasagne.

Thanks for your time.

loshchil · 2016-09-01T11:24:08Z

Thanks for your investigations. I guess that the clipping should be optional, otherwise depending on parameters it might be harmful like you just found. If you will have a chance to share your Lasagne/Keras, please let me know.

esube · 2016-09-05T19:43:19Z

@loshchil Finally, I was able to figure out the problem with Keras SGD. Setting the learning rate decay to zero produces undesirable 'nan' occasionally. So, I had to comment out the exponential learning rate decay line in keras optimizers.py under SGD class. So, finally, I was able to reproduce at least the WRN 28x4 network both in keras and lasagne and here are the results and the model graphs for your comparison. The models are essentially the same visually. However, it looks like the keras version has a slightly larger number of parameters. I don't know why yet.

nebw · 2016-09-10T13:55:57Z

@esube: That looks great! How did you determine that the difference in accuracy was caused by nans due to the weight decay? Any idea whats going on there? Did you test if the same problem occurs without nvcc.fastmath?

Also, would you mind sharing your code that you used to reproduce the WRN 28x4 results with keras?

esube · 2016-09-10T18:18:11Z

@nebw I didn't say the difference in accuracy is due to 'nans'. I said previously, I was unable to train the networks in keras because of the conflict between the two learning rate decay schedules. If there is 'nan' all you have to do is stop training and change things until you avoid the 'nan' :-) There is no accuracy after a 'nan'.

I set the learning rate decay in SGD instantiation to 0, 1e-6, etc to remove its effect as I only want the step decay scheduling to have effect. Going that low produced 'nan' sporadically as the exponential learning rate decay implemented in keras is dividing by the decay. So, I commented out that line and I was able to train just fine. I have tested that on several versions even as deep as 74 layers and it works just fine. Since, the exponential decay is meant for when you don't have additional learning rate scheduler, I think, that part of the code needs to be guarded if you have your own callback or learning rate schedulers. I have tried it with or without the fastmath flag so I don't think that is the problem.

As for the code, I am sharing it as an example to a PR am about to make for the step learning rate scheduler. Keep an eye for it in keras PRs.

Implementation of Stochastic Gradient Descent with Restarts as callback

fed1a6d

Simplify and document code

eafe45c

nebw closed this Aug 22, 2016

[WIP] Stochastic Gradient Descent with Restarts - Callback #3525

[WIP] Stochastic Gradient Descent with Restarts - Callback #3525

Uh oh!

Conversation

nebw commented Aug 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fchollet commented Aug 19, 2016

Uh oh!

nebw commented Aug 22, 2016

Uh oh!

loshchil commented Aug 23, 2016

Uh oh!

nebw commented Aug 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

loshchil commented Aug 24, 2016

Uh oh!

nebw commented Aug 24, 2016

Uh oh!

esube commented Aug 25, 2016

Uh oh!

loshchil commented Aug 25, 2016

Uh oh!

esube commented Aug 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

loshchil commented Aug 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

esube commented Aug 26, 2016

Uh oh!

loshchil commented Aug 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

loshchil commented Aug 29, 2016

Uh oh!

esube commented Aug 30, 2016

Uh oh!

loshchil commented Aug 30, 2016

Uh oh!

esube commented Aug 31, 2016

Uh oh!

loshchil commented Aug 31, 2016

Uh oh!

loshchil commented Aug 31, 2016

Uh oh!

esube commented Aug 31, 2016

Uh oh!

loshchil commented Sep 1, 2016

Uh oh!

esube commented Sep 5, 2016

Uh oh!

nebw commented Sep 10, 2016

Uh oh!

esube commented Sep 10, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nebw commented Aug 19, 2016 •

edited

Loading

nebw commented Aug 24, 2016 •

edited

Loading

esube commented Aug 25, 2016 •

edited

Loading

loshchil commented Aug 25, 2016 •

edited

Loading

loshchil commented Aug 29, 2016 •

edited

Loading