DeepLearningTutorials/doc/optimization.txt at master · doomie/DeepLearningTutorials · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
A Short Primer on [Supervised] Optimization for Deep Learning
================================================


.. _stoch-grad-label:

What's exciting about Deep Learning is largely the use of unsupervised learning
of deep networks.  But supervised learning also plays an important role.  The
utility of unsupervised *pre-training* is often evaluated on the basis of what
performance can be achieved after supervised *fine-tuning*.  This chapter
reviews the basics of supervised learning for classification models, and covers
the minibatch stochastic gradient descent algorithm that is used to fine-tune
many of the models in the Deep Learning Tutorials.


.. _opt_learn_classifier:

Learning a Classifier
+++++++++++++++++++++

Zero-One Loss
-------------

The models presented in these deep learning tutorials are mostly used as
for classification. The objective in training a classifier is to minimize the number
of errors (zero-one loss) on unseen examples. If :math:`f: R^D \rightarrow
\{0,...,L\}` is the prediction function, then this loss can be written as:

.. math::

    \ell_{0,1} = \sum_{i=0}^{|\mathcal{D}|} I_{f(x^{(i)}) \neq y^{(i)}}

where either :math:`\mathcal{D}` is the training
set (during training)
or :math:`\mathcal{D} \cap \mathcal{D}_{train} = \emptyset`
(to avoid biasing the evaluation of validation or test error). :math:`I` is the
indicator function defined as:

.. math::

    I_x = \left\{\begin{array}{ccc}
          1&\mbox{ if $x$ is True} \\
          0&\mbox{ otherwise}\end{array}\right.

In this tutorial, :math:`f` is defined as:

.. math::

    f(x) = argmax_k P(Y=k | x, \theta)


Negative Log-Likelihood Loss
----------------------------

Since the zero-one loss is not differentiable, optimizing it for large models
(thousands or millions of parameters) is prohibitively expensive
(computationally). We thus maximize the log-likelihood of our classifier given
all the labels in a training set.

.. math::

    \mathcal{L}(\theta, \mathcal{D}) =
        \sum_{i=0}^{|\mathcal{D}|} \log P(Y=y^{(i)} | x^{(i)}, \theta)

The likelihood of the correct class is not the same as the
number of right predictions, but from the point of view of a randomly
initialized classifier they are pretty similar.
TODO: CLARIFY THIS SENTENCE BELOW: Later in training you can see
that the number of right predictions in a validation set can decrease a
little even after the probability of the right answers starts to drop
(indicating overfitting), but not much.

Since we usually speak in terms of minimizing a loss function, learning will
thus attempt to **minimize** the **negative** log-likelihood (NLL), defined
as:

.. math::

    NLL(\theta, \mathcal{D}) = - \sum_{i=0}^{|\mathcal{D}|} \log P(Y=y^{(i)} | x^{(i)}, \theta)

The NLL of our classifier is a differentiable surrogate for the zero-one loss,
and we use the gradient of this function over our training data as a
supervised learning signal for deep learning.

.. _opt_SGD:

Stochastic Gradient Descent
+++++++++++++++++++++++++++

What is ordinary gradient descent?  it is a simple
algorithm in which we repeatedly make small steps downward on an error surface
defined by a loss function of some parameters.
For the purpose of ordinary
gradient descent we consider that the training data is rolled into the loss
function.

.. code-block:: python

    # GRADIENT DESCENT
    while True:
        loss = f(params)
        d_loss_wrt_params = ... # compute gradient
        params -= learning_rate * d_loss_wrt_params
        if <stopping condition is met>:
            return params

Stochastic gradient descent (SGD) works according to the same principles as
ordinary gradient descent, but proceeds more quickly by estimating the gradient from just
a few examples at a time instead of the entire training set.  In its purest
form, we estimate the gradient from just a single example at a time.

.. code-block:: python

    # STOCHASTIC GRADIENT DESCENT
    for (x_i,y_i) in training_set:
                                # imagine an infinite generator
                                # that may repeat examples (if there is only a finite training set)
        loss = f(params, x_i, y_i)
        d_loss_wrt_params = ... # compute gradient
        params -= learning_rate * d_loss_wrt_params
        if <stopping condition is met>:
            return params

The variant that we recommend for deep learning is a further twist on
stochastic gradient descent using so-called "minibatches" ***other terms for
this?***.  Minibatch SGD works identically to SGD, except that we use more than
one training example to make each estimate of the gradient.  This technique reduces
variance in the estimate of the gradient, and often makes better use of the
hierarchical memory organization in modern computers.

.. code-block:: python

    for (x_batch,y_batch) in train_batches:
                                # imagine an infinite generator
                                # that may repeat examples
        loss = f(params, x_batch, y_batch)
        d_loss_wrt_params = ... # compute gradient
        params -= learning_rate * d_loss_wrt_params
        if <stopping condition is met>:
            return params


There is a tradeoff in the choice of the minibatch size :math:`B`.  The
reduction of variance and use of SIMD instructions helps most when increasing
:math:`B` from 1 to 2, but the marginal improvement fades rapidly to nothing.
With large :math:`B`, time is wasted in reducing the variance of the gradient
estimator, that time would be better spent on additional gradient steps.
An optimal :math:`B` is model-, dataset-, and hardware-dependent, and can be
anywhere from 1 to maybe several hundreds.  In the tutorial we set it to 10, but this choice
is almost arbitrary (though harmless).


.. todo::
    At this point, we can show how to allocate and apply a classifier with
    theano, load data, calculate: error rates, etc.

TODO: CHOOSE WHICH LOSS FORMULATION TO KEEP

.. code-block:: python

  zero_one_loss = T.sum(T.neq(argmax(p_y_given_x), y)) ???

  loss = T.sum(T.log(p_y_given_x)[y])  #option 1 (TODO: advanced indexing, optimization pattern)

  loss = T.log(p_y_given_x[0,y[0]]) + theano.log(p_y_given_x[1, y[1]]) # option 2: simple indexing on each minibatch element

  loss = T.sum(theano.log(p_y_given_x) * one_of_n(y))   # option 3 (TODO: one_of_n:: integer array, optimization pattern)

  loss = T.sum(theano.nnet.categorical_crossentropy(p_y_given_x, y)) # option 4:

  gw, gb = T.grad(L, [w,b])


.. _opt_early_stopping:


Regularization
++++++++++++++

L1 and L2 regularization
------------------------

TODO

Early-Stopping
--------------

There is more to machine learning than optimization.  When we
train our model from data we are trying to prepare it to do well on *new*
examples, not the ones it has already seen.  The training loop above for MSGD
does not take this into account, and may overfit the training examples.
A way to combat overfitting is through regularization.
There are several techniques for regularization, but the one we will explain here is
early-stopping.

Early-stopping combats overfitting by monitoring the model's performance on a
*validation set*.  A validation set is a set of examples that we never use for
gradient descent, but which is also not a part of the *test set*.  The
validation examples are considered to be representative of future test examples.
We can use them during training because they are not part of the test set.
If the model's performance ceases to improve sufficiently on the
validation set, or even degrades with further optimization, then the
heuristic implemented here gives up on much further optimization.


The choice of when to stop is a
judgement call and a few heuristics exist***, but these tutorials will make use
of a strategy based on a geometrically increasing amount of patience.

.. code-block:: python

    # PRE-CONDITION
    # params refers to [initialized] parameters of our model

    # early-stopping parameters
    n_iter                = 100   # the maximal number of iterations of the
                                  # entire dataset considered
    patience              = 5000  # look at this many training examples regardless
    patience_increase     = 2     # wait this much longer when a new best
                                  # validation error is found
    improvement_threshold = 0.995 # a relative improvement of this much is
                                  # considered significant
    validation_frequency  = 1000  # make this many SGD updates between validations

    # initialize cross-validation variables
    best_params = None
    best_validation_loss = float('inf')

    for iter in xrange( n_iter * len(train_batches) ) :

        # get epoch and minibatch index
        epoch           = iter / len(train_batches)
        minibatch_index =  iter % len(train_batches)

        # get the minibatches corresponding to `iter` modulo
        # `len(train_batches)`
        x,y = train_batches[ minibatch_index ]


        d_loss_wrt_params = ... # compute gradient
        params -= learning_rate * d_loss_wrt_params # gradient descent

        # note that if we do `iter % validation_frequency` it will be
        # true for iter = 0 which we do not want
        if (iter+1) % validation_frequency == 0:

            this_validation_loss = ... # compute zero-one loss on validation set

            # improve patience
            if this_validation_loss < best_validation_loss*improvement_threshold:
                patience = iter * patience_increase

            if this_validation_loss < best_validation_loss:
                best_params = copy.deepcopy(params)
                best_validation_loss = this_validation_loss

        if patience <= iter:
            break

    # POSTCONDITION:
    # best_params refers to the best out-of-sample parameters observed during the optimization

If we run out of batches of training data before running out of patience, then
we just go back to the beginning of the training set and repeat.

.. note::

    This algorithm could possibly be improved by using a test of statistical significance
    rather than the simple comparison, when deciding whether to increase the
    patience.


Testing
+++++++

After the loop exits, the best_params variable refers to the best-performing
model on the validation set.  If we repeat this procedure for another model
class, or even another random initialization, we should use the same
train/valid/test split of the data, and get other best-performing
models.  If we have to choose what the best model class or the best
initialization was, we compare the best_validation_loss for each model.  When
we have finally chosen the model we think is the best (on validation data), we
report that model's test set performance.  That is the performance we expect on
unseen examples.

Recap
+++++

That's it for the optimization section.
The technique of early-stopping requires us to partition the set of examples into three sets
(training :math:`\mathcal{D}_{train}`,
validation :math:`\mathcal{D}_{valid}`,
test :math:`\mathcal{D}_{test}`).
The training set is used for minibatch stochastic gradient descent on the
differentiable approximation of the objective function.
As we perform this gradient descent, we periodically consult the validation set
to see how our model is doing on the real objective function (or at least our
empirical estimate of it).
When we see a good model on the validation set, we save it.
When it has been a long time since seeing a good model, we abandon our search
and return the best parameters found, for evaluation on the test set.