DeepLearningTutorials/doc/logreg.txt at master · doomie/DeepLearningTutorials · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
Classifying MNIST digits using Logistic Regression
==================================================

.. note::
    This sections assumes the reader is familiar with the following Theano
    concepts: shared variables , basic arithmetic ops, T.grad.
    ((TODO: html links into theano documentation))


In this section, we show how Theano can be used to implement the most basic
classifier: the logistic regression. We start off with a quick primer of the
model, which serves both as a refresher but also to anchor the notation and
show how mathematical expressions are mapped onto theano graphs.

In the deepest of machine learning traditions, this tutorial will tackle the exciting
problem of MNIST digit classification.

The Model
+++++++++

Logistic regression is a probabilistic, linear classifier. It is parametrized
by a weight matrix :math:`W` and a bias vector :math:`b`. Classification is
done by projecting data points onto a set of hyperplanes, the distance to
which reflects a class membership probability.

Mathematically, this can be written as:

.. math::
  P(Y=i|x, W,b) &= softmax_i(W x + b) \\
                &= \frac {e^{W_i x + b_i}} {\sum_j e^{W_j x + b_j}}

The output of the model or prediction is then done by taking the argmax of the vector whose i'th element is P(Y=i|x).

.. math::
  y_{pred} = argmax_i P(Y=i|x,W,b)


The code to do this in theano is the following:

.. code-block:: python

  # generate symbolic variables for input (x and y represent a
  # minibatch)
  x = T.fmatrix()
  y = T.lvector()

  # allocate shared variables model params
  b = theano.shared(numpy.zeros((10,)))
  W = theano.shared(numpy.zeros((784,10)))

  # symbolic expression for computing the vector of
  # class-membership probabilities
  p_y_given_x = T.softmax(T.dot(x,w)+b)

  # compiled theano function that returns the vector of class-membership
  # probabilities
  get_p_y_given_x = theano.function( x, p_y_given_x)

  # print the probability of some example represented by x_value
  # x_value is not a symbolic variable but a numpy array describing the
  # datapoint
  print 'Probability that x is of class %i is %f' % i, get_p_y_given_x(x_value)[i]

  # symbolic description of how to compute prediction as class whose probability
  # is maximal
  y_pred = T.argmax(p_y_given_x)

  # compiled theano function that returns this value
  classify = theano.function(x, y_pred)


We first start by allocating symbolic variables for the inputs :math:`x,y`.
Since the parameters of the model must maintain a persistent state throughout
training, we allocate shared variables for :math:`W,b`.
This declares them both as being symbolic theano variables, but also
initializes their contents. The dot and softmax operators are then used to compute the vector
:math:`P(Y|x, W,b)`. The resulting variable p_y_given_x is a symbolic variable
of vector-type.

Up to this point, we have only defined the graph of computations which theano
should perform. To get the actual numerical value of :math:`P(Y|x, W,b)`, we
must create a function ``get_p_y_given_x``, which takes as input ``x`` and
returns ``p_y_given_x``. We can then index its return value with the
index :math:`i` to get the membership probability of the :math:`i` th class.

Now let's finish building the theano graph. To get the actual model
prediction, we can use the ``T.argmax`` operator, which will return the index at
which ``p_y_given_x`` is maximal (i.e. the class with maximum probability).

Again, to calculate the actual prediction for a given input, we construct a
function ``classify``. This function takes as argument a batch of inputs x (as a matrix),
and outputs a vector containing the predicted class for each example (row) in x.

Now of course, the model we have defined so far does not do anything useful yet,
since its parameters are still in their initial random state. The following
section will thus cover how to learn the optimal parameters.


.. note::
    For a complete list of Theano ops, see: TODO


Defining a Loss Function
++++++++++++++++++++++++

Learning optimal model parameters involves minimizing a loss function. In the
case of multi-class logistic regression, it is very common to use the negative
log-likelihood as the loss. This is equivalent to maximizing the likelihood of the
data set :math:`\cal{D}` under the model parameterized by :math:`\theta`. Let
us first start by defining the likelihood :math:`\cal{L}` and loss
:math:`\ell`:

.. math::

   \mathcal{L} (\theta=\{W,b\}, \mathcal{D}) =
     \sum_{i=0}^{|\mathcal{D}|} \log(P(Y=y^{(i)}|x^{(i)}, W,b)) \\
   \ell (\theta=\{W,b\}, \mathcal{D}) = - \mathcal{L} (\theta=\{W,b\}, \mathcal{D})

While entire books are dedicated to the topic of minimization, gradient
descent is by far the simplest method for minimizing arbitrary non-linear
functions. This tutorial will use the method of stochastic gradient method with
mini-batches (MSGD). See :ref:`opt_SGD` for more details.

The following Theano code defines the (symbolic) loss for a given minibatch:

.. code-block:: python

  loss = T.sum(T.log(p_y_given_x)[y])

.. note::
    In practice, we will use the mean (T.mean) instead of the sum. This
    allows for the learning rate choice to be less dependent of the minibatch size.


Creating a LogisticRegression class
+++++++++++++++++++++++++++++++++++

We now have all the tools we need to define a ``LogisticRegression`` class, which
encapsulates the basic behaviour of logistic regression. The code is very
similar to what we have covered so far, and should be self explanatory.

.. code-block:: python

    class LogisticRegression(object):


        def __init__(self, input, n_in, n_out):
            """ Initialize the parameters of the logistic regression
            :param input: symbolic variable that describes the input of the
            architecture (e.g., one minibatch of input images)
            :param n_in: number of input units, the dimension of the space in
            which the datapoint lies
            :param n_out: number of output units, the dimension of the space in
            which the target lies
            """

            # initialize with 0 the weights W as a matrix of shape (n_in, n_out)
            self.W = theano.shared( value=numpy.zeros((n_in,n_out),
                                                dtype = theano.config.floatX) )
            # initialize the baises b as a vector of n_out 0s
            self.b = theano.shared( value=numpy.zeros((n_out,),
                                                dtype = theano.config.floatX) )

            # compute vector of class-membership probabilities in symbolic form
            self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W)+self.b)

            # compute prediction as class whose probability is maximal in
            # symbolic form
            self.y_pred=T.argmax(self.p_y_given_x, axis=1)


        def negative_log_likelihood(self, y):
            """Return the negative log-likelihood of the prediction of this model
            under a given target distribution.
            :param y: corresponds to a vector that gives for each example the
            :correct label
            """
            # TODO: inline NLL formula, refer to theano function
            # TODO: is it really theano.log? why not T.log? make sure we use the same as in the code, which is not the case currently.
            return -theano.log(self.p_y_given_x)[y]


We instantiate this class as follows:

.. code-block:: python

    # allocate symbolic variables for the data
    x = T.fmatrix()  # the data is presented as rasterized images (each being a 1-D row vector in x)
    y = T.lvector()  # the labels are presented as 1D vector of [long int] labels

    # construct the logistic regression class
    classifier = LogisticRegression( \
                   input=x.reshape((batch_size,28*28)), n_in=28*28, n_out=10)

Note that the inputs x and y are defined outside the scope of the
``LogisticRegression`` object. Since the class requires the input x to build its
graph however, it is passed as a parameter of the ``__init__`` function.

The last step involves defining a (symbolic) cost variable to minimize, using
the instance method ``classifier.negative_log_likelihood``.

.. code-block:: python

    cost = classifier.negative_log_likelihood(y).mean()

Note that the return value of ``classifier.negative_log_likelihood`` is a vector
containing the cost for each training example within the minibatch. Since we are
using MSGD, the cost to minimize is the mean cost across the minibatch.
Note how x is an implicit symbolic input to the symbolic definition of cost,
here, because classifier.__init__ has defined its symbolic variables in terms of x.

Learning the Model
++++++++++++++++++

To implement MSGD in most programming languages (C/C++, Matlab, Python), one
would start by manually deriving the expressions for the gradient of the loss
with respect to the parameters: in this case :math:`\partial{\ell}/\partial{W}`,
and :math:`\partial{\ell}/\partial{b}`, This can get pretty tricky for complex
models, as expressions for :math:`\partial{\ell}/\partial{\theta}` can get
fairly complex, especially when taking into account problems of numerical
stability.

With Theano, this work is greatly simplified as it performs
automatic differentiation and applies certain math transforms to improve
numerical stability.

To get the gradients :math:`\partial{\ell}/\partial{W}` and
:math:`\partial{\ell}/\partial{b}` in Theano, simply do the following:

.. code-block:: python

    # compute the gradient of cost with respect to theta = (W,b)
    g_W = T.grad(cost, classifier.W)
    g_b = T.grad(cost, classifier.b)

``g_W`` and ``g_b`` are again symbolic variables, which can be used as part of a
computation graph. Performing one-step of gradient descent can then be done as
follows:

.. code-block:: python

    # set a learning rate
    learning_rate=0.01

    # specify how to update the parameters of the model as a dictionary
    updates ={classifier.W: classifier.W - numpy.asarray(learning_rate)*g_W,\
              classifier.b: classifier.b - numpy.asarray(learning_rate)*g_b}

    # compiling a theano function `train_model` that returns the cost, but in
    # the same time updates the parameter of the model based on the rules
    # defined in `updates`
    train_model = theano.function([x, y], cost, updates = updates )

The ``updates`` dictionary contains, for each parameter, the
stochastic gradient update operation. The function ``train_model`` is then
defined such that:

* the inputs are the mini-batch :math:`x` with corresponding labels :math:`y`
* the return value is the cost/loss associated with inputs x, y
* on every function call, it will apply the operations defined by the
  ``updates`` dictionary.

Each time ``train_model(x,y)`` function is called, it will thus compute and
return the appropriate cost, while also performing a step of MSGD. The entire
learning algorithm thus consists in looping over all examples in the dataset,
and repeatedly calling the ``train_model`` function.


Testing the model
+++++++++++++++++

As explained in :ref:`opt_learn_classifier`, when testing the model we are
interested in the number of misclassified examples (and not only in the likelihood).
The ``LogisticRegression`` class therefore has an extra instance method, which
builds the symbolic graph for retrieving the number of misclassified examples in
each minibatch.

The code is as follows:

.. code-block:: python

    class LogisticRegression(object):

        ...

        def errors(self, y):
            """Return a float representing the number of errors in the minibatch
            over the total number of examples of the minibatch
            """
            return T.mean(T.neq(self.y_pred, y))


We then create a function ``test_model``, which we can call to retrieve this
value. As you will see shortly, ``test_model`` is key to our early-stopping
implementation (see :ref:`opt_early_stopping`).

.. code-block:: python

    test_model = theano.function([x,y], classifier.errors(y))


Putting it All Together
+++++++++++++++++++++++

The finished product is as follows.

.. literalinclude:: ../code/logistic_sgd.py

The user can learn to classify MNIST digits with SGD logistic regression, by typing, from
within the DeepLearningTutorials folder:

.. code-block:: bash

    python code/logistic_sgd.py

The output one should expect is of the form :

.. code-block:: bash

  epoch 0, minibatch 2500/2500, validation error 10.720000 %
       epoch 0, minibatch 2500/2500, test error of best model 11.050000 %
  ...
  epoch 96, minibatch 2500/2500, validation error 7.010000 %
  Optimization complete with best validation score of 7.01%, with test performance 7.61%
  The code ran for 2.595667 minutes

On an Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00 Ghz  the code runs with
approximately 1.62229185 sec/epoch and it took 99 epochs to reach a test
error of 7.61%.

.. rubric:: Footnotes

.. [#f1] For smaller datasets and simpler models, more sophisticated descent
         algorithms can be more effective. The sample code logistic_cg.py
         demonstrates how to use SciPy's conjugate gradient solver with theano
         on the logistic regression task.