DeepLearningTutorials/doc/logreg.txt at master · erogol/DeepLearningTutorials · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
.. index:: Logistic Regression

.. _logreg :


Classifying MNIST digits using Logistic Regression
==================================================

.. note::
    This sections assumes familiarity with the following Theano
    concepts: `shared variables`_ , `basic arithmetic ops`_ , `T.grad`_ ,
    `floatX`_. If you intend to run the code on GPU also read `GPU`_.

.. note::
    The code for this section is available for download `here`_.

.. _here: http://deeplearning.net/tutorial/code/logistic_sgd.py

.. _shared variables: http://deeplearning.net/software/theano/tutorial/examples.html#using-shared-variables

.. _basic arithmetic ops: http://deeplearning.net/software/theano/tutorial/adding.html#adding-two-scalars

.. _T.grad: http://deeplearning.net/software/theano/tutorial/examples.html#computing-gradients

.. _floatX: http://deeplearning.net/software/theano/library/config.html#config.floatX

.. _GPU: http://deeplearning.net/software/theano/tutorial/using_gpu.html

In this section, we show how Theano can be used to implement the most basic
classifier: the logistic regression. We start off with a quick primer of the
model, which serves both as a refresher but also to anchor the notation and
show how mathematical expressions are mapped onto Theano graphs.

In the deepest of machine learning traditions, this tutorial will tackle the exciting
problem of MNIST digit classification.

The Model
+++++++++

Logistic regression is a probabilistic, linear classifier. It is parametrized
by a weight matrix :math:`W` and a bias vector :math:`b`. Classification is
done by projecting data points onto a set of hyperplanes, the distance to
which reflects a class membership probability.

Mathematically, this can be written as:

.. math::
  P(Y=i|x, W,b) &= softmax_i(W x + b) \\
                &= \frac {e^{W_i x + b_i}} {\sum_j e^{W_j x + b_j}}

The output of the model or prediction is then done by taking the argmax of the vector whose i'th element is P(Y=i|x).

.. math::
  y_{pred} = {\rm argmax}_i P(Y=i|x,W,b)

The code to do this in Theano is the following:

.. literalinclude:: ../code/logistic_sgd.py
  :start-after: index = T.lscalar()
  :end-before: # construct the logistic regression class


.. literalinclude:: ../code/logistic_sgd.py
  :start-after: start-snippet-1
  :end-before: end-snippet-1

We first start by allocating symbolic variables for the inputs :math:`x,y`.
Since the parameters of the model must maintain a persistent state throughout
training, we allocate shared variables for :math:`W,b`.
This declares them both as being symbolic Theano variables, but also
initializes their contents. The dot and softmax operators are then used to compute the vector
:math:`P(Y|x, W,b)`. The resulting variable p_y_given_x is a symbolic variable
of vector-type.

Up to this point, we have only defined the graph of computations which Theano
should perform. To get the actual numerical value of :math:`P(Y|x, W,b)`, we
must create a function ``get_p_y_given_x``, which takes as input ``x`` and
returns ``p_y_given_x``. We can then index its return value with the
index :math:`i` to get the membership probability of the :math:`i` th class.

Now let's finish building the Theano graph. To get the actual model
prediction, we can use the ``T.argmax`` operator, which will return the index at
which ``p_y_given_x`` is maximal (i.e. the class with maximum probability).

Again, to calculate the actual prediction for a given input, we construct a
function ``classify``. This function takes as argument a batch of inputs x (as a matrix),
and outputs a vector containing the predicted class for each example (row) in x.

Now of course, the model we have defined so far does not do anything useful yet,
since its parameters are still in their initial random state. The following
section will thus cover how to learn the optimal parameters.


.. note::
    For a complete list of Theano ops, see: `list of ops <http://deeplearning.net/software/theano/library/tensor/basic.html#basic-tensor-functionality>`_


Defining a Loss Function
++++++++++++++++++++++++

Learning optimal model parameters involves minimizing a loss function. In the
case of multi-class logistic regression, it is very common to use the negative
log-likelihood as the loss. This is equivalent to maximizing the likelihood of the
data set :math:`\cal{D}` under the model parameterized by :math:`\theta`. Let
us first start by defining the likelihood :math:`\cal{L}` and loss
:math:`\ell`:

.. math::

   \mathcal{L} (\theta=\{W,b\}, \mathcal{D}) =
     \sum_{i=0}^{|\mathcal{D}|} \log(P(Y=y^{(i)}|x^{(i)}, W,b)) \\
   \ell (\theta=\{W,b\}, \mathcal{D}) = - \mathcal{L} (\theta=\{W,b\}, \mathcal{D})

While entire books are dedicated to the topic of minimization, gradient
descent is by far the simplest method for minimizing arbitrary non-linear
functions. This tutorial will use the method of stochastic gradient method with
mini-batches (MSGD). See :ref:`opt_SGD` for more details.

The following Theano code defines the (symbolic) loss for a given minibatch:

.. literalinclude:: ../code/logistic_sgd.py
  :start-after: start-snippet-2
  :end-before: end-snippet-2

.. note::

    Even though the loss is formally defined as the *sum*, over the data set,
    of individual error terms, in practice, we use the *mean* (``T.mean``)
    in the code. This allows for the learning rate choice to be less dependent
    of the minibatch size.


Creating a LogisticRegression class
+++++++++++++++++++++++++++++++++++

We now have all the tools we need to define a ``LogisticRegression`` class, which
encapsulates the basic behaviour of logistic regression. The code is very
similar to what we have covered so far, and should be self explanatory.

.. literalinclude:: ../code/logistic_sgd.py
  :pyobject: LogisticRegression

We instantiate this class as follows:

.. literalinclude:: ../code/logistic_sgd.py
  :start-after: index = T.lscalar()
  :end-before: # the cost we minimize during

Note that the inputs x and y are defined outside the scope of the
``LogisticRegression`` object. Since the class requires the input x to build its
graph however, it is passed as a parameter of the ``__init__`` function.
This is usefull in the case when you would want to concatenate such
classes to form a deep network (case in which the input is not a new
variable but the output of the layer below). While in this example we
will not do that, the tutorials are designed such that the code is as
similar as possible among them, making it easy to go from one tutorial
to the other.

The last step involves defining a (symbolic) cost variable to minimize, using
the instance method ``classifier.negative_log_likelihood``.

.. literalinclude:: ../code/logistic_sgd.py
  :start-after: classifier = LogisticRegression(input=x, n_in=28 * 28, n_out=10)
  :end-before: # compiling a Theano function that computes the mistakes

Note how x is an implicit symbolic input to the symbolic definition of cost,
here, because classifier.__init__ has defined its symbolic variables in terms of x.

Learning the Model
++++++++++++++++++

To implement MSGD in most programming languages (C/C++, Matlab, Python), one
would start by manually deriving the expressions for the gradient of the loss
with respect to the parameters: in this case :math:`\partial{\ell}/\partial{W}`,
and :math:`\partial{\ell}/\partial{b}`, This can get pretty tricky for complex
models, as expressions for :math:`\partial{\ell}/\partial{\theta}` can get
fairly complex, especially when taking into account problems of numerical
stability.

With Theano, this work is greatly simplified as it performs
automatic differentiation and applies certain math transforms to improve
numerical stability.

To get the gradients :math:`\partial{\ell}/\partial{W}` and
:math:`\partial{\ell}/\partial{b}` in Theano, simply do the following:

.. literalinclude:: ../code/logistic_sgd.py
  :start-after: # compute the gradient of cost
  :end-before: # start-snippet-3

``g_W`` and ``g_b`` are again symbolic variables, which can be used as part of a
computation graph. Performing one-step of gradient descent can then be done as
follows:

.. literalinclude:: ../code/logistic_sgd.py
  :start-after: start-snippet-3
  :end-before: end-snippet-3

The ``updates`` list contains, for each parameter, the
stochastic gradient update operation. The ``givens`` dictionary indicates with
what to replace certain variables of the graph. The function ``train_model`` is then
defined such that:

* the input is the mini-batch index ``index`` that together with the batch
  size( which is not an input since it is fixed) defines :math:`x` with
  corresponding labels :math:`y`
* the return value is the cost/loss associated with the x, y defined by
  the ``index``
* on every function call, it will first replace ``x`` and ``y`` with the
  corresponding slices from the training set as defined by the
  ``index`` and afterwards it will evaluate the cost
  associated with that minibatch and apply the operations defined by the
  ``updates`` list.

Each time ``train_model(index)`` function is called, it will thus compute and
return the appropriate cost, while also performing a step of MSGD. The entire
learning algorithm thus consists in looping over all examples in the dataset,
and repeatedly calling the ``train_model`` function.


Testing the model
+++++++++++++++++

As explained in :ref:`opt_learn_classifier`, when testing the model we are
interested in the number of misclassified examples (and not only in the likelihood).
The ``LogisticRegression`` class therefore has an extra instance method, which
builds the symbolic graph for retrieving the number of misclassified examples in
each minibatch.

The code is as follows:

.. literalinclude:: ../code/logistic_sgd.py
  :pyobject: LogisticRegression.errors

We then create a function ``test_model`` and a function ``validate_model``, which we can call to retrieve this
value. As you will see shortly, ``validate_model`` is key to our early-stopping
implementation (see :ref:`opt_early_stopping`). Both of these function
will get as input a batch offset and will compute the number of
missclassified examples for that mini-batch. The only difference between them
is that one draws its batches from the testing set, while
the other from the validation set.

.. literalinclude:: ../code/logistic_sgd.py
  :start-after: cost = classifier.negative_log_likelihood(y)
  :end-before: # compute the gradient of cost

Putting it All Together
+++++++++++++++++++++++

The finished product is as follows.

.. literalinclude:: ../code/logistic_sgd.py

The user can learn to classify MNIST digits with SGD logistic regression, by typing, from
within the DeepLearningTutorials folder:

.. code-block:: bash

    python code/logistic_sgd.py

The output one should expect is of the form :

.. code-block:: bash

    ...
    epoch 72, minibatch 83/83, validation error 7.510417 %
         epoch 72, minibatch 83/83, test error of best model 7.510417 %
    epoch 73, minibatch 83/83, validation error 7.500000 %
         epoch 73, minibatch 83/83, test error of best model 7.489583 %
    Optimization complete with best validation score of 7.500000 %,with test performance 7.489583 %
    The code run for 74 epochs, with 1.936983 epochs/sec


On an Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00 Ghz  the code runs with
approximately 1.936 epochs/sec and it took 75 epochs to reach a test
error of 7.489%. On the GPU the code does almost 10.0 epochs/sec. For this
instance we used a batch size of 600.

.. rubric:: Footnotes

.. [#f1] For smaller datasets and simpler models, more sophisticated descent
         algorithms can be more effective. The sample code
         `logistic_cg.py <http://deeplearning.net/tutorial/code/logistic_cg.py>`_
         demonstrates how to use SciPy's conjugate gradient solver with Theano
         on the logistic regression task.