DeepLearningTutorials/doc/SdA.txt at master · jwmneu/DeepLearningTutorials · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
.. _SdA:

Stacked Denoising Autoencoders (SdA)
====================================

.. note::
  This section assumes the reader has already read through :doc:`logreg`
  and :doc:`mlp`. Additionally it uses the following Theano functions
  and concepts : `T.tanh`_, `shared variables`_, `basic arithmetic ops`_, `T.grad`_, `Random numbers`_, `floatX`_. If you intend to run the code on GPU also read `GPU`_.

.. _T.tanh: http://deeplearning.net/software/theano/tutorial/examples.html?highlight=tanh

.. _shared variables: http://deeplearning.net/software/theano/tutorial/examples.html#using-shared-variables

.. _basic arithmetic ops: http://deeplearning.net/software/theano/tutorial/adding.html#adding-two-scalars

.. _T.grad: http://deeplearning.net/software/theano/tutorial/examples.html#computing-gradients

.. _floatX: http://deeplearning.net/software/theano/library/config.html#config.floatX

.. _GPU: http://deeplearning.net/software/theano/tutorial/using_gpu.html

.. _Random numbers: http://deeplearning.net/software/theano/tutorial/examples.html#using-random-numbers


.. note::
    The code for this section is available for download `here`_.

.. _here: http://deeplearning.net/tutorial/code/SdA.py


The Stacked Denoising Autoencoder (SdA) is an extension of the stacked
autoencoder [Bengio07]_ and it was introduced in [Vincent08]_.

This tutorial builds on the previous tutorial :ref:`dA` and we recommend,
especially if you do not have experience with autoencoders, to read it
before going any further.

.. _stacked_autoencoders:

Stacked Autoencoders
++++++++++++++++++++

The denoising autoencoders can be stacked to form a deep network by
feeding the latent representation (output code)
of the denoising auto-encoder found on the layer
below as input to the current layer. The **unsupervised pre-training** of such an
architecture is done one layer at a time. Each layer is trained as
a denoising auto-encoder by minimizing the reconstruction of its input
(which is the output code of the previous layer).
Once the first :math:`k` layers
are trained, we can train the :math:`k+1`-th layer because we can now
compute the code or latent representation from the layer below.
Once all layers are pre-trained, the network goes through a second stage
of training called **fine-tuning**. Here we consider **supervised fine-tuning**
where we want to minimize prediction error on a supervised task.
For this we first add a logistic regression
layer on top of the network (more precisely on the output code of the
output layer). We then
train the entire network as we would train a multilayer
perceptron. At this point, we only consider the encoding parts of
each auto-encoder.
This stage is supervised, since now we use the target class during
training (see the :ref:`mlp` for details on the multilayer perceptron).

This can be easily implemented in Theano, using the class defined
before for a denoising autoencoder. We can see the stacked denoising
autoencoder as having two facades, one is a list of
autoencoders, the other is an MLP. During pre-training we use the first facade, i.e we treat our model
as a list of autoencoders, and train each autoencoder seperately. In the
second stage of training, we use the second facade. These two
facedes are linked by the fact that the autoencoders and the sigmoid layers of
the MLP share parameters, and the fact that autoencoders get as input latent
representations of intermediate layers of the MLP.

.. literalinclude:: ../code/SdA.py
  :start-after: start-snippet-1
  :end-before: end-snippet-1

``self.sigmoid_layers`` will store the sigmoid layers of the MLP facade, while
``self.dA_layers`` will store  the denoising autoencoder associated with the layers of the MLP.

Next step, we construct ``n_layers`` sigmoid layers (we use the
``HiddenLayer`` class introduced in :ref:`mlp`, with the only
modification that we replaced the non-linearity from ``tanh`` to the
logistic function :math:`s(x) = \frac{1}{1+e^{-x}}`) and ``n_layers``
denoising autoencoders, where ``n_layers`` is the depth of our model.
We link the sigmoid layers such that they form an MLP, and construct
each denoising autoencoder such that they share the weight matrix and the
bias of the encoding part with its corresponding sigmoid layer.

.. literalinclude:: ../code/SdA.py
  :start-after: start-snippet-2
  :end-before: end-snippet-2

All we need now is to add the logistic layer on top of the sigmoid
layers such that we have an MLP. We will
use the ``LogisticRegression`` class introduced in :ref:`logreg`.

.. literalinclude:: ../code/SdA.py
  :start-after: end-snippet-2
  :end-before: def pretraining_functions

The class also provides a method that generates training functions for
each of the denoising autoencoder associated with the different layers.
They are returned as a list, where element :math:`i` is a function that
implements one step of training the ``dA`` correspoinding to layer
:math:`i`.

.. literalinclude:: ../code/SdA.py
  :start-after: self.errors = self.logLayer.errors(self.y)
  :end-before: corruption_level = T.scalar('corruption')

In order to be able to change the corruption level or the learning rate
during training we associate a Theano variable to them.

.. literalinclude:: ../code/SdA.py
  :start-after: index = T.lscalar('index')
  :end-before: def build_finetune_functions

Now any function ``pretrain_fns[i]`` takes as arguments ``index`` and
optionally ``corruption`` -- the corruption level or ``lr`` -- the
learning rate. Note that the name of the parameters are the name given
to the Theano variables when they are constructed, not the name of the
python variables (``learning_rate`` or ``corruption_level``). Keep this
in mind when working with Theano.

In the same fashion we build a method for constructing function required
during finetuning ( a ``train_model``, a ``validate_model`` and a
``test_model`` function).

.. literalinclude:: ../code/SdA.py
  :pyobject: SdA.build_finetune_functions

Note that the returned ``valid_score`` and ``test_score`` are not Theano
functions, but rather python functions that also loop over the entire
validation set and the entire test set producing a list of the losses
over these sets.

Putting it all together
+++++++++++++++++++++++

The few lines of code below constructs the stacked denoising
autoencoder :

.. literalinclude:: ../code/SdA.py
  :start-after: start-snippet-3
  :end-before: end-snippet-3

There are two stages in training this network, a layer-wise pre-training and
fine-tuning afterwards.

For the pre-training stage, we will loop over all the layers of the
network. For each layer we will use the compiled theano function that
implements a SGD step towards optimizing the weights for reducing
the reconstruction cost of that layer. This function will be applied
to the training set for a fixed number of epochs given by
``pretraining_epochs``.

.. literalinclude:: ../code/SdA.py
  :start-after: start-snippet-4
  :end-before: end-snippet-4

The fine-tuning loop is very similar with the one in the :ref:`mlp`, the
only difference is that we will use now the functions given by
``build_finetune_functions``  .

Running the Code
++++++++++++++++

The user can run the code by calling:

.. code-block:: bash

  python code/SdA.py

By default the code runs 15 pre-training epochs for each layer, with a batch
size of 1. The  corruption level for the first layer is 0.1, for the second
0.2 and 0.3 for the third. The pretraining learning rate is was 0.001 and
the finetuning learning rate is 0.1. Pre-training takes 585.01 minutes, with
an average of 13 minutes per epoch. Fine-tuning is completed after 36 epochs
in 444.2 minutes, with an average of 12.34 minutes per epoch. The final
validation score is 1.39% with a testing score of 1.3%.
These results were obtained on a machine with an Intel
Xeon E5430 @ 2.66GHz CPU, with a single-threaded GotoBLAS.


Tips and Tricks
+++++++++++++++

One way to improve the running time of your code (given that you have
sufficient memory available), is to compute how the network, up to layer
:math:`k-1`, transforms your data. Namely, you start by training your first
layer dA. Once it is trained, you can compute the hidden units values for
every datapoint in your dataset and store this as a new dataset that you will
use to train the dA corresponding to layer 2. Once you trained the dA for
layer 2, you compute, in a similar fashion, the dataset for layer 3 and so on.
You can see now, that at this point, the dAs are trained individually, and
they just provide (one to the other) a non-linear transformation of the input.
Once all dAs are trained, you can start fine-tunning the model.