@@ -45,26 +45,26 @@ the :math:`\ell` hidden layers :math:`h^k` as follows:
4545
4646 P(x, h^1, \ldots, h^{\ell}) = \left(\prod_{k=0}^{\ell-2} P(h^k|h^{k+1})\right) P(h^{\ell-1},h^{\ell})
4747
48- where :math:`x=h^0`, :math:`P(h^{k-1} | h^k)` is a conditional distribution for
49- visible hidden units in an RBM associated with level :math:`k` of the DBN,
50- and :math:`P(h^{\ell-1}, h^{\ell})` is the visible-hidden joint distribution
51- in the top-level RBM. This is illustrated in the figure below.
48+ where :math:`x=h^0`, :math:`P(h^{k-1} | h^k)` is a conditional distribution
49+ for the visible units conditioned on the hidden units of the RBM at level
50+ :math:`k`, and :math:`P(h^{\ell-1}, h^{\ell})` is the visible-hidden joint
51+ distribution in the top-level RBM. This is illustrated in the figure below.
5252
5353
5454.. figure:: images/DBN3.png
5555 :align: center
5656
57- The principle of greedy layer-wise unsupervised training can be applied with
58- RBMs as the building blocks for each layer [Hinton06]_, [Bengio07]_. The process
59- is as follows:
57+ The principle of greedy layer-wise unsupervised training can be applied to
58+ DBNs with RBMs as the building blocks for each layer [Hinton06]_, [Bengio07]_.
59+ The process is as follows:
6060
61611. Train the first layer as an RBM that models the raw input :math:`x =
6262h^{(0)}` as its visible layer.
6363
64- 2. Use that first layer to obtain a representation of the input data that will
65- be used as data for the second layer. Two common solutions exist. The
66- reprensetation can be chosen as being the mean activations
67- :math:`p(h^{(1)}=1|h^{(0)}` or samples of :math:`p(h^{(1)}|h^{(0)}`.
64+ 2. Use that first layer to obtain a representation of the input that will
65+ be used as data for the second layer. Two common solutions exist. This
66+ representation can be chosen as being the mean activations
67+ :math:`p(h^{(1)}=1|h^{(0)}) ` or samples of :math:`p(h^{(1)}|h^{(0)}) `.
6868
69693. Train the second layer as an RBM, taking the transformed data (samples or
7070mean activations) as training examples (for the visible layer of that RBM).
@@ -100,7 +100,7 @@ p(x)` can be rewritten as,
100100 :label: dbn_bound
101101
102102 \log p(x) = &KL(Q(h^{(1)}|x)||p(h^{(1)}|x)) + H_{Q(h^{(1)}|x)} + \\
103- &\sum_h Q(h^{(1)}|x)(\log p(h^{(1)}) + \log p(x|h^{(1)}))
103+ &\sum_h Q(h^{(1)}|x)(\log p(h^{(1)}) + \log p(x|h^{(1)})).
104104
105105:math:`KL(Q(h^{(1)}|x) || p(h^{(1)}|x))` represents the KL divergence between
106106the posterior :math:`Q(h^{(1)}|x)` of the first RBM if it were standalone, and the
@@ -133,10 +133,10 @@ for SdA. The main difference is that we use the RBM class instead of the dA
133133class.
134134
135135We start off by defining the DBN class which will store the layers of the
136- MLP, along with their associated RBMs. Since in this tutorial we take the
137- viewpoint of using the RBMs to initialize an MLP, the code will reflect this
138- by seperating as much as possible the RBMs used to initialize the network
139- and the MLP used for classification.
136+ MLP, along with their associated RBMs. Since we take the viewpoint of using
137+ the RBMs to initialize an MLP, the code will reflect this by seperating as
138+ much as possible the RBMs used to initialize the network and the MLP used for
139+ classification.
140140
141141.. code-block:: python
142142
@@ -180,16 +180,16 @@ and the MLP used for classification.
180180 self.y = T.ivector('y') # the labels are presented as 1D vector of
181181 # [int] labels
182182
183- ``self.sigmoid_layers`` will store the sigmoid layers of the MLP facade, while
184- ``self.rbm_layers`` will store the RBMs associated with the layers of the MLP.
183+ ``self.sigmoid_layers`` will store the feed-forward graph which together form
184+ the MLP, while ``self.rbm_layers`` will store the RBMs used to pretrain each
185+ layer of the MLP.
185186
186187Next step, we construct ``n_layers`` sigmoid layers (we use the
187- ``SigmoidalLayer`` class introduced in :ref:`mlp`, with the only
188- modification that we replaced the non-linearity from ``tanh`` to the
189- logistic function :math:`s(x) = \frac{1}{1+e^{-x}}`) and ``n_layers``
190- RBMs, where ``n_layers`` is the depth of our model.
191- We link the sigmoid layers such that they form an MLP, and construct
192- each RBM such that they share the weight matrix and the
188+ ``SigmoidalLayer`` class introduced in :ref:`mlp`, with the only modification
189+ that we replaced the non-linearity from ``tanh`` to the logistic function
190+ :math:`s(x) = \frac{1}{1+e^{-x}}`) and ``n_layers`` RBMs, where ``n_layers``
191+ is the depth of our model. We link the sigmoid layers such that they form an
192+ MLP, and construct each RBM such that they share the weight matrix and the
193193bias with its corresponding sigmoid layer.
194194
195195
@@ -236,9 +236,9 @@ bias with its corresponding sigmoid layer.
236236 self.rbm_layers.append(rbm_layer)
237237
238238
239- All we need now is to add the logistic layer on top of the sigmoid
240- layers such that we have an MLP. We will
241- use the ``LogisticRegression`` class introduced in :ref:`logreg`.
239+ All that is left is to stack one last logistic regression layer in order to
240+ form an MLP. We will use the ``LogisticRegression`` class introduced in
241+ :ref:`logreg`.
242242
243243.. code-block:: python
244244
@@ -257,10 +257,9 @@ use the ``LogisticRegression`` class introduced in :ref:`logreg`.
257257 # minibatch given by self.x and self.y
258258 self.errors = self.logLayer.errors(self.y)
259259
260- The class also provides a method that generates training functions for
261- each of the RBM associated with the different layers.
262- They are returned as a list, where element :math:`i` is a function that
263- implements one step of training the ``RBM`` correspoinding to layer
260+ The class also provides a method which generates training functions for each
261+ of the RBMs. They are returned as a list, where element :math:`i` is a
262+ function which implements one step of training for the ``RBM`` at layer
264263:math:`i`.
265264
266265
@@ -282,9 +281,8 @@ implements one step of training the ``RBM`` correspoinding to layer
282281 # index to a [mini]batch
283282 index = T.lscalar('index') # index to a minibatch
284283
285- In order to be able to change the learning rate
286- during training we associate a Theano variable to it that has a
287- default value.
284+ In order to be able to change the learning rate during training, we associate a
285+ Theano variable to it that has a default value.
288286
289287.. code-block:: python
290288
@@ -319,18 +317,17 @@ default value.
319317
320318 return pretrain_fns
321319
322- Now any function ``pretrain_fns[i]`` takes as arguments ``index`` and
323- optionally ``lr`` -- the
324- learning rate. Note that the name of the parameters are the name given
325- to the Theano variables when they are constructed, not the name of the
326- python variables (``learning_rate``). Keep this
327- in mind when working with Theano. Optionally, if you provide ``k`` (the
328- number of Gibbs steps to do in CD or PCD) this will also become an argument
329- of your function.
320+ Now any function ``pretrain_fns[i]`` takes as arguments ``index`` and
321+ optionally ``lr`` -- the learning rate. Note that the names of the parameters
322+ are the names given to the Theano variables (e.g. ``lr``) when they are
323+ constructed and not the name of the python variables (e.g. ``learning_rate``). Keep
324+ this in mind when working with Theano. Optionally, if you provide ``k`` (the
325+ number of Gibbs steps to perform in CD or PCD) this will also become an
326+ argument of your function.
330327
331- In the same fashion we build a method for constructing function required
332- during finetuning ( a ``train_model``, a ``validate_model`` and a
333- ``test_model`` function).
328+ In the same fashion, the DBN class includes a method for building the
329+ functions required for finetuning ( a ``train_model``, a ``validate_model``
330+ and a ``test_model`` function).
334331
335332.. code-block:: python
336333
@@ -397,9 +394,9 @@ during finetuning ( a ``train_model``, a ``validate_model`` and a
397394
398395
399396Note that the returned ``valid_score`` and ``test_score`` are not Theano
400- functions, but rather python functions that also loop over the entire
401- validation set and the entire test set producing a list of the losses
402- over these sets.
397+ functions, but rather Python functions. These loop over the entire
398+ validation set and the entire test set to produce a list of the losses
399+ obtained over these sets.
403400
404401
405402Putting it all together
@@ -418,15 +415,14 @@ The few lines of code below constructs the deep belief network :
418415
419416
420417
421- There are two stages in training this network, a layer-wise pre-training and
422- fine-tuning afterwards.
418+ There are two stages in training this network: (1) a layer-wise pre-training and
419+ (2) a fine-tuning stage.
423420
424- For the pre-training stage, we will loop over all the layers of the
425- network. For each layer we will use the compiled theano function that
426- implements a SGD step towards optimizing the weights for reducing
427- the reconstruction cost of that layer. This function will be applied
428- to the training set for a fixed number of epochs given by
429- ``pretraining_epochs``.
421+ For the pre-training stage, we loop over all the layers of the network. For
422+ each layer, we use the compiled theano function which determines the
423+ input to the ``i``-th level RBM and performs one step of CD-k within this RBM.
424+ This function is applied to the training set for a fixed number of epochs
425+ given by ``pretraining_epochs``.
430426
431427
432428.. code-block:: python
@@ -457,8 +453,8 @@ to the training set for a fixed number of epochs given by
457453
458454 end_time = time.clock()
459455
460- The fine-tuning loop is very similar with the one in the :ref:`mlp`, the
461- only difference is that we will use now the functions given by
456+ The fine-tuning loop is very similar to the one in the :ref:`mlp` tutorial,
457+ the only difference being that we now use the functions given by
462458`build_finetune_functions`.
463459
464460Running the Code
@@ -479,12 +475,12 @@ Tips and Tricks
479475+++++++++++++++
480476
481477One way to improve the running time of your code (given that you have
482- sufficient memory available), is to compute how the network, up to layer
483- :math:`k-1`, transforms your data. Namely, you start by training your first
478+ sufficient memory available), is to compute the representation of the entire
479+ dataset at layer ``i`` in a single pass, once the weights of the
480+ :math:`i-1`-th layers have been fixed. Namely, start by training your first
484481layer RBM. Once it is trained, you can compute the hidden units values for
485- every datapoint in your dataset and store this as a new dataset that you will
486- use to train the RBM corresponding to layer 2. Once you trained the RBM for
487- layer 2, you compute, in a similar fashion, the dataset for layer 3 and so on.
488- You can see now, that at this point, the RBMs are trained individually, and
489- they just provide (one to the other) a non-linear transformation of the input.
490- Once all RBMs are trained, you can start fine-tunning the model.
482+ every example in the dataset and store this as a new dataset which is used to
483+ train the 2nd layer RBM. Once you trained the RBM for layer 2, you compute, in
484+ a similar fashion, the dataset for layer 3 and so on. This avoids calculating
485+ the intermediate (hidden layer) representations, ``pretraining_epochs`` times
486+ at the expense of increased memory usage.
0 commit comments