@@ -34,9 +34,11 @@ Deep Belief Networks
3434Deep Belief Networks
3535++++++++++++++++++++
3636
37- A Deep Belief Network [Hinton06]_ with :math:`\ell` layers models the joint
38- distribution between observed vector :math:`x` and :math:`\ell` hidden layers :math:`h^k` as
39- follows:
37+ [Hinton06]_ showed that RBMs can be stacked and trained in a greedy manner
38+ to form so-called Deep Belief Networks (DBN). DBNs are graphical models which
39+ learn to extract a deep hierarchical representation of the training data.
40+ They model the joint distribution between observed vector :math:`x` and
41+ the :math:`\ell` hidden layers :math:`h^k` as follows:
4042
4143.. math::
4244 :label: dbn
@@ -52,41 +54,89 @@ in the top-level RBM. This is illustrated in the figure below.
5254.. figure:: images/DBN3.png
5355 :align: center
5456
57+ The principle of greedy layer-wise unsupervised training can be applied with
58+ RBMs as the building blocks for each layer [Hinton06]_, [Bengio07]_. The process
59+ is as follows:
5560
56- In practice, such a model is trained in two stages, a pretraining stage and
57- a fine-tunning one. During pretraining, you go through the layers starting
58- from the bottom to top and train each layer seperately. At this point you
59- can see your model as a set of disconnected RBMs that have to be trained.
60- To train the RBM corresponding to layer :math:`k` though, you need to have
61- the input of this layer (which depends on the RBM corresponding to the first
62- :math:`k-1` layers) and this is why you have to go through the RBMs in a
63- specific order. A trick that you can do (and would actually improve the
64- time run of your code, given that you have sufficient memory available),
65- is to compute how the network, up to layer :math:`k-1`, transforms your
66- data. Namely, you start by training your first layer RBM. Once it
67- is trained, you can compute the hidden units values for every datapoint in
68- your dataset and store this as a new dataset that you will use to train the
69- RBM corresponding to layer 2. Once you trained the RBM for layer 2, you
70- compute, in a similar fashion, the dataset for layer 3 and so on. You
71- can see now, that at this point, the RBMs are trained individually, and
72- they just provide (one to the other) a non-linear transformation of the
73- input. Once all RBMs are trained, you can start fine-tunning the model.
74-
75- During fine-tunning, you drop the RBMs and just use the learned weights
76- and biases to create a MLP. Layer :math:`k` of the MLP will have the
77- weights and biases of the RBM corresponding to layer :math:`k`. On top
78- of these layers you add a logistic regression layer and train the model
79- using (stochastic) gradient descent. Note, that for classification you
80- use the RBMs just to initialize your MLP, which you will use in the end
81- as your model.
82-
83- To implement this in Theano we will use the class defined before for the
84- RBM tutorial. As an observation, the code for the DBN is very similar
85- with the one for SdA, mostly the difference being that we use the RBM class
86- instead of the dA class.
87-
88- We start off, by defining the DBN class which will store the layers of the
89- MLP together with the RBMs that are linked to them.
61+ 1. Train the first layer as an RBM that models the raw input :math:`x =
62+ h^{(0)}` as its visible layer.
63+
64+ 2. Use that first layer to obtain a representation of the input data that will
65+ be used as data for the second layer. Two common solutions exist. The
66+ reprensetation can be chosen as being the mean activations
67+ :math:`p(h^{(1)}=1|h^{(0)}` or samples of :math:`p(h^{(1)}|h^{(0)}`.
68+
69+ 3. Train the second layer as an RBM, taking the transformed data (samples or
70+ mean activations) as training examples (for the visible layer of that RBM).
71+
72+ 4. Iterate (2 and 3) for the desired number of layers, each time propagating
73+ upward either samples or mean values.
74+
75+ 5. Fine-tune all the parameters of this deep architecture with respect to a
76+ proxy for the DBN log- likelihood, or with respect to a supervised training
77+ criterion (after adding extra learning machinery to convert the learned
78+ representation into supervised predictions, e.g. a linear classifier).
79+
80+
81+ In this tutorial, we focus on fine-tuning via supervised gradient descent.
82+ Specifically, we use a logistic regression classifier to classify the input
83+ :math:`x` based on the output of the last hidden layer :math:`h^{(l)}` of the
84+ DBN. Fine-tuning is then performed via supervised gradient descent of the
85+ negative log-likelihood cost function. Since the supervised gradient is only
86+ non-null for the weights and hidden layer biases of each layer (i.e. null for
87+ the visible biases of each RBM), this procedure is equivalent to initializing
88+ the parameters of a deep MLP with the weights and hidden layer biases obtained
89+ with the unsupervised training strategy.
90+
91+ Justifying Greedy-Layer Wise Pre-Training
92+ +++++++++++++++++++++++++++++++++++++++++
93+
94+ Why does such an algorithm work ? Taking as example a 2-layer DBN with hidden
95+ layers :math:`h^{(1)}` and :math:`h^{(2)}` (with respective weight parameters
96+ :math:`W^{(1)}` and :math:`W^{(2)}`), [Bengio09]_ established that :math:`\log
97+ p(x)` can be rewritten as,
98+
99+ .. math::
100+ :label: dbn_bound
101+
102+ \log p(x) = &KL(Q(h^{(1)}|x)||p(h^{(1)}|x)) + H_{Q(h^{(1)}|x)} + \\
103+ &\sum_h Q(h^{(1)}|x)(\log p(h^{(1)}) + \log p(x|h^{(1)}))
104+
105+ :math:`KL(Q(h^{(1)}|x) || p(h^{(1)}|x))` represents the KL divergence between
106+ the posterior :math:`Q(h^{(1)}|x)` of the first RBM if it were standalone, and the
107+ probability :math:`p(h^{(1)}|x)` for the same layer but defined by the entire DBN
108+ (i.e. taking into account the prior :math:`p(h^{(1)},h^{(2)})` defined by the
109+ top-level RBM). :math:`H_{Q(h^{(1)}|x)}` is the entropy of the distribution
110+ :math:`Q(h^{(1)}|x)`.
111+
112+ It can be shown that if we initialize both hidden layers such that
113+ :math:`W^{(2)}={W^{(1)}}^T`, :math:`Q(h^{(1)}|x)=p(h^{(1)}|x)` and the KL
114+ divergence term is null. If we learn the first level RBM and then keep its
115+ parameters :math:`W^{(1)}` fixed, optimizing Eq. :eq:`dbn_bound` with respect
116+ to :math:`W^{(2)}` can thus only increase the likelihood :math:`p(x)`.
117+
118+ Also, notice that if we isolate the terms which depend only on :math:`W^{(2)}`, we
119+ get:
120+
121+ .. math::
122+ \sum_h Q(h^{(1)}|x)p(h^{(1)})
123+
124+ Optimizing this with respect to :math:`W^{(2)}` amounts to training a second-stage
125+ RBM, using the output of :math:`Q(h^{(1)}|x)` as the training distribution.
126+
127+ Implementation
128+ ++++++++++++++
129+
130+ To implement DBNs in Theano, we will use the class defined in the :doc:`rbm`
131+ tutorial. As an observation, the code for the DBN is very similar with the one
132+ for SdA. The main difference is that we use the RBM class instead of the dA
133+ class.
134+
135+ We start off by defining the DBN class which will store the layers of the
136+ MLP, along with their associated RBMs. Since in this tutorial we take the
137+ viewpoint of using the RBMs to initialize an MLP, the code will reflect this
138+ by seperating as much as possible the RBMs used to initialize the network
139+ and the MLP used for classification.
90140
91141.. code-block:: python
92142
@@ -131,16 +181,16 @@ MLP together with the RBMs that are linked to them.
131181 # [int] labels
132182
133183``self.sigmoid_layers`` will store the sigmoid layers of the MLP facade, while
134- ``self.rbm_layers`` will store the RBMs associated with the layers of the MLP.
184+ ``self.rbm_layers`` will store the RBMs associated with the layers of the MLP.
135185
136186Next step, we construct ``n_layers`` sigmoid layers (we use the
137187``SigmoidalLayer`` class introduced in :ref:`mlp`, with the only
138188modification that we replaced the non-linearity from ``tanh`` to the
139189logistic function :math:`s(x) = \frac{1}{1+e^{-x}}`) and ``n_layers``
140- denoising autoencoders , where ``n_layers`` is the depth of our model.
190+ RBMs , where ``n_layers`` is the depth of our model.
141191We link the sigmoid layers such that they form an MLP, and construct
142192each RBM such that they share the weight matrix and the
143- bias of the encoding part with its corresponding sigmoid layer.
193+ bias with its corresponding sigmoid layer.
144194
145195
146196.. code-block:: python
@@ -216,7 +266,7 @@ implements one step of training the ``RBM`` correspoinding to layer
216266
217267.. code-block:: python
218268
219- def pretraining_functions(self, train_set_x, batch_size):
269+ def pretraining_functions(self, train_set_x, batch_size, k ):
220270 ''' Generates a list of functions, for performing one step of gradient descent at a
221271 given layer. The function will require as input the minibatch index, and to train an
222272 RBM you just need to iterate, calling the corresponding function on all minibatch
@@ -226,6 +276,7 @@ implements one step of training the ``RBM`` correspoinding to layer
226276 :param train_set_x: Shared var. that contains all datapoints used for training the RBM
227277 :type batch_size: int
228278 :param batch_size: size of a [mini]batch
279+ :param k: number of Gibbs steps to do in CD-k / PCD-k
229280 '''
230281
231282 # index to a [mini]batch
@@ -251,11 +302,15 @@ default value.
251302
252303 # get the cost and the updates list
253304 # TODO: change cost function to reconstruction error
254- cost,updates = rbm.cd(learning_rate, persistent=None)
305+ cost,updates = rbm.cd(learning_rate, persistent=None, k )
255306
256- # compile the theano function
257- fn = theano.function(inputs = [index,
258- theano.Param(learning_rate, default = 0.1)],
307+ # compile the Theano function; check if k is also a Theano
308+ # variable, if so added to the inputs of the function
309+ if isinstance(k, theano.Variable):
310+ inputs = [ index, theano.Param(learning_rate, default=0.1),k]
311+ else:
312+ inputs = [ index, theano.Param(learning_rate, default=0.1)]
313+ fn = theano.function(inputs = inputs,
259314 outputs = cost,
260315 updates = updates,
261316 givens = {self.x :train_set_x[batch_begin:batch_end]})
@@ -269,11 +324,13 @@ optionally ``lr`` -- the
269324learning rate. Note that the name of the parameters are the name given
270325to the Theano variables when they are constructed, not the name of the
271326python variables (``learning_rate``). Keep this
272- in mind when working with Theano.
327+ in mind when working with Theano. Optionally, if you provide ``k`` (the
328+ number of Gibbs steps to do in CD or PCD) this will also become an argument
329+ of your function.
273330
274331In the same fashion we build a method for constructing function required
275332during finetuning ( a ``train_model``, a ``validate_model`` and a
276- ``test_model`` funcion ).
333+ ``test_model`` function ).
277334
278335.. code-block:: python
279336
@@ -379,9 +436,11 @@ to the training set for a fixed number of epochs given by
379436 # PRETRAINING THE MODEL #
380437 #########################
381438 print '... getting the pretraining functions'
439+ # We are using CD-1 here
382440 pretraining_fns = dbn.pretraining_functions(
383441 train_set_x = train_set_x,
384- batch_size = batch_size )
442+ batch_size = batch_size,
443+ k = 1)
385444
386445 print '... pre-training the model'
387446 start_time = time.clock()
@@ -415,3 +474,17 @@ The user can run the code by calling:
415474Sampling a DBN
416475++++++++++++++
417476
477+
478+ Tips and Tricks
479+ +++++++++++++++
480+
481+ One way to improve the running time of your code (given that you have
482+ sufficient memory available), is to compute how the network, up to layer
483+ :math:`k-1`, transforms your data. Namely, you start by training your first
484+ layer RBM. Once it is trained, you can compute the hidden units values for
485+ every datapoint in your dataset and store this as a new dataset that you will
486+ use to train the RBM corresponding to layer 2. Once you trained the RBM for
487+ layer 2, you compute, in a similar fashion, the dataset for layer 3 and so on.
488+ You can see now, that at this point, the RBMs are trained individually, and
489+ they just provide (one to the other) a non-linear transformation of the input.
490+ Once all RBMs are trained, you can start fine-tunning the model.
0 commit comments