Skip to content

Commit 160416e

Browse files
author
Razvan Pascanu
committed
Merget .. I hope I didn't break anything
1 parent a59a2bf commit 160416e

6 files changed

Lines changed: 294 additions & 104 deletions

File tree

code/DBN.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -373,8 +373,9 @@ def test_DBN( finetune_lr = 0.1, pretraining_epochs = 10, \
373373
'with test performance %f %%') %
374374
(best_validation_loss * 100., test_score*100.))
375375
print >> sys.stderr, ('The fine tuning code for file '+os.path.split(__file__)[1]+' ran for %.2fm expected Xm our buildbot' % ((end_time-start_time)/60.))
376-
377-
376+
##################
377+
## SAMPLING DBN ##
378+
##################
378379

379380

380381

code/rbm.py

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -104,10 +104,15 @@ def free_energy(self, v_sample):
104104
hidden_term = T.sum(T.log(1+T.exp(wx_b)),axis = 1)
105105
return -hidden_term - vbias_term
106106

107+
def propup(self, vis):
108+
''' This function propagates the visible units activation upwards to
109+
the hidden units '''
110+
return T.nnet.sigmoid(T.dot(v, self.W) + self.hbias)
111+
107112
def sample_h_given_v(self, v0_sample):
108113
''' This function infers state of hidden units given visible units '''
109114
# compute the activation of the hidden units given a sample of the visibles
110-
h1_mean = T.nnet.sigmoid(T.dot(v0_sample, self.W) + self.hbias)
115+
h1_mean = self.propup(v0_sample)
111116
# get a sample of the hiddens given their activation
112117
# Note that theano_rng.binomial returns a symbolic sample of dtype
113118
# int64 by default. If we want to keep our computations in floatX
@@ -116,10 +121,15 @@ def sample_h_given_v(self, v0_sample):
116121
dtype = theano.config.floatX)
117122
return [h1_mean, h1_sample]
118123

124+
def propdown(self.hid):
125+
'''This function propagates the hidden units activation downwards to
126+
the visible units'''
127+
return T.nnet.sigmoid(T.dot(hid,self.W.T) + self.vbias)
128+
119129
def sample_v_given_h(self, h0_sample):
120130
''' This function infers state of visible units given hidden units '''
121131
# compute the activation of the visible given the hidden sample
122-
v1_mean = T.nnet.sigmoid(T.dot(h0_sample, self.W.T) + self.vbias)
132+
v1_mean = self.propdown(h0_sample)
123133
# get a sample of the visible given their activation
124134
# Note that theano_rng.binomial returns a symbolic sample of dtype
125135
# int64 by default. If we want to keep our computations in floatX
@@ -352,13 +362,13 @@ def test_rbm(learning_rate=0.1, training_epochs = 15,
352362
#################################
353363

354364

355-
# find out the number of test
365+
# find out the number of test samples
356366
number_of_test_samples = test_set_x.value.shape[0]
357367

358368
# pick random test examples, with which to initialize the persistent chain
359369
test_idx = rng.randint(number_of_test_samples-n_chains)
360370
persistent_vis_chain = theano.shared(
361-
numpy.array(test_set_x.value[test_idx:test_idx+100], dtype=theano.config.floatX))
371+
numpy.array(test_set_x.value[test_idx:test_idx+n_chains], dtype=theano.config.floatX))
362372

363373
plot_every = 1000
364374
# define one step of Gibbs sampling (mf = mean-field)

doc/DBN.txt

Lines changed: 121 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -34,9 +34,11 @@ Deep Belief Networks
3434
Deep Belief Networks
3535
++++++++++++++++++++
3636

37-
A Deep Belief Network [Hinton06]_ with :math:`\ell` layers models the joint
38-
distribution between observed vector :math:`x` and :math:`\ell` hidden layers :math:`h^k` as
39-
follows:
37+
[Hinton06]_ showed that RBMs can be stacked and trained in a greedy manner
38+
to form so-called Deep Belief Networks (DBN). DBNs are graphical models which
39+
learn to extract a deep hierarchical representation of the training data.
40+
They model the joint distribution between observed vector :math:`x` and
41+
the :math:`\ell` hidden layers :math:`h^k` as follows:
4042

4143
.. math::
4244
:label: dbn
@@ -52,41 +54,89 @@ in the top-level RBM. This is illustrated in the figure below.
5254
.. figure:: images/DBN3.png
5355
:align: center
5456

57+
The principle of greedy layer-wise unsupervised training can be applied with
58+
RBMs as the building blocks for each layer [Hinton06]_, [Bengio07]_. The process
59+
is as follows:
5560

56-
In practice, such a model is trained in two stages, a pretraining stage and
57-
a fine-tunning one. During pretraining, you go through the layers starting
58-
from the bottom to top and train each layer seperately. At this point you
59-
can see your model as a set of disconnected RBMs that have to be trained.
60-
To train the RBM corresponding to layer :math:`k` though, you need to have
61-
the input of this layer (which depends on the RBM corresponding to the first
62-
:math:`k-1` layers) and this is why you have to go through the RBMs in a
63-
specific order. A trick that you can do (and would actually improve the
64-
time run of your code, given that you have sufficient memory available),
65-
is to compute how the network, up to layer :math:`k-1`, transforms your
66-
data. Namely, you start by training your first layer RBM. Once it
67-
is trained, you can compute the hidden units values for every datapoint in
68-
your dataset and store this as a new dataset that you will use to train the
69-
RBM corresponding to layer 2. Once you trained the RBM for layer 2, you
70-
compute, in a similar fashion, the dataset for layer 3 and so on. You
71-
can see now, that at this point, the RBMs are trained individually, and
72-
they just provide (one to the other) a non-linear transformation of the
73-
input. Once all RBMs are trained, you can start fine-tunning the model.
74-
75-
During fine-tunning, you drop the RBMs and just use the learned weights
76-
and biases to create a MLP. Layer :math:`k` of the MLP will have the
77-
weights and biases of the RBM corresponding to layer :math:`k`. On top
78-
of these layers you add a logistic regression layer and train the model
79-
using (stochastic) gradient descent. Note, that for classification you
80-
use the RBMs just to initialize your MLP, which you will use in the end
81-
as your model.
82-
83-
To implement this in Theano we will use the class defined before for the
84-
RBM tutorial. As an observation, the code for the DBN is very similar
85-
with the one for SdA, mostly the difference being that we use the RBM class
86-
instead of the dA class.
87-
88-
We start off, by defining the DBN class which will store the layers of the
89-
MLP together with the RBMs that are linked to them.
61+
1. Train the first layer as an RBM that models the raw input :math:`x =
62+
h^{(0)}` as its visible layer.
63+
64+
2. Use that first layer to obtain a representation of the input data that will
65+
be used as data for the second layer. Two common solutions exist. The
66+
reprensetation can be chosen as being the mean activations
67+
:math:`p(h^{(1)}=1|h^{(0)}` or samples of :math:`p(h^{(1)}|h^{(0)}`.
68+
69+
3. Train the second layer as an RBM, taking the transformed data (samples or
70+
mean activations) as training examples (for the visible layer of that RBM).
71+
72+
4. Iterate (2 and 3) for the desired number of layers, each time propagating
73+
upward either samples or mean values.
74+
75+
5. Fine-tune all the parameters of this deep architecture with respect to a
76+
proxy for the DBN log- likelihood, or with respect to a supervised training
77+
criterion (after adding extra learning machinery to convert the learned
78+
representation into supervised predictions, e.g. a linear classifier).
79+
80+
81+
In this tutorial, we focus on fine-tuning via supervised gradient descent.
82+
Specifically, we use a logistic regression classifier to classify the input
83+
:math:`x` based on the output of the last hidden layer :math:`h^{(l)}` of the
84+
DBN. Fine-tuning is then performed via supervised gradient descent of the
85+
negative log-likelihood cost function. Since the supervised gradient is only
86+
non-null for the weights and hidden layer biases of each layer (i.e. null for
87+
the visible biases of each RBM), this procedure is equivalent to initializing
88+
the parameters of a deep MLP with the weights and hidden layer biases obtained
89+
with the unsupervised training strategy.
90+
91+
Justifying Greedy-Layer Wise Pre-Training
92+
+++++++++++++++++++++++++++++++++++++++++
93+
94+
Why does such an algorithm work ? Taking as example a 2-layer DBN with hidden
95+
layers :math:`h^{(1)}` and :math:`h^{(2)}` (with respective weight parameters
96+
:math:`W^{(1)}` and :math:`W^{(2)}`), [Bengio09]_ established that :math:`\log
97+
p(x)` can be rewritten as,
98+
99+
.. math::
100+
:label: dbn_bound
101+
102+
\log p(x) = &KL(Q(h^{(1)}|x)||p(h^{(1)}|x)) + H_{Q(h^{(1)}|x)} + \\
103+
&\sum_h Q(h^{(1)}|x)(\log p(h^{(1)}) + \log p(x|h^{(1)}))
104+
105+
:math:`KL(Q(h^{(1)}|x) || p(h^{(1)}|x))` represents the KL divergence between
106+
the posterior :math:`Q(h^{(1)}|x)` of the first RBM if it were standalone, and the
107+
probability :math:`p(h^{(1)}|x)` for the same layer but defined by the entire DBN
108+
(i.e. taking into account the prior :math:`p(h^{(1)},h^{(2)})` defined by the
109+
top-level RBM). :math:`H_{Q(h^{(1)}|x)}` is the entropy of the distribution
110+
:math:`Q(h^{(1)}|x)`.
111+
112+
It can be shown that if we initialize both hidden layers such that
113+
:math:`W^{(2)}={W^{(1)}}^T`, :math:`Q(h^{(1)}|x)=p(h^{(1)}|x)` and the KL
114+
divergence term is null. If we learn the first level RBM and then keep its
115+
parameters :math:`W^{(1)}` fixed, optimizing Eq. :eq:`dbn_bound` with respect
116+
to :math:`W^{(2)}` can thus only increase the likelihood :math:`p(x)`.
117+
118+
Also, notice that if we isolate the terms which depend only on :math:`W^{(2)}`, we
119+
get:
120+
121+
.. math::
122+
\sum_h Q(h^{(1)}|x)p(h^{(1)})
123+
124+
Optimizing this with respect to :math:`W^{(2)}` amounts to training a second-stage
125+
RBM, using the output of :math:`Q(h^{(1)}|x)` as the training distribution.
126+
127+
Implementation
128+
++++++++++++++
129+
130+
To implement DBNs in Theano, we will use the class defined in the :doc:`rbm`
131+
tutorial. As an observation, the code for the DBN is very similar with the one
132+
for SdA. The main difference is that we use the RBM class instead of the dA
133+
class.
134+
135+
We start off by defining the DBN class which will store the layers of the
136+
MLP, along with their associated RBMs. Since in this tutorial we take the
137+
viewpoint of using the RBMs to initialize an MLP, the code will reflect this
138+
by seperating as much as possible the RBMs used to initialize the network
139+
and the MLP used for classification.
90140

91141
.. code-block:: python
92142

@@ -131,16 +181,16 @@ MLP together with the RBMs that are linked to them.
131181
# [int] labels
132182

133183
``self.sigmoid_layers`` will store the sigmoid layers of the MLP facade, while
134-
``self.rbm_layers`` will store the RBMs associated with the layers of the MLP.
184+
``self.rbm_layers`` will store the RBMs associated with the layers of the MLP.
135185

136186
Next step, we construct ``n_layers`` sigmoid layers (we use the
137187
``SigmoidalLayer`` class introduced in :ref:`mlp`, with the only
138188
modification that we replaced the non-linearity from ``tanh`` to the
139189
logistic function :math:`s(x) = \frac{1}{1+e^{-x}}`) and ``n_layers``
140-
denoising autoencoders, where ``n_layers`` is the depth of our model.
190+
RBMs, where ``n_layers`` is the depth of our model.
141191
We link the sigmoid layers such that they form an MLP, and construct
142192
each RBM such that they share the weight matrix and the
143-
bias of the encoding part with its corresponding sigmoid layer.
193+
bias with its corresponding sigmoid layer.
144194

145195

146196
.. code-block:: python
@@ -216,7 +266,7 @@ implements one step of training the ``RBM`` correspoinding to layer
216266

217267
.. code-block:: python
218268

219-
def pretraining_functions(self, train_set_x, batch_size):
269+
def pretraining_functions(self, train_set_x, batch_size, k):
220270
''' Generates a list of functions, for performing one step of gradient descent at a
221271
given layer. The function will require as input the minibatch index, and to train an
222272
RBM you just need to iterate, calling the corresponding function on all minibatch
@@ -226,6 +276,7 @@ implements one step of training the ``RBM`` correspoinding to layer
226276
:param train_set_x: Shared var. that contains all datapoints used for training the RBM
227277
:type batch_size: int
228278
:param batch_size: size of a [mini]batch
279+
:param k: number of Gibbs steps to do in CD-k / PCD-k
229280
'''
230281

231282
# index to a [mini]batch
@@ -251,11 +302,15 @@ default value.
251302

252303
# get the cost and the updates list
253304
# TODO: change cost function to reconstruction error
254-
cost,updates = rbm.cd(learning_rate, persistent=None)
305+
cost,updates = rbm.cd(learning_rate, persistent=None, k)
255306

256-
# compile the theano function
257-
fn = theano.function(inputs = [index,
258-
theano.Param(learning_rate, default = 0.1)],
307+
# compile the Theano function; check if k is also a Theano
308+
# variable, if so added to the inputs of the function
309+
if isinstance(k, theano.Variable):
310+
inputs = [ index, theano.Param(learning_rate, default=0.1),k]
311+
else:
312+
inputs = [ index, theano.Param(learning_rate, default=0.1)]
313+
fn = theano.function(inputs = inputs,
259314
outputs = cost,
260315
updates = updates,
261316
givens = {self.x :train_set_x[batch_begin:batch_end]})
@@ -269,11 +324,13 @@ optionally ``lr`` -- the
269324
learning rate. Note that the name of the parameters are the name given
270325
to the Theano variables when they are constructed, not the name of the
271326
python variables (``learning_rate``). Keep this
272-
in mind when working with Theano.
327+
in mind when working with Theano. Optionally, if you provide ``k`` (the
328+
number of Gibbs steps to do in CD or PCD) this will also become an argument
329+
of your function.
273330

274331
In the same fashion we build a method for constructing function required
275332
during finetuning ( a ``train_model``, a ``validate_model`` and a
276-
``test_model`` funcion).
333+
``test_model`` function).
277334

278335
.. code-block:: python
279336

@@ -379,9 +436,11 @@ to the training set for a fixed number of epochs given by
379436
# PRETRAINING THE MODEL #
380437
#########################
381438
print '... getting the pretraining functions'
439+
# We are using CD-1 here
382440
pretraining_fns = dbn.pretraining_functions(
383441
train_set_x = train_set_x,
384-
batch_size = batch_size )
442+
batch_size = batch_size,
443+
k = 1)
385444

386445
print '... pre-training the model'
387446
start_time = time.clock()
@@ -415,3 +474,17 @@ The user can run the code by calling:
415474
Sampling a DBN
416475
++++++++++++++
417476

477+
478+
Tips and Tricks
479+
+++++++++++++++
480+
481+
One way to improve the running time of your code (given that you have
482+
sufficient memory available), is to compute how the network, up to layer
483+
:math:`k-1`, transforms your data. Namely, you start by training your first
484+
layer RBM. Once it is trained, you can compute the hidden units values for
485+
every datapoint in your dataset and store this as a new dataset that you will
486+
use to train the RBM corresponding to layer 2. Once you trained the RBM for
487+
layer 2, you compute, in a similar fashion, the dataset for layer 3 and so on.
488+
You can see now, that at this point, the RBMs are trained individually, and
489+
they just provide (one to the other) a non-linear transformation of the input.
490+
Once all RBMs are trained, you can start fine-tunning the model.

doc/SdA.txt

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -434,3 +434,18 @@ For comparison, on a multi-core Intel Xeon X5560 @ 2.80GHz, using multi-threaded
434434
meaning a ~4x speed-up at an 8x CPU cost.
435435

436436
Timings accurate as of March 16, 2010.
437+
438+
439+
Tips and Tricks
440+
+++++++++++++++
441+
442+
One way to improve the running time of your code (given that you have
443+
sufficient memory available), is to compute how the network, up to layer
444+
:math:`k-1`, transforms your data. Namely, you start by training your first
445+
layer dA. Once it is trained, you can compute the hidden units values for
446+
every datapoint in your dataset and store this as a new dataset that you will
447+
use to train the dA corresponding to layer 2. Once you trained the dA for
448+
layer 2, you compute, in a similar fashion, the dataset for layer 3 and so on.
449+
You can see now, that at this point, the dAs are trained individually, and
450+
they just provide (one to the other) a non-linear transformation of the input.
451+
Once all dAs are trained, you can start fine-tunning the model.

0 commit comments

Comments
 (0)