Skip to content

Commit 1e482f8

Browse files
committed
* added explicit derivation of gradient updates to theory section
* cosmetic changes to writing * moved plotting function to "utilities" page (found it too cluttered)
1 parent 05da1e5 commit 1e482f8

5 files changed

Lines changed: 276 additions & 195 deletions

File tree

doc/contents.txt

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,5 @@ Contents
1616
lenet
1717
SdA
1818
rbm
19-
dbn
20-
dae
19+
utilities
2120
references

doc/intro.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ The purely supervised learning algorithms are meant to be read in order:
3636
The unsupervised and semi-supervised learning algorithms are here (the auto-encoders can be read independently of the RBM/DBN thread):
3737

3838
* :ref:`Auto Encoders, Stacked Auto Encoders, and Stacked Denoising Auto-Encoders <SdA>` - easy steps into unsupervised pre-training for deep nets
39-
* :ref:`Restricted Boltzmann Machines` - single layer generative RBM model
39+
* :ref:`Restricted Boltzmann Machines <rbm>` - single layer generative RBM model
4040
* :ref:`Deep Belief Networks` - unsupervised generative pre-training of stacked RBMs followed by supervised fine-tuning
4141

4242
.. _Theano: http://deeplearning.net/software/theano

doc/rbm.txt

Lines changed: 92 additions & 192 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,8 @@ negative phase gradient are referred to as **negative particles**, which are
9797
denoted as :math:`\mathcal{N}`. The gradient can then be written as:
9898

9999
.. math::
100+
:label: bm_grad
101+
100102
\frac{\partial log p(x)}{\partial \theta}
101103
&\approx
102104
- \frac{\partial \mathcal{F}(x)}{\partial \theta} +
@@ -136,6 +138,7 @@ respectively.
136138
This translates directly to the following free energy formula:
137139

138140
.. math::
141+
139142
\mathcal{F}(x)= - b'x - \sum_i log \sum_{h_i} e^{h_i (c_i + W_i x)}.
140143

141144
Because of the specific structure of RBMs, visible and hidden units are
@@ -153,14 +156,42 @@ In the commonly studied case of using binary units (where :math:`h_i \in
153156
version of the usual neuron activation function:
154157

155158
.. math::
159+
:label: rbm_propup
160+
156161
P(h_i=1|x) = sigm(c_i + W_i x) \\
162+
163+
.. math::
164+
:label: rbm_propdown
165+
157166
P(x_j=1|h) = sigm(b_j + W'_j h)
158167

159168
The free energy of an RBM with binary units further simplifies to:
160169

161170
.. math::
171+
:label: rbm_free_energy
172+
162173
\mathcal{F}(x)= - b'x - \sum_i log 1 + e^{(c_i + W_i x)}.
163174

175+
**Update Equations with Binary Units**
176+
177+
Combining Eqs. :eq:`bm_grad` with :eq:`rbm_free_energy`, we obtain the
178+
following log-likelihood gradients for an RBM with binary units:
179+
180+
.. math::
181+
:label: rbm_grad
182+
183+
\frac {\partial{log p(v)}} {\partial W_{ij}} &=
184+
- x^{(i)}_j \cdot sigm(W_i \cdot x^{(i)} + c_i)
185+
+ E_v[p(h_i|v) \cdot v_j] \\
186+
\frac {\partial{log p(v)}} {\partial c_i} &=
187+
- sigm(W_i \cdot x^{(i)}) + E_v[p(h_i|v)] \\
188+
\frac {\partial{log p(v)}} {\partial b_j} &=
189+
- x^{(i)} + E_v[p(v_j|h)]
190+
191+
.. note::
192+
We will be updating the tutorial shortly, such that the gradients are
193+
directly computed (using ``T.grad``) from the free energy formula.
194+
164195

165196
Sampling in an RBM
166197
++++++++++++++++++
@@ -232,11 +263,11 @@ to changes in the model.
232263
Implementation
233264
++++++++++++++
234265

235-
We construct a ``RBM`` class. The constructor has to initialize all
236-
parameters of the network, or to get them as arguments. The second
237-
option is usefull when a RBM is used in a deep network,
238-
case in which the weight matrix and the hidden layer bias has to be
239-
shared with the corresponding sigmoidal layer of a mlp network.
266+
We construct an ``RBM`` class. The parameters of the network can either be
267+
initialized by the constructor or can be passed as arguments. This option is
268+
useful when a RBM is used as the building block of a deep network, in which
269+
case the weight matrix and the hidden layer bias is shared with the
270+
corresponding sigmoidal layer of a MLP network.
240271

241272
.. code-block:: python
242273

@@ -316,10 +347,20 @@ shared with the corresponding sigmoidal layer of a mlp network.
316347
self.params = [self.W, self.hbias, self.vbias]
317348
self.batch_size = self.input.shape[0]
318349

319-
Next step is to add a Gibbs sampling function to our class. Since we are
320-
going to use only CD-1 in this tutorial we need to implement just one
321-
step of Gibbs sampling. The following function does this, by first
322-
sample the hidden, and afterwards sampling the visible.
350+
351+
Next step is to define a function which defines the symbolic graph associated
352+
with a single step of Gibbs sampling. Since we are going to use only CD-1 in
353+
this tutorial we need to implement just one step of Gibbs sampling. The
354+
following function does this by:
355+
356+
* inferring the activation probabilities of the hidden units, given the
357+
state `v0_sample` of the visibles (Eq. :eq:`rbm_propup`)
358+
* sampling the hidden units
359+
* inferring the activation probabilities of the visible units, given the state
360+
`h0_sample` of the hidden units (Eqs. :eq:`rbm_propdown`)
361+
* sampling the visible units
362+
363+
The code is as follows:
323364

324365
.. code-block:: python
325366

@@ -339,8 +380,8 @@ sample the hidden, and afterwards sampling the visible.
339380
return [v1_mean, v1_sample]
340381

341382

342-
We also need to add a ``cd`` method to the class that gives as the CD-1
343-
updates :
383+
We then add a ``cd`` method, whose purpose is to generate the symbolic
384+
gradients for CD-1 and PCD-1 updates.
344385

345386
.. code-block:: python
346387

@@ -372,13 +413,17 @@ updates :
372413
chain_start = persistent
373414

374415

375-
Note that ``cd`` takes as argument a variable called persistent. This
376-
should point to where we left off the Gibbs chain in the previous call if
377-
we are to use PCD instead of CD, otherwise we initialize the input of
378-
the chain with our last observarion. Given that we know the starting
379-
point of the chain, we can go ahead and compute the values of the
380-
visibles and hiddens in the negative and positive phase, values needed
381-
to compute the gradient of the parameters
416+
Note that ``cd`` takes as argument a variable called ``persistent``. While this
417+
may be confusing to the reader (since CD is by definition **not** persistent),
418+
this little trick allows us to use the same code to implement both CD and PCD.
419+
To use PCD, ``persistent`` should refer to a shared variable which contains the
420+
state of the Gibbs chain from the previous iteration.
421+
422+
If ``persistent`` is ``None``, we initialize the input with a standard
423+
``dmatrix``, which will eventually map to a training example. Given that we
424+
know the starting point of the chain, we can then compute the values
425+
of the visible and hidden units in both the positive and negative phases.
426+
These are requires to compute the gradient of Eq. :eq:`rbm_grad`.
382427

383428
.. code-block:: python
384429

@@ -407,11 +452,12 @@ to compute the gradient of the parameters
407452

408453
gparams = [g_W.T, g_hbias, g_vbias]
409454

410-
Finally, we compute the reconstruction cross-entropy cost, which can be
411-
seen as a proxy to the function minimized by CD. While this value is
412-
not used anywhere, it is a good indicative that our network is learning.
413-
We also construct the update dictionary, that in case of PCD,
414-
should update the shared variable pointing the end of the chain.
455+
Finally, we compute the reconstruction cross-entropy cost. This can be
456+
seen as a proxy to the function minimized by CD (see [BengioDelalleau09]_).
457+
While this value is not used anywhere, it is a good indication that the RBM is
458+
learning. We also construct the updates dictionary containing the parameter
459+
updates. In case of PCD, these should also update the shared variable
460+
containing the state of the Gibbs chain.
415461

416462
.. code-block:: python
417463

@@ -430,159 +476,20 @@ should update the shared variable pointing the end of the chain.
430476

431477
return (cross_entropy, updates)
432478

479+
We now have all the necessary ingredients to start training our network.
433480

434-
Given that we have the update rule of the network, training is easy. But
435-
before going over the training loop we need some more utility
436-
functions. RBMs are generative models, and once trained we want to be
437-
able to sample them, plot the samples and look at them. We also want to
438-
be able to plot the weights to see what it learned. This technique of
439-
visualizing the weights can be quite insightful but bare in mind that it
440-
does not provide the entire story, since we neglect the biases, and we
441-
scale the weights such that we convert them to values between 0 and 1.
442-
443-
To plot a sample, what we need to do is to take the visible units, which
444-
are a flattened image (there is no 2D structure to the visible units,
445-
just a 1D string of nodes) and reshape it into a 2D image. The order in
446-
which the points from the 1D array go into the 2D image is given by the
447-
order in which the inital MNIST images where converted into a 1D array.
448-
Lucky for us this is just a call of the ``numpy.reshape`` function.
449-
450-
Plotting the weights is a bit more tricky. We have ``n_hidden`` hidden
451-
units, each of them corresponding to a column of the weight matrix. A
452-
column has the same shape as the visible, where the weight corresponding
453-
to the connection with visible unit `j` is at position `j`. Therefore,
454-
if we reshape every such column, using ``numpy.reshape``, we get a
455-
filter image that tells us how this hidden unit is influenced by
456-
the input image.
457-
458-
We need a utility function that takes a minibatch, or the weight matrix,
459-
and converts each row ( for the weight matrix we do a transpose ) into a
460-
2D image and then tile this images together. Once we converted the
461-
minibatch or the weights in this image of tiles, we can use PIL to plot
462-
and save. `PIL <http://www.pythonware.com/products/pil/>`_ is a standard
463-
python libarary to deal with images.
464-
465-
Tiling minibatches together is done for us by
466-
``tile_raster_image`` function which we provide here.
467-
468-
.. code-block:: python
469-
470-
471-
def scale_to_unit_interval(ndar,eps=1e-8):
472-
""" Scales all values in the ndarray ndar to be between 0 and 1 """
473-
ndar = ndar.copy()
474-
ndar -= ndar.min()
475-
ndar *= 1.0 / (ndar.max()+eps)
476-
return ndar
477-
478-
479-
def tile_raster_images(X, img_shape, tile_shape,tile_spacing = (0,0),
480-
scale_rows_to_unit_interval = True, output_pixel_vals = True):
481-
"""
482-
Transform an array with one flattened image per row, into an array in
483-
which images are reshaped and layed out like tiles on a floor.
484-
485-
This function is useful for visualizing datasets whose rows are images,
486-
and also columns of matrices for transforming those rows
487-
(such as the first layer of a neural net).
488-
489-
:type X: a 2-D ndarray or a tuple of 4 channels, elements of which can
490-
be 2-D ndarrays or None;
491-
:param X: a 2-D array in which every row is a flattened image.
492-
493-
:type img_shape: tuple; (height, width)
494-
:param img_shape: the original shape of each image
495-
496-
:type tile_shape: tuple; (rows, cols)
497-
:param tile_shape: the number of images to tile (rows, cols)
498-
499-
:param output_pixel_vals: if output should be pixel values (i.e. int8
500-
values) or floats
501-
502-
:param scale_rows_to_unit_interval: if the values need to be scaled before
503-
being plotted to [0,1] or not
504-
505-
506-
:returns: array suitable for viewing as an image.
507-
(See:`PIL.Image.fromarray`.)
508-
:rtype: a 2-d array with same dtype as X.
509-
510-
"""
511-
512-
assert len(img_shape) == 2
513-
assert len(tile_shape) == 2
514-
assert len(tile_spacing) == 2
515-
516-
# The expression below can be re-written in a more C style as
517-
# follows :
518-
#
519-
# out_shape = [0,0]
520-
# out_shape[0] = (img_shape[0]+tile_spacing[0])*tile_shape[0] -
521-
# tile_spacing[0]
522-
# out_shape[1] = (img_shape[1]+tile_spacing[1])*tile_shape[1] -
523-
# tile_spacing[1]
524-
out_shape = [(ishp + tsp) * tshp - tsp for ishp, tshp, tsp
525-
in zip(img_shape, tile_shape, tile_spacing)]
526-
527-
if isinstance(X, tuple):
528-
assert len(X) == 4
529-
# Create an output numpy ndarray to store the image
530-
if output_pixel_vals:
531-
out_array = numpy.zeros((out_shape[0], out_shape[1], 4), dtype='uint8')
532-
else:
533-
out_array = numpy.zeros((out_shape[0], out_shape[1], 4), dtype=X.dtype)
534-
535-
#colors default to 0, alpha defaults to 1 (opaque)
536-
if output_pixel_vals:
537-
channel_defaults = [0,0,0,255]
538-
else:
539-
channel_defaults = [0.,0.,0.,1.]
540-
541-
for i in xrange(4):
542-
if X[i] is None:
543-
# if channel is None, fill it with zeros of the correct
544-
# dtype
545-
out_array[:,:,i] = numpy.zeros(out_shape,
546-
dtype='uint8' if output_pixel_vals else out_array.dtype
547-
)+channel_defaults[i]
548-
else:
549-
# use a recurrent call to compute the channel and store it
550-
# in the output
551-
out_array[:,:,i] = tile_raster_images(X[i], img_shape, tile_shape, tile_spacing, scale_rows_to_unit_interval, output_pixel_vals)
552-
return out_array
553-
554-
else:
555-
# if we are dealing with only one channel
556-
H, W = img_shape
557-
Hs, Ws = tile_spacing
558-
559-
# generate a matrix to store the output
560-
out_array = numpy.zeros(out_shape, dtype='uint8' if output_pixel_vals else X.dtype)
561-
562-
563-
for tile_row in xrange(tile_shape[0]):
564-
for tile_col in xrange(tile_shape[1]):
565-
if tile_row * tile_shape[1] + tile_col < X.shape[0]:
566-
if scale_rows_to_unit_interval:
567-
# if we should scale values to be between 0 and 1
568-
# do this by calling the `scale_to_unit_interval`
569-
# function
570-
this_img = scale_to_unit_interval(X[tile_row * tile_shape[1] + tile_col].reshape(img_shape))
571-
else:
572-
this_img = X[tile_row * tile_shape[1] + tile_col].reshape(img_shape)
573-
# add the slice to the corresponding position in the
574-
# output array
575-
out_array[
576-
tile_row * (H+Hs):tile_row*(H+Hs)+H,
577-
tile_col * (W+Ws):tile_col*(W+Ws)+W
578-
] \
579-
= this_img * (255 if output_pixel_vals else 1)
580-
return out_array
581-
582-
583-
Having this utility function, we can start training, saving the filters
584-
(weight plots) after each training epoch.
481+
Before going over the training loop however, the reader should familiarize
482+
himself with the function ``tile_raster_images`` (see :ref:`how-to-plot`). Since
483+
RBMs are generative models, we are interested in sampling from them and
484+
plotting/visualizing these samples. We also want to visualize the filters
485+
(weights) learnt by the RBM, to gain insights into what the RBM is actually
486+
doing. Bare in mind however, that this does not provide the entire story,
487+
since we neglect the biases and plot the weights up to a multiplicative
488+
constant (weights are converted to values between 0 and 1).
585489

490+
Having these utility functions, we can start training the RBM and plot/save
491+
the filters after each training epoch. We train the RBM using PCD, as it has
492+
been shown to lead to a better generative model ([Tieleman]_).
586493

587494
.. code-block:: python
588495

@@ -609,11 +516,11 @@ Having this utility function, we can start training, saving the filters
609516

610517

611518

612-
Now for sampling we need to use the ``gibbs_1`` and PCD to have better
613-
results. For this we first pick several samples (from the testing sequence,
614-
though we could as well pick it from the training set) to initialize
615-
several chains that we would sample.
616-
519+
Once the RBM is trained, we can then use the ``gibbs_1`` function to implement
520+
the Gibbs chain required for sampling. We initialize the Gibbs chain starting
521+
from test examples (although we could as well pick it from the training set)
522+
in order to speed up convergence and avoid problems with random
523+
initialization.
617524

618525
.. code-block:: python
619526

@@ -630,12 +537,11 @@ several chains that we would sample.
630537
# initialize 20 persistent chains in parallel
631538
persistent_chain = theano.shared( test_set_x.value[sample:sample+20])
632539

633-
634-
635540
Next we create the 20 persistent chains in paralel to get our
636-
samples. To do so, we compile a theano function that takes as one
637-
step and aplly this function iteratively for a large number of steps,
638-
plotting the samples drawn at every 1000 step.
541+
samples. To do so, we compile a theano function which performs one Gibbs step
542+
and updates the state of the persistent chain with the new visible sample. We
543+
apply this function iteratively for a large number of steps, plotting the
544+
samples at every 1000 step.
639545

640546
.. code-block:: python
641547

@@ -668,17 +574,11 @@ plotting the samples drawn at every 1000 step.
668574
image.save('sample_%i_step_%i.png'%(idx,idx*jdx))
669575

670576

671-
672-
673-
674577
Results
675578
+++++++
676579

677-
678-
Training took 20.862 minutes for 15 epochs with learning rate 0.1 .
679-
680-
681-
Picture below shows the filters after 15 epochs :
580+
Training took 20.862 minutes for 15 epochs with a learning rate 0.1 .
581+
The pictures below shows the filters after 15 epochs :
682582

683583
.. image:: images/filters_at_epoch_14.png
684584
:align: center

0 commit comments

Comments
 (0)