@@ -97,6 +97,8 @@ negative phase gradient are referred to as **negative particles**, which are
9797denoted as :math:`\mathcal{N}`. The gradient can then be written as:
9898
9999.. math::
100+ :label: bm_grad
101+
100102 \frac{\partial log p(x)}{\partial \theta}
101103 &\approx
102104 - \frac{\partial \mathcal{F}(x)}{\partial \theta} +
@@ -136,6 +138,7 @@ respectively.
136138This translates directly to the following free energy formula:
137139
138140.. math::
141+
139142 \mathcal{F}(x)= - b'x - \sum_i log \sum_{h_i} e^{h_i (c_i + W_i x)}.
140143
141144Because of the specific structure of RBMs, visible and hidden units are
@@ -153,14 +156,42 @@ In the commonly studied case of using binary units (where :math:`h_i \in
153156version of the usual neuron activation function:
154157
155158.. math::
159+ :label: rbm_propup
160+
156161 P(h_i=1|x) = sigm(c_i + W_i x) \\
162+
163+ .. math::
164+ :label: rbm_propdown
165+
157166 P(x_j=1|h) = sigm(b_j + W'_j h)
158167
159168The free energy of an RBM with binary units further simplifies to:
160169
161170.. math::
171+ :label: rbm_free_energy
172+
162173 \mathcal{F}(x)= - b'x - \sum_i log 1 + e^{(c_i + W_i x)}.
163174
175+ **Update Equations with Binary Units**
176+
177+ Combining Eqs. :eq:`bm_grad` with :eq:`rbm_free_energy`, we obtain the
178+ following log-likelihood gradients for an RBM with binary units:
179+
180+ .. math::
181+ :label: rbm_grad
182+
183+ \frac {\partial{log p(v)}} {\partial W_{ij}} &=
184+ - x^{(i)}_j \cdot sigm(W_i \cdot x^{(i)} + c_i)
185+ + E_v[p(h_i|v) \cdot v_j] \\
186+ \frac {\partial{log p(v)}} {\partial c_i} &=
187+ - sigm(W_i \cdot x^{(i)}) + E_v[p(h_i|v)] \\
188+ \frac {\partial{log p(v)}} {\partial b_j} &=
189+ - x^{(i)} + E_v[p(v_j|h)]
190+
191+ .. note::
192+ We will be updating the tutorial shortly, such that the gradients are
193+ directly computed (using ``T.grad``) from the free energy formula.
194+
164195
165196Sampling in an RBM
166197++++++++++++++++++
@@ -232,11 +263,11 @@ to changes in the model.
232263Implementation
233264++++++++++++++
234265
235- We construct a ``RBM`` class. The constructor has to initialize all
236- parameters of the network, or to get them as arguments. The second
237- option is usefull when a RBM is used in a deep network,
238- case in which the weight matrix and the hidden layer bias has to be
239- shared with the corresponding sigmoidal layer of a mlp network.
266+ We construct an ``RBM`` class. The parameters of the network can either be
267+ initialized by the constructor or can be passed as arguments. This option is
268+ useful when a RBM is used as the building block of a deep network, in which
269+ case the weight matrix and the hidden layer bias is shared with the
270+ corresponding sigmoidal layer of a MLP network.
240271
241272.. code-block:: python
242273
@@ -316,10 +347,20 @@ shared with the corresponding sigmoidal layer of a mlp network.
316347 self.params = [self.W, self.hbias, self.vbias]
317348 self.batch_size = self.input.shape[0]
318349
319- Next step is to add a Gibbs sampling function to our class. Since we are
320- going to use only CD-1 in this tutorial we need to implement just one
321- step of Gibbs sampling. The following function does this, by first
322- sample the hidden, and afterwards sampling the visible.
350+
351+ Next step is to define a function which defines the symbolic graph associated
352+ with a single step of Gibbs sampling. Since we are going to use only CD-1 in
353+ this tutorial we need to implement just one step of Gibbs sampling. The
354+ following function does this by:
355+
356+ * inferring the activation probabilities of the hidden units, given the
357+ state `v0_sample` of the visibles (Eq. :eq:`rbm_propup`)
358+ * sampling the hidden units
359+ * inferring the activation probabilities of the visible units, given the state
360+ `h0_sample` of the hidden units (Eqs. :eq:`rbm_propdown`)
361+ * sampling the visible units
362+
363+ The code is as follows:
323364
324365.. code-block:: python
325366
@@ -339,8 +380,8 @@ sample the hidden, and afterwards sampling the visible.
339380 return [v1_mean, v1_sample]
340381
341382
342- We also need to add a ``cd`` method to the class that gives as the CD-1
343- updates :
383+ We then add a ``cd`` method, whose purpose is to generate the symbolic
384+ gradients for CD-1 and PCD-1 updates.
344385
345386.. code-block:: python
346387
@@ -372,13 +413,17 @@ updates :
372413 chain_start = persistent
373414
374415
375- Note that ``cd`` takes as argument a variable called persistent. This
376- should point to where we left off the Gibbs chain in the previous call if
377- we are to use PCD instead of CD, otherwise we initialize the input of
378- the chain with our last observarion. Given that we know the starting
379- point of the chain, we can go ahead and compute the values of the
380- visibles and hiddens in the negative and positive phase, values needed
381- to compute the gradient of the parameters
416+ Note that ``cd`` takes as argument a variable called ``persistent``. While this
417+ may be confusing to the reader (since CD is by definition **not** persistent),
418+ this little trick allows us to use the same code to implement both CD and PCD.
419+ To use PCD, ``persistent`` should refer to a shared variable which contains the
420+ state of the Gibbs chain from the previous iteration.
421+
422+ If ``persistent`` is ``None``, we initialize the input with a standard
423+ ``dmatrix``, which will eventually map to a training example. Given that we
424+ know the starting point of the chain, we can then compute the values
425+ of the visible and hidden units in both the positive and negative phases.
426+ These are requires to compute the gradient of Eq. :eq:`rbm_grad`.
382427
383428.. code-block:: python
384429
@@ -407,11 +452,12 @@ to compute the gradient of the parameters
407452
408453 gparams = [g_W.T, g_hbias, g_vbias]
409454
410- Finally, we compute the reconstruction cross-entropy cost, which can be
411- seen as a proxy to the function minimized by CD. While this value is
412- not used anywhere, it is a good indicative that our network is learning.
413- We also construct the update dictionary, that in case of PCD,
414- should update the shared variable pointing the end of the chain.
455+ Finally, we compute the reconstruction cross-entropy cost. This can be
456+ seen as a proxy to the function minimized by CD (see [BengioDelalleau09]_).
457+ While this value is not used anywhere, it is a good indication that the RBM is
458+ learning. We also construct the updates dictionary containing the parameter
459+ updates. In case of PCD, these should also update the shared variable
460+ containing the state of the Gibbs chain.
415461
416462.. code-block:: python
417463
@@ -430,159 +476,20 @@ should update the shared variable pointing the end of the chain.
430476
431477 return (cross_entropy, updates)
432478
479+ We now have all the necessary ingredients to start training our network.
433480
434- Given that we have the update rule of the network, training is easy. But
435- before going over the training loop we need some more utility
436- functions. RBMs are generative models, and once trained we want to be
437- able to sample them, plot the samples and look at them. We also want to
438- be able to plot the weights to see what it learned. This technique of
439- visualizing the weights can be quite insightful but bare in mind that it
440- does not provide the entire story, since we neglect the biases, and we
441- scale the weights such that we convert them to values between 0 and 1.
442-
443- To plot a sample, what we need to do is to take the visible units, which
444- are a flattened image (there is no 2D structure to the visible units,
445- just a 1D string of nodes) and reshape it into a 2D image. The order in
446- which the points from the 1D array go into the 2D image is given by the
447- order in which the inital MNIST images where converted into a 1D array.
448- Lucky for us this is just a call of the ``numpy.reshape`` function.
449-
450- Plotting the weights is a bit more tricky. We have ``n_hidden`` hidden
451- units, each of them corresponding to a column of the weight matrix. A
452- column has the same shape as the visible, where the weight corresponding
453- to the connection with visible unit `j` is at position `j`. Therefore,
454- if we reshape every such column, using ``numpy.reshape``, we get a
455- filter image that tells us how this hidden unit is influenced by
456- the input image.
457-
458- We need a utility function that takes a minibatch, or the weight matrix,
459- and converts each row ( for the weight matrix we do a transpose ) into a
460- 2D image and then tile this images together. Once we converted the
461- minibatch or the weights in this image of tiles, we can use PIL to plot
462- and save. `PIL <http://www.pythonware.com/products/pil/>`_ is a standard
463- python libarary to deal with images.
464-
465- Tiling minibatches together is done for us by
466- ``tile_raster_image`` function which we provide here.
467-
468- .. code-block:: python
469-
470-
471- def scale_to_unit_interval(ndar,eps=1e-8):
472- """ Scales all values in the ndarray ndar to be between 0 and 1 """
473- ndar = ndar.copy()
474- ndar -= ndar.min()
475- ndar *= 1.0 / (ndar.max()+eps)
476- return ndar
477-
478-
479- def tile_raster_images(X, img_shape, tile_shape,tile_spacing = (0,0),
480- scale_rows_to_unit_interval = True, output_pixel_vals = True):
481- """
482- Transform an array with one flattened image per row, into an array in
483- which images are reshaped and layed out like tiles on a floor.
484-
485- This function is useful for visualizing datasets whose rows are images,
486- and also columns of matrices for transforming those rows
487- (such as the first layer of a neural net).
488-
489- :type X: a 2-D ndarray or a tuple of 4 channels, elements of which can
490- be 2-D ndarrays or None;
491- :param X: a 2-D array in which every row is a flattened image.
492-
493- :type img_shape: tuple; (height, width)
494- :param img_shape: the original shape of each image
495-
496- :type tile_shape: tuple; (rows, cols)
497- :param tile_shape: the number of images to tile (rows, cols)
498-
499- :param output_pixel_vals: if output should be pixel values (i.e. int8
500- values) or floats
501-
502- :param scale_rows_to_unit_interval: if the values need to be scaled before
503- being plotted to [0,1] or not
504-
505-
506- :returns: array suitable for viewing as an image.
507- (See:`PIL.Image.fromarray`.)
508- :rtype: a 2-d array with same dtype as X.
509-
510- """
511-
512- assert len(img_shape) == 2
513- assert len(tile_shape) == 2
514- assert len(tile_spacing) == 2
515-
516- # The expression below can be re-written in a more C style as
517- # follows :
518- #
519- # out_shape = [0,0]
520- # out_shape[0] = (img_shape[0]+tile_spacing[0])*tile_shape[0] -
521- # tile_spacing[0]
522- # out_shape[1] = (img_shape[1]+tile_spacing[1])*tile_shape[1] -
523- # tile_spacing[1]
524- out_shape = [(ishp + tsp) * tshp - tsp for ishp, tshp, tsp
525- in zip(img_shape, tile_shape, tile_spacing)]
526-
527- if isinstance(X, tuple):
528- assert len(X) == 4
529- # Create an output numpy ndarray to store the image
530- if output_pixel_vals:
531- out_array = numpy.zeros((out_shape[0], out_shape[1], 4), dtype='uint8')
532- else:
533- out_array = numpy.zeros((out_shape[0], out_shape[1], 4), dtype=X.dtype)
534-
535- #colors default to 0, alpha defaults to 1 (opaque)
536- if output_pixel_vals:
537- channel_defaults = [0,0,0,255]
538- else:
539- channel_defaults = [0.,0.,0.,1.]
540-
541- for i in xrange(4):
542- if X[i] is None:
543- # if channel is None, fill it with zeros of the correct
544- # dtype
545- out_array[:,:,i] = numpy.zeros(out_shape,
546- dtype='uint8' if output_pixel_vals else out_array.dtype
547- )+channel_defaults[i]
548- else:
549- # use a recurrent call to compute the channel and store it
550- # in the output
551- out_array[:,:,i] = tile_raster_images(X[i], img_shape, tile_shape, tile_spacing, scale_rows_to_unit_interval, output_pixel_vals)
552- return out_array
553-
554- else:
555- # if we are dealing with only one channel
556- H, W = img_shape
557- Hs, Ws = tile_spacing
558-
559- # generate a matrix to store the output
560- out_array = numpy.zeros(out_shape, dtype='uint8' if output_pixel_vals else X.dtype)
561-
562-
563- for tile_row in xrange(tile_shape[0]):
564- for tile_col in xrange(tile_shape[1]):
565- if tile_row * tile_shape[1] + tile_col < X.shape[0]:
566- if scale_rows_to_unit_interval:
567- # if we should scale values to be between 0 and 1
568- # do this by calling the `scale_to_unit_interval`
569- # function
570- this_img = scale_to_unit_interval(X[tile_row * tile_shape[1] + tile_col].reshape(img_shape))
571- else:
572- this_img = X[tile_row * tile_shape[1] + tile_col].reshape(img_shape)
573- # add the slice to the corresponding position in the
574- # output array
575- out_array[
576- tile_row * (H+Hs):tile_row*(H+Hs)+H,
577- tile_col * (W+Ws):tile_col*(W+Ws)+W
578- ] \
579- = this_img * (255 if output_pixel_vals else 1)
580- return out_array
581-
582-
583- Having this utility function, we can start training, saving the filters
584- (weight plots) after each training epoch.
481+ Before going over the training loop however, the reader should familiarize
482+ himself with the function ``tile_raster_images`` (see :ref:`how-to-plot`). Since
483+ RBMs are generative models, we are interested in sampling from them and
484+ plotting/visualizing these samples. We also want to visualize the filters
485+ (weights) learnt by the RBM, to gain insights into what the RBM is actually
486+ doing. Bare in mind however, that this does not provide the entire story,
487+ since we neglect the biases and plot the weights up to a multiplicative
488+ constant (weights are converted to values between 0 and 1).
585489
490+ Having these utility functions, we can start training the RBM and plot/save
491+ the filters after each training epoch. We train the RBM using PCD, as it has
492+ been shown to lead to a better generative model ([Tieleman]_).
586493
587494.. code-block:: python
588495
@@ -609,11 +516,11 @@ Having this utility function, we can start training, saving the filters
609516
610517
611518
612- Now for sampling we need to use the ``gibbs_1`` and PCD to have better
613- results. For this we first pick several samples (from the testing sequence,
614- though we could as well pick it from the training set) to initialize
615- several chains that we would sample.
616-
519+ Once the RBM is trained, we can then use the ``gibbs_1`` function to implement
520+ the Gibbs chain required for sampling. We initialize the Gibbs chain starting
521+ from test examples (although we could as well pick it from the training set)
522+ in order to speed up convergence and avoid problems with random
523+ initialization.
617524
618525.. code-block:: python
619526
@@ -630,12 +537,11 @@ several chains that we would sample.
630537 # initialize 20 persistent chains in parallel
631538 persistent_chain = theano.shared( test_set_x.value[sample:sample+20])
632539
633-
634-
635540Next we create the 20 persistent chains in paralel to get our
636- samples. To do so, we compile a theano function that takes as one
637- step and aplly this function iteratively for a large number of steps,
638- plotting the samples drawn at every 1000 step.
541+ samples. To do so, we compile a theano function which performs one Gibbs step
542+ and updates the state of the persistent chain with the new visible sample. We
543+ apply this function iteratively for a large number of steps, plotting the
544+ samples at every 1000 step.
639545
640546.. code-block:: python
641547
@@ -668,17 +574,11 @@ plotting the samples drawn at every 1000 step.
668574 image.save('sample_%i_step_%i.png'%(idx,idx*jdx))
669575
670576
671-
672-
673-
674577Results
675578+++++++
676579
677-
678- Training took 20.862 minutes for 15 epochs with learning rate 0.1 .
679-
680-
681- Picture below shows the filters after 15 epochs :
580+ Training took 20.862 minutes for 15 epochs with a learning rate 0.1 .
581+ The pictures below shows the filters after 15 epochs :
682582
683583.. image:: images/filters_at_epoch_14.png
684584 :align: center
0 commit comments