aquarius20th
diff --git a/‎doc/contents.txt‎
Lines changed: 1 addition & 2 deletions b/‎doc/contents.txt‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎doc/intro.txt‎
Lines changed: 1 addition & 1 deletion b/‎doc/intro.txt‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎doc/rbm.txt‎
Lines changed: 92 additions & 192 deletions b/‎doc/rbm.txt‎
Lines changed: 92 additions & 192 deletions
@@ -16,6 +16,5 @@ Contents
    lenet
    SdA
    rbm
-   dbn
-   dae
+   utilities
    references
@@ -36,7 +36,7 @@ The purely supervised learning algorithms are meant to be read in order:
 The unsupervised and semi-supervised learning algorithms are here (the auto-encoders can be read independently of the RBM/DBN thread):
 
   * :ref:`Auto Encoders, Stacked Auto Encoders, and Stacked Denoising Auto-Encoders <SdA>` - easy steps into unsupervised pre-training for deep nets
-  * :ref:`Restricted Boltzmann Machines` - single layer generative RBM model
+  * :ref:`Restricted Boltzmann Machines <rbm>` - single layer generative RBM model
   * :ref:`Deep Belief Networks` - unsupervised generative pre-training of stacked RBMs followed by supervised fine-tuning
 
 .. _Theano: http://deeplearning.net/software/theano
 
@@ -97,6 +97,8 @@ negative phase gradient are referred to as **negative particles**, which are
 denoted as :math:`\mathcal{N}`. The gradient can then be written as:
 
 .. math::
+  :label: bm_grad
+
   \frac{\partial log p(x)}{\partial \theta}
    &\approx 
    - \frac{\partial \mathcal{F}(x)}{\partial \theta} + 
@@ -136,6 +138,7 @@ respectively.
 This translates directly to the following free energy formula:
 
 .. math::
+
   \mathcal{F}(x)= - b'x - \sum_i log \sum_{h_i} e^{h_i (c_i + W_i x)}.
 
 Because of the specific structure of RBMs, visible and hidden units are
@@ -153,14 +156,42 @@ In the commonly studied case of using binary units (where :math:`h_i \in
 version of the usual neuron activation function:
 
 .. math::
+  :label: rbm_propup
+
   P(h_i=1|x) = sigm(c_i + W_i x) \\
+
+.. math::
+  :label: rbm_propdown
+
   P(x_j=1|h) = sigm(b_j + W'_j h)
 
 The free energy of an RBM with binary units further simplifies to:
 
 .. math::
+  :label: rbm_free_energy   
+
   \mathcal{F}(x)= - b'x - \sum_i log 1 + e^{(c_i + W_i x)}.
 
+**Update Equations with Binary Units**
+
+Combining Eqs. :eq:`bm_grad` with :eq:`rbm_free_energy`, we obtain the
+following log-likelihood gradients for an RBM with binary units:
+
+.. math::
+    :label: rbm_grad
+
+    \frac {\partial{log p(v)}} {\partial W_{ij}} &= 
+       - x^{(i)}_j \cdot sigm(W_i \cdot x^{(i)} + c_i)
+       + E_v[p(h_i|v) \cdot v_j] \\
+    \frac {\partial{log p(v)}} {\partial c_i} &=
+       - sigm(W_i \cdot x^{(i)}) + E_v[p(h_i|v)] \\
+    \frac {\partial{log p(v)}} {\partial b_j} &=
+       - x^{(i)} + E_v[p(v_j|h)]
+
+.. note::
+    We will be updating the tutorial shortly, such that the gradients are
+    directly computed (using ``T.grad``) from the free energy formula.
+
 
 Sampling in an RBM
 ++++++++++++++++++
@@ -232,11 +263,11 @@ to changes in the model.
 Implementation
 ++++++++++++++
 
-We construct a ``RBM`` class. The constructor has to initialize all 
-parameters of the network, or to get them as arguments. The second
-option is usefull when a RBM is used in a deep network, 
-case in which the weight matrix and the hidden layer bias has to be 
-shared with the corresponding sigmoidal layer of a mlp network.
+We construct an ``RBM`` class. The parameters of the network can either be
+initialized by the constructor or can be passed as arguments. This option is
+useful when a RBM is used as the building block of a deep network, in which
+case the weight matrix and the hidden layer bias is shared with the
+corresponding sigmoidal layer of a MLP network.
 
 .. code-block:: python
 
@@ -316,10 +347,20 @@ shared with the corresponding sigmoidal layer of a mlp network.
         self.params     = [self.W, self.hbias, self.vbias]
         self.batch_size = self.input.shape[0]
 
-Next step is to add a Gibbs sampling function to our class. Since we are 
-going to use only CD-1 in this tutorial we need to implement just one 
-step of Gibbs sampling. The following function does this, by first 
-sample the hidden, and afterwards sampling the visible.
+
+Next step is to define a function which defines the symbolic graph associated
+with a single step of Gibbs sampling. Since we are going to use only CD-1 in
+this tutorial we need to implement just one step of Gibbs sampling. The
+following function does this by:
+
+* inferring the activation probabilities of the hidden units, given the
+  state `v0_sample` of the visibles (Eq. :eq:`rbm_propup`)
+* sampling the hidden units
+* inferring the activation probabilities of the visible units, given the state
+  `h0_sample` of the hidden units (Eqs. :eq:`rbm_propdown`)
+* sampling the visible units
+
+The code is as follows:
 
 .. code-block:: python
 
@@ -339,8 +380,8 @@ sample the hidden, and afterwards sampling the visible.
         return [v1_mean, v1_sample]
 
 
-We also need to add a ``cd`` method to the class that gives as the CD-1 
-updates :
+We then add a ``cd`` method, whose purpose is to generate the symbolic
+gradients for CD-1 and PCD-1 updates.
 
 .. code-block:: python
 
@@ -372,13 +413,17 @@ updates :
             chain_start = persistent
 
 
-Note that ``cd`` takes as argument a variable called persistent. This
-should point to where we left off the Gibbs chain in the previous call if
-we are to use PCD instead of CD, otherwise we initialize the input of
-the chain with our last observarion. Given that we know the starting 
-point of the chain, we can go ahead and compute the values of the
-visibles and hiddens in the negative and positive phase, values needed
-to compute the gradient of the parameters
+Note that ``cd`` takes as argument a variable called ``persistent``. While this
+may be confusing to the reader (since CD is by definition **not** persistent),
+this little trick allows us to use the same code to implement both CD and PCD.
+To use PCD, ``persistent`` should refer to a shared variable which contains the
+state of the Gibbs chain from the previous iteration.
+
+If ``persistent`` is ``None``, we initialize the input with a standard
+``dmatrix``, which will eventually map to a training example.  Given that we
+know the starting point of the chain, we can then compute the values
+of the visible and hidden units in both the positive and negative phases.
+These are requires to compute the gradient of Eq. :eq:`rbm_grad`.
 
 .. code-block:: python
 
@@ -407,11 +452,12 @@ to compute the gradient of the parameters
 
         gparams = [g_W.T, g_hbias, g_vbias]
 
-Finally, we compute the reconstruction cross-entropy cost, which can be 
-seen as a proxy to the function minimized by CD.  While this value is 
-not used anywhere, it is a good indicative that our network is learning.
-We also construct the update dictionary, that in case of PCD,
-should update the shared variable pointing the end of the chain.
+Finally, we compute the reconstruction cross-entropy cost. This can be 
+seen as a proxy to the function minimized by CD (see [BengioDelalleau09]_).
+While this value is not used anywhere, it is a good indication that the RBM is
+learning.  We also construct the updates dictionary containing the parameter
+updates. In case of PCD, these should also update the shared variable
+containing the state of the Gibbs chain.
 
 .. code-block:: python
 
@@ -430,159 +476,20 @@ should update the shared variable pointing the end of the chain.
 
         return (cross_entropy, updates)
 
+We now have all the necessary ingredients to start training our network.
 
-Given that we have the update rule of the network, training is easy. But
-before going over the training loop we need some more utility 
-functions. RBMs are generative models, and once trained we want to be 
-able to sample them, plot the samples and look at them. We also want to 
-be able to plot the weights to see what it learned. This technique of
-visualizing the weights can be quite insightful but bare in mind that it
-does not provide the entire story, since we neglect the biases, and we 
-scale the weights such that we convert them to values between 0 and 1. 
-
-To plot a sample, what we need to do is to take the visible units, which 
-are a flattened image (there is no 2D structure to the visible units,
-just a 1D string of nodes) and reshape it into a 2D image. The order in
-which the points from the 1D array go into the 2D image is given by the 
-order in which the inital MNIST images where converted into a 1D array.
-Lucky for us this is just a call of the ``numpy.reshape`` function.
-
-Plotting the weights is a bit more tricky. We have ``n_hidden`` hidden
-units, each of them corresponding to a column of the weight matrix. A 
-column has the same shape as the visible, where the weight corresponding 
-to the connection with visible unit `j` is at position `j`. Therefore,
-if we reshape every such column, using ``numpy.reshape``, we get a
-filter image that tells us how this hidden unit is influenced by 
-the input image.
-
-We need a utility function that takes a minibatch, or the weight matrix, 
-and converts each row ( for the weight matrix we do a transpose ) into a 
-2D image and then tile this images together.  Once we converted the
-minibatch or the weights in this image of tiles, we can use PIL to plot 
-and save. `PIL <http://www.pythonware.com/products/pil/>`_ is a standard 
-python libarary to deal with images.
-
-Tiling minibatches together is done for us by 
-``tile_raster_image`` function which we provide here. 
-
-.. code-block:: python
-
-
-  def scale_to_unit_interval(ndar,eps=1e-8):
-    """ Scales all values in the ndarray ndar to be between 0 and 1 """
-    ndar = ndar.copy()
-    ndar -= ndar.min()
-    ndar *= 1.0 / (ndar.max()+eps)
-    return ndar
-
-
-  def tile_raster_images(X, img_shape, tile_shape,tile_spacing = (0,0), 
-              scale_rows_to_unit_interval = True, output_pixel_vals = True):
-    """
-    Transform an array with one flattened image per row, into an array in 
-    which images are reshaped and layed out like tiles on a floor.
-
-    This function is useful for visualizing datasets whose rows are images, 
-    and also columns of matrices for transforming those rows 
-    (such as the first layer of a neural net).
-
-    :type X: a 2-D ndarray or a tuple of 4 channels, elements of which can 
-    be 2-D ndarrays or None;
-    :param X: a 2-D array in which every row is a flattened image.
-
-    :type img_shape: tuple; (height, width)
-    :param img_shape: the original shape of each image
-
-    :type tile_shape: tuple; (rows, cols)
-    :param tile_shape: the number of images to tile (rows, cols)
-    
-    :param output_pixel_vals: if output should be pixel values (i.e. int8
-    values) or floats
-
-    :param scale_rows_to_unit_interval: if the values need to be scaled before
-    being plotted to [0,1] or not
-
-
-    :returns: array suitable for viewing as an image.  
-    (See:`PIL.Image.fromarray`.)
-    :rtype: a 2-d array with same dtype as X.
-
-    """
- 
-    assert len(img_shape) == 2
-    assert len(tile_shape) == 2
-    assert len(tile_spacing) == 2
-
-    # The expression below can be re-written in a more C style as 
-    # follows : 
-    #
-    # out_shape    = [0,0]
-    # out_shape[0] = (img_shape[0]+tile_spacing[0])*tile_shape[0] -
-    #                tile_spacing[0]
-    # out_shape[1] = (img_shape[1]+tile_spacing[1])*tile_shape[1] -
-    #                tile_spacing[1]
-    out_shape = [(ishp + tsp) * tshp - tsp for ishp, tshp, tsp 
-                        in zip(img_shape, tile_shape, tile_spacing)]
-
-    if isinstance(X, tuple):
-        assert len(X) == 4
-        # Create an output numpy ndarray to store the image 
-        if output_pixel_vals:
-            out_array = numpy.zeros((out_shape[0], out_shape[1], 4), dtype='uint8')
-        else:
-            out_array = numpy.zeros((out_shape[0], out_shape[1], 4), dtype=X.dtype)
-
-        #colors default to 0, alpha defaults to 1 (opaque)
-        if output_pixel_vals:
-            channel_defaults = [0,0,0,255]
-        else:
-            channel_defaults = [0.,0.,0.,1.]
-
-        for i in xrange(4):
-            if X[i] is None:
-                # if channel is None, fill it with zeros of the correct 
-                # dtype
-                out_array[:,:,i] = numpy.zeros(out_shape,
-                        dtype='uint8' if output_pixel_vals else out_array.dtype
-                        )+channel_defaults[i]
-            else:
-                # use a recurrent call to compute the channel and store it 
-                # in the output
-                out_array[:,:,i] = tile_raster_images(X[i], img_shape, tile_shape, tile_spacing, scale_rows_to_unit_interval, output_pixel_vals)
-        return out_array
-
-    else:
-        # if we are dealing with only one channel 
-        H, W = img_shape
-        Hs, Ws = tile_spacing
-
-        # generate a matrix to store the output
-        out_array = numpy.zeros(out_shape, dtype='uint8' if output_pixel_vals else X.dtype)
-
-
-        for tile_row in xrange(tile_shape[0]):
-            for tile_col in xrange(tile_shape[1]):
-                if tile_row * tile_shape[1] + tile_col < X.shape[0]:
-                    if scale_rows_to_unit_interval:
-                        # if we should scale values to be between 0 and 1 
-                        # do this by calling the `scale_to_unit_interval`
-                        # function
-                        this_img = scale_to_unit_interval(X[tile_row * tile_shape[1] + tile_col].reshape(img_shape))
-                    else:
-                        this_img = X[tile_row * tile_shape[1] + tile_col].reshape(img_shape)
-                    # add the slice to the corresponding position in the 
-                    # output array
-                    out_array[
-                        tile_row * (H+Hs):tile_row*(H+Hs)+H,
-                        tile_col * (W+Ws):tile_col*(W+Ws)+W
-                        ] \
-                        = this_img * (255 if output_pixel_vals else 1)
-        return out_array
-
-
-Having this utility function, we can start training, saving the filters
-(weight plots) after each training epoch.
+Before going over the training loop however, the reader should familiarize
+himself with the function ``tile_raster_images`` (see :ref:`how-to-plot`). Since
+RBMs are generative models, we are interested in sampling from them and
+plotting/visualizing these samples. We also want to visualize the filters
+(weights) learnt by the RBM, to gain insights into what the RBM is actually
+doing. Bare in mind however, that this does not provide the entire story,
+since we neglect the biases and plot the weights up to a multiplicative
+constant (weights are converted to values between 0 and 1). 
 
+Having these utility functions, we can start training the RBM and plot/save
+the filters after each training epoch.  We train the RBM using PCD, as it has
+been shown to lead to a better generative model ([Tieleman]_).
 
 .. code-block:: python
 
@@ -609,11 +516,11 @@ Having this utility function, we can start training, saving the filters
 
 
 
-Now for sampling we need to use the ``gibbs_1`` and PCD to have better 
-results. For this we first pick several samples (from the testing sequence, 
-though we could as well pick it from the training set) to initialize 
-several chains that we would sample. 
-
+Once the RBM is trained, we can then use the ``gibbs_1`` function to implement
+the Gibbs chain required for sampling. We initialize the Gibbs chain starting
+from test examples (although we could as well pick it from the training set)
+in order to speed up convergence and avoid problems with random
+initialization.
 
 .. code-block:: python
 
@@ -630,12 +537,11 @@ several chains that we would sample.
     # initialize 20 persistent chains in parallel
     persistent_chain = theano.shared( test_set_x.value[sample:sample+20])
 
-
-
 Next we create the 20 persistent chains in paralel to get our
-samples. To do so, we compile a theano function that takes as one 
-step and aplly this function iteratively for a large number of steps, 
-plotting the samples drawn at every 1000 step.
+samples. To do so, we compile a theano function which performs one Gibbs step
+and updates the state of the persistent chain with the new visible sample. We
+apply this function iteratively for a large number of steps, plotting the
+samples at every 1000 step.
 
 .. code-block:: python
 
@@ -668,17 +574,11 @@ plotting the samples drawn at every 1000 step.
         image.save('sample_%i_step_%i.png'%(idx,idx*jdx))
 
 
-
-
-
 Results
 +++++++
 
-
- Training took 20.862 minutes for 15 epochs with learning rate 0.1 .
-
-
-Picture below shows the filters after 15 epochs : 
+Training took 20.862 minutes for 15 epochs with a learning rate 0.1 .
+The pictures below shows the filters after 15 epochs : 
 
 .. image:: images/filters_at_epoch_14.png
     :align: center