@@ -39,7 +39,7 @@ descent on the empirical log-likelihood of the training data:
3939 \mathcal{L}(\theta, \mathcal{D}) = \frac{1}{N} \sum_{x^{(i)} \in
4040 \mathcal{D}} \log\ p(x^{(i)}).
4141
42- using the stochastic gradient :math:`\frac{\partial p(x^{(i)})}{\partial
42+ using the stochastic gradient :math:`\frac{\partial \log p(x^{(i)})}{\partial
4343\theta}`, where :math:`\theta` are the parameters of the model.
4444
4545
@@ -102,9 +102,11 @@ denoted as :math:`\mathcal{N}`. The gradient can then be written as:
102102 \frac{\partial \log p(x)}{\partial \theta}
103103 &\approx
104104 - \frac{\partial \mathcal{F}(x)}{\partial \theta} +
105- \sum_{\tilde{x} \in \mathcal{N}} p(\tilde{x}) \
105+ \frac{1}{|\mathcal{N}|}\ sum_{\tilde{x} \in \mathcal{N}} \
106106 \frac{\partial \mathcal{F}(\tilde{x})}{\partial \theta}.
107107
108+ where we would ideally like elements :math:`\tilde{x}` of :math:`\mathcal{N}` to be sampled
109+ according to :math:`P` (i.e. we are doing Monte-Carlo).
108110With the above formula, we almost have a pratical, stochastic algorithm for
109111learning an EBM. The only missing ingredient is how to extract these negative
110112particles :math:`\mathcal{N}`. While the statistical litterature abounds with
@@ -116,8 +118,14 @@ EBM.
116118Restricted Boltzmann Machines (RBM)
117119+++++++++++++++++++++++++++++++++++
118120
119- Boltzmann Machines (BMs) are a particular form of energy-based model which
120- contain hidden variables. Restricted Boltzmann Machines further restrict BMs to
121+ Boltzmann Machines (BMs) are a particular form of log-linear Markov Random Field (MRF),
122+ i.e., for which the energy function is linear in its free parameters. To make
123+ them powerful enough to represent complicated distributions (i.e., go from the
124+ limited parametric setting to a non-parametric one), we consider that some of
125+ the variables are never observed (they are called hidden). By having more hidden
126+ variables (also called hidden units), we can increase the modeling capacity
127+ of the Boltzmann Machine (BM).
128+ Restricted Boltzmann Machines further restrict BMs to
121129those without visible-visible and hidden-hidden connections. A graphical
122130depiction of an RBM is shown below.
123131
@@ -151,8 +159,8 @@ write:
151159
152160**RBMs with binary units**
153161
154- In the commonly studied case of using binary units (where :math:`h_i \in
155- \{0,1\}`, we obtain from Eq. :eq:`rbm_energy` and :eq:`energy2`, a stochastic
162+ In the commonly studied case of using binary units (where :math:`x_j` and :math:` h_i \in
163+ \{0,1\}`) , we obtain from Eq. :eq:`rbm_energy` and :eq:`energy2`, a probabilistic
156164version of the usual neuron activation function:
157165
158166.. math::
@@ -181,15 +189,16 @@ following log-likelihood gradients for an RBM with binary units:
181189 :label: rbm_grad
182190
183191 \frac {\partial{\log p(v)}} {\partial W_{ij}} &=
184- - x^{(i)}_j \cdot sigm(W_i \cdot x^{(i)} + c_i)
185- + E_v[p(h_i|v) \cdot v_j] \\
192+ x^{(i)}_j \cdot sigm(W_i \cdot x^{(i)} + c_i)
193+ - E_v[p(h_i|v) \cdot v_j] \\
186194 \frac {\partial{\log p(v)}} {\partial c_i} &=
187- - sigm(W_i \cdot x^{(i)}) + E_v[p(h_i|v)] \\
195+ sigm(W_i \cdot x^{(i)}) - E_v[p(h_i|v)] \\
188196 \frac {\partial{\log p(v)}} {\partial b_j} &=
189- - x^{(i)}_j + E_v[p(v_j|h)]
197+ x^{(i)}_j - E_v[p(v_j|h)]
190198
191199For a more detailed derivation of these equations, we refer the reader to the
192- following `page <http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/DBNEquations>`_.
200+ following `page <http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/DBNEquations>`_,
201+ or to section 5 of `Learning Deep Architectures for AI <http://www.iro.umontreal.ca/%7Elisa/publications2/index.php/publications/show/239>`_.
193202
194203.. note::
195204 We will be updating the tutorial shortly, such that the gradients are
@@ -219,7 +228,11 @@ follows:
219228 x^{(n+1)} &\sim sigm(W h^{(n+1)} + b),
220229
221230where :math:`h^{(n)}` refers to the set of all hidden units at the n-th step of
222- the Markov chain.
231+ the Markov chain. What it means is that, for example, :math:`h^{(n+1)}_i` is
232+ randomly chosen to be 1 (versus 0) with probability :math:`sigm(W_i'x^{(n)} + c_i)`,
233+ and similarly,
234+ :math:`x^{(n+1)}_j` is
235+ randomly chosen to be 1 (versus 0) with probability :math:`sigm(W_{.j} h^{(n+1)} + b_j)`.
223236
224237This can be illustrated graphically:
225238
@@ -241,9 +254,10 @@ Contrastive Divergence (CD-k)
241254
242255Contrastive Divergence uses two tricks to speed up the sampling process:
243256
244- * since we eventually want :math:`p(x) \approx p_T (x)` (the true, underlying
257+ * since we eventually want :math:`p(x) \approx p_{train} (x)` (the true, underlying
245258 distribution of the data), we initialize the Markov chain with a training
246- example.
259+ example (i.e., from a distribution that is expected to be close to :math:`p`,
260+ so that the chain will be already close to having converged to its final distribution :math:`p`).
247261
248262* CD does not wait for the chain to converge. Samples are obtained after only
249263 k-steps of Gibbs sampling. In pratice, :math:`k=1` has been shown to work
@@ -255,8 +269,9 @@ Persistent CD
255269
256270Persistent CD [Tieleman08]_ uses another approximation for sampling from
257271:math:`p(x,h)`. It relies on a single Markov chain, which has a persistent
258- state. For each parameter update, we extract new samples by simply running the
259- chain for k-steps. The state of the chain is then preserved for subsequent updates.
272+ state (i.e., not restarting a chain for each observed example). For each
273+ parameter update, we extract new samples by simply running the chain for
274+ k-steps. The state of the chain is then preserved for subsequent updates.
260275
261276The general intuition is that if parameter updates are small enough compared
262277to the mixing rate of the chain, the Markov chain should be able to "catch up"
@@ -447,7 +462,7 @@ compute the gradients of Eq. :eq:`rbm_grad`.
447462 gparams = [g_W.T, g_hbias, g_vbias]
448463
449464Finally, we construct the updates dictionary containing the parameter
450- updates. In case of PCD, these should also update the shared variable
465+ updates. In the case of PCD, these should also update the shared variable
451466containing the state of the Gibbs chain.
452467
453468.. code-block:: python
@@ -536,8 +551,8 @@ samples at every 1000 steps.
536551 # define one step of Gibbs sampling (mf = mean-field)
537552 [hid_mf, hid_sample, vis_mf, vis_sample] = rbm.gibbs_vhv(persistent_vis_chain)
538553
539- # the sample at the end of the channel is returned by ``gibbs_1 `` as
540- # its second output; note that this is computed as a binomial draw,
554+ # the sample at the end of the channel is returned by ``gibbs_vhv `` as
555+ # its last output; note that this is computed as a binomial draw,
541556 # therefore it is formed of ints (0 and 1) and therefore needs to
542557 # be converted to the same dtype as ``persistent_vis_chain``
543558 vis_sample = T.cast(vis_sample, dtype=theano.config.floatX)
@@ -554,13 +569,15 @@ samples at every 1000 steps.
554569 plot_every = 1000
555570
556571 for idx in xrange(n_samples):
557- # do `plot_every` intermediate samplings of which we do not care
572+ # generate `plot_every` intermediate samples that we discard, because successive samples in the chain are too correlated
558573 for jdx in xrange(plot_every):
559574 vis_mf, vis_sample = sample_fn()
560575
561576 # construct image
562577 image = PIL.Image.fromarray(tile_raster_images(
563- X = vis_mf, img_shape = (28,28), tile_shape = (10,10),
578+ X = vis_mf,
579+ img_shape = (28,28),
580+ tile_shape = (10,10),
564581 tile_spacing = (1,1) ) )
565582 print ' ... plotting sample ', idx
566583 image.save('sample_%i_step_%i.png'%(idx,idx*jdx))
@@ -580,7 +597,7 @@ Several options are available to the user.
580597
581598Negative samples obtained during training can be visualized. As training
582599progresses, we know that the model defined by the RBM becomes closer to the
583- true underlying distribution, :math:`p_T (x)`. Negative samples should thus
600+ true underlying distribution, :math:`p_{train} (x)`. Negative samples should thus
584601look like samples from the training set. Obviously bad hyperparameters can be
585602discarded in this fashion.
586603
@@ -605,17 +622,18 @@ all bits are independent. Therefore,
605622 PL(x) = \prod_i P(x_i | x_{-i}) \text{ and }\\
606623 \log PL(x) = \sum_i \log P(x_i | x_{-i})
607624
608- Here :math:`x_{-i}` denotes the set of all bits of :math:`x` minus bit
625+ Here :math:`x_{-i}` denotes the set of all bits of :math:`x` except bit
609626:math:`i`. The log-PL is therefore the sum of the log-probabilities of each
610627bit :math:`x_i`, conditionned on the state of all other bits. For MNIST, this
611628would involve summing over the 784 input dimensions, which remains rather
612629expensive. For this reason, we use the following stochastic approximation to
613630log-PL:
614631
615632.. math::
616- \log PL(x) &\approx N \cdot \log P(x_i | x_{-i}) \text{, where }
617- i \sim U(0,N),
618-
633+ g = N \cdot \log P(x_i | x_{-i}) \text{, where } i \sim U(0,N), \text{, and}\\
634+ E[ g ] = \log PL(x)
635+
636+ where the expectation is taken over the uniform random choice of index :math:`i`,
619637and :math:`N` is the number of visible units. In order to work with binary
620638units, we further introduce the notation :math:`\tilde{x}_i` to refer to
621639:math:`x` with bit-i being flipped (1->0, 0->1). The log-PL for an RBM with binary unit is
@@ -649,7 +667,7 @@ values :math:`\{0,1,...,N\}`, from one update to another.
649667 # calculate free energy for the given bit configuration
650668 fe_xi = self.free_energy(xi)
651669
652- # flip bit x_i of matrix xi and preserve all other bits x_{\ i}
670+ # flip bit x_i of matrix xi and preserve all other bits x_{- i}
653671 # Equivalent to xi[:,bit_i_idx] = 1-xi[:, bit_i_idx]
654672 # NB: slice(start,stop,step) is the python object used for
655673 # slicing, e.g. to index matrix x as follows: x[start:stop:step]
0 commit comments