plsang
diff --git a/‎doc/images/lstm.png‎
12.8 KB b/‎doc/images/lstm.png‎
12.8 KB
diff --git a/‎doc/images/lstm_memorycell.png‎
14 KB b/‎doc/images/lstm_memorycell.png‎
14 KB
diff --git a/‎doc/lstm.txt‎
Lines changed: 182 additions & 18 deletions b/‎doc/lstm.txt‎
Lines changed: 182 additions & 18 deletions
@@ -6,9 +6,172 @@ LSTM Network for Sentiment Analysis
 Summary
 +++++++
 
-This tutorial aims to provide an example of how a Recurrent Neural Network (RNN) using the Long Short Term Memory (LSTM) architecture can be implemented using Theano. In this tutorial, this model is used to perform sentiment analysis on movie reviews from the `Large Movie Review Dataset <http://ai.stanford.edu/~amaas/data/sentiment/>`_, sometimes known as the IMDB dataset. 
+This tutorial aims to provide an example of how a Recurrent Neural Network
+(RNN) using the Long Short Term Memory (LSTM) architecture can be implemented
+using Theano. In this tutorial, this model is used to perform sentiment
+analysis on movie reviews from the `Large Movie Review Dataset
+<http://ai.stanford.edu/~amaas/data/sentiment/>`_, sometimes known as the
+IMDB dataset.
+
+In this task, given a movie review, the model attempts to predict whether it
+is positive or negative. This is a binary classification task.
+
+Data
+++++
+
+As previously mentionned, the provided scripts are used to train a LSTM
+recurrent neural on the Large Movie Review Dataset dataset.
+
+While the dataset is public, in this tutorial we provide a copy of the dataset
+that has previously been preprocessed according to the needs of this LSTM
+implementation. You can download this preprocessed version of the dataset
+using the script `download.sh
+<https://raw.githubusercontent.com/lisa-lab/DeepLearningTutorials/master/data/download.sh>`_
+and uncompress it.
+
+Model
++++++
+
+LSTM
+====
+
+In a *traditional* recurrent neural network, during the gradient
+back-propagation phase, the gradient signal can end up being multiplied a
+large number of times (as many as the number of timesteps) by the weight
+matrix associated with the connections between the neurons of the recurrent
+hidden layer. This means that, the magnitude of weights in the transition
+matrix can have a strong impact on the learning process.
+
+If the weights in this matrix are small, it can lead to a situation called
+*vanishing gradients* where the gradient signal gets so small that learning
+either becomes very slow or stops working altogether. It can also make more
+difficult the task of learning long-term dependencies in the data.
+Conversely, if the weights in this matrix are large, it can lead to a
+situation where the gradient signal is so large that it can cause learning to
+diverge. This is often referred to as *exploding gradients*.
+
+These issues are the main motivation behind the LSTM model which introduces a
+new structure called a *memory cell* (see Figure 1 below). A memory cell is
+composed of four main elements: an input gate, a neuron with a self-recurrent
+connection (a connection to itself), a forget gate and an output gate. The
+self-recurrent connection has a weight of 1.0 and ensures that, barring any
+outside interference, the state of a memory cell can remain constant from one
+timestep to another. The gates serve to modulate the interactions between the
+memory cell and itself and its environment. The input gate can allow incoming
+signal to alter the state of the memory cell or block it. On the other hand,
+the output gate can allow the state of the memory cell to have an effect on
+other neurons or prevent it. Finally, the forget gate can modulate the memory
+cell’s self-recurrent connection, allowing the cell to remember or forget its
+previous state, as needed.
+
+.. figure:: images/lstm_memorycell.png
+    :align: center
+
+    **Figure 1** : Illustration of an LSTM memory cell.
+
+The equations below describe how a layer of memory cells is updated at every
+timestep :math:`t`. In these equations :
+
+*       :math:`x_t` is the input to the memory cell layer at time :math:`t`
+*       :math:`W_i`, :math:`W_f`, :math:`W_c`, :math:`W_o`, :math:`U_i`,
+        :math:`U_f`, :math:`U_c`, :math:`U_o` and :math:`V_o` are weight
+        matrices
+*       :math:`b_i`, :math:`b_f`, :math:`b_c` and :math:`b_o` are bias vectors
+
+
+First, we compute the values for :math:`i_t`, the input gate, and
+:math:`\widetilde{C_t}` the candidate value for the states of the memory
+cells at time :math:`t` :
+
+.. math::
+    :label: 1
+
+    i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i)
+
+.. math::
+    :label: 2
+
+    \widetilde{C_t} = tanh(W_c x_t + U_c h_{t-1} + b_c)
+
+Second, we compute the value for :math:`f_t`, the activation of the memory
+cells' forget gates at time :math:`t` :
+
+.. math::
+    :label: 3
+
+    f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f)
+
+Given the value of the input gate activation :math:`i_t`, the forget gate
+activation :math:`f_t` and the candidate state value :math:`\widetilde{C_t}`,
+we can compute :math:`C_t` the memory cells' new state at time :math:`t` :
+
+.. math::
+    :label: 4
+
+    C_t = i_t * \widetilde{C_t} + f_t * C_{t-1}
+
+With the new state of the memory cells, we can compute the value of their
+output gates and, subsequently, their outputs :
+
+.. math::
+    :label: 5
+
+    o_t = \sigma(W_o x_t + U_o h_{t-1} + V_o C_t + b_1)
+
+.. math::
+    :label: 6
+
+    h_t = o_t * tanh(C_t)
+
+Our model
+---------
+
+The model we used in this tutorial is a variation of the standard LSTM model.
+In this variant, the activation of a cell’s output gate does not depend on the
+memory cell’s state :math:`C_t`. This allows us to perform part of the
+computation more efficiently (see the implementation note, below, for
+details). This means that, in the variant we have implemented, there is no
+matrix :math:`V_o` and equation :eq:`5` is replaced by equation :eq:`5-alt` :
+
+.. math::
+    :label: 5-alt
+
+    o_t = \sigma(W_o x_t + U_o h_{t-1} + b_1)
+
+Our model is composed of a single LSTM layer followed by an average pooling
+and a logistic regression layer as illustrated in Figure 2 below. Thus, from
+an input sequence :math:`x_0, x_1, x_2, ..., x_n`, the memory cells in the
+LSTM layer will produce a representation sequence :math:`h_0, h_1, h_2, ...,
+h_n`. This representation sequence is then averaged over all timesteps
+resulting in representation h. Finally, this representation is fed to a
+logistic regression layer whose target is the class label associated with the
+input sequence.
+
+.. figure:: images/lstm.png
+    :align: center
+
+    **Figure 2** : Illustration of the model used in this tutorial. It is
+    composed of a single LSTM layer followed by mean pooling over time and
+    logistic regression.
+
+**Implementation note** : In the code included this tutorial, the equations
+:eq:`1`, :eq:`2`, :eq:`3` and :eq:`5-alt` are performed in parallel to make
+the computation more efficient. This is possible because none of these
+equations rely on a result produced by the other ones. It is achieved by
+concatenating the four matrices :math:`W_*` into a single weight matrix
+:math:`W` and performing the same concatenation on the weight matrices
+:math:`U_*` to produce the matrix :math:`U` and the bias vectors :math:`b_*`
+to produce the vector :math:`b`. Then, the pre-nonlinearity activations can
+be computed with :
+
+.. math::
+
+    z = \sigma(W x_t + U h_{t-1} + b)
+
+The result is then sliced to obtain the pre-nonlinearity activations for
+:math:`i`, :math:`f`, :math:`\widetilde{C_t}`, and :math:`o` and the
+non-linearities are then applied independently for each.
 
-In this task, given a movie review, the model attempts to predict whether it is positive or negative. This is a binary classification task.
 
 Code - Citations - Contact
 ++++++++++++++++++++++++++
@@ -22,23 +185,23 @@ The LSTM implementation can be found in the two following files :
 
 * `imdb.py <http://deeplearning.net/tutorial/code/imdb.py>`_ : Secondary script. Handles the loading and preprocessing of the IMDB dataset.
 
-Data
-====
+After downloading both the scripts, downloading and uncompressing the data and
+putting all those files in the same folder, the user can run the code by
+calling:
 
-As previously mentionned, the provided scripts are used to train a LSTM
-recurrent neural on the Large Movie Review Dataset dataset.
+.. code-block:: bash
+
+    THEANO_FLAGS="floatX=float32" python train_lstm.py
 
-While the dataset is public, in this tutorial we provide a copy of the dataset
-that has previously been preprocessed according to the needs of this LSTM
-implementation. You can download this preprocessed version of the dataset
-using the script `download.sh <https://raw.githubusercontent.com/lisa-lab/DeepLearningTutorials/master/data/download.sh>`_ and uncompress it. 
 
 Papers
 ======
 
 If you use this tutorial, please cite the following papers:
 
-* `[pdf] <http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf>`_ HOCHREITER, Sepp et SCHMIDHUBER, Jürgen. Long short-term memory. Neural computation, 1997, vol. 9, no 8, p. 1735-1780. 1997.
+* `[pdf] <http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf>`_ Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
+
+* `[pdf] <http://www.cs.toronto.edu/~graves/preprint.pdf>`_ Graves, Alex. Supervised sequence labelling with recurrent neural networks. Vol. 385. Springer, 2012.
 
 * `[pdf] <http://www.iro.umontreal.ca/~lisa/pointeurs/nips2012_deep_workshop_theano_final.pdf>`_ Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Bergstra, James, Goodfellow, Ian, Bergeron, Arnaud, Bouchard, Nicolas, and Bengio, Yoshua. Theano: new features and speed improvements. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2012.
 
@@ -52,14 +215,15 @@ Contact
 Please email `Kyunghyun Cho <http://www.kyunghyuncho.me/>`_ for any
 problem report or feedback. We will be glad to hear from you.
 
-Running the Code
-++++++++++++++++
+References
+==========
 
-After downloading both the scripts, downloading and uncompressing the data and
-putting all those files in the same folder, the user can run the code by
-calling:
+* Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
 
-.. code-block:: bash
+* Graves, A. (2012). Supervised sequence labelling with recurrent neural networks (Vol. 385). Springer.
 
-    THEANO_FLAGS="floatX=float32" python train_lstm.py
+* Hochreiter, S., Bengio, Y., Frasconi, P., & Schmidhuber, J. (2001). Gradient flow in recurrent nets: the difficulty of learning long-term dependencies.
+
+* Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. Neural Networks, IEEE Transactions on, 5(2), 157-166.
 
+* Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011, June). Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 142-150). Association for Computational Linguistics.