@@ -6,9 +6,172 @@ LSTM Network for Sentiment Analysis
66Summary
77+++++++
88
9- This tutorial aims to provide an example of how a Recurrent Neural Network (RNN) using the Long Short Term Memory (LSTM) architecture can be implemented using Theano. In this tutorial, this model is used to perform sentiment analysis on movie reviews from the `Large Movie Review Dataset <http://ai.stanford.edu/~amaas/data/sentiment/>`_, sometimes known as the IMDB dataset.
9+ This tutorial aims to provide an example of how a Recurrent Neural Network
10+ (RNN) using the Long Short Term Memory (LSTM) architecture can be implemented
11+ using Theano. In this tutorial, this model is used to perform sentiment
12+ analysis on movie reviews from the `Large Movie Review Dataset
13+ <http://ai.stanford.edu/~amaas/data/sentiment/>`_, sometimes known as the
14+ IMDB dataset.
15+
16+ In this task, given a movie review, the model attempts to predict whether it
17+ is positive or negative. This is a binary classification task.
18+
19+ Data
20+ ++++
21+
22+ As previously mentionned, the provided scripts are used to train a LSTM
23+ recurrent neural on the Large Movie Review Dataset dataset.
24+
25+ While the dataset is public, in this tutorial we provide a copy of the dataset
26+ that has previously been preprocessed according to the needs of this LSTM
27+ implementation. You can download this preprocessed version of the dataset
28+ using the script `download.sh
29+ <https://raw.githubusercontent.com/lisa-lab/DeepLearningTutorials/master/data/download.sh>`_
30+ and uncompress it.
31+
32+ Model
33+ +++++
34+
35+ LSTM
36+ ====
37+
38+ In a *traditional* recurrent neural network, during the gradient
39+ back-propagation phase, the gradient signal can end up being multiplied a
40+ large number of times (as many as the number of timesteps) by the weight
41+ matrix associated with the connections between the neurons of the recurrent
42+ hidden layer. This means that, the magnitude of weights in the transition
43+ matrix can have a strong impact on the learning process.
44+
45+ If the weights in this matrix are small, it can lead to a situation called
46+ *vanishing gradients* where the gradient signal gets so small that learning
47+ either becomes very slow or stops working altogether. It can also make more
48+ difficult the task of learning long-term dependencies in the data.
49+ Conversely, if the weights in this matrix are large, it can lead to a
50+ situation where the gradient signal is so large that it can cause learning to
51+ diverge. This is often referred to as *exploding gradients*.
52+
53+ These issues are the main motivation behind the LSTM model which introduces a
54+ new structure called a *memory cell* (see Figure 1 below). A memory cell is
55+ composed of four main elements: an input gate, a neuron with a self-recurrent
56+ connection (a connection to itself), a forget gate and an output gate. The
57+ self-recurrent connection has a weight of 1.0 and ensures that, barring any
58+ outside interference, the state of a memory cell can remain constant from one
59+ timestep to another. The gates serve to modulate the interactions between the
60+ memory cell and itself and its environment. The input gate can allow incoming
61+ signal to alter the state of the memory cell or block it. On the other hand,
62+ the output gate can allow the state of the memory cell to have an effect on
63+ other neurons or prevent it. Finally, the forget gate can modulate the memory
64+ cell’s self-recurrent connection, allowing the cell to remember or forget its
65+ previous state, as needed.
66+
67+ .. figure:: images/lstm_memorycell.png
68+ :align: center
69+
70+ **Figure 1** : Illustration of an LSTM memory cell.
71+
72+ The equations below describe how a layer of memory cells is updated at every
73+ timestep :math:`t`. In these equations :
74+
75+ * :math:`x_t` is the input to the memory cell layer at time :math:`t`
76+ * :math:`W_i`, :math:`W_f`, :math:`W_c`, :math:`W_o`, :math:`U_i`,
77+ :math:`U_f`, :math:`U_c`, :math:`U_o` and :math:`V_o` are weight
78+ matrices
79+ * :math:`b_i`, :math:`b_f`, :math:`b_c` and :math:`b_o` are bias vectors
80+
81+
82+ First, we compute the values for :math:`i_t`, the input gate, and
83+ :math:`\widetilde{C_t}` the candidate value for the states of the memory
84+ cells at time :math:`t` :
85+
86+ .. math::
87+ :label: 1
88+
89+ i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i)
90+
91+ .. math::
92+ :label: 2
93+
94+ \widetilde{C_t} = tanh(W_c x_t + U_c h_{t-1} + b_c)
95+
96+ Second, we compute the value for :math:`f_t`, the activation of the memory
97+ cells' forget gates at time :math:`t` :
98+
99+ .. math::
100+ :label: 3
101+
102+ f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f)
103+
104+ Given the value of the input gate activation :math:`i_t`, the forget gate
105+ activation :math:`f_t` and the candidate state value :math:`\widetilde{C_t}`,
106+ we can compute :math:`C_t` the memory cells' new state at time :math:`t` :
107+
108+ .. math::
109+ :label: 4
110+
111+ C_t = i_t * \widetilde{C_t} + f_t * C_{t-1}
112+
113+ With the new state of the memory cells, we can compute the value of their
114+ output gates and, subsequently, their outputs :
115+
116+ .. math::
117+ :label: 5
118+
119+ o_t = \sigma(W_o x_t + U_o h_{t-1} + V_o C_t + b_1)
120+
121+ .. math::
122+ :label: 6
123+
124+ h_t = o_t * tanh(C_t)
125+
126+ Our model
127+ ---------
128+
129+ The model we used in this tutorial is a variation of the standard LSTM model.
130+ In this variant, the activation of a cell’s output gate does not depend on the
131+ memory cell’s state :math:`C_t`. This allows us to perform part of the
132+ computation more efficiently (see the implementation note, below, for
133+ details). This means that, in the variant we have implemented, there is no
134+ matrix :math:`V_o` and equation :eq:`5` is replaced by equation :eq:`5-alt` :
135+
136+ .. math::
137+ :label: 5-alt
138+
139+ o_t = \sigma(W_o x_t + U_o h_{t-1} + b_1)
140+
141+ Our model is composed of a single LSTM layer followed by an average pooling
142+ and a logistic regression layer as illustrated in Figure 2 below. Thus, from
143+ an input sequence :math:`x_0, x_1, x_2, ..., x_n`, the memory cells in the
144+ LSTM layer will produce a representation sequence :math:`h_0, h_1, h_2, ...,
145+ h_n`. This representation sequence is then averaged over all timesteps
146+ resulting in representation h. Finally, this representation is fed to a
147+ logistic regression layer whose target is the class label associated with the
148+ input sequence.
149+
150+ .. figure:: images/lstm.png
151+ :align: center
152+
153+ **Figure 2** : Illustration of the model used in this tutorial. It is
154+ composed of a single LSTM layer followed by mean pooling over time and
155+ logistic regression.
156+
157+ **Implementation note** : In the code included this tutorial, the equations
158+ :eq:`1`, :eq:`2`, :eq:`3` and :eq:`5-alt` are performed in parallel to make
159+ the computation more efficient. This is possible because none of these
160+ equations rely on a result produced by the other ones. It is achieved by
161+ concatenating the four matrices :math:`W_*` into a single weight matrix
162+ :math:`W` and performing the same concatenation on the weight matrices
163+ :math:`U_*` to produce the matrix :math:`U` and the bias vectors :math:`b_*`
164+ to produce the vector :math:`b`. Then, the pre-nonlinearity activations can
165+ be computed with :
166+
167+ .. math::
168+
169+ z = \sigma(W x_t + U h_{t-1} + b)
170+
171+ The result is then sliced to obtain the pre-nonlinearity activations for
172+ :math:`i`, :math:`f`, :math:`\widetilde{C_t}`, and :math:`o` and the
173+ non-linearities are then applied independently for each.
10174
11- In this task, given a movie review, the model attempts to predict whether it is positive or negative. This is a binary classification task.
12175
13176Code - Citations - Contact
14177++++++++++++++++++++++++++
@@ -22,23 +185,23 @@ The LSTM implementation can be found in the two following files :
22185
23186* `imdb.py <http://deeplearning.net/tutorial/code/imdb.py>`_ : Secondary script. Handles the loading and preprocessing of the IMDB dataset.
24187
25- Data
26- ====
188+ After downloading both the scripts, downloading and uncompressing the data and
189+ putting all those files in the same folder, the user can run the code by
190+ calling:
27191
28- As previously mentionned, the provided scripts are used to train a LSTM
29- recurrent neural on the Large Movie Review Dataset dataset.
192+ .. code-block:: bash
193+
194+ THEANO_FLAGS="floatX=float32" python train_lstm.py
30195
31- While the dataset is public, in this tutorial we provide a copy of the dataset
32- that has previously been preprocessed according to the needs of this LSTM
33- implementation. You can download this preprocessed version of the dataset
34- using the script `download.sh <https://raw.githubusercontent.com/lisa-lab/DeepLearningTutorials/master/data/download.sh>`_ and uncompress it.
35196
36197Papers
37198======
38199
39200If you use this tutorial, please cite the following papers:
40201
41- * `[pdf] <http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf>`_ HOCHREITER, Sepp et SCHMIDHUBER, Jürgen. Long short-term memory. Neural computation, 1997, vol. 9, no 8, p. 1735-1780. 1997.
202+ * `[pdf] <http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf>`_ Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
203+
204+ * `[pdf] <http://www.cs.toronto.edu/~graves/preprint.pdf>`_ Graves, Alex. Supervised sequence labelling with recurrent neural networks. Vol. 385. Springer, 2012.
42205
43206* `[pdf] <http://www.iro.umontreal.ca/~lisa/pointeurs/nips2012_deep_workshop_theano_final.pdf>`_ Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Bergstra, James, Goodfellow, Ian, Bergeron, Arnaud, Bouchard, Nicolas, and Bengio, Yoshua. Theano: new features and speed improvements. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2012.
44207
@@ -52,14 +215,15 @@ Contact
52215Please email `Kyunghyun Cho <http://www.kyunghyuncho.me/>`_ for any
53216problem report or feedback. We will be glad to hear from you.
54217
55- Running the Code
56- ++++++++++++++++
218+ References
219+ ==========
57220
58- After downloading both the scripts, downloading and uncompressing the data and
59- putting all those files in the same folder, the user can run the code by
60- calling:
221+ * Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
61222
62- .. code-block:: bash
223+ * Graves, A. (2012). Supervised sequence labelling with recurrent neural networks (Vol. 385). Springer.
63224
64- THEANO_FLAGS="floatX=float32" python train_lstm.py
225+ * Hochreiter, S., Bengio, Y., Frasconi, P., & Schmidhuber, J. (2001). Gradient flow in recurrent nets: the difficulty of learning long-term dependencies.
226+
227+ * Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. Neural Networks, IEEE Transactions on, 5(2), 157-166.
65228
229+ * Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011, June). Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 142-150). Association for Computational Linguistics.
0 commit comments