You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -20,9 +20,9 @@ and make a link to the directory in which you installed lamtram:
20
20
21
21
Machine translation is a method for translating from a source sequence `F` with words `f_1, ..., f_J` to a target sequence `E` with words `e_1, ..., e_I`. This usually means that we translate between a sentence in a source language (e.g. Japanese) to a sentence in a target language (e.g. English). Machine translation can be used for other applications as well.
22
22
23
-
In recent years, the most prominent method is Statistical Machine Translation (SMT; Brown et al. (1992)), which builds a probabilistic model of the target sequence given the source sequence `P(E|F)`. This probabilistic model is trained using a large set of training data containing pairs of source and target sequences.
23
+
In recent years, the most prominent method is Statistical Machine Translation (SMT; Brown et al. (1993)), which builds a probabilistic model of the target sequence given the source sequence `P(E|F)`. This probabilistic model is trained using a large set of training data containing pairs of source and target sequences.
24
24
25
-
A good resource on machine translation in general, including a number of more traditional (non-Neural) methods is Koehn (2010)'s book "Statistical Machine Translation".
25
+
A good resource on machine translation in general, including a number of more traditional (non-Neural) methods is Koehn (2009)'s book "Statistical Machine Translation".
26
26
27
27
## Neural Machine Translation (NMT) and Encoder-decoder Models
28
28
@@ -144,7 +144,7 @@ Looking at the `w/s` (words per second) on the right side of the log, we can see
144
144
145
145
### Other Update Rules
146
146
147
-
In addition to the standard `SGD_UPDATE` rule listed above, there are a myriad of additional ways to update the parameters, including "SGD With Momentum", "Adagrad", "Adadelta", "RMSProp", "Adam", and many others. Explaining these in detail is beyond the scope of this tutorial, but it suffices to say that these will more quickly find a good place in parameter space than the standard method above. My current favorite optimization method is "Adam" (Kingma et al. 2012), which can be run by setting `--trainer adam`. We'll also have to change the initial learning rate to `--learning_rate 0.001`, as a learning rate of 0.1 is too big when using Adam.
147
+
In addition to the standard `SGD_UPDATE` rule listed above, there are a myriad of additional ways to update the parameters, including "SGD With Momentum", "Adagrad", "Adadelta", "RMSProp", "Adam", and many others. Explaining these in detail is beyond the scope of this tutorial, but it suffices to say that these will more quickly find a good place in parameter space than the standard method above. My current favorite optimization method is "Adam" (Kingma et al. 2014), which can be run by setting `--trainer adam`. We'll also have to change the initial learning rate to `--learning_rate 0.001`, as a learning rate of 0.1 is too big when using Adam.
148
148
149
149
Try re-running the following command:
150
150
@@ -410,9 +410,12 @@ We can see that as we increase the word penalty, this gives us more reasonably-l
410
410
411
411
## Changing Network Structure
412
412
413
-
One thing that we have not considered so far is the size of the network that we're training. Currently the default for lamtram is that all recurrent networks have 100 hidden nodes (or when using forward/backward encoders, the encoders will be 50 and decoder will be 100). In addition, we're using only a single hidden layer, while many recent systems use deeper networks with 2-4 hidden layers. These can be changed using the `--layers` option of lamtram, which defaults to "lstm:100:1", where the first option is using LSTM networks (which tend to work pretty well), the second option is the width, and third option is the depth. Let's try to train a wider network by setting `--layers lstm:200:1`:
413
+
One thing that we have not considered so far is the size of the network that we're training. Currently the default for lamtram is that all recurrent networks have 100 hidden nodes (or when using forward/backward encoders, the encoders will be 50 and decoder will be 100). In addition, we're using only a single hidden layer, while many recent systems use deeper networks with 2-4 hidden layers. These can be changed using the `--layers` option of lamtram, which defaults to "lstm:100:1", where the first option is using LSTM networks (which tend to work pretty well), the second option is the width, and third option is the depth. Let's try to train a wider network by setting `--layers lstm:200:1`.
414
+
415
+
One thing to note is that the cnn toolkit has a default limit of using 512MB of memory, but once we start using larger networks this might not be sufficient. So we'll also increase the amount of memory to 1024MB by adding the `--cnn_mem 1024` parameter.
414
416
415
417
lamtram/src/lamtram/lamtram-train \
418
+
--cnn_mem 1024 \
416
419
--model_type encatt \
417
420
--train_src data/train.unk.ja \
418
421
--train_trg data/train.unk.en \
@@ -427,40 +430,116 @@ One thing that we have not considered so far is the size of the network that we'
427
430
--epochs 10 \
428
431
--model_out models/encatt-unk-stop-lex-w200.mod
429
432
430
-
Note that this makes training significantly slower, because we need to do twice as many calculations in many of our matrix multiplications.
433
+
Note that this makes training significantly slower, because we need to do twice as many calculations in many of our matrix multiplications. Testing this model, the model with 200 nodes reduces perplexity from 37 to 33, and improves BLEU from 10.00 to 10.21. When using larger training data we'll get even bigger improvements by making the network bigger.
431
434
432
435
## Ensembling
433
436
437
+
One final technique that is useful for improving final results is "ensembling," or combining multiple models together. The way this works is that if we have two probability distributions `pe_i^{(1)}` and `pe_i^{(2)}` from multiple models, we can calculate the next probability by linearly interpolating them together:
Performing ensembling at test time in lamtram is simple: in `--models_in`, we simply add two different model options separated by a pipe, as follows. The default is linear interpolation, but you can also try log-linear interpolation by setting `--ensemble_op logsum`. Let's try ensembling our 100-node and 200-node models to measure perplexity:
This reduced the perplexity from 36/33 to 30 for the ensembled model, and resulted in a BLEU score of 10.99. Of course, we can probably improve this by ensembling even more models together. It's actually OK to just train several models of the same structure with different random seeds (if you set the `--seed` parameter of lamtram you can set a different seed, or by default a different one will be chosen randomly every time).
454
+
455
+
## Final Output
456
+
457
+
Because we're basically done, I'll also list up a few examples from the start of the test corpus, where the first line is the input, the second line is the correct translation, and the third line is generated translation.
458
+
459
+
君 は 1 日 で それ が でき ま す か 。
460
+
can you do it in one day ?
461
+
you can do it on a day ?
462
+
463
+
皮肉 な 笑い を 浮かべ て 彼 は 私 を 見つめ た 。
464
+
he stared at me with a satirical smile .
465
+
he stared at the irony of irony .
466
+
467
+
私 たち の 出発 の 時間 が 差し迫 っ て い る 。
468
+
it 's time to leave .
469
+
our start of our start is we .
470
+
471
+
あなた は 午後 何 を し た い で す か 。
472
+
what do you want to do in the afternoon ?
473
+
what did you do for you this afternoon ?
474
+
475
+
Not bad, but actually pretty good considering that we only have 10,000 sentences of training data, and that Japanese-English is a pretty difficult language pair to translate!
476
+
434
477
## More Advanced (but very useful!) Methods
435
478
436
-
### Dropout
479
+
The following are a few extra methods that can be pretty useful in some cases, but I won't be testing here:
480
+
481
+
### Regularization
437
482
438
-
TODO: Dropout
483
+
As mentioned before, when dealing with small data we need to worry about overfitting, and some ways to fix this are ealy stopping and learning rate decay. In addition, we can also reduce the damage of overfitting by adding some variety of regularization.
484
+
485
+
One common way of regularizing neural networks is "dropout" (Srivastava et al. 2014) which consists of randomly disabling a set fraction of the units in the input network. This dropout rate can be set with the `--dropout RATE` option. Usually we use a rate of 0.5, which has nice theoretical properties. I tried this on this data set, and it reduced perplexity from 33 to 30 for the 200 node model, but didn't have a large effect on BLEU scores.
486
+
487
+
Another way to do this is using L2 regularization, which puts a penalty on the L2 norm of the parameter vectors in the model. This can be applied by adding `--cnn_l2 RATE` to the beginning of the option list. I've personally had little luck with getting this to work for neural networks, but it might be worth trying.
439
488
440
489
### Using Subword Units
441
490
442
-
TODO: BPE, other methods
491
+
One problem with neural network models is that as the vocabulary gets larger, training time increases, so it's often necessary to replace many of the words in the vocabulary with `<unk>` to ensure that training times remain reasonable. There are a number of ways that have been proposed to handle the problem of large vocabularies. One simple way to do so without sacrificing accuracy on low-frequency words (too much) is by splitting rare words into subword units. A method to do so by Sennrich et al. (2016) discovers good subword units using a method called "byte pair encoding", and is implemented in the [subword-nmt](http://github.com/rsennrich/subword-nmt) package. You can use this as an additional pre-processing/post-processing step before learning and using a model with lamtram.
443
492
444
493
### Training for Other Evaluation Measures
445
494
446
-
TODO: minrisk
495
+
Finally, you may have noticed throughout this tutorial that we are training models to maximize the likelihood, but evaluating our models using BLEU score. There are a number of methods to resolve this mismatch between the training and testing criteria by directly optimizing NMT systems to improve translation accuracy. In lamtram, a method by Shen et al. (2016) can be used to optimize NMT systems for expected BLEU score (or in other words, minimize the risk). In particular, I've found that this does a good job of at least ensuring that the NMT system generates output that is of the appropriate length.
496
+
497
+
There are a number settings that should be changed when using the method:
498
+
499
+
*`--learning_criterion minrisk`: This will enable minimum-risk based training.
500
+
*`--model_in FILE`: Because this method is slow to train, it's better to first initialize the model using standard maximimum likelihood training, then fine-tune the model with BLEU-based training. This method can be used to read in an already-trained model.
501
+
*`--minrisk_num_samples NUM`: This method works by generating samples from the model, then evaluating these generated samples. Increasing NUM improves the stability of the training, but also reduces the training efficiency. A value 20-100 should be reasonable.
502
+
*`--minrisk_scaling`, `--minrisk_dedup`: Parameters of the algorithm including the scaling factors for probabilities, and whether to include the correct answer in the samples or not.
503
+
*`--trainer sgd --learning_rate 0.05`: I've found that using more advanced optimizers like Adam actually reduces stability in training, so using vanilla SGD might be a safer choice. Slightly lowering the learning rate is also sometimes necessary.
504
+
*`--eval_every 1000`: Training is a bit slower than standard NMT training, so we can evaluate more frequently than when we finish the whole corpus.
Up until now, you have just been working with the small data set of 10,000 that I've provided. Having about 10,000 sentences makes training relatively fast, but having more data will make accuracy significantly higher. Fortunately, there is a larger data set of about 140,000 sentences called `train-big.ja` and `train-big.en`, which you can download by running the following commands.
Try re-running experiments with this larger data set, and you will see that the accuracy gets significantly higher. In real NMT systems, it's common to use several million sentences (or more!) to achieve usable accuracies. Sometimes in these cases, you'll want to evaluate the accuracy of your system more frequently than when you reach the end of the corpus, so try specifying the `--eval_every NUM_SENTENCES` command, where `NUM_SENTENCES` is the number of sentences after which you'd like to evaluate on the data set. Also, it's highly recommended that you use a GPU for training when scaling to larger data and networks.
458
537
459
538
### Preprocessing
460
539
461
540
Also note that up until now, we've taken it for granted that our data is split into words and lower-cased. When you build an actual system, this will not be the case, so you'll have to perform these processes yourself. Here, for tokenization we're using:
462
541
463
-
* English: [Moses](http://) (Koehn et al. 2008)
542
+
* English: [Moses](http://statmt.org/moses) (Koehn et al. 2007)
464
543
* Japanese: [KyTea](http://phontron.com/kytea/) (Neubig et al. 2011)
465
544
466
545
And for lowercasing we're using:
@@ -471,18 +550,23 @@ Make sure that you do tokenization, and potentially lowercasing, before feeding
471
550
472
551
## Final Word
473
552
474
-
Now, you know a few practical things about making an accurate neural MT system. This is a very fast-moving field, so this guide might be obsolete in a few months from the writing (or even already!) but hopefully this helped you learn the basics to get started, start reading papers, and come up with your own methods/applications.
553
+
Now, you know a few practical things about making an accurate neural MT system. Using the methods described here, we were able to improve a system trained on only 10,000 sentences from 1.83 BLEU to 10.99 BLEU. Switching over to larger data should result in much larger increases, and may even result in readable translations.
554
+
555
+
This is a very fast-moving field, so this guide might be obsolete in a few months from the writing (or even already!) but hopefully this helped you learn the basics to get started, start reading papers, and come up with your own methods/applications.
475
556
476
557
## References
477
558
478
-
* Brown et al. 1992
479
-
* Koehn et al. 2008
480
-
* Koehn 2010
481
-
* Kingma et al. 2012
482
-
* Neubig et al. 2011
483
-
* Kalchbrenner & Blunsom 2013
484
-
* Sutskever et al. 2014
485
-
* Bahdanau et al. 2015
486
-
* Luong et al. 2015
487
-
* Goldberg 2015
488
-
* Arthur et al. 2016
559
+
* Philip Arthur, Graham Neubig, Satoshi Nakamura. Incorporating Discreet Translation Lexicons in Neural Machine Translation. ArXiv, 2016
560
+
* Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR, 2015.
561
+
* Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, Robert L. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 1993.
562
+
* Yoav Goldberg. A primer on neural network models for natural language processing. ArXiv, 2015.
* Diederik Kingma, Jimmy Ba. Adam: A method for stochastic optimization. ArXiv, 2014.
565
+
* Philipp Koehn et al. Moses: Open source toolkit for statistical machine translation. ACL, 2007.
566
+
* Philipp Koehn. Statistical machine translation. Cambridge University Press, 2009.
567
+
* Minh-Thang Luong, Hieu Pham, Christopher D. Manning. Effective approaches to attention-based neural machine translation. EMNLP, 2015.
568
+
* Graham Neubig, Yosuke Nakata, Shinsuke Mori. Pointwise prediction for robust, adaptable Japanese morphological analysis. ACL, 2011.
569
+
* Rico Sennrich, Barry Haddow, Alexandra Birch. Neural machine translation of rare words with subword units. ACL, 2016.
570
+
* Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 2014.
571
+
* Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, Yang Liu. Minimum risk training for neural machine translation. ACL, 2016.
572
+
* Ilya Sutskever, Oriol Vinyals, Quoc V. Le. Sequence to sequence learning with neural networks. NIPS, 2014.
0 commit comments