Skip to content

Commit d7ce691

Browse files
committed
Finished writing, added references, etc.
1 parent 199b59b commit d7ce691

File tree

1 file changed

+108
-24
lines changed

1 file changed

+108
-24
lines changed

README.md

Lines changed: 108 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,9 @@ and make a link to the directory in which you installed lamtram:
2020

2121
Machine translation is a method for translating from a source sequence `F` with words `f_1, ..., f_J` to a target sequence `E` with words `e_1, ..., e_I`. This usually means that we translate between a sentence in a source language (e.g. Japanese) to a sentence in a target language (e.g. English). Machine translation can be used for other applications as well.
2222

23-
In recent years, the most prominent method is Statistical Machine Translation (SMT; Brown et al. (1992)), which builds a probabilistic model of the target sequence given the source sequence `P(E|F)`. This probabilistic model is trained using a large set of training data containing pairs of source and target sequences.
23+
In recent years, the most prominent method is Statistical Machine Translation (SMT; Brown et al. (1993)), which builds a probabilistic model of the target sequence given the source sequence `P(E|F)`. This probabilistic model is trained using a large set of training data containing pairs of source and target sequences.
2424

25-
A good resource on machine translation in general, including a number of more traditional (non-Neural) methods is Koehn (2010)'s book "Statistical Machine Translation".
25+
A good resource on machine translation in general, including a number of more traditional (non-Neural) methods is Koehn (2009)'s book "Statistical Machine Translation".
2626

2727
## Neural Machine Translation (NMT) and Encoder-decoder Models
2828

@@ -144,7 +144,7 @@ Looking at the `w/s` (words per second) on the right side of the log, we can see
144144

145145
### Other Update Rules
146146

147-
In addition to the standard `SGD_UPDATE` rule listed above, there are a myriad of additional ways to update the parameters, including "SGD With Momentum", "Adagrad", "Adadelta", "RMSProp", "Adam", and many others. Explaining these in detail is beyond the scope of this tutorial, but it suffices to say that these will more quickly find a good place in parameter space than the standard method above. My current favorite optimization method is "Adam" (Kingma et al. 2012), which can be run by setting `--trainer adam`. We'll also have to change the initial learning rate to `--learning_rate 0.001`, as a learning rate of 0.1 is too big when using Adam.
147+
In addition to the standard `SGD_UPDATE` rule listed above, there are a myriad of additional ways to update the parameters, including "SGD With Momentum", "Adagrad", "Adadelta", "RMSProp", "Adam", and many others. Explaining these in detail is beyond the scope of this tutorial, but it suffices to say that these will more quickly find a good place in parameter space than the standard method above. My current favorite optimization method is "Adam" (Kingma et al. 2014), which can be run by setting `--trainer adam`. We'll also have to change the initial learning rate to `--learning_rate 0.001`, as a learning rate of 0.1 is too big when using Adam.
148148

149149
Try re-running the following command:
150150

@@ -410,9 +410,12 @@ We can see that as we increase the word penalty, this gives us more reasonably-l
410410

411411
## Changing Network Structure
412412

413-
One thing that we have not considered so far is the size of the network that we're training. Currently the default for lamtram is that all recurrent networks have 100 hidden nodes (or when using forward/backward encoders, the encoders will be 50 and decoder will be 100). In addition, we're using only a single hidden layer, while many recent systems use deeper networks with 2-4 hidden layers. These can be changed using the `--layers` option of lamtram, which defaults to "lstm:100:1", where the first option is using LSTM networks (which tend to work pretty well), the second option is the width, and third option is the depth. Let's try to train a wider network by setting `--layers lstm:200:1`:
413+
One thing that we have not considered so far is the size of the network that we're training. Currently the default for lamtram is that all recurrent networks have 100 hidden nodes (or when using forward/backward encoders, the encoders will be 50 and decoder will be 100). In addition, we're using only a single hidden layer, while many recent systems use deeper networks with 2-4 hidden layers. These can be changed using the `--layers` option of lamtram, which defaults to "lstm:100:1", where the first option is using LSTM networks (which tend to work pretty well), the second option is the width, and third option is the depth. Let's try to train a wider network by setting `--layers lstm:200:1`.
414+
415+
One thing to note is that the cnn toolkit has a default limit of using 512MB of memory, but once we start using larger networks this might not be sufficient. So we'll also increase the amount of memory to 1024MB by adding the `--cnn_mem 1024` parameter.
414416

415417
lamtram/src/lamtram/lamtram-train \
418+
--cnn_mem 1024 \
416419
--model_type encatt \
417420
--train_src data/train.unk.ja \
418421
--train_trg data/train.unk.en \
@@ -427,40 +430,116 @@ One thing that we have not considered so far is the size of the network that we'
427430
--epochs 10 \
428431
--model_out models/encatt-unk-stop-lex-w200.mod
429432

430-
Note that this makes training significantly slower, because we need to do twice as many calculations in many of our matrix multiplications.
433+
Note that this makes training significantly slower, because we need to do twice as many calculations in many of our matrix multiplications. Testing this model, the model with 200 nodes reduces perplexity from 37 to 33, and improves BLEU from 10.00 to 10.21. When using larger training data we'll get even bigger improvements by making the network bigger.
431434

432435
## Ensembling
433436

437+
One final technique that is useful for improving final results is "ensembling," or combining multiple models together. The way this works is that if we have two probability distributions `pe_i^{(1)}` and `pe_i^{(2)}` from multiple models, we can calculate the next probability by linearly interpolating them together:
438+
439+
pe_i = (pe_i^{(1)} + pe_i^{(2)}) / 2
440+
441+
or log-linearly interpolating them together:
442+
443+
pe_i = exp( (log(pe_i^{(1)}) + log(pe_i^{(2)})) / 2 )
444+
445+
Performing ensembling at test time in lamtram is simple: in `--models_in`, we simply add two different model options separated by a pipe, as follows. The default is linear interpolation, but you can also try log-linear interpolation by setting `--ensemble_op logsum`. Let's try ensembling our 100-node and 200-node models to measure perplexity:
446+
447+
lamtram/src/lamtram/lamtram \
448+
--operation ppl \
449+
--models_in "encatt=models/encatt-unk-stop-lex.mod|encatt=models/encatt-unk-stop-lex-w200.mod" \
450+
--src_in data/test.ja \
451+
< data/test.en
452+
453+
This reduced the perplexity from 36/33 to 30 for the ensembled model, and resulted in a BLEU score of 10.99. Of course, we can probably improve this by ensembling even more models together. It's actually OK to just train several models of the same structure with different random seeds (if you set the `--seed` parameter of lamtram you can set a different seed, or by default a different one will be chosen randomly every time).
454+
455+
## Final Output
456+
457+
Because we're basically done, I'll also list up a few examples from the start of the test corpus, where the first line is the input, the second line is the correct translation, and the third line is generated translation.
458+
459+
君 は 1 日 で それ が でき ま す か 。
460+
can you do it in one day ?
461+
you can do it on a day ?
462+
463+
皮肉 な 笑い を 浮かべ て 彼 は 私 を 見つめ た 。
464+
he stared at me with a satirical smile .
465+
he stared at the irony of irony .
466+
467+
私 たち の 出発 の 時間 が 差し迫 っ て い る 。
468+
it &apos;s time to leave .
469+
our start of our start is we .
470+
471+
あなた は 午後 何 を し た い で す か 。
472+
what do you want to do in the afternoon ?
473+
what did you do for you this afternoon ?
474+
475+
Not bad, but actually pretty good considering that we only have 10,000 sentences of training data, and that Japanese-English is a pretty difficult language pair to translate!
476+
434477
## More Advanced (but very useful!) Methods
435478

436-
### Dropout
479+
The following are a few extra methods that can be pretty useful in some cases, but I won't be testing here:
480+
481+
### Regularization
437482

438-
TODO: Dropout
483+
As mentioned before, when dealing with small data we need to worry about overfitting, and some ways to fix this are ealy stopping and learning rate decay. In addition, we can also reduce the damage of overfitting by adding some variety of regularization.
484+
485+
One common way of regularizing neural networks is "dropout" (Srivastava et al. 2014) which consists of randomly disabling a set fraction of the units in the input network. This dropout rate can be set with the `--dropout RATE` option. Usually we use a rate of 0.5, which has nice theoretical properties. I tried this on this data set, and it reduced perplexity from 33 to 30 for the 200 node model, but didn't have a large effect on BLEU scores.
486+
487+
Another way to do this is using L2 regularization, which puts a penalty on the L2 norm of the parameter vectors in the model. This can be applied by adding `--cnn_l2 RATE` to the beginning of the option list. I've personally had little luck with getting this to work for neural networks, but it might be worth trying.
439488

440489
### Using Subword Units
441490

442-
TODO: BPE, other methods
491+
One problem with neural network models is that as the vocabulary gets larger, training time increases, so it's often necessary to replace many of the words in the vocabulary with `<unk>` to ensure that training times remain reasonable. There are a number of ways that have been proposed to handle the problem of large vocabularies. One simple way to do so without sacrificing accuracy on low-frequency words (too much) is by splitting rare words into subword units. A method to do so by Sennrich et al. (2016) discovers good subword units using a method called "byte pair encoding", and is implemented in the [subword-nmt](http://github.com/rsennrich/subword-nmt) package. You can use this as an additional pre-processing/post-processing step before learning and using a model with lamtram.
443492

444493
### Training for Other Evaluation Measures
445494

446-
TODO: minrisk
495+
Finally, you may have noticed throughout this tutorial that we are training models to maximize the likelihood, but evaluating our models using BLEU score. There are a number of methods to resolve this mismatch between the training and testing criteria by directly optimizing NMT systems to improve translation accuracy. In lamtram, a method by Shen et al. (2016) can be used to optimize NMT systems for expected BLEU score (or in other words, minimize the risk). In particular, I've found that this does a good job of at least ensuring that the NMT system generates output that is of the appropriate length.
496+
497+
There are a number settings that should be changed when using the method:
498+
499+
* `--learning_criterion minrisk`: This will enable minimum-risk based training.
500+
* `--model_in FILE`: Because this method is slow to train, it's better to first initialize the model using standard maximimum likelihood training, then fine-tune the model with BLEU-based training. This method can be used to read in an already-trained model.
501+
* `--minrisk_num_samples NUM`: This method works by generating samples from the model, then evaluating these generated samples. Increasing NUM improves the stability of the training, but also reduces the training efficiency. A value 20-100 should be reasonable.
502+
* `--minrisk_scaling`, `--minrisk_dedup`: Parameters of the algorithm including the scaling factors for probabilities, and whether to include the correct answer in the samples or not.
503+
* `--trainer sgd --learning_rate 0.05`: I've found that using more advanced optimizers like Adam actually reduces stability in training, so using vanilla SGD might be a safer choice. Slightly lowering the learning rate is also sometimes necessary.
504+
* `--eval_every 1000`: Training is a bit slower than standard NMT training, so we can evaluate more frequently than when we finish the whole corpus.
505+
506+
The final command will look like this:
507+
508+
lamtram/src/lamtram/lamtram-train \
509+
--cnn_mem 1024 \
510+
--model_type encatt \
511+
--train_src data/train.unk.ja \
512+
--train_trg data/train.unk.en \
513+
--dev_src data/dev.ja \
514+
--dev_trg data/dev.en \
515+
--trainer sgd \
516+
--learning_criterion minrisk \
517+
--learning_rate 0.05 \
518+
--minrisk_num_samples 20 \
519+
--minrisk_scaling 0.005 \
520+
--minrisk_include_ref true \
521+
--rate_decay 1.0 \
522+
--epochs 10 \
523+
--eval_every 1000 \
524+
--model_in models/encatt-unk-stop-lex-w200.mod \
525+
--model_out models/encatt-unk-stop-lex-w200-minrisk.mod
447526

448527
## Preparing Data
449528

450529
### Data Size
451530

452531
Up until now, you have just been working with the small data set of 10,000 that I've provided. Having about 10,000 sentences makes training relatively fast, but having more data will make accuracy significantly higher. Fortunately, there is a larger data set of about 140,000 sentences called `train-big.ja` and `train-big.en`, which you can download by running the following commands.
453532

454-
wget http://phontron.com/lamtram/download/data-big.tar.gz
455-
tar -xzf data-big.tar.gz
533+
wget http://phontron.com/lamtram/download/data-big.tar.gz
534+
tar -xzf data-big.tar.gz
456535

457536
Try re-running experiments with this larger data set, and you will see that the accuracy gets significantly higher. In real NMT systems, it's common to use several million sentences (or more!) to achieve usable accuracies. Sometimes in these cases, you'll want to evaluate the accuracy of your system more frequently than when you reach the end of the corpus, so try specifying the `--eval_every NUM_SENTENCES` command, where `NUM_SENTENCES` is the number of sentences after which you'd like to evaluate on the data set. Also, it's highly recommended that you use a GPU for training when scaling to larger data and networks.
458537

459538
### Preprocessing
460539

461540
Also note that up until now, we've taken it for granted that our data is split into words and lower-cased. When you build an actual system, this will not be the case, so you'll have to perform these processes yourself. Here, for tokenization we're using:
462541

463-
* English: [Moses](http://) (Koehn et al. 2008)
542+
* English: [Moses](http://statmt.org/moses) (Koehn et al. 2007)
464543
* Japanese: [KyTea](http://phontron.com/kytea/) (Neubig et al. 2011)
465544

466545
And for lowercasing we're using:
@@ -471,18 +550,23 @@ Make sure that you do tokenization, and potentially lowercasing, before feeding
471550

472551
## Final Word
473552

474-
Now, you know a few practical things about making an accurate neural MT system. This is a very fast-moving field, so this guide might be obsolete in a few months from the writing (or even already!) but hopefully this helped you learn the basics to get started, start reading papers, and come up with your own methods/applications.
553+
Now, you know a few practical things about making an accurate neural MT system. Using the methods described here, we were able to improve a system trained on only 10,000 sentences from 1.83 BLEU to 10.99 BLEU. Switching over to larger data should result in much larger increases, and may even result in readable translations.
554+
555+
This is a very fast-moving field, so this guide might be obsolete in a few months from the writing (or even already!) but hopefully this helped you learn the basics to get started, start reading papers, and come up with your own methods/applications.
475556

476557
## References
477558

478-
* Brown et al. 1992
479-
* Koehn et al. 2008
480-
* Koehn 2010
481-
* Kingma et al. 2012
482-
* Neubig et al. 2011
483-
* Kalchbrenner & Blunsom 2013
484-
* Sutskever et al. 2014
485-
* Bahdanau et al. 2015
486-
* Luong et al. 2015
487-
* Goldberg 2015
488-
* Arthur et al. 2016
559+
* Philip Arthur, Graham Neubig, Satoshi Nakamura. Incorporating Discreet Translation Lexicons in Neural Machine Translation. ArXiv, 2016
560+
* Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR, 2015.
561+
* Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, Robert L. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 1993.
562+
* Yoav Goldberg. A primer on neural network models for natural language processing. ArXiv, 2015.
563+
* Nal Kalchbrenner, Phil Blunsom. Recurrent Continuous Translation Models. EMNLP, 2013.
564+
* Diederik Kingma, Jimmy Ba. Adam: A method for stochastic optimization. ArXiv, 2014.
565+
* Philipp Koehn et al. Moses: Open source toolkit for statistical machine translation. ACL, 2007.
566+
* Philipp Koehn. Statistical machine translation. Cambridge University Press, 2009.
567+
* Minh-Thang Luong, Hieu Pham, Christopher D. Manning. Effective approaches to attention-based neural machine translation. EMNLP, 2015.
568+
* Graham Neubig, Yosuke Nakata, Shinsuke Mori. Pointwise prediction for robust, adaptable Japanese morphological analysis. ACL, 2011.
569+
* Rico Sennrich, Barry Haddow, Alexandra Birch. Neural machine translation of rare words with subword units. ACL, 2016.
570+
* Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 2014.
571+
* Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, Yang Liu. Minimum risk training for neural machine translation. ACL, 2016.
572+
* Ilya Sutskever, Oriol Vinyals, Quoc V. Le. Sequence to sequence learning with neural networks. NIPS, 2014.

0 commit comments

Comments
 (0)