Fixes and full results for bAbi RNN example #501

Smerity · 2015-08-07T04:50:54Z

This PR contains two bugfixes (missing vocabulary item in QA19 + issues with running on Python 3) and additional documentation (results and comparison to the Facebook LSTM baseline when run over all 20 tasks).

For all other questions, the full vocab is in the stories and the queries

+ reduce function disappeared (requires import from functools) + tarfiles and encodings - decoding bytes to ASCII at line level

fchollet · 2015-08-07T05:18:36Z

Why do you think this setup significantly outperforms the FB baseline? Have you tried using LSTM instead of GRU? (I guess you have?). It would be a more straightforward comparison to the FB baseline. Have you tried IRNN?

Fixes and full results for bAbi RNN example

Smerity · 2015-08-07T05:42:03Z

The Facebook paper is light on details for their LSTM baseline unfortunately. They don't specify the size of the internal state or the exact setup of the network itself. From the minimal facts included, my idea is that they fed the entire input into a single RNN, including the query, presumably then running a dense layer on the RNN output.

The only other pure RNN implementation that I found for this task was from Answering Reading Comprehension Using Memory Networks done as part of the CS224d course. They only report results for QA1-3 however and have some odd numbers: QA1-3 are the same style of task each time, just more challenging, yet their accuracy goes QA1, QA2, QA3 = [31.2, 35.6, 27.1] - counter intuitive for accuracy to improve on a harder task.

In our example, we have two RNNs - one that encodes the story, the other that encodes the query vector - and then use a dense layer to compute the answer from those outputs. I did this partially as this felt naturally to me, partially as I wanted to use the Merge API, and finally as I feel it's a fairer baseline. Memory networks and other more advanced neural networks have an input component and a query component, which could substantially reduce the complexity of the task, especially if the number of hidden units are kept equivalently sized.

There are also interesting issues with the dataset (duplication of some of the (story, query, answer) tuples) that I've discovered that might have had an impact on all the systems that have used this data. I've contacted the maintainers of the dataset with my finding so we'll see what comes of that.

As far as GRU vs LSTM, I wrote it the code to be easy and generic to swap RNNs out, then didn't really think about it too much more. GRU runs faster than LSTM for me (sadly I've only CPUs) and I didn't see any performance difference on the tasks I tested both LSTM and GRU on. I agree with you completely re: making it an LSTM baseline to be more comparable. I've replaced the GRU with RNN and am running the experiments for all tasks now and can report back the findings. I've not tested the IRNN and can get those results too.

fchollet · 2015-08-07T07:18:06Z

I've not tested the IRNN and can get those results too.

Should be much faster still than GRU. IRNN's advantage is that it is very lightweight. But it is yet to be seen if it performs any better than a simple RNN (with orthogonal initialization and hard_sigmoid activation, the Keras defaults). Open question...

The only other pure RNN implementation that I found for this task was from Answering Reading Comprehension Using Memory Networks done as part of the CS224d course.

I'd be curious to see if they used Keras for their experiments, or if there is any code out there associated with the paper. Doesn't seem to be mentioned in the paper...

I agree with you completely re: making it an LSTM baseline to be more comparable. I've replaced the GRU with RNN and am running the experiments for all tasks now and can report back the findings.

Great. I think even with identical architecture it would make sense to see Keras outperform a past LSTM-based experiment, because the Keras defaults incorporate recent advances that may not have been available at the time (orthogonal initialization of weights and identity initialization of the forget gate bias).

Smerity · 2015-08-08T00:04:07Z

With the major proviso that each of these is run with zero hyperparameter optimization (each task / learner was run for 20 epochs each) here are some initial results.

I'm trying to work out the best way to do hyperparameter optimization. The issue is that, to be comparable with the Facebook LSTM baseline, the training set size is only 1,000. Splitting that out to have a reasonable validation set whilst not compromising the raw number of samples to train on is problematic. With a validation set of only 0.05 (50 samples), log loss and accuracy numbers jump around wildly. Hyperparameter optimization would then need to be done for each task and each learner.

The plan would be to add a callback, similar to EarlyStopping, where we'd train on 0.8 * |samples| and validate with 0.2 * |samples|, then use the same epoch number to train over the full dataset and test. It appears that EarlyStopping doesn't keep track of which epoch it stopped at - do you mind if I add in that counter? I'd assume this would be a common task to use the EarlyStopping callback for.

Note - I think I set up IRNN correctly though I'm not 100% certain. I just took the code from mnist_irnn.py and based it upon that, enabling hot swapping in as with LSTM, GRU, and SimpleRNN.
(now that I think about it, I'd just refactor to *args and **kwargs but that's more code cleanup)

RNN = lambda ind, outd, return_sequences: \
    recurrent.SimpleRNN(input_dim=ind, output_dim=outd,
                        init=lambda shape: normal(shape, scale=0.001),
                        inner_init=lambda shape: identity(shape, scale=1.0),
                        activation='relu',
                        return_sequences=return_sequences)

Task Number	FB-LSTM	LSTM	GRU	IRNN	RNN
QA1 - Single Supporting Fact	50	51.2	52.1	47.7	52.7
QA2 - Two Supporting Facts	20	21.8	37.0	19.7	27.8
QA3 - Three Supporting Facts	20	20.1	20.5	21.3	22.4
QA4 - Two Arg. Relations	61	56.2	62.9	69.0	20.0
QA5 - Three Arg. Relations	70	46.8	61.9	32.7	38.8
QA6 - Yes/No Questions	48	49.1	50.7	49.3	44.8
QA7 - Counting	49	76.1	78.9	75.4	63.2
QA8 - Lists/Sets	45	72.1	77.2	73.7	41.0
QA9 - Simple Negation	64	63.5	64.0	58.6	63.8
QA10 - Indefinite Knowledge	44	47.6	47.7	47.7	42.8
QA11 - Basic Coreference	72	71.9	74.9	74.0	75.1
QA12 - Conjunction	74	73.2	76.4	71.0	77.2
QA13 - Compound Coreference	94	94.0	94.4	94.0	94.4
QA14 - Time Reasoning	27	23.7	34.8	30.5	19.9
QA15 - Basic Deduction	21	21.7	32.4	54.0	23.9
QA16 - Basic Induction	23	44.4	50.6	49.4	41.8
QA17 - Positional Reasoning	51	52.1	49.1	48.9	52.4
QA18 - Size Reasoning	52	91.0	90.8	58.4	54.8
QA19 - Path Finding	8	9.5	9.0	11.5	7.1
QA20 - Agent's Motivations	91	93.5	90.7	97.6	92.2

The results are a little all over the place. The plain RNN actually does fairly well but occasionally just falls apart on tasks which the IRNN has less of an issue with (QA4, QA8, QA14). Occasionally the plain RNN or IRNN do better than LSTM or GRU which suggests they've not been trained for enough epochs.

I'll add in code for hyperparameter choice, re-run the experiments, and hopefully get a clearer and more consistent picture of performance differences.

fchollet · 2015-08-08T03:01:09Z

It is very difficult to make sense of these results... I suspect there might be issues with the dataset that would cause results to be semi-random. Maybe just the very small size of the dataset (especially the test dataset), coupled with the quantized metric (accuracy). An easy way to check this would be to look at the variance between different runs with different random initializations. If the same model/code (minus the random seed) produces very different results each time, then the setup is not reliable. If the results are consistent, then something different is going on.

One thought about the training / validation process: it would be better not to use a validation split (which is too small to be statistically significant anyway), instead use the test data as validation_data, use an EarlyStopping callback (with a reasonable patience value) to stop when loss stops improving on the test data, and record the best accuracy on the test data. This should lead to more reliable scores (closer to optimal) and the ability to really train until convergence.
My initial tests with the above setup indicate that the model was underfitting in most cases. For instance I am now getting 0.3060 with a SimpleRNN on QA2.
There is no use of Dropout. Given the very small size of the dataset, I think Dropout will be indispensable to prevent overfitting. This is especially true for "power" models like LSTM, which normally require huge datasets to be fully utilized.
The combination of Dropout to prevent overfitting, and training until convergence, should yield more stable and much better results.

It appears that EarlyStopping doesn't keep track of which epoch it stopped at - do you mind if I add in that counter? I'd assume this would be a common task to use the EarlyStopping callback for.

You can add it, but I've never have any issue with it: the History callback tracks all epochs, and the best epoch according to the EarlyStopping callback is the epoch at which it stopped minus the patience. Or alternatively, just look at loss or accuracy history (History callback) to pick the best epoch.

Smerity · 2015-08-08T08:34:20Z

I agree that the training data and resulting validation data after is far smaller than we'd like (part of the dataset's aims) but I'd not use the test data for the validation_data with EarlyStopping. It was bashed into my head repeatedly over the years that test data should only ever be used for evaluation of a final model rather than model fitting of any kind.

My initial tests with the above setup indicate that the model was underfitting in most cases. For instance I am now getting 0.3060 with a SimpleRNN on QA2.

As mentioned, the results above used zero hyperparameter optimization (20 epochs each time), so were preliminary. I've implemented the EarlyStopping technique with 20% of the data set for validation, training over the full training dataset accordingly, and got similar results to you for SimpleRNN on QA2.

Adding Dropout is a good idea. For some reason I thought dropout had to be modified to work well for RNNs but Recurrent Neural Network Regularization shows that using the Keras Dropout layer like normal is good. My bad!

...the best epoch according to the EarlyStopping callback is the epoch at which it stopped minus the patience.

EarlyStopping doesn't store the epoch it stopped at, only reporting it to the user via verbose. I'll instead use History and EarlyStopping then selecting the epoch with the maximum as you've suggested.

Smerity · 2015-08-17T11:34:55Z

I implemented EarlyStopping using 20% of the data (training on 100% of the data) with various dropout levels. The results can still fluctuate 5% to 10% from what they should be, especially when run over all tasks. Some of the tasks are resilient to tweaking and others need good hyperparameter tuning just to get working, especially when certain architectures meet certain tasks. As such, I agree strongly with you that this doesn't make a good dataset for testing various RNN architectures.

If people are interested in the early stopping code, feel free to check out:
https://gist.github.com/Smerity/418a4e7f9e719ff02bf3
(note: I did add a "best_epoch" to EarlyStopping)

Off topic potential bug: dropout doesn't work after a Merge layer but that wasn't an issue for me as you can apply dropout before the two layers are merged.

…olutions (#501)

…olutions (keras-team#501)

Smerity added 3 commits August 6, 2015 12:58

babi_rnn bugfix: QA19 requires vocab from the answer

63284a4

For all other questions, the full vocab is in the stories and the queries

babi_rnn bugfix: Fixing missing Python 3 support

dbc0c27

+ reduce function disappeared (requires import from functools) + tarfiles and encodings - decoding bytes to ASCII at line level

babi_rnn: Adding results for all tasks in bAbi tasks dataset

342f2bc

fchollet added a commit that referenced this pull request Aug 7, 2015

Merge pull request #501 from Smerity/master

1a572b1

Fixes and full results for bAbi RNN example

fchollet merged commit 1a572b1 into keras-team:master Aug 7, 2015

fchollet pushed a commit that referenced this pull request Sep 22, 2023

Moving Timeseries anomaly detection technique to Framework agnostic s…

8220692

…olutions (#501)

hubingallin pushed a commit to hubingallin/keras that referenced this pull request Sep 22, 2023

Moving Timeseries anomaly detection technique to Framework agnostic s…

f38783b

…olutions (keras-team#501)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixes and full results for bAbi RNN example #501

Fixes and full results for bAbi RNN example #501

Uh oh!

Smerity commented Aug 7, 2015

Uh oh!

fchollet commented Aug 7, 2015

Uh oh!

Smerity commented Aug 7, 2015

Uh oh!

fchollet commented Aug 7, 2015

Uh oh!

Smerity commented Aug 8, 2015

Uh oh!

fchollet commented Aug 8, 2015

Uh oh!

Smerity commented Aug 8, 2015

Uh oh!

Smerity commented Aug 17, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fixes and full results for bAbi RNN example #501

Fixes and full results for bAbi RNN example #501

Uh oh!

Conversation

Smerity commented Aug 7, 2015

Uh oh!

fchollet commented Aug 7, 2015

Uh oh!

Smerity commented Aug 7, 2015

Uh oh!

fchollet commented Aug 7, 2015

Uh oh!

Smerity commented Aug 8, 2015

Uh oh!

fchollet commented Aug 8, 2015

Uh oh!

Smerity commented Aug 8, 2015

Uh oh!

Smerity commented Aug 17, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants