Skip to content

Conversation

@Smerity
Copy link
Contributor

@Smerity Smerity commented Aug 7, 2015

This PR contains two bugfixes (missing vocabulary item in QA19 + issues with running on Python 3) and additional documentation (results and comparison to the Facebook LSTM baseline when run over all 20 tasks).

Smerity added 3 commits August 6, 2015 12:58
For all other questions, the full vocab is in the stories and the queries
+ reduce function disappeared (requires import from functools)
+ tarfiles and encodings - decoding bytes to ASCII at line level
@fchollet
Copy link
Collaborator

fchollet commented Aug 7, 2015

Why do you think this setup significantly outperforms the FB baseline? Have you tried using LSTM instead of GRU? (I guess you have?). It would be a more straightforward comparison to the FB baseline. Have you tried IRNN?

fchollet added a commit that referenced this pull request Aug 7, 2015
Fixes and full results for bAbi RNN example
@fchollet fchollet merged commit 1a572b1 into keras-team:master Aug 7, 2015
@Smerity
Copy link
Contributor Author

Smerity commented Aug 7, 2015

The Facebook paper is light on details for their LSTM baseline unfortunately. They don't specify the size of the internal state or the exact setup of the network itself. From the minimal facts included, my idea is that they fed the entire input into a single RNN, including the query, presumably then running a dense layer on the RNN output.

The only other pure RNN implementation that I found for this task was from Answering Reading Comprehension Using Memory Networks done as part of the CS224d course. They only report results for QA1-3 however and have some odd numbers: QA1-3 are the same style of task each time, just more challenging, yet their accuracy goes QA1, QA2, QA3 = [31.2, 35.6, 27.1] - counter intuitive for accuracy to improve on a harder task.

In our example, we have two RNNs - one that encodes the story, the other that encodes the query vector - and then use a dense layer to compute the answer from those outputs. I did this partially as this felt naturally to me, partially as I wanted to use the Merge API, and finally as I feel it's a fairer baseline. Memory networks and other more advanced neural networks have an input component and a query component, which could substantially reduce the complexity of the task, especially if the number of hidden units are kept equivalently sized.

There are also interesting issues with the dataset (duplication of some of the (story, query, answer) tuples) that I've discovered that might have had an impact on all the systems that have used this data. I've contacted the maintainers of the dataset with my finding so we'll see what comes of that.

As far as GRU vs LSTM, I wrote it the code to be easy and generic to swap RNNs out, then didn't really think about it too much more. GRU runs faster than LSTM for me (sadly I've only CPUs) and I didn't see any performance difference on the tasks I tested both LSTM and GRU on. I agree with you completely re: making it an LSTM baseline to be more comparable. I've replaced the GRU with RNN and am running the experiments for all tasks now and can report back the findings. I've not tested the IRNN and can get those results too.

@fchollet
Copy link
Collaborator

fchollet commented Aug 7, 2015

I've not tested the IRNN and can get those results too.

Should be much faster still than GRU. IRNN's advantage is that it is very lightweight. But it is yet to be seen if it performs any better than a simple RNN (with orthogonal initialization and hard_sigmoid activation, the Keras defaults). Open question...

The only other pure RNN implementation that I found for this task was from Answering Reading Comprehension Using Memory Networks done as part of the CS224d course.

I'd be curious to see if they used Keras for their experiments, or if there is any code out there associated with the paper. Doesn't seem to be mentioned in the paper...

I agree with you completely re: making it an LSTM baseline to be more comparable. I've replaced the GRU with RNN and am running the experiments for all tasks now and can report back the findings.

Great. I think even with identical architecture it would make sense to see Keras outperform a past LSTM-based experiment, because the Keras defaults incorporate recent advances that may not have been available at the time (orthogonal initialization of weights and identity initialization of the forget gate bias).

@Smerity
Copy link
Contributor Author

Smerity commented Aug 8, 2015

With the major proviso that each of these is run with zero hyperparameter optimization (each task / learner was run for 20 epochs each) here are some initial results.

I'm trying to work out the best way to do hyperparameter optimization. The issue is that, to be comparable with the Facebook LSTM baseline, the training set size is only 1,000. Splitting that out to have a reasonable validation set whilst not compromising the raw number of samples to train on is problematic. With a validation set of only 0.05 (50 samples), log loss and accuracy numbers jump around wildly. Hyperparameter optimization would then need to be done for each task and each learner.

The plan would be to add a callback, similar to EarlyStopping, where we'd train on 0.8 * |samples| and validate with 0.2 * |samples|, then use the same epoch number to train over the full dataset and test. It appears that EarlyStopping doesn't keep track of which epoch it stopped at - do you mind if I add in that counter? I'd assume this would be a common task to use the EarlyStopping callback for.

Note - I think I set up IRNN correctly though I'm not 100% certain. I just took the code from mnist_irnn.py and based it upon that, enabling hot swapping in as with LSTM, GRU, and SimpleRNN.
(now that I think about it, I'd just refactor to *args and **kwargs but that's more code cleanup)

RNN = lambda ind, outd, return_sequences: \
    recurrent.SimpleRNN(input_dim=ind, output_dim=outd,
                        init=lambda shape: normal(shape, scale=0.001),
                        inner_init=lambda shape: identity(shape, scale=1.0),
                        activation='relu',
                        return_sequences=return_sequences)
Task Number FB-LSTM LSTM GRU IRNN RNN
QA1 - Single Supporting Fact 50 51.2 52.1 47.7 52.7
QA2 - Two Supporting Facts 20 21.8 37.0 19.7 27.8
QA3 - Three Supporting Facts 20 20.1 20.5 21.3 22.4
QA4 - Two Arg. Relations 61 56.2 62.9 69.0 20.0
QA5 - Three Arg. Relations 70 46.8 61.9 32.7 38.8
QA6 - Yes/No Questions 48 49.1 50.7 49.3 44.8
QA7 - Counting 49 76.1 78.9 75.4 63.2
QA8 - Lists/Sets 45 72.1 77.2 73.7 41.0
QA9 - Simple Negation 64 63.5 64.0 58.6 63.8
QA10 - Indefinite Knowledge 44 47.6 47.7 47.7 42.8
QA11 - Basic Coreference 72 71.9 74.9 74.0 75.1
QA12 - Conjunction 74 73.2 76.4 71.0 77.2
QA13 - Compound Coreference 94 94.0 94.4 94.0 94.4
QA14 - Time Reasoning 27 23.7 34.8 30.5 19.9
QA15 - Basic Deduction 21 21.7 32.4 54.0 23.9
QA16 - Basic Induction 23 44.4 50.6 49.4 41.8
QA17 - Positional Reasoning 51 52.1 49.1 48.9 52.4
QA18 - Size Reasoning 52 91.0 90.8 58.4 54.8
QA19 - Path Finding 8 9.5 9.0 11.5 7.1
QA20 - Agent's Motivations 91 93.5 90.7 97.6 92.2

The results are a little all over the place. The plain RNN actually does fairly well but occasionally just falls apart on tasks which the IRNN has less of an issue with (QA4, QA8, QA14). Occasionally the plain RNN or IRNN do better than LSTM or GRU which suggests they've not been trained for enough epochs.

I'll add in code for hyperparameter choice, re-run the experiments, and hopefully get a clearer and more consistent picture of performance differences.

@fchollet
Copy link
Collaborator

fchollet commented Aug 8, 2015

It is very difficult to make sense of these results... I suspect there might be issues with the dataset that would cause results to be semi-random. Maybe just the very small size of the dataset (especially the test dataset), coupled with the quantized metric (accuracy). An easy way to check this would be to look at the variance between different runs with different random initializations. If the same model/code (minus the random seed) produces very different results each time, then the setup is not reliable. If the results are consistent, then something different is going on.

  • One thought about the training / validation process: it would be better not to use a validation split (which is too small to be statistically significant anyway), instead use the test data as validation_data, use an EarlyStopping callback (with a reasonable patience value) to stop when loss stops improving on the test data, and record the best accuracy on the test data. This should lead to more reliable scores (closer to optimal) and the ability to really train until convergence.
  • My initial tests with the above setup indicate that the model was underfitting in most cases. For instance I am now getting 0.3060 with a SimpleRNN on QA2.
  • There is no use of Dropout. Given the very small size of the dataset, I think Dropout will be indispensable to prevent overfitting. This is especially true for "power" models like LSTM, which normally require huge datasets to be fully utilized.
  • The combination of Dropout to prevent overfitting, and training until convergence, should yield more stable and much better results.

It appears that EarlyStopping doesn't keep track of which epoch it stopped at - do you mind if I add in that counter? I'd assume this would be a common task to use the EarlyStopping callback for.

You can add it, but I've never have any issue with it: the History callback tracks all epochs, and the best epoch according to the EarlyStopping callback is the epoch at which it stopped minus the patience. Or alternatively, just look at loss or accuracy history (History callback) to pick the best epoch.

@Smerity
Copy link
Contributor Author

Smerity commented Aug 8, 2015

I agree that the training data and resulting validation data after is far smaller than we'd like (part of the dataset's aims) but I'd not use the test data for the validation_data with EarlyStopping. It was bashed into my head repeatedly over the years that test data should only ever be used for evaluation of a final model rather than model fitting of any kind.

My initial tests with the above setup indicate that the model was underfitting in most cases. For instance I am now getting 0.3060 with a SimpleRNN on QA2.

As mentioned, the results above used zero hyperparameter optimization (20 epochs each time), so were preliminary. I've implemented the EarlyStopping technique with 20% of the data set for validation, training over the full training dataset accordingly, and got similar results to you for SimpleRNN on QA2.

Adding Dropout is a good idea. For some reason I thought dropout had to be modified to work well for RNNs but Recurrent Neural Network Regularization shows that using the Keras Dropout layer like normal is good. My bad!

...the best epoch according to the EarlyStopping callback is the epoch at which it stopped minus the patience.

EarlyStopping doesn't store the epoch it stopped at, only reporting it to the user via verbose. I'll instead use History and EarlyStopping then selecting the epoch with the maximum as you've suggested.

@Smerity
Copy link
Contributor Author

Smerity commented Aug 17, 2015

I implemented EarlyStopping using 20% of the data (training on 100% of the data) with various dropout levels. The results can still fluctuate 5% to 10% from what they should be, especially when run over all tasks. Some of the tasks are resilient to tweaking and others need good hyperparameter tuning just to get working, especially when certain architectures meet certain tasks. As such, I agree strongly with you that this doesn't make a good dataset for testing various RNN architectures.

If people are interested in the early stopping code, feel free to check out:
https://gist.github.com/Smerity/418a4e7f9e719ff02bf3
(note: I did add a "best_epoch" to EarlyStopping)

Off topic potential bug: dropout doesn't work after a Merge layer but that wasn't an issue for me as you can apply dropout before the two layers are merged.

hubingallin pushed a commit to hubingallin/keras that referenced this pull request Sep 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants