-
Notifications
You must be signed in to change notification settings - Fork 19.7k
Fixes and full results for bAbi RNN example #501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
For all other questions, the full vocab is in the stories and the queries
+ reduce function disappeared (requires import from functools) + tarfiles and encodings - decoding bytes to ASCII at line level
|
Why do you think this setup significantly outperforms the FB baseline? Have you tried using LSTM instead of GRU? (I guess you have?). It would be a more straightforward comparison to the FB baseline. Have you tried IRNN? |
Fixes and full results for bAbi RNN example
|
The Facebook paper is light on details for their LSTM baseline unfortunately. They don't specify the size of the internal state or the exact setup of the network itself. From the minimal facts included, my idea is that they fed the entire input into a single RNN, including the query, presumably then running a dense layer on the RNN output. The only other pure RNN implementation that I found for this task was from Answering Reading Comprehension Using Memory Networks done as part of the CS224d course. They only report results for QA1-3 however and have some odd numbers: QA1-3 are the same style of task each time, just more challenging, yet their accuracy goes QA1, QA2, QA3 = [31.2, 35.6, 27.1] - counter intuitive for accuracy to improve on a harder task. In our example, we have two RNNs - one that encodes the story, the other that encodes the query vector - and then use a dense layer to compute the answer from those outputs. I did this partially as this felt naturally to me, partially as I wanted to use the Merge API, and finally as I feel it's a fairer baseline. Memory networks and other more advanced neural networks have an input component and a query component, which could substantially reduce the complexity of the task, especially if the number of hidden units are kept equivalently sized. There are also interesting issues with the dataset (duplication of some of the (story, query, answer) tuples) that I've discovered that might have had an impact on all the systems that have used this data. I've contacted the maintainers of the dataset with my finding so we'll see what comes of that. As far as GRU vs LSTM, I wrote it the code to be easy and generic to swap RNNs out, then didn't really think about it too much more. GRU runs faster than LSTM for me (sadly I've only CPUs) and I didn't see any performance difference on the tasks I tested both LSTM and GRU on. I agree with you completely re: making it an LSTM baseline to be more comparable. I've replaced the GRU with RNN and am running the experiments for all tasks now and can report back the findings. I've not tested the IRNN and can get those results too. |
Should be much faster still than GRU. IRNN's advantage is that it is very lightweight. But it is yet to be seen if it performs any better than a simple RNN (with orthogonal initialization and hard_sigmoid activation, the Keras defaults). Open question...
I'd be curious to see if they used Keras for their experiments, or if there is any code out there associated with the paper. Doesn't seem to be mentioned in the paper...
Great. I think even with identical architecture it would make sense to see Keras outperform a past LSTM-based experiment, because the Keras defaults incorporate recent advances that may not have been available at the time (orthogonal initialization of weights and identity initialization of the forget gate bias). |
|
With the major proviso that each of these is run with zero hyperparameter optimization (each task / learner was run for 20 epochs each) here are some initial results. I'm trying to work out the best way to do hyperparameter optimization. The issue is that, to be comparable with the Facebook LSTM baseline, the training set size is only 1,000. Splitting that out to have a reasonable validation set whilst not compromising the raw number of samples to train on is problematic. With a validation set of only 0.05 (50 samples), log loss and accuracy numbers jump around wildly. Hyperparameter optimization would then need to be done for each task and each learner. The plan would be to add a callback, similar to Note - I think I set up IRNN correctly though I'm not 100% certain. I just took the code from
The results are a little all over the place. The plain RNN actually does fairly well but occasionally just falls apart on tasks which the IRNN has less of an issue with (QA4, QA8, QA14). Occasionally the plain RNN or IRNN do better than LSTM or GRU which suggests they've not been trained for enough epochs. I'll add in code for hyperparameter choice, re-run the experiments, and hopefully get a clearer and more consistent picture of performance differences. |
|
It is very difficult to make sense of these results... I suspect there might be issues with the dataset that would cause results to be semi-random. Maybe just the very small size of the dataset (especially the test dataset), coupled with the quantized metric (accuracy). An easy way to check this would be to look at the variance between different runs with different random initializations. If the same model/code (minus the random seed) produces very different results each time, then the setup is not reliable. If the results are consistent, then something different is going on.
You can add it, but I've never have any issue with it: the |
|
I agree that the training data and resulting validation data after is far smaller than we'd like (part of the dataset's aims) but I'd not use the test data for the
As mentioned, the results above used zero hyperparameter optimization (20 epochs each time), so were preliminary. I've implemented the Adding
|
|
I implemented EarlyStopping using 20% of the data (training on 100% of the data) with various dropout levels. The results can still fluctuate 5% to 10% from what they should be, especially when run over all tasks. Some of the tasks are resilient to tweaking and others need good hyperparameter tuning just to get working, especially when certain architectures meet certain tasks. As such, I agree strongly with you that this doesn't make a good dataset for testing various RNN architectures. If people are interested in the early stopping code, feel free to check out: Off topic potential bug: dropout doesn't work after a Merge layer but that wasn't an issue for me as you can apply dropout before the two layers are merged. |
This PR contains two bugfixes (missing vocabulary item in QA19 + issues with running on Python 3) and additional documentation (results and comparison to the Facebook LSTM baseline when run over all 20 tasks).