Ask Me Anything Dynamic Memory Networks For Natural Language Processing
Ask Me Anything Dynamic Memory Networks For Natural Language Processing
Ask Me Anything Dynamic Memory Networks For Natural Language Processing
Figure 3. Real example of an input list of sentences and the attention gates that are triggered by a specific question from the bAbI tasks
(Weston et al., 2015a). Gate values gti are shown above the corresponding vectors. The gates change with each search over inputs. We
do not draw connections for gates that are close to zero. Note that the second iteration has wrongly placed some weight in sentence 2,
which makes some intuitive sense, as sentence 2 is another place John had been.
of TQ words, hidden states for the question encoder at time pass. It also allows for a type of transitive inference, since
t is given by qt = GRU (L[wtQ ], qt−1 ), L represents the the first pass may uncover the need to retrieve additional
word embedding matrix as in the previous section and wtQ facts. For instance, in the example in Fig. 3, we are asked
represents the word index of the tth word in the question. Where is the football? In the first iteration, the model ought
We share the word embedding matrix across the input mod- attend to sentence 7 (John put down the football.), as the
ule and the question module. Unlike the input module, the question asks about the football. Only once the model sees
question module produces as output the final hidden state that John is relevant can it reason that the second iteration
of the recurrent network encoder: q = qTQ . should retrieve where John was. Similarly, a second pass
may help for sentiment analysis as we show in the experi-
2.3. Episodic Memory Module ments section below.
In its general form, the episodic memory module is com- Attention Mechanism: In our work, we use a gating func-
prised of an internal memory, an attention mechanism and tion as our attention mechanism. For each pass i, the
a recurrent network to update its memory. During each it- mechanism takes as input a candidate fact ct , a previ-
eration, the attention mechanism attends over the fact rep- ous memory mi−1 , and the question q to compute a gate:
resentations c by using a gating function (described below) gti = G(ct , mi−1 , q).
while taking into consideration the question representation The scoring function G takes as input the feature set
q and the previous memory mi−1 to produce an episode ei . z(c, m, q) and produces a scalar score. We first define a
The episode is then used, alongside the previous mem- large feature vector that captures a variety of similarities
ories mi−1 , to update the episodic memory mi = between input, memory and question vectors: z(c, m, q) =
GRU (ei , mi−1 ). The initial state of this GRU is initialized h i
to the question vector itself: m0 = q. For some tasks, it c, m, q, c ◦ q, c ◦ m, |c − q|, |c − m|, cT W (b) q, cT W (b) m ,
is beneficial for episodic memory module to take multiple (5)
passes over the input. After TM passes, the final memory where ◦ is the element-wise product. The function
mTM is given to the answer module. G is a simple two-layer feed forward neural network
G(c, m, q) =
Need for Multiple Episodes: The iterative nature of this
module allows it to attend to different inputs during each
σ W (2) tanh W (1) z(c, m, q) + b(1) + b(2) . (6)
Ask Me Anything: Dynamic Memory Networks for Natural Language Processing
map (O) components have some functional overlap with based on a convolutional neural network that uses multi-
our episodic memory. However, the Memory Network can- ple word vector representations. The previous best model
not be applied to the same variety of NLP tasks since it for part-of-speech tagging on the Wall Street Journal sec-
processes sentences independently and not via a sequence tion of the Penn Tree Bank (Marcus et al., 1993) was So-
model. It requires bag of n-gram vector features as well gaard (Søgaard, 2011) who used a semisupervised nearest
as a separate feature that captures whether a sentence came neighbor approach. We also directly compare to paragraph
before another one. vectors by (Le & Mikolov., 2014).
Various other neural memory or attention architectures Neuroscience: The episodic memory in humans stores
have recently been proposed for algorithmic problems specific experiences in their spatial and temporal context.
(Joulin & Mikolov, 2015; Kaiser & Sutskever, 2015), cap- For instance, it might contain the first memory somebody
tion generation for images (Malinowski & Fritz, 2014; has of flying a hang glider. Eichenbaum and Cohen have ar-
Chen & Zitnick, 2014), visual question answering (Yang gued that episodic memories represent a form of relation-
et al., 2015) or other NLP problems and datasets (Hermann ship (i.e., relations between spatial, sensory and temporal
et al., 2015). information) and that the hippocampus is responsible for
general relational learning (Eichenbaum & Cohen, 2004).
In contrast, the DMN employs neural sequence models for
Interestingly, it also appears that the hippocampus is active
input representation, attention, and response mechanisms,
during transitive inference (Heckers et al., 2004), and dis-
thereby naturally capturing position and temporality. As a
ruption of the hippocampus impairs this ability (Dusek &
result, the DMN is directly applicable to a broader range
Eichenbaum, 1997).
of applications without feature engineering. We compare
directly to Memory Networks on the bAbI dataset (Weston The episodic memory module in the DMN is inspired by
et al., 2015a). these findings. It retrieves specific temporal states that
are related to or triggered by a question. Furthermore,
NLP Applications: The DMN is a general model which
we found that the GRU in this module was able to do
we apply to several NLP problems. We compare to what,
some transitive inference over the simple facts in the bAbI
to the best of our knowledge, is the current state-of-the-art
dataset. This module also has similarities to the Temporal
method for each task.
Context Model (Howard & Kahana, 2002) and its Bayesian
There are many different approaches to question answer- extensions (Socher et al., 2009) which were developed to
ing: some build large knowledge bases (KBs) with open in- analyze human behavior in word recall experiments.
formation extraction systems (Yates et al., 2007), some use
neural networks, dependency trees and KBs (Bordes et al., 4. Experiments
2012), others only sentences (Iyyer et al., 2014). A lot of
other approaches exist. When QA systems do not produce We include experiments on question answering, part-of-
the right answer, it is often unclear if it is because they speech tagging, and sentiment analysis. The model is
do not have access to the facts, cannot reason over them trained independently for each problem, while the archi-
or have never seen this type of question or phenomenon. tecture remains the same except for the answer module and
Most QA dataset only have a few hundred questions and input fact subsampling (words vs sentences). The answer
answers but require complex reasoning. They can hence module, as described in Section 2.4, is triggered either once
not be solved by models that have to learn purely from ex- at the end or for each token.
amples. While synthetic datasets (Weston et al., 2015a)
For all datasets we used either the official train, devel-
have problems and can often be solved easily with manual
opment, test splits or if no development set was defined,
feature engineering, they let us disentangle failure modes
we used 10% of the training set for development. Hyper-
of models and understand necessary QA capabilities. They
parameter tuning and model selection (with early stopping)
are useful for analyzing models that attempt to learn every-
is done on the development set. The DMN is trained via
thing and do not rely on external features like coreference,
backpropagation and Adam (Kingma & Ba, 2014). We
POS, parsing, logical rules, etc. The DMN is such a model.
employ L2 regularization, and dropout on the word em-
Another related model by Andreas et al. (2016) combines
beddings. Word vectors are pre-trained using GloVe (Pen-
neural and logical reasoning for question answering over
nington et al., 2014).
knowledge bases and visual question answering.
Sentiment analysis is a very useful classification task and 4.1. Question Answering
recently the Stanford Sentiment Treebank (Socher et al.,
2013) has become a standard benchmark dataset. Kim The Facebook bAbI dataset is a synthetic dataset for test-
(Kim, 2014) reports the previous state-of-the-art result ing a model’s ability to retrieve facts and reason over them.
Each task tests a different skill that a question answering
Ask Me Anything: Dynamic Memory Networks for Natural Language Processing
model ought to have, such as coreference resolution, de- 4.2. Text Classification: Sentiment Analysis
duction, and induction. Showing an ability exists here is The Stanford Sentiment Treebank (SST) (Socher et al.,
not sufficient to conclude a model would also exhibit it on 2013) is a popular dataset for sentiment classification. It
real world text data. It is, however, a necessary condition. provides phrase-level fine-grained labels, and comes with a
Training on the bAbI dataset uses the following objective train/development/test split. We present results on two for-
function: J = αECE (Gates) + βECE (Answers), where mats: fine-grained root prediction, where all full sentences
ECE is the standard cross-entropy cost and α and β are hy- (root nodes) of the test set are to be classified as either very
perparameters. In practice, we begin training with α set to negative, negative, neutral, positive, or very positive, and
1 and β set to 0, and then later switch β to 1 while keep- binary root prediction, where all non-neutral full sentences
ing α at 1. As described in Section 2.1, the input module of the test set are to be classified as either positive or neg-
outputs fact representations by taking the encoder hidden ative. To train the model, we use all full sentences as well
states at time steps corresponding to the end-of-sentence to- as subsample 50% of phrase-level labels every epoch. Dur-
kens. The gate supervision aims to select one sentence per ing evaluation, the model is only evaluated on the full sen-
pass; thus, we also experimented with modifying Eq. 8 to tences (root setup). In binary classification, neutral phrases
a simple softmax instead of a P GRU. Here, we compute the are removed from the dataset. The DMN achieves state-of-
T
final episode vector via: ei = t=1 softmax(gti )ct , where the-art accuracy on the binary classification task, as well as
exp(gti ) on the fine-grained classification task.
softmax(gti ) = PT i , and gti here is the value of
j=1 exp(gj )
the gate before the sigmoid. This setting achieves better re- In all experiments, the DMN was trained with GRU se-
sults, likely because the softmax encourages sparsity and is quence models. It is easy to replace the GRU sequence
better suited to picking one sentence at a time. model with any of the models listed above, as well as in-
corporate tree structure in the retrieval process.
We list results in Table 1. The DMN does worse than
the Memory Network, which we refer to from here on as
4.3. Sequence Tagging: Part-of-Speech Tagging
MemNN, on tasks 2 and 3, both tasks with long input se-
quences. We suspect that this is due to the recurrent input Part-of-speech tagging is traditionally modeled as a se-
sequence model having trouble modeling very long inputs. quence tagging problem: every word in a sentence is to
Ask Me Anything: Dynamic Memory Networks for Natural Language Processing
Table 5. An example of what the DMN focuses on during each episode on a real query in the bAbI task. Darker colors mean that the
attention weight is higher.
Figure 4. Attention weights for sentiment examples that were Figure 5. These sentence demonstrate cases where initially posi-
only labeled correctly by a DMN with two episodes. The y-axis tive words lost their importance after the entire sentence context
shows the episode number. This sentence demonstrates a case became clear either through a contrastive conjunction (”but”) or a
where the ability to iterate allows the DMN to sharply focus on modified action ”best described.”
relevant words.
Cho, K., van Merrienboer, B., Bahdanau, D., and Ben- Kalchbrenner, N., Grefenstette, E., and Blunsom, P. A con-
gio, Y. On the properties of neural machine translation: volutional neural network for modelling sentences. In
Encoder-decoder approaches. CoRR, abs/1409.1259, ACL, 2014.
2014a.
Karpathy, A. and Fei-Fei, L. Deep visual-semantic align-
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., ments for generating image descriptions. In CVPR,
Bougares, F., Schwenk, H., and Bengio, Y. Learning 2015.
Phrase Representations using RNN Encoder-Decoder for
Statistical Machine Translation. In EMNLP, 2014b. Kim, Y. Convolutional neural networks for sentence clas-
sification. In EMNLP, 2014.
Chung, J., Gülçehre, Ç., Cho, K., and Bengio, Y. Empiri-
cal evaluation of gated recurrent neural networks on se- Kingma, P. and Ba, Jimmy. Adam: A method for stochastic
quence modeling. CoRR, abs/1412.3555, 2014. optimization. CoRR, abs/1412.6980, 2014.
Dusek, J. A. and Eichenbaum, H. The hippocampus and Le, Q.V. and Mikolov., T. Distributed representations of
memory for orderly stimulusrelations. Proceedings of sentences and documents. In ICML, 2014.
the National Academy of Sciences, 94(13):7109–7114,
1997. Malinowski, M. and Fritz, M. A Multi-World Approach to
Question Answering about Real-World Scenes based on
Eichenbaum, H. and Cohen, N. J. From Conditioning to Uncertain Input. In NIPS, 2014.
Conscious Recollection: Memory Systems of the Brain
(Oxford Psychology). Oxford University Press, 1 edition, Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B.
2004. ISBN 0195178041. Building a large annotated corpus of english: The penn
treebank. Computational Linguistics, 19(2), June 1993.
Elman, J. L. Distributed representations, simple recurrent
networks, and grammatical structure. Machine Learn- Mikolov, T. and Zweig, G. Context dependent recurrent
ing, 7(2-3):195–225, 1991. neural network language model. In SLT, pp. 234–239.
Graves, A., Wayne, G., and Danihelka, I. Neural turing IEEE, 2012. ISBN 978-1-4673-5125-6.
machines. CoRR, abs/1410.5401, 2014.
Passos, A., Kumar, V., and McCallum, A. Lexicon infused
Heckers, S., Zalesak, M., Weiss, A. P., Ditman, T., and phrase embeddings for named entity resolution. In Con-
Titone, D. Hippocampal activation during transitive in- ference on Computational Natural Language Learning.
ference in humans. Hippocampus, 14:153–62, 2004. Association for Computational Linguistics, June 2014.
Hermann, K. M., Kočiský, T., Grefenstette, E., Espeholt, Pennington, J., Socher, R., and Manning, C. D. Glove:
L., Kay, W., Suleyman, M., and Blunsom, P. Teaching Global vectors for word representation. In EMNLP,
machines to read and comprehend. In NIPS, 2015. 2014.
Ask Me Anything: Dynamic Memory Networks for Natural Language Processing
Yang, Z., He, X., Gao, J., Deng, L., and Smola, A. Stacked
attention networks for image question answering. arXiv
preprint arXiv:1511.02274, 2015.
Yates, A., Banko, M., Broadhead, M., Cafarella, M. J., Et-
zioni, O., and Soderland, S. Textrunner: Open informa-
tion extraction on the web. In HLT-NAACL (Demonstra-
tions), 2007.