Ask Me Anything Dynamic Memory Networks For Natural Language Processing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Ask Me Anything:

Dynamic Memory Networks for Natural Language Processing

Ankit Kumar*, Ozan Irsoy*, Peter Ondruska*,


Mohit Iyyer*, James Bradbury, Ishaan Gulrajani*,
Victor Zhong*, Romain Paulus, Richard Socher {JAMES . BRADBURY, RPAULUS , RSOCHER}@ SALESFORCE . COM
Salesforce Inc., CA USA *Authors were interns at MetaMind.

Abstract I: Jane went to the hallway.


I: Mary walked to the bathroom.
Most tasks in natural language processing can I: Sandra went to the garden.
be cast into question answering (QA) problems I: Daniel went back to the garden.
over language input. We introduce the dynamic I: Sandra took the milk there.
memory network (DMN), a neural network ar- Q: Where is the milk?
chitecture which processes input sequences and A: garden
questions, forms episodic memories, and gener- I: It started boring, but then it got interesting.
ates relevant answers. Questions trigger an itera- Q: What’s the sentiment?
tive attention process which allows the model to A: positive
condition its attention on the inputs and the result Q: POS tags?
of previous iterations. These results are then rea- A: PRP VBD JJ , CC RB PRP VBD JJ .
soned over in a hierarchical recurrent sequence
model to generate answers. The DMN can be
Figure 1. Example inputs and questions, together with answers
trained end-to-end and obtains state-of-the-art generated by a dynamic memory network trained on the corre-
results on several types of tasks and datasets: sponding task. In sequence modeling tasks, an answer mechanism
question answering (Facebook’s bAbI dataset), is triggered at each input word instead of only at the end.
text classification for sentiment analysis (Stan-
ford Sentiment Treebank) and sequence model-
ing for part-of-speech tagging (WSJ-PTB). The tion problems like sentiment analysis (Socher et al., 2013)
training for these different tasks relies exclu- (What is the sentiment?); even multi-sentence joint clas-
sively on trained word vector representations and sification problems like coreference resolution (Who does
input-question-answer triplets. ”their” refer to?).
We propose the Dynamic Memory Network (DMN), a neu-
ral network based framework for general question answer-
1. Introduction
ing tasks that is trained using raw input-question-answer
Question answering (QA) is a complex natural language triplets. Generally, it can solve sequence tagging tasks,
processing task which requires an understanding of the classification problems, sequence-to-sequence tasks and
meaning of a text and the ability to reason over relevant question answering tasks that require transitive reasoning.
facts. Most, if not all, tasks in natural language process- The DMN first computes a representation for all inputs and
ing can be cast as a question answering problem: high the question. The question representation then triggers an
level tasks like machine translation (What is the transla- iterative attention process that searches the inputs and re-
tion into French?); sequence modeling tasks like named en- trieves relevant facts. The DMN memory module then rea-
tity recognition (Passos et al., 2014) (NER) (What are the sons over retrieved facts and provides a vector representa-
named entity tags in this sentence?) or part-of-speech tag- tion of all relevant information to an answer module which
ging (POS) (What are the part-of-speech tags?); classifica- generates the answer.
Proceedings of the 33 rd International Conference on Machine Fig. 1 provides examples of inputs, questions and answers
Learning, New York, NY, USA, 2016. JMLR: W&CP volume for tasks that are evaluated in this paper and for which a
48. Copyright 2016 by the author(s).
DMN achieves a new level of state-of-the-art performance.
Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

2. Dynamic Memory Networks


We now give an overview of the modules that make up the
DMN. We then examine each module in detail and give
intuitions about its formulation. A high-level illustration of
the DMN is shown in Fig. 2.1.
Input Module: The input module encodes raw text inputs
from the task into distributed vector representations. In this
paper, we focus on natural language related problems. In
these cases, the input may be a sentence, a long story, a
movie review, a news article, or several Wikipedia articles.
Question Module: Like the input module, the question Figure 2. Overview of DMN modules. Communication between
module encodes the question of the task into a distributed them is indicated by arrows and uses vector representations.
vector representation. For example, in the case of question Questions trigger gates which allow vectors for certain inputs to
answering, the question may be a sentence such as Where be given to the episodic memory module. The final state of the
did the author first fly?. The representation is fed into the episodic memory is the input to the answer module.
episodic memory module, and forms the basis, or initial
state, upon which the episodic memory module iterates. of the input module. Note that in the case where the input
is a single sentence, TC = TI . That is, the number of out-
Episodic Memory Module: Given a collection of in-
put representations is equal to the number of words in the
put representations, the episodic memory module chooses
sentence. In the case where the input is a list of sentences,
which parts of the inputs to focus on through the attention
TC is equal the number of sentences.
mechanism. It then produces a ”memory” vector represen-
tation taking into account the question as well as the pre- Choice of recurrent network: In our experiments, we use
vious memory. Each iteration provides the module with a gated recurrent network (GRU) (Cho et al., 2014a; Chung
newly relevant information about the input. In other words, et al., 2014). We also explored the more complex LSTM
the module has the ability to retrieve new information, in (Hochreiter & Schmidhuber, 1997) but it performed sim-
the form of input representations, which were thought to ilarly and is more computationally expensive. Both work
be irrelevant in previous iterations. much better than the standard tanh RNN and we postulate
that the main strength comes from having gates that allow
Answer Module: The answer module generates an answer
the model to suffer less from the vanishing gradient prob-
from the final memory vector of the memory module.
lem (Hochreiter & Schmidhuber, 1997). Assume each time
A detailed visualization of these modules is shown in Fig.3. step t has an input xt and a hidden state ht . The internal
mechanics of the GRU is defined as:
2.1. Input Module
In natural language processing problems, the input is a se-  
quence of TI words w1 , . . . , wTI . One way to encode the zt = σ W (z) xt + U (z) ht−1 + b(z) (1)
input sequence is via a recurrent neural network (Elman,
 
rt = σ W (r) xt + U (r) ht−1 + b(r) (2)
1991). Word embeddings are given as inputs to the recur-  
rent network. At each time step t, the network updates its h̃t = tanh W xt + rt ◦ U ht−1 + b(h) (3)
hidden state ht = RN N (L[wt ], ht−1 ), where L is the em-
bedding matrix and wt is the word index of the tth word of ht = zt ◦ ht−1 + (1 − zt ) ◦ h̃t (4)
the input sequence.
where ◦ is an element-wise product, W (z) , W (r) , W ∈
In cases where the input sequence is a single sentence, the
RnH ×nI and U (z) , U (r) , U ∈ RnH ×nH . The dimensions
input module outputs the hidden states of the recurrent net-
n are hyperparameters. We abbreviate the above computa-
work. In cases where the input sequence is a list of sen-
tion with ht = GRU (xt , ht−1 ).
tences, we concatenate the sentences into a long list of word
tokens, inserting after each sentence an end-of-sentence to-
2.2. Question Module
ken. The hidden states at each of the end-of-sentence to-
kens are then the final representations of the input mod- Similar to the input sequence, the question is also most
ule. In subsequent sections, we denote the output of the commonly given as a sequence of words in natural lan-
input module as the sequence of TC fact representations c, guage processing problems. As before, we encode the
whereby ct denotes the tth element in the output sequence question via a recurrent neural network. Given a question
Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

Figure 3. Real example of an input list of sentences and the attention gates that are triggered by a specific question from the bAbI tasks
(Weston et al., 2015a). Gate values gti are shown above the corresponding vectors. The gates change with each search over inputs. We
do not draw connections for gates that are close to zero. Note that the second iteration has wrongly placed some weight in sentence 2,
which makes some intuitive sense, as sentence 2 is another place John had been.

of TQ words, hidden states for the question encoder at time pass. It also allows for a type of transitive inference, since
t is given by qt = GRU (L[wtQ ], qt−1 ), L represents the the first pass may uncover the need to retrieve additional
word embedding matrix as in the previous section and wtQ facts. For instance, in the example in Fig. 3, we are asked
represents the word index of the tth word in the question. Where is the football? In the first iteration, the model ought
We share the word embedding matrix across the input mod- attend to sentence 7 (John put down the football.), as the
ule and the question module. Unlike the input module, the question asks about the football. Only once the model sees
question module produces as output the final hidden state that John is relevant can it reason that the second iteration
of the recurrent network encoder: q = qTQ . should retrieve where John was. Similarly, a second pass
may help for sentiment analysis as we show in the experi-
2.3. Episodic Memory Module ments section below.

In its general form, the episodic memory module is com- Attention Mechanism: In our work, we use a gating func-
prised of an internal memory, an attention mechanism and tion as our attention mechanism. For each pass i, the
a recurrent network to update its memory. During each it- mechanism takes as input a candidate fact ct , a previ-
eration, the attention mechanism attends over the fact rep- ous memory mi−1 , and the question q to compute a gate:
resentations c by using a gating function (described below) gti = G(ct , mi−1 , q).
while taking into consideration the question representation The scoring function G takes as input the feature set
q and the previous memory mi−1 to produce an episode ei . z(c, m, q) and produces a scalar score. We first define a
The episode is then used, alongside the previous mem- large feature vector that captures a variety of similarities
ories mi−1 , to update the episodic memory mi = between input, memory and question vectors: z(c, m, q) =
GRU (ei , mi−1 ). The initial state of this GRU is initialized h i
to the question vector itself: m0 = q. For some tasks, it c, m, q, c ◦ q, c ◦ m, |c − q|, |c − m|, cT W (b) q, cT W (b) m ,
is beneficial for episodic memory module to take multiple (5)
passes over the input. After TM passes, the final memory where ◦ is the element-wise product. The function
mTM is given to the answer module. G is a simple two-layer feed forward neural network
G(c, m, q) =
Need for Multiple Episodes: The iterative nature of this
module allows it to attend to different inputs during each    
σ W (2) tanh W (1) z(c, m, q) + b(1) + b(2) . (6)
Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

Some datasets, such as Facebook’s bAbI dataset, spec- 2.5. Training


ify which facts are important for a given question. In
Training is cast as a supervised classification problem to
those cases, the attention mechanism of the G function can
minimize cross-entropy error of the answer sequence. For
be trained in a supervised fashion with a standard cross-
datasets with gate supervision, such as bAbI, we add the
entropy cost function.
cross-entropy error of the gates into the overall cost. Be-
Memory Update Mechanism: To compute the episode for cause all modules communicate over vector representations
pass i, we employ a modified GRU over the sequence of the and various types of differentiable and deep neural net-
inputs c1 , . . . , cTC , weighted by the gates g i . The episode works with gates, the entire DMN model can be trained
vector that is given to the answer module is the final state via backpropagation and gradient descent.
of the GRU. The equation to update the hidden states of the
GRU at time t and the equation to compute the episode are,
3. Related Work
respectively:
Given the many shoulders on which this paper is standing
hit = gti GRU (ct , hit−1 ) + (1 − gti )hit−1 (7) and the many applications to which our model is applied, it
ei = hiTC (8) is impossible to do related fields justice.
Deep Learning: There are several deep learning models
Criteria for Stopping: The episodic memory module also that have been applied to many different tasks in NLP.
has a signal to stop iterating over inputs. To achieve this, For instance, recursive neural networks have been used for
we append a special end-of-passes representation to the in- parsing (Socher et al., 2011), sentiment analysis (Socher
put, and stop the iterative attention process if this represen- et al., 2013), paraphrase detection (Socher et al., 2011) and
tation is chosen by the gate function. For datasets without question answering (Iyyer et al., 2014) and logical infer-
explicit supervision, we set a maximum number of itera- ence (Bowman et al., 2014), among other tasks. However,
tions. The whole module is end-to-end differentiable. because they lack the memory and question modules, a sin-
gle model cannot solve as many varied tasks, nor tasks that
require transitive reasoning over multiple sentences. An-
2.4. Answer Module
other commonly used model is the chain-structured recur-
The answer module generates an answer given a vector. rent neural network of the kind we employ above. Recur-
Depending on the type of task, the answer module is ei- rent neural networks have been successfully used in lan-
ther triggered once at the end of the episodic memory or at guage modeling (Mikolov & Zweig, 2012), speech recog-
each time step. nition, and sentence generation from images (Karpathy &
Fei-Fei, 2015). Also relevant is the sequence-to-sequence
We employ another GRU whose initial state is initialized to
model used for machine translation by Sutskever et al.
the last memory a0 = mTM . At each timestep, it takes as
(Sutskever et al., 2014). This model uses two extremely
input the question q, last hidden state at−1 , as well as the
large and deep LSTMs to encode a sentence in one lan-
previously predicted output yt−1 .
guage and then decode the sentence in another language.
This sequence-to-sequence model is a special case of the
yt = softmax(W (a) at ) (9)
DMN without a question and without episodic memory.
at = GRU ([yt−1 , q], at−1 ), (10) Instead it maps an input sequence directly to an answer se-
quence.
where we concatenate the last generated word and the ques-
tion vector as the input at each time step. The output is Attention and Memory: The second line of work that
trained with the cross-entropy error classification of the is very relevant to DMNs is that of attention and mem-
correct sequence appended with a special end-of-sequence ory in deep learning. Attention mechanisms are generally
token. useful and can improve image classification (Stollenga &
J. Masci, 2014), automatic image captioning (Xu et al.,
In the sequence modeling task, we wish to label each word 2015) and machine translation (Cho et al., 2014b; Bah-
in the original sequence. To this end, the DMN is run in danau et al., 2014). Neural Turing machines use memory
the same way as above over the input words. For word t, to solve algorithmic problems such as list sorting (Graves
we replace Eq. 8 with ei = hit . Note that the gates for the et al., 2014). The work of recent months by Weston et
first pass will be the same for each word, as the question al. on memory networks (Weston et al., 2015b) focuses
is the same. This allows for speed-up in implementation on adding a memory component for natural language ques-
by computing these gates only once. However, gates for tion answering. They have an input (I) and response (R)
subsequent passes will be different, as the episodes are dif- component and their generalization (G) and output feature
ferent.
Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

map (O) components have some functional overlap with based on a convolutional neural network that uses multi-
our episodic memory. However, the Memory Network can- ple word vector representations. The previous best model
not be applied to the same variety of NLP tasks since it for part-of-speech tagging on the Wall Street Journal sec-
processes sentences independently and not via a sequence tion of the Penn Tree Bank (Marcus et al., 1993) was So-
model. It requires bag of n-gram vector features as well gaard (Søgaard, 2011) who used a semisupervised nearest
as a separate feature that captures whether a sentence came neighbor approach. We also directly compare to paragraph
before another one. vectors by (Le & Mikolov., 2014).
Various other neural memory or attention architectures Neuroscience: The episodic memory in humans stores
have recently been proposed for algorithmic problems specific experiences in their spatial and temporal context.
(Joulin & Mikolov, 2015; Kaiser & Sutskever, 2015), cap- For instance, it might contain the first memory somebody
tion generation for images (Malinowski & Fritz, 2014; has of flying a hang glider. Eichenbaum and Cohen have ar-
Chen & Zitnick, 2014), visual question answering (Yang gued that episodic memories represent a form of relation-
et al., 2015) or other NLP problems and datasets (Hermann ship (i.e., relations between spatial, sensory and temporal
et al., 2015). information) and that the hippocampus is responsible for
general relational learning (Eichenbaum & Cohen, 2004).
In contrast, the DMN employs neural sequence models for
Interestingly, it also appears that the hippocampus is active
input representation, attention, and response mechanisms,
during transitive inference (Heckers et al., 2004), and dis-
thereby naturally capturing position and temporality. As a
ruption of the hippocampus impairs this ability (Dusek &
result, the DMN is directly applicable to a broader range
Eichenbaum, 1997).
of applications without feature engineering. We compare
directly to Memory Networks on the bAbI dataset (Weston The episodic memory module in the DMN is inspired by
et al., 2015a). these findings. It retrieves specific temporal states that
are related to or triggered by a question. Furthermore,
NLP Applications: The DMN is a general model which
we found that the GRU in this module was able to do
we apply to several NLP problems. We compare to what,
some transitive inference over the simple facts in the bAbI
to the best of our knowledge, is the current state-of-the-art
dataset. This module also has similarities to the Temporal
method for each task.
Context Model (Howard & Kahana, 2002) and its Bayesian
There are many different approaches to question answer- extensions (Socher et al., 2009) which were developed to
ing: some build large knowledge bases (KBs) with open in- analyze human behavior in word recall experiments.
formation extraction systems (Yates et al., 2007), some use
neural networks, dependency trees and KBs (Bordes et al., 4. Experiments
2012), others only sentences (Iyyer et al., 2014). A lot of
other approaches exist. When QA systems do not produce We include experiments on question answering, part-of-
the right answer, it is often unclear if it is because they speech tagging, and sentiment analysis. The model is
do not have access to the facts, cannot reason over them trained independently for each problem, while the archi-
or have never seen this type of question or phenomenon. tecture remains the same except for the answer module and
Most QA dataset only have a few hundred questions and input fact subsampling (words vs sentences). The answer
answers but require complex reasoning. They can hence module, as described in Section 2.4, is triggered either once
not be solved by models that have to learn purely from ex- at the end or for each token.
amples. While synthetic datasets (Weston et al., 2015a)
For all datasets we used either the official train, devel-
have problems and can often be solved easily with manual
opment, test splits or if no development set was defined,
feature engineering, they let us disentangle failure modes
we used 10% of the training set for development. Hyper-
of models and understand necessary QA capabilities. They
parameter tuning and model selection (with early stopping)
are useful for analyzing models that attempt to learn every-
is done on the development set. The DMN is trained via
thing and do not rely on external features like coreference,
backpropagation and Adam (Kingma & Ba, 2014). We
POS, parsing, logical rules, etc. The DMN is such a model.
employ L2 regularization, and dropout on the word em-
Another related model by Andreas et al. (2016) combines
beddings. Word vectors are pre-trained using GloVe (Pen-
neural and logical reasoning for question answering over
nington et al., 2014).
knowledge bases and visual question answering.
Sentiment analysis is a very useful classification task and 4.1. Question Answering
recently the Stanford Sentiment Treebank (Socher et al.,
2013) has become a standard benchmark dataset. Kim The Facebook bAbI dataset is a synthetic dataset for test-
(Kim, 2014) reports the previous state-of-the-art result ing a model’s ability to retrieve facts and reason over them.
Each task tests a different skill that a question answering
Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

Task MemNN DMN Task Binary Fine-grained


1: Single Supporting Fact 100 100 MV-RNN 82.9 44.4
2: Two Supporting Facts 100 98.2 RNTN 85.4 45.7
3: Three Supporting Facts 100 95.2 DCNN 86.8 48.5
4: Two Argument Relations 100 100 PVec 87.8 48.7
5: Three Argument Relations 98 99.3 CNN-MC 88.1 47.4
6: Yes/No Questions 100 100 DRNN 86.6 49.8
7: Counting 85 96.9 CT-LSTM 88.0 51.0
8: Lists/Sets 91 96.5
DMN 88.6 52.1
9: Simple Negation 100 100
10: Indefinite Knowledge 98 97.5 Table 2. Test accuracies for sentiment analysis on the Stanford
11: Basic Coreference 100 99.9 Sentiment Treebank. MV-RNN and RNTN: Socher et al. (2013).
12: Conjunction 100 100 DCNN: Kalchbrenner et al. (2014). PVec: Le & Mikolov. (2014).
13: Compound Coreference 100 99.8 CNN-MC: Kim (2014). DRNN: Irsoy & Cardie (2015), 2014.
14: Time Reasoning 99 100 CT-LSTM: Tai et al. (2015)
15: Basic Deduction 100 100
16: Basic Induction 100 99.4 The MemNN does not suffer from this problem as it views
17: Positional Reasoning 65 59.6 each sentence separately. The power of the episodic mem-
18: Size Reasoning 95 95.3 ory module is evident in tasks 7 and 8, where the DMN
19: Path Finding 36 34.5 outperforms the MemNN. Both tasks require the model to
20: Agent’s Motivations 100 100 iteratively retrieve facts and store them in a representation
that slowly incorporates more of the relevant information
Mean Accuracy (%) 93.3 93.6
of the input sequence. Both models do poorly on tasks 17
Table 1. Test accuracies on the bAbI dataset. MemNN numbers and 19, though the MemNN does better. We suspect this
taken from Weston et al. (Weston et al., 2015a). The DMN passes is due to the MemNN using n-gram vectors and sequence
(accuracy > 95%) 18 tasks, whereas the MemNN passes 16. position features.

model ought to have, such as coreference resolution, de- 4.2. Text Classification: Sentiment Analysis
duction, and induction. Showing an ability exists here is The Stanford Sentiment Treebank (SST) (Socher et al.,
not sufficient to conclude a model would also exhibit it on 2013) is a popular dataset for sentiment classification. It
real world text data. It is, however, a necessary condition. provides phrase-level fine-grained labels, and comes with a
Training on the bAbI dataset uses the following objective train/development/test split. We present results on two for-
function: J = αECE (Gates) + βECE (Answers), where mats: fine-grained root prediction, where all full sentences
ECE is the standard cross-entropy cost and α and β are hy- (root nodes) of the test set are to be classified as either very
perparameters. In practice, we begin training with α set to negative, negative, neutral, positive, or very positive, and
1 and β set to 0, and then later switch β to 1 while keep- binary root prediction, where all non-neutral full sentences
ing α at 1. As described in Section 2.1, the input module of the test set are to be classified as either positive or neg-
outputs fact representations by taking the encoder hidden ative. To train the model, we use all full sentences as well
states at time steps corresponding to the end-of-sentence to- as subsample 50% of phrase-level labels every epoch. Dur-
kens. The gate supervision aims to select one sentence per ing evaluation, the model is only evaluated on the full sen-
pass; thus, we also experimented with modifying Eq. 8 to tences (root setup). In binary classification, neutral phrases
a simple softmax instead of a P GRU. Here, we compute the are removed from the dataset. The DMN achieves state-of-
T
final episode vector via: ei = t=1 softmax(gti )ct , where the-art accuracy on the binary classification task, as well as
exp(gti ) on the fine-grained classification task.
softmax(gti ) = PT i , and gti here is the value of
j=1 exp(gj )
the gate before the sigmoid. This setting achieves better re- In all experiments, the DMN was trained with GRU se-
sults, likely because the softmax encourages sparsity and is quence models. It is easy to replace the GRU sequence
better suited to picking one sentence at a time. model with any of the models listed above, as well as in-
corporate tree structure in the retrieval process.
We list results in Table 1. The DMN does worse than
the Memory Network, which we refer to from here on as
4.3. Sequence Tagging: Part-of-Speech Tagging
MemNN, on tasks 2 and 3, both tasks with long input se-
quences. We suspect that this is due to the recurrent input Part-of-speech tagging is traditionally modeled as a se-
sequence model having trouble modeling very long inputs. quence tagging problem: every word in a sentence is to
Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

Model Acc (%) Max task 3 task 7 task 8 sentiment


passes three-facts count lists/sets (fine grain)
SVMTool 97.15
Sogaard 97.27 0 pass 0 48.8 33.6 50.0
Suzuki et al. 97.40 1 pass 0 48.8 54.0 51.5
Spoustova et al. 97.44 2 pass 16.7 49.1 55.6 52.1
SCNN 97.50 3 pass 64.7 83.4 83.4 50.1
5 pass 95.2 96.9 96.5 N/A
DMN 97.56
Table 3. Test accuracies on WSJ-PTB Table 4. Effectiveness of episodic memory module across tasks.
Each row shows the final accuracy in term of percentages with
a different maximum limit for the number of passes the episodic
be classified into its part-of-speech class (see Fig. 1). We memory module can take. Note that for the 0-pass DMN, the
evaluate on the standard Wall Street Journal dataset (Mar- network essential reduces to the output of the attention module.
cus et al., 1993). We use the standard splits of sections
0-18 for training, 19-21 for development and 22-24 for test We present specific examples from the experiments to illus-
sets (Søgaard, 2011). Since this is a word level tagging trate that the iterative nature of the episodic memory mod-
task, DMN memories are classified at each time step corre- ule enables the model to focus on relevant parts of the input.
sponding to each word. This is described in detail in Sec- For instance, Table 5 shows an example of what the DMN
tion 2.4’s discussion of sequence modeling. focuses on during each pass of a three-iteration scan on a
We compare the DMN with the results in (Søgaard, 2011). question from the bAbI dataset.
The DMN achieves state-of-the-art accuracy with a single We also evaluate the episodic memory module for senti-
model, reaching a development set accuracy of 97.5. En- ment analysis. Given that the DMN performs well with
sembling the top 4 development models, the DMN gets to both one iteration and two iterations, we study test exam-
97.58 dev and 97.56 test accuracies, achieving a slightly ples where the one-iteration DMN is incorrect and the two-
higher new state-of-the-art (Table 3). episode DMN is correct. Looking at the sentences in Fig. 4
and 5, we make the following observations:
4.4. Quantitative Analysis of Episodic Memory Module
The main novelty of the DMN architecture is in its episodic 1. The attention of the two-iteration DMN is generally
memory module. Hence, we analyze how important the much more focused compared to that of the one-
episodic memory module is for NLP tasks and in particular iteration DMN. We believe this is due to the fact that
how the number of passes over the input affect accuracy. with fewer iterations over the input, the hidden states
of the input module encoder have to capture more of
Table 4 shows the accuracies on a subset of bAbI tasks as the content of adjacent time steps. Hence, the atten-
well as on the Stanford Sentiment Treebank. We note that tion mechanism cannot only focus on a few key time
for several of the hard reasoning tasks, multiple passes over steps. Instead, it needs to pass all necessary informa-
the inputs are crucial to achieving high performance. For tion to the answer module from a single pass.
sentiment the differences are smaller. However, two passes
outperform a single pass or zero passes. In the latter case, 2. During the second iteration of the two-iteration DMN,
there is no episodic memory at all and outputs are passed the attention becomes significantly more focused on
directly from the input module to the answer module. We relevant key words and less attention is paid to strong
note that, especially complicated examples are more of- sentiment words that lose their sentiment in context.
ten correctly classified with 2 passes but many examples This is exemplified by the sentence in Fig. 5 that in-
in sentiment contain only simple sentiment words and no cludes the very positive word ”best.” In the first iter-
negation or misleading expressions. Hence the need to have ation, the word ”best” dominates the attention scores
a complicated architecture for them is small. The same is (darker color means larger score). However, once its
true for POS tagging. Here, differences in accuracy are less context, ”is best described”, is clear, its relevance is
than 0.1 between different numbers of passes. diminished and ”lukewarm” becomes more important.
Next, we show that the additional correct classifications are
hard examples with mixed positive/negative vocabulary. We conclude that the ability of the episodic memory mod-
ule to perform multiple passes over the data is beneficial. It
provides significant benefits on harder bAbI tasks, which
4.5. Qualitative Analysis of Episodic Memory Module
require reasoning over several pieces of information or
Apart from a quantitative analysis, we also show qualita- transitive reasoning. Increasing the number of passes also
tively what happens to the attention during multiple passes. slightly improves the performance on sentiment analysis,
Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

Question: Where was Mary before the Bedroom?


Answer: Cinema.

Facts Episode 1 Episode 2 Episode 3


Yesterday Julie traveled to the school.
Yesterday Marie went to the cinema.
This morning Julie traveled to the kitchen.
Bill went back to the cinema yesterday.
Mary went to the bedroom this morning.
Julie went back to the bedroom this afternoon.
[done reading]

Table 5. An example of what the DMN focuses on during each episode on a real query in the bAbI task. Darker colors mean that the
attention weight is higher.

Figure 4. Attention weights for sentiment examples that were Figure 5. These sentence demonstrate cases where initially posi-
only labeled correctly by a DMN with two episodes. The y-axis tive words lost their importance after the entire sentence context
shows the episode number. This sentence demonstrates a case became clear either through a contrastive conjunction (”but”) or a
where the ability to iterate allows the DMN to sharply focus on modified action ”best described.”
relevant words.

though the difference is not as significant. We did not at-


tempt more iterations for sentiment analysis as the model
struggles with overfitting with three passes. ple NLP problems. The DMN is trained end-to-end with
one, albeit complex, objective function. Future work can
5. Conclusion explore ways to scale the model with larger inputs, which
could be done by running an information retrieval system
The DMN model is a potentially general architecture for a to filter the most relevant inputs before running the DMN,
variety of NLP applications, including classification, ques- or by using a hierarchical attention module. Future work
tion answering and sequence modeling. A single architec- will also explore additional tasks, larger multi-task models
ture is a first step towards a single joint model for multi- and multimodal inputs and questions.
Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

References Hochreiter, S. and Schmidhuber, J. Long short-term mem-


ory. Neural Computation, 9(8):1735–1780, Nov 1997.
Andreas, J., Rohrbach, M., Darrell, T., and Klein, D.
ISSN 0899-7667.
Learning to Compose Neural Networks for Question An-
swering. arXiv preprint arXiv:1601.01705, 2016. Howard, Marc W. and Kahana, Michael J. A distributed
Bahdanau, D., Cho, K., and Bengio, Y. Neural machine representation of temporal context. Journal of Mathe-
translation by jointly learning to align and translate. matical Psychology, 46(3):269 – 299, 2002.
CoRR, abs/1409.0473, 2014. Irsoy, O. and Cardie, C. Modeling compositionality with
Bordes, A., Glorot, X., Weston, J., and Bengio, Y. Joint multiplicative recurrent neural networks. In ICLR, 2015.
Learning of Words and Meaning Representations for
Iyyer, M., Boyd-Graber, J., Claudino, L., Socher, R., and
Open-Text Semantic Parsing. AISTATS, 2012.
Daumé III, H. A neural network for factoid question
Bowman, S. R., Potts, C., and Manning, C. D. Recursive answering over paragraphs. In EMNLP, 2014.
neural networks for learning logical semantics. CoRR,
abs/1406.1827, 2014. Joulin, A. and Mikolov, T. Inferring algorithmic patterns
with stack-augmented recurrent nets. In NIPS, 2015.
Chen, X. and Zitnick, C. L. Learning a recurrent visual rep-
resentation for image caption generation. arXiv preprint Kaiser, L. and Sutskever, I. Neural GPUs Learn Algo-
arXiv:1411.5654, 2014. rithms. arXiv preprint arXiv:1511.08228, 2015.

Cho, K., van Merrienboer, B., Bahdanau, D., and Ben- Kalchbrenner, N., Grefenstette, E., and Blunsom, P. A con-
gio, Y. On the properties of neural machine translation: volutional neural network for modelling sentences. In
Encoder-decoder approaches. CoRR, abs/1409.1259, ACL, 2014.
2014a.
Karpathy, A. and Fei-Fei, L. Deep visual-semantic align-
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., ments for generating image descriptions. In CVPR,
Bougares, F., Schwenk, H., and Bengio, Y. Learning 2015.
Phrase Representations using RNN Encoder-Decoder for
Statistical Machine Translation. In EMNLP, 2014b. Kim, Y. Convolutional neural networks for sentence clas-
sification. In EMNLP, 2014.
Chung, J., Gülçehre, Ç., Cho, K., and Bengio, Y. Empiri-
cal evaluation of gated recurrent neural networks on se- Kingma, P. and Ba, Jimmy. Adam: A method for stochastic
quence modeling. CoRR, abs/1412.3555, 2014. optimization. CoRR, abs/1412.6980, 2014.
Dusek, J. A. and Eichenbaum, H. The hippocampus and Le, Q.V. and Mikolov., T. Distributed representations of
memory for orderly stimulusrelations. Proceedings of sentences and documents. In ICML, 2014.
the National Academy of Sciences, 94(13):7109–7114,
1997. Malinowski, M. and Fritz, M. A Multi-World Approach to
Question Answering about Real-World Scenes based on
Eichenbaum, H. and Cohen, N. J. From Conditioning to Uncertain Input. In NIPS, 2014.
Conscious Recollection: Memory Systems of the Brain
(Oxford Psychology). Oxford University Press, 1 edition, Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B.
2004. ISBN 0195178041. Building a large annotated corpus of english: The penn
treebank. Computational Linguistics, 19(2), June 1993.
Elman, J. L. Distributed representations, simple recurrent
networks, and grammatical structure. Machine Learn- Mikolov, T. and Zweig, G. Context dependent recurrent
ing, 7(2-3):195–225, 1991. neural network language model. In SLT, pp. 234–239.
Graves, A., Wayne, G., and Danihelka, I. Neural turing IEEE, 2012. ISBN 978-1-4673-5125-6.
machines. CoRR, abs/1410.5401, 2014.
Passos, A., Kumar, V., and McCallum, A. Lexicon infused
Heckers, S., Zalesak, M., Weiss, A. P., Ditman, T., and phrase embeddings for named entity resolution. In Con-
Titone, D. Hippocampal activation during transitive in- ference on Computational Natural Language Learning.
ference in humans. Hippocampus, 14:153–62, 2004. Association for Computational Linguistics, June 2014.

Hermann, K. M., Kočiský, T., Grefenstette, E., Espeholt, Pennington, J., Socher, R., and Manning, C. D. Glove:
L., Kay, W., Suleyman, M., and Blunsom, P. Teaching Global vectors for word representation. In EMNLP,
machines to read and comprehend. In NIPS, 2015. 2014.
Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

Socher, R., Gershman, S., Perotte, A., Sederberg, P., Blei,


D., and Norman, K. A bayesian analysis of dynamics in
free recall. In NIPS. 2009.
Socher, R., Huang, E. H., Pennington, J., Ng, A. Y., and
Manning, C. D. Dynamic Pooling and Unfolding Recur-
sive Autoencoders for Paraphrase Detection. In NIPS,
2011.
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning,
C., Ng, A., and Potts, C. Recursive deep models for
semantic compositionality over a sentiment treebank. In
EMNLP, 2013.

Søgaard, A. Semisupervised condensed nearest neighbor


for part-of-speech tagging. In ACL-HLT, 2011.
Stollenga, M. F. and J. Masci, F. Gomez, J. Schmidhu-
ber. Deep Networks with Internal Selective Attention
through Feedback Connections. In NIPS, 2014.

Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to se-


quence learning with neural networks. In NIPS, 2014.
Tai, K. S., Socher, R., and Manning, C. D. Improved se-
mantic representations from tree-structured long short-
term memory networks. In ACL, 2015.

Weston, J., Bordes, A., Chopra, S., and Mikolov, T. To-


wards ai-complete question answering: A set of prereq-
uisite toy tasks. CoRR, abs/1502.05698, 2015a.
Weston, J., Chopra, S., and Bordes, A. Memory networks.
In ICLR, 2015b.
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A. C.,
Salakhutdinov, R., Zemel, R. S., and Bengio, Y. Show,
attend and tell: Neural image caption generation with vi-
sual attention. CoRR, abs/1502.03044, 2015.

Yang, Z., He, X., Gao, J., Deng, L., and Smola, A. Stacked
attention networks for image question answering. arXiv
preprint arXiv:1511.02274, 2015.
Yates, A., Banko, M., Broadhead, M., Cafarella, M. J., Et-
zioni, O., and Soderland, S. Textrunner: Open informa-
tion extraction on the web. In HLT-NAACL (Demonstra-
tions), 2007.

You might also like