Zhang 21 Ac
Zhang 21 Ac
Zhang 21 Ac
Zhu Zhang * 1 2 Chang Zhou * 2 Jianxin Ma 2 Zhijie Lin 1 Jingren Zhou 2 Hongxia Yang 2 Zhou Zhao 1
Abstract 2020b), or predict whether a user will click the given item
based on the user behavior sequence in recommender sys-
Existing reasoning tasks often have an impor- tems (Ren et al., 2019; Pi et al., 2019; Zhang et al., 2021).
tant assumption that the input contents can be Studies that achieve top performances at such reasoning
always accessed while reasoning, requiring un- tasks usually have an important assumption that the raw in-
limited storage resources and suffering from se- put contents X can be always accessed while answering the
vere time delay on long sequences. To achieve query Q. In this setting, the complex interaction between
efficient reasoning on long sequences with lim- X and Q can be designed to extract query-relevant infor-
ited storage resources, memory augmented neu- mation from X with little loss, such as co-attention inter-
ral networks introduce a human-like write-read action (Xiong et al., 2016; Jin et al., 2019b). Though these
memory to compress and memorize the long in- methods (Seo et al., 2016; Le et al., 2020b) can effectively
put sequence in one pass, trying to answer sub- handle these reasoning tasks, they require unlimited storage
sequent queries only based on the memory. But resources to hold the original input X. Further, they have to
they have two serious drawbacks: 1) they continu- encode the whole input contents and develop the elaborate
ally update the memory from current information interaction from scratch, which are time-consuming. This
and inevitably forget the early contents; 2) they is not acceptable for online services that require instant re-
do not distinguish what information is important sponse such as recommender systems, as the input sequence
and treat all contents equally. In this paper, we becomes extremely long (Ren et al., 2019).
propose the Rehearsal Memory (RM) to enhance
long-sequence memorization by self-supervised To achieve efficient reasoning on long sequences with
rehearsal with a history sampler. To alleviate the limited storage resources, memory augmented neural net-
gradual forgetting of early information, we design works (MANNs) (Graves et al., 2014; 2016) introduce a
self-supervised rehearsal training with recollec- write-read memory M with fixed-size capacity (size much
tion and familiarity tasks. Further, we design a smaller than |X|) to compress and remember the input con-
history sampler to select informative fragments tents X. In the inference phase, they can capture query-
for rehearsal training, making the memory focus relevant clues directly from the memory M , i.e., the raw
on the crucial information. We evaluate the perfor- input X is not needed at the time of answering Q. This
mance of our rehearsal memory by the synthetic procedure is very similar to the daily situation of our human
bAbI task and several downstream tasks, includ- beings, i.e., we may not know the tasks Q that we will an-
ing text/video question answering and recommen- swer in the future when we are experiencing current events,
dation on long sequences. but we have the instincts to continually memorize our expe-
riences within the limited memory capacity, from which we
can rapidly recall and draw upon past events to guide our
behaviors given the present tasks (Moscovitch et al., 2016;
1. Introduction
Baddeley, 1992). Such human-like memory-based methods
In recent years, the tremendous progress of neural networks bring three benefits for long-sequence reasoning: 1) storage
has enabled machines to perform reasoning given the input efficiency: we only need to maintain the limited memory
contents X and a query Q, e.g., infer the answer of given M rather than X; 2) reasoning efficiency: inference over
questions from the text/video stream in text/video question M and Q is more lightweight than inference over X and Q
answering (Seo et al., 2016; Jin et al., 2019a; Le et al., from scratch ; 3) high reusability: the maintained memory
*
M can be reused for any query Q.
Equal contribution 1 Zhejiang University, China 2 DAMO
Academy, Alibaba Group, China. Correspondence to: Zhou Zhao However, existing MANNs have two serious drawbacks for
<[email protected]>. memory-based long-sequence reasoning. First, these ap-
proaches ignore the long-term memorization ability of the
Proceedings of the 38 th International Conference on Machine
Learning, PMLR 139, 2021. Copyright 2021 by the author(s). memory. They learn how to maintain the memory M only
Learning to Rehearse in Long Sequence Memorization
by back-propagated losses to the final answer and do not de- weight can be regarded as the importance of each fragment.
sign any specific training target for long-term memorization, After training, the history sampler can select the vital frag-
which inevitably lead to the gradual forgetting of early con- ments based on the attention weights for self-supervised
tents (Le et al., 2019a). That is, when dealing with the long rehearsal training. This is similar to the procedure where hu-
input sequences, these approaches may fail to answer the man beings learn to memorize meaningful experiences, i.e.,
query relevant to early contents due to the lack of long-term we have gone through a lot of tasks to slowly understand
memorization. Second, determining what to remember in which information is likely to be used in future tasks and pay
the memory with limited capacity is crucial to retain suffi- more attention to them during memorization (Moscovitch
cient clues for subsequent Q. This is especially challenging et al., 2016).
since the information compression procedure in M is totally
In conclusion, we propose the self-supervised memory re-
not aware of Q. But existing MANNs do not distinguish
hearsal to enhance the long-sequence memorization for sub-
what information is important and treat all contents equally.
sequent reasoning. We design the self-supervised recollec-
Thus, due to lack of information discrimination, these ap-
tion and familiarity tasks to solve how to rehearse, which
proaches may store too much meaningless information but
can alleviate the gradual forgetting of early information. Fur-
lose vital evidence for subsequent reasoning.
ther, we adopt a history sampler to decide what to rehearse,
In this paper, we propose the Rehearsal Memory (RM) to which guides the memory to remember critical information.
enhance long-sequence memorization by self-supervised We illustrate the ability of our rehearsal memory via the syn-
rehearsal with a history sampler. To overcome gradual for- thetic bAbI task and several downstream tasks, including
getting of early information and increase the generalization text/video question answering and recommendation on long
ability of the memorization technique, we develop two extra sequences.
self-supervised rehearsal tasks to recall the recorded his-
tory contents from the memory. The two tasks are inspired 2. Related Works
by the observation that human beings can recall details
nearby some specific events and distinguish whether a se- Memory augmented neural networks (MANNs) introduce
ries of events happened in the history, which respectively the external memory to store and access the past contents by
correspond to two different memory processes revealed in differentiable write-read operators. Neural Turing Machine
cognitive, neuropsychological, and neuroimaging studies, (NTM) (Graves et al., 2014) and Differentiable Neural Com-
i.e., recollection and familiarity (Yonelinas, 2002; Moscov- puter (DNC) (Graves et al., 2016) are the typical MANNs
itch et al., 2016). Concretely, the recollection task aims to for human-like memorization and reasoning, whose infer-
predict the masked items in history fragments H, which are ences rely only on the memory with limited capacity rather
sampled from the original input stream and parts of items than starting from the original input. In this line of research,
are masked as the prediction target. This task tries to endow Rae et al. (2016) adopt the sparse memory accessing to re-
the memory with the recollection ability that enables one duce computational cost. Csordás & Schmidhuber (2019)
to relive past episodes. And the familiarity task tries to introduce the key/value separation problem of content-based
distinguish whether a historical fragment H ever appears addressing and adopt a mask for memory operations as a so-
in the input stream, where we directly sample positive frag- lution. Le et al. (2019b) manipulate both data and programs
ments from the input stream and replace parts of the items stored in memory to perform universal computations. And
in positive ones as negative fragments. This task resembles Santoro et al. (2018); Le et al. (2020a) consider the complex
the familiarity process that recognizes experienced events relational reasoning with the information they remember.
or stimulus as familiar.
However, these works exploit MANNs mainly to help cap-
To make the rehearsal memory have the ability of remember- ture complex dependencies in dealing with input sequences,
ing the crucial information, we further train an independent but do not explore the potential of MANNs in the field of
history sampler to select informative fragments H for self- memory-based long-sequence reasoning. They learn how
supervised rehearsal training. Similar to the teacher-student to maintain the memory only by back-propagated losses to
architecture in knowledge distillation (Hinton et al., 2015), the final answer but do not design specific training target
we expect the history sampler (i.e. the teacher) to capture for long-term memorization, inevitably incurring gradual
the characteristic of important fragments in the current envi- forgetting of early contents during memorizing long se-
ronment and guide the rehearsal memory (i.e. the student) to quences (Le et al., 2019a). Recently, there are a few works
remember task-relevant clues. Concretely, we independently trying to alleviate this problem. Le et al. (2019a) propose
train a conventional reasoning model that can access raw to measure “remember” ability by the final gradient on the
contents X while answering the query Q as the history sam- early input, and adopt a uniform writing operation on the
pler. The model contains the attention interaction between memory to balance between maximizing memorization and
history fragments H and the query Q, where the attention forgetting. Munkhdalai et al. (2019) design the meta-learned
Learning to Rehearse in Long Sequence Memorization
m! Self-Supervised
x!
Rehearsal Training
x# Rehearsal Memory Machine m#
…
x( Task-Specific
m% Reasoning Training
…
l!"
Slot-to-Item
Transformer Encoder Gate-Based Update
……
Attention
l"%
Memory Update Module
"
x!" x#" …… x$
m!" m"# …… m"%
the t-th segment X !
Rehearsal Memory Machine rehearsal memory M! at the t-th step
we develop a rehearsal model Hξ (M, H) to reconstruct the 3.3. Self-Supervised Rehearsal Training
masked history fragments (recollection task) and distinguish
As shown in Figure 2, based on the maintained memory M ,
positive history fragments from negative ones (familiarity
we apply the memory rehearsal technique to enhance long-
task), where H means the critical history fragments that
sequence memorization. We first design the self-supervised
are selected by the history sampler SΨ (Q, X). For task-
recollection and familiarity tasks to solve how to rehearse,
specific reasoning training, we develop the task-specific rea-
which can alleviate the gradual forgetting of early informa-
son model RΩ (M, Q) to answer the given query Q. Dur-
tion. We next adopt a history sampler to decide what to
ing the testing stage, we maintain the rehearsal memory
rehearse, which guides the memory to remember critical
M = GΘ (X) from the stream X and then infer the answer
task-relevant clues.
A for any relevant query Q by A = RΩ (M, Q), where
the rehearsal model Hξ (M, H) and the history sampler
3.3.1. H OW TO R EHEARSE : S ELF -S UPERVISED
SΨ (Q, X) are no longer needed.
R ECOLLECTION AND FAMILIARITY TASKS
3.2. Rehearsal Memory Machine We design the rehearsal model Hξ (M, H) with recollection
and familiarity tasks. The recollection task reconstructs the
We deal with the input stream X from the segment level
masked positive history fragments to enable the memory
rather than item level, i.e., we cut the input sequence
to relive past episodes. And the familiarity task tries to
into fixed-length segments and memorize them into the re-
distinguish positive history fragments from negative ones
hearsal memory segment-by-segment. Compared to existing
for making the memory recognize experienced events.
MANNs (Graves et al., 2014; 2016), which store the input
stream item-by-item orderly with a RNN-based controller, First, we apply an independent history sampler to select
our segment-level memorization can further capture the bi- the B segments from the input stream as the history frag-
directional context of each item and improve the modeling ment set H = {H b }B b=1 , which is illustrated in the next
efficiency. We denote the t-th segment as Xt = {xtn }N
n=1 section. Each fragment H b = {hb1 , hb2 , · · · , hbN } con-
with N items and the current memory as Mt = {mtk }K k=1 ,
tains N items and each item h∗ corresponds to a feature
where we have recorded t-1 segments in Mt . The xtn and x∗ . For the b-th fragment, we randomly mask 50% of
mtk have the same dimension dx . items in the fragment and add an especial item [cls] at
the beginning to obtain the masked positive history frag-
We first model the t-th segment by a Transformer en-
ment H b+ = {hb+ b+ b+ b+ b+
[cls] , h1 , h[m1 ] , · · · , hN }, where h[m1 ]
coder (Vaswani et al., 2017) and obtain the sequence fea-
means the first masked item. In order to guarantee that
tures Ft = {fnt }Nn=1 with dimension dx . After it, we apply the model Hξ (M, H) reconstructs the masked fragment by
a memory update module to write Ft into Mt . We apply
utilizing the maintained memory M rather than only re-
a slot-to-item attention to align the sequence features to
lying on fragment context, we set the mask ratio to 50%
slot features in the current memory Mt , and then develop
instead of 15% in BERT (Devlin et al., 2019). More-
the gate-based update. Concretely, we first calculate the
over, we construct the masked negative history fragment
slot-to-item attention matrix where each element means the
H b− = {hb− b− b− b−
[cls] , h1 , h[m1 ] , · · · , hN } by replacing 50% of
relevance of a slot-item pair, and then learn aligned features
unmasked items in the positive fragment, where the replace-
Lt = {ltk }Kk=1 for each slot, given by ment items are sampled from other input stream to make the
t
αkn = wa> tanh(W1a mtk + W2a fnt + ba ), negative fragment distinguishable. Here we construct the
N positive fragment set H+ = {H b+ }B b=1 and corresponding
t (1)
exp(αkn ) negative fragment set H− = {H b− }B b=1 from the origi-
X
t
α̂kn = PK , ltk = t
α̂kn fnt ,
exp(α t ) nal fragment set H. Next, we adopt a bidirectional Trans-
j=1 jn n=1
former decoder (Vaswani et al., 2017) without the future
where W1a ∈ Rdmodel ×dx , W2a ∈ Rdmodel ×dx and ba ∈ masking to model each history fragment H b+ /H b− . In the
Rdmodel are the projection matrices and bias. wa> is the row decoder, each history item can interact with all other items
vector. Next, the k-th slot feature mtk is updated with its in the fragment. The rehearsal memory M is input to the
aligned feature ltk based on a GRU unit with dx -d hidden “encoder-decoder multi-head attention sub-layer” in each
states, given by decoder layer, where the queries come from the previous
mt+1 = GRU(mtk , ltk ), (2) decoder layer and the memory slots are regarded as the keys
k
and values. This allows each item in the decoder to attend
ltk
where is the current input of the GRU unit and mtk
is over all slot features in the memory M . Finally, we obtain
the hidden state at the t-th step. And mt+1
k is the new slot b+/b− b+/b− b+/b−
the features {r[cls] , r1
b+/b−
, r[m1 ] , · · · , rN } where
feature after the gate-based update. After memorizing T b+/b−
segments, we can obtain rehearsal memory MT +1 and we each r∗ has the dimension dx .
denote it by M for convenience.
Learning to Rehearse in Long Sequence Memorization
…
M
…
…
H/ H /'/)
H.
'/) '/) '/) … '/)
Selected Fragments h[#$%] h! h[*!] h+
History Fragments Positive/Negative
Fragments masked history fragment
Recollection Task. We first predict the masked items of subsequent inference and pay more attention to them dur-
positive history fragments to build the item-level recon- ing memorization. Thus, we further train a history sampler
struction for the recollection task. Considering there are SΨ (Q, X) to select informative history fragments H for
too many item types, we apply the contrastive training (He self-supervised rehearsal training, which is independent to
et al., 2020; Chen et al., 2020; Zhang et al., 2020) based the memory machine M = GΘ (X). In knowledge distilla-
on the ground truth and other sampled items. For the N/2 tion (Hinton et al., 2015), the teacher model can access the
masked items, we compute the recollection loss for the b-th privileged information and transfer the knowledge to the stu-
fragment by dent model. Similar to it, our history sampler, which is the
teacher and can access raw contents X while answering the
exp(rb+
[mi ] · yi ) query Q, learns the ability of distinguishing task-relevant
Lbi = log PJ , important fragments and guides the rehearsal memory (i.e.,
b+
exp(r[mi ] · yi ) + j=1 exp(rb+ [mi ] · yj )
(3) the student) to remember critical clues.
N/2
2 X b Concretely, we independently train a directly reasoning
Lbrec =− L
N i=1 i model as the history sampler SΨ (Q, X), which can access
raw contents X while answering the query Q. We first cut
where yi ∈ Rdx is the feature of ground truth of the i-th the entire input X into C history fragments {H c }C c=1 just
masked item, yj ∈ Rdx is the feature of sampled items and like the rehearsal memory machine, where each fragment
rb+
[mi ] · y∗ is the inner product of two features.
contains N items. Next, we obtain the fragment features
{hc }Cc=1 by averaging the item features in each fragment.
Familiarity Task. Next, we predict whether the masked After it, we develop the attention-based reasoning for the
history fragment ever appears in the current input stream, i.e. query Q on these fragment features. The query feature
distinguish positive history fragments from negative ones. q ∈ Rdmodel is modeled by the task-specific encoder in
This training objective makes the memory learn the ability different downstream tasks, which is introduced in Section
of recognizing experienced events. Concretely, we project A of the supplementary material. Given the query feature
b+/b−
each feature r[cls] into a confident score sb+/b− ∈ (0, 1) q and fragment features {hc }Cc=1 , we conduct the attention
by a linear layer with the sigmoid activation, and calculate method to aggregate query-relevant clues from fragments,
the familiarity loss by given by
Lbf am = −log(sb+ ) + log(1 − sb− ), (4) β c = wh> tanh(W1h q + W2h hc + bh ),
C
exp(β c ) (5)
where Lbf am is the familiarity loss for the b-th pair of posi-
X
β̂ c = PC , e= β̂ c hc ,
exp(β j)
tive and negative fragments in H+ /H− . j=1 c=1
3.4. Task-Specific Reasoning Training approaches. The directly reasoning baselines are different in
Besides self-supervised rehearsal training, we simultane- downstream tasks and the memory-based baselines mainly
ously develop task-specific reasoning training. For several are DNC (Graves et al., 2016), NUTM (Le et al., 2019b),
downstream tasks, we propose different task-specific reason DMSDNC (Park et al., 2020), STM (Le et al., 2020a) and
model RΩ (M, Q) based on the memory M. Here we adopt Compressive Transformer (CT) (Rae et al., 2019). For a fair
the simple and mature components in the reason model for comparison, we modify the reasoning module of memory-
a fair comparison. The details are introduced in Section A based baselines to be consistent with our rehearsal memory,
of the supplementary material. Briefly, we first learn the i.e. we conduct multi-hop attention-based reasoning based
query representation q by a task-specific encoder and then on the built memory. And the number of memory slots in
perform the multi-hop attention-based reasoning. Finally, these baselines is also set to K. Besides, we set the core
we obtain the reason loss Lr from RΩ (M, Q). number of NUTM to 4, the query number of STM to 8 and
the memory block number of DMSDNC to 2. As for CT, the
Eventually, we combine the rehearsal and reason losses to layer number of the Transformer is set to 3 as our rehearsal
train our model, given by memory and the compression rate is set 5.
Lrm = λ1 Lrec + λ2 Lf am + λ3 Lr , (7)
4.2. Rehearsal Memory on Short-Sequence Reasoning
where λ1 , λ2 and λ3 are applied to adjust the balance of
three losses. The bAbI dataset (Weston et al., 2015) is a synthetic
text question answering benchmark and widely applied to
evaluate the memorization and reasoning performance of
4. Experiments MANNs. This dataset contains 20 reasoning tasks and re-
In this section, we first verify our rehearsal memory on the quires to be solved with one common model. Although
widely-used short-sequence reasoning task bAbI. Next, we most of these tasks only give short-sequence text input (less
mainly compare our approach with diverse baselines on than 100 words) and existing methods (Park et al., 2020; Le
several long-sequence reasoning tasks. We then perform et al., 2020a) have solved these tasks well, we still compare
ablation studies on the memory rehearsal techniques and our rehearsal memory with other memory-based baselines
analyze the impact of crucial hyper-parameters. to verify the short-sequence reasoning performance. We set
the dx and dmodel to 128. The number K of memory slots
is set to 20. And we naturally take each sentence in input
4.1. Experiment Setting
texts as a segment and the maximum length N of segments
Model Setting. We first introduce the common model set- is set to 15. Due to limited word types in this dataset, we
tings for all downstream tasks. We set the layer number sample all other words as negative items in Lrec .
of the Transformer encoder and bi-directional Transformer
The results are summarized in Table 1. The RM model
decoder to 3. The head number in Multi-Head Attention
solves these bAbI tasks with near-zero error and outperforms
is set to 4. We set λ1 , λ2 and λ3 to 1.0, 0.5 and 1.0, re-
existing baselines at the mean and best error rate of 10
spectively. The number B of history fragments is set to 6.
runs, verifying that our RM method can conduct effective
During training, we apply an Adam optimizer (Duchi et al.,
memorization and reasoning on short-sequence tasks. In
2011) to minimize the multi-task loss Lrm , where the initial
these methods, DNC, NUTM and DMSDNC model the
learning rate is set to 0.001.
input contents word-by-word. STM processes input texts as
Baseline. We compare our rehearsal memory with the di- a sentence-level sequence. And CT and RM methods cut
rectly reasoning methods and the memory-based reasoning the input texts into sentences for modeling. From the results,
Learning to Rehearse in Long Sequence Memorization
Table 2. Performance Comparisons for Long-Sequence Text Ques- Table 3. Performance Comparisons for Long-Term
tion Answering on NarrativeQA. Video Question Answering on ActivityNet-QA.
4.3. Rehearsal Memory on Long-Sequence Reasoning 4.3.2. L ONG -S EQUENCE V IDEO Q UESTION A NSWERING
We then compare our approach with diverse baselines on The ActivityNet-QA dataset (Yu et al., 2019) contains 5,800
several long-sequence reasoning tasks. videos from the ActivityNet (Caba Heilbron et al., 2015).
The average video duration of this dataset is about 180s and
4.3.1. L ONG -S EQUENCE T EXT Q UESTION A NSWERING is the longest in VQA datasets. We compare our method
with four directly reasoning baselines, including three ba-
We apply the NarrativeQA dataset (Kočiskỳ et al., 2018)
sic models E-VQA, E-MN, E-SA from (Yu et al., 2019)
with long input contents for long-sequence text question an-
and the SOTA model HCRN (Le et al., 2020b). For our
swering. This dataset contains 1,572 stories and correspond-
rehearsal memory, we set the dx and dmodel to 256. The
ing summaries generated by humans, where each summary
number K of memory slots and length N of segments are
contains more than 600 tokens on average. And there are
both set to 20. And in Lrec , we select 30 other frame fea-
46,765 questions in total. We adopt the multi-choice form to
tures from the video as the sampled items.
answer the given question based on a summary, where other
answers for questions associated with the same summary are As shown in Table 3, the RM method obtains a better per-
regarded as answer candidates. We compute the mean recip- formance than other memory-based baselines. Compared
rocal rank (MRR) as the metric, i.e., the rank of the correct to the best baseline CT, our RM model further achieves the
answer among candidates. Besides the memory-based meth- 0.9% absolute improvement, showing the effectiveness of
ods, we adopt directly reasoning model AS Reader (Kadlec our model designs and self-supervised rehearsal training.
et al., 2016) and E2E-MN (Sukhbaatar et al., 2015) as base- Moreover, the RM method outperforms the basic directly
lines. The AS Reader applies a pointer network to generate reasoning baselines E-VQA, E-MN and E-SA, but slightly
the answer and E2E-MN employs the end-to-end memory worse than the SOTA method HCRN. This suggests our re-
network to conduct multi-hop reasoning. For our rehearsal hearsal memory can reduce the gap between memory-based
memory, we set the dx and dmodel to 256. The number K of and directly reasoning paradigms.
memory slots is set to 20. We naturally take each sentence
in summaries as a segment and the maximum length N of 4.3.3. L IFELONG S EQUENCE R ECOMMENDATION
segments is set to 20. And we sample all other words as
The lifelong sequence recommendation (Ren et al., 2019)
negative items in Lrec .
aims to predict whether the user will click a given item
We report the results in Table 2. Our RM method obtains the based on long sequences, thus it can be regarded as a long-
best performance among memory-based approaches, which sequence reasoning task. The XLong dataset (Ren et al.,
demonstrates our self-supervised rehearsal training with the 2019) is sampled from the click logs on Alibaba. The length
history sampler can effectively enhance long-sequence mem- of historical behavior sequences in this dataset is 1000. We
orization and reasoning. Further, the RM model slightly compare our method with four directly reasoning methods
Learning to Rehearse in Long Sequence Memorization
Table 4. Performance Comparisons for Lifelong Table 6. Ablation Results about the Transformer Encoder and Mask
Sequence Recommendation on XLong. Ratio.
30 38
and the model cannot effectively capture the evidence and
28 36
infer the answer.
26 34
MRR
Acc
24 32 5. Conclusions
STM STM
22 CT 30 CT
RM RM In this paper, we propose the self-supervised rehearsal to
20 28
10 15 20
Memory Slot Number K
25 10 15 20
Memory Slot Number K
25 enhance long-sequence memorization. We design the self-
supervised recollection and familiarity tasks to alleviate
(a) NarrativeQA (b) ActivityNet-QA
the gradual forgetting of early information. Further, we
adopt a history sampler to guide the memory to remember
Figure 3. Effect of the Memory Slot Number K.
critical information. Extensive experiments on a series of
downstream tasks verify the performance of our method.
36.5 For future work, we will further explore the property of
rehearsal memory.
36
Acc
Acknowledgments
35.5
Journal of Machine Learning Research, 12(Jul):2121– Le, T. M., Le, V., Venkatesh, S., and Tran, T. Hierarchical
2159, 2011. conditional relation networks for video question answer-
ing. In Proceedings of the IEEE Conference on Computer
Graves, A., Wayne, G., and Danihelka, I. Neural turing Vision and Pattern Recognition, pp. 9972–9981, 2020b.
machines. arXiv preprint arXiv:1410.5401, 2014.
Moscovitch, M., Cabeza, R., Winocur, G., and Nadel, L.
Graves, A., Wayne, G., Reynolds, M., Harley, T., Dani- Episodic memory and beyond: the hippocampus and neo-
helka, I., Grabska-Barwińska, A., Colmenarejo, S. G., cortex in transformation. Annual review of psychology,
Grefenstette, E., Ramalho, T., Agapiou, J., et al. Hybrid 67:105–134, 2016.
computing using a neural network with dynamic external
Munkhdalai, T., Sordoni, A., Wang, T., and Trischler, A.
memory. Nature, 538(7626):471–476, 2016.
Metalearned neural memory. In Advances in Neural In-
formation Processing Systems, pp. 13331–13342, 2019.
He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Mo-
mentum contrast for unsupervised visual representation Park, T., Choi, I., and Lee, M. Distributed memory based
learning. In Proceedings of the IEEE Conference on self-supervised differentiable neural computer. arXiv
Computer Vision and Pattern Recognition, 2020. preprint arXiv:2007.10637, 2020.
Hidasi, B., Karatzoglou, A., Baltrunas, L., and Tikk, D. Pi, Q., Bian, W., Zhou, G., Zhu, X., and Gai, K. Practice on
Session-based recommendations with recurrent neural long sequential user behavior modeling for click-through
networks. arXiv preprint arXiv:1511.06939, 2015. rate prediction. In Proceedings of the 25th ACM SIGKDD
International Conference on Knowledge Discovery &
Hinton, G., Vinyals, O., and Dean, J. Distilling Data Mining, pp. 2671–2679, 2019.
the knowledge in a neural network. arXiv preprint
Rae, J., Hunt, J. J., Danihelka, I., Harley, T., Senior, A. W.,
arXiv:1503.02531, 2015.
Wayne, G., Graves, A., and Lillicrap, T. Scaling memory-
Jin, W., Zhao, Z., Gu, M., Yu, J., Xiao, J., and Zhuang, augmented neural networks with sparse reads and writes.
Y. Multi-interaction network with object relation for In Advances in Neural Information Processing Systems,
video question answering. In Proceedings of the ACM pp. 3621–3629, 2016.
International Conference on Multimedia, pp. 1193–1201, Rae, J. W., Potapenko, A., Jayakumar, S. M., and Lillicrap,
2019a. T. P. Compressive transformers for long-range sequence
modelling. arXiv preprint arXiv:1911.05507, 2019.
Jin, W., Zhao, Z., Gu, M., Yu, J., Xiao, J., and Zhuang,
Y. Video dialog via multi-grained convolutional self- Ren, K., Qin, J., Fang, Y., Zhang, W., Zheng, L., Bian, W.,
attention context networks. In Proceedings of the Inter- Zhou, G., Xu, J., Yu, Y., Zhu, X., et al. Lifelong sequen-
national ACM SIGIR Conference on Research and Devel- tial modeling with personalized memorization for user
opment in Information Retrieval, pp. 465–474, 2019b. response prediction. In Proceedings of the International
ACM SIGIR Conference on Research and Development
Kadlec, R., Schmid, M., Bajgar, O., and Kleindienst, J. in Information Retrieval, pp. 565–574, 2019.
Text understanding with the attention sum reader network.
arXiv preprint arXiv:1603.01547, 2016. Santoro, A., Faulkner, R., Raposo, D., Rae, J., Chrzanowski,
M., Weber, T., Wierstra, D., Vinyals, O., Pascanu, R., and
Kočiskỳ, T., Schwarz, J., Blunsom, P., Dyer, C., Hermann, Lillicrap, T. Relational recurrent neural networks. In
K. M., Melis, G., and Grefenstette, E. The narrativeqa Advances in neural information processing systems, pp.
reading comprehension challenge. Transactions of the 7299–7310, 2018.
Association for Computational Linguistics, 6:317–328,
Seo, M., Kembhavi, A., Farhadi, A., and Hajishirzi, H.
2018.
Bidirectional attention flow for machine comprehension.
arXiv preprint arXiv:1611.01603, 2016.
Le, H., Tran, T., and Venkatesh, S. Learning to re-
member more with less memorization. arXiv preprint Sukhbaatar, S., Weston, J., Fergus, R., et al. End-to-end
arXiv:1901.01347, 2019a. memory networks. In Advances in Neural Information
Processing Systems, pp. 2440–2448, 2015.
Le, H., Tran, T., and Venkatesh, S. Neural stored-program
memory. arXiv preprint arXiv:1906.08862, 2019b. Tang, J. and Wang, K. Personalized top-n sequential rec-
ommendation via convolutional sequence embedding. In
Le, H., Tran, T., and Venkatesh, S. Self-attentive associative Proceedings of the Eleventh ACM International Confer-
memory. arXiv preprint arXiv:2002.03519, 2020a. ence on Web Search and Data Mining, pp. 565–573, 2018.
Learning to Rehearse in Long Sequence Memorization