Zhang 21 Ac

Learning to Rehearse in Long Sequence Memorization
Zhu Zhang * 1 2 Chang Zhou * 2 Jianxin Ma 2 Zhijie Lin 1 Jingren Zhou 2 Hongxia Yang 2 Zhou Zhao 1
Abstract 2020b), or predict whether a user will click the given item
based on the user behavior sequence in recommender sys-
Existing reasoning tasks often have an impor- tems (Ren et al., 2019; Pi et al., 2019; Zhang et al., 2021).
tant assumption that the input contents can be Studies that achieve top performances at such reasoning
always accessed while reasoning, requiring un- tasks usually have an important assumption that the raw in-
limited storage resources and suffering from se- put contents X can be always accessed while answering the
vere time delay on long sequences. To achieve query Q. In this setting, the complex interaction between
efficient reasoning on long sequences with lim- X and Q can be designed to extract query-relevant infor-
ited storage resources, memory augmented neu- mation from X with little loss, such as co-attention inter-
ral networks introduce a human-like write-read action (Xiong et al., 2016; Jin et al., 2019b). Though these
memory to compress and memorize the long in- methods (Seo et al., 2016; Le et al., 2020b) can effectively
put sequence in one pass, trying to answer sub- handle these reasoning tasks, they require unlimited storage
sequent queries only based on the memory. But resources to hold the original input X. Further, they have to
they have two serious drawbacks: 1) they continu- encode the whole input contents and develop the elaborate
ally update the memory from current information interaction from scratch, which are time-consuming. This
and inevitably forget the early contents; 2) they is not acceptable for online services that require instant re-
do not distinguish what information is important sponse such as recommender systems, as the input sequence
and treat all contents equally. In this paper, we becomes extremely long (Ren et al., 2019).
propose the Rehearsal Memory (RM) to enhance
long-sequence memorization by self-supervised To achieve efficient reasoning on long sequences with
rehearsal with a history sampler. To alleviate the limited storage resources, memory augmented neural net-
gradual forgetting of early information, we design works (MANNs) (Graves et al., 2014; 2016) introduce a
self-supervised rehearsal training with recollec- write-read memory M with fixed-size capacity (size much
tion and familiarity tasks. Further, we design a smaller than |X|) to compress and remember the input con-
history sampler to select informative fragments tents X. In the inference phase, they can capture query-
for rehearsal training, making the memory focus relevant clues directly from the memory M , i.e., the raw
on the crucial information. We evaluate the perfor- input X is not needed at the time of answering Q. This
mance of our rehearsal memory by the synthetic procedure is very similar to the daily situation of our human
bAbI task and several downstream tasks, includ- beings, i.e., we may not know the tasks Q that we will an-
ing text/video question answering and recommen- swer in the future when we are experiencing current events,
dation on long sequences. but we have the instincts to continually memorize our expe-
riences within the limited memory capacity, from which we
can rapidly recall and draw upon past events to guide our
behaviors given the present tasks (Moscovitch et al., 2016;
1. Introduction
Baddeley, 1992). Such human-like memory-based methods
In recent years, the tremendous progress of neural networks bring three benefits for long-sequence reasoning: 1) storage
has enabled machines to perform reasoning given the input efficiency: we only need to maintain the limited memory
contents X and a query Q, e.g., infer the answer of given M rather than X; 2) reasoning efficiency: inference over
questions from the text/video stream in text/video question M and Q is more lightweight than inference over X and Q
answering (Seo et al., 2016; Jin et al., 2019a; Le et al., from scratch ; 3) high reusability: the maintained memory
*
M can be reused for any query Q.
Equal contribution 1 Zhejiang University, China 2 DAMO
Academy, Alibaba Group, China. Correspondence to: Zhou Zhao However, existing MANNs have two serious drawbacks for
<[email protected]>. memory-based long-sequence reasoning. First, these ap-
proaches ignore the long-term memorization ability of the
Proceedings of the 38 th International Conference on Machine
Learning, PMLR 139, 2021. Copyright 2021 by the author(s). memory. They learn how to maintain the memory M only
by back-propagated losses to the final answer and do not de- weight can be regarded as the importance of each fragment.
sign any specific training target for long-term memorization, After training, the history sampler can select the vital frag-
which inevitably lead to the gradual forgetting of early con- ments based on the attention weights for self-supervised
tents (Le et al., 2019a). That is, when dealing with the long rehearsal training. This is similar to the procedure where hu-
input sequences, these approaches may fail to answer the man beings learn to memorize meaningful experiences, i.e.,
query relevant to early contents due to the lack of long-term we have gone through a lot of tasks to slowly understand
memorization. Second, determining what to remember in which information is likely to be used in future tasks and pay
the memory with limited capacity is crucial to retain suffi- more attention to them during memorization (Moscovitch
cient clues for subsequent Q. This is especially challenging et al., 2016).
since the information compression procedure in M is totally
In conclusion, we propose the self-supervised memory re-
not aware of Q. But existing MANNs do not distinguish
hearsal to enhance the long-sequence memorization for sub-
what information is important and treat all contents equally.
sequent reasoning. We design the self-supervised recollec-
Thus, due to lack of information discrimination, these ap-
tion and familiarity tasks to solve how to rehearse, which
proaches may store too much meaningless information but
can alleviate the gradual forgetting of early information. Fur-
lose vital evidence for subsequent reasoning.
ther, we adopt a history sampler to decide what to rehearse,
In this paper, we propose the Rehearsal Memory (RM) to which guides the memory to remember critical information.
enhance long-sequence memorization by self-supervised We illustrate the ability of our rehearsal memory via the syn-
rehearsal with a history sampler. To overcome gradual for- thetic bAbI task and several downstream tasks, including
getting of early information and increase the generalization text/video question answering and recommendation on long
ability of the memorization technique, we develop two extra sequences.
self-supervised rehearsal tasks to recall the recorded his-
tory contents from the memory. The two tasks are inspired 2. Related Works
by the observation that human beings can recall details
nearby some specific events and distinguish whether a se- Memory augmented neural networks (MANNs) introduce
ries of events happened in the history, which respectively the external memory to store and access the past contents by
correspond to two different memory processes revealed in differentiable write-read operators. Neural Turing Machine
cognitive, neuropsychological, and neuroimaging studies, (NTM) (Graves et al., 2014) and Differentiable Neural Com-
i.e., recollection and familiarity (Yonelinas, 2002; Moscov- puter (DNC) (Graves et al., 2016) are the typical MANNs
itch et al., 2016). Concretely, the recollection task aims to for human-like memorization and reasoning, whose infer-
predict the masked items in history fragments H, which are ences rely only on the memory with limited capacity rather
sampled from the original input stream and parts of items than starting from the original input. In this line of research,
are masked as the prediction target. This task tries to endow Rae et al. (2016) adopt the sparse memory accessing to re-
the memory with the recollection ability that enables one duce computational cost. Csordás & Schmidhuber (2019)
to relive past episodes. And the familiarity task tries to introduce the key/value separation problem of content-based
distinguish whether a historical fragment H ever appears addressing and adopt a mask for memory operations as a so-
in the input stream, where we directly sample positive frag- lution. Le et al. (2019b) manipulate both data and programs
ments from the input stream and replace parts of the items stored in memory to perform universal computations. And
in positive ones as negative fragments. This task resembles Santoro et al. (2018); Le et al. (2020a) consider the complex
the familiarity process that recognizes experienced events relational reasoning with the information they remember.
or stimulus as familiar.
However, these works exploit MANNs mainly to help cap-
To make the rehearsal memory have the ability of remember- ture complex dependencies in dealing with input sequences,
ing the crucial information, we further train an independent but do not explore the potential of MANNs in the field of
history sampler to select informative fragments H for self- memory-based long-sequence reasoning. They learn how
supervised rehearsal training. Similar to the teacher-student to maintain the memory only by back-propagated losses to
architecture in knowledge distillation (Hinton et al., 2015), the final answer but do not design specific training target
we expect the history sampler (i.e. the teacher) to capture for long-term memorization, inevitably incurring gradual
the characteristic of important fragments in the current envi- forgetting of early contents during memorizing long se-
ronment and guide the rehearsal memory (i.e. the student) to quences (Le et al., 2019a). Recently, there are a few works
remember task-relevant clues. Concretely, we independently trying to alleviate this problem. Le et al. (2019a) propose
train a conventional reasoning model that can access raw to measure “remember” ability by the final gradient on the
contents X while answering the query Q as the history sam- early input, and adopt a uniform writing operation on the
pler. The model contains the attention interaction between memory to balance between maximizing memorization and
history fragments H and the query Q, where the attention forgetting. Munkhdalai et al. (2019) design the meta-learned
m! Self-Supervised
x!
Rehearsal Training
x# Rehearsal Memory Machine m#
…
x( Task-Specific
m% Reasoning Training
…
Input Stream X Rehearsal Memory M
M!"# m!"&! m"&!

#
…… m"&!
%
f!" f#" …… f$"
l!"
Slot-to-Item
Transformer Encoder Gate-Based Update
……
Attention
l"%
Memory Update Module
"
x!" x#" …… x$
m!" m"# …… m"%
the t-th segment X !
Rehearsal Memory Machine rehearsal memory M! at the t-th step
Figure 1. The Framework of Rehearsal Memory.
neural memory instead of the conventional array-structured 3. Rehearsal Memory

memory and memorize the current and past information
by reconstructing the written values via the memory func- 3.1. Problem Formulation
tion. Besides, Compressive Transformer (Rae et al., 2019) Given the input stream X = {x1 , x2 , · · · } and a query Q,
maps the past memory to a smaller compressed memory the directly reasoning methods (Seo et al., 2016; Le et al.,
for long-range sequence learning, where the compressed 2020b) learn the model T (X, Q) to predict the answer A.
memory preserves much original information by a high These is an important assumption that the input stream X
compression rate. But considering the compressed memory can be always accessed while reasoning. And complex
is implemented by a FIFO queue, it will completely forget interaction between X and Q can be designed to extract
the contents beyond a fixed range. query-relevant information in T (X, Q). Obviously, these
Our approach is different and parallel to these techniques, methods have to store the original input X and infer the
we try to enhance long-sequence memorization by self- answer A from scratch when the query Q is known. In this
supervised memory rehearsal, i.e., recall the recorded his- paper, we explore the human-like memory-based reasoning
tory contents from the memory to overcome gradual forget- on long sequences, where we compress the input stream X
ting of early information. We design the recollection task into a fixed-size memory M = {mk }K k=1 with K memory
to enable the memory to relive past episodes and adopt the slots and then infer the answer A for any relevant query
familiarity task to make the memory recognize experienced Q by A = R(M, Q). Here we only need to store the
events. A recent work (Park et al., 2020) also introduces a compressed memory M, which can be updated in real-time
self-supervised memory loss to ensure how well the current and reused for a series of queries. Since the slot number
input is written to the memory, but it only focuses on remem- K in the memory is irrelevant to the input length |X|, this
bering the current information and ignoring the long-term setting only requires O(1) storage space rather than O(|X|)
memorization. Further, compared to previous techniques in directly reasoning model T (X, Q).
that have no assumptions on what behavior will be remem- As shown in Figure 1, we apply a rehearsal memory ma-
bered the most, we propose a history sampler to distinguish chine GΘ (X) to compress the input stream X into rehearsal
the characteristic of important fragments in the current en- memory M = {mk }K k=1 with K memory slots. During the
vironment and guide the memory rehearsal to remember training stage, we simultaneously develop self-supervised
task-relevant clues. rehearsal training and task-specific reasoning training based
on the memory M . For self-supervised rehearsal training,
we develop a rehearsal model Hξ (M, H) to reconstruct the 3.3. Self-Supervised Rehearsal Training
masked history fragments (recollection task) and distinguish
As shown in Figure 2, based on the maintained memory M ,
positive history fragments from negative ones (familiarity
we apply the memory rehearsal technique to enhance long-
task), where H means the critical history fragments that
sequence memorization. We first design the self-supervised
are selected by the history sampler SΨ (Q, X). For task-
recollection and familiarity tasks to solve how to rehearse,
specific reasoning training, we develop the task-specific rea-
which can alleviate the gradual forgetting of early informa-
son model RΩ (M, Q) to answer the given query Q. Dur-
tion. We next adopt a history sampler to decide what to
ing the testing stage, we maintain the rehearsal memory
rehearse, which guides the memory to remember critical
M = GΘ (X) from the stream X and then infer the answer
task-relevant clues.
A for any relevant query Q by A = RΩ (M, Q), where
the rehearsal model Hξ (M, H) and the history sampler
3.3.1. H OW TO R EHEARSE : S ELF -S UPERVISED
SΨ (Q, X) are no longer needed.
R ECOLLECTION AND FAMILIARITY TASKS
3.2. Rehearsal Memory Machine We design the rehearsal model Hξ (M, H) with recollection
and familiarity tasks. The recollection task reconstructs the
We deal with the input stream X from the segment level
masked positive history fragments to enable the memory
rather than item level, i.e., we cut the input sequence
to relive past episodes. And the familiarity task tries to
into fixed-length segments and memorize them into the re-
distinguish positive history fragments from negative ones
hearsal memory segment-by-segment. Compared to existing
for making the memory recognize experienced events.
MANNs (Graves et al., 2014; 2016), which store the input
stream item-by-item orderly with a RNN-based controller, First, we apply an independent history sampler to select
our segment-level memorization can further capture the bi- the B segments from the input stream as the history frag-
directional context of each item and improve the modeling ment set H = {H b }B b=1 , which is illustrated in the next
efficiency. We denote the t-th segment as Xt = {xtn }N
n=1 section. Each fragment H b = {hb1 , hb2 , · · · , hbN } con-
with N items and the current memory as Mt = {mtk }K k=1 ,
tains N items and each item h∗ corresponds to a feature
where we have recorded t-1 segments in Mt . The xtn and x∗ . For the b-th fragment, we randomly mask 50% of
mtk have the same dimension dx . items in the fragment and add an especial item [cls] at
the beginning to obtain the masked positive history frag-
We first model the t-th segment by a Transformer en-
ment H b+ = {hb+ b+ b+ b+ b+
[cls] , h1 , h[m1 ] , · · · , hN }, where h[m1 ]
coder (Vaswani et al., 2017) and obtain the sequence fea-
means the first masked item. In order to guarantee that
tures Ft = {fnt }Nn=1 with dimension dx . After it, we apply the model Hξ (M, H) reconstructs the masked fragment by
a memory update module to write Ft into Mt . We apply
utilizing the maintained memory M rather than only re-
a slot-to-item attention to align the sequence features to
lying on fragment context, we set the mask ratio to 50%
slot features in the current memory Mt , and then develop
instead of 15% in BERT (Devlin et al., 2019). More-
the gate-based update. Concretely, we first calculate the
over, we construct the masked negative history fragment
slot-to-item attention matrix where each element means the
H b− = {hb− b− b− b−
[cls] , h1 , h[m1 ] , · · · , hN } by replacing 50% of
relevance of a slot-item pair, and then learn aligned features
unmasked items in the positive fragment, where the replace-
Lt = {ltk }Kk=1 for each slot, given by ment items are sampled from other input stream to make the
t
αkn = wa> tanh(W1a mtk + W2a fnt + ba ), negative fragment distinguishable. Here we construct the
N positive fragment set H+ = {H b+ }B b=1 and corresponding
t (1)
exp(αkn ) negative fragment set H− = {H b− }B b=1 from the origi-
X
t
α̂kn = PK , ltk = t
α̂kn fnt ,
exp(α t ) nal fragment set H. Next, we adopt a bidirectional Trans-
j=1 jn n=1
former decoder (Vaswani et al., 2017) without the future
where W1a ∈ Rdmodel ×dx , W2a ∈ Rdmodel ×dx and ba ∈ masking to model each history fragment H b+ /H b− . In the
Rdmodel are the projection matrices and bias. wa> is the row decoder, each history item can interact with all other items
vector. Next, the k-th slot feature mtk is updated with its in the fragment. The rehearsal memory M is input to the
aligned feature ltk based on a GRU unit with dx -d hidden “encoder-decoder multi-head attention sub-layer” in each
states, given by decoder layer, where the queries come from the previous
mt+1 = GRU(mtk , ltk ), (2) decoder layer and the memory slots are regarded as the keys
k
and values. This allows each item in the decoder to attend
ltk
where is the current input of the GRU unit and mtk
is over all slot features in the memory M . Finally, we obtain
the hidden state at the t-th step. And mt+1
k is the new slot b+/b− b+/b− b+/b−
the features {r[cls] , r1
b+/b−
, r[m1 ] , · · · , rN } where
feature after the gate-based update. After memorizing T b+/b−
segments, we can obtain rehearsal memory MT +1 and we each r∗ has the dimension dx .
denote it by M for convenience.
Recollection Loss Familiarity Loss
H! '/) '/) '/) '/)

r[#$%] r[*!] …
H! H!'/) r! r+
H,
Selection Construction
Pretrained H, H ,'/) Rehearsal
H- History Sampler Memory Bi-Directional
Transformer Decoder
…
M
…
…
H/ H /'/)
H.
'/) '/) '/) … '/)
Selected Fragments h[#$%] h! h[*!] h+
History Fragments Positive/Negative
Fragments masked history fragment
History Sampler for History Fragment Selection Self-Supervised Rehearsal Training
Figure 2. Self-Supervised Rehearsal Training with the History Sampler.
Recollection Task. We first predict the masked items of subsequent inference and pay more attention to them dur-
positive history fragments to build the item-level recon- ing memorization. Thus, we further train a history sampler
struction for the recollection task. Considering there are SΨ (Q, X) to select informative history fragments H for
too many item types, we apply the contrastive training (He self-supervised rehearsal training, which is independent to
et al., 2020; Chen et al., 2020; Zhang et al., 2020) based the memory machine M = GΘ (X). In knowledge distilla-
on the ground truth and other sampled items. For the N/2 tion (Hinton et al., 2015), the teacher model can access the
masked items, we compute the recollection loss for the b-th privileged information and transfer the knowledge to the stu-
fragment by dent model. Similar to it, our history sampler, which is the
teacher and can access raw contents X while answering the
exp(rb+
[mi ] · yi ) query Q, learns the ability of distinguishing task-relevant
Lbi = log PJ , important fragments and guides the rehearsal memory (i.e.,
b+
exp(r[mi ] · yi ) + j=1 exp(rb+ [mi ] · yj )
(3) the student) to remember critical clues.
N/2
2 X b Concretely, we independently train a directly reasoning
Lbrec =− L
N i=1 i model as the history sampler SΨ (Q, X), which can access
raw contents X while answering the query Q. We first cut
where yi ∈ Rdx is the feature of ground truth of the i-th the entire input X into C history fragments {H c }C c=1 just
masked item, yj ∈ Rdx is the feature of sampled items and like the rehearsal memory machine, where each fragment
rb+
[mi ] · y∗ is the inner product of two features.
contains N items. Next, we obtain the fragment features
{hc }Cc=1 by averaging the item features in each fragment.
Familiarity Task. Next, we predict whether the masked After it, we develop the attention-based reasoning for the
history fragment ever appears in the current input stream, i.e. query Q on these fragment features. The query feature
distinguish positive history fragments from negative ones. q ∈ Rdmodel is modeled by the task-specific encoder in
This training objective makes the memory learn the ability different downstream tasks, which is introduced in Section
of recognizing experienced events. Concretely, we project A of the supplementary material. Given the query feature
b+/b−
each feature r[cls] into a confident score sb+/b− ∈ (0, 1) q and fragment features {hc }Cc=1 , we conduct the attention
by a linear layer with the sigmoid activation, and calculate method to aggregate query-relevant clues from fragments,
the familiarity loss by given by
Lbf am = −log(sb+ ) + log(1 − sb− ), (4) β c = wh> tanh(W1h q + W2h hc + bh ),
C
exp(β c ) (5)
where Lbf am is the familiarity loss for the b-th pair of posi-
X
β̂ c = PC , e= β̂ c hc ,
exp(β j)
tive and negative fragments in H+ /H− . j=1 c=1
where W1h ∈ Rdmodel ×dmodel , W2h ∈ Rdmodel ×dx and

3.3.2. W HAT TO R EHEARSE : H ISTORY S AMPLER
bh ∈ Rdmodel are the projection matrices and bias. And
Existing MANNs (Graves et al., 2014; 2016) often have no wh> is the row vector. We then obtain the reasoning feature
assumptions on what information needs to be remembered a = [e; q] by concatenating the query and query-relevant
the most. But due to the limited capacity of the memory, fragment features, and design the final reasoning layer for
it is crucial to distinguish what contents are important for different tasks, shown in Section A of the supplementary
material. After independent training, the attention weights

Table 1. Performance Comparisons for Synthetical
{β c }C
c=1 can be regarded as the importance score of each
bAbI Task: mean ± std. and best error over 10 runs.
fragment for the query Q. Thus, we can sample the vital his-
tory fragments with high attention weights for each (X, Q) Method mean ± std error best error
pair. Specifically, to guarantee the sampled fragments ap-
DNC 16.7 ± 7.6 3.8
pear in the entire input stream, we select B/2 fragments
C/2 NUTM 5.6 ± 1.9 3.3
from {H c }c=1 with large weights and choose another B/2 DMSDNC 1.53 ± 1.33 0.16
fragments from {H c }C c=C/2 to constitute the fragment set STM 0.39 ± 0.18 0.15
b B
H = {H }b=1 . The final recollection loss Lrec and famil- CT 0.81 ± 0.26 0.34
iarity loss Lf am are computed by
RM 0.33 ± 0.15 0.12
Lrec = Eb∼SΨ [Lbrec ], Lf am = Eb∼SΨ [Lbf am ]. (6)
3.4. Task-Specific Reasoning Training approaches. The directly reasoning baselines are different in
Besides self-supervised rehearsal training, we simultane- downstream tasks and the memory-based baselines mainly
ously develop task-specific reasoning training. For several are DNC (Graves et al., 2016), NUTM (Le et al., 2019b),
downstream tasks, we propose different task-specific reason DMSDNC (Park et al., 2020), STM (Le et al., 2020a) and
model RΩ (M, Q) based on the memory M. Here we adopt Compressive Transformer (CT) (Rae et al., 2019). For a fair
the simple and mature components in the reason model for comparison, we modify the reasoning module of memory-
a fair comparison. The details are introduced in Section A based baselines to be consistent with our rehearsal memory,
of the supplementary material. Briefly, we first learn the i.e. we conduct multi-hop attention-based reasoning based
query representation q by a task-specific encoder and then on the built memory. And the number of memory slots in
perform the multi-hop attention-based reasoning. Finally, these baselines is also set to K. Besides, we set the core
we obtain the reason loss Lr from RΩ (M, Q). number of NUTM to 4, the query number of STM to 8 and
the memory block number of DMSDNC to 2. As for CT, the
Eventually, we combine the rehearsal and reason losses to layer number of the Transformer is set to 3 as our rehearsal
train our model, given by memory and the compression rate is set 5.
Lrm = λ1 Lrec + λ2 Lf am + λ3 Lr , (7)
4.2. Rehearsal Memory on Short-Sequence Reasoning
where λ1 , λ2 and λ3 are applied to adjust the balance of
three losses. The bAbI dataset (Weston et al., 2015) is a synthetic
text question answering benchmark and widely applied to
evaluate the memorization and reasoning performance of
4. Experiments MANNs. This dataset contains 20 reasoning tasks and re-
In this section, we first verify our rehearsal memory on the quires to be solved with one common model. Although
widely-used short-sequence reasoning task bAbI. Next, we most of these tasks only give short-sequence text input (less
mainly compare our approach with diverse baselines on than 100 words) and existing methods (Park et al., 2020; Le
several long-sequence reasoning tasks. We then perform et al., 2020a) have solved these tasks well, we still compare
ablation studies on the memory rehearsal techniques and our rehearsal memory with other memory-based baselines
analyze the impact of crucial hyper-parameters. to verify the short-sequence reasoning performance. We set
the dx and dmodel to 128. The number K of memory slots
is set to 20. And we naturally take each sentence in input
4.1. Experiment Setting
texts as a segment and the maximum length N of segments
Model Setting. We first introduce the common model set- is set to 15. Due to limited word types in this dataset, we
tings for all downstream tasks. We set the layer number sample all other words as negative items in Lrec .
of the Transformer encoder and bi-directional Transformer
The results are summarized in Table 1. The RM model
decoder to 3. The head number in Multi-Head Attention
solves these bAbI tasks with near-zero error and outperforms
is set to 4. We set λ1 , λ2 and λ3 to 1.0, 0.5 and 1.0, re-
existing baselines at the mean and best error rate of 10
spectively. The number B of history fragments is set to 6.
runs, verifying that our RM method can conduct effective
During training, we apply an Adam optimizer (Duchi et al.,
memorization and reasoning on short-sequence tasks. In
2011) to minimize the multi-task loss Lrm , where the initial
these methods, DNC, NUTM and DMSDNC model the
learning rate is set to 0.001.
input contents word-by-word. STM processes input texts as
Baseline. We compare our rehearsal memory with the di- a sentence-level sequence. And CT and RM methods cut
rectly reasoning methods and the memory-based reasoning the input texts into sentences for modeling. From the results,
Table 2. Performance Comparisons for Long-Sequence Text Ques- Table 3. Performance Comparisons for Long-Term
tion Answering on NarrativeQA. Video Question Answering on ActivityNet-QA.
Method Setting Val MRR Test MRR Method Setting Accuracy

AS Reader Directly 26.9 25.9 E-VQA Directly 25.2
E2E-MN Directly 29.1 28.6 E-MN Directly 27.9
E-SA Directly 31.8
DNC Memory-Based 25.8 25.2 HCRN Directly 37.6
NUTM Memory-Based 27.7 27.2
DMSDNC Memory-Based 28.1 27.5 DNC Memory-Based 30.3
STM Memory-Based 27.2 26.7 NUTM Memory-Based 33.1
CT Memory-Based 28.7 28.3 DMSDNC Memory-Based 32.4
STM Memory-Based 33.7
RM Memory-Based 29.4 28.7
CT Memory-Based 35.4
RM Memory-Based 36.3
we can find STM, CT and RM achieve lower error rates than
other baselines, suggesting the importance of sentence-level
modeling. Considering the bAbI tasks are close to being outperforms early directly reasoning methods AS Reader
solved, we design another difficult synthetic task in Section and E2E-MN, showing the ability of efficient reasoning on
B of the supplementary material to evaluate our RM model. long sequences with limited storage resources.
4.3. Rehearsal Memory on Long-Sequence Reasoning 4.3.2. L ONG -S EQUENCE V IDEO Q UESTION A NSWERING
We then compare our approach with diverse baselines on The ActivityNet-QA dataset (Yu et al., 2019) contains 5,800
several long-sequence reasoning tasks. videos from the ActivityNet (Caba Heilbron et al., 2015).
The average video duration of this dataset is about 180s and
4.3.1. L ONG -S EQUENCE T EXT Q UESTION A NSWERING is the longest in VQA datasets. We compare our method
with four directly reasoning baselines, including three ba-
We apply the NarrativeQA dataset (Kočiskỳ et al., 2018)
sic models E-VQA, E-MN, E-SA from (Yu et al., 2019)
with long input contents for long-sequence text question an-
and the SOTA model HCRN (Le et al., 2020b). For our
swering. This dataset contains 1,572 stories and correspond-
rehearsal memory, we set the dx and dmodel to 256. The
ing summaries generated by humans, where each summary
number K of memory slots and length N of segments are
contains more than 600 tokens on average. And there are
both set to 20. And in Lrec , we select 30 other frame fea-
46,765 questions in total. We adopt the multi-choice form to
tures from the video as the sampled items.
answer the given question based on a summary, where other
answers for questions associated with the same summary are As shown in Table 3, the RM method obtains a better per-
regarded as answer candidates. We compute the mean recip- formance than other memory-based baselines. Compared
rocal rank (MRR) as the metric, i.e., the rank of the correct to the best baseline CT, our RM model further achieves the
answer among candidates. Besides the memory-based meth- 0.9% absolute improvement, showing the effectiveness of
ods, we adopt directly reasoning model AS Reader (Kadlec our model designs and self-supervised rehearsal training.
et al., 2016) and E2E-MN (Sukhbaatar et al., 2015) as base- Moreover, the RM method outperforms the basic directly
lines. The AS Reader applies a pointer network to generate reasoning baselines E-VQA, E-MN and E-SA, but slightly
the answer and E2E-MN employs the end-to-end memory worse than the SOTA method HCRN. This suggests our re-
network to conduct multi-hop reasoning. For our rehearsal hearsal memory can reduce the gap between memory-based
memory, we set the dx and dmodel to 256. The number K of and directly reasoning paradigms.
memory slots is set to 20. We naturally take each sentence
in summaries as a segment and the maximum length N of 4.3.3. L IFELONG S EQUENCE R ECOMMENDATION
segments is set to 20. And we sample all other words as
The lifelong sequence recommendation (Ren et al., 2019)
negative items in Lrec .
aims to predict whether the user will click a given item
We report the results in Table 2. Our RM method obtains the based on long sequences, thus it can be regarded as a long-
best performance among memory-based approaches, which sequence reasoning task. The XLong dataset (Ren et al.,
demonstrates our self-supervised rehearsal training with the 2019) is sampled from the click logs on Alibaba. The length
history sampler can effectively enhance long-sequence mem- of historical behavior sequences in this dataset is 1000. We
orization and reasoning. Further, the RM model slightly compare our method with four directly reasoning methods
Table 4. Performance Comparisons for Lifelong Table 6. Ablation Results about the Transformer Encoder and Mask
Sequence Recommendation on XLong. Ratio.
Method Setting AUC NarrativeQA ActNet-QA XLong

Method
GRU4REC Directly 0.8702 Val Test Acc. AUC
Caser Directly 0.8390 GRU Encoder 29.1 28.3 35.9 0.8789
RUM Directly 0.8649 15% Mask Ratio 29.0 28.5 36.0 0.8809
DIEN Directly 0.8793
Full 29.4 28.7 36.3 0.8817
HPMN Memory-Based 0.8645
MIMN Memory-Based 0.8731
RM Memory-Based 0.8817 where the history sampler is still retained. Next, we replace
the independent history sampler with a random sampler to
generate the ablation model RM (random sampler), which
Table 5. Ablation Results about the Rehearsal Losses and History randomly selects B history fragments from the input stream
Sampler. for rehearsal training.
NarrativeQA ActNet-QA XLong We conduct the ablation experiments on NarrativeQA,
Method
Val Test Acc. AUC ActivityNet-QA and XLong datasets. The results are re-
ported in Table 5. We can find the full model outperforms
w/o. rehearsal 27.9 27.5 24.6 0.8745
the model RM (w/o. rehearsal), demonstrating the self-
only Lrec 29.2 28.6 36.0 0.8802
supervised rehearsal training with the history sampler can
only Lf am 28.6 28.1 35.4 0.8776
further boost the long-sequence memorization and reason-
random sampler 28.7 28.3 35.7 0.8813
ing ability of rehearsal memory. Further, the full model has
Full 29.4 28.7 36.3 0.8817 better performance than RM (only Lf am ) and RM (only
Lrec ) on all metrics, which illustrates two rehearsal tasks
are both helpful for alleviating the issue of gradual forget-
GRU4REC (Hidasi et al., 2015), Caser (Tang & Wang, ting. And RM (only Lrec ) achieves better results than RM
2018), DIEN (Zhou et al., 2019), RUM (Chen et al., 2018) (only Lf am ), showing the recollection task that enables
and two memory-based methods HPMN (Ren et al., 2019) the memory to relive past episode is more important for
and MIMN (Pi et al., 2019), where the HPMN method rehearsal training. Moreover, the ablation model RM (ran-
builds the memory by hierarchical RNNs and the MIMN dom sampler) has the performance degradation than the full
method introduces a write-read memory as in (Graves et al., model. This fact indicates it is critical to select informative
2014). For our rehearsal memory, we set the dx and dmodel history fragments for rehearsal training. By the guidance
to 64. The number K of memory slots and length N of of the history sampler, the rehearsal memory can remember
segments are both set to 20. And in Lrec , we select 200 task-relevant clues for subsequent reasoning.
items from the large item set as the sampled items.
The results are shown in Table 4. our RM method not 4.5. Ablation Study for Model Setting
only outperforms other memory-based approaches, but also In this section, we conduct ablation study aboout the model
achieves better performance than directly reasoning base- settings. Existing MANNs often use a RNN as their con-
lines. This is because our rehearsal memory can aggregate troller, but we apply a Transformer encoder to improve
and organize the long-term interests from user behavior the sequential modeling ability of RM. Thus, we replace
sequences and these interests can be activated during next- the Transformer encoder in the rehearsal memory machine
item prediction. But the directly reasoning approaches may with a bi-directional GRU encoder. As shown in Table 6,
fail to learn such informative interest representations. the full model achieves better performance than the model
with the GRU Encoder, verifying the effectiveness of the
4.4. Ablation Study for Self-Supervised Rehearsal Transformer encoder.
We next perform ablation studies on the self-supervised For the masked history fragments, we set the mask ratio to
rehearsal losses and history sampler. Concretely, we first 50% instead of 15% in BERT. We compare the results of
completely discard the self-supervised rehearsal training two mask ratios in Table 6. We can find the full model with
to produce the ablation model RM (w/o. rehearsal). We the 50% mask ratio outperforms the one with the 15% mask
then remove the recollection or familiarity loss to produce ratio. This fact suggests that the large mask ratio makes the
two ablation models RM (only Lf am ) and RM (only Lrec ), rehearsal model Hξ (M, H) utilize the maintained memory
30 38
and the model cannot effectively capture the evidence and
28 36
infer the answer.
26 34
MRR
Acc
24 32 5. Conclusions
STM STM
22 CT 30 CT
RM RM In this paper, we propose the self-supervised rehearsal to
20 28
10 15 20
Memory Slot Number K
25 10 15 20
Memory Slot Number K
25 enhance long-sequence memorization. We design the self-
supervised recollection and familiarity tasks to alleviate
(a) NarrativeQA (b) ActivityNet-QA
the gradual forgetting of early information. Further, we
adopt a history sampler to guide the memory to remember
Figure 3. Effect of the Memory Slot Number K.
critical information. Extensive experiments on a series of
downstream tasks verify the performance of our method.
36.5 For future work, we will further explore the property of
rehearsal memory.
36
Acc
Acknowledgments
35.5
This work is supported by the National Key R&D Program

35 RM of China under Grant No. 2018AAA0100603. This research
10 15 20 25 30 is supported by the National Natural Science Foundation of
Segment Length N
China under Grant No.61836002 and No.62072397, and the
Zhejiang Natural Science Foundation LR19F020006.
Figure 4. Effect of the Segment Length N on the ActivityNet-QA
dataset.
References
M than only relying on fragment context, and is beneficial Baddeley, A. Working memory. Science, 255(5044):556–
for the self-supervised rehearsal training of RM. 559, 1992.
Caba Heilbron, F., Escorcia, V., Ghanem, B., and Car-
4.6. Hyper-Parameters Analysis los Niebles, J. Activitynet: A large-scale video bench-
We then explore the effect of two crucial hyper-parameters: mark for human activity understanding. In Proceedings
the memory slot number K and the segment length N . We of the IEEE Conference on Computer Vision and Pattern
first set the slot number K to [10, 15, 20, 25] and compare Recognition, pp. 961–970, 2015.
our RM method with two baselines STM and CT on the Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A
NarrativeQA and ActivityNet-QA datasets. We display the simple framework for contrastive learning of visual rep-
results in Figure 3. We note that the performance of all three resentations. arXiv preprint arXiv:2002.05709, 2020.
methods gradually improves with the increase of slot num-
ber and slowly reaches the bottleneck. When the number Chen, X., Xu, H., Zhang, Y., Tang, J., Cao, Y., Qin, Z., and
of memory slots exceeds 20, more slots can not bring much Zha, H. Sequential recommendation with user memory
improvement. By comparison, we can find our RM method networks. In Proceedings of the eleventh ACM interna-
achieves the best performance on different slot numbers, tional conference on web search and data mining, pp.
verifying the effectiveness and stability of our rehearsal 108–116, 2018.
memory. Moreover, the performance of CT is terrible when Csordás, R. and Schmidhuber, J. Improving differen-
the slot number is too few. This is because the CT method tiable neural computers through memory masking, de-
implements the compressed memory by a FIFO queue and allocation, and link distribution sharpness control. arXiv
completely forgets the contents beyond the queue, i.e., its preprint arXiv:1904.10278, 2019.
memorization range severely depends on the slot number.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:
We then set the segment length N to [10, 15, 20, 25, 30] and
Pre-training of deep bidirectional transformers for lan-
report the results on the ActivityNet-QA dataset in Figure 4.
guage understanding. In Proceedings of the Conference
We can find that when the segment length is set to 10, the
on The North American Chapter of the Association for
RM method achieves poor results and the performance is
Computational Linguistics, 2019.
relatively stable when the segment length changes between
20 and 30. This is because when the segment is too short, Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient
important evidence may be scattered in different segments, methods for online learning and stochastic optimization.
Journal of Machine Learning Research, 12(Jul):2121– Le, T. M., Le, V., Venkatesh, S., and Tran, T. Hierarchical
2159, 2011. conditional relation networks for video question answer-
ing. In Proceedings of the IEEE Conference on Computer
Graves, A., Wayne, G., and Danihelka, I. Neural turing Vision and Pattern Recognition, pp. 9972–9981, 2020b.
machines. arXiv preprint arXiv:1410.5401, 2014.
Moscovitch, M., Cabeza, R., Winocur, G., and Nadel, L.
Graves, A., Wayne, G., Reynolds, M., Harley, T., Dani- Episodic memory and beyond: the hippocampus and neo-
helka, I., Grabska-Barwińska, A., Colmenarejo, S. G., cortex in transformation. Annual review of psychology,
Grefenstette, E., Ramalho, T., Agapiou, J., et al. Hybrid 67:105–134, 2016.
computing using a neural network with dynamic external
Munkhdalai, T., Sordoni, A., Wang, T., and Trischler, A.
memory. Nature, 538(7626):471–476, 2016.
Metalearned neural memory. In Advances in Neural In-
formation Processing Systems, pp. 13331–13342, 2019.
He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Mo-
mentum contrast for unsupervised visual representation Park, T., Choi, I., and Lee, M. Distributed memory based
learning. In Proceedings of the IEEE Conference on self-supervised differentiable neural computer. arXiv
Computer Vision and Pattern Recognition, 2020. preprint arXiv:2007.10637, 2020.
Hidasi, B., Karatzoglou, A., Baltrunas, L., and Tikk, D. Pi, Q., Bian, W., Zhou, G., Zhu, X., and Gai, K. Practice on
Session-based recommendations with recurrent neural long sequential user behavior modeling for click-through
networks. arXiv preprint arXiv:1511.06939, 2015. rate prediction. In Proceedings of the 25th ACM SIGKDD
International Conference on Knowledge Discovery &
Hinton, G., Vinyals, O., and Dean, J. Distilling Data Mining, pp. 2671–2679, 2019.
the knowledge in a neural network. arXiv preprint
Rae, J., Hunt, J. J., Danihelka, I., Harley, T., Senior, A. W.,
arXiv:1503.02531, 2015.
Wayne, G., Graves, A., and Lillicrap, T. Scaling memory-
Jin, W., Zhao, Z., Gu, M., Yu, J., Xiao, J., and Zhuang, augmented neural networks with sparse reads and writes.
Y. Multi-interaction network with object relation for In Advances in Neural Information Processing Systems,
video question answering. In Proceedings of the ACM pp. 3621–3629, 2016.
International Conference on Multimedia, pp. 1193–1201, Rae, J. W., Potapenko, A., Jayakumar, S. M., and Lillicrap,
2019a. T. P. Compressive transformers for long-range sequence
modelling. arXiv preprint arXiv:1911.05507, 2019.
Jin, W., Zhao, Z., Gu, M., Yu, J., Xiao, J., and Zhuang,
Y. Video dialog via multi-grained convolutional self- Ren, K., Qin, J., Fang, Y., Zhang, W., Zheng, L., Bian, W.,
attention context networks. In Proceedings of the Inter- Zhou, G., Xu, J., Yu, Y., Zhu, X., et al. Lifelong sequen-
national ACM SIGIR Conference on Research and Devel- tial modeling with personalized memorization for user
opment in Information Retrieval, pp. 465–474, 2019b. response prediction. In Proceedings of the International
ACM SIGIR Conference on Research and Development
Kadlec, R., Schmid, M., Bajgar, O., and Kleindienst, J. in Information Retrieval, pp. 565–574, 2019.
Text understanding with the attention sum reader network.
arXiv preprint arXiv:1603.01547, 2016. Santoro, A., Faulkner, R., Raposo, D., Rae, J., Chrzanowski,
M., Weber, T., Wierstra, D., Vinyals, O., Pascanu, R., and
Kočiskỳ, T., Schwarz, J., Blunsom, P., Dyer, C., Hermann, Lillicrap, T. Relational recurrent neural networks. In
K. M., Melis, G., and Grefenstette, E. The narrativeqa Advances in neural information processing systems, pp.
reading comprehension challenge. Transactions of the 7299–7310, 2018.
Association for Computational Linguistics, 6:317–328,
Seo, M., Kembhavi, A., Farhadi, A., and Hajishirzi, H.
2018.
Bidirectional attention flow for machine comprehension.
arXiv preprint arXiv:1611.01603, 2016.
Le, H., Tran, T., and Venkatesh, S. Learning to re-
member more with less memorization. arXiv preprint Sukhbaatar, S., Weston, J., Fergus, R., et al. End-to-end
arXiv:1901.01347, 2019a. memory networks. In Advances in Neural Information
Processing Systems, pp. 2440–2448, 2015.
Le, H., Tran, T., and Venkatesh, S. Neural stored-program
memory. arXiv preprint arXiv:1906.08862, 2019b. Tang, J. and Wang, K. Personalized top-n sequential rec-
ommendation via convolutional sequence embedding. In
Le, H., Tran, T., and Venkatesh, S. Self-attentive associative Proceedings of the Eleventh ACM International Confer-
memory. arXiv preprint arXiv:2002.03519, 2020a. ence on Web Search and Data Mining, pp. 565–573, 2018.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-
tion is all you need. In Advances in Neural Information
Processing Systems, pp. 5998–6008, 2017.
Weston, J., Bordes, A., Chopra, S., Rush, A. M., van
Merriënboer, B., Joulin, A., and Mikolov, T. Towards
ai-complete question answering: A set of prerequisite toy
tasks. arXiv preprint arXiv:1502.05698, 2015.
Xiong, C., Zhong, V., and Socher, R. Dynamic coatten-
tion networks for question answering. arXiv preprint
arXiv:1611.01604, 2016.
Yonelinas, A. P. The nature of recollection and familiarity:

A review of 30 years of research. Journal of memory and
language, 46(3):441–517, 2002.
Yu, Z., Xu, D., Yu, J., Yu, T., Zhao, Z., Zhuang, Y., and
Tao, D. Activitynet-qa: A dataset for understanding com-
plex web videos via question answering. In Proceedings
of the American Association for Artificial Intelligence,
volume 33, pp. 9127–9134, 2019.
Zhang, S., Yao, D., Zhao, Z., Chua, T.-S., and Wu, F.
Causerec: Counterfactual user sequence synthesis for
sequential recommendation. In Proceedings of the In-
ternational ACM SIGIR Conference on Research and
Development in Information Retrieval, 2021.
Zhang, Z., Zhao, Z., Lin, Z., He, X., et al. Counterfac-
tual contrastive learning for weakly-supervised vision-
language grounding. Advances in Neural Information
Processing Systems, 33:18123–18134, 2020.
Zhou, G., Mou, N., Fan, Y., Pi, Q., Bian, W., Zhou, C.,
Zhu, X., and Gai, K. Deep interest evolution network for
click-through rate prediction. In Proceedings of the Amer-
ican Association for Artificial Intelligence, volume 33,
pp. 5941–5948, 2019.

Zhang 21 Ac

Uploaded by

Copyright:

Available Formats

Zhang 21 Ac

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Zhang 21 Ac

Uploaded by

Copyright:

Available Formats

Learning to Rehearse in Long Sequence Memorization

Input Stream X Rehearsal Memory M

M!"# m!"&! m"&!

Figure 1. The Framework of Rehearsal Memory.

neural memory instead of the conventional array-structured 3. Rehearsal Memory

Recollection Loss Familiarity Loss

H! '/) '/) '/) '/)

History Sampler for History Fragment Selection Self-Supervised Rehearsal Training

Figure 2. Self-Supervised Rehearsal Training with the History Sampler.

where W1h ∈ Rdmodel ×dmodel , W2h ∈ Rdmodel ×dx and

material. After independent training, the attention weights

Method Setting Val MRR Test MRR Method Setting Accuracy

Method Setting AUC NarrativeQA ActNet-QA XLong

This work is supported by the National Key R&D Program

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

Yonelinas, A. P. The nature of recollection and familiarity:

You might also like