A Review of Recurrent Neural Networks - LSTM Cells and Network Architectures (Neural Computation) (2019)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

NECO_a_01199-Yu MITjats-NECO.

cls May 13, 2019 17:6

REVIEW Communicated by Terrence Sejnowski

f
A Review of Recurrent Neural Networks:

o
LSTM Cells and Network Architectures

Pro
Yong Yu
[email protected]
Department of Automation, Xi’an Institute of High-Technology, Xi’an 710025,
China, and Institute No. 25, Second Academy of China, Aerospace Science
and Industry Corporation, Beijing 100854, China

ed
Xiaosheng Si
[email protected]
Changhua Hu
[email protected]
Jianxun Zhang
t
rec
[email protected]
Department of Automation, Xi’an Institute of High-Technology,
Xi’an 710025, China

Recurrent neural networks (RNNs) have been widely adopted in research


cor

areas concerned with sequential data, such as text, audio, and video. How-
ever, RNNs consisting of sigma cells or tanh cells are unable to learn the
relevant information of input data when the input gap is large. By intro-
ducing gate functions into the cell structure, the long short-term memory
(LSTM) could handle the problem of long-term dependencies well. Since
its introduction, almost all the exciting results based on RNNs have been
Un

achieved by the LSTM. The LSTM has become the focus of deep learning.
We review the LSTM cell and its variants to explore the learning capac-
ity of the LSTM cell. Furthermore, the LSTM networks are divided into
two broad categories: LSTM-dominated networks and integrated LSTM
networks. In addition, their various applications are discussed. Finally,
future research directions are presented for LSTM networks.

1 Introduction

Over the past few years, deep learning techniques have been well devel-
oped and widely adopted to extract information from various kinds of data
(Ivakhnenko & Lapa, 1965; Ivakhnenko, 1971; Bengio, 2009; Carrio, Sampe-
dro, Rodriguez-Ramos, & Campoy, 2017; Khan & Yairi, 2018). Considering
different characteristics of input data, there are several types of architec-
tures for deep learning, such as the recurrent neural network (RNN; Robin-
son & Fallside, 1987; Werbos, 1988; Williams, 1989; Ranzato et al., 2014),

Neural Computation 31, 1–36 (2019) © 2019 Massachusetts Institute of Technology


doi:10.1162/neco_a_01199
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

2 Y. Yu, X. Si, C. Hu, and J. Zhang

convolutional neural network (CNN; Fukushima, 1980; LeCun et al., 1989;


Weng, Ahuja, & Huang, 1993; Rawat & Wang, 2017), and deep neural net-
work (DNN; Guo, Liu, Georgiou, & Lew, 2017; Sharma & Singh, 2017). Usu-

f
ally the CNN and DNN cannot deal with the temporal information of input

o
data. Therefore, in research areas that contain sequential data, such as text,
audio, and video, RNNs are dominant. Specifically, there are two types of

Pro
RNNs: discrete-time RNNs and continuous-time RNNs (Pearlmutter, 1989;
Brown, Yu, & Garverick, 2004; Gallagher, Boddhu, & Vigraham, 2005). The
focus of this review is on discrete-time RNNs.
The typical feature of the RNN architecture is a cyclic connection, which
enables the RNN to possess the capacity to update the current state based on
past states and current input data. These networks, such as fully RNNs (El-
man, 1990; Jordan, 1986; Chen & Soo, 1996) and selective RNNs (Šter, 2013),

ed
consisting of standard recurrent cells (e.g., sigma cells), have had incredible
success on some problems. Unfortunately, when the gap between the rele-
vant input data is large, the above RNNs are unable to connect the relevant
information. In order to handle the “long-term dependencies,” Hochreiter
t
and Schmidhuber (1997) proposed long short-term memory (LSTM).
rec
Almost all exciting results based on RNNs have been achieved by LSTM,
and thus it has become the focus of deep learning. Because of their pow-
erful learning capacity, LSTMs work tremendously well and have been
widely used in various kinds of tasks, including speech recognition (Fer-
nández, Graves, & Schmidhuber, 2007; He & Droppo, 2016; Hsu, Zhang,
cor

Lee, & Glass, 2016), acoustic modeling (Sak, Senior, & Beaufays, 2014;
Qu, Haghani, Weinstein, & Moreno, 2017), trajectory prediction (Altché &
Fortelle, 2017), sentence embedding (Palangi et al., 2015), and correlation
analysis (Mallinar & Rosset, 2018). In this review, we explore these LSTM
networks. This review is different from other review work on RNNs (Deng,
Un

2013; Lipton, Berkowitz, & Elkan, 2015); it focuses only on advances in the
LSTM cell and structures of LSTM networks. Here, the LSTM cell denotes
the recurrent unit in LSTM networks.
The rest of this review is organized as follows. Section 2 introduces the
LSTM cell and its variants. Section 3 reviews the popular LSTM networks
and discusses their applications in various tasks. A summary and discus-
sion of future directions are in section 4.

2 LSTM Cells and Their Variants

In RNNs, the recurrent layers or hidden layers consist of recurrent cells


whose states are affected by both past states and current input with feed-
back connections. The recurrent layers can be organized in various archi-
tectures to form different RNNs. Therefore, RNNs are mainly distinguished
by the recurrent cell and network architecture. Different cells and inner con-
nections enable RNNs to possess different capacities. In order to explore the
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

A Review of Recurrent Neural Networks 3

o f
Pro
t ed
Figure 1: Schematic of the standard recurrent sigma cell.
rec
development of LSTM networks, this section first gives a brief review of the
LSTM cell and its variants.
cor

2.1 Standard Recurrent Cell. Usually RNNs are networks that consist
of standard recurrent cells such as sigma cells and tanh cells. Figure 1 shows
a schematic of the standard recurrent sigma cell. The mathematical expres-
sions of the standard recurrent sigma cell are written as follows:

ht = σ (Wh ht−1 + Wx xt + b),


Un

yt = ht , (2.1)

where xt , ht , and yt denote the input, the recurrent information, and the out-
put of the cell at time t, respectively; Wh and Wx are the weights; and b is the
bias. Standard recurrent cells have achieved some success in some problems
(Karpathy, Johnson, & Li, 2015; Li, Li, Cook, Zhu, & Gao, 2018). However,
the recurrent networks that consist of standard recurrent cells are not ca-
pable of handling long-term dependencies: as the gap between the related
inputs grows, it is difficult to learn the connection information. Hochreiter
(1991) and Bengio, Simard, and Frasconi (1994) analyzed fundamental rea-
sons for the long-term dependencies problem: error signals flowing back-
ward in time tend to either blow up or vanish.

2.2 LSTM. In order to deal with the problem of “long-term depen-


dencies,” Hochreiter and Schmidhuber (1997) proposed the LSTM cell.
They improved the remembering capacity of the standard recurrent cell by
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

4 Y. Yu, X. Si, C. Hu, and J. Zhang

o f
Pro
t ed
rec
Figure 2: Original LSTM architecture.

introducing a “gate” into the cell. Since this pioneering work, LSTMs have
been modified and popularized by many researchers (Gers, 2001; Gers &
cor

Schmidhuber, 2000). Variations include LSTM without a forget gate, LSTM


with a forget gate, and LSTM with a peephole connection. Usually the term
LSTM cell denotes LSTM with a forget gate. We first introduce the original
LSTM model that possesses only input and output gates.
Un

2.2.1 LSTM without a Forget Gate. The architecture of LSTM with only
input and output gates is shown in Figure 2. The mathematical expressions
of the LSTM in Figure 2 can be written as follows,

it = σ (Wih ht−1 + Wix xt + bi ),


c̃t = tanh(Wc̃h ht−1 + Wc̃x xt + bc̃ ),
ct = ct−1 + it · c̃t ,
ot = σ (Woh ht−1 + Wox xt + bo ),
ht = ot · tanh(ct ), (2.2)

where ct denotes the cell state of LSTM. Wi , Wc̃ , and Wo are the weights, and
the operator ‘·’ denotes the pointwise multiplication of two vectors. When
updating the cell state, the input gate can decide what new information can
be stored in the cell state, and the output gate decides what information can
be output based on the cell state.
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

A Review of Recurrent Neural Networks 5

o f
Pro
ed
Figure 3: Architecture of LSTM with a forget gate.

2.2.2 LSTM with a Forget Gate. Gers, Schmidhuber, and Cummins (2000)
t
modified the original LSTM in 2000 by introducing a forget gate into the
rec
cell. In order to obtain the mathematical expressions of this modified LSTM
cell, Figure 3 presents its inner connections.
Based on the connections shown in Figure 3, the LSTM cell can be math-
ematically expressed as follows:

ft = σ (W f h ht−1 + W f x xt + b f ),
cor

it = σ (Wih ht−1 + Wix xt + bi ),


c̃t = tanh(Wc̃h ht−1 + Wc̃x xt + bc̃ ),
ct = ft · ct−1 + it · c̃t ,
Un

ot = σ (Woh ht−1 + Wox xt + bo ),


ht = ot · tanh(ct ). (2.3)

The forget gate can decide what information will be thrown away from
the cell state. When the value of the forget gate, ft , is 1, it keeps this infor-
mation; meanwhile, a value of 0 means it gets rid of all the information.
Jozefowicz, Zaremba, and Sutskever (2015) found that when increasing the
bias of the forget gate, b f , the performance of the LSTM network usually
became better. Furthermore, Schmidhuber, Wierstra, Gagliolo, and Gomez
(2007) proposed that LSTM was sometimes better trained by evolutionary
algorithms combined with other techniques rather than by pure gradient
descent.

2.2.3 LSTM with a Peephole Connection. As the gates of the above LSTM
cells do not have direct connections from the cell state, there is a lack of
essential information that harms the network’s performance. In order to
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

6 Y. Yu, X. Si, C. Hu, and J. Zhang

o f
Pro
ed
Figure 4: LSTM architecture with a peephole connection.

solve this problem, Gers and Schmidhuber (2000) extended the LSTM cell
t
by introducing a peephole connection, as shown in Figure 4.
rec
Based on the connections shown in Figure 4, the mathematical expres-
sions can be expressed as follows:

ft = σ (W f h ht−1 + W f x xt + P f · ct−1 + b f ),
it = σ (Wih ht−1 + Wix xt + Pi · ct−1 + bi ),
cor

c̃t = tanh(Wc̃h ht−1 + Wc̃x xt + bc̃ ),


ct = ft · ct−1 + it · c̃t ,
ot = σ (Woh ht−1 + Wox xt + Po · ct + bo ),
Un

ht = ot · tanh(ct ), (2.4)

where P f , Pi , and Po are the peephole weights for the forget gate, input
gate, and output gate, respectively. The peephole connections allow the
LSTM cell to inspect its current internal states (Gers & Schmidhuber, 2001),
and thus the LSTM with a peephole connection can learn stable and pre-
cise timing algorithms without teacher forcing (Gers & Schraudolph, 2002).
To explore the LSTM with a peephole connection in depth, Greff, Srivas-
tava, Koutník, Steunebrink, and Schmidhuber (2016) compared the perfor-
mance of eight variants: no input gate, no forget gate, no output gate, no
input activation function, no output activation function, coupled input and
forget gate (CIFG), no peepholes, and full gate recurrence. Each variant dif-
fered from the original LSTM with a peephole connection by only a sin-
gle change. The results showed that the forget and output gates are the
most critical components, and removing any of them obviously decreases
network performance. Moreover, the modified coupled input and forget
gate could reduce the number of parameters and lower computational cost
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

A Review of Recurrent Neural Networks 7

of
Pro
Figure 5: GRU cell architecture and connections.

ed
without a significant decrease in network performance. Because of this
powerful capacity, LSTM has become the focus of deep learning and has
been applied to multiple tasks.
t
rec
2.3 Gated Recurrent Unit. The learning capacity of the LSTM cell is su-
perior to that of the standard recurrent cell. However, the additional param-
eters increase computational burden. Therefore, the gated recurrent unit
(GRU) was introduced by Cho et al. (2014). Figure 5 shows the details of
the architecture and connections of the GRU cell.
cor

Based on the information shown in Figure 5, the mathematical expres-


sions of the GRU cell are as follows:

rt = σ (Wrh ht−1 + Wrx xt + br ),


Un

zt = σ (Wzh ht−1 + Wzx xt + bz ),


h̃t = tanh(Wh̃h (rt · ht−1 ) + Wh̃x xt + bz ),
ht = (1 − zt ) · ht−1 + zt · h̃t . (2.5)

In order to reduce the number of parameters, the GRU cell integrates the
forget gate and input gate of the LSTM cell as an update gate. The GRU cell
has only two gates: an update gate and a reset gate. Therefore, it could save
one gating signal and the associated parameters. The GRU is essentially a
variant of vanilla LSTM with a forget gate. Since one gate is missing, the
single GRU cell is less powerful than the original LSTM. The GRU cannot
be taught to count or to solve context-free language (Weiss, Goldberg, &
Yahav, 2018) and also does not work for translation (Britz, Goldie, Luong,
& Le, 2017). Chung, Gulcehre, Cho, and Bengio (2014) empirically evalu-
ated the performance of the LSTM network, GRU network, and traditional
tanh-RNN and found that both the LSTM cell and GRU cell were supe-
rior to the traditional tanh unit, under the condition that each network had
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

8 Y. Yu, X. Si, C. Hu, and J. Zhang

o f
Pro
Figure 6: MGU cell schematic.

ed
approximately the same number of parameters. Dey and Salemt (2017)
modified the original GRU and evaluated properties of three variants of
t
GRU—GRU-1, GRU-2, and GRU-3—on MNIST and IMDB data sets. The
rec
results showed that these three variants could reduce the computational
expense while performing as well as the original GRU cell.

2.4 Minimal Gated Unit. To further reduce the number of cell param-
eters, Zhou, Sun, Liu, and Lau (2016) proposed the minimal gated unit,
cor

which only has one gate. The schematic of the MGU is shown in Figure
6. Based on the connections in the figure, the mathematical expressions of
the MGU can be written as follows:

ft = σ (W f h ht−1 + W f x xt + b f ),
Un

h̃t = tanh(Wh̃h ( ft · ht−1 + Wh̃x xt + bh̃ ),


ht = (1 − ft ) · ht−1 + ft · h̃t . (2.6)

The MGU cell involves only one forget gate. Hence, this kind of cell
has a simpler structure and fewer parameters compared with LSTM and
the GRU. Evaluation results (Zhou et al., 2016), based on various sequence
data, showed that the performance of the MGU is comparable to that of
the GRU. In order to further simplify the MGU, Heck and Salem (2017) in-
troduced three model variants—MGU-1, MGU-2, and MGU-3—which re-
duced the number of parameters in the forget gate. They found that the
performances of these variants were comparable to those of the MGU in
tasks on the MNIST data set and the Reuters Newswire Topics data set.

2.5 Other LSTM Variants. In order to determine whether the architec-


ture of the LSTM cell is optimal, Jozefowicz et al. (2015) evaluated over
10,000 different architectures and found three that outperformed both the
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

A Review of Recurrent Neural Networks 9

LSTM and GRU on the considered tasks: MUT-1, MUT-2, and MUT-3. The
architectures of these cells are similar to that of the GRU with only two
gates: an update gate and a reset gate. Neil, Pfeiffer, and Liu (2016) extended

f
the LSTM cell by adding a new time gate, and proposed the phased LSTM

o
cell. This new time gate leads to a sparse updating for the cell, which makes
the phased LSTM achieve faster convergence than regular LSTMs on tasks

Pro
of learning long sequences. Nina and Rodriguez (2016) simplified the LSTM
cell by removing the input gate and coupling the forget gate and input gate.
Removing the input gate produced better results in tasks that did not need
to recall very long sequences.
Unlike the variants that modified the LSTM cell through decreasing or
increasing gate functions, there are other kinds of LSTM variants. For ex-
ample, Rahman, Mohammed, and Azad (2017) incorporated a biologically

ed
inspired variation into the LSTM cell and proposed the biologically variant
LSTM. They changed only the updating of cell state c(t) to improve cell ca-
pacity. Moreover, LSTM with working memory was introduced by Pulver
and Lyu (2017). This modified version replaced the forget gate with a func-
t
tional layer, whose input is decided by the previous memory cell value.
rec
The gated orthogonal recurrent unit (GORU) was proposed by Jing et al.
(2017). They modified the GRU by an orthogonal matrix, which replaced
the hidden state loop matrix of the GRU. This made the GORU possess
the advantages of both an orthogonal matrix and the GRU structure. Irie,
Tüske, Alkhouli, Schlüter, and Ney (2016) added highway connections to
cor

the LSTM cell and GRU cell to ensure an unobstructed information flow
between adjacent layers. Furthermore, Veeriah, Zhuang, and Qi (2015) in-
troduced a differential gating scheme for the LSTM neural network to solve
the impact of spatiotemporal dynamics and then proposed the differential
LSTM cell.
Un

Although variants and analogous neural cells have been introduced,


they can only be used on one or some specific data sets. Additionally, there
is no cell variant that can outperform the LSTM cell on the whole. Thus, the
LSTM cell is still the focus of deep learning for the processing of sequential
data, and it remains the most popular recurrent cell in the literature. There-
fore, in section 3, we give a comprehensive review of networks that consist
of LSTM cells.

3 Different LSTM Networks

Due to the limited capacity of the single LSTM cell in handling engineering
problems, the LSTM cells have to be organized into a specific network archi-
tecture when processing practical data. Despite the fact that all RNNs can be
modified as LSTM networks by replacing the standard recurrent cell with
the LSTM cell, this review discusses only the verified LSTM networks. We
divide these LSTM networks into two broad categories: LSTM-dominated
networks and integrated LSTM networks. LSTM-dominated networks are
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

10 Y. Yu, X. Si, C. Hu, and J. Zhang

o f
Pro
Figure 7: Simplified schematic diagram of the LSTM cell.

t ed
neural networks that are mainly constructed by LSTM cells. These networks
focus on optimizing the connections of the inner LSTM cells so as to en-
hance the network properties (Nie, An, Huang, Yan, & Han, 2016; Zhao,
rec
Wang, Yan, & Mao, 2016). The integrated LSTM networks consist of LSTM
layers and other components, such as a CNN and an external memory unit.
The integrated LSTM networks mainly pay attention to integrating the ad-
vantageous features of different components when dealing with the target
task.
cor

Before introducing the LSTM networks, we simplify the schematic of the


LSTM cell in section 2.2 as in Figure 7. In the figure, the dashed line indicates
identity transformations. Therefore, the LSTM cell only outputs ht along the
depth dimension with no time delay and transmits both the ct and ht along
the time dimension with one-unit-interval delay.
Un

3.1 LSTM-Dominated Neural Networks

3.1.1 Stacked LSTM Network. In the usual application, the simplest


method to add capacity and depth to the LSTM network is to stack the
LSTM layers. Therefore, the stacked LSTM network is the most basic and
simplest structure of the LSTM network (Fernández, Graves, & Schmidhu-
ber, 2007). It can also be treated as a multilayer fully connected structure. A
block of three recurrent layers is illustrated in Figure 8.
Then we unroll this schematic along the time dimension in Figure 9,
where we assume that the sequence length is 4.
In the depth dimension of the stacked LSTM network, the output of the
(L − 1)th LSTM layer at timet is htL−1 . This output is treated as the input, xtL ,
of the Lth layer. These output-input connections are the only relation be-
tween two adjacent layers. In the time dimension, the recurrent connections
are only within one layer. Based on the previous analysis, the mathematical
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

A Review of Recurrent Neural Networks 11

f
o
Pro
t ed
rec
Figure 8: The stacked LSTM network.
cor
Un

Figure 9: An unrolled stacked LSTM network.

expressions of the Lth LSTM layer could be expressed as follows:

ftL = σ (W Lfh ht−1


L
+ W Lfx htL−1 + bLf ),

itL = σ (WihL ht−1


L
+ WixL htL−1 + bLi ),
c̃tL = tanh(Wc̃hL ht−1
L
+ Wc̃xL ht−1
L
+ bLc̃ ),
ctL = ftL · ct−1
L
+ itL · c̃tL ,
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

12 Y. Yu, X. Si, C. Hu, and J. Zhang

oLt = σ (WohL ht−1


L
+ WoxL ht−1
L
+ bLo ),
htL = oLt · tanh(ctL ), (3.1)

o f
where htL−1 is the output of the (L − 1)th LSTM layer at time t.
Because of the simple and efficient architecture, the stacked LSTM net-

Pro
work has been widely adopted by researchers. Du, Zhang, Nguyen, and
Han (2017) adopted the stacked LSTM network to solve the problem of
vehicle-to-vehicle communication and discovered that the efficiency of the
stacked LSTM-based regression model was much higher than that of lo-
gistic regression. Sutskever, Vinyals, and Le (2014) used a stacked LSTM
network with four layers, and 1000 cells at each layer, to accomplish an
English-to-French translation task. They found that reversing the order of

ed
source words could introduce short-term dependencies between the source
and the target sentence; thus, the performance of this LSTM network was
remarkably improved. Saleh, Hossny, and Nahavandi (2018) constructed a
deep stacked LSTM network to predict the intent of vulnerable road users.
t
When evaluated on the Daimler pedestrian path prediction benchmark data
rec
set (Schneider & Gavrila, 2013) for intent and path prediction of pedestrians
in four unique traffic scenarios, the stacked LSTM was more powerful than
the methodology relying on a set of specific hand-crafted features.

3.1.2 Bidirectional LSTM Network. Conventional RNNs are only able to


cor

make use of a previous context. In order to overcome this shortcoming, the


bidirectional RNN (BRNN) was introduced by Schuster and Paliwal (1997).
This kind of architecture could be trained in both time directions simultane-
ously, with separate hidden layers (i.e., forward layers and backward lay-
ers). Graves and Schmidhuber (2005) combined the BRNN with the LSTM
Un

cell and proposed the bidirectional LSTM. Figure 10 shows the internal con-
nections of bidirectional LSTM recurrent layers.
In the architecture in Figure 10, the connections of forward layers are the
same as in the stacked LSTM network, which computes sequence (htL , ctL )
← ←
from t = 1 to T. However, for the backward layers, hidden sequence ( htL , c tL )
and output are iterated from t = T to 1. Therefore, the mathematical expres-
sions of the LSTM cell in backward layer L, at time t, could be written as
follows:

f tL = σ (W←L ht+1
L
+ W←L htL−1 + bL← ),
fh fx f

i tL = σ (W←L ht+1
L
+ W←L htL−1 + bL← ),
ih ix i

c̃ tL = tanh(W←L ht+1
L
+ W←L htL−1 + bL← ),
c̃ h c̃ x c̃

←L ← ← ← ←
c t = f tL · c t+1
L
+ i tL · c̃ tL ,
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

A Review of Recurrent Neural Networks 13

of
Pro
t ed
Figure 10: Internal connections of bidirectional LSTM.
rec
←L
ot = σ (W←L ht+1
L
+ W←L htL−1 + bL← ),
oh ox o
← ←L ←L
htL = ot · tanh( c t ). (3.2)
cor

The output of this architecture can be expressed as


yt = Whyht + W← ht + by , (3.3)
hy
Un


where ht and ht are the outputs of forward and backward layers, respec-
tively.
Bidirectional LSTM networks have been widely adopted by researchers
because of their excellent properties (Han, Wu, Jiang, & Davis, 2017; Yu,
Xu, & Zhang, 2018). Thireou and Reczko (2007) applied this architecture to
the sequence-based prediction of protein localization. The results showed
that the bidirectional LSTM network outperforms the feedforward network
and standard recurrent network. Wu, Zhang, and Zong (2016) investigated
the different skip connections in a stacked bidirectional LSTM network and
found that adding skip connections to the cell outputs with a gated identity
function could improve network performance on the part-of-speech tag-
ging task. Brahma (2018) extended the bidirectional LSTM to suffix bidirec-
tional LSTM (SuBiLSTM), which improved the bidirectional LSTM network
by encoding each suffix of the sequence. The SuBiLSTM outperformed the
bidirectional LSTM in learning general sentence representations.
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

14 Y. Yu, X. Si, C. Hu, and J. Zhang

o f
Pro
Figure 11: Unrolled 2D LSTM network.
t ed
rec
3.1.3 Multidimensional LSTM Network. The standard RNNs can only be
used to deal with one-dimensional data. However, in fields with multidi-
mensional data like video processing, the properties of RNNs—the ability
to access contextual information and robustness to input warping—are also
cor

desired. Graves, Fernández, and Schmidhuber (2007) introduced multidi-


mensional LSTM (MDLSTM) to extend the application fields of RNNs. The
core idea of MDLSTM was to build as many recurrent connections as the di-
mensions of the data. At each point in the data sequence, the recurrent layer
L receives both the output of layer L − 1 and its own activations from one
step back along all dimensions. This means that the LSTM cells in layer L
Un

have n-forget gates in the n-dimensional LSTM network (one for each of the
cell’s previous states along every dimension). Figure 11 shows the unrolled
two-dimensional (2D) LSTM network case.
In the 2D LSTM network, the blue lines and green lines indicate recurrent
connections along dimensions k and j, respectively. The expression ∗(k, j)
denotes the recurrent information (hLk, j , cLk, j ). Then the mathematical expres-
sion of the Lth LSTM layer is expressed as follows:

fkL = σ (W Lf1k hLk−1, j + U Lf1 hL−1


k, j + b f k ),
L

f jL = σ (W Lf2 j hLk, j−1 + U Lf2 hL−1


k, j + b f j ),
L

iLk, j = σ (WikL hLk−1, j + WiLj hLk, j−1 + UiL hL−1


k, j + bi ),
L

c̃Lk, j = tanh(Wc̃kL hLk−1, j + Wc̃Lj hLk, j−1 + Uc̃L hL−1


k, j + bc̃ ),
L

cLk, j = fkL · cLk−1, j + f jL · cLk, j−1 + iLk, j · c̃Lk, j ,


NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

A Review of Recurrent Neural Networks 15

o f
Pro
Figure 12: The proposed graph LSTM structure.

hLk, j = oLk, j · tanh(cLk, j ),


t ed
oLk, j = σ (WokL hLk−1, j + WojL hLk, j−1 + UoL hL−1
k, j + bo ),
L

(3.4)

where (hLk−1, j , cLk−1, j ) and (hLk, j−1 , cLk, j−1 ) denote the one-step-back recurrent
rec
information of point (k, j) along dimension k and j, at LSTM layer L. hL−1
k, j is
the output of point (k, j) at LSTM layer L − 1.
Graves et al. (2007) extended the above unidirectional MDLSTM to the
multidirectional MDLSTM and adopted this architecture to deal with the
cor

Air Freight database (McCarter & Storkey, 2007) and the MNIST database
(LeCun, Bottou, Bengio, & Haffner, 1998). The results showed that this ar-
chitecture is more robust to input warping than the state-of-the-art digit
recognition algorithm. Li, Mohamed, Zweig, and Gong (2016a) constructed
a MDLSTM network by time-frequency LSTM cells (Li, Mohamed, Zweig,
Un

& Gong, 2016b), which performed recurrence along the time and frequency
axes. On the Microsoft Windows phone short message dictation task, the
recurrence over both time and frequency axes promoted learning accuracy
compared with a network that consisted of only time LSTM cells.

3.1.4 Graph LSTM Network. When processing graph-structured data, the


existing pixel-wise LSTM structures usually assume that each pixel is in-
fluenced by fixed neighboring pixels. These structures include row LSTM
(Oord, Kalchbrenner, & Kavukcuoglu, 2016), diagonal BiLSTM (Shabanian,
Arpit, Trischler, & Bengio, 2017), and local-global LSTM (Liang et al., 2016).
These predefined recurrent sequences usually cause redundant computa-
tional costs. Liang et al. (2016) extended the fixed topologies and proposed
the graph LSTM network, based on the graph RNN network (Goller &
Kuchler, 1996). The graph LSTM model assumes that each superpixel node
is defined by its previous states and adaptive neighboring nodes. Figures
12a and 12b show a superpixel graph topology and the corresponding
graph LSTM architecture, respectively. Instead of using the fixed starting
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

16 Y. Yu, X. Si, C. Hu, and J. Zhang

node and predefined updating route for all images, the starting node and
node updating scheme of graph LSTM are dynamically specified.
The superpixel graph topology is acquired by image oversegmentation

f
using a simple linear iterative cluster (Achanta et al., 2010). For conve-

o
nience, we first define the average hidden states for neighboring nodes of
node k:

Pro

j∈NG (k) (1(q j = 1)h j,t+1 + 1(q j = 0)h j,t )
h̄k,t =   , (3.5)
NG (k)

where NG (k) denotes the total neighboring nodes of the visited node k. q j in-
dicates the state of node j, which is set at 1 if updated and otherwise 0. Fur-
thermore, neighboring nodes have different influences on the visited node

ed
k. Graph LSTM utilizes the adaptive forget gates, f¯k j , j ∈ Ng (k), to describe
it. The graph LSTM cell is updated as follows:

ik,t = σ (Wih hk,t−1 + Wih̄ h̄k,t−1 + Wix xk,t + bi ),


t
fk,t = σ (W f h hk,t−1 + W f x xk,t + b f ),
rec
f¯k j,t = σ (W f¯h h j,t−1 + W f¯x xk,t + b f¯ ), ( j ∈ NG (k)),

c̃k,t = tanh(Wc̃h hk,t−1 + Wc̃h̄ h̄k,t−1 + Wc̃x xk,t + bc̃ ),


ck,t = fk,t · ck,t−1 + ik,t · c̃k,t
cor

 ¯ ¯
j∈NG (k) (1(q j = 1) f k j,t · c j,t + 1(q j = 0) f k j,t · c j,t−1 )
+   ,
NG (k)

ok,t = σ (Woh hk,t−1 + Woh̄ h̄k,t−1 + Wox xk,t + bo ),


Un

hk,t = tanh(ok,t · ck,t ), (3.6)

where the input xk,t is the feature of graph node k.


Liang et al. (2016) demonstrated the superiority of the graph LSTM
network on four data sets: the Fashionista data set (Yamaguchi, 2012),
PASCAL-Person-Part data set (Chen et al., 2014), ATR data set (Liang et al.,
2015), and Horse-Cow Parsing data set (Wang & Yuille, 2015), and the re-
sults showed that this network could perform well in semantic object pars-
ing. Liang et al. (2017) further extended the graph LSTM model to the
structure-evolving LSTM. The structure-evolving LSTM merges the graph
nodes stochastically and thus learns the multilevel graph structures in a
progressive and stochastic way. The effectiveness evaluation shows that the
structure-evolving LSTM outperforms the state-of-the-art LSTM models.

3.1.5 Grid LSTM Network. The Grid LSTM network, proposed by Kalch-
brenner, Danihelka, and Graves (2015), could also be used to process multi-
dimensional data. This architecture arranges the LSTM cells in a grid of one
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

A Review of Recurrent Neural Networks 17

o f
Pro
t ed
rec
cor
Un

Figure 13: Blocks of the stacked LSTM and the grid LSTM.

or more dimensions. Different from the existing networks, the grid LSTM
network has recurrent connections along the depth dimension. In order to
explain the architecture of the grid LSTM, the blocks from the standard
LSTM and Grid LSTM are shown in Figure 13.
In the blocks, the dashed lines denote identity transformations. The red
lines, green lines, and purple lines indicate the transformations along dif-
ferent dimensions. Compared with the standard LSTM block, the 2D Grid
LSTM block has the cell memory vector along the vertical dimension.
Kalchbrenner et al. (2015) put forward that the one-dimensional (1D)
grid LSTM network is the architecture that replaces the transfer functions
(e.g., tanh and ReLU; Nair & Hinton, 2010) of the feedforward network by
the 1D grid LSTM block. Thus, the 2D Grid LSTM network is similar to the
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

18 Y. Yu, X. Si, C. Hu, and J. Zhang

o f
Pro
Figure 14: The stacked LSTM and 2D grid LSTM.
t ed
stacked LSTM network, but there are recurrent connections along the depth
rec
dimension. The grid LSTM with three or more dimensions corresponds to
MDLSTM (Graves, 2012), but the N-way recurrent interactions exit along
all dimensions. Figure 14 shows the stacked LSTM network and 2D Grid
LSTM network.
Based on the mathematical expressions of the LSTM cell block in equa-
cor

tion 2.3, in the 2D Grid LSTM network along the time dimension, the ex-
pressions of the LSTM cell at layer L and time t are expressed as follows:

ftL = σ (W Lfh ht−1


L
+ W Lfx htL−1 + bLf ),
Un

itL = σ (WihL ht−1


L
+ WixL htL−1 + bLi ),
c̃tL = tanh(Wc̃hL ht−1
L
+ Wc̃xL htL−1 + bLc̃ ),
ctL = ftL · ct−1
L
+ itL · c̃tL ,
oLt = σ (WohL ht−1
L
+ WoxL htL−1 + bLo ),
htL = oLt · tanh(ctL ). (3.7)

Furthermore, along the vertical dimension, we replace the input xt of


equation 3.1 with the htL in the time dimension. Then the expressions are
expressed as follows:

fLt = σ (W tf h htL−1 + W tf x htL + btf ),

itL = σ (Wiht htL−1 + Wixt htL + bti ),


c̃tL = tanh(Wc̃h
t t
hL−1 + Wc̃x
t L
ht + btc̃ ),
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

A Review of Recurrent Neural Networks 19

of
Pro
Figure 15: Recurrent structure of the ConvLSTM layer.

ctL = fLt · ctL−1 + itL · c̃tL ,

ed
otL = σ (Woh
t t
hL−1 + Wox
t L
ht + bto ),
htL = otL · tanh(ctL ). (3.8)
t
Kalchbrenner et al. (2015) found that the 2D grid LSTM outperformed the
stacked LSTM on the task of memorizing sequences of numbers (Zaremba
rec
& Sutskever, 2014). The recurrent connections along the depth dimension
could improve the learning properties of grid LSTM. In order to lower the
computational complexity of the grid LSTM network, Li and Sainath (2017)
compared four grid LSTM variations and found that the frequency-block
grid LSTM reduced computation costs without loss of accuracy on a 12,500-
cor

hour voice search task.

3.1.6 Convolutional LSTM Network. The fully connected LSTM layer con-
tains too much redundancy for spatial data (Sainath, Vinyals, Senior, & Sak,
2015). Therefore, to accomplish a spatiotemporal sequence forecasting prob-
Un

lem, Shi et al. (2015) proposed the convolutional LSTM (ConvLSTM), which
had convolutional structures in the recurrent connections. Figure 15 shows
the recurrent structure of the ConvLSTM layer.
The ConvLSTM network uses the convolution operator to calculate the
future state of a certain cell, and then the future state is determined by the
inputs and past states of its local neighbors. The ConvLSTM cell can be ex-
pressed as follows:

ft = σ (W f h ∗ ht−1 + W f x ∗ xt + b f ),
it = σ (Wih ∗ ht−1 + Wih ∗ xt + bi ),
c̃t = tanh(Wc̃h ∗ ht−1 + Wc̃x ∗ xt + bc̃ ),
ct = ft · ct−1 + it · c̃t ,
ot = σ (Woh ∗ ht−1 + Wox ∗ xt + bo ),
ht = o(t) · tanh(c(t)), (3.9)
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

20 Y. Yu, X. Si, C. Hu, and J. Zhang

where ‘∗ is the convolution operator and ‘· , as before, denotes the
Hadamard product.
Wei, Zhou, Sankaranarayanan, Sengupta, and Samet (2018) adopted

f
the ConvLSTM network to solve tweet count prediction, a spatiotempo-

o
ral sequence forecasting problem. The results of experiments on the city of
Seattle showed that the proposed network consistently outperforms the

Pro
competitive baseline approaches: the autoregressive integrated moving
average model, ST-ResNet (Zhang, Zheng, & Qi, 2016), and Eyewitness
(Krumm & Horvitz, 2015). Zhu, Zhang, Shen, and Song (2017) combined
the 3D CNN and ConvLSTM to construct a multimodal gesture recogni-
tion model. The 3D CNN and ConvLSTM learned the short-term and long-
term spatiotemporal features of gestures, respectively. When verified on the
Sheffield Kinect Gesture data set and the ChaLearn LAP large-scale isolated

ed
gesture data set, the results showed that the proposed method performs
better than other models. Liu, Zhou, Hang, and Yuan (2017) extended both
the ConvLSTM and bidirectional LSTM and presented the bidirectional–
convolutional LSTM architecture to learn spectral–spatial features from hy-
perspectral images.
t
rec
3.1.7 Depth-gated LSTM Network. The stacked LSTM network is the sim-
plest way to construct DNN. However, Yao, Cohn, Vylomova, Duh, and
Dyer (2015) pointed out that the error signals in the stacked LSTM network
might be either diminished or exploded because the error signals from the
cor

top have to be backpropagated through many layers of nonlinear trans-


formations. In order to solve this problem, they proposed the depth-gated
LSTM (DGLSTM), in which a depth gate was used to connect memory cells
of neighboring LSTM layers. An illustration of the DGLSTM architecture is
shown in Figure 16.
Un

In the depth dimension, there is one additional connection between


memory cells in the DGLSTM compared with the stacked LSTM. This depth
gate controls the information flow between the memory cells of different
layers. The function of the depth gate at layer L at time t is defined as

 L L 
dtL = σ Wxd xt + ωcd
L
· ct−1
L
+ ωld
L
· ctL−1 + bLd , (3.10)

where ωcd
L
and ωld
L
are the weight vectors to relate the past memory cell and
the lower-layer memory cell, respectively. Moreover, ωld L
should be a matrix
instead of a vector if memory cells in lower and upper layers have differ-
ent dimensions. Therefore, in the DGLSTM, the memory cell at layer L is
computed as

ctL = itL · tanh(WcxL xtL + WchL ht−1


L
) + ftL · ct−1
L
+ dtL · ctL−1 . (3.11)
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

A Review of Recurrent Neural Networks 21

o f
Pro
t
Figure 16: An illustration of DGLSTM.
ed
rec
The DPLSTM architecture is inspired by the highway network (Kim,
El-Khamy, & Lee, 2017; Srivastava, Greff, & Schmidhuber, 2015) and grid
LSTM (Kalchbrenner et al., 2015). Yao et al. (2015) used the varied depth of
cor

DGLSTMs to accomplish the Chinese-to-English machine translation task


The results showed that the performance of the DGLSTM network is always
better than that of the stacked LSTM network. Zhang et al. (2015) proposed
efficient algorithms to train the DGLSTM network using both frame and
sequence discriminative criteria and achieved even better improvements.
Un

3.1.8 Gated-Feedback LSTM Network. To deal with the problem of learning


multiple adaptive timescales, Chung, Gulcehre, Cho, & Bengio (2015) pro-
posed the gated-feedback RNN (GF-RNN) and the gated-feedback LSTM
(GF-LSTM) network. For each pair of layers, the GF-RNN uses a global gat-
ing unit to allow and control signals flowing from upper recurrent layers to
lower layers, which are called gated-feedback connections. Figure 17 shows
the architecture of the GF-RNN.
In the GF-RNN, the recurrent connections between two units are gated
by a logistics unit, which is called a global reset gate. The signals from the
i j
ith layer ht−1 to the jth layer ht are determined by the global reset gate and
can be computed as follows:
 
j−1→ j j−1

L
i → j i
gi→ j
=σ Wg ht + Ug ht−1 , (3.12)
i =1
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

22 Y. Yu, X. Si, C. Hu, and J. Zhang

o f
Pro
Figure 17: Architecture of the GF-RNN.
t ed
rec
j−1→ j i→ j
where Wg and Ug are the weight vectors for the input and hidden ac-
tivations of the ith layer at time step t − 1, respectively, and L is the number
j−1
of hidden layers. For j = 1, ht is the input xt . In the GF-LSTM network, the
unit-wise gates, such as the input gate, forget gate, and output gate, are not
modified by the global reset gate. However, the cell state c(t), in equation
cor

2.3, is controlled by it. At layer j, the memory content is computed by

j j−1→ j j−1

L
i→ j i
c̃t = tanh(Wc ht + gi→ jUc ht−1 + bc̃ ). (3.13)
i=1
Un

The proposed GF-RNN was evaluated against the conventional stacked


RNN on the task of Python program evaluation (Koutnik, Greff, Gomez, &
Schmidhuber, 2014), and the results showed that the GF-RNN outperforms
the stacked RNN.

3.1.9 Tree-Structured LSTM Network. Most of the existing LSTM net-


works are chain-structured. These networks perform well in machine
translation and speech recognition. However, when these chain-structured
networks need to combine words and phrases in natural language process-
ing, they exhibit poor properties (Li, Luong, Dan, & Hovy, 2015; Zhang, Lu,
& Lapata, 2015). Therefore, Zhu, Sobhani, and Guo (2015) and Tai, Socher,
and Manning (2015) extended the chain-structured LSTM networks to tree-
structured networks, based on the original tree-structured recursive models
(Goller & Kuchler, 1996; Sperduti & Starita, 1997; Francesconi et al., 1997;
Frasconi, Gori, & Sperduti, 1998). Figure 18 shows a tree-structured LSTM
network.
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

A Review of Recurrent Neural Networks 23

o f
Pro
t ed
rec
Figure 18: Tree-structured LSTM network. The short line and small circle at
each arrowhead indicate a block or pass of information, respectively. In the real
model, the gating is a soft version of gating.
cor

The cell state of the tree-structured LSTM depends on the states of its
child units. The updating of the LSTM cell can be expressed as follows:

it = σ (WhiL ht−1
L
+ WhiR ht−1
R
+ WciL ct−1
L
+ WciR ct−1
R
+ bi ),
Un

ftL = σ (WhLfl ht−1


L
+ WhRfl ht−1
R
+ WcLfl ct−1
L
+ WcRfl ct−1
R
+ b fl ),

ftR = σ (WhLfr ht−1


L
+ WhRfr ht−1
R
+ WcLfr ct−1
L
+ WcRfr ct−1
R
+ b fr ),

c̃t = tanh(WhLc̃ ht−1


L
+ WhRc̃ ht−1
R
+ bc̃ ),
ct = ftL · ct−1
L
+ ftR · ct−1
R
+ it · c̃t ,
ot = σ (Who
L L
ht−1 + Who
R R
ht−1 + Wcoct + bo ),
ht = ot · tanh(ct ), (3.14)

where f L and f R are the left and right forget gates, respectively. At time
t, the inputs of the node LSTM cell are hidden vectors of its two children,
L R
which are denoted as ht−1 and ht−1 .
Zhu et al. (2015) proved that the tree-structured LSTM network out-
performed other recursive models in learning distributed sentiment rep-
resentations for texts. Tai et al. (2015) proposed two tree-structured LSTM
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

24 Y. Yu, X. Si, C. Hu, and J. Zhang

architectures: the Child-Sum Tree-LSTM, and the N-ary Tree-LSTM. The


tree nodes in both models are dependent on multiple child units. The Child-
Sum Tree-LSTM adopts the sum of child hidden states hk to update the

f
cell. However, the N-ary Tree-LSTM introduces separate parameter matri-

o
ces for each child and is able to learn more finely grained information from
its children. Teng and Zhang (2016) further extended the tree-structured

Pro
LSTM to the bidirectional tree-structured LSTM with head lexicaliza-
tion. The general tree-structured LSTM network constitutes trees in the
bottom-up direction. Therefore, only the leaf nodes can use the input word
information. In order to propagate head words from leaf nodes to every con-
stituent node, Teng and Zhang (2016) proposed the automatic head lexical-
ization for a general tree-structured LSTM network, and built a tree LSTM
in the top-down direction. Miwa and Bansal (2016) further stacked a bidi-

ed
rectional tree-structured LSTM network on a bidirectional sequential LSTM
network to accomplish end-to-end relation extraction. Additionally, Niu,
Zhou, Wang, Gao, and Hua (2017) extended the tree-structured LSTM to
the hierarchical multimodal LSTM (HM-LSTM) for the problem of dense
t
visual-semantic embedding. Since the HM-LSTM could exploit the hierar-
rec
chical relations between whole images and image regions and between sen-
tences and phrases, this model outperformed other methods on Flickr8K
(Graves, 2014), Flickr30K (Plummer et al., 2017), and MS-COCO (Chen et al.,
2015; Lin et al., 2014) data sets.
cor

3.1.10 Other Networks


Coupled LSTM Network. Inspired by the grid LSTM and the multidimen-
sional RNN, Liu, Qiu, and Huang (2016) proposed two coupled LSTMs ar-
chitectures, the loosely coupled LSTM (LC-LSTM) and the tightly coupled
LSTM (TC-LSTM), to model the interactions of sentence pairs. These two-
Un

coupled LSTM networks model the strong interactions of two sentences


through the connections between hidden states.
Deep Fusion LSTM Networks. Liu, Qiu, Chen, and Huang (2016) adopted
a deep fusion LSTM (DF-LSTM) network to extract the strong interaction
of text pairs. This network consisted of two interdependent LSTMs. In or-
der to capture more complicated matching patterns, the authors then used
external memory to enhance the memory capacity of LSTMs. They gave
a qualitative analysis for the model capacity and demonstrated the model
efficacy on the Stanford Natural Language Inference Corpus.
Multiplicative LSTM Network. The weight matrices of the usual LSTM
networks are fixed for different inputs. These networks are inexpressive
when dealing with sequences of discrete mutually exclusive elements.
Therefore, Krause, Lu, Murray, and Renals (2016) combined the LSTM
with the multiplicative RNN architecture and obtained the multiplicative
LSTM, which possessed flexible input-dependent weight matrices. In a se-
ries of character-level language modeling tasks, this model outperformed
the stacked LSTM.
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

A Review of Recurrent Neural Networks 25

Nested LSTMs Network. The nested LSTM (NLSTM) network was pro-
posed by Moniz and Krueger (2018). They adopted nesting to add the depth
of the network. In the NLSTM network, the memory cell of the outer LSTM

f
cell is computed by the inner LSTM cell, and ctouter = htinner . In the task of

o
character-level language modeling, the NLSTM network is a promising al-
ternative to the stacked architecture, and it outperforms both stacked and

Pro
single-layer LSTM networks with similar numbers of parameters.

3.2 Integrated LSTM Networks. Sometimes the capacity of a single


LSTM network cannot meet the practical engineering requirements, and
thus there are some integrated LSTM networks that consist of an LSTM
layer/network and other components, such as a convolution neural net-
work and an external memory unit. These integrated LSTM networks take

ed
advantage of different components.

3.2.1 Neural Turing Machine. Because the RNN needs to update the cur-
rent cell states by inputs and previous states, the memory capacity is the
t
essential feature of the RNN. In order to enhance the memory capacity of
rec
networks, neural networks have been adopted to learn to control end-to-
end differentiable stack machines (Sun, 1990; Mozer, & Das, 1993). Schmid-
huber (1993) and Schlag and Schmidhuber (2017) extended the above
structure, and used the RNN to learn and control end-to-end-differentiable
fast weight memory. Furthermore, Graves, Wayne, and Danihelka (2014)
cor

proposed the neural Turing machine (NTM) with an LSTM controller,


which took inspiration from both the design of digital computers and the
biological working memory models.
In Figure 19, the dashed line displays the division between the NTM
and the external environment. The RNN receives inputs and emits outputs
Un

during each update cycle. At the same time, the RNN also exchanges infor-
mation with the memory matrix via the read and write heads. The math-
ematical expression of the neural network is decided by the type of the
selected RNN. Graves et al. (2014) selected the LSTM cell to construct the
neural network and compared the performance of this NTM and a standard
LSTM network. The results showed that NTM possessed advantages over
the LSTM network. Xie and Shi (2018) improved the training speed by de-
signing a new read–write mechanism for the NTM, which used convolution
operations to extract global memory features.

3.2.2 DBN-LSTM Network. This network is the combination of a deep be-


lief network and LSTM (DBN-LSTM), and was introduced by Vohra, Goel,
and Sahoo (2015). It is an extension of the RNN-DBN (Goel, Vohra, & Sa-
hoo, 2014). In the amalgamation of the DBN and LSTM, the multilayer DBN
helps in high-level representation of data, while the LSTM network pro-
vides temporal information. Compared with the RNN-DBN, the most no-
table improvement of DBN-LSTM is that it replaces the standard recurrent
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

26 Y. Yu, X. Si, C. Hu, and J. Zhang

o f
Pro
ed
Figure 19: Schematic of the NTM with an LSTM controller.
t
rec
cell with the LSTM cell, which ensures that the model can deal with a longer
time duration.

3.2.3 Multiscale LSTM Network. Cheng et al. (2016) combined a prepro-


cor

cessing block and LSTM network to form the multiscale LSTM (MS-LSTM)
network. The preprocessing block helps select a proper timescale for the in-
put data, and the LSTM layer is adopted to model the processed sequential
data. In processing dynamic Internet traffic, Cheng et al. (2016) used this
model to learn the Internet traffic pattern in a flexible time window. Peng,
Un

Zhang, Liang, Liu, and Lin (2016) combined the MS-LSTM with a CNN and
constructed a recurrent architecture for geometric scene parsing.

3.2.4 CFCC-LSTM Network. To predict sea surface temperature, Yang


et al. (2017) proposed a CFCC-LSTM network, which was composed of a
fully connected LSTM (FC-LSTM) and a CNN. They first introduced a 3D
grid to preprocess the information and then adopted an FC-LSTM layer to
predict the temporal information in the 3D grid and a convolution oper-
ation to handle the spatial information. The results on two data sets (the
China Ocean data set and the Bohai Sea data set) showed the effectiveness
of this model.

3.2.5 C-LSTM Network. The C-LSTM network is also a combination of a


CNN and LSTM network. It was proposed by Zhou (2016). This network
can extract a sequence of higher-level phrase representations by the CNN,
and then this information is fed into the LSTM network to obtain the sen-
tence representation in the task of sentence and document modeling.
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

A Review of Recurrent Neural Networks 27

3.2.6 LSTM-in-LSTM Network. Song, Tang, Xiao, Wu, and Zhang (2016)
combined a Deep-CNN network and an LSTM-in-LSTM architecture to
generate rich, finely grained textual descriptions of images. The LSTM-in-

f
LSTM architecture consists of an inner LSTM cell and an outer LSTM cell.

o
This architecture can learn the contextual interactions between visual cues
and can thus predict long sentence descriptions.

Pro
4 Conclusion

We have systematically reviewed various LSTM cell variants and LSTM net-
works. The LSTM cell is the basic node of LSTM networks. Different vari-
ants outperform the standard LSTM cell on some characteristics and tasks.
For example, the GRU has a small number of cell parameters. However,

ed
there is no variant that can surpass the standard LSTM cell in all aspects.
As for LSTM networks, there are two major categories: LSTM-dominated
neural networks and integrated LSTM networks. In order to enhance the
network properties of some specific tasks, LSTM-dominated networks fo-
t
cus on optimizing the connections between inner LSTM cells. Integrated
rec
LSTM networks mainly pay attention to integrating the advantageous fea-
tures of different components, such as the convolution neural network and
external memory unit, when dealing with the target task. Current LSTM
models have acquired incredible success on numerous tasks. Neverthe-
less, there are still some directions to augment RNNs with more powerful
cor

properties.
First, more efficient recurrent cells should be explored through explicit
knowledge. The recurrent cells are the basic nodes, and the properties of the
networks depend on recurrent cells to some extent. However, the current
studies on recurrent cells are mainly empirical explorations. The handling
Un

capacity of a specific cell structure is ambiguous. Therefore, the relationship


between the cell’s handling capacity and data structure needs to be clarified.
Second, RNNs should be combined with external memory to strengthen
their memory capacity, such as in NTM and differentiable neural computer
(Graves et al., 2016). Differing from the CNN and DNN, the RNNs require
memory capacity because they need to update the cell states by inputs and
previous states. However, neural networks are not good at storing informa-
tion. The external memory could greatly increase the memory capacity of
RNNs and help them handle the long-term dependencies.
Finally, adopting adaptive computation time is an interesting direction.
At present, for each time step, the amount of computation in current RNNs
is the same. This is no problem when tasks have the same degree of diffi-
culty. However, more time should be spent on individual tasks of higher
difficulty, and researchers should allow the RNNs to execute multiple steps
of computation during each time step based on the degree of difficulty of
the task. Self-delimiting (SLIM) RNN (Schmidhuber, 2012) is a typical net-
work, and SLIM RNN could decide during the computation which parts of
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

28 Y. Yu, X. Si, C. Hu, and J. Zhang

itself are used at all, and for how long, learning to control time through a
”halt unit.” For each time step the amount of computation in adaptive com-
putation time RNNs may be different, and they could deal with the tasks

f
with different difficulties well.

o
Acknowledgments

Pro
We thank LetPub (www.letpub.com) for its linguistic assistance during the
preparation of this review. This work was supported by the National Na-
ture Science Foundation of China (grants 61773386, 61573365, 61573366,
61573076), the Young Elite Scientists Sponsorship Program of China Asso-
ciation for Science and Technology (grant 2016QNRC001), and the National
Key R&D Program of China (grant 2018YFB1306100).

References
t ed
Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., & Süsstrunk, S. (2010). SLIC super-
pixels. https://www.researchgate.net/publication/44234783_SLIC_superpixels
rec
Altché, F., & Fortelle, A. D. L. (2017). An LSTM network for highway trajectory pre-
diction. In Proceedings of the IEEE 20th International Conference on Intelligent Trans-
portation Systems. Piscataway, NJ: IEEE.
Bengio, Y. (2009). Learning deep Architectures for AI. Foundations and Trends in Ma-
chine Learning, 2(1), 1–127.
cor

Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with
gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166.
Brahma, S. (2018). Suffix bidirectional long short-term memory. http://www
.researchgate.net/publication/325262895_Suffix_Bidirectional_Long_Short-Term
_Memory
Un

Britz, D., Goldie, A., Luong, M. T., & Le, Q. (2017). Massive exploration of neural machine
translation architectures. arXiv:1703.03906.
Brown, B., Yu, X., & Garverick, S. (2004). Mixed-mode analog VLSI continuous-time
recurrent neural network. In Circuits, Signals, and Systems: IASTED International
Conference Proceedings.
Carrio, A., Sampedro, C., Rodriguez-Ramos, A., & Campoy, P. (2017). A review of
deep learning methods and applications for unmanned aerial Vehicles. Journal of
Sensors, 2, 1–13.
Chen, T. B., & Soo, V. W. (1996). A comparative study of recurrent neural network
architectures on learning temporal sequences. In Proceedings of the IEEE Interna-
tional Conference on Neural Networks. Piscataway, NJ: IEEE.
Chen, X., Fang, H., Lin, T. Y., Vedantam, R., Gupta, S., Dollar, P., & Zitnick,
C. L. (2015). Microsoft COCO captions: Data collection and evaluation server.
arXiv:504.00325v2.
Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., & Yuille, A. (2014). Detect what
you can: Detecting and representing objects using holistic models and body parts.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Piscataway, NJ: IEEE.
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

A Review of Recurrent Neural Networks 29

Cheng, M., Xu, Q., Lv, J., Liu, W., Li, Q., & Wang, J. (2016). MS-LSTM: A multi-scale
LSTM model for BGP anomaly detection. In Proceedings of the IEEE 24th Interna-
tional Conference on Network Protocols. Piscataway, NJ: IEEE.

f
Cho, K., Van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H.,

o
& Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder
for statistical machine translation. arXiv:1406.1078v3.

Pro
Chung, J., Gulcehre, C., Cho, K. H., & Bengio, Y. (2014). Empirical evaluation of gated
recurrent neural networks on sequence modeling. arXiv:1412.3555v1.
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2015). Gated feedback recurrent neural
networks. arXiv:1502.02367v1.
Deng, L. (2013). Three classes of deep learning architectures and their applications:
A tutorial survey. In APSIPA transactions on signal and information processing. Cam-
bridge: Cambridge University Press.

ed
Dey, R., & Salemt, F. M. (2017). Gate-variants of gated recurrent unit (GRU) neural net-
works. In Proceedings of the IEEE International Midwest Symposium on Circuits and
Systems. Piscataway, NJ: IEEE.
Du, X., Zhang, H., Nguyen, H. V., & Han, Z. (2017). Stacked LSTM deep learning
t
model for traffic prediction in vehicle-to-vehicle communication. In Proceedings
of the IEEE Vehicular Technology Conference. Piscataway, NJ: IEEE.
rec
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179–211.
Fernández, S., Graves, A., & Schmidhuber, J. (2007a). Sequence labelling in struc-
tured domains with hierarchical recurrent neural networks. In Proceedings of the
20th International Joint Conference on Artificial Intelligence. San Mateo, CA: Morgan
Kaufmann.
cor

Fernández, S., Graves, A., & Schmidhuber, J. (2007b). An application of recurrent


neural networks to discriminative keyword spotting. In Proceedings of the Interna-
tional Conference on Artificial Neural Networks (pp. 220–229). Berlin: Springer.
Francesconi, E., Frasconi, P., Gori, M., Marinai, S., Sheng, J. Q., Soda, G., & Sperduti,
A. (1997). Logo recognition by recursive neural networks. In Proceedings of the
Un

International Workshop on Graphics Recognition (pp. 104–117). Berlin: Springer.


Frasconi, P., Gori, M., & Sperduti, A. (1998). A general framework for adaptive pro-
cessing of data structures. IEEE Transactions on Neural Networks, 9(5), 768–786.
Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a
mechanism of pattern recognition unaffected by shift in position. Biological Cy-
bernetics, 36(4), 193–202.
Gallagher, J. C., Boddhu, S. K., & Vigraham, S. (2005). A reconfigurable continuous
time recurrent neural network for evolvable hardware applications. In Proceed-
ings of the 2005 IEEE Congress on Evolutionary Computation. Piscataway, NJ: IEEE.
Gers, F. (2001). Long short-term memory in recurrent neural networks. PhD diss., Beuth
Hochschule für Technik Berlin.
Gers, F. A., & Schmidhuber, J. (2000). Recurrent nets that time and count. In Pro-
ceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks.
Piscataway, NJ: IEEE.
Gers, F. A., & Schmidhuber, J. (2001). LSTM recurrent networks learn simple context-
free and context-sensitive languages. IEEE Trans. Neural. Netw., 12(6), 1333–1340.
Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). Learning to forget: Continual
prediction with LSTM. Neural Computation, 12(10), 2451–2471.
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

30 Y. Yu, X. Si, C. Hu, and J. Zhang

Gers, F. A., & Schraudolph, N. N. (2002). Learning precise timing with LSTM recur-
rent networks. Journal of Machine Learning Research, 3, 115–143.
Goel, K., Vohra, R., & Sahoo, J. K. (2014). Polyphonic music generation by model-

f
ing temporal dependencies using a RNN-DBN. In Proceedings of the International

o
Conference on Artificial Neural Networks. Berlin: Springer.
Goller, C., & Kuchler, A. (1996). Learning task-dependent distributed representa-

Pro
tions by backpropagation through structure. Neural Networks, 1, 347–352.
Graves, A. (2012). Supervised sequence labelling with recurrent neural networks. Berlin:
Springer.
Graves, A. (2014). Generating sequences with recurrent neural networks. arXiv:1308.
0850v5.
Graves, A., Fernández, S., & Schmidhuber, J. (2007). Multi-dimensional recurrent neural
networks. In Proceedings of the International Conference on Artificial Neural Networks.

ed
Berlin: Springer.
Graves, A., Wayne, G., & Danihelka, I. (2014). Neural Turing machines. arXiv:1410.
5401v2.
Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabskabarwińska,
t
A., & Agapiou, J. (2016). Hybrid computing using a neural network with dynamic
external memory. Nature, 538(7626), 471–476.
rec
Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidi-
rectional LSTM and other neural network architectures. Neural Networks, 18(5),
602–610.
Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., & Schmidhuber, J. (2016).
LSTM: A search space odyssey. IEEE Transactions on Neural Networks and Learning
cor

Systems, 28(10), 2222–2232.


Guo, Y., Liu, Y., Georgiou, T., & Lew, M. S. (2017). A review of semantic segmenta-
tion using deep neural networks. International Journal of Multimedia Information
Retrieval, 2, 1–7.
Han, X., Wu, Z., Jiang, Y. G., & Davis, L. S. (2017). Learning fashion compatibility
Un

with bidirectional LSTMs. In Proceedings of the 2017 ACM on Multimedia Confer-


ence. New York: ACM.
He, T., & Droppo, J. (2016). Exploiting LSTM structure in deep neural networks for
speech recognition. In Proceedings of the IEEE International Conference on Acoustics,
Speech and Signal Processing (pp. 5445–5449). Piscataway, NJ: IEEE.
Heck, J. C., & Salem, F. M. (2017). Simplified minimal gated unit variations for recur-
rent neural networks. In Proceedings of the IEEE International Midwest Symposium
on Circuits and Systems (pp. 1593–1596). Piscataway, NJ: IEEE.
Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. PhD diss.,
Technische Universität München.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computa-
tion, 9(8), 1735–1780.
Hsu, W. N., Zhang, Y., Lee, A., & Glass, J. (2016). Exploiting depth and highway con-
nections in convolutional recurrent deep neural networks for speech recognition.
Cell, 50(1), 395–399.
Irie, K., Tüske, Z., Alkhouli, T., Schlüter, R., & Ney, H. (2016). LSTM, GRU, highway
and a bit of attention: An empirical overview for language modeling in speech
recognition. In Proceedings of the INTERSPEECH (pp. 3519–3523). Red Hook, NY:
Curran.
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

A Review of Recurrent Neural Networks 31

Ivakhnenko, A. G. (1971). Polynomial theory of complex systems. IEEE Transactions


on Systems, Man and Cybernetics, 4, 364–378.
Ivakhnenko, A. G., & Lapa, V. G. (1965). Cybernetic predicting devices. Sacramento,

f
CA: CCM Information Corporation.

o
Jing, L., Gulcehre, C., Peurifoy, J., Shen, Y., Tegmark, M., Soljačić, M., & Bengio, Y.
(2017). Gated orthogonal recurrent units: On learning to forget. arXiv:1706.02761.

Pro
Jordan, M. (1986). Attractor dynamics and parallelism in a connectionist sequential
machine. In Proceedings of the Annual Conference of the Cognitive Science Society (pp.
531–546). Piscataway, NJ: IEEE.
Jozefowicz, R., Zaremba, W., & Sutskever, I. (2015). An empirical exploration of re-
current network architectures. In Proceedings of the International Conference on In-
ternational Conference on Machine Learning (pp. 2342–2350). New York: ACM.
Kalchbrenner, N., Danihelka, I., & Graves, A. (2015). Grid long short-term memory.

ed
arXiv:1507.01526.
Karpathy, A., Johnson, J., & Li, F. F. (2015). Visualizing and understanding recurrent
networks. arXiv:1506.02078.
Khan, S., & Yairi, T. (2018). A review on the application of deep learning in system
t
health management. Mechanical Systems and Signal Processing, 107, 241–265.
Kim, J., El-Khamy, M., & Lee, J. (2017). Residual LSTM: Design of a deep recurrent ar-
rec
chitecture for distant speech recognition. arXiv:1701.03360.
Koutnik, J., Greff, K., Gomez, F., & Schmidhuber, J. (2014). A clockwork RNN.
arXiv:1402.3511v1.
Krause, B., Lu, L., Murray, I., & Renals, S. (2016). Multiplicative LSTM for sequence
modelling. arXiv:1609.07959.
cor

Krumm, J., & Horvitz, E. (2015). Eyewitness: Identifying local events via space-time
signals in Twitter feeds. In Proceedings of the 23rd SIGSPATIAL International Con-
ference on Advances in Geographic Information Systems. New York: ACM.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., &
Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recogni-
Un

tion. Neural Computation, 1(4), 541–551.


LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning ap-
plied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
Li, B., & Sainath, T. N. (2017). Reducing the computational complexity of two-
dimensional LSTMs. In Proceedings of the INTERSPEECH (pp. 964–968). Red
Hook, NY: Curran.
Li, J., Luong, M. T., Dan, J., & Hovy, E. (2015). When are tree structures necessary for
deep learning of representations? arXiv:1503.00185v5.
Li, J., Mohamed, A., Zweig, G., & Gong, Y. (2016a). Exploring multidimensional
LSTMS for large vocabulary ASR. In Proceedings of the IEEE International Con-
ference on Acoustics, Speech and Signal Processing (pp. 4940–4944). Piscataway, NJ:
IEEE.
Li, J., Mohamed, A., Zweig, G., & Gong, Y. (2016b). LSTM time and frequency re-
currence for automatic speech recognition. In Proceedings of the Conference on
Automatic Speech Recognition and Understanding (pp. 187–191). Piscataway, NJ:
IEEE.
Li, S., Li, W., Cook, C., Zhu, C., & Gao, Y. (2018). Independently recurrent neural
network: Building a longer and deeper RNN. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (pp. 5457–5466). Piscataway, NJ: IEEE.
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

32 Y. Yu, X. Si, C. Hu, and J. Zhang

Liang, X., Lin, L., Shen, X., Feng, J., Yan, S., & Xing, E. P. (2017). Interpretable
structure-evolving LSTM. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (pp. 2171–2184). Piscataway, NJ: IEEE.

f
Liang, X., Liu, S., Shen, X., Yang, J., Liu, L., Dong, J., & Yan, S. (2015). Deep human

o
parsing with active template regression. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 37(12), 2402.

Pro
Liang, X., Shen, X., Feng, J., Lin, L., & Yan, S. (2016). Semantic object parsing with
graph LSTM. In Proceedings of the European Conference on Computer Vision (pp. 125–
143). Berlin: Springer.
Liang, X., Shen, X., Xiang, D., Feng, J., Lin, L., & Yan, S. (2016). Semantic object pars-
ing with local-global long short-term memory. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (pp. 3185–3193). Piscataway, NJ: IEEE.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., & Zitnick, C. L.

ed
(2014). Microsoft COCO: Common objects in context. In Proceedings of the European
Conference on Computer Vision (pp. 740–755). Berlin: Springer.
Lipton, Z. C., Berkowitz, J., & Elkan, C. (2015). A critical review of recurrent neural
networks for sequence learning. arXiv:1506.000190.
t
Liu, P., Qiu, X., Chen, J., & Huang, X. (2016). Deep fusion LSTMs for text semantic
matching. In Proceedings of the 54th Annual Meeting of the Association for Computa-
rec
tional Linguistics (pp. 1034–1043). Stroudsburg, PA: ACL.
Liu, P., Qiu, X., & Huang, X. (2016). Modelling interaction of sentence pair with coupled-
LSTMs. arXiv:1605.05573.
Liu, Q., Zhou, F., Hang, R., & Yuan, X. (2017). Bidirectional-convolutional LSTM
based spectral-spatial feature learning for hyperspectral image classification. Re-
cor

mote Sensing, 9(12), 1330–1339.


Mallinar, N., & Rosset, C. (2018). Deep canonically correlated LSTMs. arXiv:1801.05407.
McCarter, G., & Storkey, A. (2007). Air freight image segmentation database. http://
homepages.inf.ed.ac.uk/amos/afreightdata.html
Miwa, M., & Bansal, M. (2016). End-to-end relation extraction using LSTMs on sequences
Un

and tree structures. arXiv:1601.00770.


Moniz, J. R. A., & Krueger, D. (2018). Nested LSTMs. arXiv:1801.10308.
Mozer, M. C., & Das, S. (1993). A connectionist symbol manipulator that discov-
ers the structure of context-free languages. In S. J. Hanson, J. D. Cowan, & C.
L. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 863–870).
Cambridge, MA: MIT Press.
Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann
machines. In Proceedings of the International Conference on International Conference
on Machine Learning (pp. 807–814). Madison, WI: Omnipress.
Neil, D., Pfeiffer, M., & Liu, S. C. (2016). Phased LSTM: Accelerating recurrent net-
work training for long or event-based sequences. In D. D. Lee, U. von Luxburg, R.
Garnett, M. Sugiyama, & I. Guyon (Eds.), Advances in neural information processing
systems, 16 (pp. 3882–3890). Red Hook, NY: Curran.
Nie, Y., An, C., Huang, J., Yan, Z., & Han, Y. (2016). A bidirectional LSTM model for
question title and body analysis in question answering. In IEEE First International
Conference on Data Science in Cyberspace (pp. 307–311). Piscataway, NJ: IEEE.
Nina, O., & Rodriguez, A. (2016). Simplified LSTM unit and search space prob-
ability exploration for image description. In Proceedings of the International
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

A Review of Recurrent Neural Networks 33

Conference on Information, Communications and Signal Processing (pp. 1–5). Piscat-


away, NJ: IEEE.
Niu, Z., Zhou, M., Wang, L., Gao, X., & Hua, G. (2017). Hierarchical multimodal

f
LSTM for dense visual-semantic embedding. In Proceedings of the IEEE Interna-

o
tional Conference on Computer Vision (pp. 1899–1907). Piscataway, NJ: IEEE.
Oord, A. V. D., Kalchbrenner, N., & Kavukcuoglu, K. (2016). Pixel recurrent neural

Pro
networks. arXiv:1601.06759.
Palangi, H., Deng, L., Shen, Y., Gao, J., He, X., Chen, J., & Ward, R. (2015). Deep
sentence embedding using the long short-term memory network: Analysis and
application to information retrieval. IEEE/ACM Transactions on Audio Speech and
Language Processing, 24(4), 694–707.
Pearlmutter, B. A. (1989). Learning state space trajectories in recurrent neural net-
works. Neural Computation, 1(2), 263–269.

ed
Peng, Z., Zhang, R., Liang, X., Liu, X., & Lin, L. (2016). Geometric scene parsing with
hierarchical LSTM. In Proceedings of the International Joint Conference on Artificial
Intelligence (pp. 3439–3445). Palo Alto, CA: AAAI Press.
Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., & Lazeb-
t
nik, S. (2017). Flickr30k entities: Collecting region-to-phrase correspondences for
richer image-to-sentence models. International Journal of Computer Vision, 123(1),
rec
74–93.
Pulver, A., & Lyu, S. (2017). LSTM with working memory. In Proceedings of the Inter-
national Joint Conference on Neural Networks (pp. 845–851). Piscataway, NJ: IEEE.
Qu, Z., Haghani, P., Weinstein, E., & Moreno, P. (2017). Syllable-based acoustic mod-
eling with CTC-SMBR-LSTM. In Proceedings of the IEEE Automatic Speech Recogni-
cor

tion and Understanding Workshop (pp. 173–177). Piscataway, NJ: IEEE.


Rahman, L., Mohammed, N., & Azad, A. K. A. (2017). A new LSTM model by in-
troducing biological cell state. In Proceedings of the International Conference on Elec-
trical Engineering and Information Communication Technology (pp. 1–6). Piscataway,
NJ: IEEE.
Un

Ranzato, M. A., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., & Chopra, S.
(2014). Video (language) modeling: A baseline for generative models of natural videos.
arXiv:1412.6604.
Rawat, W., & Wang, Z. (2017). Deep convolutional neural Networks for image clas-
sification: A comprehensive review. Neural Computation, 29(9), 1–10.
Robinson, A. J., & Fallside, F. (1987). The utility driven dynamic error propagation net-
work. Cambridge: University of Cambridge Department of Engineering.
Sainath, T. N., Vinyals, O., Senior, A., & Sak, H. (2015). Convolutional, long short-
term memory, fully connected deep neural networks. In IEEE International Con-
ference on Acoustics, Speech and Signal Processing (pp. 4580–4584). Piscataway, NJ:
IEEE.
Sak, H. I., Senior, A., & Beaufays, F. O. (2014). Long short-term memory based recur-
rent neural network architectures for large vocabulary speech recognition. arXiv:1402.
1128v1.
Saleh, K., Hossny, M., & Nahavandi, S. (2018). Intent prediction of vulnerable road
users from motion trajectories using stacked LSTM network. In Proceedings of the
IEEE International Conference on Intelligent Transportation Systems (pp. 327–332).
Piscataway, NJ: IEEE.
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

34 Y. Yu, X. Si, C. Hu, and J. Zhang

Schlag, I., & Schmidhuber, J. (2017). Gated fast weights for on-the-fly neural program
generation. Workshop on Meta-Learning, In NIPS Metalearning Workshop.
Schmidhuber, J. (1993). Reducing the ratio between learning complexity and number

f
of time varying variables in fully recurrent nets. In Proceedings of the International

o
Conference on Neural Networks (pp. 460–463). London: Springer.
Schmidhuber, J. (2012). Self-delimiting neural networks. arXiv:2012.

Pro
Schmidhuber, J., Wierstra, D., Gagliolo, M., & Gomez, F. (2007). Training recurrent
networks by evolino. Neural Computation, 19(3), 757–779.
Schneider, N., & Gavrila, D. M. (2013). Pedestrian path prediction with recursive
Bayesian filters: A comparative study. In Proceedings of the German Conference on
Pattern Recognition (pp. 174–183). Berlin: Springer.
Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE
Transactions on Signal Processing, 45(11), 2673–2681.

ed
Shabanian, S., Arpit, D., Trischler, A., & Bengio, Y. (2017). Variational Bi-LSTMs.
arXiv:1711.05717.
Sharma, P., & Singh, A. (2017). Era of deep neural networks: A review. In Proceed-
ings of the 8th International Conference on Computing, Communication and Networking
t
Technologies (pp. 1–5). Piscataway, NJ: IEEE.
Shi, X., Chen, Z., Wang, H., Woo, W. C., Woo, W. C., & Woo, W. C. (2015). Convolu-
rec
tional LSTM Network: A machine learning approach for precipitation nowcast-
ing. In C. Cortes, D. D. Lee, M. Sugiyama, & R. Garnett (Eds.), Advances in neural
information processing systems (pp. 802–810). Red Hook, NY: Curran.
Song, J., Tang, S., Xiao, J., Wu, F., & Zhang, Z. (2016). LSTM-in-LSTM for generating
long descriptions of images. Computational Visual Media, 2(4), 1–10.
cor

Sperduti, A., & Starita, A. (1997). Supervised neural networks for the classification
of structures. IEEE Transactions on Neural Networks, 8(3), 714–735.
Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Highway networks. arXiv:1505.
00387.
Šter, B. (2013). Selective recurrent neural network. Neural Processing Letters, 38(1),
Un

1–15.
Sun, G. (1990). Connectionist pushdown automata that learn context-free grammars.
In Proceedings of the International Joint Conference on Neural Networks (pp. 577–580).
Piscataway, NJ: IEEE.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with
neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K.
Q. Weinberger (Eds.), Advances in neural information processing systems (pp. 3104–
3112). Red Hook, NY: Curran.
Tai, K. S., Socher, R., & Manning, C. D. (2015). Improved semantic representations
from tree-structured long short-term memory networks. Computer Science, 5(1),
36.
Teng, Z., & Zhang, Y. (2016). Bidirectional tree-structured LSTM with head lexicalization.
arXiv:1611.06788.
Thireou, T., & Reczko, M. (2007). Bidirectional long short-term memory networks for
predicting the subcellular localization of eukaryotic proteins. IEEE/ACM Transac-
tions on Computational Biology and Bioinformatics, 4(3), 441–446.
Veeriah, V., Zhuang, N., & Qi, G. J. (2015). Differential recurrent neural networks for
action recognition. In Proceedings of the IEEE International Conference on Computer
Vision (pp. 4041–4049). Piscataway, NJ: IEEE.
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

A Review of Recurrent Neural Networks 35

Vohra, R., Goel, K., & Sahoo, J. K. (2015). Modeling temporal dependencies in data
using a DBN-LSTM. In Proceedings of the IEEE International Conference on Data Sci-
ence and Advanced Analytics (pp. 1–4). Piscataway, NJ: IEEE.

f
Wang, J., & Yuille, A. (2015). Semantic part segmentation using compositional model

o
combining shape and appearance. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition (pp. 1788–1797). Piscataway, NJ: IEEE.

Pro
Wei, H., Zhou, H., Sankaranarayanan, J., Sengupta, S., & Samet, H. (2018). Residual
convolutional LSTM for tweet count prediction. In Companion of the Web Confer-
ence 2018 (pp. 1309–1316). Geneva: International World Wide Web Conferences
Steering Committee.
Weiss, G., Goldberg, Y., & Yahav, E. (2018). On the practical computational power of finite
precision RNNs for language recognition. arXiv:1805.04908.
Weng, J. J., Ahuja, N., & Huang, T. S. (1993, May). Learning recognition and seg-

ed
mentation of 3D objects from 2D images. In Proceedings of the Fourth International
Conference on Computer Vision (pp. 121–128). Piscataway, NJ: IEEE.
Werbos, P. J. (1988). Generalization of backpropagation with application to a recur-
rent gas market model. Neural Networks, 1(4), 339–356.
t
Williams, R. J. (1989). Complexity of exact gradient computation algorithms for recurrent
neural networks (Technical Report NU-CCS-89-27). Boston: Northeastern Univer-
rec
sity, College of Computer Science.
Wu, H., Zhang, J., & Zong, C. (2016). An empirical exploration of skip connections for
sequential tagging. arXiv:1610.03167.
Xie, X., & Shi, Y. (2018). Long-term memory neural Turing machines. Computer Sci-
ence and Application, 8(1), 49–58.
cor

Yamaguchi, K. (2012). Parsing clothing in fashion photographs. In Proceedings of the


IEEE Conference on Computer Vision and Pattern Recognition (pp. 3570–3577). Pis-
cataway, NJ: IEEE.
Yang, Y., Dong, J., Sun, X., Lima, E., Mu, Q., & Wang, X. (2017). A CFCC-LSTM model
for sea surface temperature prediction. IEEE Geoscience and Remote Sensing Letters,
Un

99, 1–5.
Yao, K., Cohn, T., Vylomova, K., Duh, K., & Dyer, C. (2015). Depth-gated LSTM.
arXiv:1508.03790v4.
Yu, B., Xu, Q., & Zhang, P. (2018). Question classification based on MAC-LSTM. In
Proceedings of the IEEE Third International Conference on Data Science in Cyberspace.
Piscataway, NJ: IEEE.
Zaremba, W., & Sutskever, I. (2014). Learning to execute. arXiv:1410.4615.
Zhang, J., Zheng, Y., & Qi, D. (2016). Deep spatio-temporal residual networks for
citywide crowd flows prediction. In Proceedings of the IEEE Conference on Acoustics,
Speech, and Signal Processing (pp. 1655–1661). Piscataway, NJ: IEEE.
Zhang, X., Lu, L., & Lapata, M. (2015). Tree recurrent neural networks with application
to language modeling. arXiv:1511.0006v1.
Zhang, Y., Chen, G., Yu, D., Yao, K., Khudanpur, S., & Glass, J. (2015). Highway long
short-term memory RNNs for distant speech recognition. In Proceedings of the
IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 5755–
5759). Piscataway, NJ: IEEE.
Zhao, R., Wang, J., Yan, R., & Mao, K. (2016). Machine health monitoring with LSTM
networks. In Proceedings of the 10th International Conference on Sensing Technology
(pp. 1–6). Piscataway, NJ: IEEE.
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6

36 Y. Yu, X. Si, C. Hu, and J. Zhang

Zhou, C., Sun, C., Liu, Z., & Lau, F. C. M. (2016). A C-LSTM Neural network for text
classification. Computer Science, 1(4), 39–44.
Zhou, G., Wu, J., Zhang, C., & Zhou, Z. (2016). Minimal gated unit for recurrent

f
neural networks. International Journal of Automation and Computing, 13(3), 226–234.

o
Zhu, G., Zhang, L., Shen, P., & Song, J. (2017). Multimodal gesture recognition using
3D convolution and convolutional LSTM. IEEE Access, 5, 4517–4524.

Pro
Zhu, X., Sobhani, P., & Guo, H. (2015). Long short-term memory over tree structures.
arXiv:1503.04881.

Received November 24, 2018; accepted February 26, 2019.

t ed
rec
cor
Un

You might also like