A Review of Recurrent Neural Networks - LSTM Cells and Network Architectures (Neural Computation) (2019)
A Review of Recurrent Neural Networks - LSTM Cells and Network Architectures (Neural Computation) (2019)
A Review of Recurrent Neural Networks - LSTM Cells and Network Architectures (Neural Computation) (2019)
f
A Review of Recurrent Neural Networks:
o
LSTM Cells and Network Architectures
Pro
Yong Yu
[email protected]
Department of Automation, Xi’an Institute of High-Technology, Xi’an 710025,
China, and Institute No. 25, Second Academy of China, Aerospace Science
and Industry Corporation, Beijing 100854, China
ed
Xiaosheng Si
[email protected]
Changhua Hu
[email protected]
Jianxun Zhang
t
rec
[email protected]
Department of Automation, Xi’an Institute of High-Technology,
Xi’an 710025, China
areas concerned with sequential data, such as text, audio, and video. How-
ever, RNNs consisting of sigma cells or tanh cells are unable to learn the
relevant information of input data when the input gap is large. By intro-
ducing gate functions into the cell structure, the long short-term memory
(LSTM) could handle the problem of long-term dependencies well. Since
its introduction, almost all the exciting results based on RNNs have been
Un
achieved by the LSTM. The LSTM has become the focus of deep learning.
We review the LSTM cell and its variants to explore the learning capac-
ity of the LSTM cell. Furthermore, the LSTM networks are divided into
two broad categories: LSTM-dominated networks and integrated LSTM
networks. In addition, their various applications are discussed. Finally,
future research directions are presented for LSTM networks.
1 Introduction
Over the past few years, deep learning techniques have been well devel-
oped and widely adopted to extract information from various kinds of data
(Ivakhnenko & Lapa, 1965; Ivakhnenko, 1971; Bengio, 2009; Carrio, Sampe-
dro, Rodriguez-Ramos, & Campoy, 2017; Khan & Yairi, 2018). Considering
different characteristics of input data, there are several types of architec-
tures for deep learning, such as the recurrent neural network (RNN; Robin-
son & Fallside, 1987; Werbos, 1988; Williams, 1989; Ranzato et al., 2014),
f
ally the CNN and DNN cannot deal with the temporal information of input
o
data. Therefore, in research areas that contain sequential data, such as text,
audio, and video, RNNs are dominant. Specifically, there are two types of
Pro
RNNs: discrete-time RNNs and continuous-time RNNs (Pearlmutter, 1989;
Brown, Yu, & Garverick, 2004; Gallagher, Boddhu, & Vigraham, 2005). The
focus of this review is on discrete-time RNNs.
The typical feature of the RNN architecture is a cyclic connection, which
enables the RNN to possess the capacity to update the current state based on
past states and current input data. These networks, such as fully RNNs (El-
man, 1990; Jordan, 1986; Chen & Soo, 1996) and selective RNNs (Šter, 2013),
ed
consisting of standard recurrent cells (e.g., sigma cells), have had incredible
success on some problems. Unfortunately, when the gap between the rele-
vant input data is large, the above RNNs are unable to connect the relevant
information. In order to handle the “long-term dependencies,” Hochreiter
t
and Schmidhuber (1997) proposed long short-term memory (LSTM).
rec
Almost all exciting results based on RNNs have been achieved by LSTM,
and thus it has become the focus of deep learning. Because of their pow-
erful learning capacity, LSTMs work tremendously well and have been
widely used in various kinds of tasks, including speech recognition (Fer-
nández, Graves, & Schmidhuber, 2007; He & Droppo, 2016; Hsu, Zhang,
cor
Lee, & Glass, 2016), acoustic modeling (Sak, Senior, & Beaufays, 2014;
Qu, Haghani, Weinstein, & Moreno, 2017), trajectory prediction (Altché &
Fortelle, 2017), sentence embedding (Palangi et al., 2015), and correlation
analysis (Mallinar & Rosset, 2018). In this review, we explore these LSTM
networks. This review is different from other review work on RNNs (Deng,
Un
2013; Lipton, Berkowitz, & Elkan, 2015); it focuses only on advances in the
LSTM cell and structures of LSTM networks. Here, the LSTM cell denotes
the recurrent unit in LSTM networks.
The rest of this review is organized as follows. Section 2 introduces the
LSTM cell and its variants. Section 3 reviews the popular LSTM networks
and discusses their applications in various tasks. A summary and discus-
sion of future directions are in section 4.
o f
Pro
t ed
Figure 1: Schematic of the standard recurrent sigma cell.
rec
development of LSTM networks, this section first gives a brief review of the
LSTM cell and its variants.
cor
2.1 Standard Recurrent Cell. Usually RNNs are networks that consist
of standard recurrent cells such as sigma cells and tanh cells. Figure 1 shows
a schematic of the standard recurrent sigma cell. The mathematical expres-
sions of the standard recurrent sigma cell are written as follows:
yt = ht , (2.1)
where xt , ht , and yt denote the input, the recurrent information, and the out-
put of the cell at time t, respectively; Wh and Wx are the weights; and b is the
bias. Standard recurrent cells have achieved some success in some problems
(Karpathy, Johnson, & Li, 2015; Li, Li, Cook, Zhu, & Gao, 2018). However,
the recurrent networks that consist of standard recurrent cells are not ca-
pable of handling long-term dependencies: as the gap between the related
inputs grows, it is difficult to learn the connection information. Hochreiter
(1991) and Bengio, Simard, and Frasconi (1994) analyzed fundamental rea-
sons for the long-term dependencies problem: error signals flowing back-
ward in time tend to either blow up or vanish.
o f
Pro
t ed
rec
Figure 2: Original LSTM architecture.
introducing a “gate” into the cell. Since this pioneering work, LSTMs have
been modified and popularized by many researchers (Gers, 2001; Gers &
cor
2.2.1 LSTM without a Forget Gate. The architecture of LSTM with only
input and output gates is shown in Figure 2. The mathematical expressions
of the LSTM in Figure 2 can be written as follows,
where ct denotes the cell state of LSTM. Wi , Wc̃ , and Wo are the weights, and
the operator ‘·’ denotes the pointwise multiplication of two vectors. When
updating the cell state, the input gate can decide what new information can
be stored in the cell state, and the output gate decides what information can
be output based on the cell state.
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6
o f
Pro
ed
Figure 3: Architecture of LSTM with a forget gate.
2.2.2 LSTM with a Forget Gate. Gers, Schmidhuber, and Cummins (2000)
t
modified the original LSTM in 2000 by introducing a forget gate into the
rec
cell. In order to obtain the mathematical expressions of this modified LSTM
cell, Figure 3 presents its inner connections.
Based on the connections shown in Figure 3, the LSTM cell can be math-
ematically expressed as follows:
ft = σ (W f h ht−1 + W f x xt + b f ),
cor
The forget gate can decide what information will be thrown away from
the cell state. When the value of the forget gate, ft , is 1, it keeps this infor-
mation; meanwhile, a value of 0 means it gets rid of all the information.
Jozefowicz, Zaremba, and Sutskever (2015) found that when increasing the
bias of the forget gate, b f , the performance of the LSTM network usually
became better. Furthermore, Schmidhuber, Wierstra, Gagliolo, and Gomez
(2007) proposed that LSTM was sometimes better trained by evolutionary
algorithms combined with other techniques rather than by pure gradient
descent.
2.2.3 LSTM with a Peephole Connection. As the gates of the above LSTM
cells do not have direct connections from the cell state, there is a lack of
essential information that harms the network’s performance. In order to
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6
o f
Pro
ed
Figure 4: LSTM architecture with a peephole connection.
solve this problem, Gers and Schmidhuber (2000) extended the LSTM cell
t
by introducing a peephole connection, as shown in Figure 4.
rec
Based on the connections shown in Figure 4, the mathematical expres-
sions can be expressed as follows:
ft = σ (W f h ht−1 + W f x xt + P f · ct−1 + b f ),
it = σ (Wih ht−1 + Wix xt + Pi · ct−1 + bi ),
cor
ht = ot · tanh(ct ), (2.4)
where P f , Pi , and Po are the peephole weights for the forget gate, input
gate, and output gate, respectively. The peephole connections allow the
LSTM cell to inspect its current internal states (Gers & Schmidhuber, 2001),
and thus the LSTM with a peephole connection can learn stable and pre-
cise timing algorithms without teacher forcing (Gers & Schraudolph, 2002).
To explore the LSTM with a peephole connection in depth, Greff, Srivas-
tava, Koutník, Steunebrink, and Schmidhuber (2016) compared the perfor-
mance of eight variants: no input gate, no forget gate, no output gate, no
input activation function, no output activation function, coupled input and
forget gate (CIFG), no peepholes, and full gate recurrence. Each variant dif-
fered from the original LSTM with a peephole connection by only a sin-
gle change. The results showed that the forget and output gates are the
most critical components, and removing any of them obviously decreases
network performance. Moreover, the modified coupled input and forget
gate could reduce the number of parameters and lower computational cost
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6
of
Pro
Figure 5: GRU cell architecture and connections.
ed
without a significant decrease in network performance. Because of this
powerful capacity, LSTM has become the focus of deep learning and has
been applied to multiple tasks.
t
rec
2.3 Gated Recurrent Unit. The learning capacity of the LSTM cell is su-
perior to that of the standard recurrent cell. However, the additional param-
eters increase computational burden. Therefore, the gated recurrent unit
(GRU) was introduced by Cho et al. (2014). Figure 5 shows the details of
the architecture and connections of the GRU cell.
cor
In order to reduce the number of parameters, the GRU cell integrates the
forget gate and input gate of the LSTM cell as an update gate. The GRU cell
has only two gates: an update gate and a reset gate. Therefore, it could save
one gating signal and the associated parameters. The GRU is essentially a
variant of vanilla LSTM with a forget gate. Since one gate is missing, the
single GRU cell is less powerful than the original LSTM. The GRU cannot
be taught to count or to solve context-free language (Weiss, Goldberg, &
Yahav, 2018) and also does not work for translation (Britz, Goldie, Luong,
& Le, 2017). Chung, Gulcehre, Cho, and Bengio (2014) empirically evalu-
ated the performance of the LSTM network, GRU network, and traditional
tanh-RNN and found that both the LSTM cell and GRU cell were supe-
rior to the traditional tanh unit, under the condition that each network had
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6
o f
Pro
Figure 6: MGU cell schematic.
ed
approximately the same number of parameters. Dey and Salemt (2017)
modified the original GRU and evaluated properties of three variants of
t
GRU—GRU-1, GRU-2, and GRU-3—on MNIST and IMDB data sets. The
rec
results showed that these three variants could reduce the computational
expense while performing as well as the original GRU cell.
2.4 Minimal Gated Unit. To further reduce the number of cell param-
eters, Zhou, Sun, Liu, and Lau (2016) proposed the minimal gated unit,
cor
which only has one gate. The schematic of the MGU is shown in Figure
6. Based on the connections in the figure, the mathematical expressions of
the MGU can be written as follows:
ft = σ (W f h ht−1 + W f x xt + b f ),
Un
The MGU cell involves only one forget gate. Hence, this kind of cell
has a simpler structure and fewer parameters compared with LSTM and
the GRU. Evaluation results (Zhou et al., 2016), based on various sequence
data, showed that the performance of the MGU is comparable to that of
the GRU. In order to further simplify the MGU, Heck and Salem (2017) in-
troduced three model variants—MGU-1, MGU-2, and MGU-3—which re-
duced the number of parameters in the forget gate. They found that the
performances of these variants were comparable to those of the MGU in
tasks on the MNIST data set and the Reuters Newswire Topics data set.
LSTM and GRU on the considered tasks: MUT-1, MUT-2, and MUT-3. The
architectures of these cells are similar to that of the GRU with only two
gates: an update gate and a reset gate. Neil, Pfeiffer, and Liu (2016) extended
f
the LSTM cell by adding a new time gate, and proposed the phased LSTM
o
cell. This new time gate leads to a sparse updating for the cell, which makes
the phased LSTM achieve faster convergence than regular LSTMs on tasks
Pro
of learning long sequences. Nina and Rodriguez (2016) simplified the LSTM
cell by removing the input gate and coupling the forget gate and input gate.
Removing the input gate produced better results in tasks that did not need
to recall very long sequences.
Unlike the variants that modified the LSTM cell through decreasing or
increasing gate functions, there are other kinds of LSTM variants. For ex-
ample, Rahman, Mohammed, and Azad (2017) incorporated a biologically
ed
inspired variation into the LSTM cell and proposed the biologically variant
LSTM. They changed only the updating of cell state c(t) to improve cell ca-
pacity. Moreover, LSTM with working memory was introduced by Pulver
and Lyu (2017). This modified version replaced the forget gate with a func-
t
tional layer, whose input is decided by the previous memory cell value.
rec
The gated orthogonal recurrent unit (GORU) was proposed by Jing et al.
(2017). They modified the GRU by an orthogonal matrix, which replaced
the hidden state loop matrix of the GRU. This made the GORU possess
the advantages of both an orthogonal matrix and the GRU structure. Irie,
Tüske, Alkhouli, Schlüter, and Ney (2016) added highway connections to
cor
the LSTM cell and GRU cell to ensure an unobstructed information flow
between adjacent layers. Furthermore, Veeriah, Zhuang, and Qi (2015) in-
troduced a differential gating scheme for the LSTM neural network to solve
the impact of spatiotemporal dynamics and then proposed the differential
LSTM cell.
Un
Due to the limited capacity of the single LSTM cell in handling engineering
problems, the LSTM cells have to be organized into a specific network archi-
tecture when processing practical data. Despite the fact that all RNNs can be
modified as LSTM networks by replacing the standard recurrent cell with
the LSTM cell, this review discusses only the verified LSTM networks. We
divide these LSTM networks into two broad categories: LSTM-dominated
networks and integrated LSTM networks. LSTM-dominated networks are
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6
o f
Pro
Figure 7: Simplified schematic diagram of the LSTM cell.
t ed
neural networks that are mainly constructed by LSTM cells. These networks
focus on optimizing the connections of the inner LSTM cells so as to en-
hance the network properties (Nie, An, Huang, Yan, & Han, 2016; Zhao,
rec
Wang, Yan, & Mao, 2016). The integrated LSTM networks consist of LSTM
layers and other components, such as a CNN and an external memory unit.
The integrated LSTM networks mainly pay attention to integrating the ad-
vantageous features of different components when dealing with the target
task.
cor
f
o
Pro
t ed
rec
Figure 8: The stacked LSTM network.
cor
Un
o f
where htL−1 is the output of the (L − 1)th LSTM layer at time t.
Because of the simple and efficient architecture, the stacked LSTM net-
Pro
work has been widely adopted by researchers. Du, Zhang, Nguyen, and
Han (2017) adopted the stacked LSTM network to solve the problem of
vehicle-to-vehicle communication and discovered that the efficiency of the
stacked LSTM-based regression model was much higher than that of lo-
gistic regression. Sutskever, Vinyals, and Le (2014) used a stacked LSTM
network with four layers, and 1000 cells at each layer, to accomplish an
English-to-French translation task. They found that reversing the order of
ed
source words could introduce short-term dependencies between the source
and the target sentence; thus, the performance of this LSTM network was
remarkably improved. Saleh, Hossny, and Nahavandi (2018) constructed a
deep stacked LSTM network to predict the intent of vulnerable road users.
t
When evaluated on the Daimler pedestrian path prediction benchmark data
rec
set (Schneider & Gavrila, 2013) for intent and path prediction of pedestrians
in four unique traffic scenarios, the stacked LSTM was more powerful than
the methodology relying on a set of specific hand-crafted features.
cell and proposed the bidirectional LSTM. Figure 10 shows the internal con-
nections of bidirectional LSTM recurrent layers.
In the architecture in Figure 10, the connections of forward layers are the
same as in the stacked LSTM network, which computes sequence (htL , ctL )
← ←
from t = 1 to T. However, for the backward layers, hidden sequence ( htL , c tL )
and output are iterated from t = T to 1. Therefore, the mathematical expres-
sions of the LSTM cell in backward layer L, at time t, could be written as
follows:
←
f tL = σ (W←L ht+1
L
+ W←L htL−1 + bL← ),
fh fx f
←
i tL = σ (W←L ht+1
L
+ W←L htL−1 + bL← ),
ih ix i
←
c̃ tL = tanh(W←L ht+1
L
+ W←L htL−1 + bL← ),
c̃ h c̃ x c̃
←L ← ← ← ←
c t = f tL · c t+1
L
+ i tL · c̃ tL ,
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6
of
Pro
t ed
Figure 10: Internal connections of bidirectional LSTM.
rec
←L
ot = σ (W←L ht+1
L
+ W←L htL−1 + bL← ),
oh ox o
← ←L ←L
htL = ot · tanh( c t ). (3.2)
cor
←
yt = Whyht + W← ht + by , (3.3)
hy
Un
←
where ht and ht are the outputs of forward and backward layers, respec-
tively.
Bidirectional LSTM networks have been widely adopted by researchers
because of their excellent properties (Han, Wu, Jiang, & Davis, 2017; Yu,
Xu, & Zhang, 2018). Thireou and Reczko (2007) applied this architecture to
the sequence-based prediction of protein localization. The results showed
that the bidirectional LSTM network outperforms the feedforward network
and standard recurrent network. Wu, Zhang, and Zong (2016) investigated
the different skip connections in a stacked bidirectional LSTM network and
found that adding skip connections to the cell outputs with a gated identity
function could improve network performance on the part-of-speech tag-
ging task. Brahma (2018) extended the bidirectional LSTM to suffix bidirec-
tional LSTM (SuBiLSTM), which improved the bidirectional LSTM network
by encoding each suffix of the sequence. The SuBiLSTM outperformed the
bidirectional LSTM in learning general sentence representations.
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6
o f
Pro
Figure 11: Unrolled 2D LSTM network.
t ed
rec
3.1.3 Multidimensional LSTM Network. The standard RNNs can only be
used to deal with one-dimensional data. However, in fields with multidi-
mensional data like video processing, the properties of RNNs—the ability
to access contextual information and robustness to input warping—are also
cor
have n-forget gates in the n-dimensional LSTM network (one for each of the
cell’s previous states along every dimension). Figure 11 shows the unrolled
two-dimensional (2D) LSTM network case.
In the 2D LSTM network, the blue lines and green lines indicate recurrent
connections along dimensions k and j, respectively. The expression ∗(k, j)
denotes the recurrent information (hLk, j , cLk, j ). Then the mathematical expres-
sion of the Lth LSTM layer is expressed as follows:
o f
Pro
Figure 12: The proposed graph LSTM structure.
(3.4)
where (hLk−1, j , cLk−1, j ) and (hLk, j−1 , cLk, j−1 ) denote the one-step-back recurrent
rec
information of point (k, j) along dimension k and j, at LSTM layer L. hL−1
k, j is
the output of point (k, j) at LSTM layer L − 1.
Graves et al. (2007) extended the above unidirectional MDLSTM to the
multidirectional MDLSTM and adopted this architecture to deal with the
cor
Air Freight database (McCarter & Storkey, 2007) and the MNIST database
(LeCun, Bottou, Bengio, & Haffner, 1998). The results showed that this ar-
chitecture is more robust to input warping than the state-of-the-art digit
recognition algorithm. Li, Mohamed, Zweig, and Gong (2016a) constructed
a MDLSTM network by time-frequency LSTM cells (Li, Mohamed, Zweig,
Un
& Gong, 2016b), which performed recurrence along the time and frequency
axes. On the Microsoft Windows phone short message dictation task, the
recurrence over both time and frequency axes promoted learning accuracy
compared with a network that consisted of only time LSTM cells.
node and predefined updating route for all images, the starting node and
node updating scheme of graph LSTM are dynamically specified.
The superpixel graph topology is acquired by image oversegmentation
f
using a simple linear iterative cluster (Achanta et al., 2010). For conve-
o
nience, we first define the average hidden states for neighboring nodes of
node k:
Pro
j∈NG (k) (1(q j = 1)h j,t+1 + 1(q j = 0)h j,t )
h̄k,t = , (3.5)
NG (k)
where NG (k) denotes the total neighboring nodes of the visited node k. q j in-
dicates the state of node j, which is set at 1 if updated and otherwise 0. Fur-
thermore, neighboring nodes have different influences on the visited node
ed
k. Graph LSTM utilizes the adaptive forget gates, f¯k j , j ∈ Ng (k), to describe
it. The graph LSTM cell is updated as follows:
¯ ¯
j∈NG (k) (1(q j = 1) f k j,t · c j,t + 1(q j = 0) f k j,t · c j,t−1 )
+ ,
NG (k)
3.1.5 Grid LSTM Network. The Grid LSTM network, proposed by Kalch-
brenner, Danihelka, and Graves (2015), could also be used to process multi-
dimensional data. This architecture arranges the LSTM cells in a grid of one
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6
o f
Pro
t ed
rec
cor
Un
Figure 13: Blocks of the stacked LSTM and the grid LSTM.
or more dimensions. Different from the existing networks, the grid LSTM
network has recurrent connections along the depth dimension. In order to
explain the architecture of the grid LSTM, the blocks from the standard
LSTM and Grid LSTM are shown in Figure 13.
In the blocks, the dashed lines denote identity transformations. The red
lines, green lines, and purple lines indicate the transformations along dif-
ferent dimensions. Compared with the standard LSTM block, the 2D Grid
LSTM block has the cell memory vector along the vertical dimension.
Kalchbrenner et al. (2015) put forward that the one-dimensional (1D)
grid LSTM network is the architecture that replaces the transfer functions
(e.g., tanh and ReLU; Nair & Hinton, 2010) of the feedforward network by
the 1D grid LSTM block. Thus, the 2D Grid LSTM network is similar to the
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6
o f
Pro
Figure 14: The stacked LSTM and 2D grid LSTM.
t ed
stacked LSTM network, but there are recurrent connections along the depth
rec
dimension. The grid LSTM with three or more dimensions corresponds to
MDLSTM (Graves, 2012), but the N-way recurrent interactions exit along
all dimensions. Figure 14 shows the stacked LSTM network and 2D Grid
LSTM network.
Based on the mathematical expressions of the LSTM cell block in equa-
cor
tion 2.3, in the 2D Grid LSTM network along the time dimension, the ex-
pressions of the LSTM cell at layer L and time t are expressed as follows:
of
Pro
Figure 15: Recurrent structure of the ConvLSTM layer.
ed
otL = σ (Woh
t t
hL−1 + Wox
t L
ht + bto ),
htL = otL · tanh(ctL ). (3.8)
t
Kalchbrenner et al. (2015) found that the 2D grid LSTM outperformed the
stacked LSTM on the task of memorizing sequences of numbers (Zaremba
rec
& Sutskever, 2014). The recurrent connections along the depth dimension
could improve the learning properties of grid LSTM. In order to lower the
computational complexity of the grid LSTM network, Li and Sainath (2017)
compared four grid LSTM variations and found that the frequency-block
grid LSTM reduced computation costs without loss of accuracy on a 12,500-
cor
3.1.6 Convolutional LSTM Network. The fully connected LSTM layer con-
tains too much redundancy for spatial data (Sainath, Vinyals, Senior, & Sak,
2015). Therefore, to accomplish a spatiotemporal sequence forecasting prob-
Un
lem, Shi et al. (2015) proposed the convolutional LSTM (ConvLSTM), which
had convolutional structures in the recurrent connections. Figure 15 shows
the recurrent structure of the ConvLSTM layer.
The ConvLSTM network uses the convolution operator to calculate the
future state of a certain cell, and then the future state is determined by the
inputs and past states of its local neighbors. The ConvLSTM cell can be ex-
pressed as follows:
ft = σ (W f h ∗ ht−1 + W f x ∗ xt + b f ),
it = σ (Wih ∗ ht−1 + Wih ∗ xt + bi ),
c̃t = tanh(Wc̃h ∗ ht−1 + Wc̃x ∗ xt + bc̃ ),
ct = ft · ct−1 + it · c̃t ,
ot = σ (Woh ∗ ht−1 + Wox ∗ xt + bo ),
ht = o(t) · tanh(c(t)), (3.9)
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6
where ‘∗ is the convolution operator and ‘· , as before, denotes the
Hadamard product.
Wei, Zhou, Sankaranarayanan, Sengupta, and Samet (2018) adopted
f
the ConvLSTM network to solve tweet count prediction, a spatiotempo-
o
ral sequence forecasting problem. The results of experiments on the city of
Seattle showed that the proposed network consistently outperforms the
Pro
competitive baseline approaches: the autoregressive integrated moving
average model, ST-ResNet (Zhang, Zheng, & Qi, 2016), and Eyewitness
(Krumm & Horvitz, 2015). Zhu, Zhang, Shen, and Song (2017) combined
the 3D CNN and ConvLSTM to construct a multimodal gesture recogni-
tion model. The 3D CNN and ConvLSTM learned the short-term and long-
term spatiotemporal features of gestures, respectively. When verified on the
Sheffield Kinect Gesture data set and the ChaLearn LAP large-scale isolated
ed
gesture data set, the results showed that the proposed method performs
better than other models. Liu, Zhou, Hang, and Yuan (2017) extended both
the ConvLSTM and bidirectional LSTM and presented the bidirectional–
convolutional LSTM architecture to learn spectral–spatial features from hy-
perspectral images.
t
rec
3.1.7 Depth-gated LSTM Network. The stacked LSTM network is the sim-
plest way to construct DNN. However, Yao, Cohn, Vylomova, Duh, and
Dyer (2015) pointed out that the error signals in the stacked LSTM network
might be either diminished or exploded because the error signals from the
cor
L L
dtL = σ Wxd xt + ωcd
L
· ct−1
L
+ ωld
L
· ctL−1 + bLd , (3.10)
where ωcd
L
and ωld
L
are the weight vectors to relate the past memory cell and
the lower-layer memory cell, respectively. Moreover, ωld L
should be a matrix
instead of a vector if memory cells in lower and upper layers have differ-
ent dimensions. Therefore, in the DGLSTM, the memory cell at layer L is
computed as
o f
Pro
t
Figure 16: An illustration of DGLSTM.
ed
rec
The DPLSTM architecture is inspired by the highway network (Kim,
El-Khamy, & Lee, 2017; Srivastava, Greff, & Schmidhuber, 2015) and grid
LSTM (Kalchbrenner et al., 2015). Yao et al. (2015) used the varied depth of
cor
o f
Pro
Figure 17: Architecture of the GF-RNN.
t ed
rec
j−1→ j i→ j
where Wg and Ug are the weight vectors for the input and hidden ac-
tivations of the ith layer at time step t − 1, respectively, and L is the number
j−1
of hidden layers. For j = 1, ht is the input xt . In the GF-LSTM network, the
unit-wise gates, such as the input gate, forget gate, and output gate, are not
modified by the global reset gate. However, the cell state c(t), in equation
cor
j j−1→ j j−1
L
i→ j i
c̃t = tanh(Wc ht + gi→ jUc ht−1 + bc̃ ). (3.13)
i=1
Un
o f
Pro
t ed
rec
Figure 18: Tree-structured LSTM network. The short line and small circle at
each arrowhead indicate a block or pass of information, respectively. In the real
model, the gating is a soft version of gating.
cor
The cell state of the tree-structured LSTM depends on the states of its
child units. The updating of the LSTM cell can be expressed as follows:
it = σ (WhiL ht−1
L
+ WhiR ht−1
R
+ WciL ct−1
L
+ WciR ct−1
R
+ bi ),
Un
where f L and f R are the left and right forget gates, respectively. At time
t, the inputs of the node LSTM cell are hidden vectors of its two children,
L R
which are denoted as ht−1 and ht−1 .
Zhu et al. (2015) proved that the tree-structured LSTM network out-
performed other recursive models in learning distributed sentiment rep-
resentations for texts. Tai et al. (2015) proposed two tree-structured LSTM
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6
f
cell. However, the N-ary Tree-LSTM introduces separate parameter matri-
o
ces for each child and is able to learn more finely grained information from
its children. Teng and Zhang (2016) further extended the tree-structured
Pro
LSTM to the bidirectional tree-structured LSTM with head lexicaliza-
tion. The general tree-structured LSTM network constitutes trees in the
bottom-up direction. Therefore, only the leaf nodes can use the input word
information. In order to propagate head words from leaf nodes to every con-
stituent node, Teng and Zhang (2016) proposed the automatic head lexical-
ization for a general tree-structured LSTM network, and built a tree LSTM
in the top-down direction. Miwa and Bansal (2016) further stacked a bidi-
ed
rectional tree-structured LSTM network on a bidirectional sequential LSTM
network to accomplish end-to-end relation extraction. Additionally, Niu,
Zhou, Wang, Gao, and Hua (2017) extended the tree-structured LSTM to
the hierarchical multimodal LSTM (HM-LSTM) for the problem of dense
t
visual-semantic embedding. Since the HM-LSTM could exploit the hierar-
rec
chical relations between whole images and image regions and between sen-
tences and phrases, this model outperformed other methods on Flickr8K
(Graves, 2014), Flickr30K (Plummer et al., 2017), and MS-COCO (Chen et al.,
2015; Lin et al., 2014) data sets.
cor
Nested LSTMs Network. The nested LSTM (NLSTM) network was pro-
posed by Moniz and Krueger (2018). They adopted nesting to add the depth
of the network. In the NLSTM network, the memory cell of the outer LSTM
f
cell is computed by the inner LSTM cell, and ctouter = htinner . In the task of
o
character-level language modeling, the NLSTM network is a promising al-
ternative to the stacked architecture, and it outperforms both stacked and
Pro
single-layer LSTM networks with similar numbers of parameters.
ed
advantage of different components.
3.2.1 Neural Turing Machine. Because the RNN needs to update the cur-
rent cell states by inputs and previous states, the memory capacity is the
t
essential feature of the RNN. In order to enhance the memory capacity of
rec
networks, neural networks have been adopted to learn to control end-to-
end differentiable stack machines (Sun, 1990; Mozer, & Das, 1993). Schmid-
huber (1993) and Schlag and Schmidhuber (2017) extended the above
structure, and used the RNN to learn and control end-to-end-differentiable
fast weight memory. Furthermore, Graves, Wayne, and Danihelka (2014)
cor
during each update cycle. At the same time, the RNN also exchanges infor-
mation with the memory matrix via the read and write heads. The math-
ematical expression of the neural network is decided by the type of the
selected RNN. Graves et al. (2014) selected the LSTM cell to construct the
neural network and compared the performance of this NTM and a standard
LSTM network. The results showed that NTM possessed advantages over
the LSTM network. Xie and Shi (2018) improved the training speed by de-
signing a new read–write mechanism for the NTM, which used convolution
operations to extract global memory features.
o f
Pro
ed
Figure 19: Schematic of the NTM with an LSTM controller.
t
rec
cell with the LSTM cell, which ensures that the model can deal with a longer
time duration.
cessing block and LSTM network to form the multiscale LSTM (MS-LSTM)
network. The preprocessing block helps select a proper timescale for the in-
put data, and the LSTM layer is adopted to model the processed sequential
data. In processing dynamic Internet traffic, Cheng et al. (2016) used this
model to learn the Internet traffic pattern in a flexible time window. Peng,
Un
Zhang, Liang, Liu, and Lin (2016) combined the MS-LSTM with a CNN and
constructed a recurrent architecture for geometric scene parsing.
3.2.6 LSTM-in-LSTM Network. Song, Tang, Xiao, Wu, and Zhang (2016)
combined a Deep-CNN network and an LSTM-in-LSTM architecture to
generate rich, finely grained textual descriptions of images. The LSTM-in-
f
LSTM architecture consists of an inner LSTM cell and an outer LSTM cell.
o
This architecture can learn the contextual interactions between visual cues
and can thus predict long sentence descriptions.
Pro
4 Conclusion
We have systematically reviewed various LSTM cell variants and LSTM net-
works. The LSTM cell is the basic node of LSTM networks. Different vari-
ants outperform the standard LSTM cell on some characteristics and tasks.
For example, the GRU has a small number of cell parameters. However,
ed
there is no variant that can surpass the standard LSTM cell in all aspects.
As for LSTM networks, there are two major categories: LSTM-dominated
neural networks and integrated LSTM networks. In order to enhance the
network properties of some specific tasks, LSTM-dominated networks fo-
t
cus on optimizing the connections between inner LSTM cells. Integrated
rec
LSTM networks mainly pay attention to integrating the advantageous fea-
tures of different components, such as the convolution neural network and
external memory unit, when dealing with the target task. Current LSTM
models have acquired incredible success on numerous tasks. Neverthe-
less, there are still some directions to augment RNNs with more powerful
cor
properties.
First, more efficient recurrent cells should be explored through explicit
knowledge. The recurrent cells are the basic nodes, and the properties of the
networks depend on recurrent cells to some extent. However, the current
studies on recurrent cells are mainly empirical explorations. The handling
Un
itself are used at all, and for how long, learning to control time through a
”halt unit.” For each time step the amount of computation in adaptive com-
putation time RNNs may be different, and they could deal with the tasks
f
with different difficulties well.
o
Acknowledgments
Pro
We thank LetPub (www.letpub.com) for its linguistic assistance during the
preparation of this review. This work was supported by the National Na-
ture Science Foundation of China (grants 61773386, 61573365, 61573366,
61573076), the Young Elite Scientists Sponsorship Program of China Asso-
ciation for Science and Technology (grant 2016QNRC001), and the National
Key R&D Program of China (grant 2018YFB1306100).
References
t ed
Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., & Süsstrunk, S. (2010). SLIC super-
pixels. https://www.researchgate.net/publication/44234783_SLIC_superpixels
rec
Altché, F., & Fortelle, A. D. L. (2017). An LSTM network for highway trajectory pre-
diction. In Proceedings of the IEEE 20th International Conference on Intelligent Trans-
portation Systems. Piscataway, NJ: IEEE.
Bengio, Y. (2009). Learning deep Architectures for AI. Foundations and Trends in Ma-
chine Learning, 2(1), 1–127.
cor
Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with
gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166.
Brahma, S. (2018). Suffix bidirectional long short-term memory. http://www
.researchgate.net/publication/325262895_Suffix_Bidirectional_Long_Short-Term
_Memory
Un
Britz, D., Goldie, A., Luong, M. T., & Le, Q. (2017). Massive exploration of neural machine
translation architectures. arXiv:1703.03906.
Brown, B., Yu, X., & Garverick, S. (2004). Mixed-mode analog VLSI continuous-time
recurrent neural network. In Circuits, Signals, and Systems: IASTED International
Conference Proceedings.
Carrio, A., Sampedro, C., Rodriguez-Ramos, A., & Campoy, P. (2017). A review of
deep learning methods and applications for unmanned aerial Vehicles. Journal of
Sensors, 2, 1–13.
Chen, T. B., & Soo, V. W. (1996). A comparative study of recurrent neural network
architectures on learning temporal sequences. In Proceedings of the IEEE Interna-
tional Conference on Neural Networks. Piscataway, NJ: IEEE.
Chen, X., Fang, H., Lin, T. Y., Vedantam, R., Gupta, S., Dollar, P., & Zitnick,
C. L. (2015). Microsoft COCO captions: Data collection and evaluation server.
arXiv:504.00325v2.
Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., & Yuille, A. (2014). Detect what
you can: Detecting and representing objects using holistic models and body parts.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Piscataway, NJ: IEEE.
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6
Cheng, M., Xu, Q., Lv, J., Liu, W., Li, Q., & Wang, J. (2016). MS-LSTM: A multi-scale
LSTM model for BGP anomaly detection. In Proceedings of the IEEE 24th Interna-
tional Conference on Network Protocols. Piscataway, NJ: IEEE.
f
Cho, K., Van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H.,
o
& Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder
for statistical machine translation. arXiv:1406.1078v3.
Pro
Chung, J., Gulcehre, C., Cho, K. H., & Bengio, Y. (2014). Empirical evaluation of gated
recurrent neural networks on sequence modeling. arXiv:1412.3555v1.
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2015). Gated feedback recurrent neural
networks. arXiv:1502.02367v1.
Deng, L. (2013). Three classes of deep learning architectures and their applications:
A tutorial survey. In APSIPA transactions on signal and information processing. Cam-
bridge: Cambridge University Press.
ed
Dey, R., & Salemt, F. M. (2017). Gate-variants of gated recurrent unit (GRU) neural net-
works. In Proceedings of the IEEE International Midwest Symposium on Circuits and
Systems. Piscataway, NJ: IEEE.
Du, X., Zhang, H., Nguyen, H. V., & Han, Z. (2017). Stacked LSTM deep learning
t
model for traffic prediction in vehicle-to-vehicle communication. In Proceedings
of the IEEE Vehicular Technology Conference. Piscataway, NJ: IEEE.
rec
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179–211.
Fernández, S., Graves, A., & Schmidhuber, J. (2007a). Sequence labelling in struc-
tured domains with hierarchical recurrent neural networks. In Proceedings of the
20th International Joint Conference on Artificial Intelligence. San Mateo, CA: Morgan
Kaufmann.
cor
Gers, F. A., & Schraudolph, N. N. (2002). Learning precise timing with LSTM recur-
rent networks. Journal of Machine Learning Research, 3, 115–143.
Goel, K., Vohra, R., & Sahoo, J. K. (2014). Polyphonic music generation by model-
f
ing temporal dependencies using a RNN-DBN. In Proceedings of the International
o
Conference on Artificial Neural Networks. Berlin: Springer.
Goller, C., & Kuchler, A. (1996). Learning task-dependent distributed representa-
Pro
tions by backpropagation through structure. Neural Networks, 1, 347–352.
Graves, A. (2012). Supervised sequence labelling with recurrent neural networks. Berlin:
Springer.
Graves, A. (2014). Generating sequences with recurrent neural networks. arXiv:1308.
0850v5.
Graves, A., Fernández, S., & Schmidhuber, J. (2007). Multi-dimensional recurrent neural
networks. In Proceedings of the International Conference on Artificial Neural Networks.
ed
Berlin: Springer.
Graves, A., Wayne, G., & Danihelka, I. (2014). Neural Turing machines. arXiv:1410.
5401v2.
Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabskabarwińska,
t
A., & Agapiou, J. (2016). Hybrid computing using a neural network with dynamic
external memory. Nature, 538(7626), 471–476.
rec
Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidi-
rectional LSTM and other neural network architectures. Neural Networks, 18(5),
602–610.
Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., & Schmidhuber, J. (2016).
LSTM: A search space odyssey. IEEE Transactions on Neural Networks and Learning
cor
f
CA: CCM Information Corporation.
o
Jing, L., Gulcehre, C., Peurifoy, J., Shen, Y., Tegmark, M., Soljačić, M., & Bengio, Y.
(2017). Gated orthogonal recurrent units: On learning to forget. arXiv:1706.02761.
Pro
Jordan, M. (1986). Attractor dynamics and parallelism in a connectionist sequential
machine. In Proceedings of the Annual Conference of the Cognitive Science Society (pp.
531–546). Piscataway, NJ: IEEE.
Jozefowicz, R., Zaremba, W., & Sutskever, I. (2015). An empirical exploration of re-
current network architectures. In Proceedings of the International Conference on In-
ternational Conference on Machine Learning (pp. 2342–2350). New York: ACM.
Kalchbrenner, N., Danihelka, I., & Graves, A. (2015). Grid long short-term memory.
ed
arXiv:1507.01526.
Karpathy, A., Johnson, J., & Li, F. F. (2015). Visualizing and understanding recurrent
networks. arXiv:1506.02078.
Khan, S., & Yairi, T. (2018). A review on the application of deep learning in system
t
health management. Mechanical Systems and Signal Processing, 107, 241–265.
Kim, J., El-Khamy, M., & Lee, J. (2017). Residual LSTM: Design of a deep recurrent ar-
rec
chitecture for distant speech recognition. arXiv:1701.03360.
Koutnik, J., Greff, K., Gomez, F., & Schmidhuber, J. (2014). A clockwork RNN.
arXiv:1402.3511v1.
Krause, B., Lu, L., Murray, I., & Renals, S. (2016). Multiplicative LSTM for sequence
modelling. arXiv:1609.07959.
cor
Krumm, J., & Horvitz, E. (2015). Eyewitness: Identifying local events via space-time
signals in Twitter feeds. In Proceedings of the 23rd SIGSPATIAL International Con-
ference on Advances in Geographic Information Systems. New York: ACM.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., &
Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recogni-
Un
Liang, X., Lin, L., Shen, X., Feng, J., Yan, S., & Xing, E. P. (2017). Interpretable
structure-evolving LSTM. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (pp. 2171–2184). Piscataway, NJ: IEEE.
f
Liang, X., Liu, S., Shen, X., Yang, J., Liu, L., Dong, J., & Yan, S. (2015). Deep human
o
parsing with active template regression. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 37(12), 2402.
Pro
Liang, X., Shen, X., Feng, J., Lin, L., & Yan, S. (2016). Semantic object parsing with
graph LSTM. In Proceedings of the European Conference on Computer Vision (pp. 125–
143). Berlin: Springer.
Liang, X., Shen, X., Xiang, D., Feng, J., Lin, L., & Yan, S. (2016). Semantic object pars-
ing with local-global long short-term memory. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (pp. 3185–3193). Piscataway, NJ: IEEE.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., & Zitnick, C. L.
ed
(2014). Microsoft COCO: Common objects in context. In Proceedings of the European
Conference on Computer Vision (pp. 740–755). Berlin: Springer.
Lipton, Z. C., Berkowitz, J., & Elkan, C. (2015). A critical review of recurrent neural
networks for sequence learning. arXiv:1506.000190.
t
Liu, P., Qiu, X., Chen, J., & Huang, X. (2016). Deep fusion LSTMs for text semantic
matching. In Proceedings of the 54th Annual Meeting of the Association for Computa-
rec
tional Linguistics (pp. 1034–1043). Stroudsburg, PA: ACL.
Liu, P., Qiu, X., & Huang, X. (2016). Modelling interaction of sentence pair with coupled-
LSTMs. arXiv:1605.05573.
Liu, Q., Zhou, F., Hang, R., & Yuan, X. (2017). Bidirectional-convolutional LSTM
based spectral-spatial feature learning for hyperspectral image classification. Re-
cor
f
LSTM for dense visual-semantic embedding. In Proceedings of the IEEE Interna-
o
tional Conference on Computer Vision (pp. 1899–1907). Piscataway, NJ: IEEE.
Oord, A. V. D., Kalchbrenner, N., & Kavukcuoglu, K. (2016). Pixel recurrent neural
Pro
networks. arXiv:1601.06759.
Palangi, H., Deng, L., Shen, Y., Gao, J., He, X., Chen, J., & Ward, R. (2015). Deep
sentence embedding using the long short-term memory network: Analysis and
application to information retrieval. IEEE/ACM Transactions on Audio Speech and
Language Processing, 24(4), 694–707.
Pearlmutter, B. A. (1989). Learning state space trajectories in recurrent neural net-
works. Neural Computation, 1(2), 263–269.
ed
Peng, Z., Zhang, R., Liang, X., Liu, X., & Lin, L. (2016). Geometric scene parsing with
hierarchical LSTM. In Proceedings of the International Joint Conference on Artificial
Intelligence (pp. 3439–3445). Palo Alto, CA: AAAI Press.
Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., & Lazeb-
t
nik, S. (2017). Flickr30k entities: Collecting region-to-phrase correspondences for
richer image-to-sentence models. International Journal of Computer Vision, 123(1),
rec
74–93.
Pulver, A., & Lyu, S. (2017). LSTM with working memory. In Proceedings of the Inter-
national Joint Conference on Neural Networks (pp. 845–851). Piscataway, NJ: IEEE.
Qu, Z., Haghani, P., Weinstein, E., & Moreno, P. (2017). Syllable-based acoustic mod-
eling with CTC-SMBR-LSTM. In Proceedings of the IEEE Automatic Speech Recogni-
cor
Ranzato, M. A., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., & Chopra, S.
(2014). Video (language) modeling: A baseline for generative models of natural videos.
arXiv:1412.6604.
Rawat, W., & Wang, Z. (2017). Deep convolutional neural Networks for image clas-
sification: A comprehensive review. Neural Computation, 29(9), 1–10.
Robinson, A. J., & Fallside, F. (1987). The utility driven dynamic error propagation net-
work. Cambridge: University of Cambridge Department of Engineering.
Sainath, T. N., Vinyals, O., Senior, A., & Sak, H. (2015). Convolutional, long short-
term memory, fully connected deep neural networks. In IEEE International Con-
ference on Acoustics, Speech and Signal Processing (pp. 4580–4584). Piscataway, NJ:
IEEE.
Sak, H. I., Senior, A., & Beaufays, F. O. (2014). Long short-term memory based recur-
rent neural network architectures for large vocabulary speech recognition. arXiv:1402.
1128v1.
Saleh, K., Hossny, M., & Nahavandi, S. (2018). Intent prediction of vulnerable road
users from motion trajectories using stacked LSTM network. In Proceedings of the
IEEE International Conference on Intelligent Transportation Systems (pp. 327–332).
Piscataway, NJ: IEEE.
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6
Schlag, I., & Schmidhuber, J. (2017). Gated fast weights for on-the-fly neural program
generation. Workshop on Meta-Learning, In NIPS Metalearning Workshop.
Schmidhuber, J. (1993). Reducing the ratio between learning complexity and number
f
of time varying variables in fully recurrent nets. In Proceedings of the International
o
Conference on Neural Networks (pp. 460–463). London: Springer.
Schmidhuber, J. (2012). Self-delimiting neural networks. arXiv:2012.
Pro
Schmidhuber, J., Wierstra, D., Gagliolo, M., & Gomez, F. (2007). Training recurrent
networks by evolino. Neural Computation, 19(3), 757–779.
Schneider, N., & Gavrila, D. M. (2013). Pedestrian path prediction with recursive
Bayesian filters: A comparative study. In Proceedings of the German Conference on
Pattern Recognition (pp. 174–183). Berlin: Springer.
Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE
Transactions on Signal Processing, 45(11), 2673–2681.
ed
Shabanian, S., Arpit, D., Trischler, A., & Bengio, Y. (2017). Variational Bi-LSTMs.
arXiv:1711.05717.
Sharma, P., & Singh, A. (2017). Era of deep neural networks: A review. In Proceed-
ings of the 8th International Conference on Computing, Communication and Networking
t
Technologies (pp. 1–5). Piscataway, NJ: IEEE.
Shi, X., Chen, Z., Wang, H., Woo, W. C., Woo, W. C., & Woo, W. C. (2015). Convolu-
rec
tional LSTM Network: A machine learning approach for precipitation nowcast-
ing. In C. Cortes, D. D. Lee, M. Sugiyama, & R. Garnett (Eds.), Advances in neural
information processing systems (pp. 802–810). Red Hook, NY: Curran.
Song, J., Tang, S., Xiao, J., Wu, F., & Zhang, Z. (2016). LSTM-in-LSTM for generating
long descriptions of images. Computational Visual Media, 2(4), 1–10.
cor
Sperduti, A., & Starita, A. (1997). Supervised neural networks for the classification
of structures. IEEE Transactions on Neural Networks, 8(3), 714–735.
Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Highway networks. arXiv:1505.
00387.
Šter, B. (2013). Selective recurrent neural network. Neural Processing Letters, 38(1),
Un
1–15.
Sun, G. (1990). Connectionist pushdown automata that learn context-free grammars.
In Proceedings of the International Joint Conference on Neural Networks (pp. 577–580).
Piscataway, NJ: IEEE.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with
neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K.
Q. Weinberger (Eds.), Advances in neural information processing systems (pp. 3104–
3112). Red Hook, NY: Curran.
Tai, K. S., Socher, R., & Manning, C. D. (2015). Improved semantic representations
from tree-structured long short-term memory networks. Computer Science, 5(1),
36.
Teng, Z., & Zhang, Y. (2016). Bidirectional tree-structured LSTM with head lexicalization.
arXiv:1611.06788.
Thireou, T., & Reczko, M. (2007). Bidirectional long short-term memory networks for
predicting the subcellular localization of eukaryotic proteins. IEEE/ACM Transac-
tions on Computational Biology and Bioinformatics, 4(3), 441–446.
Veeriah, V., Zhuang, N., & Qi, G. J. (2015). Differential recurrent neural networks for
action recognition. In Proceedings of the IEEE International Conference on Computer
Vision (pp. 4041–4049). Piscataway, NJ: IEEE.
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6
Vohra, R., Goel, K., & Sahoo, J. K. (2015). Modeling temporal dependencies in data
using a DBN-LSTM. In Proceedings of the IEEE International Conference on Data Sci-
ence and Advanced Analytics (pp. 1–4). Piscataway, NJ: IEEE.
f
Wang, J., & Yuille, A. (2015). Semantic part segmentation using compositional model
o
combining shape and appearance. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition (pp. 1788–1797). Piscataway, NJ: IEEE.
Pro
Wei, H., Zhou, H., Sankaranarayanan, J., Sengupta, S., & Samet, H. (2018). Residual
convolutional LSTM for tweet count prediction. In Companion of the Web Confer-
ence 2018 (pp. 1309–1316). Geneva: International World Wide Web Conferences
Steering Committee.
Weiss, G., Goldberg, Y., & Yahav, E. (2018). On the practical computational power of finite
precision RNNs for language recognition. arXiv:1805.04908.
Weng, J. J., Ahuja, N., & Huang, T. S. (1993, May). Learning recognition and seg-
ed
mentation of 3D objects from 2D images. In Proceedings of the Fourth International
Conference on Computer Vision (pp. 121–128). Piscataway, NJ: IEEE.
Werbos, P. J. (1988). Generalization of backpropagation with application to a recur-
rent gas market model. Neural Networks, 1(4), 339–356.
t
Williams, R. J. (1989). Complexity of exact gradient computation algorithms for recurrent
neural networks (Technical Report NU-CCS-89-27). Boston: Northeastern Univer-
rec
sity, College of Computer Science.
Wu, H., Zhang, J., & Zong, C. (2016). An empirical exploration of skip connections for
sequential tagging. arXiv:1610.03167.
Xie, X., & Shi, Y. (2018). Long-term memory neural Turing machines. Computer Sci-
ence and Application, 8(1), 49–58.
cor
99, 1–5.
Yao, K., Cohn, T., Vylomova, K., Duh, K., & Dyer, C. (2015). Depth-gated LSTM.
arXiv:1508.03790v4.
Yu, B., Xu, Q., & Zhang, P. (2018). Question classification based on MAC-LSTM. In
Proceedings of the IEEE Third International Conference on Data Science in Cyberspace.
Piscataway, NJ: IEEE.
Zaremba, W., & Sutskever, I. (2014). Learning to execute. arXiv:1410.4615.
Zhang, J., Zheng, Y., & Qi, D. (2016). Deep spatio-temporal residual networks for
citywide crowd flows prediction. In Proceedings of the IEEE Conference on Acoustics,
Speech, and Signal Processing (pp. 1655–1661). Piscataway, NJ: IEEE.
Zhang, X., Lu, L., & Lapata, M. (2015). Tree recurrent neural networks with application
to language modeling. arXiv:1511.0006v1.
Zhang, Y., Chen, G., Yu, D., Yao, K., Khudanpur, S., & Glass, J. (2015). Highway long
short-term memory RNNs for distant speech recognition. In Proceedings of the
IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 5755–
5759). Piscataway, NJ: IEEE.
Zhao, R., Wang, J., Yan, R., & Mao, K. (2016). Machine health monitoring with LSTM
networks. In Proceedings of the 10th International Conference on Sensing Technology
(pp. 1–6). Piscataway, NJ: IEEE.
NECO_a_01199-Yu MITjats-NECO.cls May 13, 2019 17:6
Zhou, C., Sun, C., Liu, Z., & Lau, F. C. M. (2016). A C-LSTM Neural network for text
classification. Computer Science, 1(4), 39–44.
Zhou, G., Wu, J., Zhang, C., & Zhou, Z. (2016). Minimal gated unit for recurrent
f
neural networks. International Journal of Automation and Computing, 13(3), 226–234.
o
Zhu, G., Zhang, L., Shen, P., & Song, J. (2017). Multimodal gesture recognition using
3D convolution and convolutional LSTM. IEEE Access, 5, 4517–4524.
Pro
Zhu, X., Sobhani, P., & Guo, H. (2015). Long short-term memory over tree structures.
arXiv:1503.04881.
t ed
rec
cor
Un