Ind RNN PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Independently Recurrent Neural Network (IndRNN): Building A Longer and

Deeper RNN

Shuai Li * , Wanqing Li * , Chris Cook * , Ce Zhu † , Yanbo Gao †


*
School of Computing and Information Technology, University of Wollongong

School of Electronic Engineering, University of Electronic Science and Technology of China
arXiv:1803.04831v2 [cs.CV] 15 May 2018

{sl669,wanqing,ccook}@uow.edu.au,[email protected],[email protected]

Abstract recognition [8], scene labelling [4] and language process-


ing [5], and have achieved impressive results. Compared
Recurrent neural networks (RNNs) have been widely with the feed-forward networks such as the convolutional
used for processing sequential data. However, RNNs are neural networks (CNNs), a RNN has a recurrent connection
commonly difficult to train due to the well-known gradi- where the last hidden state is an input to the next state. The
ent vanishing and exploding problems and hard to learn update of states can be described as follows:
long-term patterns. Long short-term memory (LSTM) and
ht = σ(Wxt + Uht−1 + b) (1)
gated recurrent unit (GRU) were developed to address these
problems, but the use of hyperbolic tangent and the sig- where xt ∈ RM and ht ∈ RN are the input and hidden state
moid action functions results in gradient decay over layers. at time step t, respectively. W ∈ RN ×M , U ∈ RN ×N and
Consequently, construction of an efficiently trainable deep b ∈ RN are the weights for the current input and the recur-
network is challenging. In addition, all the neurons in an rent input, and the bias of the neurons. σ is an element-wise
RNN layer are entangled together and their behaviour is activation function of the neurons, and N is the number of
hard to interpret. To address these problems, a new type neurons in this RNN layer.
of RNN, referred to as independently recurrent neural net- Training of the RNNs suffers from the gradient vanishing
work (IndRNN), is proposed in this paper, where neurons and exploding problem due to the repeated multiplication of
in the same layer are independent of each other and they the recurrent weight matrix. Several RNN variants such as
are connected across layers. We have shown that an In- the long short-term memory (LSTM) [10, 18] and the gated
dRNN can be easily regulated to prevent the gradient ex- recurrent unit (GRU) [5] have been proposed to address the
ploding and vanishing problems while allowing the network gradient problems. However, the use of the hyperbolic tan-
to learn long-term dependencies. Moreover, an IndRNN can gent and the sigmoid functions as the activation function in
work with non-saturated activation functions such as relu these variants results in gradient decay over layers. Conse-
(rectified linear unit) and be still trained robustly. Multi- quently, construction and training of a deep LSTM or GRU
ple IndRNNs can be stacked to construct a network that is based RNN network is practically difficult. By contrast, ex-
deeper than the existing RNNs. Experimental results have isting CNNs using non-saturated activation function such as
shown that the proposed IndRNN is able to process very relu can be stacked into a very deep network (e.g. over 20
long sequences (over 5000 time steps), can be used to con- layers using the basic convolutional layers and over 100 lay-
struct very deep networks (21 layers used in the experiment) ers with residual connections [12]) and be still trained effi-
and still be trained robustly. Better performances have ciently. Although residual connections have been attempted
been achieved on various tasks by using IndRNNs compared for LSTM models in several works [50, 42], there have been
with the traditional RNN and LSTM. The code is avail- no significant improvement (mostly due to the reason that
able at https://github.com/Sunnydreamrain/ gradient decays in LSTM with the use of the hyperbolic tan-
IndRNN_Theano_Lasagne. gent and the sigmoid functions as mentioned above).
Moreover, the existing RNN models share the same com-
ponent σ(Wxt + Uht−1 + b) in (1), where the recurrent
1. Introduction connection entangles all the neurons. This makes it hard
to interpret and understand the roles of the trained neurons
Recurrent neural networks (RNNs) [17] have been (e.g., what patterns each neuron responds to) since the sim-
widely used in sequence learning problems such as action ple visualization of the outputs of individual neurons [19] is

1
hard to ascertain the function of one neuron without consid- unitary evolution RNN was proposed where the unitary re-
ering the others. current weights are defined empirically. In this case, the
In this paper, a new type of RNN, referred to as inde- norm of the backpropagated gradient can be bounded with-
pendently recurrent neural network (IndRNN), is proposed. out exploding. By contrast, the proposed IndRNN solves
In the proposed IndRNN, the recurrent inputs are processed the gradient exploding and vanishing problems without los-
with the Hadamard product as ht = σ(Wxt + u ht−1 + ing the power of trainable recurrent connections and without
b). This provides a number of advantages over the tradi- involving gate parameters.
tional RNN including: In addition to changing the form of the recurrent neu-
rons, works on initialization and training techniques, such
• The gradient backpropagation through time can be reg- as initializing the recurrent weights to a proper range or
ulated to effectively address the gradient vanishing and regulating the norm of the gradients over time, were also
exploding problems. reported in addressing the gradient problems. In [28], an
• Long-term memory can be kept with IndRNNs to pro- initialization technique was proposed for an RNN with relu
cess long sequences. Experiments have demonstrated activation, termed as IRNN, which initializes the recurrent
that an IndRNN can well process sequences over 5000 weight matrix to be the identity matrix and bias to be zero.
steps while LSTM could only process less than 1000 In [47], the recurrent weight matrix was further suggested
steps. to be a positive definite matrix with the highest eigenvalue
of unity and all the remainder eigenvalues less than 1. In
• An IndRNN can work well with non-saturated func- [38], the geometry of RNNs was investigated and a path-
tion such as relu as activation function and be trained normalized optimization method for training was proposed
robustly. for RNNs with relu activation. In [26], a penalty term on the
squared distance between successive hidden states’ norms
• Multiple layers of IndRNNs can be efficiently stacked, was proposed to prevent the exponential growth of IRNN’s
especially with residual connections over layers, to in- activation. Although these methods help ease the gradient
crease the depth of the network. An example of 21 exploding, they are not able to completely avoid the prob-
layer-IndRNN is demonstrated in the experiments for lem (the eigenvalues of the recurrent weight matrix may still
language modelling. be larger than 1 in the process of training). Moreover, the
training of an IRNN is very sensitive to the learning rate.
• Behaviour of IndRNN neurons in each layer are easy
When the learning rate is large, the gradient is likely to ex-
to interpret due to the independence of neurons in each
plode. The proposed IndRNN solves gradient problems by
layer.
making the neurons independent and constraining the re-
current weights. It can work with relu and be trained ro-
Experiments have demonstrated that IndRNN performs
bustly. As a result, an IndRNN is able to process very long
much better than the traditional RNN and LSTM models
sequences (e.g. over 5000 steps as demonstrated in the ex-
on the tasks of the adding problem, sequential MNIST clas-
periments).
sification, language modelling and action recognition.
On the other hand, comparing with the deep CNN archi-
tectures which could be over 100 layers such as the resid-
2. Related Work ual CNN [12] and the pseudo-3D residual CNN (P3D) [43],
most of the existing RNN architectures only consist of sev-
To address the gradient exploding and vanishing prob- eral layers (2 or 3 for example [25, 45, 28]). This is mostly
lems in RNNs, variants of RNNs have been proposed and due to the gradient vanishing and exploding problems which
typical ones are the long short-term memory (LSTM) [14], result in the difficulty in training a deep RNN. Since all the
and the gated recurrent unit (GRU) [5]. Both LSTM and gate functions, input and output modulations in LSTM em-
GRU enforce a constant error flow over time steps and use ploy sigmoid or hyperbolic tangent functions as the activa-
gates on the input and the recurrent input to regulate the tion function, it suffers from the gradient vanishing prob-
information flow through the network. However, the use lem over layers when multiple LSTM layers are stacked
of gates makes the computation not parallelable and thus into a deep model. Currently, a few models were reported
increases the computational complexity of the whole net- that employ residual connections [12] between LSTM lay-
work. To process the states of the network over time in par- ers to make the network deeper [50]. However, as shown
allel, the recurrent connections are fixed in [3, 30]. While in [42], the deep LSTM model with the residual connec-
this strategy greatly simplifies the computational complex- tions does not efficiently improve the performance. This
ity, it reduces the capability of their RNNs since the re- may be partly due to the gradient decay over LSTM layers.
current connections are no longer trainable. In [1, 49], a On the contrary, for each time step, the proposed IndRNN
with relu works in a similar way as CNN. Multiple layers of the gradient back propagated to the time step t is
IndRNNs can be stacked and be efficiently combined with T −1
∂Jn ∂Jn ∂hn,T ∂Jn Y ∂hn,k+1
residual connections, leading to a deep RNN. = =
∂hn,t ∂hn,T ∂hn,t ∂hn,T ∂hn,k
k=t
−1
TY T −1
∂Jn ∂Jn T −t Y 0
= σ 0 n,k+1 un = un σ n,k+1
3. Independently Recurrent Neural Network ∂hn,T
k=t
∂hn,T
k=t
(4)
In this paper, we propose an independently recurrent
0
neural network (IndRNN). It can be described as: where σ n,k+1 is the derivative of the element-wise activa-
tion function. It can be seen that the gradient only involves
the exponential term of a scalar value un which can be eas-
ht = σ(Wxt + u ht−1 + b) (2)
ily regulated, and the gradient of the activation function
which is often bounded in a certain
QT −1range. Compared with
where recurrent weight u is a vector and represents ∂J 0 T
the gradients of an RNN ( ∂h T k=t diag(σ (hk+1 ))U
Hadamard product. Each neuron in one layer is indepen- 0
where diag(σ (hk+1 )) is the Jacobian matrix of the
dent from others and connection between neurons can be element-wise activation function), the gradient of an In-
achieved by stacking two or more layers of IndRNNs as pre- dRNN directly depends on the value of the recurrent weight
sented later. For the n-th neuron, the hidden state hn,t can (which is changed by a small magnitude according to the
be obtained as learning rate) instead of matrix product (which is mainly
determined by its eigenvalues and can be changed signif-
hn,t = σ(wn xt + un hn,t−1 + bn ) (3) icantly even though the change to each matrix entries is
small [39]). Thus the training of an IndRNN is more robust
where wn and un are the n-th row of the input weight and than a traditional RNN. To solve the gradient exploding and
recurrent weight, respectively. Each neuron only receives vanishing problem over time,
QT −1we only need to regulate the
information from the input and its own hidden state at the exponential term “uTn −t k=t σ 0 n,k+1 ” to an appropriate
previous time step. That is, each neuron in an IndRNN range. This is further explained in the following together
deals with one type of spatial-temporal pattern indepen- with keeping long and short memory in an IndRNN.
dently. Conventionally, a RNN is treated as multiple layer To keep long-term memory in a network, the current
perceptrons over time where the parameters are shared. Dif- state (at time step t) would still be able to effectively in-
ferent from the conventional RNNs, the proposed IndRNN fluence the future state (at time step T ) after a large time
provides a new perspective of recurrent neural networks as interval. Consequently, the gradient at time step T can be
independently aggregating spatial patterns (i.e. through w) effectively propagated to the time step t. By assuming that
over time (i.e. through u). The correlation among different the minimum effective gradient is , a range for the recur-
neurons can be exploited by stacking two or multiple lay- rent weight of an IndRNN neuron in order to keep long-term
ers. In this case, each neuron in the next layer processes the memory can be obtained. Specifically,
q to keep a memory
outputs of all the neurons in the previous layer. of T − t time steps, |un | ∈ [ (T −t) QT −1 σ 0 , +∞) ac-
k=t n,k+1

The gradient backpropagation through time for an In- cording to (4) (ignoring the gradient backpropagated from
dRNN and how it addresses the gradient vanishing and ex- the objective at time step T ). That is, to avoid the gra-
ploding problems are described in the next Subsection 3.1. dient vanishing for a neuron, the above constraint should
Details on the exploration of cross-channel information are be met. In order to avoid the gradient exploding prob-
explained in Subsection 4. Different deeper and longer In- lem, qthe range needs to qbe further constrained to |un | ∈
dRNN network architectures are discussed in Subsection [ (T −t) QT −1 σ 0 , (T −t) QT −1 γσ0 ] where γ is the
k=t n,k+1 k=t n,k+1
4.1. largest gradient value without exploding. For the commonly
used activation functions such as relu and tanh, their deriva-
tives are no larger than 1, i.e., |σ 0 n,k+1 | ≤ 1. Especially for
3.1. Backpropagation Through Time for An In- relu, its gradient is either 0 or 1. Considering that the short-
dRNN term memories can be important for the performance of the
network as well, especially for a multiple layers RNN, the
For the gradient backpropagation through time in each
layer, the gradients of an IndRNN can be calculated inde- constraint to the range of the recurrent weight with relu acti-

pendently for each neuron since there are no interactions vation function can be relaxed to |un | ∈ [0, (T −t) γ]. When
among them in one layer. For the n-th neuron hn,t = the recurrent weight is 0, the neuron only uses the infor-
σ(wn xt + un hn,t−1 ) where the bias is ignored, suppose mation from the current input without keeping any mem-
the objective trying to minimize at time step T is Jn . Then ory from the past. In this way, different neurons can learn
represented by (5) and (6), respectively.
Weight Weight
hf,t = Wf xf,t + diag(uf i )hf,t−1 (5)
BN BN
hs,t = Ws hf,t (6)
Recurrent+ReLU Recurrent+ReLU
Assuming Ws is invertible, then
BN BN
Ws−1 hs,t = Wf xf,t + diag(uf i )Ws−1 hs,t−1 (7)

(a) Thus
hs,t = Ws Wf xf,t + Ws diag(uf i )Ws−1 hs,t−1 (8)
By assigning U = Ws diag(uf i )Ws−1 and W =
BN BN
Ws Wf , it becomes
Recurrent+ReLU Recurrent+ReLU

Weight Weight ht = Wxt + Uht−1 (9)


BN BN

Recurrent+ReLU Recurrent+ReLU
which is a traditional RNN. Note that this only imposes the
Weight Weight
constraint that the recurrent weight (U) is diagonalizable.
Therefore, the simple two-layer IndRNN network can rep-
+ +
resent a traditional RNN network with a diagonalizable re-
current weight (U). In other words, under linear activation,
(b) a traditional RNN with a diagonalizable recurrent weight
Figure 1. Illustration of (a) the basic IndRNN architecture and (b) (U) is a special case of a two-layer IndRNN where the
the residual IndRNN architecture. recurrent weight of the second layer is zero and the input
weight of the second layer is invertible.
to keep memory of different lengths. Note that the regula- It is known that a non-diagonalizable matrix can be
tion on the recurrent weight u is different from the gradient made diagonalizable with a perturbation matrix composed
clipping technique. For the gradient clipping or gradient of small entries. A stable RNN network needs to be robust
norm clipping [41], the calculated gradient is already ex- to small perturbations (in order to deal with precision er-
ploded and is forced back to a predefined range. The gra- rors for example). It is possible to find an RNN network
dients for the following steps may keep exploding. In this with a diagonalizable recurrent weight matrix to approxi-
case, the gradient of the other layers relying on this neu- mate a stable RNN network with a non-diagonalizable re-
ron may not be accurate. On the contrary, the regulation current weight matrix. Therefore, a traditional RNN with
proposed here essentially maintains the gradient in an ap- a linear activation is a special case of a two-layer IndRNN.
propriate range without affecting the gradient backprogated For a traditional RNN with a nonlinear activation function,
through this neuron. its relationship with the proposed IndRNN is yet to be estab-
lished theoretically. However, we have shown empirically
that the proposed IndRNN can achieve better performance
4. Multiple-layer IndRNN than a traditional RNN with a nonlinear activation function.
As mentioned above, neurons in the same IndRNN layer Regarding the number of parameters, for a N -neuron
are independent of each other, and cross channel informa- RNN network with input of dimension M , the number of
tion over time is explored through multiple layers of In- parameters in a traditional RNN is M × N + N × N ,
dRNNs. To illustrate this, we compare a two-layer In- while the number of parameters using one-layer IndRNN
dRNN with a traditional single layer RNN. For simplic- is M × N + N . For a two-layer IndRNN where both
ity, the bias term is ignored for both IndRNN and tra- layers consist of N neurons, the number of parameters is
ditional RNN. Assume a simple N -neuron two-layer net- M × N + N × N + 2 × N , which is of a similar order to
work where the recurrent weights for the second layer the traditional RNN.
are zero which means the second layer is just a fully In all, the cross-channel information can be well ex-
connected layer shared over time. The Hadamard prod- plored with a multiple-layer IndRNN although IndRNN
uct (u ht−1 ) can be represented in the form of matrix neurons are independent of each other in each layer.
product by diag(u1 , u2 , . . . , uN )ht−1 . In the following,
4.1. Deeper and Longer IndRNN Architectures
diag(u1 , u2 , . . . , uN ) is shortened as diag(ui ). Assume
that the activation function is a linear function σ(x) = x. In the proposed IndRNN, the processing of the input
The first and second layers of a two-layer IndRNN can be (Wxt + b) is independent at different timesteps and can
be implemented in parallel as in [3, 30]. The proposed In- The RNN models included in the experiments for com-
dRNN can be extended to a convolutional IndRNN where, parison are the traditional RNN with tanh, LSTM, IRNN
instead of processing input of each time step using a fully (RNN with relu). The proposed IndRNN was evaluated
connected weight (Wxt ), it is processed with convolutional with relu activation function. Since GRU achieved similar
operation (W ∗ xt , where ∗ denotes the convolution opera- performance as LSTM [18], it is not included in the report.
tor). RNN, LSTM, and IRNN are all one layer while the IndRNN
The basic IndRNN architecture is shown in Fig. 1(a), model is two layers. 128 hidden units were used for all the
where “weight” and “Recurrent+ReLU” denote the process- models, and the number of parameters for RNN, LSTM, and
ing of input and the recurrent process at each step with relu two-layer IndRNN are 16K, 67K and 17K, respectively. It
as the activation function. By stacking this basic architec- can be seen that the two-layer IndRNN has a comparable
ture, a deep IndRNN network can be constructed. Com- number of parameters to that of the one-layer RNN, while
pared with an LSTM-based architecture using the sigmoid many more parameters are needed for LSTM. As discussed
and hyperbolic tangent functions decaying the gradient over in Subsection 3.1, the recurrent weight is constrained in the

T
layers, a non-saturated activation function such as relu re- range of |un | ∈ (0, 2) for the IndRNN.
duces the gradient vanishing problem over layers. In ad-
dition, batch normalization, denoted as “BN”, can also be Mean squared error (MSE) was used as the objective
employed in the IndRNN network before or after the acti- function and the Adam optimization method [24] was used
vation function as shown in Fig. 1(a). for training. The baseline performance (predicting 1 as the
Since the weight layer (Wxt + b) is used to process the output regardless of the input sequence) is mean squared er-
input, it is natural to extend it to multiple layers to deepen ror of 0.167 (the variance of the sum of two independent
the processing. Also the layers used to process the input can uniform distributions). The initial learning rate was set to
be of the residual structures in the same way as in CNN [12]. 2×10−3 for models with tanh activation and set as 2×10−4
With the simple structure of IndRNN, it is very easy to ex- for models with relu activations. However, as the length of
tend it to different networks architectures. For example, in the sequence increases, the IRNN model do not converge
addition to simply stacking IndRNNs or stacking the layers and thus a smaller initial learning rate (10−5 ) was used. The
for processing the input, IndRNNs can also be stacked in the learning rate was reduced by a factor of 10 every 20K train-
form of residual connections. Fig. 1(b) shows an example ing steps. The training data and testing data were all gen-
of a residual IndRNN based on the “pre-activation” type of erated randomly throughout the experiments, different from
residual layers in [13]. At each time step, the gradient can [1] which only used a set of randomly pre-generated data.
be directly propagated to the other layers from the identity
mapping. Since IndRNN addresses the gradient exploding The results are shown in Fig. 2(a), 2(b) and 2(c). First,
and vanishing problems over time, the gradient can be effi- for short sequences (T = 100), most of the models (ex-
ciently propagated over different time steps. Therefore, the cept RNN with tanh) performed well as they converged to
network can be substantially deeper and longer. The deeper a very small error (much smaller than the baseline). When
and longer IndRNN network can be trained end-to-end sim- the length of the sequences increases, the IRNN and LSTM
ilarly as other networks. models have difficulties in converging, and when the se-
quence length reaches 1000, IRNN and LSTM cannot min-
imize the error any more. However, the proposed IndRNN
5. Experiments can still converge to a small error very quickly. This indi-
In this Section, evaluation of the proposed IndRNN on cates that the proposed IndRNN can model a longer-term
various tasks are presented. memory than the traditional RNN and LSTM.

5.1. Adding Problem From the figures, it can also be seen that the tradi-
tional RNN and LSTM can only keep a mid-range memory
The adding problem [14, 1] is commonly used to eval- (about 500 - 1000 time steps). To evaluate the proposed In-
uate the performance of RNN models. Two sequences of dRNN model for very long-term memory, experiments on
length T are taken as input. The first sequence is uniformly sequences with length 5000 were conducted where the re-
sampled in the range (0, 1) while the second sequence con- sult is shown in Fig. 2(d). It can be seen that IndRNN can
sists of two entries being 1 and the rest being 0. The output still model it very well. Note that the noise in the result of
is the sum of the two entries in the first sequence indicated IndRNN is because the initial learning rate (2 × 10−4 ) was
by the two entries of 1 in the second sequence. Three dif- relatively large and once the learning rate dropped, the per-
ferent lengths of sequences, T = 100, 500 and 1000, were formance became robust. This demonstrates that IndRNN
used for the experiments to show whether the tested models can effectively address the gradient exploding and vanish-
have the ability to model long-term memory. ing problem over time and keep a long-term memory.
(a) (b)

(c) (d)
Figure 2. Results of the adding problem for different sequence lengths. The legends for all figures are the same and thus only shown in (a).

5.1.1 Analysis of Neurons’ Behaviour model the adding problem. Moreover, since neurons in the
second layer are independent from each other, one neuron
In the proposed IndRNN, neurons in each layer are inde- can still work with the others removed (which is not possi-
pendent of each other which allows analysis of each neu- ble for the traditional RNN models).
ron’s behaviour without considering the effect coming from
other neurons. Fig. 3(a) and 3(b) show the activation of
the neurons in the first and second layers, respectively, for To verify the above conjecture, an experiment was con-
one random input with sequence length 5000. It can be ducted where the first IndRNN layer is initialized with the
seen that neurons in the first layer mainly pick up the in- trained weights and the second IndRNN layer only consists
formation of the numbers to be added, where the strong of one neuron initialized with the weight of the neuron that
responses correspond to the locations to be summed indi- keeps the long-term memory. Accordingly, the final fully
cated by the sequence. It can be regarded as reducing noise, connected layer used for output is a neuron with only one
i.e., reducing the effect of other non-useful inputs in the se- input and one output, i.e., two scalar values including one
quence. For the second layer, one neuron aggregates inputs weight parameter and one bias parameter. Only the final
to long-term memory while others generally preserve their output layer was trained/fine-tuned in this experiment and
own state or process short-term memory which may not be the result is shown in Fig. 4. It can be seen that with only
useful in the testing case (since only the hidden state of the one IndRNN neuron in the second layer, the model is still
last time step is used as output). From this result, we con- able to model the adding problem very well for sequences
jecture that only one neuron is needed in the second layer to with length 5000 as expected.
Table 1. Results (in terms of error rate (%)) for the sequential
MNIST and permuted MNIST.
MNIST pMNIST
IRNN [28] 5.0 18
uRNN [1] 4.9 8.6
RNN-path [38] 3.1 -
LSTM [1] 1.8 12
LSTM+Recurrent dropout [44] - 7.5
LSTM+Recurrent batchnorm [7] - 4.6
LSTM+Zoneout [25] - 6.9
LSTM+Recurrent batchnorm+Zoneout - 4.1
IndRNN (6 layers) 1.0 4.0

(a)
works and classification is performed after reading all pix-
els. To make the task even harder, the permuted MNIST
classification was also used where the pixels are processed
with a fixed random permutation. Since an RNN with tanh
does not converge to a high accuracy (as reported in the lit-
erature [28]), only IndRNN with relu was evaluated. As ex-
plained in Section 4.1, IndRNN can be stacked into a deep
network. Here we used a six-layer IndRNN, and each layer
has 128 neurons. To accelerate the training, batch normal-
ization is inserted after each layer. The Adam optimiza-
tion was used with the initial learning rate 2 × 10−4 and
reduced by a factor of 10 every 600K training steps. The
results are shown in Table 1 in comparison with the exist-
ing methods. It can be seen that IndRNN achieved better
(b) performance than the existing RNN models.
Figure 3. Neurons’ behaviour in different layers of the proposed
IndRNN for long sequences (5000 time steps) in the adding prob- 5.3. Language Modeling
lem.
5.3.1 Char-level Penn Treebank
In this subsection, we evaluate the performance of the pro-
posed IndRNN on the language modelling task using the
character-level Penn Treebank (PTB-c) dataset. The test
setting is similar to [7]. A six-layer IndRNN with 2000 hid-
den neurons is used for the test. To demonstrate that the
IndRNN network can be very deep with the residual con-
nections, a 21-layer residual IndRNN as shown in Fig. 1(b)
in Subsection 4.1 was adopted. The frame-wise batch nor-
malization [27] is applied, and the batch size is set to 128.
Adam was used for training with initial learning rate set to
2 × 10−4 and dropped by a factor of 5 when performance
on the validation set was no longer improved (with patience
20). Dropout [9] with a dropping probability of 0.25 and 0.3
Figure 4. Result of the adding problem with just one neuron in the
second layer for sequences of length 5000.
were used for the 6-layer IndRNN and the residual IndRNN.
The sequences are non-overlapping and length T = 50 and
T = 150 were both tested in training and testing.
5.2. Sequential MNIST Classification The results are shown in Table 2 in comparison with the
existing methods. Performance was evaluated using bits per
Sequential MNIST classification is another problem that character metric (BPC). It can be seen that the proposed In-
is widely used to evaluate RNN models. The pixels of dRNN model achieved better performance than the tradi-
MNIST digits [29] are presented sequentially to the net- tional RNN and LSTM models. It can also been seen that
Table 2. Results of char-level PTB for our proposed IndRNN Table 3. Results of word-level PTB for our proposed IndRNN
model in comparison with results reported in the literature, in model in comparison with results reported in the literature, in
terms of BPC. terms of perplexity.
Test Test
RNN-tanh [26] 1.55 RNN-LDA + KN-5 + cache [36] 92.0
RNN-relu [38] 1.55 Deep RNN [40] 107.5
RNN-TRec [26] 1.48 CharCNN [23] 78.9
RNN-path [38] 1.47 LSTM [25] 114.5
HF-MRNN [35] 1.42 LSTM+Recurrent dropout [44] 87.0
LSTM [25] 1.36 LSTM+Zoneout [25] 77.4
LSTM+Recurrent dropout [44] 1.32 LSTM+Variational Dropout [9] 73.4
LSTM+Recurrent batchnorm [7] 1.32 Pointer Sentinel LSTM [34] 70.9
LSTM+Zoneout [25] 1.27 RHN [52] 65.4
HyperLSTM + LN [11] 1.25 Neural Architecture Search [53] 62.4
Hierarchical Multiscale LSTM + LN [6] 1.24 res-IndRNN (11 layers) 65.3
Fast-slow LSTM [37] 1.19
Neural Architecture Search [53] 1.21
IndRNN (6 layers, 50 steps) 1.26 ture search [53] which constructs new models while learn-
IndRNN (6 layers, 150 steps) 1.23 ing.
res-IndRNN (21 layers, 50 steps) 1.21
res-IndRNN (11 layers* , 150 steps) 1.19 5.4. Skeleton based Action Recognition
*
Note that due to the limitation of GPU memory, an 11-layer The NTU RGB+D dataset [45] was used for the skele-
residual IndRNN was used for time step 150 instead of 21 lay- ton based action recognition. This dataset is currently the
ers. largest action recognition dataset with skeleton modality. It
contains 56880 sequences of 60 action classes, including
Cross-Subject (CS) (40320 and 16560 samples for training
with a deeper residual IndRNN, the performance can be fur- and testing, respectively) and Cross-View (CV) (37920 and
ther improved. Also an improvement can be achieved with 18960 samples for training and testing, respectively) evalu-
longer temporal dependencies (from time step 50 to 150) as ation protocols [45]. In each evaluation protocol, 5% of the
shown in Table 2. training data was used for evaluation as suggested in [45]
and 20 frames were sampled from each instance as one in-
5.3.2 Word-level Penn Treebank put in the same way as in [32]. The joint coordinates of two
subject skeletons were used as input. If only one is present,
In this subsection, the performance of the proposed In- the second was set as zero. For this dataset, when multi-
dRNN on the word-level Penn Treebank dataset is evalu- ple skeletons are present in the scene, the skeleton identity
ated. The test setting is similar to [25]. A 11-layer residual captured by the Kinect sensor may be changed over time.
IndRNN was used for test and the weight tying [16] of the Therefore, an alignment process was first applied to keep
input embedding and the final output weight is also adopted. the same skeleton saved in the same data array over time. A
The frame-wise batch normalization [27] is applied, and four-layer IndRNN and a six-layer IndRNN with 512 hid-
the batch size is set to 128. Adam was used for training den neurons were both tested. Batch size was 128 and the
with initial learning rate set to 5 × 10−4 and dropped by Adam optimization was used with the initial learning rate
a factor of 5 when performance on the validation set was 2 × 10−4 and decayed by 10 once the evaluation accuracy
no longer improved (with patience 20). The sequences are does not increase. Dropout [9] was applied after each In-
non-overlapping and length T = 50 was used in training dRNN layer with a dropping probability of 0.25 and 0.1 for
and testing. Dropout [9] with a dropping probability of CS and CV settings, respectively.
0.35 were used among IndRNN layers (including embed- The final result is shown in Table 4 including compar-
ding) while 0.8 is used after the last IndRNN layer. The isons with the existing methods. It can be seen that the
recurrent weights are initialized with N ormal(0.4, 0.2), proposed IndRNN greatly improves the performance over
which makes the network starts with learning more mid- other RNN or LSTM models on the same task. For CS,
range memory. RNN and LSTM of 2 layers can only achieve accuracies
The results are shown in Table 3 in comparison with the of 56.29% and 60.09% while a 4-layer IndRNN achieved
existing methods. It can be seen that the proposed IndRNN 78.58%. For CV, RNN and LSTM of 2 layers only achieved
model achieved better performance than most of the tradi- accuracies of 64.09% and 67.29% while 4-layer IndRNN
tional RNN and LSTM models except the neural architec- achieved 83.75%. As demonstrated in [32, 45], the perfor-
Table 4. Results of all skeleton based methods on NTU RGB+D References
dataset.
[1] M. Arjovsky, A. Shah, and Y. Bengio. Unitary evolution
recurrent neural networks. arXiv preprint arXiv:1511.06464,
Method CS CV 2015. 2, 5, 7
Deep learning on Lie Group [15] 61.37% 66.95% [2] F. Baradel, C. Wolf, and J. Mille. Pose-conditioned spatio-
JTM+CNN [48] 73.40% 75.20% temporal attention for human action recognition. arXiv
Res-TCN [22] 74.30% 83.10% preprint arXiv:1703.10106, 2017. 9
SkeletonNet(CNN) [20] 75.94% 81.16% [3] J. Bradbury, S. Merity, C. Xiong, and R. Socher. Quasi-
JDM+CNN [31] 76.20% 82.30% recurrent neural networks. arXiv preprint arXiv:1611.01576,
2016. 2, 5
Clips+CNN+MTLN [21] 79.57% 84.83%
[4] W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki. Scene
Enhanced Visualization+CNN [33] 80.03% 87.21%
labeling with lstm recurrent neural networks. In Proceed-
1 Layer RNN [45] 56.02% 60.24% ings of the IEEE Conference on Computer Vision and Pattern
2 Layer RNN [45] 56.29% 64.09% Recognition, pages 3547–3555, 2015. 1
1 Layer LSTM [45] 59.14% 66.81% [5] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau,
2 Layer LSTM [45] 60.09% 67.29% F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase
1 Layer PLSTM [45] 62.05% 69.40% representations using rnn encoder-decoder for statistical ma-
2 Layer PLSTM [45] 62.93% 70.27% chine translation. arXiv preprint arXiv:1406.1078, 2014. 1,
JL d+RNN [51] 70.26% 82.39% 2
STA-LSTM [46] 73.40% 81.20% [6] J. Chung, S. Ahn, and Y. Bengio. Hierarchical multiscale
ST-LSTM + Trust Gate [32] 69.20% 77.70% recurrent neural networks. arXiv preprint arXiv:1609.01704,
2016. 8
Pose conditioned STA-LSTM[2] 77.10% 84.50%
[7] T. Cooijmans, N. Ballas, C. Laurent, Ç. Gülçehre, and
IndRNN (4 layers) 78.58% 83.75%
A. Courville. Recurrent batch normalization. arXiv preprint
IndRNN (6 layers) 81.80% 87.97% arXiv:1603.09025, 2016. 7, 8
[8] J. Donahue, L. Anne Hendricks, S. Guadarrama,
M. Rohrbach, S. Venugopalan, K. Saenko, and T. Dar-
rell. Long-term recurrent convolutional networks for visual
mance of LSTM cannot be further improved by simply in- recognition and description. In Proceedings of the IEEE
creasing the number of parameters or increasing the number Conference on Computer Vision and Pattern Recognition,
of layers. However, by increasing the 4-layer IndRNN to pages 2625–2634, 2015. 1
a 6-layer IndRNN, the performance is further improved to [9] Y. Gal and Z. Ghahramani. A theoretically grounded appli-
81.80% and 87.97% for CS and CV, respectively. This per- cation of dropout in recurrent neural networks. In Advances
formance is better than the state-of-the-art methods includ- in neural information processing systems, pages 1019–1027,
ing those with attention models [46, 2] and other techniques 2016. 7, 8
[51, 32]. [10] K. Greff, R. K. Srivastava, J. Koutnı́k, B. R. Steunebrink,
and J. Schmidhuber. Lstm: A search space odyssey. IEEE
transactions on neural networks and learning systems, 2017.
6. Conclusion 1
In this paper, we presented an independently recurrent [11] D. Ha, A. Dai, and Q. V. Le. Hypernetworks. arXiv preprint
neural network (IndRNN), where neurons in one layer are arXiv:1609.09106, 2016. 8
independent of each other. The gradient backpropagation [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-
through time process for the IndRNN has been explained ing for image recognition. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition, pages
and a regulation technique has been developed to effec-
770–778, 2016. 1, 2, 5
tively address the gradient vanishing and exploding prob-
[13] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in
lems. Compared with the existing RNN models including
deep residual networks. In European Conference on Com-
LSTM and GRU, IndRNN can process much longer se- puter Vision, pages 630–645. Springer, 2016. 5
quences. The basic IndRNN can be stacked to construct [14] S. Hochreiter and J. Schmidhuber. Long short-term memory.
a deep network especially combined with residual connec- Neural computation, 9(8):1735–1780, 1997. 2, 5
tions over layers, and the deep network can be trained ro- [15] Z. Huang, C. Wan, T. Probst, and L. Van Gool. Deep learning
bustly. In addition, independence among neurons in each on lie groups for skeleton-based action recognition. arXiv
layer allows better interpretation of the neurons. Experi- preprint arXiv:1612.05877, 2016. 9
ments on multiple fundamental tasks have verified the ad- [16] H. Inan, K. Khosravi, and R. Socher. Tying word vectors and
vantages of the proposed IndRNN over existing RNN mod- word classifiers: A loss framework for language modeling.
els. arXiv preprint arXiv:1611.01462, 2016. 8
[17] M. I. Jordan. Serial order: A parallel distributed processing [35] T. Mikolov, I. Sutskever, A. Deoras, H.-S. Le, S. Kom-
approach. Advances in psychology, 121:471–495, 1997. 1 brink, and J. Cernocky. Subword language modeling
[18] R. Jozefowicz, W. Zaremba, and I. Sutskever. An empiri- with neural networks. preprint (http://www. fit. vutbr.
cal exploration of recurrent network architectures. In Pro- cz/imikolov/rnnlm/char. pdf), 2012. 8
ceedings of the 32nd International Conference on Machine [36] T. Mikolov and G. Zweig. Context dependent recurrent neu-
Learning (ICML-15), pages 2342–2350, 2015. 1, 5 ral network language model. SLT, 12:234–239, 2012. 8
[19] A. Karpathy, J. Johnson, and L. Fei-Fei. Visualizing [37] A. Mujika, F. Meier, and A. Steger. Fast-slow recurrent neu-
and understanding recurrent networks. arXiv preprint ral networks. In Advances in Neural Information Processing
arXiv:1506.02078, 2015. 1 Systems, pages 5917–5926, 2017. 8
[20] Q. Ke, S. An, M. Bennamoun, F. Sohel, and F. Boussaid.
[38] B. Neyshabur, Y. Wu, R. R. Salakhutdinov, and N. Srebro.
Skeletonnet: Mining deep part features for 3-d action recog-
Path-normalized optimization of recurrent neural networks
nition. IEEE Signal Processing Letters, 24(6):731–735,
with relu activations. In Advances in Neural Information
2017. 9
Processing Systems, pages 3477–3485, 2016. 2, 7, 8
[21] Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid.
[39] B. Parlett. Laguerre’s method applied to the matrix eigen-
A new representation of skeleton sequences for 3d action
value problem. Mathematics of Computation, 18(87):464–
recognition. arXiv preprint arXiv:1703.03492, 2017. 9
485, 1964. 3
[22] T. S. Kim and A. Reiter. Interpretable 3d human action anal-
ysis with temporal convolutional networks. arXiv preprint [40] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio. How to
arXiv:1704.04516, 2017. 9 construct deep recurrent neural networks. arXiv preprint
[23] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush. Character- arXiv:1312.6026, 2013. 8
aware neural language models. In AAAI, pages 2741–2749, [41] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of
2016. 8 training recurrent neural networks. In International Confer-
[24] D. P. Kingma and J. Ba. Adam: A method for stochastic ence on Machine Learning, pages 1310–1318, 2013. 4
optimization. arXiv preprint arXiv:1412.6980, 2014. 5 [42] S. Pradhan and S. Longpre. Exploring the depths of recurrent
[25] D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, neural networks with stochastic residual learning. Report. 1,
N. R. Ke, A. Goyal, Y. Bengio, H. Larochelle, A. Courville, 2
et al. Zoneout: Regularizing rnns by randomly preserving [43] Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal rep-
hidden activations. arXiv preprint arXiv:1606.01305, 2016. resentation with pseudo-3d residual networks. In Proceed-
2, 7, 8 ings of the IEEE Conference on Computer Vision and Pattern
[26] D. Krueger and R. Memisevic. Regularizing rnns by stabi- Recognition, pages 5533–5541, 2017. 2
lizing activations. In Proceeding of the International Con- [44] S. Semeniuta, A. Severyn, and E. Barth. Recurrent dropout
ference on Learning Representations, 2016. 2, 8 without memory loss. arXiv preprint arXiv:1603.05118,
[27] C. Laurent, G. Pereyra, P. Brakel, Y. Zhang, and Y. Bengio. 2016. 7, 8
Batch normalized recurrent neural networks. In Acoustics, [45] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang. Ntu rgb+ d:
Speech and Signal Processing (ICASSP), 2016 IEEE Inter- A large scale dataset for 3d human activity analysis. In Pro-
national Conference on, pages 2657–2661. IEEE, 2016. 7, ceedings of the IEEE Conference on Computer Vision and
8 Pattern Recognition, pages 1010–1019, 2016. 2, 8, 9
[28] Q. V. Le, N. Jaitly, and G. E. Hinton. A simple way to initial-
[46] S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu. An end-to-
ize recurrent networks of rectified linear units. arXiv preprint
end spatio-temporal attention model for human action recog-
arXiv:1504.00941, 2015. 2, 7
nition from skeleton data. In AAAI, pages 4263–4270, 2017.
[29] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
9
based learning applied to document recognition. Proceed-
ings of the IEEE, 86(11):2278–2324, 1998. 7 [47] S. S. Talathi and A. Vartak. Improving performance of re-
[30] T. Lei and Y. Zhang. Training rnns as fast as cnns. arXiv current neural network with relu nonlinearity. arXiv preprint
preprint arXiv:1709.02755, 2017. 2, 5 arXiv:1511.03771, 2015. 2
[31] C. Li, Y. Hou, P. Wang, and W. Li. Joint distance maps based [48] P. Wang, Z. Li, Y. Hou, and W. Li. Action recognition based
action recognition with convolutional neural networks. IEEE on joint trajectory maps using convolutional neural networks.
Signal Processing Letters, 24(5):624–628, 2017. 9 In Proceedings of the 2016 ACM on Multimedia Conference,
[32] J. Liu, A. Shahroudy, D. Xu, and G. Wang. Spatio-temporal pages 102–106. ACM, 2016. 9
lstm with trust gates for 3d human action recognition. In [49] S. Wisdom, T. Powers, J. Hershey, J. Le Roux, and L. At-
European Conference on Computer Vision, pages 816–833. las. Full-capacity unitary recurrent neural networks. In
Springer, 2016. 8, 9 Advances in Neural Information Processing Systems, pages
[33] M. Liu, H. Liu, and C. Chen. Enhanced skeleton visualiza- 4880–4888, 2016. 2
tion for view invariant human action recognition. Pattern [50] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi,
Recognition, 68:346–362, 2017. 9 W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey,
[34] S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer et al. Google’s neural machine translation system: Bridg-
sentinel mixture models. arXiv preprint arXiv:1609.07843, ing the gap between human and machine translation. arXiv
2016. 8 preprint arXiv:1609.08144, 2016. 1, 2
[51] S. Zhang, X. Liu, and J. Xiao. On geometric features for
skeleton-based action recognition using multilayer lstm net-
works. In Applications of Computer Vision (WACV), 2017
IEEE Winter Conference on, pages 148–157. IEEE, 2017. 9
[52] J. G. Zilly, R. K. Srivastava, J. Koutnı́k, and J. Schmid-
huber. Recurrent highway networks. arXiv preprint
arXiv:1607.03474, 2016. 8
[53] B. Zoph and Q. V. Le. Neural architecture search with rein-
forcement learning. arXiv preprint arXiv:1611.01578, 2016.
8

You might also like