Depth-Gated Recurrent Neural Networks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Depth-Gated Recurrent Neural Networks

Kaisheng Yao Trevor Cohn


Microsoft Research University of Melbourne
[email protected] [email protected]
arXiv:1508.03790v2 [cs.NE] 19 Aug 2015

Katerina Vylomova Kevin Duh


University of Melbourne Nara Institute of Science and Technology

Chris Dyer
Carnegie Mellon University
[email protected]

Abstract

In this short note, we present an extension of LSTM to use a depth gate to con-
nect memory cells of adjacent layers. Doing so introduces a linear dependence
between lower and upper recurrent units. Importantly, the linear dependence is
gated through a gating function, which we call forget gate. This gate is a func-
tion of lower layer memory cell, its input, and its past memory. We conducted
experiments and verified that this new architecture of LSTMs is able to improve
machine translation and language modeling performances.

1 Introduction

Deep neural networks (DNNs) have been successfully applied to many areas, including speech [1]
and vision [2]. On natural language processing tasks, recurrent neural networks (RNNs) [3] are
more widely used because of its ability to memorize long-term dependency.
However, simple RNN has problems of gradient diminishing or explosion. Since RNNs can be con-
sidered as a deep neural networks across many time instance, the gradients at the end of a sentence
may not be able to back-propagated to the beginning of a sentence, because many layers of nonlinear
transformation.
The long-short-term memory (LSTM) [4] neural networks is an extension of simple RNN [3]. In-
stead of using nonlinear connection between the past hidden activity and this layer’s hidden activity,
it uses a linear dependence to relate its past memory to the current memory. Importantly, a forget
gate is introduced in LSTM to modulate each element of the past memory to be contributed to the
current memory cell.
LSTMs and its extensions, for instances Gated Recurrent Units [5] have been successfully used
in many natural language processing tasks [5], including machine translation [5, 6] and language
understanding [7, 8].
To construct a deep neural networks, the standard way is to stack many layers of neural networks.
This however has the same problem of building simple recurrent networks. The difference here
is that the error signals from the top, instead of from last time instance, have to back propagated
through many layers of nonlinear transformation and therefore the error signals are either diminished
or exploded.

1
This paper proposes an extension of LSTMs. The key concept is a depth gate that modulates the
linear dependence of memory cells in the upper and lower layers.

2 Review of recurrent neural networks


2.1 Simple RNN

The simple recurrent neural networks (simple RNN) computes output yt as follows
yt = f (Wy ht ) (1)
ht = σ(Wh ht−1 + Wx xt ) (2)
where Wy , Wh , and Wx are the matrices for hidden layer output ht , past hidden layer activity ht−1
and the input xt .
The time recurrence is introduced in Eq. (2) which relates the current hidden layer activity ht with
its past hidden layer activity ht−1 . This dependence is nonlinear because of using a logistic function
σ(·).

2.2 Long short-term memory (LSTM)

The simpel RNN presented above is hard to train because of gradient diminishing and exploding
problems. This is because the nonlinear relation between ht and ht−1 . LSTM was initially proposed
in [4] and later modified in [9]. We follow the implementation in [9]. LSTM introduces a linear
dependence between its memory cells ct and its past ct−1 . Additionally, LSTM has input and output
gates. These two gates are applied on a nonlinear function on the input and a nonlinear function on
the output from LSTM. Specially, LSTM is written below as
it = σ(Wxi xt + Whi ht−1 + Wci ct−1 ) (3)
ft = σ(Wxf xt + Whf ht−1 + Wcf ct−1 ) (4)
ct = ft ct−1 + it tanh(Wxc xt + Whc ht−1 ) (5)
ot = σ(Wxo xt + Who ht−1 + Wco ct ) (6)
ht = ot tanh(ct ) (7)
where it , ft and ot are the input gate, forget gate and output gate.

2.3 Gated Recurrent Unit

A gated recurrent unit (GRU) was proposed in [10]. It is similar to LSTM in using gating functions,
but differs from LSTM in that it doesn’t have a memory cell. Its operations can be summarized in
the following
ht = (1 − zt )ht−1 + zt h̃t (8)
zt = σ(Wz xt + Uz ht−1 ) (9)
h̃t = tanh(Wh xt + U (rt ht−1 )) (10)
rt = σ(Wr xt + Ur ht−1 ) (11)
where the output from GRU is ht . zt and rt are the update gate and reset gate. h̃t is the candidate
output. Wz , Wh , Wr , Uz , and Ur are the matrices in GRU.

3 The depth-gated recurrent neural networks


Sec. 3.1 presents the extension of LSTM.

3.1 Depth-gated LSTM

The depth-gated LSTM is illustrated in Fig. 4. It has a depth gate that connects the memory cells
cL+1
t in the upper layer L + 1 and the memory cell cL
t in the lower layer L. The depth-gate controls

2
Figure 1: LSTM Figure 2: GRU

Figure 3: Illustration of LSTM and GRU.

how much flow from the lower memory cell directly to the upper layer memory cell. The gate
function at layer L + 1 at time t is a logistic function as
dL+1
t = σ(bL+1
d
L+1 L+1
+ Wxd xt L+1
+ Wcd cL+1 L+1
t−1 + Wld cL
t ) (12)
where bL+1
d is a bias term. L+1
Wxd is the weight matrix to relate depth gate to the input of this layer.
L+1
It also relates to the past memory cell via a weight vector Wcd . To relate the lower layer memory,
L+1
it uses a weight vector Wl .

Figure 4: The depth-gated LSTM.

Using the depth gate, a DGLSTM unit can be written as


iL+1
t = L+1
σ(Wxi L+1 L+1
xt + Whi L+1 L+1
ht−1 + Wci ct−1 ) (13)
ftL+1 = σ(Wxf L+1
xt + WhfL+1 L+1
ht−1 + Wcf ct−1 )L+1 (14)
dL+1
t = σ(bL+1
d + Wxd L+1 L+1
xt + WcdL+1
cL+1
t−1 + Wld
L+1
cL
t ) (15)
cL+1
t
L+1 L
= dt ct + ft L+1
ct−1 + itL+1
tanh(Wxc xt + Whc hL+1 t−1 ) (16)
oL+1
t
L+1 L+1
= σ(Wxo xt + Who ht−1 + Wco ct ) L+1 L+1
(17)
hL+1
t = oL+1
t tanh(cL+1
t ) (18)
where iL+1
t , ft
L+1 L+1
ot , and dtL+1
are the input gate, forget gate, output gate and the depth gate.

4 Experiments
We applied DGLSTMs on two datasets. The first is BTEC Chinese to English machine translation.
The second is PennTreeBank dataset.

3
Table 1: BTEC Chinese to English results
Depth GRU LSTM DGLSTM
3 33.95 32.43 34.48
5 32.73 33.52 33.81
10 30.72 31.99 32.19

Table 2: BTEC Chinese to English reranking BLEU scores


Dataset Baseline DGLSTM
Dev 26.61 30.05
Test 40.63 43.08

4.1 Machine translation results

We first compared DGLSTM with GRU and LSTM. Results are shown in Table 1, which shows
DGLSTM outperforms LSTM and GRU in all of the depths.
In another experiment for the machine translation experiment, we use DGLSTM to output scores.
We trained 2 DGLSTMs. One was with 3 layers and the other was with 5 layers. Both of them used
50 dimension hidden layers. They are used as the basic recurrent unit in an attention model [5]. The
scores are used to train a reranker. We ran reranker 10 times. Their mean BLEU scores are listed in
Table 2. Compared to the baseline, DGLSTM obtained 3 pointer BLEU score improvement.

4.2 Language modeling

We conducted experiments on PennTreeBank(PTB) dataset. We trained a two layer DGLSTM.


Each layer has 200 dimension vector. Test set perplexity results are shown in Table 3. Compared
the previously published results on PTB dataset, DGLSTM obtained the lowest perplexity on PTB
test set.

5 Related works

Recent work to build deep networks include [12]. In [12], the output from a layer is a linear function
to the input, in addition to the path from the nonlinear part. Both of them are gated, as follows

y = H(x, WH ) T (x, WT ) + x C(x, WC ) (19)

where T and C are each called transform gate and carry gate. Therefore, the highway network
output has a direct connection, albeit gated, to the input.
However, the depth-gated LSTM linearly connects input and output through memory cell. Therefore,
the key difference of depth-gated LSTM from highway network is that memory cell has errors from
both the future and also from the top layer, linearly albeit gated. In contrast, the memory cell in
highway network only has linear dependence between adjacent times.

6 Conclusions

We have presented a depth-gated RNN architecture. In particular, we have extended LSTM to use
the depth gate that modulates a linear dependence of the memory cells in the upper and lower layer
recurrent units. We observed better performances using this new model on a machine translation
experiment and a language modeling task.

4
Table 3: Penn Treebank Test Set Results.
Model PPL (in %)
RNN [3] 123
LSTM [9] 117
sRNN [11] 110
DOT(s)-RNN [11] 108
DGLSTM 96

References
[1] G. Hinton, L. Deng, D. Yu, G. Dahl, A. r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke,
P. Nguyen, and T. Sainath and. B. Kingsbury, “Deep neural networks for acoustic modeling in
speech recognition,” 2012.
[2] A Krizhevsky, I Sutskever, and G Hinton, “Imagenet classification with deep convolution
neural networks,” in NIPS, 2012, vol. 25, pp. 1090–1098.
[3] T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khudanpur, “Recurrent neural network
based language model,” in INTERSPEECH, 2010, pp. 1045–1048.
[4] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9,
pp. 1735–1780, 1997.
[5] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align
and translate,” in arXiv:1409.0473 [cs.CL], 2014.
[6] I. Sutskever, O. Vinyals, and Q.V. Le, “Sequence to sequence learning with neural networks,”
in NIPS, 2014.
[7] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural
language processing (almost) from scratch,” Journal of Machine Learning Research, vol. 12,
pp. 2493–2537, 2011.
[8] K. Yao, B. Peng, G. Zweig, D. Yu, X. Li, and F. Gao, “Recurrent conditional random field for
language understanding,” in ICASSP, 2014.
[9] A. Graves, “Generating sequences with recurrent neural networks,” in arXiv:1308.0850
[cs.NE], 2013.
[10] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, “On the properties of neural ma-
chine translation,” in arXiv:1409.125, 2014.
[11] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, “How to construct deep recurrent neural
networks,” in arXiv:1312.6026 [cs.NE], 2013.
[12] R. K. Srivastava, K. Greff, and Jurgen Schmidhuber, “Highway networks,” in
arxiv:1505.00387v1, May 2015.

You might also like