Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks
Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks
Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks
2. MODEL ARCHITECTURE
D
fully
This section describes the CLDNN architecture shown in Figure 1. connected
layers
2.1. CLDNN
D
Frame xt , surrounded by l contextual vectors to the left and r con-
textual vectors to the right, is passed as input to the network. This
input is denoted as [xt−l , . . . , xt+r ]. In our work, each frame xt is
a 40-dimensional log-mel feature.
First, we reduce frequency variance in the input signal by pass- L
ing the input through a few convolutional layers. The architecture
used for each CNN layer is similar to that proposed in [2]. Specif-
(2)
LSTM ...
ically, we use 2 convolutional layers, each with 256 feature maps. layers
We use a 9x9 frequency-time filter for the first convolutional layer,
followed by a 4x3 filter for the second convolutional layer, and these L
filters are shared across the entire time-frequency space. Our pool-
ing strategy is to use non-overlapping max pooling, and pooling in xt
frequency only is performed [11]. A pooling size of 3 was used for
the first layer, and no pooling was done in the second layer. linear dim
The dimension of the last layer of the CNN is large, due to the layer red
number of feature-maps×time×frequency context. Thus, we add a
linear layer to reduce feature dimension, before passing this to the
LSTM layer, as indicated in Figure 1. In [12] we found that adding
this linear layer after the CNN layers allows for a reduction in pa- (1)
rameters with no loss in accuracy. In our experiments, we found that
C
reducing the dimensionality, such that we have 256 outputs from the convolutional
linear layer, was appropriate. layers
After frequency modeling is performed, we next pass the CNN C
output to LSTM layers, which are appropriate for modeling the sig-
nal in time. Following the strategy proposed in [3], we use 2 LSTM
layers, where each LSTM layer has 832 cells, and a 512 unit projec-
tion layer for dimensionality reduction. Unless otherwise indicated, [xt-l,..., xt, ...., xt+r]
the LSTM is unrolled for 20 time steps for training with truncated
backpropagation through time (BPTT). In addition, the output state Fig. 1. CLDNN Architecture
label is delayed by 5 frames, as we have observed with DNNs that
information about future frames helps to better predict the current
frame. The input feature into the CNN has l contextual frames to features, with no context. In order to model short and long-term
the left and r to the right, and the CNN output is then passed to the features, we take the original xt and pass this as input, along with
LSTM. In order to ensure that the LSTM does not see more than 5 the long-term feature from the CNN, into the LSTM. This is shown
frames of future context, which would increase the decoding latency, by dashed stream (1) in Figure 1.
we set r = 0 for CLDNNs. The use of short and long-term features in a neural network has
Finally, after performing frequency and temporal modeling, we been explored previously (i.e., [13, 14]). The main difference be-
pass the output of the LSTM to a few fully connected DNN layers. tween previous work and ours is that we are able to do this jointly
As shown in [5], these higher layers are appropriate for producing a in one network, namely because of the power of the LSTM sequen-
higher-order feature representation that is more easily separable into tial modeling. In addition, our combination of short and long-term
the different classes we want to discriminate. Each fully connected features results in a negligible increase in the number of network
layer has 1,024 hidden units. parameters.
In addition, we explore if there is complementarity between
2.2. Multi-scale Additions modeling the output of the CNN temporally with an LSTM, as well
as discriminatively with a DNN. Specifically, motivated by work in
The CNN takes a long-term feature, seeing a context of t−l to t (i.e., computer vision [10], we explore passing the output of the CNN into
r = 0 in the CLDNN), and produces a higher order representation both the LSTM and DNN. This is indicated by the dashed stream
of this to pass into the LSTM. The LSTM is then unrolled for 20 (2) in Figure 1. This idea of combining information from CNN and
timesteps, and thus consumes a larger context of 20 + l. However, DNN layers has been explored before in speech [11, 15], though
we feel there is complementary information in also passing the short- previous work added extra DNN layers to do the combination. Our
term xt feature to the LSTM. In fact, the original LSTM work in work differs in that we pass the output of the CNN directly into the
[3] looked at modeling a sequence of 20 consecutive short-term xt DNN, without extra layers and thus minimal parameter increase.
3. EXPERIMENTS Method WER
DNN 18.4
Our initial experiments to understand the CNN, DNN and LSTM ar- CNN 18.0
chitectures are conducted on a medium-sized training set consisting LSTM 18.0
of 300k English-spoken utterances (about 200 hours). Further ex-
periments are then performed on larger training set of 3m utterances Table 1. DNN, CNN, LSTM Baselines
(2,000 hrs). In addition, to explore the robustness of our model to
noise, we also perform experiments using a noisy training set of 3m results for adding the DNN before the LSTM layer. Table 2 com-
utterances (2,000 hrs). This data set is created by artificially corrupt- pares the results for both CNNs and DNN, with different amounts of
ing clean utterances using a room simulator, adding varying degrees left input context (i.e., l) to the CNN and DNN. Notice that for both
of noise and reverberation, such that the overall SNR is between CNNs and DNNs, the best results are obtained by having a left con-
5dB to 30dB. The noise sources are from YouTube and daily life text of 10 frames. A larger context of 20 hurts performance, likely
noisy environmental recordings. All training sets are anonymized since the LSTM is then unrolled for 20 time steps, so the total con-
and hand-transcribed, and are representative of Google’s speech traf- text processed by the LSTM is 40. Also notice that the benefits of
fic. Models trained on clean speech are evaluated on a clean test set CNNs over DNNs [2] continue to hold even when combined with
containing 30,000 utterances (20 hrs). In addition, models trained on LSTMs.
noisy speech are evaluated in matched conditions on a 30,000 utter-
ance noisy test set, to which noise at various SNRs has been added Input Context # Steps Unroll WER CNN WER DNN
to the clean test set. It is important to note that the training and test l=0,r=0 20 17.8 18.2
sets used in this paper are different than those in [3], and therefore l=10,r=0 20 17.6 18.2
numbers cannot directly be compared. l=20,r=0 20 17.9 18.5
The input feature for all models are 40-dimensional log-mel fil-
terbank features, computed every 10ms. Unless otherwise indicated, Table 2. WER, CNN+LSTM vs. DNN+LSTM
all neural networks are trained with the cross-entropy criterion, using To ensure that improvements with CLDNNs are not due to extra
the asynchronous stochastic gradient descent (ASGD) optimization contextual features given to the CNN (and thus the LSTM), we ex-
strategy described in [16]. The sequence-training experiments in this plore the behavior of LSTMs with different temporal contexts. First,
paper also use distributed ASGD, which is outlined in more detail in we provide the LSTM with an input spanning ten frames to the left
[17]. All networks have 13,522 CD output targets. The weights of the current frame (i.e., l = 10), the same input feature provided
for all CNN and DNN layers are initialized using the Glorot-Bengio to the CNN. The LSTM is still unrolled 20 timesteps. Table 3 shows
strategy described in [18]. Unless otherwise indicated, all LSTM that this does not improve WER over providing no feature context
layers are randomly initialized to be Gaussian, with a variance of to the LSTM. In addition, we compare passing a single frame to the
1/(# inputs). In addition, the learning rate is chosen specific to LSTM and unrolling it for 30 time steps, but this degrades WER.
each network, and is chosen to be the largest value such that training This helps to justify the gains from the CNN+LSTM architecture,
remains stable. Learning rates are exponentially decayed. showing the importance of extracting more robust features (with
CNNs) before performing temporal modeling (with LSTMs).
4. RESULTS
Method WER
Initial results to understand the combination CLDNN model, and its LSTM, l=0, unroll=20 18.0
variants, are presented in this section. All models are trained on the LSTM, l=10, unroll=20 18.0
medium-sized 200 hour clean training set, and results are reported LSTM, l=0, unroll=30 18.2
on the clean test set.
Table 3. WER, Alternative Temporal Modeling for LSTM
4.1. Baselines
4.3. LSTM+DNN
First, we establish baseline numbers for CNNs, DNNs and LSTMs,
as shown in Table 1. Consistent with results reported in the literature In this section, we explore the effect of adding fully connected layers
[2], the CNN is trained with 2 convolutional layers with 256 feature after the output of the LSTM. For this experiment, the input provided
maps, and 4 fully connected layers of 1,024 hidden units. The DNN to each network is single frame xt , and the LSTM is again unrolled
is trained with 6 layers, 1,024 hidden units [1]. The input into both for 20 time steps. Table 4 shows that improvements are obtained,
the CNN and DNN is a 40-dimensional log-mel filterbank feature, but performance seems to saturate after two additional layers. This
surrounded by a context of 20 past frames and 5 future frames. The indicates that after temporal modeling is completed, it is beneficial
LSTM is trained with 2 layers of 832 cells, and a 512 dimensional to use DNN layers to transform the output of the LSTM layer to a
projection layer. Adding extra LSTM layers to this configuration space that is more discriminative and easier to predict output targets.
was not found to help [3]. The input into the LSTM is a single 40-
dimensional log-mel feature. The LSTM is unrolled 20 steps in time, # DNN Layers WER
and the output is delayed by 5 frames. 0 18.0 (LSTM)
1 17.8
2 17.6
4.2. CNN+LSTM 3 17.6
In this section, we analyze the effect of adding CNN before the
Table 4. WER, LSTM+DNN
LSTM. To show the benefit of CNNs over DNNs, we also report
4.4. CNN+LSTM+DNN Method WER
LSTM 17.7
In this section, we put together the models from Sections 4.2 and CLDNN, long-term feature to LSTM 17.0
4.3, feeding features into a CNN, then performing temporal model- + short-term feature to LSTM 16.8
ing with an LSTM, and finally feeding this output into 2 fully con- + CNN to LSTM and DNN layers 17.0
nected layers. Table 5 shows the WER for the LSTM, CNN+LSTM,
LSTM+DNN and finally the combined CLDNN model. The table Table 7. WER with Multi-scale Additions
indicates that the gains from combining the CNN and DNN layers
with the LSTM are complementary. Overall, we are able to achieve CLDNN, we just include results passing short and long-term features
a 4% relative improvement in WER over the LSTM model alone. into the CNN, and omit passing the CNN into both the LSTM and
DNN, as only the first technique showed gains in Section 4.6. In
Method WER addition, in this section we report numbers after both cross-entropy
LSTM 18.0 (CE) and sequence training [17], a strategy which has shown to give
CNN+LSTM 17.6 consistent gains over CE training [20].
LSTM+DNN 17.6 Table 8 shows the WER for the 3 models when trained on a
CLDNN 17.3 2,000 hour clean data set, and then evaluated on a clean test set.
With both the CLDNN and multi-scale additions, we can achieve a
Table 5. WER, CLDNN 6% relative reduction in WER over the LSTM after CE training, and
a 5% relative improvement after sequence training.
4.5. Better Weight Initialization
Method WER-CE WER-Seq
We have shown that we can achieve gains by using the CNNs to LSTM 14.6 13.7
provide better features before performing temporal modeling with CLDNN 14.0 13.1
LSTMs. One may argue that if the LSTMs are better initialized, multi-scale CLDNN 13.8 13.1
such that better temporal modeling can be performed, are CNNs re-
ally necessary? Our initial experiments with LSTMs use Gaussian Table 8. WER, Models Trained on 2,000 hours, Clean
random weight initialization, which produces eigenvalues of the ini-
Finally, Table 9 illustrates the WER for the 3 models when
tial recurrent network which are close to zero, thus increasing the
trained on a 2,000 hour noisy training set, and then evaluated on
chances for vanishing gradients [19]. To address this issue, we look
a noisy test set. At the CE level, the CLDNN provides a 4% rela-
at uniform random weight initialization between −0.02 to 0.02 for
tive reduction in WER compared to the LSTM, and including the
the LSTM layers.
multi-scale information again provides a small additional improve-
Table 6 shows the gains with the CLDNN model still hold even
ment. After sequence training, the CLDNN provides a 7% relative
after better weight initialization, and the CLDNN model still has a
improvement over the LSTM. The improvements with CLDNNs
4% relative improvement over the LSTM. This justifies the benefit of
on larger data sets and after sequence training demonstrate the
having CNN layers provide better features to do temporal modeling.
robustness and value of the proposed method.
Notice now that with proper weight initialization, the LSTM is better
than the CNN or DNN in Table 1.
Method WER-CE WER-Seq
Method WER - Gaussian Init WER - Uniform Init LSTM 20.3 18.8
LSTM 18.0 17.7 CLDNN 19.4 17.4
CLDNN 17.3 17.0 multi-scale CLDNN 19.2 17.4
Table 6. WER, Weight Initialization Table 9. WER, Models Trained on 2,000 hours, Noisy
In this section, we investigate adding multi-scale information into the In this paper, we present a combined CNN, LSTM and DNN ar-
CLDNN architecture, as described in Section 2.2. First, we explore chitecture, which we call CLDNN. The architecture uses CNNs to
passing a long-term feature [xt−10 , . . . , xt ] to the CNN, and a short- reduce the spectral variation of the input feature, and then passes
term feature xt to the LSTM. Table 7 shows this gives a WER of this to LSTM layers to perform temporal modeling, and finally out-
16.8%, an additional 1% relative improvement over passing just the puts this to DNN layers, which produces a feature representation that
long-term feature from the CNN into the LSTM. is more easily separable. We also incorporate multi-scale additions
Second, we explore passing the output of the CNN into both the to this architecture, to capture information at different resolutions.
LSTM and the DNN. Table 7 indicates that this does not yield gains Results on a variety of LVCSR Voice Search tasks indicate that the
over the CLDNN alone. This indicates that temporal processing of proposed CLDNN architecture provides between an 4-6% relative
CNN features using the LSTM is sufficient, and more information is reduction in WER compared to an LSTM.
not gained by additionally passing CNN features into the DNN.
7. ACKNOWLEDGEMENTS
5. RESULTS ON LARGER DATA SETS
Thank you to Izhak Shafran for help with LSTM training scripts,
In this section, we compare CLDNNs and LSTMs, as well as multi- Hank Liao for help with decoding setups and Arun Narayanan for
scale additions, on larger data sets. Note when we say multi-scale suggestions on how to train and decode in noisy conditions.
8. REFERENCES [18] X. Glorot and Y. Bengio, “Understanding the Difficulty of
Training Deep Feedforward Neural Networks,” in Proc. AIS-
[1] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, TATS, 2014.
A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and
[19] R. Pascanu, T. Mikolov, and Y. Bengio, “On the Difficulty of
B. Kingsbury, “Deep Neural Networks for Acoustic Modeling
Training Recurrent Neural Networks,” in Proc. ICML, 2013.
in Speech Recognition,” IEEE Signal Processing Magazine,
vol. 29, no. 6, pp. 82–97, 2012. [20] B. Kingsbury, “Lattice-Based Optimization of Sequence Clas-
sification Criteria for Neural-Network Acoustic Modeling,” in
[2] T. N. Sainath, A. Mohamed, B. Kingsbury, and B. Ramabhad-
Proc. ICASSP, 2009.
ran, “Deep Convolutional Neural Networks for LVCSR,” in
Proc. ICASSP, 2013.
[3] H. Sak, A. Senior, and F. Beaufays, “Long Short-Term Mem-
ory Recurrent Neural Network Architectures for Large Scale
Acoustic Modeling,” in Proc. Interspeech, 2014.
[4] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, “How to
Construct Deep Recurrent Neural Networks,” in Proc. ICLR,
2014.
[5] A. Mohamed, G. Hinton, and G. Penn, “Understanding
how Deep Belief Networks Perform Acoustic Modelling,” in
ICASSP, 2012.
[6] H. Soltau, G. Saon, and B. Kingsbury, “The IBM Attila speech
recognition toolkit,” in Proc. IEEE Workshop on Spoken Lan-
guage Technology, 2010.
[7] T. N. Sainath, B. Ramabhadran, M. Picheny, D. Nahamoo, and
D. Kanevsky, “Exemplar-Based Sparse Representation Fea-
tures: From TIMIT to LVCSR,” in IEEE TSAP, 2011.
[8] G. Saon, H. Soltau, A. Emami, and M. Picheny, “Unfolded
Recurrent Neural Networks for Speech Recognition,” in Inter-
speech, 2014.
[9] L. Deng and J. Platt, “Ensemble Deep Learning for Speech
Recognition,” in Proc. Interspeech, 2014.
[10] P. Sermanet and Y. LeCun, “Traffic sign recognition with
multi-scale convolutional networks,” in Proc. IJCNN, 2011.
[11] T. N. Sainath, B. Kingsbury, A. Mohamed, G. Dahl, G. Saon,
H. Soltau, T. Beran, A. Aravkin, and B. Ramabhadran,
“Improvements to Deep Convolutional Neural Networks for
LVCSR,” in in Proc. ASRU, 2013.
[12] T. N. Sainath, V. Peddinti, B. Kingsbury, P. Fousek, D. Na-
hamoo, and B. Ramabhadhran, “Deep Scattering Spectra with
Deep Neural Networks for LVCSR Tasks,” in Proc. Inter-
speech, 2014.
[13] P. Schwarz, P. Matejka, and J. Cernocky, “Hierarchical Struc-
tures of Neural Networks for Phoneme Recognition,” in Proc.
ICASSP, 2006.
[14] F. Grezl and M. Karafat, “Semi-Supervised Bootstrapping Ap-
proach for Neural Network Feature Extractor Training,” in
Proc. ASRU, 2013.
[15] H. Soltau, G. Saon, and T. N. Sainath, “Joint Training of
Convolutional and Non-Convolutional Neural Networks,” in
in Proc. ICASSP, 2014.
[16] J. Dean, G.S. Corrado, R. Monga, K. Chen, M. Devin, Q.V.
Le, M.Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and
A.Y. Ng, “Large Scale Distributed Deep Networks,” in Proc.
NIPS, 2012.
[17] G. Heigold, E. McDermott, V. Vanhoucke, A. Senior, and
M. Bacchiani, “Asynchronous Stochastic Optimization for Se-
quence Training of Deep Neural Networks,” in Proc. ICASSP,
2014.