10.2478 - Jaiscr 2019 0006
10.2478 - Jaiscr 2019 0006
10.2478 - Jaiscr 2019 0006
235 – 245
10.2478/jaiscr-2019-0006
Abstract
Deep Neural Networks (DNN) are nothing but neural networks with many hidden lay-
ers. DNNs are becoming popular in automatic speech recognition tasks which combines a
good acoustic with a language model. Standard feedforward neural networks cannot han-
dle speech data well since they do not have a way to feed information from a later layer
back to an earlier layer. Thus, Recurrent Neural Networks (RNNs) have been introduced
to take temporal dependencies into account. However, the shortcoming of RNNs is that
long-term dependencies due to the vanishing/exploding gradient problem cannot be han-
dled. Therefore, Long Short-Term Memory (LSTM) networks were introduced, which are
a special case of RNNs, that takes long-term dependencies in a speech in addition to short-
term dependencies into account. Similarily, GRU (Gated Recurrent Unit) networks are an
improvement of LSTM networks also taking long-term dependencies into consideration.
Thus, in this paper, we evaluate RNN, LSTM, and GRU to compare their performances
on a reduced TED-LIUM speech data set. The results show that LSTM achieves the best
word error rates, however, the GRU optimization is faster while achieving word error rates
close to LSTM.
Keywords: Spectrogram, Connectionist Temporal Classification, TED-LIUM data set
input data by decreasing the dimensionality of the Term Memory (LSTM) networks have been intro-
feature space. In stacked denoising autoencoders, duced. LSTMs are designed to work on long-term
for example, a partially corrupted output is cleaned, dependencies in data. LSTM prove to be effec-
i.e., de-noised. tive in speech recognition tasks [7] where the spe-
Another area where deep learning is success- cial memory cells of LSTMs are used to identify
fully applied is automatic speech recognition. In long dependencies. Slightly different than LSTM is
automatic speech recognition, good acoustic and the Gated Recurrent Unit (GRU) introduced in 2014
language models are combined [2, 3]. The speech [9]. GRUs are also designed for long-term depen-
recognition problem involves time-series data. dencies and work well with sequential data as do
LSTMs.
Feedforward neural networks are unidirectional
whereby the outputs of one layer are forwarded In this paper, we have evaluated the perfor-
to the following layer. These feedforward net- mance of RNN, LSTM, and GRU for speech recog-
works cannot persist past information. Further- nition applied to the reduced TED-LIUM data set
more, when DNNs are used to analyze speech [10] using an appropriate regularization method.
recognition, certain issues are encountered includ- Different parameterized models are trained end-to-
ing different speaking rates, and temporal depen- end using CTC for labeling sequence data.
dencies [4, 5, 6]. DNNs can only model fixed size The paper is structured as follows. Section 2
sliding window of acoustic frames but cannot han- describes related work in the area of speech recog-
dle different speaking rates [6]. Recurrent Neural nition and deep learning. In Section 3 the three
Network (RNN) is another class of network that approaches applied are described. Section 4 de-
contains loops in the hidden layer to retain the in- scribes the experiments conducted and discusses the
formation at the previous time step to predict the results. The conclusion and future work is provided
value of the current time step. This mechanism in Section 5.
helps RNNs to handle different speaking rates [6].
Temporal dependencies could be an issue while 2 Related Work
analyzing speech recognition tasks. Temporal de-
pendencies may be present in the short-term or Traditionally, generative models were used for
long-term depending on the speech recognition speech recognition. Generative models are typi-
problem. RNNs take into account only the short- cally composed of Maximum-A-Posteriori (MAP)
term dependencies due to the vanishing/exploding estimation, Gaussian Mixture Models (GMMs),
gradient problem. In the last couple of years, RNNs and Hidden Markov Models (HMM) [11, 12].
are being applied to a variety of problems such as These traditional models require expert knowledge
machine translation, image captioning, and speech (i.e., knowledge of a specific language) as well
recognition. Speech involves a dynamic process, as preprocessing of the text for Automatic Speech
and thus, RNN seems a good choice over traditional Recognition (ASR) [12]. Fortunately, an end-to-
feedforward network [7]. end ASR does not require expert knowledge be-
However, the applicability of RNN is limited cause it depends on paired acoustics and language
due to two reasons. The first is that RNNs re- data [12].
quire pre-segmented training and post-processing Recent advances in deep learning have given
of the output to convert it into labeled sequences. rise to the use of sequence-to-sequence models (dis-
The Connectionist Temporal Classification (CTC) criminative models) for speech recognition [11, 13].
for labeling sequence data in training with RNN In simple terms, sequence-to-sequence model for
solves this shortcoming. The CTC method has been speech recognition takes an acoustic sequence as in-
proven to be helpful where alignment between in- put and returns a transcript sequence as output [11].
put and output labels is unknown [8]. Second,
Our work is guided by many previous works
for the long-term dependencies in data where the
done in this area. In particular, RNNs have achieved
gap between the relevant information and the place
excellent results in language modeling tasks as out-
where it is needed is large, RNNs have only lim-
lined in [14, 15]. Language modeling is based on
ited use. Thus, a special type of RNN, Long Short-
PERFORMANCE EVALUATION OF DEEP NEURAL NETWORKS APPLIED . . . 237
a probabilistic model that is fitted to assign proba- chitectures. Bidirectional LSTM networks have
bilities to sentences. This is accomplished by pre- been used for the phoneme classification task [21,
dicting the next word in a sentence given previous 22]. Results have shown that bidirectional LSTM
word data. The experiments were performed on the performs better than unidirectional LSTM and stan-
popular Penn Tree Bank (PTB) data set [15]. dard RNNs on the frame-wise phoneme classifica-
Language modeling is key to many problems tion task. These results suggest that bidirectional
such as speech recognition, machine translation or LSTM is an effective architecture for speech pro-
image captioning. RNNs and LSTMs have both cessing where context information is very impor-
been used to map sequences to sequences as well. tant.
Sequence to sequence models are made up of two A hybrid bidirectional LSTM-HMM system ap-
RNNs; an encoder to process the input and a de- plied to phoneme recognition has proved as an
coder to generate the output. In [16], multi-layered improvement over unidirectional LSTM-HMM as
RNN cells have been used for the translation tasks well as traditional HMM systems. Bidirectional
and were evaluated on a popular English to French LSTMs have been experimented on the handwriting
translation from the WMT’14 data set [16]. recognition problem, and were evaluated on both
LSTM network architectures have proven to be online and offline data, thus proving to outperform
better than standard RNNs on learning Context Free the state-of-the-art HMM based system [23].
Language (CFL) and Context Sensitive Language Deep bidirectional LSTMs have been applied
(CSL) as described in [17]. Particularly, in speech to the speech recognition task whereby each hid-
recognition tasks, RNNs and LSTMs have achieved den layer was replaced by a combination of a for-
excellent results. Sequence labeling is an important ward layer and a backward layer. In this architec-
task in training RNNs during the speech recogni- ture, every hidden layer receives input from both
tion process. HMM-RNN frameworks were used in the forward and backward layer at one level be-
the past [18, 19], however, they do not perform well low. This type of deep bidirectional LSTM network
with DNN. combined with the CTC objective function has been
Graves et al. [8] came up with a very effi- used for end-to-end speech recognition. The net-
cient Connectionist Temporal Classification (CTC) work was evaluated on the Wall Street Journal cor-
method of sequence labeling to train RNNs end-to- pus [24]. The authors’ novel approach with the ob-
end. This method works very well for problems jective function has allowed direct optimization of
where input-output label alignment is unknown, the word error rate, even in the absence of a lan-
and the method does not require pre-segmented guage model. A hybrid of deep bidirectional LSTM
training data and post-processing of the outputs. In and HMM system has been used for speech recog-
addition, Graves et al. also introduced deep LSTM nition on the TIMIT data [25] outperforming GMM
RNNs and evaluated the framework on speech deep network benchmarks on part of the Wall Street
recognition. Particularly, RNN with CTC was used Journal corpus.
to train the model end-to-end. The authors achieved Different DNN architectures have been evalu-
the best recorded score on the TIMIT phoneme ated on hundreds of hours of speech data in recent
recognition benchmark [7]. years like Wall Street Journal, Librispeech, Switch-
Sak et al. [20] evaluated and compared the per- board, TED-LIUM, Fisher Corpus [26, 27, 28].
formance of LSTM, RNN and DNN architectures Specifically, the TED-LIUM data set has been used
on a large vocabulary speech recognition problem in a variety of tasks such as in audio augmentation
- the Google English Voice Search Task. The au- [28], in modeling probabilities of pronunciation and
thors have used an updated architecture compared silence [29], etc. Furthermore, the TED-LIUM data
to standard LSTM to make better use of the model set has also been used in automatic speech recog-
parameters. nition combined with human correction at the word
level and lattice level [30].
Several experiments are performed on the
TIMIT speech data set using bidirectional LSTM, Gated Recurrent Units (GRU) recurrent neural
deep bidirectional LSTMs, RNNs, and hybrid ar- networks were introduced by Cho et al. in 2014
[31]. GRUs are similar to LSTMs, both were de-
238 Apeksha Shewalkar, Deepika Nyavanandi, Simone A. Ludwig
signed to handle long term dependencies. However, dard formulation of a single LSTM cell can be given
GRUs have a simpler structure than LSTMs. Both by following equations
architectures have been used for polyphonic music
modeling as well as for speech recognition tasks ft = σ(W f · [ht−1 , xt ] + b f ), (3)
[9, 32]. The results show that GRUs are equally
efficient as LSTMs. it = σ(Wi · [ht−1 , xt ] + bi ), (4)
Our work is similar to Baidu research [26]
C̃t = tanh(WC · [ht−1 , xt ] + bC ), (5)
where a bidirectional RNN is used to speed up
the speech recognition performance using GPU
Ct = ft ∗Ct−1 + it ∗ C̃t , (6)
(Graphics Processing Unit) parallelization. We
have used a single GPU structure for our experi- ot = σ(Wo · [ht−1 , xt ] + bo ), (7)
ments to evaluate and compare the results of three
recurrent networks, namely bidirectional RNN, ht = ot ∗ tanh(Ct ), (8)
bidirectional LSTM, and bidirectional GRU.
where σ is the sigmoid function, tanh is the hyper-
bolic tangent function, i, f , o, C, C̃ are the input
3 Recurrent Neural Network Ar- gate, forget gate, output gate, memory cell content,
chitecture and new memory cell content, respectively. The
sigmoid function is used to form three gates in the
This Section first outlines the three models used memory cell, whereas the tanh function is used to
for the experimentation followed by the chosen ar- scale up the output of a particular memory cell.
chitecture description.
3.3 Gated Recurrent Units (GRU)
3.1 Recurrent Neural Network (RNN)
Introduced in 2014 [9], GRUs are similar to
As mentioned earlier, in RNN the decision LSTMs but they have fewer parameters. They also
made at time t − 1 affects the decision at time t. have gated units like LSTMs which controls the
Thus, the decision of how the network will respond flow of information inside the unit but without hav-
to new data is dependent on two things, (1) the cur- ing separate memory cells. Unlike LSTM, GRU
rent input, and (2) the output from the recent past. does not have output gate, thus exposing its full
RNN calculates its output by iteratively calculating content. GRU formulation can be given by follow-
the following two equations ing equations
3.4 Architecture for Experiments one forward recurrent sequence h( f ) , and one back-
ward hidden sequence h(b) can be formulated as
The RNN architecture that we are using is based
(f) (3) (f) (f)
on [26]. The architecture takes speech spectro- ht = g(W (4) ht +Wr ht−1 + b(4) ), (14)
grams as an input and creates English text as an out- (b) (3) (b) (b)
put. During pre-processing, a typically small win- ht = g(W (4) ht +Wr ht+1 + b(4) ). (15)
dow of a raw audio waveform is taken to compute Please note that the forward sequence is calcu-
FFT (Fast Fourier Transform) to calculate the mag- lated from t = 1, ...,t = T (i) for the ith utterance,
nitude (power) to describe the frequency content in whereas the backward sequence is calculated from
the selected local window. Then, the spectrogram t = T (i) , ...,t = 1. This way BRNN processes the
is generated by concatenating frames from adjacent data in both directions using two separate hidden
windows of the input audio. These spectrograms sequences, and then ‘feedforward’ the output to the
serve as the input features for the RNN. next layer, which is layer 5. Thus,
Consider a single input utterance x and la-
bel y as being sampled from the training set X = (5)
ht
(4)
= g(W (5) ht + b(5) ), (16)
{(x(1) , y(1) ), (x(2) , y(2) ), ...} where every utterance
(4) (f) (b)
x(i) is a time series of length T (i) ; T (i) is a time-slice where ht = ht + ht .
(i)
represented as a vector of speech features xt from Our experiments are performed with three mod-
t = 1, ..., T (i) . Eventually, the sequence of inputs x els, one where we use bidirectional RNN, the sec-
are converted into the sequence of character prob- ond with bidirectional LSTM, and the last model
abilities for transcription y as ŷt = P(ct |x), where with a bidirectional GRU layer. The formulation of
ct ∈ {a, b, c, ..., z, space, apostrophe, blank}. bidirectional LSTM or GRU is the same as bidirec-
The RNN model consists of one input layer, one tional RNN, however, instead of RNN we use either
output layer and five hidden layers. Figure 1 shows an LSTM cell or a GRU cell in layer 4.
the architecture with the different layers and nota- For the output layer a standard softmax function
tion used below. The hidden layer is denoted by is used in order to calculate the predicted probabili-
h(l) . Thus, for input x, h(0) is the input and the out- ties of characters for each time slice t, and character
put at the input layer depending on the spectrogram k in the alphabet. This output is given by
frame xt and context C of frames where t is the time.
The first three of the five hidden layers are normal
(6)
feedforward layers. For each time t, these are cal- ht,k = ŷt,k ≡ P(ct = k|x) =
culated by (6) (5) (6)
exp(Wk ht + bk ) , (17)
(l) (l−1) (6) (5) (6)
ht = g(W (l) ht + b(l) ) (13) ∑ j exp(W j ht + b j )
(6) (6)
In Equation 13, g(z) is a clipped rectified linear where Wk and bk denote the kth column of the
unit (ReLu) activation function that is used to cal- weight matrix and the kth bias, respectively.
culate the output at hidden layers, W (l) represents After the character probabilities are calculated,
the weight matrix, and b(l) is the bias vector at layer the CTC loss is calculated next [26]. The CTC loss
l. In order to avoid the vanishing gradient problem, function is used to integrate over all possible align-
ReLu functions are chosen over sigmoid functions. ments of characters. Thus, given the output of the
The following layer (layer 4) is a Bidirectional network, the CTC loss function calculates the er-
Recurrent layer (BRNN) consisting of one for- ror of the predicted output as the negative log like-
ward hidden sequence and one backward hidden se- lihood of the probability of the target. For this,
quence [7]. Standard RNNs make use of only the the network output from Equation 17, the probabil-
previous context information but bidirectional RNN ities of the alphabet over each time frame, is the
explore the future context as well. In particular, for input to the CTC loss function. Given the defini-
speech where complete utterances are recorded at tion of the CTC loss, the gradient of the loss ac-
once, BRNN is a better choice over simple RNN. cording to its inputs have to be calculated. This
The set of two hidden sequence layers in BRNN, loss can then be ‘backpropagated’ to the weights of
240 Apeksha Shewalkar, Deepika Nyavanandi, Simone A. Ludwig
the network. The ADAM optimization algorithm The data set has separate data folders for training,
[35] has been used for the backpropagation train- validation and testing. The following is contained
ing since this training algorithm is very tolerant to in the data set:
learning rate as well as to other training parameters,
and thus, requires less fine-tuning. – 378 audio talks in NIST sphere format (SPH)
– 378 transcripts in STM format
4 Experiments and Results – Dictionary having pronunciation (152k entries)
In this Section, first the data set is described as – Improved language model having selected
well as the evaluation measures used for the experi- monolingual data from WMT12 corpus [2]
ments are listed, followed by the results and discus-
sion. 4.2 Evaluation Measures
In speech recognition, there are two different
4.1 Data Set
types of performance or evaluation measures, which
We have used a subset of the improved TED- are based on (1) accuracy, and (2) speed [37]. Eval-
LIUM release 2 corpus [10]. The latest version of uation measures based on accuracy include WER,
the TED-LIUM data set has an improved language loss, and mean edit distance.
model which is an important factor in achieving re-
WER is the most commonly used error mea-
duced WER (Word Error Rate) values [2]. The data
surement in ASR. It is derived from the Levenshtein
set is publicly available and contains filtered data
distance [38] and calculated as [39, 40]
having audio talks and their transcriptions obtained
from the TED website. This corpus is particularly S+I +D
W ER = ( ) × 100, (18)
designed to train acoustic models. For our exper- N
iments, we have reduced the data set of size 34.3 where S is the number of substitutions, I is number
GB to 11.7 GB. The data set is available at [36]. of insertions, D is number of deletions, and N is the
PERFORMANCE EVALUATION OF DEEP NEURAL NETWORKS APPLIED . . . 241
Figure 3. WER values in % per epoch for all models and the 1,000-node architecture
PERFORMANCE EVALUATION OF DEEP NEURAL NETWORKS APPLIED . . . 243
5 Conclusion Acknowledgment
Since standard feedforward neural networks This work used the Extreme Science and En-
cannot handle speech data well (due to lacking a gineering Discovery Environment (XSEDE), which
way to feed information from a later layer back to is supported by National Science Foundation grant
an earlier layer), thus, RNNs have been introduced number ACI-1548562. We also gratefully acknowl-
to take the temporal dependencies of speech data edge the support of NVIDIA Corporation.
into account. Furthermore, RNNs cannot handle the
long-term dependencies due to vanishing/exploding
gradient problem very well. Therefore, LSTMs and References
a few years later GRUs were introduced to over- [1] G. E. Hinton, S. Osindero, Y. Teh, A fast learning
come the shortcomings of RNNs. algorithm for deep belief nets, Neural Computation
This paper evaluated RNN, LSTM and GRU 18, 1527-1554, 2006.
and compared their performances on a reduced
[2] A. Rousseau, P. Deléglise, Y. Estève, Enhancing
TED-LIUM speech data set. Two different archi- the TED-LIUM Corpus with Selected Data for Lan-
tectures were evaluated; a network with 500 nodes guage Modeling and More TED Talks. Proceedings
and a network with 1,000 nodes in each layer. The of Sventh Language Resources and Evaluation Con-
evaluation measures used were WER, loss, mean ference, 3935-3939, May 2014.
edit distance, and the running time. The results
show that the WER value of LSTM and GRU are [3] Y. Gaur, F. Metze, J. P. Bigham, Manipulating Word
Lattices to Incorporate Human Corrections, Inter-
close (LSTM scoring slightly better than GRU),
speech 2016, 17th Annual Conference of the Inter-
however, the running time of LSTM is larger than national Speech Communication Association, San
GRU. Thus, the recommendation for the reduced Francisco, CA, USA, September 2016.
TED-LIUM speech data set is to use GRU since
it returned good WER values within an acceptable [4] E. Busseti, I. Osband, S. Wong, Deep Learning for
running time. Time Series Modeling, Seminar on Collaborative In-
telligence in the TU Kaiserslautern, Germany, June
Future work will include parameter optimiza- 2012.
tion in order to investigate the influence on different
parameter settings. Furthermore, the learning rate, [5] Deep Learning for Sequential Data - Part V:
dropout rate as well as higher numbers of neurons Handling Long Term Temporal Dependencies,
in the hidden layers will be experimented with. https://prateekvjoshi.com/2016/05/31/deep-learning
-for-sequential-data-part-v-handling-long-term-tem
poral-dependencies/, last retrieved July 2017.
244 Apeksha Shewalkar, Deepika Nyavanandi, Simone A. Ludwig
[30] Y. Gaur, F. Metze, J. P. Bigham, Manipulating [35] D. Kingma, J. Ba, Adam: A method for stochas-
Word Lattices to Incorporate Human Corrections. tic optimization. arXiv preprint arXiv:1412.6980,
Seventeenth Annual Conference of the International 2014.
Speech Communication Association, 3062-3065,
[36] Reduced TED-LIUM release 2 corpus (11.7
2016.
GB), http://www.cs.ndsu.nodak.edu/˜siludwig/data/
TEDLIUM release2.zip, last retrieved July 2017.
[31] K. Cho, B. van Merrienboer, D. Bahdanau, and Y.
Bengio, On the properties of neural machine trans- [37] Speech recognition performance, https://en.wiki
lation: Encoder-decoder approaches, Eighth Work- pedia.org/wiki/Speech recognition#Performance,
shop on Syntax, Semantics and Structure in Statisti- last retrieved July 2017.
cal Translation, 2014.
[38] Levenshtein distance, https://en.wikipedia.org
/wiki/Levenshtein distance, last retrieved July
[32] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel,
2017.
Y. Bengio, End-to-end attention-based large vocab-
ulary speech recognition. 2016 IEEE International [39] A. C. Morris, V. Maier, P. Green, From WER and
Conference on Acoustics, Speech and Signal Pro- RIL to MER and WIL: improved evaluation mea-
cessing (ICASSP), 4945-4949, March 2016. sures for connected speech recognition. Eighth In-
ternational Conference on Spoken Language Pro-
[33] S. Hochreiter, J. Schmidhuber, Long Short-Term cessing, 2004.
Memory. Neural Comput. 9, 8, 1735-1780, Novem-
[40] Word error rate, https://en.wikipedia.org/wiki/
ber 1997.
Word error rate, last retrieved July 2017.
[34] K. Bahdanau, K. Cho, and Y. Bengio, Neural [41] A. Marzal, E. Vidal, Computation of normalized
machine translation by jointly learning to align edit distance and applications, IEEE Transactions on
and translate, Technical report, arXiv preprint Pattern Analysis and Machine Intelligence, 15(9),
arXiv:1409.0473, 2014. 926-932, 1993.