Using Gaussian Mixtures For Hindi Speech Recognition System
Using Gaussian Mixtures For Hindi Speech Recognition System
Using Gaussian Mixtures For Hindi Speech Recognition System
1. Introduction
The speech recognition problem is the task of taking an utterance of speech signal as input,
captured by a microphone (or a microphone array), a telephone or other transducers, and
converting it into a text sequence as close as possible to what was represented by the acoustic
data. To make such a system ubiquitous, it is important that the system should be independent
of speaker and language characteristics such as accents, speaking styles, disfluencies
(particularly important in spontaneous speech), syntax and grammar, along with the capability
of handling a large vocabulary [1].
Although, ASR technology has made remarkable progress over the last 50 years, there still
exist a large number of problems that need to be solved. Gaussian mixture evaluation of
acoustic signals is one such problem which is a computationally expensive task. In such
systems, calculation of the state likelihoods makes a significant proportion (between 30-70%)
of the total computational load [2]. A range of 8 to 64 mixture components per state have
been found useful depending on the amount of training data. It is a tedious, time consuming
and expensive process if in Gaussian mixture model we gradually increase the number of
mixtures and then optimize it. In this paper we present a novel approach to speedup statistical
pattern classification by reducing the time consumed in likelihood evaluations of feature
vectors by using optimal number of Gaussian mixture components selected on the basis of
empirical observations.
157
Various experiments were conducted using hidden Markov model (HMM) by varying
number of mixtures at back-end and by using MFCC and its extension for feature extraction
at front-end. Analysis was carried out to select the parameters giving the best results at both
ends.
All the investigations are based on the experiments conducted in typical field condition and
in the context of databases available for Indian languages. Rest of the paper is organized as
follows: Section 2 describes the architecture and working of ASR with the issues related to
data preparation for Indian languages. Feature extraction techniques are given in section 3.
Section 4 presents the use of HMM with mixture of multivariate Gaussians. In section 5, an
experimental comparison of ASR performance with various mixtures and training methods is
presented. Finally, the paper concludes with a brief discussion of the experimental results.
158
ASR is a pattern classification approach divided into training and decoding (i.e. testing)
parts. The training task consists of taking a collection of utterances with associated word
label, and learning an association between the specified word models and observed acoustics.
For this it requires various information sources (i.e. databases and corpus) that include
waveforms of isolated words or of phonetically labelled phrases. During recognition, the
sequence of symbols generated by the acoustic components is compared with the set of words
present in the lexicon to produce optimal sequence of words that compose the systems final
output. In order to cover the words that are not seen in the acoustic training data, it is
necessary to have a grapheme-to-phoneme (G2P) system that uses the word orthography to
guess the pronunciation of the word [4].
2.2 Data Preparation for Indian Languages
There are 22 officially recognized languages among the 200 or so different written
languages used in the Indian subcontinent. Apart from few Perso-Arabic scripts (i.e. Kasmiri,
Sindhi and Urdu), all the other scripts (i.e. Assamese, Bengali, Devanagari, Gujarati,
Kannada, Oriya, Punjabi, and Telugu etc.) used for Indian languages have evolved from the
ancient Brahmi script and have a common phonetic structure. Brahmi-derived scripts are
further subdivided into northern and southern groups. The northern group (of which
Devanagari is a derivative) extends from northwestern India to Nepal and Tibet in the north,
across the subcontinent to Bengal and Bangladesh and further east to southeast Asia
(including Thailand, Indonesia and Korea). The other group, also known as Dravidian scripts
(i.e., Tamil, Telugu, Kannada and Malayalam) is used predominantly in south India.
Devanagari, as the script of Sanskrit literature, became the most widely used script in India by
the 11th century. Languages written in Devanagari include Nepali, Marathi, Bengali, Gujarati
and Hindi as well as Tibetan and Burmese.
Table 1: Hindi Character Set
159
The broad division of all sounds in human languages can be classified into two categories
viz. vowels (v) and consonants(c). The vowels in Indian languages include short and long
versions of the same sound. There are 12 basic vowels in Hindi languages which are called
Barakhadi. The basic set of consonants has been categorized according to the place and
manner of articulation as given in Table 1. Besides this, in Hindi language there are some
graphemes which do not have atomic sound. They correspond to two or more concatenated
phoneme sound, for example, AUM [] and RI []. These can be mapped to a string of unit
phonemes.
160
to enhance the distorted speech. Spectral subtraction is widely used as a simple technique to
reduce additive noise in the spectral domain [9]. In order to eliminate the convolutive channel
effect, Cepstral Mean Normalization (CMN) is applied which removes mean vector in the
acoustic features of the utterance. An extension of CMN, Cepstral Variance Normalization
(CVN) also adjusts feature variance to improve ASR robustness [10]. Relative spectra
(RASTA) processing and its variants such as J-RASTA, phase corrected RASTA have also
been used to reduce both communication channel effects and noise distortion [11].
161
models like triphones can be constructed in two ways: either word internal or cross word.
When constructing word internal models, context beyond the word borders are not
considered. On the other hand, for cross word triphones, the phonemes at the end or
beginning of neighbouring words are considered to affect the phonology used for modeling
[13]. Context dependent modelling significantly increases the number of model parameters to
be estimated. The most common solutions used some form of parameter or distribution tying,
in which equivalence classes are defined between model constructs (e.g. HMM states) and
then constructs in the same class share the same parameters for associated distributions [14].
4.2 Hidden Markov Model
Each subword unit is realized by a hidden Markov model in most state-of-the-art LVCSR
systems. HMM is a statistical model [3] for an ordered sequence of symbols, acting as a
stochastic finite state machine which is assumed to be built up from a finite set of possible
states, each of those states being associated with a specific probability distribution or
probability density function (pdf). Three fundamental problems of HMMs are probability
evaluation, determination of the best sequence, and parameter estimation. The probability
evaluation can be realized easily with the forward algorithm [15]. The determination of the
best state sequence is often referred as a decoding or search process. Viterbi search [16] and
A* search [17] are two major search algorithms. The parameter estimation in ASR is solved
with the well-known maximum likelihood estimation (MLE) using a forward-backward
procedure [18]. Several discriminative training methods have been proposed in recent years to
boost ASR system accuracy like maximum mutual information estimation (MMIE); minimum
classification error (MCE); and minimum word error/minimum phone error (MWE/MPE)
[19].
In MLE based HMM we find those HMM model parameters, , which maximize the
likelihood of the HMMs having generated with training data. Thus, given training data
Y (1) ......Y ( r ) the maximum likelihood (ML) training criterion may be expressed as:
FMLE ( )
1 R
r
r
r 1 log( p(Y | wref ; ))
R
(1)
(r )
where Y ( r ) is the r th training utterance with transcription wref
. This optimization is normally
performed using EM [20]. However, for ML to be the best training criterion, the data and
models would need to satisfy a number of requirements, in particular, training data
sufficiency and model-correctness [21].
The MPE criterion is a smoothed approximation to the phone transcription accuracy
measured on the output of a word recognition system given the training data. The objective
function in MPE, which is to be maximized, is:
FMPE ( ) rR1 S Pk (S | Or ) A(S , Sr )
(2)
where represents the HMM parameters; Pk ( S | Or ) is defined as the scaled posterior
probability of the sentence S being the correct one (given the model) and formulated by:
Pk ( S | Or )
P (Or | S )k P( S )k
k
k
u P (Or | u ) P(u )
(3)
Where K is the scaling factor typically less than one, Or is the speech data for r th training
sentence; and A( S , Sr ) is the raw phone transcription accuracy of the sentence S given the
162
reference, S r , which equals the number of reference phones minus the number of errors [22,
23].
4.3 Database for Speech Recognition
For the estimation of acoustic model parameters , and evaluation of ASR performance a
corpus of training utterances is required, which is also known as speech database. Ideally, the
databases of speech are labelled with textual transcriptions and each speech signal is aligned
with its words and phones, so that word-based and phone-based models could be trained
automatically.
For the design and development of European languages ASR systems, large and standard
databases are available which were prepared by various agencies. For example, TIMIT and
ATIS are two of the most important databases that are used to build acoustic models of
American English in ASRs [24]. But to prepare such kind of standard databases for Indian
languages, no much effort has been done so far. The few databases available for Indian
languages are relatively small and phonetically not very rich, as these were prepared by
various research groups especially for their own use.
4.4 Mixtures of Multivariate Gaussian
To model the complex speech signal, mixtures of Gaussian have been used as emission
pdfs in the hidden Markov models. In such systems, the output likelihood of a HMM state S
for a given observation vector, X n can be represented as a weighted sum of probabilities:
p( X n | S ) kK1 wk pk ( X n )
(4)
Where, parameters of the state pdf are number of mixture components K ; their weighing
factors, wk , which satisfies wk 0 and kK1 wk 1 ; mean vector k and the variancecovariance matrix k of the k th mixture component. Each mixture component belongs to a
D-dimensional multivariate Gaussian density function defined as:
pk ( X n )
( xn k )T k 1 ( xn k )
1
exp
(2 ) D 2 | k |1 2
2
(5)
In practice the full covariance matrices are reduced to diagonal covariance due to
computational and data sparseness reasons. By substituting the values of probabilities defined
in Equation (5), the state model in Equation (4) becomes the Gaussian mixture model (GMM)
defined as:
1 D ( xnq kq )2
p( X n | S ) Z k exp q 1
kq2
2
(6)
In order to compute efficiently and to avoid underflow, probabilities are computed in log
domain. Therefore the log likelihood can be expressed as:
163
1 D ( xnq kq )
log p( X n | S )
log( Z k ) q 1
2
2
kq
(7)
(8)
In a typical HMM-based LVCSR system, the number of model states ranges from 2000 to
6000, each of which is a weighted sum of typically 864 multidimensional Gaussian
distributions as in Equation (6). For each input frame, the output likelihoods should be
evaluated against each active state. Therefore, the state likelihoods estimation is
computationally intensive and takes about 3070% of the total recognition time [25]. This
kind of likelihood-based statistical acoustic decoding is so time consuming that it is one of the
most important reasons why the recognition is slow. Some LVCSR systems might even
decode speech several times slower than real time; that is to say, these systems are not
practical for most spontaneous applications, such as man-machine dialogue. Therefore, it is
necessary to develop efficient techniques in order to reduce the time consumption of
likelihood computation without a significant degradation of recognition accuracy [2].
4.5 Large Margin Training of GMM
5. Experimental Results
The input speech was sampled at 12 kHz and then processed at 10 ms frame rate with a
Hamming window of 25 ms to obtain the feature vectors. The CDHMM with linear left-right
topology is used to compute the score against a sequence of features for their phonetic
transcription. To compute the likelihood of a given word, the word is broken into sub words
or its constituent phones, and the likelihood of the phones is computed from the HMMs.
Three differerent HMMs based on word, context independent (CI) phones, and triphones
164
modeling units, were implemented. In phoneme based HMM, total 48 CI phone models were
used. We used word internal triphone models for our experiments.
At front end, two feature extraction methods standard MFCC and extended form of MFCC
were investigated. At back end, two training methods MLE and MPE were used. The
experiment was performed on a set of speech data consisting of four hundred words of Hindi
language recorded by 10 male and 10 female speakers. Since databases from non-Indian
languages cannot be used for Hindi (owing to the language specific effects), we have
developed our own corpus which includes documents from EMILLE text corpus [29], and
popular Hindi news papers. Testing of randomly chosen fifty sentences spoken by different
speakers was performed and recognition rate (i.e. accuracy) was calculated as:
Recognition rate=
165
166
1
2
62
74
63
75
66
78
82
84
88
79
80
84
16
76
78
80
6. Conclusion
Recognition of human speech is a problem with many solutions, but still open because
none of the current methods are fast and precise enough to be comparable with recognition
capability of human beings. Although there are various methods but among all these methods,
very little are used in real automatic speech recognition systems. Actually, most of the
methods are still at the stage of experimental research and do not provide enough convincing
results to be integrated. In this paper we have proposed a novel approach to develop speaker
independent speech recognition system for Hindi language using MFCC and its extensions for
feature extraction and HMM with Gaussian mixtures to generate acoustic models.
In our approach the numbers of mixture components are kept fixed while mean and
variances may be varying from state to state. Experimental results have shown that only 4
Gaussian mixtures, applied for discriminative and margin based techniques, yield optimal
performance in the context of small databases available for Indian languages which have been
used to train the hidden Markov model. Results also illustrate that for small vocabulary up to
200, whole word model gives maximum accuracy, and beyond that triphone model must be
used for better accuracy. At front- end if 3rd order MFCC combined with HLDA (i.e.
extended MFCC) is used for feature extraction and at back-end if discriminative minimum
167
phone error (MPE) or margin based techniques are applied in ASR, accuracy can be improved
by 5-6%.
References
[1] D. OShaughnessy, Interacting with Computers by Voice-Automatic Speech Recognitions and Synthesis,
(Invited Paper), Proceedings of the IEEE, Vol. 91, No. 9, 2003, pp. 1272-1305.
[2] Jun Cai, Ghazi Bouselmi, Yves Laprie, and Jean- Paul Haton, Efficient Likelihood Evaluation and Dynamic
Gaussian Selection for HMM-Based Speech Recognition, Computer Speech and Language, Vol.23, 2009,
pp.147164.
[3] F. Jelinek, Statistical Methods for Speech Recognition, MIT press, 1997.
[4] N. Goel, S.Thomas, M. Agarwal et al. Approaches to Automatic Lexicon Learning with Limited Training
Examples, Proc. of IEEE Conference on Acoustic Speech and Signal Processing, 2010.
[5] O. Duda and P. E. Hart Pattern, Classification and Scene Analysis, Wiley, New York, 1973.
[6] R. Haeb-Umbach and H. Ney, Linear Discriminant Analysis for Improved Large Vocabulary Continuous
Speech Recognition, in Proceedings of ICASSP, 1992, pp13-16.
[7] N. Kumar and A. G. Andreou, Heteroscedastic Disciminant Analysis and Reduced Rank HMMs for
Improved Speech Recognition, Speech Communication, Vol.26, 1998, pp. 283-297.
[8] S. Davis and P. Mermelstein, Comparison of Parametric Representations for Monosyllabic Word Recognition
in Continuously Spoken Sentences, IEEE Transactions on Acoustics, Speech and Signal Processing, Vol.28,
1980, pp.357-366.
[9] S. F. Boll, Suppression of Acoustic Noise in Speech using Spectral Subtraction, IEEE Transaction of
Acoustic, Speech and Signal Processing, Vol.27, No. 2, 1979, pp. 113-120.
[10] S. Molau, F. Hilger and H. Ney, Feature Space Normalization in Adverse Acoustic Conditions, Proc. of
ICASSP, 2003, pp. 656-659.
[11] H. Hermansky and N. Morgan, RASTA Processing of Speech, IEEE Transaction on Speech and Audio
Processing, Vol.2, No. 4, 1994, pp. 578-589.
[12] S. Young, A Review of Large Vocabulary Continuous Speech Recognition, IEEE Signal Processing Mag.,
Vol.13, 1996, pp. 45-57.
[13] C.H. Lee, J. L. Gauvain, R. Pieraccini, and L. R. Rabiner, Large Vocabulary Speech Recognition using
Subword Units, Speech Communication, Vol.13, 1993, pp. 263-279.
[14] W. Reichl and W. Chou, Robust Decision Tree State Tying for Continuous Speech Recognition, IEEE
Transaction on Speech and Audio Processing, Vol. 8, No. 5, 2000, pp. 555-566.
[15] L. R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,
Proceedings of the IEEE, Vol. 77, No. 2, 1989, pp. 257-286.
[16] H. Ney, The Use of a One-Stage Dynamic Programming Algorithm for Connected Word Recognition,
IEEE Trans. on Acoustics, Speech and Signal Processing, Vol. 32, No. 2, 1984, pp. 263-271.
[17] D. B. Paul, Algorithms for an Optimal A* Search and Linearizing the Search in the Stack Decoder, Proc.
ICASSP, Vol. 1, 1991, pp. 693-696.
[18] X. D. Huang , Y. Ariki and Jack, M.A., Hidden Markov Models for Speech Recognition, Edinburg
University Press 1990.
[19] H. Jiang, Discriminative Training of HMM for Automatic Speech Recognition: A Survey, Computer Speech
and Language, Vol. 24, 2010, 589-608.
[20] A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum Likelihood from Incomplete Data via the EM
Algorithms, Journal of the Royal Statistical Society Series B, Vol.39, 1977, pp. 1-38.
[21] M. Gale and S. Young, The Application of Hidden Markov Models in Speech Recognition, Foundations
and Trends in Signal Processing, Vol.1, No. 3, 2007, pp. 195-304.
[22] Sim Khe Chai and M. J. F. Gales, Minimum Phone Error Training of Precision Matrix Models, IEEE
Transaction on Acoustic Speech and Signal Processing 2006.
[23] Xu Haihua, Povey Daniel, Zhu Jie and Wu Guanyong, Minimum Hypothesis Phone Error as a Decoding
Method for Speech Recognition, Interspeech (ISCA), 2009, pp. 76-79.
[24] C. Becchati and K. Ricotti Speech Recognition Theory and C++ Implementation, John Wiley & Sons 2004.
168
[25] M. J. F. Gales, K. M. Knill and S. J. Young, State-Based Gaussian Selection in Large Vocabulary
Continuous Speech Recognition using HMMs, IEEE Transactions on Speech and Audio Processing, Vol.7,
No. 2, 1999, pp.152161.
[26] F. Sha, and L. K. Saul, Large Margin Hidden Markov Models for Automatic Speech Recognition, in B.
Scholkopf, J. Platt, and T. Hoffman ( Eds.), Advances in Neural Information Processing Systems, 19 MIT
Press, 2007, 12491256.
[27] C. M. Bishop, Pattern Recognition and Machine Learning (Springer, 2006).
[28] H. Jiang, and X. Li, Incorporating Training Errors for Large Margin HMMs under Semi Definite
Programming Framework, Proc. ICASSP, 2007.
[29] ELRA Catalogue, The EMILLE/CIIL Corpus, Catalogue Reference: ELRA-W0037, http://catalog.elra.info
[30] SPHINX: An Open Source at CMU: http://cmusphinx.sourceforge.net/html/cmusphinx.php.
[31] Hidden Markov Model Toolkit (HTK-3.4.1): http://htk.eng.cam.ac.uk.
[32] Julius: An Open Source for LVCSR Engine: http://julius.sourceforge.jp.
[33] M. Kumar, A. Verma, and N. Rajput, A Large Vocabulary Speech Recognition System for Hindi, Journal
of IBM Research, Vol. 48, 2004, pp. 703-715.
[34] R.K. Aggarwal and M. Dave, Discriminative Techniques for Hindi Speech Recognition System,
Communication in Computer and Information Science (Information Systems for Indian Languages),
Springer-Verlag Berlin Heidelberg, Vol. 139, 2011, pp. 261-266.
Authors
R. K. Aggarwal received his M. Tech. degree in 2006 and pursuing
PhD from National Institute of Technology, Kurukshetra, INDIA.
Currently he is also working as an Associate Professor in the Department
of Computer Engineering of the same Institute. He has published more
than 24 research papers in various International/National journals and
conferences and also worked as an active reviewer in many of them. He
has delivered several invited talks, keynote addresses and also chaired the
sessions in reputed conferences. His research interests include speech
processing, soft computing, statistical modeling and science and
spirituality. He is a life member of Computer Society of India (CSI) and
Indian Society for Technical Education (ISTE). He has been involved in
various academic, administrative and social affairs of many organizations
having more than 20 years of experience in this field.
Mayank Dave obtained the M. Tech. degree in Computer Science and
Technology from IIT Roorkee, INDIA in 1991 and PhD from the same
institute in 2002. He is presently working as Associate Professor in
Department of Computer Engineering at NIT Kurukshetra, INDIA with
more than 19 years experience of academic and administrative affairs in
the institute. He is presently heading Department of Computer
Engineering and Department of Computer Applications. He has
published approximately 60 research papers in various International /
National Journals and Conferences. He has coordinated several projects
and training programs for students and faculty. He has delivered number
of expert lectures and keynote addresses on different topics. He has
guided four PhDs and several M. Tech. dissertations. His research
interests include Peer-to-Peer Computing, Pervasive Computing,
Wireless Sensor Networks and Database Systems.
169
170