Urdu Speech Recognition System For District Names of Pakistan Development, Challenges and Solutions
Urdu Speech Recognition System For District Names of Pakistan Development, Challenges and Solutions
Urdu Speech Recognition System For District Names of Pakistan Development, Challenges and Solutions
for Coordination and Standardization of Speech Databases and Assessment Technique (O-COCOSDA)
26-28 October 2016, Bali, Indonesia
Abstract— Speech interfaces provide people an easy and resources are mostly unavailable for Urdu language due to
comfortable means to interact with computer systems. Speech which speech recognition for Urdu language is still at a very
recognition is a core component of speech interfaces which basic level. Few speech corpora for Urdu language consisting
recognizes human speech in a particular language. Some small isolated words have recently been developed that may be used
vocabulary speech recognition systems for Urdu language with to develop isolated words speech recognition systems.
reasonable accuracy have been developed but these systems are
either speaker dependent or unable to handle the large accent This paper presents an isolated word Urdu speech
variations that exists among people of Pakistan. This paper recognition system to recognize 139 district names of Pakistan.
presents a speaker independent Urdu speech recognition system The system is developed for use in a spoken dialog system that
for district names of Pakistan that performs well for major provides weather information to the citizens of Pakistan in
accents of Urdu. Two methods have been studied and analyzed to Urdu language. In order to develop a speaker independent
handle accent variation. The systems have been tested in system, we need to handle the accent variation of speakers
laboratory and field and their results have been reported in this calling from all over Pakistan. Major languages being spoken
paper. We conclude that accent independent system performs in Pakistan include Punjabi, Sindhi, Pashto, Balochi, Seraiki
better for isolated words and addition of field data in the training and Urdu. Accent of each language’s speaker is significantly
of system improves the overall recognition accuracy. different than others. Therefore, we have implemented and
evaluated different techniques to handle this accent variation.
Keywords—speech recognition; isolated word; speaker
independent; accent variation; field testing
The system also needs to be robust enough to work in low to
mild noise environments. This is achieved through
preprocessing of speech input signal. Rigorous field testing of
I. INTRODUCTION speech recognition system in real world environment is
The access to online information has become essential for conducted to evaluate the performance of the system.
development in today’s age and literate population of world is The rest of the paper is structured as follows: Section 2
getting benefit from it. But on the other hand, barriers like low describes literature review. Section 3 describes the
literacy rate and internet connectivity are hindering illiterate methodology used to build ASR for weather forecast system
and semi-literate population to access invaluable online and their laboratory results. Section 4 discusses the results of
resources. To overcome this challenge, speech interfaces are the system’s field testing and improvements made in the
used to provide information to the users in their local system based on the results, Finally, Section 5 concludes our
languages. Speech interfaces are also helpful for visually discussion and results.
challenged persons. The speech recognition system is a
fundamental component of these speech interfaces.
II. LITERATURE REVIEW
Speech recognition systems are categorized into three
types; isolated word, connected word and continuous speech Speech recognition systems have been in use for decades
recognition systems. Isolated word systems recognize only now. The earliest speech recognition system, Audrey [1], was
isolated word, connected word systems recognize connected developed in 1952 in Bell Laboratories which recognized digits
words while continuous speech recognition systems recognize from a single-speaker. Speech recognition has advanced
words spoken as in natural conversation. Speech recognition considerably in last few decades. Speech recognition systems
systems are used in spoken dialog, dictation, language learning for English include Nuance’s Dragon, BBN’s BYBLOS [2]
and speech translation systems to name a few. and MIT’s SUMMIT [3]. The BBN continuous speech
recognition system, BYBLOS [2], is a large vocabulary speech
Speech recognition systems require the design and recognition system. BYBLOS is tested for two domains;
development of speech corpus, language models and grammar Electronic Mail (EMAIL) consisting of 334 words and Naval
specifications related to the language for which system is to be Database Retrieval consisting of 354 words. The system
developed. Corpus development includes the collection, careful achieves an accuracy of 98.5% for speaker-dependent mode
annotation, cleaning and verification of speech data. These and 97% for speaker-adapted mode. The SUMMIT [3] speech
28
2016 Conference of The Oriental Chapter of International Committee
for Coordination and Standardization of Speech Databases and Assessment Technique (O-COCOSDA)
26-28 October 2016, Bali, Indonesia
recognition system developed by MIT used 1500 sentences supporting six major accents of Urdu with reasonable accuracy
recorded from 300 speakers for training. This system achieved both in laboratory testing and field testing.
an accuracy of 87% on DARPA Resource Management Task.
The study [4] discusses continuous speech recognition system III. METHODOLOGY
for French language consisting of 65,000 words and reports an
accuracy of 88.8%. This section presents the development of speaker
independent Automatic Speech Recognition (ASR) system for
Speech recognition for resource rich languages has excelled district names of Pakistan. In order to handle accent variation,
greatly but it is still at a rudimentary level for under resource we developed two types of speech recognition system; an
languages such as Urdu. In order to develop speech recognition accent dependent system and an accent independent system.
systems for under resource languages, Speech‐based We have used CMU Sphinx [12] toolkit for the development of
Automated Learning of Accent and Articulation Mapping speech recognition systems. Additionally, we developed an
(SALAAM) [5] uses the existing available resources of accent classifier system to identify the accent of any unknown
“developed languages” like English to develop systems for utterance. The following sub sections discuss the development
“developing languages”. of speech corpus and speech recognition systems.
The SALAAM method was tested for 10 diverse languages
and yielded word recognition rate around 90% for 3 to 10 A. Speech corpus
words. An improved method of speech recognition for low The speech corpus consisted of more than nine hours of
resource languages based on SALAAM is presented in [6]. speech data recorded from 300 speakers (both male and
They used US English as source language and target languages female) from all over Pakistan covering the aforementioned six
included Yoruba, Hindi and Hebrew. They reported word major accents. The detail of the corpus is given in Table I. The
recognition rate up to 90% for vocabulary size of 20-30 words data was collected over mobile channel using Asterisk Private
but it drops sharply for higher vocabulary size (word Branch Exchange (PBX) at 8 KHz sampling rate and
recognition rate of 50 word vocabulary for Hindi is around digitization rate of 16 bits. The data was cleaned and verified
77%). Due to relatively large vocabulary size consisting of 139 by expert linguists using a set of developed guidelines [13].
district names of Pakistan, the SALAAM method was not used
in our study. TABLE I. CORPUS DETAILS
Recently, speech recognition systems for several under- First language Number of Duration (in
resource languages have been developed. These include of speaker Utterances minutes)
isolated words system for Arabic language [7]. The system is Urdu 3424 51
used to assists callers using speech interface with a recognition
rate of 88.8%. The study [8] presents Hindi language isolated Punjabi 4493 71
words speech recognition system consisting of 113 words. The Pashto 7934 125
training data was collected from five male and four female
speakers. The word recognition rate for the speakers present in Sindhi 5231 70
training data is 96.61%. For speakers not present in training Balochi 13262 113
data, word recognition rate is 95.49%. An isolated word speech
recognizer for Bangla language has been developed that Seraiki 11494 141
consists of 100 vocabulary words [9]. They reported an Total 41293 574
accuracy of 90% for speaker dependent system and 70% for
speaker independent system. B. Pre-processing
The development of an isolated speech recognition system Voice Activity Detector (VAD) is pre-processing technique
for Urdu language is described in [10]. The vocabulary size to separate out the voice and non-voice portions of a user input.
was 52 words and training data was collected from ten We used procedure outlined in [14] for VAD and only the
speakers. The word recognition rate for the speakers present in speech portion of file is fed to the ASR for recognition. This
training data is 94.67%. For speakers not present in training pre-processing improves the accuracy of speech recognition
data, word recognition rate is 89.34%. The system is not systems. VAD works by computing the statistics of
feasible for a real word application as the data used is background noise from the first 200 samples of the recorded
statistically insignificant. file, therefore it can handle static noise like noise of fans in the
background but fails to handle the dynamic noise like people
A first attempt to develop a continuous speech recognition talking in the background. VAD and background noise are
system for Urdu language is discussed in [11]. The corpus for closely linked and it is not incorrect to treat the errors of voice
the system consisted of data recorded from 42 male and 40 activity detection as consequences of dynamic noise. Most of
female speakers with total duration of 45 hours. The reported the errors in voice activity detection are also due to background
word recognition rate was around 40% due to insufficient noise noise – if noise is higher and comparable in magnitude to the
modeling, diacritic issues and lack of accent variation handling. spoken word, it will inevitably lead to cutting some portion of
In this paper, we present an isolated word speech recognizer the speech as well and the final speech given as input to ASR
will be an incomplete word. On the other hand, if noise is very
29
2016 Conference of The Oriental Chapter of International Committee
for Coordination and Standardization of Speech Databases and Assessment Technique (O-COCOSDA)
26-28 October 2016, Bali, Indonesia
low at the start but changes midway in the recording, like least accuracy and Urdu having the most. The reasons for poor
someone starts speaking loudly in the background or a car horn accent identification may be because of the fact that
honks, it will lead to noise being treated as speech and the ASR vocabulary includes proper nouns and isolated words which
will most likely give wrong output. may not adequately capture accent variations.
C. Accent Identification D. Accent Dependent ASRs
Accent is linked to the articulation pattern followed by a A separate system for each accent using the data available
speaker when producing a particular sound. It is associated for each accent was developed. Sphinx toolkit was used to
with the first language of speaker which affects the production develop the systems. Table II shows the results of speech
of speech. In this work, we focused on identification of six recognition system for each accent.
main accents of Urdu i.e., Urdu, Punjabi, Sindhi, Balochi,
Seraiki, and Pashto. In our previous work [15], we conducted a TABLE II. ACCENT DEPENDENT ASR SYSTEMS DATA AND ACCURACY
study to classify five of these accents based on formant
frequencies and Mel Frequency Cepstral Coefficients Training Testing Accuracy
Accent
Utterances Utterances (%age)
(MFCCs). Three approaches were used to classify different
accents: 1) using Support Vector Machine (SVM) and Random Punjabi 3476 793 91.29
Forest, 2) using Gaussian Mixture Models (GMMs) and 3) Pashto 6202 1566 93.99
using confidence scores of phone ASRs.
Urdu 2661 616 92.37
1) Accent classification using SVM and Random Forest:
Balochi 10757 2505 92.71
Using machine learning classification algorithms, an
experiment was carried out to classify six different accents of Saraiki 8998 2243 95.05
Urdu language. For classification of accents, feature vector Sindhi 4301 1075 91.81
used was Mel Frequency Cepstral Coefficients (MFCCs). The
Overall AD ASRs Accuracy 92.87
MFCCs were computed from complete words of particular
accents and then averaged over a window of few milliseconds.
The values for frame length, frame shift and averaging The developed accent dependent system are used with
window used in experiment were 10ms, 7ms and 50ms accent identifier for recognition. Figure 1 shows the
respectively. The kernel used was polynomial with cost value combination of accent identification with accent dependent
of 10, gamma and co-efficient value of 0.5. The number of ASRs.
trees and depth of tree for Random Forest were 400 and 0
respectively. The number of utterances used for each accent Urdu accent Speech
Recognition System
were 424. This method resulted in an accuracy of 44%.
2) Accent classification using Gaussian Mixture Models: Punjabi accent
Speech Recognition
The system was built using the Gaussian Mixture Models System
where each accent was represented by a mixture of Gaussian Sindhi accent
densities. In training phase, MFCC features of all the wave Speech Recognition
files of a single accent were computed and stacked together. Input Accent System Output (District
audio file Indentifier name)
Expectation Maximization Algorithm was used to find the Pashto accent
Speech Recognition
optimal parameters of Gaussian Mixtures and the weight of System
each mixture. In testing phase, after computing MFCC feature Balochi accent
vector for a single file, posterior probabilities of each feature Speech Recognition
System
vector were computed for all three accents. Each test utterance
Seraiki accent
was assigned to the accent for which its posterior probability Speech Recognition
was maximum. Same number of files (3861) was used in System
training for each accent, but number of files in testing was Figure 1: Accent dependent ASRs with accent classifier.
different according to the data available for each accent. The
overall accuracy of the system was about 61% which is the E. Accent Independent ASR
weighted average of the accuracies of the individual accents. The accent independent speech recognition system for
3) Accent classification using confidence scores of phone weather domain is built using the speech data of Punjabi, Urdu,
ASRs: This method estimates the accent of a test utterance by Sindhi, Pashto, Balochi and Seraiki speakers. The training data
comparing its confidence scores after being decoded by phone consisted of 80% of the corpus for each accent while rest of the
ASRs of all accents. Results of using this method are better 20% of each accent was used for testing the system. The
than other SVM and GMM based methods. The overall number of training utterances was 28805 while 7759 utterances
accuracy of this system is 65.71%, with Punjabi having the were used for testing. The recognition accuracy of the system
was 91.98%.
30
2016 Conference of The Oriental Chapter of International Committee
for Coordination and Standardization of Speech Databases and Assessment Technique (O-COCOSDA)
26-28 October 2016, Bali, Indonesia
IV. FIELD TESTING target users effectively. The field data contains both of these
The aim of the field testing is to compute accuracy of ASR aspects which can lead to improved accuracy of the system.
system in real-world conditions. Based on the amount of noise More than 7000 utterances recorded in the field were added
present in the surroundings, from very quiet environment to to the training data. This resulted in a major turn around and
very loud, different places are selected for field testing of ASR helped in bringing the accuracy of the system closer to the
systems. These include labs, offices, classrooms, campus- laboratory results. The adapted ASR system had a field
parking space, open-fields, cafeteria, bus-stand and roads accuracy of 92.56% for 2609 utterances.
within the campus. The demographics includes both technical
and non-technical people working in the university and V. CONCLUSION
illiterate people like car drivers, rickshaw drivers, shopkeepers
and waiters of the cafeteria. Best efforts are made to have an In this paper, we have presented Urdu speech recognition
equal representation of males and females but in case of system for district names of Pakistan. To handle accent
drivers, shopkeepers and waiters, it is not possible to get the variation, accent dependent and accent independent ASRs have
system tested by the females. A total of 80 speakers have took been developed and their tested. Accent dependent ASRs
part in the testing and each speaker spoke 7 to 8 different perform better when accent of the input is known and the ASR
district names. The total test files are 586 excluding the files in for that accent is used. However, the accent of the input is
which user didn’t speak anything. Table III presents the results unknown in the field and accent identifier is used to determine
on the performance of standalone speech recognition systems. the accent from the input. The performance of accent identifier
is poor due to inadequate capturing of accent variation in
isolated words that are proper nouns. This leads to the
TABLE III. PERFORMANCE OF ASR SYSTEMS IN FIELD
degradation in performance of accent dependent speech
ASR Type
Testing Correctly Accuracy recognition system. The accent independent ASR system
Utterances Decoded (%age) performs better than accent dependent system. However,
Accent further study is required to develop better accent identification
586 441 75.25
Independent
Accent
system and then compare the accent independent and accent
Dependent
586 352 60.06 dependent systems. Furthermore, addition of field data in the
training of speech recognition system helps to improve its
accuracy in the field.
It is quite clear that accent independent system clearly
outperforms the accent dependent system. The reason for poor
ACKNOWLEDGMENT
performance lies on the fact that accent of an unknown
utterance is not identified with good accuracy. The isolated This work has been conducted through the project,
word accent-independent speech recognition system due to its Enabling Information Access for Mobile based Urdu Dialogue
better performance was integrated with weather information Systems and Screen Readers supported through a research
spoken dialog system and deployed at Pakistan Meteorological grant from National ICT RnD Fund, Pakistan.
Department (PMD), Islamabad on a landline number. The
initial version of the system was deployed on 13th August 2015 REFERENCES
and the performance of the system was monitored for two [1] K. H. Davis, R. Biddulph and S. Balashek, "Automatic Recognition of
months. The system gave an accuracy of 71% for 3560 Spoken Digits," Journal of the Acoustical Society of America, vol. 24,
utterances. no. 6, pp. 627-642, 1952.
[2] Y. Chow, M. Dunham, O. Kimball and M. Krasner, "BYBLOS: The
A. Improvements BBN continuous speech recognition system," in International
Conference on Acoustics, Speech, and Signal Processing (ICASSP),
The accuracy has dropped from 92% in laboratory to 71% Dallas, Texas, USA, 1987.
in the field. In order to improve the accuracy of system in the [3] V. Zue, J. Glass, M. Phillips and S. Seneff, "The MIT SUMMIT Speech
field, the speech corpus was cleaned and verified again along Recognition system: a progress report," in HLT Workshop on Speech
with the addition of new recorded data. This modified system and Natural Language, Philadelphia, PA, USA, 1989.
had an accuracy of 93.01% in laboratory testing. For 3095 [4] M. Adda-Decker, G. Adda, J. L. Gauvain and L. Lamel, "LARGE
utterances in the field, this system gave an accuracy of 79.46%. VOCABULARY SPEECH RECOGNITION IN FRENCH," in IEEE
International Conference on Acoustics, Speech, and Signal Processing,
This improved accuracy is still quite lower than the accuracy in Phoenix, AZ, USA, 1999.
lab of 93%. In order to further improve the system, we decided
[5] J. Sherwani, "Speech Interface for Information Access by Low-Literate
to adapt the system to the users by including the field data in Users in the Developing World," Pittsburgh, PA, USA, 2009.
the training of the system. This field data means user responses [6] F. Qiao, J. Sherwani and R. Rosenfeld, "Small-vocabulary speech
recorded during the operation of the system in the field. The recognition for resource-scarce languages," in First ACM Symposium
difference between field data and collected corpus is that the on Computing for Development, London, United Kingdom, 2010.
collected corpus is recorded under supervision which may not [7] M. A. M. A. Shariah, R. N. Ainon, R. Zainuddin and O. O. Khalifa,
reflect exactly how the users will say their response in real "Human computer interaction using isolated-words speech recognition
world. As corpus is usually collected from some sections of technology," in International Conference on Intelligent and Advanced
Systems (ICIAS), Kuala Lumpur, Malaysia, 2007.
population (mostly students in our case), it may not cover the
31
2016 Conference of The Oriental Chapter of International Committee
for Coordination and Standardization of Speech Databases and Assessment Technique (O-COCOSDA)
26-28 October 2016, Bali, Indonesia
[8] P. Saini, P. Kaur and M. Dua, "Hindi Automatic Speech Recognition [12] K. F. Lee, H. L. Hon and R. Reddy, "An overview of the SPHINX
Using HTK," International Journal of Engineering Trends and speech recognition system," IEEE Transactions on Acoustics, Speech
Technology (IJETT), vol. 4, no. 6, pp. 2223-2229, 2013. and Signal Processing, vol. 38, no. 1, pp. 35-45, 1990.
[9] M. A. Hasnat, J. Mowla and M. Khan, "Isolated and Continuous Bangla [13] S. Rauf, A. Hameed, T. Habib and S. Hussain, "District Names Speech
Speech Recognition: Implementation, performance and application Corpus for Pakistani Languages," in Oriental COCOSDA/CASLRE
perspective," BRAC University, Dhaka, Bangladesh, 2007. Conference, Shanghai, China, 2015.
[10] J. Ashraf, N. Iqbal, N. S. Khattak and A. M. Zaidi, "Speaker [14] L. R. Rabiner and M. R. Sambur, "An Algorithm for Determining the
Independent Urdu Speech Recognition," in International Conference on Endpoints of Isolated Utterances," Bell System Technical Journal, vol.
Informatics and Systems (INFOS), Cairo, Egypt, 2010. 54, no. 2, p. 297–315, 1975.
[11] H. Sarfraz, S. Hussain, R. Bokhari, A. A. Raza, I. Ullah, Z. Sarfraz, S. [15] Afsheen, S. Irtza, M. Farooq and S. Hussain, "Accent Classification
Pervez, A. Mustafa, I. Javed and R. Parveen, "Large vocabulary among Punjabi, Urdu, Pashto, Saraiki and Sindhi," in Conference on
continuous speech recognition for Urdu," in International Conference on Language and Technology (CLT), Karachi, Pakistan, 2014.
Frontiers of Information Technology, Islamabad, Pakistan, 2010.
32