Decision Rules For Selection of Allophones of Marathi Affricates For Speech Synthesis
Decision Rules For Selection of Allophones of Marathi Affricates For Speech Synthesis
Decision Rules For Selection of Allophones of Marathi Affricates For Speech Synthesis
Samudravijaya K
Tata Institute of Fundamental Research, Homi Bhabha Road, Mumbai 400005
[email protected]
Abstract word is different from its canonical pronunciation.
There are 4 affricates and the corresponding graphemes
Marathi affricates have allophones that differ in the place
in most Indian languages. The place of articulation of the
of articulation. Devanagari script, employed by the
affricate phonemes is palatal. The affricates are categorized
Marathi language, does not provide any clue to a reader
by the binary values of two distinctive features: voicing and
(or a computer) about the allophone to be used. Thus,
aspiration. The devanagari graphemes corresponding to the
there is a need for discovering rules for such a choice in
4 affricates are
. A notable feature of Marathi
the context of developing Marathi text to speech systems.
language is the presence of allophones of affricates. The
Decision Tree Learning methodology was used to identify
allophones differ by the place of articulation: palatal and
factors that influence the choice of appropriate allophone.
dental. Both voiced and unvoiced Marathi affricates have
The work led to a simple rule that is able to predict the
allophones. The orthography does not indicate the allo-
place of articulation with high accuracy. The rule relies on
phone to be used in a given word. Thus, the script-to-phone
the ‘frontness’ of the vowel following the affricate, and is
ambiguity in case of Marathi affricates need to be resolved
consistent with articulatory principles.
so that appropriate allophone can be incorporated in ma-
chine generated Marathi speech. In the absence of docu-
1. Introduction mented rules of pronunciation, there appears to be some
A prime goal of Speech and Language Technology is to confusion even among native speakers. This papers de-
develop human-centric machine interfaces. Human Com- scribes an attempt to discover the rules of pronunciation
puter Interaction in native language is an important step in of Marathi unvoiced affricate using a data driven approach
this direction. Enabling machines to accept oral instruc- so that the rule-set, even if it is imperfect but nearly com-
tions from human beings and to respond in speech mode plete, can be used for identifying the appropriate allophone
would empower a large fraction of the Indian population to of Marathi affricate in the context of speech synthesis.
benefit from the advances made in computer and communi-
cations technology. Thus, it is important to develop Auto- 2. Problem definition
matic Speech Recognition and Text to Speech systems for
Indian languages. Two main stages in the operation of an The goal of the present work is to arrive at a set of rules
unlimited vocabulary text to speech system are conversion that will aid in identifying the place of articulation of af-
of (i) text to phoneme and (ii) phoneme to speech [1]. This fricate(s) in a Marathi word when its orthography is given.
paper deals with an aspect of the text to phoneme module In this preliminary work, the domain is restricted to un-
of a Marathi text to speech system. voiced and unaspirated affricate: . The basic assump-
Devanagari script is used by some modern Indian lan- tion is that there is regularity in this grapheme to allo-
guages such as Hindi and Marathi. There is a near one- phone transformation although it is not apparent in orthog-
to-one correspondence between the graphemes and the raphy. It is natural to expect that the rules of pronuncia-
phonemes. The exceptions to this correspondence are pri- tion have an articulatory basis. We hypothesize that allo-
marily due to schwa deletion. The need for creating pro- phone of a Marathi affricate is chosen such that the place
nunciation rulesets for speech synthesis for Indian lan- of articulation of the allophone is close to that of adjacent
guages has been well recognized [3, 4, 5]. In order to syn- phoneme(s), especially that of the following phoneme. Ac-
thesize speech that is acceptable to native speakers, a text cording to one article [6], palatal affricates occur before
to phoneme module has to take into account allophones of the vowels i, ii, e, ai and au whereas dental affricates occur
the language in addition to such phonological rules. For before the vowels u, uu and o; there is no such rule when
example, the pronunciation of the phoneme /a/ in the Hindi affricates precede the vowels a and aa. An example of the
latter case is the contrast between the palatal affricate in
the word
(four) and the dental affricate in the word
Table 1: The list of attributes and their permitted values ex-
(fodder); affricates in both cases are followed by /a:/ amined for their possible role in the choice of place of artic-
ulation of Marathi unvoiced unaspirated affricate. Here ‘c’
and both occur at word initial position. Since a linguistic
and ‘C’ denote palatal and dental affricates respectively.
solution to this problem is not available, the need for an en-
Also, ‘A’ denotes longer version of the vowel ‘a’ and so on.
gineering solution was felt. The aim of the current work is
to discover patterns in the process of deciding allophones Attribute Values
of Marathi affricate and evaluate the performance of a position in word initial, medial, final.
rule set derived using a data driven approach. prev nasalised vowel true, false.
prev is vowel true, false.
3. Experimental details prev vowel a, A, i, I, u, U, e, E, o, O, other.
prev long vowel true, false.
In this section, we describe the method and software used
prev front vowel true, false.
for generating rule set, the lexical attributes of the phoneme
prev back vowel true, false.
used for decision and the text corpus.
prev consonant c, C, h, k, l, r, other.
next nasalised vowel true, false.
3.1. Knowledge discovery method
next is vowel true, false.
The goal is to arrive at a collection of rules for determining next vowel a, A, i, I, u, U, e, E, o, O, other.
the place of articulation of Marathi affricate from a set of next long vowel true, false.
specific examples. The input to the system are values of next front vowel true, false.
a set of attributes (articulatory and lexical properties) of a next back vowel true, false.
word and the correct place of articulation (the truth value). next consonant c, C, h, y, other.
A decision tree is well suited for this purpose as it takes as
input a set of properties of an object, and outputs a yes/no
decision. In addition, it tolerates errrors in training data 3.4. Database
or missing attribute values of an input. This property of
robustness to errors in data is particularly relevant here be- We used TDIL [8] Marathi text corpus. This database has
cause sometimes there is confusion among natives speakers 465 files containing text drawn from diverse sources. This
as to whether to use dental or palatal affricate. Moreover, corpus has about 60,000 Marathi unique words containing
the decision tree can be re-cast as sets of if-then rules. This unvoiced unaspirated affricate /c/ represented by the script
property of the decision module is very useful for incorpo- . The task of annotating the place of articulation of all
ration into text to phone module of a text to speech system. words is labour-intensive, boring and time consuming. So,
we selected subsets of words for this experiment.
3.2. Decision tree toolkit The TDIL Marathi text corpus (in ISCII format) were
romanized for processing by computer in Linux environ-
A public domain decision tree construction tool “C4.5” [7] ment. This facilitated the process of (a) selection of words
was used in this work. It uses Quinlan’s ID3 algorithm for containing the phoneme /c/ and (b) creation of word sets
constructing a decision tree. The construction needs input with desired characteristics. Two sets of words were gener-
in the form of two input files. The first file contains a list ated from the set of words containing . The characteris-
of attributes and the set of values each attribute can take. tics of these two sets differ and will help us to evaluate the
Each line of the second (data) file contains the values of efficacy of the rule system by training the system on one
these attributes for an example and the corresponding truth data set and testing on the other, and vice versa.
value (dental or palatal in this case).
wordSet 1. This set contains all Marathi words con-
taining the phoneme /c/ in the first 15 (out of 465) files. It
3.3. Attributes has 3085 words with 3246 tokens of the phoneme /c/. Quite
In addition to the place of articulation of the following a few words share the same root word.
vowel, other factors that may play a role in the choice of wordSet 2. The words in this set were selected so as
allophones are articulatory characteristics of the preceding to increase the lexical diversity of the set. The words were
phoneme and position of the affricate in a word. If the selected from 450 files representing information from di-
neighbouring phoneme is a vowel, the attributes of duration verse fields. There were 55,453 distinct words containing
(short vs long vowel) and nasalization were also examined. the phoneme /c/. However, due to inflectional nature of
Table 1 shows the list of attributes examined along with the the language, many words are inflected forms of the same
values these attributes could take. root word. A lot of such words have one of the following
suffixes: ; these correspond to the English pruned tree were 83 and 41 respectively. The errors in de-
word “of”. In order to increase the lexical diversity, the cision were 7.1% and 7.6% respectively. Since the size of
following steps were followed to select words of this set. the pruned tree is about half of that of the unpruned tree,
and the increase in error due to pruning is small (0.5%), we
1. Ignore a word if the affricate is present only in the use the pruned decision tree for our analysis.
last 3 phonemes of the word. Optionally, one can evaluate the performance of the de-
cision tree on test data. The pruned decision tree generated
2. Words with distinct root words should be retained. If from the dataSet 1 (3246 items) was fed with the dataSet 2
there are multiple words with the same root, select (1253 items). The error rates for the test data were 15.9%
only one (the first) word. and 14.8% for unpruned and pruned trees respectively. It
should be noted that dataSet 2 was specifically designed to
Implementation of the second step requires a list of root
have diverse lexical context. So, it is not surprising to see
words of Marathi. In the absence of such a list, we followed
higher error rate with dataSet 2.
an ad hoc approach here. We define a ”rootPattern” as the
sequence of the first 5 characters. The root pattern should The program can also generate a confusion matrix.
also contain the phoneme /c/. Mostly, all words with the Table 2 shows the confusion matrices obtained when the
same root word will have the same rootPattern most of the decision tree trained with dataSet 1 was evaluated with
times. When multiple words with the same rootPattern ex- dataSet 1 (training data) as well as dataSet 2 (test data).
ists, we retain only the first word. While this method does
not guarantee a list of words with distinct root words, it is
Table 2: Confusion matrices showing the performance of
sufficient for the current experiment. In wordSet 2, there
the pruned decision tree trained with dataSet 1 and evalu-
are 803 words with 1253 tokens of the phoneme /c/.
ated with dataSet 1 (training data: 1745 palatal and 1501
The place of articulation of each word was determined
dental tokens) as well as dataSet 2 (test data: 617 palatal
by inspection by native Marathi speakers. If a token of
and 636 dental tokens). It should be noted that dataSet 2
the phoneme /c/ in the ASCII representation of a word is
was specifically designed to have diverse lexical context;
to pronounced as a dental affricate, the character ’c’ as-
hence the higher error rate.
sociated with the phoneme was manually changed to ’C’.
Thus, ’c’ and ’C’ represent palatal and dental allophones of Training data Test data
the phoneme /c/ respectively. For example, the romanized Classified as Palatal Dental Palatal Dental
representation of the word will be remain as ’cAra’ Palatal 87% 13% 76% 24%
while that of the word will be modified as ’CArA’. Dental 1% 99% 6% 94%
The annotation of the first set was carried out by two per-
sons, and that of the second set was performed by a third
person. The decision tree can also be constructed using
dataSet 2. In this case, the decision accuracies for the train-
Based on the manual annotation, the input file for the
ing data turned out to be 76% and 96% for palatal and den-
program “C4.5” [7] was generated using a perl script. Each
tal tokens respectively. The corresponding figures for test
line of this file contains the values of the 15 attributes
data (dataSet 1) were 86% and 98% respectively. When
listed in Table 1 as well as the place of articulation of the
these accuracy figures are compared with those in Table 2,
phoneme /c/ (dental or palatal) for a word. If a word con-
we see that performance on train data has decreased when
tains multiple tokens of the phoneme /c/, the file contains
dataSet 2 is used for training. On the other hand, the deci-
a line corresponding to each token of /c/. The attributes of
sion accuracies for the test data has increased. This discrep-
multiple tokens of a word generally differ due to variation
ancy may be due to higher diversity of dataSet 2 as well as
in phonemic context and position in word.
its small size. It can also be due to annotation errors; the
two datasets were annotated by different persons.
4. Results and analysis We also trained the decision tree with combined data
The output of the decision tree generation program is a (4499 tokens). The accuracies with which the decision tree
decision tree. The program “C4.5” [7] generates an ‘un- could predict the correct place of articulation of training
pruned’ decision tree that makes the least prediction error examples were 83% and 98% for palatal and dental allo-
on training data. However, this tree contains a large num- phones respectively.
ber of rules, some of which may not be significant. The The errors in decision for test data were 15% for one
program also generates a ‘pruned’ decision tree that retains data set and 8% for another dataset. It should also be noted
significant rules. For example, for the dataSet 1, the sizes that decision error is very small when the true place of ar-
(number of nodes in the decision tree) of the unpruned and ticulation of the affricate is dental.
4.1. Decision rules 5. Conclusions
The selection of appropriate allophone of unvoiced, unaspi-
As mentioned earlier, the “C4.5” decision tree construction
rated Marathi affricate, , has been studied in the context
tool can generate both unpruned and pruned decision trees
of speech synthesis. The Decision Tree Learning method
for a given training data. The size (number of nodes) of the
was used to discover articulatory factors that influence the
pruned tree (41) is much smaller than that of the unpruned
choice of appropriate allophone. The main outcome of the
tree (83). The pruned tree contains 18 rules; these are listed
study is that the place of the articulation of the allophone is
in Table 3.
dental if the following phoneme is not a front vowel. This
Among the many rules in Table 3, some may be ap- observation is in accordance with articulatory principles.
plied very frequently and some others rarely. The soft-
ware tool also provides statistics about the usage and ac- 6. Acknowledgements
curacy of each rule. We wanted to see whether there is an
even smaller set of rules that captures the essence of the We thank Technology Development in Indian Languages
larger rule set. Table 3 shows the rules when dataSet 1 was unit of Department of Information Technology for making
used as the training data. Similar set of rules were derived us available the Marathi text corpus. We thank Poonam,
corresponding to dataSet 2 as well as the combined data Swankita and Priyanka for their effort and patience in an-
(dataSet 1 + dataSet 2). On inspection of these 3 sets of notating the corpus. Special thanks to Dr. J.R.Quinlan and
rules, it was observed that there indeed exist a compact set Dr. H.J.Hamilton for the excellent tutorial as well as for
of simple rules. In fact, one can even contemplate a sin- sharing “C4.5”, the Decision Tree Learning tool.
gle rule based on the place of articulation of the following
vowel. For example, when dataSet 1 is used for training, 7. References
one can determine the place of articulation of the Marathi
[1] “Indian accent text to speech system for web brows-
phoneme /c/ by the application of a single rule with only
ing”, A. Sen and K. Samudravijaya, Sadhana, Vol. 27,
about 10% error in case of dental affricates (and 0% error
February 2002, pp. 113-126.
in case of palatal affricates). This rule is shown below:
[2] “Text-to-speech synthesis in Marathi”, A. Sen and
Atul Warjurkar, Proc. National Symposium on
The place of the articulation of the Acoustics, Oct. 2003, pune, paper no. NSA2003-55.
allophone is
[3] “Schwa-deletion in Hindi Text-to- Speech Synthesis”,
dental if (the phone /c/ is followed Narasimhan B., Sproat R., and Kiraz G., Workshop
by either /e/, /a/ or /A/) on Computational Linguistics in South Asian Lan-
palatal otherwise guages, 21st SALA, October 2001, Konstanz.