An Omnifont Open-Vocabulary OCR System For English and Arabic
An Omnifont Open-Vocabulary OCR System For English and Arabic
An Omnifont Open-Vocabulary OCR System For English and Arabic
1 INTRODUCTION
HE introduction of Hidden Markov Models (HMM) to
the area of automatic speech recognition has brought
several useful aspects to this technology, some of which are:
language-independent training and recognition methodol-
ogy; no separate segmentation is required at the phoneme
and word levels; and automatic training on non-segmented
data. In previous papers [13], [18], we presented a method
for using existing continuous speech recognition technol-
ogy for OCR. After a line-finding stage, followed by a sim-
ple feature-extraction stage, the system utilizes the BBN
BYBLOS continuous speech recognition system [15] to per-
form the training and recognition.
In this paper, we present techniques for handling multi-
ple print styles using a single HMM model for each char-
acter. It had been assumed that a natural mix of data from
different fonts and styles is best for training a recognition
system, as it leads to a matched condition between training
and test. We show how the HMM conditional independ-
ence assumption leads to less than optimal recognition ac-
curacy when a natural mix of data from different styles is
used for training, and we present a method for improving
system performance by allocating training data properly
among the different styles.
In our previous papers, we focused mainly on closed-
vocabulary experiments, in which the lexicon contained all
the words in the training and test sets. In this paper, we
show how the same word-based system can be used to deal
with unlimited vocabularies with the use of a lexicon of
characters and a statistical language model at the character
level. We report on results for English and Arabic. We also
discuss the effects of the language model on the perform-
ance of the character-based recognition system and how
language model perplexity relates to the recognition rate.
This paper is organized as follows. In Section 2, we give
a short literature review on related work in the area. In Sec-
tions 3 and 4, we present an overview of the system de-
scribing our approach to using HMMs for character recog-
nition. In Section 5, we present initial results on English and
Arabic. In Section 6, we present an approach for improving
accuracy with multiple print styles. In Section 7, we show
how the word-based system can be used to perform char-
acter-level recognition with unlimited vocabulary.
2 LITERATURE REVIEW
A number of research efforts have been made that use
HMMs in off-line printed and handwriting recognition [2],
[3], [5], [23]. In all these efforts, the recognition of only a
single language is attempted. The approach we take is most
similar to those of references [1], [6], [8], [9], [10], [14] in that
they also extract features from thin slices of the image
which, in principle, could make these systems language-
independent. The approach of Elms and Illingworth [8]
is
similar in that they use vertical thin slices to extract one set
of features, but they also use horizontal slices to extract an-
other set of features which makes some presegmentation at
the character level necessary, hence making the system not
appropriate for language-independent recognition, espe-
cially for languages with connected script. They used their
system to perform recognition of printed Roman characters.
Aas and Eikvil [1]
draw a bounding box around each word
0162-8828/99/$10.00 1999 IEEE
=
Solving for x yields 47.8 percent. Therefore, if we allo-
cate our training data as 47.8 percent italic and 52.2 per-
cent plain, the net effect will be weighting the final models
with the natural mix ratio of 15 percent and 85 percent,
respectively.
Finding the right amount of training cannot be easily
generalized to more than two styles (fonts). Even in the
two-style case, we do not know the durations of each
character a priori. Therefore, we have tried to ameliorate
the exponentiation problem by following the approxi-
mate solution of using equal amounts of training from
the different styles. Having equal amounts of training
results in similar weighting for all styles. This approach
results in a somewhat unmatched condition between
training and test since the final model will have similar
weighting for all styles as opposed to the natural
weighting (15 percent to 85 percent in our example
above). However, this mismatch is far less severe than
502 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 21, NO. 6, JUNE 1999
that due to exponentiation and, therefore, should result
in better performance.
6.3 Balanced Training Models
To test the above solution, we performed a balanced train-
ing experiment where we used an equal amount of italic
and plain text for training (~50k characters each) but tested
the resulting model on the same data as before. The result is
shown in the last row of Table 1. Using a single model for
all data, the CER was the same 0.8 percent for the plain and
italic portions of the test. This is to be contrasted with the
second row of the table which shows the results of the pre-
vious multifont experiment, which used a natural mix for
training the model.
The use of equal amounts of data for training from each
font reduced the CER on italic from 6 percent to 0.8 percent
while the CER for the plain data increased from 0.5 percent
to 0.8 percent. So, on average, the use of a single balanced
training model reduced the overall error rate from 1.2 per-
cent to 0.8 percenta significant 33 percent overall reduc-
tion in character error rate. For comparison purposes, we
also show in the first row of Table 1 the results of the more
computationally expensive method of using multiple style-
dependent models.
7 UNLIMITED VOCABULARY OCR
The BYBLOS system is a word-based system that allows for
recognition of only a closed set of words that constitute the
systems lexicon. In order to overcome this limitation, we
perform character-level recognition by allowing the char-
acter to play the role of the word in the system. Thus, the
lexicon is a list of the possible characters and the language
model is an n-gram (bigram or trigram) on sequences of
characters.
In Sections 7.1 and 7.2, we present results on character-
based recognition for English and Arabic, respectively. As
expected, the CER increases when compared to the word-
based closed-vocabulary results. Then, in Section 7.3, we
show how good performance can be achieved for unlimited
vocabularies through a hybrid approach that combines
character and word level recognition.
7.1 Character-Based Recognition for English
The first unlimited vocabulary experiment we present here
is on English. Under the same experimental conditions as
our word-based balanced-training English experiment of
0.8 percent CER, we instead built character models and per-
formed recognition using both a bigram and a trigram
grammar on characters.
Table 2 summarizes the results of the experiments using
different language models. For each model, the table gives
the character perplexity of and the corresponding CER un-
der balanced training. The first two results are for charac-
ter-based recognition (using no lexicon of words) and the
third row is the word-based, closed-vocabulary recognition
result, given in Table 1, using the 30k-word lexicon. We can
see two effects for changing the language model (as de-
scribed in Section 3.1). First, as we go from a brigram on
characters to a trigram on characters, perplexity goes down
from 13.0 to 8.6. When we then use a lexicon and a bigram
on words, the perplexity decreases to 2.8, indicating an
easier recognition task. Second, the CER decreases when we
use a trigram instead of a bigram on characters, as can be
seen from the last column of Table 2, and it decreases fur-
ther when we use a lexicon and a bigram on words. Doing
character-based recognition for English without the use of a
lexicon allowed for unlimited vocabulary but degraded the
performance by roughly a factor of three (from 0.8 percent
to 2.1 percent).
7.2 Character-Based Recognition for Arabic
As mentioned in Section 5.1, our Arabic character set con-
sists of all the forms and ligatures shown in Fig. 7. For
training and test sets chosen from 40 pages of the DARPA
corpus, as presented in Section 5.1, we obtained a CER of
2.6 percent using our closed-vocabulary, word-based sys-
tem. Using the same training and test but performing un-
limited-vocabulary recognition with the form-based models
and a trigram language model on forms, we obtained a
CER of 4.5 percent. Similar to English, the performance de-
grades in going from word to character recognition, but in
this case with a degradation of only a factor of two (from
2.6 percent to 4.5 percent) as opposed to a factor of three for
English.
TABLE 1
CHARACTER ERROR RATES FOR ENGLISH UNDER DIFFERENT TRAINING AND TEST CONDITIONS
TABLE 2
ENGLISH CER VERSUS MODEL PERPLEXITY
BAZZI ET AL.: AN OMNIFONT OPEN-VOCABULARY OCR SYSTEM FOR ENGLISH AND ARABIC 503
7.3 A Hybrid Recognition System
Turning the system into a character-based recognition sys-
tem as we did in Sections 7.1 and 7.2 allows for any se-
quence of characters and hence for unlimited vocabulary.
However, in doing this, we lose a significant amount of
prior language information that the lexicon provides during
recognition. As we saw in the two previous subsections, not
using a lexicon increases the error rate by a factor of two to
three.
To solve this problem, we used a hybrid approach where
we perform character-based recognition together with some
higher level constraints set by a word lexicon and a uni-
gram language model at the word level. Using this hybrid
approach for English and Arabic, Table 3 summarizes the
results. The first two columns summarize our previous re-
sults for word-based and character-based recognition, while
the third column shows the results for the hybrid system.
Using the hybrid recognition system, we obtained an error
rate of 1.1 percent on the balanced training English experi-
ment and 3.3 percent on the 40 pages Arabic experiment.
These results show surprisingly significant improve-
ments over the error rates we obtained by doing only char-
acter-based recognition. The hybrid system, without having
the closed-vocabulary constraint, performed close to the
closed-vocabulary system (comparing first and third col-
umn in Table 3) which makes us believe that this can be a
good approach for unlimited vocabulary while making use
of a lexicon and some word-level language model. The de-
tails of the hybrid system will be presented in a subsequent
paper.
Comparing Arabic to English, the average error rate for
Arabic was about three times that of English (3.3 percent
versus 1.1 percent). We speculate that the higher error rate
for Arabic is due to several causes:
1) the greater similarity, and hence confusability, of Ara-
bic characters;
2) the connectedness of Arabic characters and the exis-
tence of ligatures;
3) the wider diversity of fonts in the Arabic corpus; and
4) the lower quality of some of the Arabic data.
Finally, even though the English and Arabic corpora we
used for our training and testing are also used by other re-
searchers in the field, no standard training and testing sets
have been defined that could allow for comparing our re-
sults to those of other methods used on the same corpora.
8 CONCLUSION
In this paper, we presented an omnifont open-vocabulary
OCR system for English and Arabic that is based on Hid-
den Markov Models. We showed that our HMM-based
OCR system has several benefits, two of which are: no
segmentation is required at the word or character levels,
and the system is language-independent, that is, the same
system can be used for different languages with little or
no modification.
We addressed the issue of the simultaneous recognition
of two English styles: plain and italic. We showed, mathe-
matically and empirically, that our initial high error rate on
italics, when using a single overall model, was due to the
conditional independence assumption inherent in the
HMM framework. We presented a method for ameliorating
the problem by balancing the training between the two
styles.
We also presented a technique for using our word-based
system to handle unlimited vocabularies. Using a lexicon of
only characters of the same size as the character set (e.g., 90
for English) and letting these characters play the role of
words enable our system to recognize any sequence of
characters, thus allowing for unlimited vocabulary recog-
nition. We presented results that show the effect of using
bigram and trigram language models on sequences of char-
acters to improve recognition accuracy. To recover the lexi-
con information, we combined character and word recog-
nition, and we were able to achieve open-vocabulary per-
formance of 1.1 percent CER for English and 3.3 percent for
Arabic, results which are close to the closed-vocabulary
system performance.
When performing character-based decoding, the recog-
nition speed is around 35 characters per second, while with
word-based decoding where the vocabulary is much larger,
the recognition speed is around an order of magnitude less
than the character-based recognition speed. We are cur-
rently working on a fast hybrid procedure that has a recog-
nition speed comparable to that of only character-based
recognition.
For our future work, we plan to move in two directions.
First, we plan to port the system to Chinese OCR, which
will further demonstrate the language-independence aspect
of the overall methodology. Implementing Chinese OCR
will also bring a new challenge because of the large cardi-
nality of the Chinese character set. Second, we plan to test
our system on degraded and noisy data, such as fax and
nth-generation photocopies.
ACKNOWLEDGMENT
An earlier version of this paper was presented at ICDAR in
1997 [22].
REFERENCES
[1] K. Aas and L. Eikvil, Text Page Recognition Using Grey-Level
Features and Hidden Markov Models, Pattern Recognition, vol.
29, pp. 977-985, 1996.
TABLE 3
ENGLISH AND ARABIC HYBRID RECOGNITION RESULTS
504 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 21, NO. 6, JUNE 1999
[2] B. Al-Badr and S. Mahmoud, Survey and Bibliography of Arabic
Optical Text Recognition, Signal Processing, vol. 41, no. 1, pp. 49-
77, 1995.
[3] M. Allam, Segmentation Versus Segmentation-Free for Recog-
nizing Arabic Text, Proc. SPIE, vol. 2,422, pp. 228-235, 1995.
[4] J. Bellegarda and D. Nahamoo, Tied Mixture Continuous Pa-
rameter Models for Large Vocabulary Isolated Speech Recogni-
tion, IEEE Intl Conf. Acoustics, Speech, Signal Processing, vol. 1,
pp. 13-16, Glasgow, Scotland, May 1989.
[5] N. Ben Amara and A. Belaid, Printed PAW Recognition Based on
Planar Hidden Markov Models, 13th Intl Conf. Pattern Recogni-
tion, vol. 2, pp. 220-224, Vienna, 1996.
[6] W. Cho, S.-W. Lee, and J.H. Kim, Modeling and Recognition of
Cursive Words With Hidden Markov Models, Pattern Recogni-
tion, vol. 28, pp. 1,941-1,953, 1995.
[7] R.B. Davidson and R.L. Hopley, Arabic and Persian OCR Train-
ing and Test Data Sets, Proc. Symp. Document Image Understanding
Technology (SDIUT97), pp. 303-307, Annapolis, Md., 1997.
[8] A.J. Elms and J. Illingworth, Modelling Polyfont Printed Char-
acters With HMMs and a Shift Invariant Hamming Distance,
Proc. Intl Conf. Document Analysis and Recognition, pp. 504-507,
Montreal, Canada, 1995.
[9] A. Kaltenmeier, T. Caesar, J.M. Gloger, and E. Mandler, Sophisti-
cated Topology of Hidden Markov Models for Cursive Script
Recognition, Proc. Intl Conf. Document Analysis and Recognition,
pp. 139-142, Tsukuba City, Japan, 1993.
[10] A. Kornai, Experimental HMM-Based Postal OCR System, Proc.
Intl Conf. Acoustics, Speech, Signal Processing, vol. 4, pp. 3,177-
3,180, Munich, Germany, 1997.
[11] J. Makhoul, S. Roucos, and H. Gish. Vector Quantization in
Speech Coding, Proc. IEEE, vol. 73, pp. 1,551-1,588, 1985.
[12] J. Makhoul and R. Schwartz, State of the Art in Continuous
Speech Recognition, Proc. Natl Acad. Sci. USA, vol. 92, pp. 9,956-
9,963, Oct. 1995.
[13] J. Makhoul, R. Schwartz, C. LaPre, C. Raphael, and I. Bazzi, Lan-
guage-Independent and Segmentation-Free Techniques for Opti-
cal Character Recognition, Document Analysis Systems Workshop,
pp. 99-114, Malvern, Pa., Oct. 1996.
[14] M. Mohamed and P. Gader, Handwritten Word Recognition
Using Segmentation-Free Hidden Markov Modeling and Seg-
mentation-Based Dynamic Programming Techniques, IEEE
Trans. Pattern Analysis and Machine Intelligence, vol. 18, no. 5, pp.
548-554, May 1996.
[15] L. Nguyen, T. Anastasakos, F. Kubala, C. LaPre, J. Makhoul, R.
Schwartz, N. Yuan, G. Zavaliagkos, and Y. Zhao, The 1994
BBN/ BYBLOS Speech Recognition System, Proc. ARPA Spoken
Language Systems Technology Workshop, pp. 77-81, Austin, Texas,
Jan. 1995. San Mateo, Calif.: Morgan Kaufmann Publishers, 1995.
[16] I.T. Phillips, S. Chen, and R.M. Haralick, CD-ROM Document
Database Standard, Proc. Intl Conf. Document Analysis and Recog-
nition, pp. 478-483, Tsukuba City, Japan, Oct. 1993.
[17] L. Rabiner, A Tutorial on Hidden Markov Models and Selected
Applications in Speech Recognition, Proc. IEEE, vol. 77, no. 2, pp.
257-286, Feb. 1989.
[18] R. Schwartz, C. LaPre, J. Makhoul, C. Raphael, and Y. Zhao,
Language-Independent OCR Using a Continuous Speech Rec-
ognition System, Proc. Intl Conf. Pattern Recognition, pp. 99-103,
Vienna, Aug. 1996.
[19] R. Schwartz, L. Nguyen, and J. Makhoul, Multiple-Pass Search
Strategies, C.-H. Lee, F.K. Soong, and K.K. Paliwal, eds., Auto-
matic Speech and Speaker Recognition: Advanced Topics. Kluwer Aca-
demic Publishers, 1996, pp. 429-456.
[20] T. Starner, J. Makhoul, R. Schwartz, and G. Chou, On-Line Cursive
Handwriting Recognition Using Speech Recognition Methods,
IEEE Intl Conf. Acoustics, Speech, Signal Processing, pp. V-125-128,
Adelaide, Australia, 1994.
[21] F.T. Yarman-Vural and A. Atici, A Heuristic Algorithm for Opti-
cal Character Recognition of Arabic Script, Proc. SPIE, vol. 2,727,
part 2, pp. 725-736, 1996.
[22] I. Bazzi, C. LaPre, J. Makhoul, and R. Schwartz, Omnifont and
Unlimited Vocabulary OCR for English and Arabic, Proc. Intl
Conf. Document Analysis and Recognition, vol. 2, pp. 842-846, Ulm,
Germany, 1997.
[23] C.B. Bose and S.-S. Kuo, Connected and Degraded Text Recogni-
tion Using Hidden Markov Model, Pattern Recognition, vol. 27,
pp. 1,345-1,363, 1994.
Issam Bazzi is a staff scientist at BBN Technologies, GTE Internet-
working, Cambridge, Massachusetts, and is pursuing a PhD degree at
the Massachusetts Institute of Technology (MIT). He received his BE in
computer and communication engineering from the American Univer-
sity of Beirut in 1993 and his SM in information technology from MIT in
1997. He has been working at MIT since 1993 on networked multime-
dia systems. He joined BBN in 1996, working mainly on optical char-
acter recognition, which is the topic of his PhD study at MIT.
Richard Schwartz is a principal scientist at BBN Technologies, GTE
Internetworking, Cambridge, Massachusetts. He joined BBN in 1972,
after receiving an SB in electrical engineering from MIT. Since then, he
has worked on phonetic recognition and synthesis, speech coding,
speech enhancement in noise, speaker identification and verification,
speech recognition and understanding, fast search algorithms, neural
networks, online handwriting recognition, optical character recognition,
and statistical text processing.
John Makhoul is a chief scientist at BBN Technologies, GTE Internet-
working, Cambridge, Massachusetts. He is also an adjunct professor at
Northeastern University and at the University of Massachusetts and a
research affiliate at MIT. An alumnus of the American University of
Beirut and the Ohio State University, he received a PhD from MIT in
1970 in electrical engineering. Since then, he has been with BBN,
directing various projects in speech recognition and understanding,
speech coding, speech synthesis, speech enhancement, signal proc-
essing, artificial neural networks, and character recognition. Dr. Mak-
houl is a Fellow of the IEEE and of the Acoustical Society of America.