Paper
Paper
Paper
Operating System
Authors: J Patricia, K Ratheesh, V S Shenoi, G Sreepriya,
Timothy A Gonsalves1 and Hema A Murthy
TeNeT Group, Department of Computer Science and Engineering,
Indian Institute of Technology, Madras, Chennai - 600 036
E-mail: [email protected]
1 Introduction
Almost all the widely available software today is written and documented
in English, and uses English as the medium to interact with users. This
has the advantage of a common language of communication between
developers, maintainers and users from different countries. In a country
like India, an overwhelming majority of the population does not know
English. Given this fact, availability of affordable native language
software will play a crucial role in the process of taking the benefits of
the "information revolution" to the marginalized sections of society and to
achieve appropriate social use of information technology [1].
Each country or language has its own set of native attributes. These
attributes could include the country’s cultural conventions, language-
specific scripts (fonts), format of date and time, representation of
numbers, currency-symbols etc. The formal description of these
attributes together with associated translations targeted to a native
language, constitute the Locale for the particular language or country.
The Linux operating system has two interfaces, namely the console and
the X Window System. The RAM requirement for the console is about
4MB whereas for a minimal X Window system, the RAM requirement is
6-8 MB. The X Window System however has a user-friendly graphical
interface.
This paper deals with the various tasks involved in the development of
Native Language Support for the Linux operating system, both for the
console as well as for the X Window System, with mutual compatibility.
Developing a native language interface at an operating system level is a
better proposition compared to developing it at an application level as the
former enables all the applications running on top of the operating
system to inherit the interface with no or minimal modification. An
application developed in the console-based environment must work
without requiring any modification in the X-environment. Further, once
support has been developed for a particular language, the effort to enable
any other Indian Language support should require changes to the
configuration only. To meet this requirement, ISCII [2] in consonance
with the Inscript keyboard layout has been used, so that the keyboard
and sound mappings are uniform across all the Indian languages. ISCII
includes ASCII as a subset.
2
2 Multilingual support on Linux
Linux offers high flexibility for customization of the keyboard-input-
display pipeline. This flexibility is offered both in the console and in the X
Window System. At the input level, Linux offers the flexibility to
manipulate the mapping tables that specify the keycodes/actions
generated by the keys of the keyboard. The character consequently
displayed depends on the font that is loaded. Linux provides easy
mechanisms to load fonts for the console as well as for the X Window
System. The high level of flexibility offered by Linux was an encouraging
factor for the choice of Linux as the platform for our effort. The detailed
mechanisms for the keyboard handling and display handling for the
console and the X Window System are given in the subsequent sections.
At the console level, Linux allows loading of a font into the EGA/VGA
character generator, with the options of specifying the screen-font map
and/or application-character set mapping. Linux allows loading of
keyboard translation tables as well. Linux provides a utility,
consolechars, for doing the former task, and loadkeys for the latter [3].
Thus, at the console level, Linux allows a high level of flexibility of
customizing the keyboard input as well as the display.
4
3.1 Character Width
Linux uses the PC Screen Font (PSF) format for display purposes for the
console[3]. Neither the PSF format, nor the kernel modules implementing
the display mechanism, viz., console and video drivers, supports variable
width fonts. Moreover, the width of a font glyph is fixed at 8 pixels. This
does not pose a problem for the English characters where even the glyph
with the largest width, “m” can be represented legibly in 8 pixels. Also,
the mean deviation of width of characters is very less in English, and
hence the aesthetic appeal of characters is not affected because of the
fixed width of the glyphs.
But, this is not the case with most of the Indian language scripts. A
character Ë in Tamil or B in Malayalam cannot be legibly fit in 8-pixel
width. If we reduce the font size to accommodate the widest glyph of a
script into 8-pixels, then we loose out on the legibility of the narrower
glyphs of the script. One option to overcome this problem would be to
enable the kernel to support wider fixed-width fonts (say, all 16 pixels
wide). But then, there are characters like S in Malayalam or ó in Tamil
which are very narrow and it will look odd if characters with such a wide
variation in width are displayed together in a screen, with the same
width-allocation for all of them. So, an appropriate solution to overcome
this problem would be to provide support for variable width fonts.
Similar issues arise in the context of the X Window System also. The
virtual terminal (e.g. xterm) or the applications running in it support only
English and other foreign languages, which do not demand variable-
width glyphs.
5
3.2 Vowel and Consonant Clusters
If we want to really use the Native Language Support for the console or
the X Window System, applications running on it, like Mail-utility,
Editor, Web browser and command interpreter also need to be modified
to give the user interface in a native language. This may include
modification of the applications and also generation of the application-
specific string tables.
4 Proposed Solution
The focus of the effort in this work has been to provide a unified
approach to address these problems across all languages that require
variable width font and have the concept of modifiers. The interface
6
allows co-existence of English and multiple native languages as well. In
the effort undertaken, support has been extended for Indian languages.
The 8-bit ISCII has been used as the encoding standard [2]. This
facilitates the compatibility of console-based support and the X Windows
support, with support for transliteration.
The display mechanism at the console is taken care by the console, the
tty and the video device drivers in the kernel. Provision of variable-width
font support in all respects, will require the modification of all these
drivers. The solution adopted by us is to display multiple glyphs for a
single character, code in order to display wider fonts. In this case, the
glyphs are still of fixed size and a one-to-many mapping mechanism is
introduced in the display pipeline. In this design, only the console and
TTY drivers need to be changed.
7
Thus, the kernel should be able to interpret and process the appropriate
language-specific parse rules.
Display of characters
For displaying multiglyph characters, data structures are added in the
kernel to store a one-to-many mapping of character codes to glyph
indices. Device driver functions are provided to load user defined
mutliglyph mappings. In the kernel display pipeline, code is inserted to
index the character codes into the data structures and display the glyph
codes corresponding to a character.
Inserting a character
While a character is being inserted, all the glyphs that are displayed to
the right of it need to be shifted. But the number of positions to shift
depends upon the number of glyphs in the character to be inserted.
Again, the bit-mappings for backspacing and deletion are used for
determining the number of glyphs in the character and the shifting is
done accordingly.
8
4.1.2 Parserule Support
αβA→abc
βA→dc
Start Node γA→ec
Start Node
A B
αβB→adc c
-- -- --
b e
β β d
dc ec -- -- βA γA
α a a
abc adc αβ A αβ B
(a) (b)
Figure 1: A set of parserules and (a): A forward DFA (b): A reverse DFA
produced from these parserules using algorithm ConstructDFA.
9
Forward Parserule Matching
Forward parse rule matching is used while normal characters are being
displayed. Suppose that there is a parse rule α β A = a b c. Also assume
that the glyphs α β are already displayed. Now, when the character code
corresponding to A is pressed, the forward DFA matching is initialized. If
the rules are already loaded before, then the traversal will go in the order
A → β → α, matching all successive characters, finally when it
encounters a non-matching character, the data element in the last DFA
node (abc) is returned.
Control
Normal Check if
Character
Apply Character Backspace
Forward
Matching of
Parserules Backspace
Not
found
Backspace
Match not Check multiglyph
Match found
found mapping tables for Apply Reverse
multiglyph matching Matching of
Parserules
Replace the matched
glyphs with the RHS of No Multiglyph Match not
the matched parserule Mapping found found
Match found
11
4.1.3 Utilities
Two utilities have been developed to enable the user to load the
multimap and parserules into the kernel space. This section describes
the usage of these utilities.
Loadmultimap
The loadmultimap utility can be used to load a multimap file into the
kernel space. The usage of this utility is as follows:
Ø loadmultimap <multimap filename>
A multimap file should be a text file where each line is of the form:
Ø C G1 G2 G3 ... Gn
where C is the character code in the range [0,256) and Gi‘s are glyph
indices in the range [0,512). Both of them can be in decimals or in
hexadecimals (prefixed with 0x).
Loadparserules
The loadparserules utility can be used to load a parserule file into the
kernel space.
The usage of this utility is as follows:
Ø loadparaserules <parserule filename>
A parserule file should be a text file where each line is of the form
Ø [G1-1G1-2G1-3 ... G1-m] C = G2-1G2-2G2-3 ... G2-n
where Gij's are the glyph indices in the range [0,256) and C is a character
code in the range [0,512). Both can be given in decimals or
hexadecimals.
12
4.2.1 Indian language support library: “Libind”
Files:
Ø Fontmap files: The Libind uses the Fontmap files to map ISCII
characters to their equivalent font codes. This mapping forms the
parse rules for the language. It defines the ISCII character sequence
along with the resultant non-trivial glyph(s). The fontmaps are specific
to the font and the language. The user can create his own Fontmap
and place it in the maps directory (Libind searches in this directory),
under the name “<language>.map”, where <language> is the language
for which the Fontmap is meant and access it from his application by
calling “indian_init” function with the language parameter set to
<language>.
Ø Keymap files: The Libind uses the keymap files to define the keyboard
layout. This maps the ASCII code of the keys to the desired ISCII
characters. The keymaps for the standard inscript and the phonetic
keyboard layouts are provided in the keymap directory (Libind
searches in this directory).
13
Functions:
Ø Indian_init: This function reads a font map table which maps ISCII
syllables to their equivalent font codes and returns the size of the font
map table and the font for the language.
Ø Iscii2font: This function converts the input ISCII string into its
equivalent font codes.
The text will hold the actual ISCII/ASCII code of the character to be
displayed in the columns of the each of the rows, including the scrollbar
buffer. The rendering information includes the basic attributes
(foreground color, background color, bold, underline etc) and special
attributes (whether cursor is in the column, whether the char is ASCII or
ISCII etc) as shown in Figure 3 below.
The length of a row will be the actual number of characters in the line or
-1 in the case of wrapped lines. Each line for both text and rendering
are allocated only on demand. The text and rendering are pairs, which
are allocated or deallocated together.
31 0
15
The “screen” structure will hold the information for rows even in the
scroll back region.
Unlike the normal English fonts, which are mono-spaced, the Indian
script fonts are proportional with each character glyph having different
width. Therefore, when accommodating Indian scripts, the column-pixel
relation with respect to display has to be modified.
Let Pi denote the pixel position for the ith column denoted by Ci. Let FWi
denote the width of the character glyph in the ith column. In the case of
English characters, the relation is:
Pi = Ci * FWfixed
When the character glyphs from both Indian and English script are
involved, the computation is accordingly modified.
For displaying Indian languages, True Type Fonts (TTF), which are
scalable are used. During the screen refresh, the rendering bits of the
character are checked to see if it is an English character or an Indian
language character. Accordingly, the fonts will be set for display.
16
In Indian languages, there is a wide range of vowel modifiers which when
applied to base consonants result in modified glyphs. To handle all these
consonant-vowel clusters, a map file containing all possible conjunctures
and their equivalent glyphs are developed for each Indian language.
Toggle &
F1 Inscript / Phonetic Keymap
Load keymap
A set of lex rules is defined for the valid conjuncture formation. Once a
word matches the rule, a binary search is made in the parse rules and
the matching TTF glyphs are retrieved. These will be subsequently
displayed using an ``IDrawString'' function.
ISCII
ISCII Lex Words TTF Draw
To
stream Parser codes string
TTF
Load
Indian Parse Rules
F2 Language
Language parse rules
17
Figure 5: Consonant-Vowel cluster formation
During startup, the virtual terminal has a width to accommodate 80
English characters, which may not be sufficient to accommodate that
many Indian characters. However, the window can be easily resized
using the mouse.
This problem was solved in a two step approach. In the first step, the
events were allowed to post their data into a shared memory location
protected by a Ternary Semaphore as explained below. In the second
step, the data from the shared memory was processed in an optimized
way.
The semaphore value "Old" indicates that the data in the shared buffer
has been used up and can now be overwritten by the assynchronous X
event. The value "New" means the data in the shared buffer is not yet
processed. A situation where in the X event sees the semaphore value
"New" indicates the speed mismatch in the generation of event data and
consumption of the data. In this situation, we are overwriting the buffer
with the new data, thereby losing the previous unprocessed data. This is
acceptable in the situations such as movement of windows and resizing
of windows. The semaphore value "InUse" indicates that the data in the
18
shared buffer is currently being used and should not be overwritten. In
this case, the new data of the X event is ignored. This situation indicates
the speed mismatch.
So the data posting is done on the X event, based on the semaphore and
the processing is done along with a less computation intensive event. By
following this approach the crashing of the terminal has been eliminated.
To enable Emacs with support for Indian languages, the user interface
and system responses need to be changed. So, the frequently used string
literals in the Emacs LISP code have been made into variables in the
format:
19
These plug-in functions and files are kept in the site dependent startup
directory. For enabling 8-bit support, so that emacs does not discard the
8th bit from input, “(set-input-mode nil nil 1) “ is inserted in the startup
file. To load an Indian language, the following code is inserted in the
startup file:
A user can easily switch between the various supported Indian languages
by using the “select-indian-language” function from the ".emacs” startup
file in the home directory.
5.3 Ttf2Psf:
The issue that has been discussed in Section 3.4 prompted the
development of the utility ‘Ttf2Psf’. The FreeType Library has been used
to extract the bitmaps of the specified glyphs from a TTF file and
generate a PSF file with the bitmaps of the specified glyphs embedded in
it.
20
native language support can be easily provided for different languages by
just creating the machine object files.
The GCC ‘C’ Compiler supports gettext, and this feature has been used in
developing local language support for gcc. The machine object file has
been developed for Malayalam, and this enables display of commonly
encountered error messages and warnings of the compilation process in
Malayalam. Also, the programmer can give comments and strings in
Malayalam inside a ‘C’ program. The effort for providing native language
support for a ‘C’ compiler has many advantages from a social
perspective. With this support, lack of proficiency in English will no
longer be a handicap to someone who wants to program in ‘C’. This
would encourage people who are not proficient in English to learn
programming and they can become effective contributors for development
of small scale software systems, which inturn could make a significant
difference in the rural Indian context.
Support for Malayalam has been provided for Pine – a mail-utility and
Pico- an editor. As these applications have not been designed with the
gettext support, our approach has been to modify the source code to
change the string variables and string constants in the source code to
Malayalam. The new applications display a user interface in Malayalam.
The content generated with our Indian language –enabled editors (emacs,
pico) can be printed using any printer with ISCII support. (For example,
the TVSE- MSP 430.)
To complete the full cycle of Indian script handling for printers that do
not have in-built ISCII support, an utility ``iscii2ps'' has been developed.
This utility helps in printing an ISCII text file. The utility reads an ISCII
text file, applies the language dependent parse rules to form the words
composed of font codes of the TTF font file for the language. Then a
PostScript file is generated with language tags for these words. PostScript
functions are written to handle the margins, line justifications, new line
and page breaks. The utility also generates a file that can be directly sent
to the printer. This utility with the minimal page formatting of margins,
justification, new line and page breaks help in getting a printed version
of an Indian language text in the true type fonts. This support has been
extended to be used on a low-end dot matrix printer also.
21
6 Conclusion
The iitm-term, the modifications made in the Linux kernel and the
utilities that have been developed can be used to effectively provide
console based support and the X Window System support for any Indian
language or any language that demands variable width font and/or uses
modifiers and clusters. At the console level, the support has been
developed for Tamil and Malayalam. The task of providing console-based
support for any Indian language reduces to that of:
Ø Creation of a PC Screen Font file: The Ttf2Psf utility can be used for
this purpose
Ø Creation of the multi-maps and parse-rules as appropriate for the
language.
The support provided can be used for the process of localization of other
internationalized packages as well. Using the support that has been
provided, native language support can be extended to new applications
as well.
BIBLIOGRAPHY
[1]. Kenneth Kenistion, “Politics, Culture and Software'' Economic and Political Weekly,
Vol. XXXIII, No, 3, pp. 105-110, January 17, 1998
[2]. Indian Standard, Indian Script Code for Information Interchange - ISCII from Bureau
of Indian Standards, [email protected]
[6]. Michael Beck, Harald Bohme, Mirko Dziadzka, Ulrich Kunitz, Robert Magnus, Dirk
Verworner. Linux Kernel Internals Second Edition, 1998, Addison-Wesley. Chapters
1-4, 7, Appendices A,B
[7]. Alessandro Rubini , Linux Device Drivers , 1998, O’reilly & Associates Inc. Chapters
1-5
[9]. Naoshad A Mehta and Rudrava Roy, “RXVT - Indian Devanagari'', www.rxvt-
idev.freeservers.com.
[10]. X Window System Programming, Nabajyoti Barkakati, Prentice Hall India Pvt. Ltd.
23