L2/01-307 From: Kenneth Whistler [kenw@sybase.com] Sent: Monday, August 06, 2001 5:56 PM Subject: Serious bug in Khmer, Myanmar combining classes WARNING. The following presentation contains explicit linguistic material that may not be suitable for young audiences. Peter Constable noted the problem that Myanmar dependent vowels i (U+102D) and u (U+102F) both have been given combining class zero, but that they must be used in sequence to represent the common Myanmar vowel ui. However, since i is a combining mark above and u is a combining mark below, this leads to a visual ambiguity -- either order could, in principle, lead to the same visual rendering, but the two orders would not be canonically equivalent, since combining marks of class zero do not rearrange under normalization. The short summary of the discussion which followed is: Peter suggested fixing the combining classes to non-zeroes, so the two sequences would be canonically equivalent (as would be the case for a sequence of an accent above and an accent below a Latin letter, for instance). Mark said, we can't fix it -- it's against Unicode policy. Rick said, it's an error, and the UTC should discuss it. Mark said, the IETF and W3C would kill us if we tried to change the combining classes. We can only document it. Peter said implementations will end up having to do an ad hoc kind of normalization, and that's a problem. Here are some more gory details about Myanmar to scare the kiddies with. The basic pattern for Myanmar is as follows. The syllabic order is ka kaa ki kii ku kuu ke kai ko koo k� kui. Spelled out in Unicode, this is: ka 1000 kaa 1000 102C ki 1000 102D kii 1000 102E ku 1000 102F kuu 1000 1030 ke 1000 1031 kai 1000 1032 ko 1000 1031 102C koo 1000 1031 102C 1039 k� 1000 1036 kui 1000 102F 102D The pattern is nice and clean except for ko, koo, and kui. -o is a two-part vowel, with the -e piece to the left of the consonant, and the -aa piece to the right of the consonant. -oo adds a visual killer (1039, combining class 9) on top of the -aa piece. -ui is a two-part vowel, with the -u piece underneath the consonant, and the -i piece on top of the consonant. Alternatives were proposed and discussed ad nauseum before Myanmar was finally agreed to. In particular, there were proposals that had a single encoded character for each of the two-part vowels (or some subset of them). The situation was complicated by the mismatched pattern for the independent vowels, some of which are written by cliticizing the dependent vowel to U+1021 'a' (which behaves like an open syllable initial consonant placeholder -- or could be analyzed as a glottal stop, I suppose), and some of which have distinct composite forms of their own. In the end, things were horsetraded down to what we've got, and like it or not, we are stuck with it now. (Note, for the record, that the Myanmar participants agreed to the idea of encoding -ui as a sequence of two characters, so this wasn't just something foisted on them by glyph-oriented Westerners ignorant of the vocalic pattern.) Now consider the problem of "spelling" of the two-part vowels. Peter points out the visual ambiguity. In principle, kui could also be spelled 1000 102D 102F, instead of 1000 102F 102D, and under an ordinary implementation, you wouldn't be able to tell them apart. However, the problem is not so simple as above and below pieces. If you look at the other two-part vowel, the one with the left and right pieces, -o, the same ambiguity exists, despite the fact that we are not talking about above and below clitics. If one spelled ko 1000 102C 1031, a renderer would still be faced with the problem that 1031 is defined as rendering to the *left* of its consonant base, and 102C as rendering to the right. One could argue that a dumb renderer would end up positioning these as visually: [1000 1031 102C], i.e. moving the 1031 around the 102C, but not around the consonant ka, 1000, resulting in a visually incorrect display, so that there would not be any visual ambiguity. But that just means that a dumb renderer would get it wrong, whereas a renderer that checked appropriately for the preceding consonant might, in fact, get it right, resulting in visual ambiguity again. In my opinion, in *both* of these instances, the right way to proceed is to specify the correct order, and to characterize the other order as a *spelling* error -- not as a canonicalization error. The way to eliminate the visual ambiguity in the -ui case is to write a Myanmar renderer such that if it encounters the two pieces of the -ui vowel in the wrong order, it displays them visually wrong (intentionally), rather than quietly stacking them as if they were spelled in the correct order. That will give correct feedback for all of the potentially ambiguous cases. Furthermore, one would expect that Myanmar input methods would provide single key access to all of the two-part vowels, in any case, as for most Indic keyboarding systems. This will work to help keep the -ui's and -o's correct in the underlying store. Actually, rather than the -ui vowel issue, where having everything assigned a combining class of zero still allows a consistent way to implement the behavior desired, there is another issue where I think the combining classes *are* clearly wrong, but still cannot be fixed. The issue is for U+1037, the aukmyit dot below. In Myanmar, this is a tone mark, *not* a nukta. But it was given the combining class of a nukta, i.e. 7. By itself, that would cause no harm, but the problem is that 1037 comes in a pattern pair with U+1038 MYANMAR SIGN VISARGA, which also behaves, in Myanmar, as a tone mark. Thus we get tonal triples of the sort: ang1 1021 1004 1039 ( a -nga -killer ) ang2 1021 1004 1039 1037 ang3 1021 1004 1039 1038 This is the order that I think makes the most linguistic sense, where the killer is applied to the nga to create a final -ng consonant, and then the tone marks, if any, are in logical order following the killer. Visually, the dot below appears below the -ng, and the visarga, a colon-shaped double dot, appears to the right of the -ng (with the killer above the -ng). The problem is that the combining class of the killer is 9, as for all other halants (viramas), whereas the combining class of the 1037 dot below is 7, and the combining class of the 1038 visarga is zero. That means that the representation of ang2 is not in canonical order, which would instead be: ang2 1021 1004 1037 1039 whereas the representation of ang3 *is* in canonical order. This assymetry of two otherwise parallel and very commonly occurring forms under normalization is likely to create problems for processing of Myanmar data. The alternative would be to specify that the correct spelling of tone marks applied to consonant-final syllables is to place the tone marks *before* the syllable-final killer: ang2 1021 1004 1037 1039 0 0 7 9 ang3 1021 1004 1038 1039 0 0 0 9 In this way, despite the mismatch in combining classes for 1037 and 1038, both of these expressions would be in canonical order, which would bode better for systematic processing, despite the somewhat counterintuitive notion of putting the tone mark in between the final consonant and its killer. (In particular, for ang3, the 1039 killer would have to rearrange around the 1038 visarga, so that it correctly appeared on top of the 1004 nga.) What this is all pointing to, in my opinion, is that we are desperately in need of implementation guidelines for Myanmar (and for Khmer) in the same kind of detail as for Devangari, so that these ordering issues and ambiguities can be nailed down in sufficient detail to enable a text model of properly spelled Myanmar (and Khmer). Otherwise, we will not be able to interchange text successfully. Or at least, while the text itself could be interchanged, it would be spelled by drastically different conventions -- and since for Indic scripts, the "spellings" involve complicated interactions with the rendering rules, a spelling that works for Renderer A might result in illegible gibberish in Renderer B, which was assuming different spelling conventions. That would fail the Unicode plain text criteria for interoperability. All my ruminations on this topic are gladly contributed to the cause, but I think it is imperative that someone who actually has implementation experience with Myanmar in a real system take the lead on this. --Ken 1