L2/01-305 Title: Draft UTC Response to L2/01-304, "Feedback on Unicode Standard 3.0", an article published in Vishwabhara@tdil (Newsletter of the TDIL Programme of Ministry of Information Technology, Government of India). URLs: http://vishwabharat.tdil.gov.in/newsletter1.htm http://www.unicode.org/L2/L2001/01304-feedback.pdf Source: Rick McGowan Date: August 8, 2001 ===== ABSTRACT The document L2/01-304 asks for some quite reasonable additional characters, provides some annotations and information for block introductions, and also request a number of codepoint changes. This document is a detailed preliminary analysis of all the requests and suggestions made in L2/01-304, with some suggested actions for UTC and/or the authors of L2/01-304. ===== INTRODUCTION UTC would like to thank the authors of L2/01-304 for writing this detailed analysis of Indic script encoding within the Unicode standard, and looks forward to discussion of the various points raised by the document. Herein pages of the document L2/01-304 will be referred to by number, beginning with "Page 15". ===== PAGE 15. Several points on this page are numbered with small Roman numerals. In point (iii) of page 15 the document apparently requests what amounts to a name change with regard to the terminology "virama" versus "halant", in various scripts. This cannot be accommodated due to UTC and WG2 policy about name changes, but probably some explanatory text and/or annotations in the name list could be written to clarify the issue, and to discuss the two terms. Point (iv) on page 15 appears to ask for a complete change in the rendering model so that "halant" would render conjuncts horizontally while "ZWJ" would be used for vertical rendering of conjuncts. If that is indeed what is being suggested, it is not possible to accommodate because it would invalidate the entire existing model as well as all existing data and implementations -- both font implementations and software. Suggested UTC action: Seek clarification from the authors as to the intent of point (iv). ===== PAGE 16. Point (v) of page 16 seems merely to point out that the last column of each 128-character block is for language-specific letters. This is already set aside for script-specific entries, both in ISCII and Unicode. Point (vi) suggests that the authors of L2/01-304 will write up some detailed block descriptions for the remainder of the Indic scripts that are not already detailed. This is a very good development, since UTC has not to date been able to write these block introductions. Point (viii) expresses the desire that transliteration between scripts be simple and one-to-one. However, this is not possible without (of course) completely invalidating the existing codes. The differences between North and South Indian scripts would seem to make this a practical impossibility anyway; and is furthermore probably contradicted by the apparent plans for Tamil; see below.) Clearly it would be a desirable state of affairs, but the document offers no further explanation or plan. Point (ix) of page 16 is headed "Updating constraints of Unicode consortium regarding character encoding stability". It seems to be merely a quotation of the Unicode policies as expressed in: http://www.unicode.org/unicode/standard/policies.html In this case, UTC should inquire as to why, given rule (a) -- that characters once encoded will not be moved or removed -- the document goes on to propose removal or movement of a substantial number of characters. It would appear that this point may be asking that fundamental policies be changed to accommodate proposed incompatible changes in various scripts. This issue must be clarified. Suggested action for UTC: obtain clarification from the authors on the intent of including point (ix), page 16; and re-iterate policies of both Unicode and WG2. ===== PAGE 17. The remainder of L2/01-304 is divided into sections for each of several scripts, or languages, beginning with Devanagari. Likewise, this document follows that structure with headers for clarity in matching responses to L2/01-304. Codepoints given below, throughout the rest of this document are in hexadecimal and refer to codepoints as used or suggested in L2/01-304. Please note that many of the codepoints under discussion herein are not encoded in the Unicode standard, but are encoding suggestions made in L2/01-304. Here, the suggested codepoint numbers are retained only for clarity in matching responses to L2/01-304, and do not imply any endorsement by UTC or final encoding in any way. DEVANAGARI. A chart is presented on page 16, which coincides precisely to the Unicode chart, with some additions, and removal of several consonants with nukta. 0904, 093A, 0955, 0956. The document proposes a number of additions which appear to be fine candidates for encoding. As with all suggested additions, UTC needs detailed explanations of their usage, form, etc, and detailed WG2 forms will be needed. The document itself does not provide sufficient detail for adding these characters to be formally proposed. Suggested UTC action: for these, and all other characters mentioned below, obtain further details, and work with MIT experts to draft proposed additions; then submit proposals with WG2 forms. 0958 - 095F. The document proposes to discourage the use of these precomposed characters with nuktas. By putting them into the composition exclusions list, UTC has already excluded them from the Form C normalization. Annotations and cautionary statements could also be added to that effect, with whatever degree of strength is appropriate. 094D. A change in the character name is suggested. UTC might want to make an annotation, since a name change is not possible. 0970. The document suggests an annotation or explanation which is suitable. Suggested UTC action: add explanation. The document points out several representative experts who apparently were consulted during the preparation of this document. It also suggests a number of explanations and details that could be added to the block introductions, e.g., for the Konkani language, written with Devanagari. Suggested UTC action: add Konkani specific remarks to the Devanagari block introduction, as illustrated on page 17. 0974. DEVANAGARI LETTER SHORT YA is proposed for use in Sindhi, which seems to be a fine addition. UTC action: request further details and add to list of proposed items. Some other Sindhi related comments and explanations follow, e.g., for 0952. Suggested UTC action: add these comments to the Devanagari block introduction. ===== PAGE 18. BENGALI. For Bengali, as well as a number of other scripts, the document proposes to add DANDA and DOUBLE DANDA clones. It was long ago decided in UTC not to clone these punctuation characters. Therefore all of the suggestions for adding DANDA and DOUBLE DANDA characters for all of the scripts must be declined. However, the block introductions should specify where users are to look for the DANDA and DOUBLE DANDA characters (in the Devanagari block). The document also proposes the addition of INVISIBLE LETTER in a number of scripts. INVISIBLE LETTER will be dealt with here, and not below under each place it is proposed. It was long ago decided that Unicode would not use this INVISIBLE LETTER, and the mechanisms are well explained in relation to Devanagari, etc, about ZWJ. This has been discussed elsewhere, but comments aren't available. Here is one note from Anupam Saurabh, May 27 1998: "The INV is used to simulate a joining and display of resultant glyph with an invisible consonant. ZWJ as described on page 6-71 of Unicode 2.0 is used to alter the behavior of rendering process as if it had been joined with either preceding or following character, or both. It also mentions the function of ZWJ for Indian languages, and the explanation is identical to that of INV in ISCII. Apart from difference in the language of explanation, I do not find any other difference." 09BD. "Avagrah" for Bengali is proposed. Suggested UTC action: add this and several other Avagraha characters that are proposed for other scripts to the list of suggested additions. These are well-understood already, so no further information is required about each of them; they merely need to be added to the list of proposed additions. 09CE, 09CF, 09DE. The document suggests adding signs for Bengali YA, RA, and LA. This would apparently be a change to the model for Bengali (which would not be possible), and it needs to be considered in detail, with reasoning for the proposal outlined. In any case, some detailed explanation is needed. Proposed UTC action: request further clarification and detailed explanation of these proposed additions for Bengali. 09F4, 09F5, 09F6. Changes in namelist annotation that seem fine and should be added to the list of proposed additions. GURMUKHI. 0A01. Gurmukhi sign "adak bindi" and a visarga are suggested. These seem fine, and should be added to the list of proposed additions. Further clarification or documentation would be useful. 0A50. This suggestion amounts to moving the existing U+0A74 to 0A50. It is not possible to move it. The suggested annotation is already made in the name list for U+0A74. No action is needed. 0A4E, 0A4F. These two proposed additions for RA and HA subjoined signs for Gurmukhi would possibly change the model, and UTC should request detailed explanation to decide whether they are reasonable additions. 0A64, 0A65. More Danda, Double Danda. See above. 0A78. The document suggests addition of "Gurmukhi Sign Khanda" which is identical to the character already encoded as U+262C. There is no need to add a clone of this character here, but the block introduction could point out that it is encoded at U+262C. 0A33. A shape change in the chart is suggested. This needs detailed documentation as to whether this is a simple mistake, or a font detail. Suggested UTC action: seek clarification and reasoning for the suggested change. The document suggests moving 0A74 to 0A50, which will not be possible. But it is also said to be "not used", so maybe an annotation is in order regarding its obsolescence? Suggested UTC action: request clarification, and propose an annotation, if needed. GUJARATI. 0A8C. Gujarati vocalic L is suggested and is probably fine for UTC to add this to the list of proposed additions without further explanation, as it is a well-understood letter. 0AD1, 0AD2, 0AD3, 0AD4. These are some new accent marks that appear similar to, or even identical to, accents in the U+0300 area. They need to be looked at in detail and explanations provided. But they should probably not be added, and just use the existing non-spacing marks. Suggested UTC action: request further explanation of these marks, and suggest using the existing U+0300, etc, where applicable. 0AE1, 0AE2, 0AE3. These are suggested additions of Vocalic L, LL. Probably fine for UTC to add this to the list of proposed additions without further explanation, as they are well-understood letters. 0AF1. A rupee sign for Gujarati. Probably fine to add this, since it looks like a symbol that's not made from pieces that are already encoded Gujarati characters. Since the form of this character is very Gujarati-like, it should probably be proposed for encoding at this location, rather than in the Currency Symbols block. ORIYA. 0B0A, 0B0B, 0B48, 0B4C. The document proposes changing shapes of these five codepoints, but there is no explanation. Details are needed before a determination can be made. Are these simply font differences? They may be so. Suggested UTC action: request clarifications as mentioned. 0B66. The shape/size of the representative glyph for the "zero" character is probably fine to change; the document gives some detail as to why it should be smaller than 0B20, avoiding confusion. Suggested UTC action: amend the charts for this character. ===== PAGE 19. (ORIYA, cont'd.) The document suggests removing the annotation under character 0B2C; probably because it also suggests the addition of an Oriya "va" character at 0B35. Suggested UTC actions: remove the annotation and put "va" on the list of proposed additions for Oriya. 0B64, 0B65, 0B3A. The dandas and invisible character are again proposed to be cloned here, which cannot be accommodated. See the explanatory information above under Devanagari. TAMIL. 0B83. The document indicates that this is not a combining character at all, but an independent character. Maybe need to remove the dotted circle. In any case, it needs investigation, since it has "Mc" category in the Unidata. This may be one mistake that UTC will have to work around by deprecating this character and adding an appropriately SPACING character at another location. 0B82. The document suggests that this is not used in Tamil. Presumably, this means that the Tamil language itself does not use it. Suggested UTC action: clarify this, and annotate the character "for use in Sanskrit" if appropriate. Here for TAMIL, and in other places below for other scripts, the document strongly indicates that the Unicode encoding with respect to the two-part vowel signs is considered simply incorrect. It is apparently desired that Unicode remove the explanation of using sequences like ", 0BC6, 0BBE" instead of the two-part vowel symbol 0BCA after the consonant. This probably needs some detailed discussion in UTC. The document suggests not using the split pieces, but the two-part vowel signs. Normalization form NFC should then be preferred here, and UTC may want to annotate this, and/or deprecate the split-up pieces in some cases. Under the heading "TAMIL" the item numbered (3) is a serious issue: "Tamil letter sequencing as in the Unicode Standard 3.0 is also not acceptable. New code-set is being worked out." This looks like groundwork to ask for a complete overhaul and replacement of Tamil encoding. Suggested UTC action: work with the experts, MIT and INFITT to show the workability of the current Unicode Tamil encoding. If MIT goes ahead with endorsing an entirely new Tamil encoding within India, UTC should propose working together to specify precise mapping between the existing Unicode Tamil encoding and whatever local Indian standard is proposed to replace the ISCII Tamil encoding. TELUGU. Again the document asks for invisible letter and change in halant/virama naming; see above under Devanagari. 0C3D, 0C3C, 0C0D, 0C11, 0C34. These seem like five reasonable additions for Telugu, and UTC should probably add them to the list of proposed additions for Telugu. KANNADA. 0CBC, 0CBD. The document requests Nukta and Avagraha to be added for Kannada. These are well-understood additions and should be put onto the list of proposed additions without requiring any further information. 0CD2, 0CD1, 0CF9, 0CD3, 0CD4. Several additional diacritics are suggested, but there is not enough explanation for UTC to come to a determination. Some are probably just clones of non-spacing marks in the U+0300 block, and need to be explained before UTC can determine what to do. As with similar diacritical marks suggested in previous sections, UTC should request clarification and point out the existing marks. Again, the document requests that Unicode "delete" the equivalences for split vowel signs for Kannada. Suggested UTC action: discuss in conjunction with the similar requests above for Tamil. (Note: The previous L2 document (L2/01-037) submitted by the Directorate of Information Technology, Government of Karnataka, also points out that the various "length marks" have no independent existence; see L2/01-037 page 4 of 17. Furthermore, 0CD5 and 0CD6 as well as 0CE1 are therein suggested for deletion; see page 11 of 17.) ===== PAGE 20. MALAYALAM. A number of additions are requested, which seems fine, but need to be explained and documented before they can be added to the list of proposed additions for Malayalam. The document also suggests changing the representative shape of 0D4C, but UTC should request confirmation and an explanation of the motivation for the proposed changes before taking any action. Then the document requests that names of seventeen consonants be changed, which cannot be accommodated. The document also suggests removing the character U+0D57 as a duplicate of U+0D4C. Suggested UTC action: the claim must be investigated, and possibly one of them deprecated; but 0D57 cannot be removed. ARABIC. The document suggests adding three characters at 0656, 0657, 0658, since they are used in Urdu. Suggested UTC action: request further information and documentation before adding them to the list of proposed additions for Arabic. As suggested in correspondence to UTC from IBM, it will probably be found expedient to make any such additions to the Extended Arabic block rather than at the proposed locations. The document suggests removing one annotation for 0690, which could be reasonable, and several other annotations as well, which should be investigated. Suggested UTC action: investigate the proposed annotations. The document ask for a new diacritical for "Hamza" at U+0659, but this seems to be already encoded at U+0654. Suggested UTC action: request clarification. Annotations are suggested for pointing out Sindhi shape differences for some numerals 0664 - 0667. This are probably reasonable additions for the Arabic block introduction, and UTC should probably just add this information. D:/Uniw/L2-Docs/L2-01-305.txt