[Unicode]  Unicode 5.1.0 Home | Site Map | Search
 

Unicode 5.1.0

Released: 2008 April 4

Version 5.1.0 has been superseded by the latest version of the Unicode Standard.

The Unicode Standard, Version 5.0

Version 5.1.0 of the Unicode Standard consists of the core specification (The Unicode Standard, Version 5.0), as amended by this specification, together with the delta and archival code charts for this version, the 5.1.0 Unicode Standard Annexes, and the 5.1.0 Unicode Character Database (UCD).

The core specification gives the general principles, requirements for conformance, and guidelines for implementers. The code charts show representative glyphs for all the Unicode characters. The Unicode Standard Annexes supply detailed normative information about particular aspects of the standard. The Unicode Character Database supplies normative and informative data for implementers to allow them to implement the Unicode Standard.

Version 5.1.0 of the Unicode Standard should be referenced as:

The Unicode Consortium. The Unicode Standard, Version 5.1.0, defined by: The Unicode Standard, Version 5.0 (Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0), as amended by Unicode 5.1.0 (http://www.unicode.org/versions/Unicode5.1.0/).

A complete specification of the contributory files for Unicode 5.1.0 is found on the page Components for 5.1.0. That page also provides the recommended reference format for Unicode Standard Annexes.


Contents of This Document

A. Online Edition
B. Overview
C. Errata
D. Notable Changes From Unicode 5.0.0 to Unicode 5.1.0
E. Conformance Changes to the Standard
F. Changes to Unicode Standard Annexes
G. Other Changes to the Standard
H. Unicode Character Database
I. Character Assignment Overview
J. Script Additions
K. Significant Character Additions
Tamil Named Character Sequences
Malayalam Chillu Characters
Myanmar

A. Online Edition

The text of The Unicode Standard, Version 5.0, as well as the delta and archival code charts, is available online via the navigation links on this page. The charts and the Unicode Standard Annexes may be printed, while the other files may be viewed but not printed. The Unicode 5.0 Web Bookmarks page has links to all sections of the online text.

The changes addressed in this document consist of additional characters, new normative text, additional clarifications, and corrections.

This specification is a delta document consisting of changes to the text, typically with an indication of how the principal affected text would be changed. The indications of affected text are not exhaustive; all other relevant text in the core specification is also overridden by text in this specification.

The Unicode Standard Annexes themselves are not delta documents; they incorporate all of the textual changes for their updates for Version 5.1.0.

B. Overview

Unicode 5.1.0 contains over 100,000 characters, and provides significant additions and improvements that extend text processing for software worldwide. Some of the key features are: increased security in data exchange, significant character additions for Indic and South East Asian scripts, expanded identifier specifications for Indic and Arabic scripts, improvements in the processing of Tamil and other Indic scripts, linebreaking conformance relaxation for HTML and other protocols, strengthened normalization stability, new case pair stability, plus others given below.

In addition to updated existing files, implementers will find new test data files (for example, for linebreaking) and new XML data files that encapsulate all of the Unicode character properties.

A major feature of Unicode 5.1.0 is the enabling of ideographic variation sequences. These sequences allow standardized representation of glyphic variants needed for Japanese, Chinese, and Korean text. The first registered collection, from Adobe Systems, is now available at http://www.unicode.org/ivd/.

Unicode 5.1.0 contains significant changes to properties and behaviorial specifications. Several important property definitions were extended, improving linebreaking for Polish and Portuguese hyphenation. The Unicode Text Segmentation Algorithms, covering sentences, words, and characters, were greatly enhanced to improve the processing of Tamil and other Indic languages. The Unicode Normalization Algorithm now defines stabilized strings and provides guidelines for buffering. Standardized named sequences are added for Lithuanian, and provisional named sequences for Tamil.

Unicode 5.1.0 adds 1,624 newly encoded characters. These additions include characters required for Malayalam and Myanmar and important individual characters such as Latin capital sharp s for German. Version 5.1 extends support for languages in Africa, India, Indonesia, Myanmar, and Vietnam, with the addition of the Cham, Lepcha, Ol Chiki, Rejang, Saurashtra, Sundanese, and Vai scripts. Scholarly support includes important editorial punctuation marks, as well as the Carian, Lycian, and Lydian scripts, and the Phaistos disc symbols. Other new symbol sets include dominoes, Mahjong, dictionary punctuation marks, and math additions. This latest version of the Unicode Standard has exactly the same character assignments as ISO/IEC 10646:2003 plus Amendments 1 through 4.

C. Errata

Errata incorporated into Unicode 5.1.0 are listed by date in a separate table. For corrigenda and errata after the release of Unicode 5.1.0, see the list of current Updates and Errata. The corrected formulation of the regular expression for More_Above in Table 3-14, Context Specification for Casing, may be of particular interest. For a statement of that erratum for Table 3-14, see Errata Fixed in Unicode 5.2.0.

D. Notable Changes From Unicode 5.0.0 to Unicode 5.1.0

Stability Policy Update

The Unicode Character Encoding Stability Policy has been updated. This update strengthens normalization stability, adds stability policy for case pairs, and extends constraints on property values. For the current statement of these policies, see Unicode Character Encoding Stability Policy.

Characters

  • Additions to Malayalam and Myanmar; characters to complete support of Indic scripts
  • New symbols: Mahjong, editorial punctuation marks, significant additions for math
  • Capital Sharp S for German
  • Some new minority scripts for communities in Vietnam, Indonesia, India, Africa
  • Some historic scripts and punctuation marks

General Specification

  • Important clarification of UTF-8 conformance
  • Improved guidance on use of Myanmar and Malayalam scripts
  • Definitions of extended base and extended combining character sequences

Unicode Standard Annexes

Properties

  • Deprecation of tag characters
  • Incorporation of Corrigendum #6: Bidi Mirroring, so that directional quotation marks are no longer mirrored
  • Revision of the definition of the Default_Ignorable_Code_Point property
  • New standardized named sequences for Lithuanian
  • Detailed documentation of provisional named sequences for Tamil
  • Documentation of U-source ideographs
  • New property values for text segmentation:
  • Sentence_Break property values CR, LF, Extend, and SContinue
  • Word_Break property values CR, LF, Newline, Extend, and MidNumLet
  • Grapheme_Cluster_Break property values Prepend and SpacingMark

E. Conformance Changes to the Standard

Additional Constraints on Conversion of Ill-formed UTF-8

In Version 5.1, the text regarding ill-formed code unit sequences is extended, with new definitions that make it clear how to identify well-formed and ill-formed code unit subsequences in strings. The implications for this on UTF-8 conversion in particular are made much clearer. This change is motivated by a potential security exploit based on over consumption of ill-formed UTF-8 code unit sequences, as discussed in UTR #36: Unicode Security Considerations.

On p. 100 in Chapter 3 of The Unicode Standard, Version 5.0, replace the existing text from D85 through D86 with the following text. (Note that this does not actually change definitions D85 or D86, but adds two new definitions and extends the explanatory text considerably, particularly with exemplications for UTF-8 code unit sequences.)

Replacement Text

D84a Ill-formed code unit subsequence: A non-empty subsequence of a Unicode code unit sequence X which does not contain any code units which also belong to any minimal well-formed subsequence of X.
  • In other words, an ill-formed code unit subsequence cannot overlap with a minimal well-formed subsequence.

D85 Well-formed: A Unicode code unit sequence that purports to be in a Unicode form is called well-formed if and only if it does follow the specification of that Unicode encoding form.

D85a Minimal well-formed code unit subsequence: A well-formed Unicode code unit sequence that maps to a single Unicode scalar value.

  • For UTF-8, see the specification in D92 and Table 3-7.
  • For UTF-16, see the specification in D91.
  • For UTF-32, see the specification in D90.

A well-formed Unicode code unit sequence can be partitioned into one or more minimal well-formed code unit sequences for the given Unicode encoding form. Any Unicode code unit sequence can be partitioned into subsequences that are either well-formed or ill-formed. The sequence as a whole is well-formed if and only if it contains no ill-formed subsequence. The sequence as a whole is ill-formed if and only if it contains at least one ill-formed subsequence.

D86 Well-formed UTF-8 code unit sequence: A well-formed Unicode code unit sequence of UTF-8 code units.

  • The UTF-8 code unit sequence <41 C3 B1 42> is well-formed, because it can be partitioned into subsequences, all of which match the specification for UTF-8 in Table 3-7. It consists of the following minimal well-formed code unit subsequences: <41>, <C3 B1>, and <42>.
  • The UTF-8 code unit sequence <41 C2 C3 B1 42> is ill-formed, because it contains one ill-formed subsequence. There is no subsequence for the C2 byte which matches the specification for UTF-8 in Table 3-7. The code unit sequence is partitioned into one minimal well-formed code unit subsequence, <41>, followed by one ill-formed code unit subsequence, <C2>, followed by two minimal well-formed code unit subsequences, <C3 B1> and <42>.
  • In isolation, the UTF-8 code unit sequence <C2 C3> would be ill-formed, but in the context of the UTF-8 code unit sequence <41 C2 C3 B1 42>, <C2 C3> does not constitute an ill-formed code unit subsequence, because the C3 byte is actually the first byte of the minimal well-formed UTF-8 code unit subsequence <C3 B1>. Ill-formed code unit subsequences do not overlap with minimal well-formed code unit subsequences.

On p. 101 in Chapter 3 of The Unicode Standard, Version 5.0, replace the existing paragraph just above Table 3-4 with the following text.

Replacement Text

If a Unicode string purports to be in a Unicode encoding form, then it must not contain any ill-formed code unit subsequence.

If a process which verifies that a Unicode string is in a Unicode encoding form encounters an ill-formed code unit subsequence in that string, then it must not identify that string as being in that Unicode encoding form.

A process which interprets a Unicode string must not interpret any ill-formed code unit subsequences in the string as characters. (See conformance clause C10.) Furthermore, such a process must not treat any adjacent well-formed code unit sequences as being part of those ill-formed code unit sequences.

The most important consequence of this requirement on processes is illustrated by UTF-8 conversion processes, which interpret UTF-8 code unit sequences as Unicode character sequences. Suppose that a UTF-8 converter is iterating through an input UTF-8 code unit sequence. If the converter encounters an ill-formed UTF-8 code unit sequence which starts with a valid first byte, but which does not continue with valid successor bytes (see Table 3-7), it must not consume the successor bytes as part of the ill-formed subsequence whenever those successor bytes themselves constitute part of a well-formed UTF-8 code unit subsequence.

If an implementation of a UTF-8 conversion process stops at the first error encountered, without reporting the end of any ill-formed UTF-8 code unit subsequence, then the requirement makes little practical difference. However, the requirement does introduce a significant constraint if the UTF-8 converter continues past the point of a detected error, perhaps by substituting one or more U+FFFD replacement characters for the uninterpretable, ill-formed UTF-8 code unit subsequence. For example, with the input UTF-8 code unit sequence <C2 41 42>, such a UTF-8 conversion process must not return <U+FFFD> or <U+FFFD, U+0042>, because either of those outputs would be the result of misinterpreting a well-formed subsequence as being part of the ill-formed subsequence. The expected return value for such a process would instead be <U+FFFD, U+0041, U+0042>.

For a UTF-8 conversion process to consume valid successor bytes is not only non-conformant, but also leaves the converter open to security exploits. See UTR #36, Unicode Security Considerations.

Although a UTF-8 conversion process is required to never consume well-formed subsequences as part of its error handling for ill-formed subsequences, such a process is not otherwise constrained in how it deals with any ill-formed subsequence itself. An ill-formed subsequence consisting of more than one code unit could be treated as a single error or as multiple errors. For example, in processing the UTF-8 code unit sequence <F0 80 80 41>, the only requirement on a converter is that the <41> be processed and correctly interpreted as <U+0041>. The converter could return <U+FFFD, U+0041>, handling <F0 80 80> as a single error, or <U+FFFD, U+FFFD, U+FFFD, U+0041>, handling each byte of <F0 80 80> as a separate error, or could take other approaches to signalling <F0 80 80> as an ill-formed code unit subsequence.

Extended Combining Character Sequences

In order to take into account the normalization behavior of Hangul syllables and conjoining jamo sequences, additional definitions for extended base and extended combining character sequence have been added to the standard.

The following text is added on p. 91 of The Unicode Standard, Version 5.0, just before D52:

Additional Text

D51a Extended base: Any base character, or any standard Korean syllable block.
  • This term is defined to take into account the fact that sequences of Korean conjoining jamo characters behave as if they were a single Hangul syllable character, so that the entire sequence of jamos constitutes a base.
  • For the definition of standard Korean syllable block, see D117 in Section 3.12, Conjoining Jamo Behavior.

The following text is added on p. 93 of The Unicode Standard, Version 5.0, just before D57:

Additional Text

D56a Extended combining character sequence: A maximal character sequence consisting of either an extended base followed by a sequence of one or more characters where each is a combining character, ZERO WIDTH JOINER, or ZERO WIDTH NON-JOINER; or a sequence of one or more characters where each is a combining character, ZERO WIDTH JOINER, or ZERO WIDTH NON-JOINER.
  • Combining character sequence is commonly abbreviated as CCS, and extended combining character sequence is commonly abbreviated as ECCS.

In addition, the existing definitions of grapheme cluster and extended grapheme cluster are slightly modified, to bring them into line with UAX #29, "Unicode Text Segmentation," where they are defined algorithmically.

The existing text for D60 and D61 on p. 94 of The Unicode Standard, Version 5.0, is replaced with the following text:

Replacement Text

D60 Grapheme cluster: The text between grapheme cluster boundaries as specified by Unicode Standard Annex #29, "Unicode Text Segmentation."
  • The grapheme cluster represents a horizontally segmentable unit of text, consisting of some grapheme base (which may consist of a Korean syllable) together with any number of nonspacing marks applied to it.
  • A grapheme cluster is similar, but not identical to a combining character sequence. A combining character sequence starts with a base character and extends across any subsequent sequence of combining marks,  nonspacing  or spacing. A combining character sequence is most directly relevant to processing issues related to normalization, comparison, and searching.
  • A grapheme cluster typically starts with a grapheme base and then extends across any subsequent sequence of nonspacing marks. For completeness in text segmentation, a grapheme cluster may also consist of segments not containing a grapheme base, such as newlines or some default ignorable code points. A grapheme cluster is most directly relevant to text rendering and processes such as cursor placement and text selection in editing.
  • For many processes, a grapheme cluster behaves as if it were a single character with the same properties as its grapheme base. Effectively, nonspacing marks apply graphically to the base, but do not change its properties. For example, <x, macron> behaves in line breaking or bidirectional layout as if it were the character x.

D61 Extended grapheme cluster: The text between extended grapheme cluster boundaries as specified by Unicode Standard Annex #29, "Unicode Text Segmentation."

  • Extended grapheme clusters are defined in a parallel manner to grapheme clusters, but also include sequences of spacing marks and certain prepending characters.
  • Grapheme clusters and extended grapheme clusters do not have linguistic significance, but are used to break up a string of text into units for processing.
  • Grapheme clusters and extended grapheme clusters may be adjusted for particular processing requirements, by tailoring the rules for grapheme cluster segmentation specified in Unicode Standard Annex #29.

Updates to Table of Named Unicode Algorithms

Table 3-1, Named Unicode Algorithms, and the associated explanatory text on p. 81 of The Unicode Standard, Version 5.0 should be updated to account for some slight changes in naming conventions for Unicode algorithms in Version 5.1.

The following text is added at the end of the paragraph above Table 3-1:

Additional Text

When externally referenced, a named Unicode algorithm may be prefixed with the qualifier "Unicode", so as to make the connection of the algorithm to the Unicode Standard and other Unicode specifications clear. Thus, for example, the Bidirectional Algorithm is generally referred to by the full name, "Unicode Bidirectional Algorithm". As much as is practical, the titles of Unicode Standard Annexes which define Unicode algorithms consist of the name of the Unicode algorithm they specify. In a few cases, named Unicode algorithms are also widely known by their acronyms, and those acronyms are also listed in Table 3-1.

The following changes are made to select entries in the table:

Current Text

Grapheme Cluster Boundary Determination UAX #29
Word Boundary Determination UAX #29
Sentence Boundary Determination UAX #29
Collation Algorithm (UCA) UTS #10

 

Replacement Text

Character Segmentation UAX #29
Word Segmentation UAX #29
Sentence Segmentation UAX #29
Unicode Collation Algorithm (UCA) UTS #10

Update of Definition of case-ignorable

On p. 124 in Section 3.13, Default Case Algorithms of The Unicode Standard, Version 5.0, update D121 to include the value MidNumLet in the definition of case-ignorable. This change was occasioned by the split of the Word_Break property value MidLetter into MidLetter and MidNumLet.

Replacement Text

D121 A character C is defined to be case-ignorable if C has the value MidLetter or the value MidNumLet for the Word_Break property or its General_Category is one of Nonspacing_Mark (Mn), Enclosing_Mark (Me), Format (Cf), Modifier_Letter (Lm), or Modifier_Symbol (Sk).

Clarifications of Default Case Conversion

On p. 125 in Section 3.13, Default Case Algorithms of The Unicode Standard, Version 5.0, replace the first paragraph under "Default Case Conversion" with the following text and add a new paragraph following rules R1-R4.

Replacement Text

The following rules specify the default case conversion operations for Unicode strings. These rules use the full case conversion operations, Uppercase_Mapping(C), Lowercase_Mapping(C), and Titlecase_Mapping(C), as well as the context-dependent mappings based on the casing context, as specified in Table 3-14.

[Rules R1-R4 are unchanged from 5.0.]

The default case conversion operations may be tailored for specific requirements. A common variant, for example, is to make use of simple case conversion, rather than full case conversion. Language- or locale-specific tailorings of these rules may also be used.

Canonical Combining Class Is Immutable

As a result of the strengthening of the Normalization Stability Policy, the Canonical_Combining_Class property has become an immutable property, rather than merely a stable property.

To adjust for this change, the first bullet after D40 on p. 88 of The Unicode Standard, Version 5.0, is removed from the text.

Clarification of Conformance Clause C7

Because of the potential security issues involved in the deletion of characters, the following explanatory bullet is added to the text of The Unicode Standard, Version 5.0, on p. 72, after the first two bullet items following Conformance Clause C7.

Additional Text

  • Note that security problems can result if noncharacter code points are removed from text received from external sources. For more information, see Section 16.7, Noncharacters, and Unicode Technical Report #36, "Unicode Security Considerations."

F. Changes to Unicode Standard Annexes

In Unicode 5.1.0, some of the Unicode Standard Annexes have minor changes to their titles for consistency with new technical report naming practices. The following summarizes the more significant changes in the Unicode Standard Annexes. More detailed notes can be found by looking at the modifications section of each document.

 

Minor changes were made to the following Unicode Standard Annexes—primarily clerical changes to update the version and revision numbers and to update references to Unicode 5.1.0.

G. Other Changes to the Standard

The following explanatory material augments the text in Unicode 5.0.

Addition of Dual-Joining Group to Table 8-7

In Table 8-7, on p. 279 of The Unicode Standard, Version 5.0, add a row for a new group, right below the YEH group:

BURUSHASKI YEH BARREE

This joining group is for the newly encoded characters for Burushaski:

U+077A YEH BARREE WITH DIGIT TWO ABOVE

U+077B YEH BARREE WITH DIGIT THREE ABOVE

These have the isolate and final forms of a Yeh Barree, but are dual-joining, rather than right-joining, like U+06D2 ARABIC LETTER YEH BARREE—a letter used in Urdu and a number of other languages.

On p. 281 of The Unicode Standard, Version 5.0, below Table 8-8, add the following explanatory paragraph:

Additional Text

The yeh barree is a form of yeh used in language such as Urdu. It is a right-joining letter, and has no initial or medial forms. However, some letterforms based on yeh barree used in other languages, such as Burushaski, do take initial and medial forms. Such characters are given the dual-joining type and have a separate joining group, BURUSHASKI YEH BARREE, based on this difference in shaping behavior.

Clarification Regarding Non-decomposition of Overlaid Diacritics

Most characters that people think of as being a character "plus accents" have formal decompositions in Unicode. For example:

00C0 LATIN CAPITAL LETTER A WITH GRAVE 0041 LATIN CAPITAL LETTER A + 0300 COMBINING GRAVE ACCENT

00C7 LATIN CAPITAL LETTER C WITH CEDILLA 0043 LATIN CAPITAL LETTER C + 0327 COMBINING CEDILLA

Based on that pattern, people often also expect to see formal Unicode decompositions for characters with slashes, bars, hooks, and the like used as diacritics for forming new Latin letters:

U+00D8 LATIN CAPITAL LETTER O WITH STROKE U+004F LATIN CAPITAL LETTER O + 0338 COMBINING LONG SOLIDUS OVERLAY

However, such decompositions are not formally defined in Unicode. For historical and implementation reasons, there are no decompositions for characters with overlaying diacritics such as bars or slashes, or for most hooks, swashes, tails, and other similar modifications to the form of a base character. These include characters such as:

U+00D8 LATIN CAPITAL LETTER O WITH STROKE
U+049A CYRILLIC CAPITAL LETTER KA WITH DESCENDER

and also characters that would seem analogous in appearance to a Latin letter with a cedilla, such as:

U+0498 CYRILLIC CAPITAL LETTER ZE WITH DESCENDER

Because these characters with overlaid diacritics or modifications to their base form shape have no formal decompositions, some kinds of processing that would normally use Normalization Form D (NFD) for internal processing may end up simulating decompositions instead, so that they can treat the diacritic as if it were a separately encoded combining mark. For example, a common operation in searching or matching is to sort as if accents were removed. This is easy to do with characters that decompose, but for characters with overlaid diacritics, the effect of ignoring the diacritic has to be simulated instead with data tables that go beyond simple use of Unicode decomposition mappings.

The lack of formal decompositions for characters with overlaid diacritics also means there are increased opportunities for spoofing with such characters. The display of a base letter plus a combining overlaid mark such as U+0335 COMBINING SHORT STROKE OVERLAY may look the same as the encoded base letter with bar diacritic, but the two sequences are not canonically equivalent and would not be folded together by Unicode normalization.

For more information and data for handling these confusable sequences involving overlaid diacritics, see UTR #36: Unicode Security Considerations.

Modifier Letters

Modifier letters, in the sense used in the Unicode Standard, are letters or symbols that are typically written adjacent to other letters and which modify their usage in some way.

They are not formally combining marks (gc=Mn or gc=Mc) and do not graphically combine with the base letter that they modify. In fact, they are base characters in their own right. The sense in which they modify other letters is more a matter of their semantics in usage; they often tend to function as if they were diacritics, serving to indicate a change in pronunciation of a letter, or to otherwise indicate something to be distinguished in a letter's use.

Modifier letters are very commonly used in technical phonetic transcriptional systems, where they augment the use of combining marks to make phonetic distinctions. Some of them have been adapted into regular language orthographies as well.

Many modifier letters take the form of superscript or subscript letters. Thus the IPA modifier letter that indicates labialization (U+02B7) is a superscript form of the letter w. As for all such superscript or subscript form characters in the Unicode Standard, such modifier letters have compatibility decompositions.

Many modifiers letters are derived from letters in the Latin script, although some modifier letters occur in other scripts, as well. Latin-derived modifier letters may be based on either minuscule (lowercase) or majuscule (uppercase) forms of the letters, but never have case mappings. Modifier letters which have the shape of capital or small capital Latin letters, in particular, are used exclusively in technical phonetic transcriptional systems. Strings of phonetic transcription are notionally lowercase—all letters in them are considered to be lowercase, whatever their shapes. In terms of formal properties in the Unicode Standard, modifier letters based on letter shapes are Lowercase=True; modifier letters not based on letter shapes are simply caseless. All modifier letters, regardless of their shapes, are operationally caseless; they need to be unaffected by casing operations, because changing them by a casing operation would destroy their meaning for the phonetic transcription.

Modifier letters in the Unicode Standard are indicated by either one of two General_Category values: gc=Lm or gc=Sk.

The General_Category Lm is given to modifier letters derived from regular letters. It is also given to some other characters with more punctuation-like shapes, such as raised commas, which nevertheless have letterlike behavior and which occur on occasion as part of the orthography for regular words in one language or another.

The General_Category Sk is given to modifier letters that typically have more symbol-like origins and which seldom, if ever, are adapted to regular orthographies outside the context of technical phonetic transcriptional systems. This subset of modifier letters is also known as "modifier symbols".

This general distinction between gc=Lm and gc=Sk is reflected in other Unicode specifications relevant to identifiers and word boundary determination. Modifier letters with gc=Lm are included in the set definitions that result in the derived properties ID_Start and ID_Continue (and XID_Start and XID_Continue). As such, they are considered part of the default definition of Unicode identifiers. Modifer symbols (gc=Sk), on the other hand, are not included in those set definitions, and so are excluded by default from Unicode identifiers.

Modifier letters (gc=Lm) have the derived property Alphabetic, while modifier symbols (gc=Sk) do not.

Modifier letters (gc=Lm) also have the word break property value (wb=ALetter), while modifier symbols (gc=Sk) do not. This means that for default determination of word break boundaries, modifier symbols will cause a word break, while modifier letters proper will not.

Most general use modifier letters (and modifier symbols) were collected together in the Spacing Modifier Letters block (U+02B0..U+02FF), the UPA-related Phonetic Extensions block (U+1D00..U+1D7F), the Phonetic Extensions Supplement block (U+1D80..U+1DBF), and the Modifier Tone Letters block (U+A700..U+A71F). However, some script-specific modifier letters are encoded in the blocks appropriate to those scripts. They can be identified by checking for their General_Category values.

There is no requirement that the Unicode names for modifier letters contain the label "MODIFIER LETTER", although most of them do.

Clarifications Related to General_Category Assignments

There are several other conventions about how General_Category values are assigned to Unicode characters.

The General_Category of an assigned character serves as a basic classification of the character, based on its primary usage. The General_Category extends the widely used subdivision of ASCII characters into letters, digits, punctuation, and symbols—but needed to be elaborated and subdivided to be appropriate for the much larger and more comprehensive scope of Unicode.

Many characters have multiple uses, however, and not all such uses can be captured by a single, simple partition property such as General_Category. Thus, many letters often serve dual functions as numerals in traditional numeral systems. Examples can be found in the Roman numeral system, in Greek usage of letters as numbers, in Hebrew, and so on for many scripts. In such cases the General_Category is assigned based on the primary letter usage of the character, despite the fact that it may also have numeric values, occur in numeric expressions, or may also be used symbolically in mathematical expressions, and so on.

The General_Category gc=Nl is reserved primarily for letterlike number forms which are not technically digits. For example, the compatibility Roman numeral characters, U+2160..U+217F, all have gc=Nl. Because of the compatibility status of these characters, the recommended way to represent Roman numerals is with regular Latin letters (gc=Ll or gc=Lu). These letters derive their numeric status from conventional usage to express Roman numerals, rather than from their Generic_Category value.

Currency symbols (gc=Sc), by contrast, are given their General_Category value based entirely on their function as symbols for currency, even though they are often derived from letters and may appear similar to other diacritic-marked letters that get assigned one of the letter-related General_Category values.

Pairs of opening and closing punctuation are given their General_Category values (gc=Ps for opening and gc=Pe for closing) based on the most typical usage and orientation of such pairs. Occasional usage of such punctuation marks unpaired or in opposite orientation certainly occurs, however, and is in no way prevented by their General_Category values.

Similarly, characters whose General_Category identifies them primarily as a symbol or as a math symbol may function in other contexts as punctuation or even paired punctuation. The most obvious such case is for U+003C "<" LESS-THAN SIGN and U+003E ">" GREATER-THAN SIGN. These are given the General_Category gc=Sm because their primary identity is as mathematical relational signs. However, as is obvious from HTML and XML, they also serve ubiquitously as paired bracket punctuation characters in many formal syntaxes.

Clarification of Hangul Jamo Handling

The normalization of Hangul conjoining jamos and of Hangul syllables depends on algorithmic mapping, as specified in Section 3.12, Conjoining Jamo Behavior in [Unicode]. That algorithm specifies the full decomposition of all precomposed Hangul syllables, but effectively it is equivalent to the recursive application of pairwise decomposition mappings, as for all other Unicode characters. Formally, the Decomposition_Mapping (dm) property value for a Hangul syllable is the pairwise decomposition and not the full decomposition. Each character with the Hangul_Syllable_Type value LVT has a decomposition mapping consisting of a character with an LV value and a character with a T value. Thus for U+CE31 the decomposition mapping is <U+CE20, U+11B8>, and not <U+110E, U+1173, U+11B8>.

Tailored Casing Operations

The Unicode Standard provides default casing operations. There are circumstances in which the default operations need to be tailored for specific locales or environments. Some of these tailorings have data that is in the standard, in the SpecialCasing.txt file, notable for the Turkish dotted capital I and dotless small i. In other cases, more specialized tailored casing operations may be appropriate. These include:

  • Titlecasing of IJ at the start of words in Dutch
  • Removal of accents when uppercasing letters in Greek
  • Uppercasing U+00DF ( ß ) LATIN SMALL LETTER SHARP S to the new U+1E9E LATIN CAPITAL LETTER SHARP S

However, these tailorings may or may not be desired, depending on the implementation in question.

In particular, capital sharp s is intended for typographical representations of signage and uppercase titles, and other environments where users require the sharp s to be preserved in uppercase. Overall, such usage is rare. In contrast, standard German orthography uses the string "SS" as uppercase mapping for small sharp s. Thus, with the default Unicode casing operations, capital sharp s will lowercase to small sharp s, but not the reverse: small sharp s uppercases to "SS". In those instances where the reverse casing operation is needed, a tailored operation would be required.

Clarification of Lowercase and Uppercase

Implementers occasionally find the several ways in which the Unicode Standard uses the concepts of lowercase and uppercase somewhat confusing. To address this, the following clarifying text is added to The Unicode Standard, Version 5.0, in Section 4.2, CaseNormative, at the bottom of p. 132:

Additional Text

For various reasons, the Unicode Standard has more than one formal definition of lowercase and uppercase. (The additional complications of titlecase are not discussed here.)

The first set of definitions is based on the General_Category property in UnicodeData.txt. The relevant values are General_Category=Ll (Lowercase_Letter) and General_Category=Lu (Uppercase_Letter). For most ordinary letters of bicameral scripts such as Latin, Greek, and Cyrillic, these values are obvious and non-problematical. However, the General_Category property is, by design, a partition of Unicode codespace. This means that each Unicode character can only have one General_Category value, and that situation results in some odd edge cases for modifier letters, letterlike symbols and letterlike numbers. So not every Unicode character that looks like a lowercase character necessarily ends up with General_Category=Ll, and the same is true for uppercase characters.

The second set of definitions consist of the derived binary properties, Lowercase and Uppercase. Those derived properties augment the General_Category values by adding the additional characters that ordinary users think of as being lowercase or uppercase, based primarily on their letterforms. The additional characters are included in the derivations by means of the Other_Lowercase and Other_Uppercase properties defined in PropList.txt. For example, Other_Lowercase adds the various modifier letters that are letterlike in shape, the circled lowercase letter symbols, and the compatibility lowercase Roman numerals. Other_Uppercase adds the circled uppercase letter symbols, and the compatibility uppercase Roman numerals.

The third set of definitions is fundamentally different in kind, and are not character properties at all. The functions isLowercase and isUppercase are string functions returning a binary True/False value. These functions are defined in Section 3.13, Default Case Algorithms, and depend on case mapping relations, rather than being based on letterforms per se. Basically, isLowercase is True for a string if the result of applying the toLowercase mapping operation for a string is the same as the string itself.

The following table illustrates the various possibilities for how these definitions interact, as applied to exemplary single characters or single character strings.

 

Code Character gc Lowercase Uppercase isLowerCase(S) isUpperCase(S)
0068 h Ll True False True False
0048 H Lu False True False True
24D7 So True False True False
24BD So False True False True
02B0 ʰ Lm True False True True
1D34 Lm True False True True
02BD ʽ Lm False False True True

Note that for "caseless" characters, such as U+02B0, U+1D34, and U+02BD, isLowerCase and isUpperCase are both True, because the inclusion of a caseless letter in a string is not criterial for determining the casing of the string—a caseless letter always case maps to itself.

On the other hand, all modifier letters derived from letter shapes are also notionally lowercase, whether the letterform itself is a minuscule or a majuscule in shape. Thus U+1D34 MODIFIER LETTER CAPITAL H is actually Lowercase=True. Other modifier letters not derived from letter shapes are neither Lowercase nor Uppercase.

The string functions isLowerCase and isUpperCase also apply to strings longer than one character, of course, in which instance the character properties General_Category, LowerCase, and Uppercase are not relevant. In the following table, the string function isTitleCase is also illustrated, to show its applicability for the same strings.

 

Codes String isLowerCase(S) isUpperCase(S) isTitleCase(S)
0068 0068 hh True False False
0048 0048 HH False True False
0048 0068 Hh False False True
0068 0048 hH False False False

Programmers concerned with manipulating Unicode strings should generally be dealing with the string functions such as isLowerCase (and its functional cousin, toLowerCase), unless they are working directly with single character properties. Care is always advised, however, when dealing with case in the Unicode Standard, as expectations based simply on the behavior of A..Z, a..z do not generalize easily across the entire repertoire of Unicode characters, and because case for modifier letters, in particular, can result in unexpected behavior.

Canonical Equivalence Issues for Greek Punctuation

Replace the last two sentences of the paragraph on "Other Basic Latin Punctuation Marks," on p. 214 of The Unicode Standard, Version 5.0, with the following expanded text as new paragraphs:

Replacement Text

Canonical Equivalence Issues for Greek Punctuation. Some commonly used Greek punctuation marks are encoded in the Greek and Coptic block, but are canonical equivalents to generic punctuation marks encoded in the C0 Controls and Basic Latin block, because they are indistinguishable in shape. Thus, U+037E ";" GREEK QUESTION MARK is canonically equivalent to U+003B ";" SEMICOLON, and U+0387 "·" GREEK ANO TELEIA is canonically equivalent to U+00B7 "·" MIDDLE DOT. In these cases, as for other canonical singletons, the preferred form is the character that the canonical singletons are mapped to, namely U+003B and U+00B7 respectively. Those are the characters that will appear in any normalized form of Unicode text, even when used in Greek text as Greek punctuation. Text segmentation algorithms need to be aware of this issue, as the kinds of text units delimited by a semicolon or a middle dot in Greek text will typically  differ from those in Latin text.

The character properties for U+00B7 MIDDLE DOT are particularly problematical, in part because of identifier issues for that character. There is no guarantee that all of its properties will align exactly with U+0387 GREEK ANO TELEIA itself, because the latter were established based on the more limited function of the middle dot in Greek as a delimiting punctuation mark.

Coptic Font Style

Update the second paragraph on p. 244 of The Unicode Standard, Version 5.0, to read as follows:

Replacement Text

Font Styles. Bohairic Coptic uses only a subset of the letters in the Coptic repertoire. It also uses a font style distinct from that for Sahidic. Prior to Version 5.0, the Coptic letters derived from Demotic, encoded in the range U+03E2..U+03EF in the Greek and Coptic block, were shown in the code charts in a Bohairic font style. Starting from Version 5.0, all Coptic letters in the standard, including those in the range U+03E2..U+03EF, are shown in the code charts in a Sahidic font style, instead.

Rendering Default Ignorable Code Points

Update the last paragraph on p. 192 of The Unicode Standard, Version 5.0, in Section 5.20, Default Ignorable Code Points, to read as follows:

Replacement Text

An implementation should ignore all default ignorable code points in rendering whenever it does not support those code points, whether they are assigned or not.

In previous versions of the Unicode Standard, surrogate code points, private use code points, and some control characters were also default ignorable code points. However, to avoid security problems, such characters always should be displayed with a missing glyph, so that there is a visible indication of their presence in the text. In Unicode 5.1 these code points are no longer default ignorable code points. For more information, see UTR #36, "Unicode Security Considerations."

Stateful Format Controls

Update the subsection on "Stateful Format Controls" in Section 5.20, Default Ignorable Code Points, on p. 194 of The Unicode Standard, Version 5.0, to read as follows:

Replacement Text

Stateful Format Controls. There are a small number of paired stateful controls. These characters are used in pairs, with an initiating character (or sequence) and a terminating character. Even when these characters are ignored, complications can arise due to their paired nature. Whenever text is cut, copied, pasted, or deleted, these characters can become unpaired. To avoid this problem, ideally both any copied text and its context (site of a deletion, or target of an insertion) would be modified so as to maintain all pairings that were in effect for each piece of text. This process can be quite complicated, however, and is not often done—or is done incorrectly if attempted.

The paired stateful controls recommended for use are listed in Table 5-6.

Table 5-6. Paired Stateful Controls

Characters Documentation
Bidi Overrides and Embeddings Section 16.2, Layout Controls; UAX #9
Annotation Characters Section 16.8, Specials
Musical Beams and Slurs Section 15.11, Western Musical Symbols

The bidirectional overrides and embeddings and the annotation characters are reasonably robust, because their behavior terminates at paragraph boundaries. Paired format controls for representation of beams and slurs in music are recommended only for specialized musical layout software, and also have limited scope.

Other paired stateful controls in the standard are deprecated, and their use should be avoided. They are listed in Table 5-7.

Table 5-7. Paired Stateful Controls (Deprecated)

Characters Documentation
Deprecated Format Characters Section 16.3, Deprecated Format Characters
Tag Characters Section 16.9, Tag Characters

The tag characters, originally intended for the representation of language tags, are particularly fragile under editorial operations that move spans of text around. See Section 5.10, Language Information in Plain Text, for more information about language tagging.

Clarification About Handling Noncharacters

The third paragraph of Section 16.7, Noncharacters, on p. 549 of The Unicode Standard, Version 5.0, is updated to read:

Replacement Text

Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as replacing it with U+FFFD REPLACEMENT CHARACTER, to indicate the problem in the text. It is not recommended to simply delete noncharacter code points from such text, because of the potential security issues caused by deleting uninterpreted characters. (See conformance clause C7 in Section 3.2, Conformance Requirements, and Unicode Technical Report #36, "Unicode Security Considerations.")

Tag Characters

Update the first paragraph of Section 16.9, Tag Characters, on pp. 554-555 of The Unicode Standard, Version 5.0, to read as follows:

Replacement Text

The characters in this block provide a mechanism for language tagging in Unicode plain text. These characters are deprecated, and should not be used—particularly with any protocols that provide alternate means of language tagging. The Unicode Standard recommends the use of higher-level protocols, such as HTML or XML, which provide for language tagging via markup. See Unicode Technical Report #20, "Unicode in XML and Other Markup Languages." The requirement for language information embedded in plain text data is often overstated, and markup or other rich text mechanisms constitute best current practice. See Section 5.10, Language Information in Plain Text for further discussion.

Ideographic Variation Database

In Section 12.1, Han, on p. 418 of The Unicode Standard, Version 5.0, replace the last sentence of the last paragraph of the subsection "Principles of Han Unification" as follows:

Replacement Text

Z-axis typeface and stylistic differences are generally ignored for the purpose of encoding Han ideographs, but can be represented in text by the use of variation sequences; see Section 16.4, Variation Selectors.

In Section 16.4, Variation Selectors, on p. 545 of The Unicode Standard, Version 5.0, replace the first two paragraphs of the "Variation Sequence" subsection by the following:

Replacement Text

Variation Sequence. A variation sequence always consists of a base character followed by a variation selector character. That sequence is referred to as a variant of the base character. The variation selector affects only the appearance of the base character. The variation selector is not used as a general code extension mechanism; only certain sequences are defined, as follows:

Standardized variation sequences are defined in the file StandardizedVariants.txt in the Unicode Character Database. Ideographic variation sequences are defined by the registration process defined in Unicode Technical Standard #37, "Ideographic Variation Database," and are listed in the Ideographic Variation Database. Only those two types of variation sequences are sanctioned for use by conformant implementations. In all other cases, use of a variation selector character does not change the visual appearance of the preceding base character from what it would have had in the absence of the variation selector.

H. Unicode Character Database

For more detailed information about the changes in the Unicode Character Database, see the file UCD.html in the Unicode Character Database.

Note that as of Version 5.1.0 an XML version of the complete Unicode Character Database is available. For details see UAX #42, An XML Representation of the UCD.

The Unihan.txt data file is now available only as a zip file, there is no longer a link for the uncompressed text file.

I. Character Assignment Overview

1624 new character assignments were made to the Unicode Standard, Version 5.1.0 (over and above what was in Unicode 5.0.0). These additions include new characters for mathematics, punctuation, and symbols. There are also eleven newly encoded scripts, including eight minority scripts (Cham, Kayah Li, Lepcha, Ol Chiki, Rejang, Saurashtra, Sundanese, and Vai) and three historical scripts (Carian, Lycian, and Lydian).

The new character additions were to both the BMP and the SMP (Plane 1). The following table shows the allocation of code points in Unicode 5.1.0. For more information on the specific characters, see the file DerivedAge.txt in the Unicode Character Database.

Graphic 100,507
Format 141
Control 65
Private Use 137,468
Surrogate 2,048
Noncharacter 66
Reserved 873,817

The character repertoire corresponds to ISO/IEC 10646:2003 plus Amendments 1 through 4. For more details of character counts, see Appendix D, Changes from Previous Versions in Unicode 5.0.

J. Script Additions

Kayah Li: U+A900—U+A92F

The Kayah Li script was devised in 1962 to write the Eastern and Western Kayah Li languages, spoken in Northern Myanmar and Northern Thailand. An orthography for these languages using the Myanmar script also exists.

Kayah Li letterforms are historically related to some other Brahmi-derived scripts, but the Kayah Li script itself is a simple, true alphabet. Some of the vowels are written with spacing letters, while others are written with combining marks applied above the letter a, which serves as a vowel carrier.

The Kayah Li script has its own set of digits. It makes use of common punctuation from Latin typography, but has a few distinct signs of its own. Spaces are used to separate words.

Lepcha: U+1C00—U+1C4F

Lepcha is a Sino-Tibetan language. A Brahmic script derived directly from Tibetan, Lepcha was likely devised around 1720 CE by the Sikkim king. The script is used by many people in Sikkim and West Bengal, especially in the Darjeeling district. It is a complex script that uses various combining marks.

Lepcha digits have distinctive forms. Lepcha has traditional punctuation signs, but everyday writing now uses punctuation such as comma, full stop, and question mark, though sometimes Tibetan tshegs are found.

Opportunities for hyphenation occur after any full orthographic syllable. Lepcha punctuation marks can be expected to have behavior similar to that of Devanagari danda and double danda.

Rejang: U+A930—U+A95F

The Rejang script dates from at least the mid-18th century. Rejang is spoken by about 200,000 people living in Indonesia on the island of Sumatra. There are five major dialects of Rejang.

Rejang is a complex, Brahmic script that uses combining marks. It uses European digits and common punctuation, as well as one script-specific section mark. Traditional texts tend not to use spacing and there are no known examples using hyphenation. Modern use of the script may use spaces between words.

Sundanese: U+1B80—U+1BBF

The Sundanese script is used to write the Sundanese language, one of the languages of the island of Java in Indonesia. It is a complex Brahmic script and uses combining marks. Spaces are used between words. Sundanese has script-specific digits, but uses common punctuation. Hyphenation may occur after any full orthographic syllable.

Saurashtra: U+A880—U+A8DF

Saurashtra is an Indo-European language, related to Gujarati and spoken in southern India, mainly in the area around the cities of Madurai, Salem, and Thanjavur. Saurashtra is most often written in the Tamil script, augmented with the use of superscript digits and a colon to indicate sounds not available in the Tamil script. Saurashtra is a complex, Brahmic script that uses combining marks and has script-specific digits. It mainly uses common punctuation, but several script-specific punctuation marks may be used.

Cham: U+AA00—U+AA5F

Cham is an Austronesian language used in Vietnam and Cambodia. There are two main groups, the Eastern Cham and the Western Cham; the script is used more by the Eastern Cham. It is a complex, Brahmic script and uses combining marks.

Cham has script-specific digits, although European digits are also used. It also has some script-specific punctuation, although, again, Western punctuation is also used. Opportunities for linebreak occur after any full orthographic syllable. Spaces are used between words.

Ol Chiki: U+1C50—U+1C7F

The Ol Chiki script was invented in the first half of the 20th century to write Santali, a Munda language spoken mainly in India. There are a few speakers in Nepal and Bangladesh. Ol Chiki is a simple, alphabetic script, consisting of letters representing consonants and vowels. It is written from left to right.

Ol Chiki has script-specific digits. It mainly uses common punctuation, but has some script-specific punctuation marks. It does not use full stop. Spaces are used between words.

Vai: U+A500—U+A61F (Vai block: A500—A63F)

The Vai script was probably invented in the 1830s, and was standardized for modern usage in 1962 at a conference at the University of Liberia. Used in Liberia, Vai is a simple, syllabic script, which is written from left to right. It is a syllabary.

Vai has some script-specific punctuation and also uses common punctuation. It does not use script-specific digits. Linebreaking within words can occur after any character. Special consideration is necessary for U+A608 VAI SYLLABLE LENGTHENER, which should not begin a line.

Carian: U+102A0—U+102D0

The Carian script is an alphabet used to write the Carian language, an ancient Indo-European language of southwestern Anatolia. The script dates from the first millennium BCE and scriptio continua is common. It does not have script-specific digits and has no script-specific punctuation. The script is written both left to right and right to left; it is encoded as a left-to-right script.

Lycian: U+10280—U+1029C

Lycian is used to write an ancient Indo-European language of Western Anatolia. The script is related to Greek and was used from about 500 BCE until 200 BCE. It is a simple alphabetic script and is written from left to right. It uses word dividers. Spaces may be used in modern editions of the text. The script does not include any script-specific digits or punctuation.

Lydian: U+10920—U+1093F

Like Lycian, Lydian is an ancient Indo-European language that was used in Western Anatolia. The script is related to Greek and is a simple alphabetic script, whose use is documented from the late-eighth century BCE to the third century BCE. Most Lydian texts have right-to-left directionality and use spaces. There is one script-specific punctuation mark, the Lydian quotation mark. The script does not have distinct digits.

K. Significant Character Additions

In addition to new scripts, many characters were added to existing scripts or to the repertoire of symbols in the standard. The following sections briefly describe these additions.

For Myanmar and Malayalam, in particular, the additional characters have significant impact on the representation of text, so those sections are more detailed.

For convenience in reference, the discussion of significant character additions is organized roughly by chapters in the standard.

Chapter 6: Writing Systems and Punctuation

Editorial Punctuation Marks

Editorial punctuation marks added in Unicode 5.1 include a set of medievalist editorial punctuation marks, as well as corner brackets used in critical editions of ancient and medieval texts. U-brackets and double parentheses employed by Latinists have also been added, as well as a few punctuation marks that appear in dictionaries.

The medievalist editorial marks are used by a variety of European traditions and will be very useful for those working on medieval manuscripts. Corner brackets are likewise used widely, appearing in transliterated Cuneiform and ancient Egyptian texts, for example.

Chapter 7: European Alphabetic Scripts

Latin, Greek, and Cyrillic

A number of characters were added to these scripts, including characters for German, Mayan, Old Church Slavonic, Mordvin, Kurdish, Aleut, Chuvash, medievalist Latin, and Finnish dictionary use.

The Latin additions include U+1E9E LATIN CAPITAL LETTER SHARP S for use in German. The recommended uppercase form for most casing operations on U+00DF LATIN SMALL LETTER SHARP S continues to be "SS", as a capital sharp s is only used in restricted circumstances. See Tailored Casing Operations.

Chapter 8: Middle Eastern Scripts

Arabic

A number of Arabic character were added in Version 5.1 in support of minority languages, four Qur'anic Arabic characters were added, and the Arabic math repertoire was greatly extended. Sixteen characters were added in support of the Khowar, Torwali, and Burushaski languages spoken primarily in Pakistan, and a set of eight Arabic characters were added in support of Persian and Azerbaijani. The 27 newly added Arabic math characters include arrows, mathematical operators and letterlike symbols.

Chapter 9: South Asian Scripts-I

Indic

A number of useful characters were added to Indic scripts. The Devanagari candra-a, Gurmukhi udaat and yakash, and additional Oriya, Tamil, and Telugu characters were added. These new characters expand the support of Sanskrit in those scripts, further the support of minority languages, and encode old fraction and number systems.

Tamil Named Character Sequences

Tamil is less complex than some of the other Indic scripts, and both conceptually and in processing can be treated as an atomic set of elements: consonants, stand-alone vowels, and syllables. The following chart shows these atomic elements, with the corresponding Unicode code points. These elements have also been accepted for a future version of the Unicode Standard as Tamil named character sequences: see the NamedSequencesProv file in the Unicode Character Database.

In implementations such as natural language processing, where it may be useful to treat these units as single code points for ease of processing, they can be mapped to a segment of the Private Use Area.

In the following "Tamil Vowels, Consonants, and Syllables" table, row 1 shows the ASCII representation of the vowel names. Column 1 shows the ASCII representation of the consonant names.

Alternative Formats: For a separate file with this table in an HTML version and a larger size PNG version, please click here.

Tamil Vowels, Consonants, and Syllables

Table of Tamil Vowels, Consonants, and Syllables

Malayalam Chillu Characters

The most important new characters for Malayalam are the six new chillu characters, U+0D7A..U+0D7F, encoding dead consonants (those without an implicit vowel). To simplify the discussion here, the formal names of the characters are shortened to use the terms that are typically used in spoken discussion of the chillu characters: chillu-n for MALAYALAM LETTER CHILLU N, and so forth.

In Malayalam-language text, chillu characters never start a word. The chillu letters -nn, -n, -rr, -l, and -ll are quite common; chillu-k is not very common.

Prior to Unicode 5.1, the representation of text with chillus was problematic, and not clearly described in the text of the standard. Because older data will use different representation for chillus, implementations must be prepared to handle both kinds of data.

Table 1 shows the relation between the representation in Unicode Version 5.0 and earlier and the new representation in Version 5.1, for the chillu letters considered in isolation.

Table 1. Atomic Encoding of Chillus


 
Visual Representation in 5.0 and Prior Preferred 5.1 Representation
1 CHILLU_NN.png NNA, VIRAMA, ZWJ
(0D23, 0D4D, 200D)
0D7A MALAYALAM LETTER CHILLU NN
2 CHILLU_N.png NA, VIRAMA, ZWJ
(0D28, 0D4D, 200D)
0D7B MALAYALAM LETTER CHILLU N
3 CHILLU_RR.png RA, VIRAMA, ZWJ
(0D30, 0D4D, 200D)
0D7C MALAYALAM LETTER CHILLU RR
4 CHILLU_L.png LA, VIRAMA, ZWJ
(0D32, 0D4D, 200D)
0D7D MALAYALAM LETTER CHILLU L
5 CHILLU_LL.png LLA, VIRAMA, ZWJ
(0D33, 0D4D, 200D)
0D7E MALAYALAM LETTER CHILLU LL
6 k undefined 0D7F MALAYALAM LETTER CHILLU K

The letter rra ra is normally read /r/. Repetition of that sound is written by two occurrences of the letter: rra_rra_side. Each occurrence can bear a vowel sign.

Repetition of the letter, written either rra_rra_stackedor rra_rra_side, is also used for the sound /tt/. In this case, the two rra fundamentally behave as a digraph. The digraph can bear a vowel sign in which case the digraph as a whole acts graphically as an atom: a left vowel part goes to the left of the digraph and a right vowel part goes to the right of the digraph. Historically, the side-by-side form was used until around 1960 when the stacked form began appearing and supplanted the side-by-side form.

The use of rra_rra_side in text is ambigous. The reader must in general use the context to understand if this is read /rr/ or /tt/. It is only when a vowel part appears between the two rra that the reading is unambiguously /rr/.

Note: the same situation is common in many other orthographies. For example, th in English can be a digraph (cathode) or two separate letters (cathouse); gn in French can be a digraph (oignon) or two separate letters (gnome)

The sequence <0D31, 0D31> represents rra_rra_side, regardless of the reading of that text. The sequence <0D31, 0D4D, 0D31> represents rra_rra_stacked. In both cases, vowels signs can be used as appropriate:

Table 2. /rr/ and /tt/

1 example c 0D2A 0D3E 0D31 0D31 /paatta/ cockroach
2 example d 0D2A 0D3E 0D31 0D4D 0D31
3 example_a 0D2E 0D3E 0D31 0D46 0D31 0D3E 0D32 0D3F /maattoli/ echo
4 example_b 0D2E 0D3E 0D31 0D4D 0D31 0D46 0D3E 0D32 0D3F
5 example e 0D2C 0D3E 0D31 0D31 0D31 0D3F /baattari/ battery
6 example f 0D2C 0D3E 0D31 0D4D 0D31 0D31 0D3F
7 example g 0D38 0D42 0D31 0D31 0D31 0D4D /suuratt/ (name of a place)
8 example h 0D38 0D42 0D31 0D31 0D4D 0D31 0D4D
9 example i 0D1F 0D46 0D02 0D2A 0D31 0D31 0D3F /temparari/ temporary (English loan word)
10 example j 0D32 0D46 0D15 0D4D 0D1A 0D31 0D31 0D4B 0D1F 0D4D /lekcararoot/ to the lecturer

A very similar situation exists for the combination of n chillu-n and rra ra. When used side by side, n_rra_side can be read either /nr/ or /nt/, while n_rra_stackedis always read /nt/.

The sequence <0D7B, 0D31> represents n_rra_side, regardless of the reading of that text. The sequence <0D7B, 0D4D, 0D31> represents n_rra_stacked. In both cases, vowels signs can be used as appropriate:

Table 3. /nr/ and /nt/

1 example k 0D06 0D7B 0D47 0D31 0D3E /aantoo/ (proper name)
2 example l 0D06 0D7B 0D4D 0D31 0D47 0D3E
3 example m 0D0E 0D7B 0D31 0D47 0D3E 0D7A /enrool/ enroll (English word)

The Unicode Technical Committee is aware of the existence of a repha form of ra ra, which looks like a dot. The representation of that form is currently under investigation.

Other New Malayalam Characters

The four new characters, avagraha, vocalic rr sign, vocalic l sign, and vocalic ll sign, are only used to write Sanskrit words in the Malayalam script. The avagraha is the most common of the four, followed by the vocalic l sign. There are six new characters used for the archaic number system, including characters for numbers 10, 100, 1000 and fractions. There is also a new character, the date mark, used only for the day of the month in dates; it is roughly the equivalent of "th" in "Jan 5th." While it has been used in modern times it is not seen as much in contemporary use.

Chapter 11: Southeast Asian Scripts

The following updated text replaces the Myanmar block introduction on pp. 379-381 of The Unicode Standard, Version 5.0.

Myanmar: U+1000–U+109F

The Myanmar script is used to write Burmese, the majority language of Myanmar (formerly called Burma). Variations and extensions of the script are used to write other languages of the region, such as Mon, Karen, Kayah, Shan, and Palaung, as well as Pali and Sanskrit. The Myanmar script was formerly known as the Burmese script, but the term “Myanmar” is now preferred.

The Myanmar writing system derives from a Brahmi-related script borrowed from South India in about the eighth century to write the Mon language. The first inscription in the Myanmar script dates from the eleventh century and uses an alphabet almost identical to that of the Mon inscriptions. Aside from rounding of the originally square characters, this script has remained largely unchanged to the present. It is said that the rounder forms were developed to permit writing on palm leaves without tearing the writing surface of the leaf.

The Myanmar script shares structural features with other Brahmi-based scripts such as Khmer: consonant symbols include an inherent “a” vowel; various signs are attached to a consonant to indicate a different vowel; medial consonants are attached to the consonant; and the overall writing direction is from left to right.

Standards. There is not yet an official national standard for the encoding of Myanmar/Burmese. The current encoding was prepared with the consultation of experts from the Myanmar Information Technology Standardization Committee (MITSC) in Yangon (Rangoon). The MITSC, formed by the government in 1997, consists of experts from the Myanmar Computer Scientists’ Association, Myanmar Language Commission, and Myanmar Historical Commission.

Encoding Principles. As with Indic scripts, the Myanmar encoding represents only the basic underlying characters; multiple glyphs and rendering transformations are required to assemble the final visual form for each syllable. Characters and combinations that may appear visually identical in some fonts, such as U+101D wa MYANMAR LETTER WA and U+1040 wa MYANMAR DIGIT ZERO, are distinguished by their underlying encoding.

Composite Characters. As is the case in many other scripts, some Myanmar letters or signs may be analyzed as composites of two or more other characters and are not encoded separately. The following are examples of Myanmar letters represented by combining character sequences

myanmar vowel sign àw
U+1000 ka ka + U+1031 e vowel sign e + U+102C aa vowel sign aakaw kàw

myanmar vowel sign aw
U+1000 ka ka + U+1031 e vowel sign e + U+102C aa vowel sign aa + U+103A asatasatkawkaw

myanmar vowel sign o
U+1000 ka ka + U+102D i vowel sign i + U+102F u vowel sign uko ko

Encoding Subranges. The basic consonants, medials, independent vowels, and dependent vowel signs required for writing the Myanmar language are encoded at the beginning of the Myanmar range. Extensions of each of these categories for use in writing other languages are appended at the end of the range. In between these two sets lie the script-specific signs, punctuation, and digits.

Conjuncts. As in other Indic-derived scripts, conjunction of two consonant letters is indicated by the insertion of a virama U+1039 virama MYANMAR SIGN VIRAMA between them. It causes the second consonant to be displayed in a smaller form below the first; the virama is not visibly rendered.

Kinzi. The conjunct form of U+1004 nga MYANMAR LETTER NGA is rendered as a superscript sign called kinzi. That superscript sign is not encoded as a separate mark, but instead is simply the rendering form of the nga in a conjunct context. The nga is represented in logical order first in the sequence, before the consonant which actually bears the visible kinzi superscript sign in final rendered form. For example, kinzi applied to U+1000 ka MYANMAR LETTER KA would be written via the following sequence:

U+1004 nga nga + U+103A asat asat + U+1039 virama virama + U+1000 ka kanga_ka ka

Note that this sequence includes both U+103A asat and U+1039 virama between the nga and the ka. Use of the virama alone would ordinarily indicate stacking of the consonants, with a small ka appearing under the nga. Use of the asat killer in addition to the virama gives a sequence that can be distinguished from normal stacking: the sequence <U+1004, U+103A, U+1039> always maps unambiguously to a visible kinzi superscript sign on the following consonant.

Medial Consonants. The Myanmar script traditionally distinguishes a set of subscript “medial” consonants: forms of ya, ra, wa, and ha that are considered to be modifiers of the syllable’s vowel. Graphically, these medial consonants are sometimes written as subscripts, but sometimes, as in the case of ra, they surround the base consonant instead. In the Myanmar encoding, the medial consonants are encoded separately. For example, the word krwe kwre, [kjwei] (“to drop off ”) would be written via the following sequence:

U+1000 ka ka + U+103C medial ra medial ra + U+103D medial wa medial wa + U+1031 e vowel sign ekwre krwe

In Pali and Sanskrit texts written in the Myanmar script, as well as in older orthographies of Burmese, the consonants ya, ra, wa, and ha are sometimes rendered in subjoined form. In those cases, U+1039 virama MYANMAR SIGN VIRAMA and the regular form of the consonant are used.

Asat. The asat, or killer, is a visibly displayed sign. In some cases it indicates that the inherent vowel sound of a consonant letter is suppressed. In other cases it combines with other characters to form a vowel letter. Regardless of its function, this visible sign is always represented by the character U+103A asat MYANMAR SIGN ASAT.

Contractions. In a few Myanmar words, the repetition of a consonant sound is written with a single occurrence of the letter for the consonant sound together with an asat sign. This asat sign occurs immediately after the double-acting consonant in the coded representation:

U+101A ya ya + U+1031 e vowel sign e + U+102C aa vowel sign aa + U+1000 ka ka + U+103A asat asat + U+103B medial ya medial ya + U+102C aa vowel sign aa + U+1038 visarga visargaman man, husband

U+1000 ka ka + U+103B medial ya medial ya + U+103D medial wamedial wa + U+1014 na na + U+103A asat asat + U+102F u vowel sign u + U+1015 pa pa + U+103A asat asati I (first person singular)

Great sa. The great sa great sa is encoded as U+103F MYANMAR LETTER GREAT SA. This letter should be represented with <U+103F>, while the sequence <U+101E, U+1039, U+101E> should be used for the regular conjunct form of two sa.

Tall aa. The two letterstall aa and aa are both used to write the sound /aa/. In Burmese orthography, both letters are used, depending on the context. In S'gaw Karen orthography, only the tall form is used. For this reason, two characters are encoded: U+102B tall aa MYANMAR VOWEL SIGN TALL AA and U+102C aa MYANMAR VOWEL SIGN AA. In Burmese texts, the coded character appropriate to the context should be used.

Ordering of Syllable Components. Dependent vowels and other signs are encoded after the consonant to which they apply, except for kinzi, which precedes the consonant. Characters occur in the relative order shown in Table 11-3.

Table 11-3. Myanmar Syllabic Structure

Name Encoding
kinzi <U+1004, U+103A, U+1039>
consonant and vowel letters [U+1000..U+102A, U+103F, U+104E]
asat sign (for contractions) U+103A
subscript consonant <U+1039, [U+1000..U+1019, U+101C, U+101E, U+1020, U+1021]>
medial ya U+103B
medial ra U+103C
medial wa U+103D
medial ha U+103E
vowel sign e U+1031
vowel sign i, ii, ai [U+102D, U+102E, U+1032]
vowel sign u, uu [U+102F, U+1030]
vowel sign tall aa, aa [U+102B, U+102C]
anusvara U+1036
asat sign U+103A
dot below U+1037
visarga U+1038

U+1031 e MYANMAR VOWEL SIGN E is encoded after its consonant (as in the earlier example), although in visual presentation its glyph appears before (to the left of) the consonant form.

Table 11-3 nominally refers to the character sequences used in representing the syllabic structure of the Burmese language proper. It would require further extensions and modifications to cover the various other languages, such as Karen, Mon, and Shan, which also use the Myanmar script.

Spacing. Myanmar does not use any whitespace between words. If word boundary indications are desired—for example, for the use of automatic line layout algorithms—the character U+200B ZERO WIDTH SPACE should be used to place invisible marks for such breaks. The zero width space can grow to have a visible width when justified. Spaces are used to mark phrases. Some phrases are relatively short (two or three syllables).

Chapter 15: Symbols

Mathematical Symbols

As the mathematical community completes its migration to Unicode, the need for additional mathematical symbols was discovered, and 29 new symbols were added for the publication of mathematical and technical material, to support mathematically oriented publications, and to address the needs for mathematical markup languages, such as MathML.

The new math characters include a non-combining diacritic, U+27CB MATHEMATICAL SPACING LONG SOLIDUS OVERLAY, and a new operator, U+2064 INVISIBLE PLUS. The non-combining diacritic can be used to decorate a mathematical variable or even an entire expression. By convention, such a decoration is indicated with a Unicode character, even in markup languages, such as MathML. The invisible plus operator is used to unambiguously represent expressions like 3½.

New delimiters, arrows, squares and other math symbols were also added.

Phaistos Disc Symbols: U+101D0—U+101FF

The Phaistos disc was found during an archeological dig in Phaistos, Crete about a century ago. The disc probably dates from the mid-18th to the mid-14th century BCE. Unlike other ancient scripts, relatively little is known about the Phaistos Disc Symbols. The symbols have not been deciphered and the disc remains the only known example of the writing. Nonetheless, the disc has engendered great interest, and numerous scholars and amateurs spend time discussing the symbols.

Mahjong Tile Symbols: U+1F000—U+1F02F

Mahjong tile symbols encode a set of tiles used to play the popular Chinese game of Mahjong. The exact origin of the game is unknown, but it has been around since at least the mid-nineteenth century, and its popularity spread to Japan, Britain and the US in the early twentieth century. There is some variation in the set of tiles used, so the Unicode Standard encodes a superset of the tiles used in various traditions of the game. The main set of tiles is made up of three suits with nine tiles each: the Bamboos, the Circles and the Characters. Additional tiles include the Dragons, the Winds, the Flowers and the Seasons.

Domino Tile Symbols: U+1F030—U+1F09F

Domino tile symbols encode the "double-six" set of tiles used to play the game of dominoes, which derives from Chinese tile games dating back to the twelfth century. The tiles are encoded in horizontal and vertical orientations, thus, for example, both U+1F081 DOMINO TILE VERTICAL-04-02 and U+1F04F DOMINO TILE HORIZONTAL-04-02 are encoded.