Unicode® 6.1.0
Released: 2012 January 31 (Announcement)
Version 6.1.0 has been superseded by the latest version of the Unicode Standard.
Unicode 6.1.0 is a
minor version of the Unicode Standard. This page summarizes the important changes for the Unicode Standard, Version 6.1.0. In the discussion below, Version 6.1.0 may be abbreviated as "Unicode 6.1" or "Version 6.1."
A. Summary
B. Version Information
C. Stability Policy Update
D. Textual Changes and Character Additions
E. Conformance Changes
F. Unicode Character Database
Changes
G. Unicode Standard Annex Changes
Version 6.1 of the Unicode Standard continues the Unicode Consortium's long-term commitment to support the full diversity of languages around the world. This latest version adds characters to support additional languages of China, other Asian countries, and Africa. It also addresses educational needs in the Arabic-speaking world. A total of 732 new characters have been added.
This version of the Standard also brings technical improvements to support implementers. Improved changes to property values and their aliases mean that properties now have labels which are easier for systematic programmatic use. The new labels combined with a new script extensions property means that regular expressions can be more straightforward and are easier to validate. Hangul algorithms were consolidated and restructured. Before, one had to examine four separate documents. Now, the information is consolidated in the core specification in Chapter 3, Conformance.
Over 200 new Standardized Variants have been added for emoji characters, allowing implementations to distinguish preferred display styles between text and emoji styles. For example:
Among the notable property changes and additions in Unicode 6.1 are two new line break property values, which improve the line-breaking behavior of Hebrew and Japanese text. Segmentation behavior was also improved for Thai, Lao, and similar languages. The processing of Chinese data has been augmented by more fully specified information on mapping between Simplified and Traditional Chinese characters, in addition to other improved Unihan data that supports the processing of Chinese data.
For detailed property changes see Section F. Unicode Character Database Changes.
Version 6.1 has minor conformance updates, including the determination of grapheme cluster boundaries and the processing of combining canonical class and decomposition mapping. There are documentation improvements throughout.
Two other important Unicode specifications are maintained in synchrony with the Unicode Standard, and have updates for
Version 6.1:
This version of the Unicode Standard is synchronized in repertoire with the forthcoming third edition of 10646: ISO/IEC 10646:2012.
Version 6.1 of the Unicode Standard consists of the core specification,
the delta and archival code charts for this version, the Unicode Standard Annexes, and
the Unicode Character Database (UCD).
The core specification gives the general principles,
requirements for conformance, and guidelines for implementers. The
code charts show representative glyphs for all the Unicode
characters. The Unicode Standard Annexes supply detailed normative
information about particular aspects of the standard. The Unicode
Character Database supplies normative and informative data for
implementers to allow them to implement the Unicode Standard.
Version 6.1.0 of the Unicode Standard
should be referenced as:
The Unicode Consortium. The Unicode Standard, Version 6.1.0, (Mountain View, CA: The Unicode Consortium,
2012. ISBN
978-1-936213-02-3)
http://www.unicode.org/versions/Unicode6.1.0/
A complete specification of the contributory files for Unicode
6.1 is found on the page
Components for 6.1.0.That page also provides the recommended reference format for Unicode Standard Annexes.
The navigation bar on the left of this page provides links to
both the core specification as a single file,
as well as to individual chapters, and
the appendices.
Also provided are links to the
code charts, the
radical-stroke indices to CJK
ideographs, the Unicode Standard Annexes
and the data files for Version 6.1 of the Unicode Character Database.
Several sets of code charts are available. They serve different
purposes:
- The latest set of code charts for the Unicode Standard are available online. Those charts are always the most current code charts available, and may be updated at any time. The charts are organized by scripts and blocks for easy reference. An online index by character name is also provided.
For Unicode 6.1.0 in particular two additional sets of code chart pages are provided:
- A set of delta code charts showing only the
new blocks for Unicode 6.1.0 and any existing blocks for
which new characters were added in Unicode 6.1.0. All
new characters are visually highlighted in those charts.
- A set of archival code charts that represent
the entire set of characters, names and representative glyphs at the time of publication of Unicode 6.1.0.
The delta and archival code charts are a stable part of this release of the Unicode Standard. They will never be updated.
Errata incorporated into Unicode 6.1 are listed by date in
a separate table. For corrigenda and errata after the release of Unicode 6.1, see the list of current
Updates and Errata.
The stability policy which limits the range of possible Canonical_Combining_Class property values was narrowed to 0..254, from its former range of 0..255. This has the effect of permanently reserving the value 255, which can then be used by implementations for possible optimizations of table building.
Note: The
Unicode Character Encoding Stability Policy restricts possible future changes to the Unicode Standard, but is not formally a part of the standard itself.
732 new character assignments were made to the Unicode Standard, Version 6.1. These additions bring the total number of characters assigned in the standard to 110,116. (That is the traditional count, which totals up graphic and format characters, but omits surrogate code points, ISO control codes, noncharacters, and private-use allocations.)
Character Assignment Overview
128 characters have been added to the BMP, while 604 characters
have been added in the supplementary planes. Most character additions are in new blocks, but there are also
character additions to a number of existing blocks.
New Blocks
The newly-defined blocks in Version 6.1 are:
08A0..08FF |
Arabic Extended-A |
1CC0..1CCF |
Sundanese Supplement |
AAE0..AAFF |
Meetei Mayek Extensions |
10980..1099F |
Meroitic Hieroglyphs |
109A0..109FF |
Meroitic Cursive |
110D0..110FF |
Sora Sompeng |
11100..1114F |
Chakma |
11180..111DF |
Sharada |
11680..116CF |
Takri |
16F00..16F9F |
Miao |
1EE00..1EEFF |
Arabic Mathematical Alphabetic Symbols |
Text Changes and Additions
Numbers indicate the chapter or section in the Unicode 6.1 core
specification where there
are some significant changes
or additions. This list is not exhaustive. Select changes to conformance requirements in
Chapter 3, Conformance, that impact implementations are listed separately under
E. Conformance Changes.
- 3.5: Updated the discussion of property values to clarify that some properties associate multiple values with each code point.
- 3.12: The discussion of Hangul syllable boundary determination was removed from this section. It now appears, instead,
as a new section in UAX #29, "Unicode Text Segmentation".
- 3.12: The Java sample code exemplifying the Hangul-related algorithms was moved from UAX #15,
"Unicode Normalization Forms" into this section,
where it immediately follows the specifications of those algorithms.
- 3.12: The statements of the Hangul Decomposition, Hangul Composition, and Hangul Name Generation algorithms were
cleaned up to give them a consistent presentation and better examples.
- 4.5: Additional text was added on the General_Category.
- 5.21: Rewrote text on Ignoring Characters in Processing.
- 6.2: Added text to the description of spaces.
- 8.2: Added new text on the Arabic Extended-A block, documented decomposition decisions involving Arabic letters
with hamza-shaped diacritics, and updated the description of Arabic diacritic marks.
- 8.3: Clarified the text on Syriac shaping behavior.
- 9.1: Made various updates to Devanagari to document Vedic extensions, Vedic use of U+20F0 COMBINING ASTERISK ABOVE, and Devanagari short vowels.
- 9.2: Added information on the history of Assamese and Bengali.
- 10.9: Added new text for Sharada.
- 10.10: Added new text for Takri.
- 10.11: Added new text for Chakma.
- 10.12: Added new text for Meetei Mayak Extensions.
- 10.14: Added new text for Sora Sompeng.
- 12.2: Updated text to modify the syntax for ideograph variation sequences, removing length constraints and allowing the use of private use characters.
- 13.4: Clarified the function of ZWJ in the context of display of bi-consonants in Tifinagh.
- 13.13: Added new text for Miao.
- 14.19: Added new text for Meroitic Hieroglyphs and Meroitic Cursive.
- 15.2: Added new text on Arabic Mathematical Alphabetic Symbols.
- 15.3: Rewrote discussion of number forms to expand it to cover all aspects of the encoding of numerals in the standard.
- 15.4: Created a new section devoted to superscript and subscript symbols.
- 15.8: Made various updates to the text in Miscellaneous Symbols regarding playing cards and Phaistos Disc symbols.
- 16.4: Updated the text on Variation Selectors
- 16.5: Extended the discussion of properties for private-use characters.
- 17.1: Updated information on the use and display of formal name aliases in the code charts.
- Appendix E was updated with more details of the early history of Chinese character standardization.
There are several changes to conformance requirements in Unicode
6.1 that impact implementations. The most important of these are:
- Several bullets were updated under definitions D58, D59, and D60, to further clarify the relationship
between "grapheme base", the Grapheme_Base property, and the specification of
grapheme cluster boundaries in UAX #29, "Unicode Text Segmentation".
- Clarifications were added under D107 and D110 in Section 3.11 to make it clear that private
agreements cannot override the Canonical_Combining_Class or Decomposition_Mapping of private
use characters.
- The text regarding tailored casing at the beginning of Section 3.13 was corrected, to properly
indicate which kinds of tailorings are covered in SpecialCasing.txt and which by CLDR.
- In Section 4.8 the description of formal name aliases has been updated to account for new types of aliases
which are now formally defined in NameAliases.txt in the UCD.
The detailed listing of all changes to the contributory data files of the Unicode Character Database
for Version 6.1 can be found in
UAX #44, Unicode Character Database.
The changes listed there include a number of important property revisions to existing characters that will affect implementations:
- Five characters (U+00A7, U+00B6, U+0F14, U+1360, and U+10102) had their General_Category changed
from So to Po, to assist in cleaner tailoring of the relative order of symbols and punctuation for the
Unicode Collation Algorithm.
- Eight relatively recently added numeric symbols (U+3248..U+324F) had their General_Category changed
from So to No, to make them more consistent with similar symbols consisting of numbers surrounded by a circle or a square.
Neither this change or the change from gc=So to gc=Po affects the derivation of identifier-related
properties, but may impact assumptions about these characters in some implementations.
- The default Bidi_Class for two ranges, U+08A0..U+08FF and U+1EE00..U+1EEFF, has been changed from bc=R to bc=AL, because the new blocks for those ranges now contain Arabic characters. Check that default Bidi_Class settings for those ranges are updated accordingly in property tables and in implementations of the Unicode Bidirectional Algorithm.
- Two new Line_Break property values have been added. The first is for Hebrew letters: lb=HL. It is used in the definition of a new rule, LB21a, in UAX #14, for handling line breaking for Hebrew characters next to hyphens. The second, lb=CJ, allows for better customization of Japanese line breaking. Implementations of Unicode line breaking may need to be updated to correctly handle these additional line break property values.
- The kTraditionalVariant and kSimplifiedVariant tags and their usage in the Unihan Database have been more fully specified. Implementations which use that data to do simplified/traditional mapping of CJK characters may need to be updated.
- The meanings of the kMandarin and kTotalStrokes tags in the Unihan Database have been more fully specified to focus on the use in collation and (for the former) transliteration, and the values of each property have changed very significantly, with values for many more characters. Implementations which use that data may need to be updated.
- Every value for enumerated and catalog properties now has both a short and a long alias. There are no more "n/a" placeholders indicating the absence of a short property value alias. In addition, the long aliases are all suitable for programmatic identifiers. This change affected the Age, Block, Canonical_Combining_Class, Indic_Matra_Category, Indic_Syllabic_Category, and Joining_Group properties.
- The Name_Alias property has added over 300 new values, most of which are common aliases for control characters. For the first time, a character may have more than one character name alias. The existence of multiple character name aliases for a single character may affect implementations.
- Script_Extensions has been added as a new, provisional property, providing finer-grained information for determining the script of runs of text. Implementations may need to be upgraded to take advantage of this information.
- 214 standardized variation sequences have been added for emoji characters, allowing implementations to distinguish between two preferred display styles: text style versus emoji style.
- The Grapheme_Cluster_Break property values have been modified to produce better segmentation results for Thai, Lao, and similar scripts.
Other significant changes resulting from the addition of new characters include:
- An additional unified ideograph has been added to the main BMP block of CJK unified ideographs: U+9FCC. This extends the range of those CJK unified ideographs by one value. Check implementations for any hard-coded assumptions about the ranges of CJK unified ideographs.
- Two new Chakma characters, U+1112E and U+1112F, have canonical decompositions. This is unusual for characters off the BMP, and may break certain assumptions used in optimization of implementations of Unicode Normalization. Check that any hard coded assumptions about normalization take these characters into account, and that the characters correctly recompose for NFC.
- The addition of the two Chakma characters with canonical decompositions may also impact implementations of the Unicode Collation Algorithm. These two characters introduce new weight contractions, and for the first time the second element of those contractions is a supplementary character. These are also the first instances where the representation of the contraction in UTF-16 is longer than three code units. These changes may impact optimization assumptions in UCA implementations.
Other significant changes to the text of the core specification or annexes which may impact implementation include:
- The Syriac shaping rules specified in Section 8.3, Syriac, of the core specification have been clarified,
so that it is clear that the term "dalath or rish" refers to characters with
Joining_Group=Dalath_Rish. Also "word breaking character" in the alaph joining
rules has been corrected to "non-joining character". Implementers with Syriac
shaping engines should check to ensure that their implementations are consistent
with those clarifications.
- The list of scripts recommended for inclusion in or exclusion from identifiers
has been updated in UAX #31. That list is not available in machine-readable form
in the UCD, so implementations which tailor their identifier usage according to
the
UAX #31 recommendations will need to refer specifically to that annex for updates.
In Version 6.1, many of the Unicode Standard Annexes have had significant revisions. The most important of these changes are listed below. For the full details of all changes, see the Modifications section
of each UAX, linked directly from the following list of UAXes.
Unicode Standard Annex |
Changes |
UAX #9 Unicode Bidirectional Algorithm
|
No significant changes in this version. |
UAX
#11 East Asian Width |
No significant changes in this version. |
UAX
#14 Unicode Line Breaking Algorithm |
Rule 21a was added, to prevent a break between a Hebrew letter and a following hyphen, and added the character class HL (Hebrew Letter) for that rule. Small kana were moved from class NS to class ID, to align Japanese "kinsoku" more closely with CSS "normal" behavior. |
UAX
#15 Unicode Normalization Forms
|
An implementation note on the use of ccc=255 was added. The example code and description of Hangul decomposition and composition was moved into Section 3.12, Conjoining Jamo Behavior in the core specification. Section 14.1, Optimization Strategies was rewritten for clarity. |
UAX
#24 Unicode Script Property
|
The former Section 4.1 on Script Anomalies for East Asian Symbols was moved to become Section 3.6, and the examples were extended to cover additional unexpected script values for symbols. A description was added for the new property Script_Extensions. |
UAX
#29 Unicode Text Segmentation |
The discussion of Hangul Syllable segmentation was moved from the Core Specification to this annex and its wording updated slightly. The handling of the Prepend and SpacingMark class was adjusted so that for the Thai and Lao scripts extended grapheme clusters behave like legacy grapheme clusters, as preferred. Characters with gc=Cs and gc=Cn were added to Control in Table 2, so that they do not join with following Extend characters for defining grapheme cluster boundaries. |
UAX
#31 Unicode Identifier and Pattern Syntax
|
New scripts were added to the tables categorizing script usage. Material was added to draw the distinction between the format of identifiers for internal use and the format of identifiers for display. Better guidance was provided on the use of variation sequences. |
UAX
#34 Unicode Named Character Sequences |
No significant changes in this version. |
UAX
#38 Unicode Han Database (Unihan) |
The kTotalStrokes and kMandarin fields were redefined. The use of the kTraditionalVariant and kSimplifiedVariant fields were clarified. A new section 4.4 was added, detailing the ranges of CJK ideographs covered by the Unihan database, with their associated Unicode age values. Each Unihan property that can have multiple values had a specification added to indicate whether the order of values matters, and if so, what the significance of that order is. The regex validity expressions were slightly simplified. |
UAX
#41 Common References for Unicode Standard Annexes |
The references were updated as needed. |
UAX
#42 Unicode Character Database in XML |
New values were added for the age, script, and jg attributes. The values for the ccc attribute were restricted to the 0..254 range, instead of 0..255. The patterns for kIRG_USource and kMandarin were updated to reflect changes in the Unihan database. A new element was added for the Name_Alias property, and new attributes were added for the Block and Script_Extensions properties. A clarification was added to distinguish attributes with empty string values from missing attributes. In particular, the absence of a numeric value is now represented by NaN. The value of the fc_nfkc attribute must now be either # or one-or-more-code-points. |
UAX
#44 Unicode Character Database
|
Text was added regarding the reserved value 255 for Canonical_Combining_Class. Grouped values for General_Category were added to the table of values for that property. The status and description of Grapheme_Base and Grapheme_Extend were updated. The tables of regular expressions for validation of property values were updated. An entry was added to the Property Table for the new Script_Extensions provisional property. The description of the Name_Alias property was updated. A new section describing multivalued properties was added. There are various other small editorial fixes to the text. |