Unicode 6.1.0

Unicode 6.1.0 is a minor version of the Unicode Standard. This page summarizes the important changes for the Unicode Standard, Version 6.1.0. In the discussion below, Version 6.1.0 may be abbreviated as "Unicode 6.1" or "Version 6.1."

Contents of This Document

A. Summary

Version 6.1 of the Unicode Standard continues the Unicode Consortium's long-term commitment to support the full diversity of languages around the world. This latest version adds characters to support additional languages of China, other Asian countries, and Africa. It also addresses educational needs in the Arabic-speaking world. A total of 732 new characters have been added.

This version of the Standard also brings technical improvements to support implementers. Improved changes to property values and their aliases mean that properties now have labels which are easier for systematic programmatic use. The new labels combined with a new script extensions property means that regular expressions can be more straightforward and are easier to validate. Hangul algorithms were consolidated and restructured. Before, one had to examine four separate documents. Now, the information is consolidated in the core specification in Chapter 3, Conformance.

Over 200 new Standardized Variants have been added for emoji characters, allowing implementations to distinguish preferred display styles between text and emoji styles. For example:

Among the notable property changes and additions in Unicode 6.1 are two new line break property values, which improve the line-breaking behavior of Hebrew and Japanese text. Segmentation behavior was also improved for Thai, Lao, and similar languages. The processing of Chinese data has been augmented by more fully specified information on mapping between Simplified and Traditional Chinese characters, in addition to other improved Unihan data that supports the processing of Chinese data.

Version 6.1 has minor conformance updates, including the determination of grapheme cluster boundaries and the processing of combining canonical class and decomposition mapping. There are documentation improvements throughout.

Two other important Unicode specifications are maintained in synchrony with the Unicode Standard, and have updates for Version 6.1:

This version of the Unicode Standard is synchronized in repertoire with the forthcoming third edition of 10646: ISO/IEC 10646:2012.

B. Version Information

Version 6.1 of the Unicode Standard consists of the core specification, the delta and archival code charts for this version, the Unicode Standard Annexes, and the Unicode Character Database (UCD).

The core specification gives the general principles, requirements for conformance, and guidelines for implementers. The code charts show representative glyphs for all the Unicode characters. The Unicode Standard Annexes supply detailed normative information about particular aspects of the standard. The Unicode Character Database supplies normative and informative data for implementers to allow them to implement the Unicode Standard.

A complete specification of the contributory files for Unicode 6.1 is found on the page Components for 6.1.0.That page also provides the recommended reference format for Unicode Standard Annexes.

Code Charts

For Unicode 6.1.0 in particular two additional sets of code chart pages are provided:

The delta and archival code charts are a stable part of this release of the Unicode Standard. They will never be updated.

Errata

Errata incorporated into Unicode 6.1 are listed by date in a separate table. For corrigenda and errata after the release of Unicode 6.1, see the list of current Updates and Errata.

C. Stability Policy Update

The stability policy which limits the range of possible Canonical_Combining_Class property values was narrowed to 0..254, from its former range of 0..255. This has the effect of permanently reserving the value 255, which can then be used by implementations for possible optimizations of table building.

D. Textual Changes and Character Additions

732 new character assignments were made to the Unicode Standard, Version 6.1. These additions bring the total number of characters assigned in the standard to 110,116. (That is the traditional count, which totals up graphic and format characters, but omits surrogate code points, ISO control codes, noncharacters, and private-use allocations.)

Character Assignment Overview

128 characters have been added to the BMP, while 604 characters have been added in the supplementary planes. Most character additions are in new blocks, but there are also character additions to a number of existing blocks.

New Blocks

Text Changes and Additions

Numbers indicate the chapter or section in the Unicode 6.1 core specification where there are some significant changes or additions. This list is not exhaustive. Select changes to conformance requirements in Chapter 3, Conformance, that impact implementations are listed separately under E. Conformance Changes.

E. Conformance Changes

There are several changes to conformance requirements in Unicode 6.1 that impact implementations. The most important of these are:

F. Unicode Character Database Changes

The detailed listing of all changes to the contributory data files of the Unicode Character Database for Version 6.1 can be found in UAX #44, Unicode Character Database. The changes listed there include a number of important property revisions to existing characters that will affect implementations:

Other significant changes resulting from the addition of new characters include:

Other significant changes to the text of the core specification or annexes which may impact implementation include:

G. Unicode Standard Annex Changes

In Version 6.1, many of the Unicode Standard Annexes have had significant revisions. The most important of these changes are listed below. For the full details of all changes, see the Modifications section of each UAX, linked directly from the following list of UAXes.

Unicode Standard Annex	Changes
UAX #9 Unicode Bidirectional Algorithm	No significant changes in this version.
UAX #11 East Asian Width	No significant changes in this version.
UAX #14 Unicode Line Breaking Algorithm	Rule 21a was added, to prevent a break between a Hebrew letter and a following hyphen, and added the character class HL (Hebrew Letter) for that rule. Small kana were moved from class NS to class ID, to align Japanese "kinsoku" more closely with CSS "normal" behavior.
UAX #15 Unicode Normalization Forms	An implementation note on the use of ccc=255 was added. The example code and description of Hangul decomposition and composition was moved into Section 3.12, Conjoining Jamo Behavior in the core specification. Section 14.1, Optimization Strategies was rewritten for clarity.
UAX #24 Unicode Script Property	The former Section 4.1 on Script Anomalies for East Asian Symbols was moved to become Section 3.6, and the examples were extended to cover additional unexpected script values for symbols. A description was added for the new property Script_Extensions.
UAX #29 Unicode Text Segmentation	The discussion of Hangul Syllable segmentation was moved from the Core Specification to this annex and its wording updated slightly. The handling of the Prepend and SpacingMark class was adjusted so that for the Thai and Lao scripts extended grapheme clusters behave like legacy grapheme clusters, as preferred. Characters with gc=Cs and gc=Cn were added to Control in Table 2, so that they do not join with following Extend characters for defining grapheme cluster boundaries.
UAX #31 Unicode Identifier and Pattern Syntax	New scripts were added to the tables categorizing script usage. Material was added to draw the distinction between the format of identifiers for internal use and the format of identifiers for display. Better guidance was provided on the use of variation sequences.
UAX #34 Unicode Named Character Sequences	No significant changes in this version.
UAX #38 Unicode Han Database (Unihan)	The kTotalStrokes and kMandarin fields were redefined. The use of the kTraditionalVariant and kSimplifiedVariant fields were clarified. A new section 4.4 was added, detailing the ranges of CJK ideographs covered by the Unihan database, with their associated Unicode age values. Each Unihan property that can have multiple values had a specification added to indicate whether the order of values matters, and if so, what the significance of that order is. The regex validity expressions were slightly simplified.
UAX #41 Common References for Unicode Standard Annexes	The references were updated as needed.
UAX #42 Unicode Character Database in XML	New values were added for the age, script, and jg attributes. The values for the ccc attribute were restricted to the 0..254 range, instead of 0..255. The patterns for kIRG_USource and kMandarin were updated to reflect changes in the Unihan database. A new element was added for the Name_Alias property, and new attributes were added for the Block and Script_Extensions properties. A clarification was added to distinguish attributes with empty string values from missing attributes. In particular, the absence of a numeric value is now represented by NaN. The value of the fc_nfkc attribute must now be either # or one-or-more-code-points.
UAX #44 Unicode Character Database	Text was added regarding the reserved value 255 for Canonical_Combining_Class. Grouped values for General_Category were added to the table of values for that property. The status and description of Grapheme_Base and Grapheme_Extend were updated. The tables of regular expressions for validation of property values were updated. An entry was added to the Property Table for the new Script_Extensions provisional property. The description of the Name_Alias property was updated. A new section describing multivalued properties was added. There are various other small editorial fixes to the text.

Unicode® 6.1.0

Released: 2012 January 31 (Announcement)