Unicode 5.2.0

Home | Site Map | Search

5.2.0 Front Matter
	Title and Copyright
	Foreword
	Contents
	Unicode 5.2 Web Bookmarks
	List of Figures
	List of Tables
	Preface
5.2.0 Chapters
1	Introduction
2	General Structure
3	Conformance
4	Character Properties
5	Implementation Guidelines
6	Writing Systems and Punctuation
7	European Alphabetic Scripts
8	Middle Eastern Scripts
9	South Asian Scripts - I
10	South Asian Scripts - II
11	Southeast Asian Scripts
12	East Asian Scripts
13	Additional Modern Scripts
14	Ancient and Historic Scripts
15	Symbols
16	Special Areas and Format Characters
17	Code Charts Introductory Text
Code Charts
•	Code Charts (Latest)
•	Delta Code Charts (additions to 5.2.0 highlighted)
•	Archival Code Charts (5.2.0, 16 MB)
•	Archival Multi-column Han Code Charts (5.2.0, 72 MB)
Han Radical-Stroke Index
18	Han Radical-Stroke Index (Introductory Text)
•	Interactive Han Radical-Stroke Index (Latest)
•	IICore Radical-Stroke Index (3.2 MB)
•	Full Han Radical-Stroke Index (24 MB)
5.2.0 Appendices and Back Matter
A	Notational Conventions
B	Unicode Publications and Resources
C	Relationship to ISO/IEC 10646
D	Changes from Previous Versions
E	Han Unification History
R	References
I	General Index
	Full Text Zipped for Download (10 MB)
5.2.0 Unicode Standard Annexes
UAX #9: The Unicode Bidirectional Algorithm
UAX #11: East Asian Width
UAX #14: Unicode Line Breaking Algorithm
UAX #15: Unicode Normalization Forms
UAX #24: Unicode Script Property
UAX #29: Unicode Text Segmentation
UAX #31: Unicode Identifier and Pattern Syntax
UAX #34: Unicode Named Character Sequences
UAX #38: Unicode Han Database (Unihan)
UAX #41: Common References for Unicode Standard Annexes
UAX #42: Unicode Character Database in XML
UAX #44: Unicode Character Database
5.2.0 UCD
5.2.0 (files) (about)
5.2.0 Zipped files (for bulk download)
Related Links
About Versions
Latest Version
Archive of Unicode Versions
The Unicode Standard
Unicode Character Database
Glossary of Unicode Terms
Unicode Character Names Index
Technical Reports
Updates and Errata

Unicode® 5.2.0

Released: 2009 October 1 (Announcement)

Version 5.2.0 has been superseded by the latest version of the Unicode Standard.

Version 5.2.0 of the Unicode Standard consists of the core specification (The Unicode Standard, Version 5.2), together with the delta and archival code charts for this version, the 5.2.0 Unicode Standard Annexes, and the 5.2.0 Unicode Character Database (UCD).

The core specification gives the general principles, requirements for conformance, and guidelines for implementers. The code charts show representative glyphs for all the Unicode characters. The Unicode Standard Annexes supply detailed normative information about particular aspects of the standard. The Unicode Character Database supplies normative and informative data for implementers to allow them to implement the Unicode Standard.

Version 5.2.0 of the Unicode Standard should be referenced as:

The Unicode Consortium. The Unicode Standard, Version 5.2.0, defined by: The Unicode Standard, Version 5.2 (Mountain View, CA: The Unicode Consortium, 2009. ISBN 978-1-936213-00-9). (http://www.unicode.org/versions/Unicode5.2.0/)

A complete specification of the contributory files for Unicode 5.2.0 is found on the page Components for 5.2.0. That page also provides the recommended reference format for Unicode Standard Annexes.

Contents of This Document

A. Online Edition
B. Overview
C. Stability Policy Update
D. Character Additions
E. Conformance Changes
F. Unicode Character Database Changes
G. Unicode Standard Annex Changes

A. Online Edition

The text of The Unicode Standard, Version 5.2, as well as the delta and archival code charts, is available via the navigation links on this page. The charts and the Unicode Standard Annexes may be printed, while the other files may be viewed but not printed. The Unicode 5.2 Web Bookmarks page has links to all sections of the online text. A zipped version of the core specification (10 MB) is also available for download.

This page summarizes important changes to the standard from Unicode 5.1.0. The core specification and the Unicode Standard Annexes are not delta documents; they incorporate all of the textual changes for their updates for Version 5.2.0.

B. Overview

The Unicode Standard, Version 5.2, adds 6,648 characters and significantly improves the documentation of conformance requirements for the specification of normalization forms, canonical ordering, and the status of types of properties. Version 5.2 brings improved clarity of presentation in many Unicode Standard Annexes.

Seven new contemporary scripts have been added in Version 5.2: Bamum, Javanese, Lisu, Meetei Mayek, Samaritan, Tai Tham, and Tai Viet. New character additions to existing scripts now provide greater support for Abkhaz, Canadian Aboriginal Syllabics, Coptic, Devanagari, Khamti Shan, Malayalam, and Myanmar. Of particular note are Devanagari additions in support of Vedic Sanskrit. Encoding Vedic is significant because Sanskrit is one of the principal languages for the religious heritage of India, and because Vedic represents the earliest attested phase of the language.

The seven contemporary scripts and newly encoded individual characters expand support of language and orthographic communities in Africa, India, China, Central Asia, Southeast Asia, and the Middle East.

Other character additions include important modern use symbols and historic characters. With Unicode Version 5.2, scholars will now have access to the Gardiner set of Egyptian Hieroglyphs as well as other important historic scripts: Imperial Aramaic, Avestan, Kaithi, Old South Arabian, and Old Turkic. Several key symbol sets were added or expanded: the ARIB set of Japanese broadcasting symbols, additional number forms used in India, and currency symbols.

This latest version of the Unicode Standard has exactly the same character assignments as ISO/IEC 10646:2003 plus Amendments 1 through 6.

Unicode Version 5.2:

Updates stability policies to add property value stability guarantees for identifier-related properties, a guarantee of property, property alias and property value alias stability, and a policy on alias uniqueness.

Incorporates into Chapter 3, Conformance the formal definitions of normalization formerly presented in Unicode Standard Annex #15, "Unicode Normalization Forms." Sections that were modified include sections 3.6 and 3.11.

Revises Section 3.5, Properties to better explain the status of Normative, Informative, Provisional, and Contributory properties.

Clarifies the definition of Deprecated and its relationship to ”strongly discouraged,” and updates the set of Deprecated characters in view of this clearer definition.

Updates best practices for the use of replacement characters.

Improves the description of compatibility characters in Chapter 2, General Structure.

Adds standardized named sequences for Tamil.

Contains significant changes to properties and behavioral specifications.

Errata

Errata incorporated into Unicode 5.2.0 are listed by date in a separate table. For corrigenda and errata after the release of Unicode 5.2.0, see the list of current Updates and Errata.

C. Stability Policy Update

The Unicode Character Encoding Stability Policy has been updated. This update strengthens normalization stability, adds stability policy for case pairs, and extends constraints on property values. For the current statement of these policies, see Unicode Character Encoding Stability Policy.

D. Character Additions

6,648 new character assignments were made to the Unicode Standard, Version 5.2.0 (over and above what was in Unicode 5.1.0). The character repertoire corresponds to ISO/IEC 10646:2003 plus Amendments 1 through 6.

The exact list of characters added for Version 5.2.0 is documented in the file DerivedAge.txt in the Unicode Character Database. Among the characters added, there are a few notable cases which may impact existing implementations. These cases are highlighted here, so that implementers can check for any problematical assumptions in their code.

There are three new characters in the newly-encoded Kaithi script that will require changes in implementations which make hard-coded assumptions about composition during normalization. Most new characters added to the standard with decompositions cannot be generated by the operations toNFC() or toNFKC(), but these three can. Implementers should check their code carefully to ensure that it handles these three characters correctly.

U+1109A KAITHI LETTER DDDHA

U+1109C KAITHI LETTER RHA

U+110AB KAITHI LETTER VA

One of the compatibility CJK ideographs added in this version has a decomposition mapping to a unified CJK ideograph in Extension B. The effect of this is that for the first time a character in the BMP normalizes to a character not in the BMP:
toNFC(U+FA6C) = U+242EE
Implementers should check their implementations of normalization to ensure they are not assuming that no BMP character can normalize to a non-BMP character.

Any hard-coded range assumptions about Unified CJK Ideographs in implementations may need fixing, because the end range for those has changed from U+9FC3 to U+9FCB in this version. There is also an entirely new block of CJK Unified Ideographs: CJK Unified Ideographs Extension C (U+2A700..U+2B73F), with characters encoded in the range U+2A700 to U+2B734.

There is now an assigned Hangul jamo character at U+11A7. This may interfere with some implementations' boundary testing for Hangul decomposition.

There are a number of new Hangul jamo characters added for support of Old Korean. Some of these are encoded in new blocks. An implementation may run into trouble if it assumes that the repertoire of conjoining jamos is fixed, or that all conjoining jamos occur only in the Hangul Jamo block, U+1100..U+11FF.

New uppercase parenthesized symbols have been added. Unlike the circled letter symbols, there are no uppercase/lowercase relationships for these new characters.

Character Assignment Overview

The new character additions were to both the BMP and the SMP (Plane 1). The following table shows the allocation of code points in Unicode 5.2.0. For more information on the specific characters, see the file DerivedAge.txt in the Unicode Character Database. For more details of character counts, see Appendix D, Changes from Previous Versions in Unicode 5.2.

Graphic 107,154

Format 142

Control 65

Private Use 137,468

Surrogate 2,048

Noncharacter 66

Reserved 867,169

E. Conformance Changes

There are several changes to conformance requirements in Unicode 5.2 that impact implementations. The most important of these are noted specifically here.

The formal definitions of normalization formerly presented in Unicode Standard Annex #15, "Unicode Normalization Forms," have been moved to Chapter 3, Conformance.

A key conformance clause on the modification of character sequences, C7, has been tightened to eliminate security risks resulting from deletion of noncharacters from uninterpreted text strings. In Unicode 5.2, the conformance requirements now disallow their removal, except where strings are explicitly being modified.

The status of Normative, Informative, Provisional, and Contributory properties is clarified in Section 3.5 Properties.

The types of code points are clarified in Chapters 2, 3, and 4, with coordinated updates in Unicode Standard Annex #44, "Unicode Character Database."

The PropertyAliases.txt file in the Unicode Character Database is now designated as the normative listing of Unicode character properties and their names.

The BidiTest.txt file in the Unicode Character Database is a new feature in Unicode 5.2. This file contains test cases for assessing conformance to the Unicode Bidirectional Algorithm.

There are additional changes in Unicode conformance requirements due to changes in the UCD data files and the Unicode Standard Annexes listed below.

F. Unicode Character Database Changes

The detailed listing of all changes to the contributory data files of the Unicode Character Database for Version 5.2.0 can be found in UAX #44, Unicode Character Database. The most significant changes include:

There are new case-related properties in DerivedCoreProperties.txt and DerivedNormalizationProps.txt. The new case-related derived properties are NFKC_Casefold, Case_Ignorable, Cased, Changes_When_Lowercased, Changes_When_Uppercased, Changes_When_Titlecased, Changes_When_Casemapped, Changes_When_Casefolded, and Changes_When_NFKC_Casefolded.

Contributory is considered to be a distinct status for a Unicode character property. Contributory properties are neither normative nor informative. The status of all character properties is listed in the property table in UAX #44, Unicode Character Database.

Two new joining groups, FARSI YEH and NYA, were added. These new joining groups may require an update to implementations of Arabic shaping rules.

There is a new data file in the Unicode Character Database, CJKRadicals.txt, which maps the radical numbers used in the Unicode Radical-Stroke Index to the actual Unicode code points for the corresponding radicals. Unlike other files, the first field is not a code point number.

The Unihan.txt file in Unihan.zip is split into 8 separate files within the zip file, organized by category. See UAX #38, Unicode Han Database (Unihan) for details.

G. Unicode Standard Annex Changes

In Version 5.2, many of the Unicode Standard Annexes have had significant revisions. The most important of these changes are listed below. For the full details of all changes, see the Modifications section of each UAX, linked directly from the following list of UAXes.

UAX #9: Unicode Bidirectional Algorithm

Added Section 4.4 Bidi Conformance Testing

Added BN to Rule X6 (removing certain characters)

Clarified examples in Rule N1 (affecting characters next to EN or AN characters)

Added to HL6 the clause: Those with a resolved directionality of L and whose bidi class is R or AL.

UAX #11: East Asian Width

Updated the description of the property value for unassigned code points

UAX #14: Unicode Line Breaking Algorithm

Added class CP, reintroduced rule LB30, adjusted other rules for class CP.

In section 5.1, clarified that the lists of characters for each property contain representative characters, and are not necessarily complete.

Unassigned code points in CJK regions default to class ID.

UAX #15: Unicode Normalization Forms

Moved formal specification of NFC and NFKC into Chapter 3.

Added general introduction to the document itself.

UAX #24: Unicode Script Property

Updated short alias for Inherited from Qaai to Zinh.

Rewrote Section 3. Added a new subsection 3.4, to clarify the distinction between script designators and script property value aliases, their respective matching rules, and the use of underscores. Added a new subsection 3.5 to clarify ambiguity in the term script name.

UAX #29: Unicode Text Segmentation

Added characters that may be tailored to be in MidLetter.

Added section 4.2 Name Validation

Revised 6.3 Regular Expressions

Changed property of ZWSP to XX (Any) in 4.1 Default Word Boundary Specification

UAX #31: Unicode Identifier and Pattern Syntax

Updated the table, Candidate Characters for Inclusion in Identifiers. Updated the placement of various scripts in the two tables, Candidate Characters for Exclusion from Identifiers and Recommended Scripts [for Identifiers], and marked some of the recommended scripts as limited use. Added pointer to CLDR for information about scripts in limited use.

Added the following to Candidate Characters for Inclusion in Identifiers: U+0F0B ( ་ ) TIBETAN MARK INTERSYLLABIC TSHEG and U+30FB ( ・ ) KATAKANA MIDDLE DOT.

Added the following to Candidate Characters for Exclusion from Identifiers or Recommended Scripts: Default Ignorable Code Points, Tatweel (-like) characters, and scripts Old Turkic, Old South Arabian, Imperial Aramaic, Inscriptional Parthian, Inscriptional Pahlavi, Avestan, Egyptian Hieroglyphs, Samaritan, Kaithi, Lisu, Meetei Mayek, Tai Tham, Tai Viet, Javanese, Bamum.

UAX #34: Unicode Named Character Sequences

UAX #38: Unicode Han Database (Unihan)

Reclassified kDefinition, kHanyuPinlu, and kXHC1983 fields as Readings.

Documented revised structure of Unihan.zip.

Updated regular expressions of tags.

UAX #41: Common References for Unicode Standard Annexes

UAX #42: Unicode Character Database in XML

Added attributes for new properties and values.

Changed types of certain elements.

Updated the patterns for Unihan properties.

UAX #44: Unicode Character Database

Completely reorganized and rewritten, to include all the content from the obsoleted UCD.html.

Extensive new content added to account for all property changes and additions for Unicode 5.2.

Further clarifications were added regarding character properties, including the definition and contents of the Deprecated property, the nature of Contributory properties, the description of numeric properties, format issues for the Unihan database, constraints on property changes between releases, the description and exact values of defaults for property values, and many others.