Unicode 4.0.0

Home | Site Map | Search

4.0.0 Front Matter
	Title and Copyright
	Acknowledgements
	Table of Contents
	Unicode 4.0 Web Bookmarks
	List of Figures
	List of Tables
	Preface
4.0.0 Chapters
1	Introduction
2	General Structure
3	Conformance
4	Character Properties
5	Implementation Guidelines
6	Writing Systems and Punctuation
7	European Alphabetic Scripts
8	Middle Eastern Scripts
9	South Asian Scripts
10	Southeast Asian Scripts
11	East Asian Scripts
12	Additional Modern Scripts
13	Archaic Scripts
14	Symbols
15	Special Areas and Format Characters
16	Code Charts Introductory Text
Code Charts
•	Code Charts (Latest)
•	Delta Code Charts (additions to 4.0.0 highlighted)
•	Archival Code Charts (4.0.0, 33 MB)
Han Radical-Stroke Index
17	Han Radical-Stroke Index (Introductory Text)
•	Interactive Han Radical-Stroke Index (Latest)
4.0.0 Appendices and Back Matter
A	Han Unification History
B	Abstracts of Unicode Technical Reports
C	Relationship to ISO/IEC 10646
D	Changes from Unicode Version 3.0
G	Glossary
R	References
I	I.1 Unicode Names Index I.2 General Index
4.0.0 Unicode Standard Annexes
UAX #9: The Bidirectional Algorithm
UAX #11: East Asian Width
UAX #14: Line Breaking Properties
UAX #15: Unicode Normalization Forms
UAX #24: Script Names
UAX #29: Text Boundaries
4.0.0 UCD
4.0 (files) (about)
For unchanged files see: Components
Related Links
About Versions
Latest Version
Archive of Unicode Versions
The Unicode Standard
Unicode Character Database
Technical Reports
Updates and Errata

Unicode 4.0.0

Version 4.0.0 has been superseded by the latest version of the Unicode Standard.

Version 4.0.0 of the Unicode Standard consists of the core specification, The Unicode Standard, Version 4.0, the delta and archival code charts for this version, the Unicode Standard Annexes, and the Unicode Character Database (UCD).

The core specification gives the general principles, requirements for conformance, and guidelines for implementers. The code charts show representative glyphs for all the Unicode characters. The Unicode Standard Annexes supply detailed normative information about particular aspects of the standard. The Unicode Character Database supplies normative and informative data for implementers to allow them to implement the Unicode Standard.

Version 4.0.0 of the Unicode Standard should be referenced as:

The Unicode Consortium. The Unicode Standard, Version 4.0.0, defined by: The Unicode Standard, Version 4.0 (Boston, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1)

A complete specification of the contributory files for Unicode 4.0.0 is found on the page Components for Version 4.0.0.

Online Edition

The text of The Unicode Standard, Version 4.0, as well as the delta and archival code charts, is available online via the navigation links on this page. These files may not be printed. The Unicode 4.0 Web Bookmarks page has links to all sections of the online text.

Overview

Unicode 4.0.0 is a major version of the Unicode Standard. The text of the standard has been extensively rewritten to improve its structure and clarity.

Major additions to Version 4.0 since Version 3.0 include:

major changes to the introductory and conformance chapters, and extensive revisions to the discussion of punctuation, symbols, and format characters

extensive additions of CJK characters to cover dictionaries and historic usage

many new symbols for mathematical and technical publication

many individual characters such as currency symbols were added to other scripts, including Indic, Khmer, Latin, Greek, Arabic, and Syriac

substantially improved specification of conformance requirements, incorporating the character encoding model

encoding of supplementary characters

formalized policies for stability of the standard

clarification of semantics of special characters, including the byte order mark

major expansion of Unicode Character Database properties and of specifications for text boundaries and casing

more minority scripts, including Limbu, Tai Le, Osmanya, and Philippine scripts

more historic scripts, including Linear B, Cypriot, and Ugaritic

tightened definition of encoding terms, including UTF-32

substantial improvements to the script descriptions, particularly for Indic scripts and Khmer.

New Characters

1,226 new character assignments were made to the Unicode Standard, Version 4.0 (over and above what was in Unicode 3.2). These additions include currency symbols, additional Latin and Cyrillic characters, the Limbu and Tai Le scripts; Yijing Hexagram symbols, Khmer symbols, Linear B syllables and ideograms, Cypriot, Ugaritic, and a new block of variation selectors (especially for future CJK variants). Double diacritic characters were added for dictionary use.

These new characters extend the set of modern currency symbols, and represent a greater coverage of minority and historical scripts. The following table shows the allocation of code points in Unicode 4.0.0. For more information on the specific characters, see the file DerivedAge.txt in the Unicode Character Database.

Graphic
96,248

Format
134

Control
65

Private Use
137,468

Surrogate
2,048

Noncharacter
66

Reserved
878,083

The character repertoire corresponds to ISO/IEC 10646:2003. For more details of character counts, see Appendix D, Changes from Unicode Version 3.0.

Unicode Character Database

Unicode Version 4.0.0 introduced the concept of provisional properties, clarified the relationships between properties, and provided precisely defined fallback properties for characters not explicitly defined in the data files. The documentation was coalesced into UCD.html, with a combined list of Properties.

Other property changes include:

Prefix Format Control. U+06DD arabic end of ayah and U+070F syriac abbreviation mark were reclassified and have significantly different behavior as prefix format control characters. The new characters U+0600..U+0603 were given this behavior as well.

New Properties. The Hangul Syllable Type and identifier Other_ID_Start properties were added. The Unicode Radical Stroke property was classified as informative; all other Unihan properties were classified as provisional. PropertyValueAliases also adds block names.

Numeric Properties. CJK numeric values added; the properties Decimal Number (Nd) and the Numeric Type decimal digit were aligned in value.

Default Ignorables. Added Hangul Filler characters, U+00AD soft hyphen, CGJ, and ZWS

Soft Hyphen. U+00AD soft hyphen was also changed to General Category Cf. Its semantics were clarified: it marks a position for hyphenation, rather than being itself a hyphen character. (The Hyphen property itself was stabilized, and thus not changed to reflect this.)

Modifier Letters. The General Category of U+02B9..U+02BA, U+02C6..U+02CF changed to General Category Lm.

Grapheme_Extend. The halfwidth katakana marks, and most combining marks (except as needed for canonical equivalence) were removed.

Mongolian Vowel Separator. U+180E mongolian vowel separator was changed to General Category Zs.

Deprecated Characters. Two Khmer characters, U+17A3 khmer independent vowel qaq and U+17D3 khmer sign bathamasat, were deprecated. Four others are strongly discouraged.

Enclosing combining marks. The scope has been defined more clearly.

ZWJ. The semantics with cursive scripts has been revised.

Normalization Corrections. There were corrections for characters U+2F868; U+2F874; U+2F91F; U+2F95F; U+2F9BF.

Note: these corrections are in accord with the Unicode Character Encoding Stability Policy.

For more information, see the file UCD.html in the Unicode Character Database.

Conformance

Chapter 3 was substantially improved by incorporating the Unicode Character Encoding Model, resulting in fully specified definitions and conformance requirements of UTF-8, UTF-16, and UTF-32. As a part of this, the related concept of Unicode String is defined, which is a sequence of code units for internal processing; a sequence that is not necessarily a valid Unicode Encoding Form.

Clearer terminology was introduced for code points assignments, including the seven main categories given in the above table. The conformance status of UAXes, UTSes and UTRs was also clarified. In addition:

Identifiers. A structure for ensuring backwards-compatible programming language identifiers was introduced using the new property Other_ID_Start. There is also an alternate definition for complete stability of identifiers.

Bidi. The bidi algorithm was updated and moved to UAX #9 (see below).

Line Breaking and Boundaries. U+00AD soft hyphen was reclassified. Text boundaries were clarified.

Case Folding. The text from UAX #21, “Case Mappings,” was incorporated and updated for case folding and other new properties. The definition of titlecase uses word boundaries, and there is a clearer definition of string functions:

isUpper(), isLower(), isTitle(), isFold()

toUpper(), toLower(), toTitle(), toFold()

Unicode Standard Annexes

The following Unicode Standard Annex was added:

UAX #29: Text Boundaries

Now contains information on text boundary conditions formerly published in Chapter 5 of The Unicode Standard, Version 3.0.

Provides default definitions for grapheme cluster ('user character'), word, and sentence boundaries

The following Unicode Standard Annexes were updated:

UAX #9: The Bidirectional Algorithm

Now contains information on the bidirectional algorithm formerly published in Chapter 3 of The Unicode Standard, Version 3.0.

Canonically equivalence is now preserved (a data change, not algorithm change)

Shaping is done after reordering, but not across directional boundaries

There were clarifications of: ZWJ, ZWNJ, and intermediate level processing

UAX #14: Line Breaking Properties

Negative numbers and dates with hyphens will not break across lines

Word-Joiner will link any characters (except hard line breaks)

The behavior of soft hyphen is clarified (it marks an opportunity for breaking, not specific graphic appearance)

The rules for GL are relaxed: SP and ZW override GL

There are new property values: NL, WJ

UAX #15: Unicode Normalization Forms

There is a description of Stable Code Points, and the notation NFC(x) and isNFC(x)

Annex 12: Corrigenda was rewritten for clarity, and to describe the use of Normalization Corrections.

Annex 13: Canonical Equivalence was added

UAX #11: East Asian Width

Extended the range for the default property value to 30000–3FFFD.

The following Unicode Technical Report was upgraded in status to a Unicode Standard Annex:

UAX #24: Script Names

Added notes on the stability of Q names, the usage of Mn, Me characters, and scripts with regard to spoofing.

Added Braille.

The following Standard Annexes were superseded as a result of their incorporation into the text of the Version 4.0.0 core specification:

UAX #13: Unicode Newline Guidelines

UAX #19: UTF-32

UAX #21: Case Mappings

UAX #27: Unicode 3.1

UAX #28: Unicode 3.2

Errata

Errata incorporated into Unicode 4.0 are listed by date in a separate table. For corrigenda and errata after the release of Unicode 4.0, see the list of current Updates and Errata.