Unicode 4.0.0
Version 4.0.0 has been superseded by the
latest version
of the Unicode Standard.
|
Version 4.0.0 of the Unicode Standard consists of the core
specification, The Unicode Standard,
Version 4.0, the delta and archival code charts for this version, the Unicode Standard Annexes,
and the Unicode Character Database (UCD).
The core specification gives the general principles,
requirements for conformance, and guidelines for implementers. The
code charts show representative glyphs for all the Unicode
characters. The Unicode Standard Annexes supply detailed normative
information about particular aspects of the standard. The Unicode
Character Database supplies normative and informative data for
implementers to allow them to implement the Unicode Standard.
|
Version 4.0.0 of the Unicode Standard should be referenced as:
The Unicode Consortium. The Unicode Standard, Version 4.0.0,
defined by: The Unicode Standard, Version 4.0 (Boston, MA,
Addison-Wesley, 2003. ISBN 0-321-18578-1)
A complete specification of the contributory files for Unicode
4.0.0 is found on the page
Components for Version 4.0.0.
Online Edition
The text of The Unicode Standard, Version 4.0, as well as the
delta and archival code charts, is available online via the navigation links
on this page. These files may not be printed. The
Unicode 4.0
Web Bookmarks page has links to all sections of the online text.
Overview
Unicode 4.0.0 is a
major version
of the Unicode Standard. The text of the standard has been
extensively rewritten to improve its structure and clarity.
Major additions to Version 4.0 since Version 3.0 include:
- major changes to the introductory and conformance chapters,
and extensive revisions to the discussion of punctuation, symbols,
and format characters
- extensive additions of CJK characters to cover dictionaries
and historic usage
- many new symbols for mathematical and technical publication
- many individual characters such as currency symbols were added
to other scripts, including Indic, Khmer, Latin, Greek, Arabic,
and Syriac
- substantially improved specification of conformance
requirements, incorporating the character encoding model
- encoding of supplementary characters
- formalized policies for stability of the standard
- clarification of semantics of special characters, including
the byte order mark
- major expansion of Unicode Character Database properties and
of specifications for text boundaries and casing
- more minority scripts, including Limbu, Tai Le, Osmanya, and
Philippine scripts
- more historic scripts, including Linear B, Cypriot, and
Ugaritic
- tightened definition of encoding terms, including UTF-32
- substantial improvements to the script descriptions,
particularly for Indic scripts and Khmer.
New Characters
1,226 new character assignments were made to the Unicode
Standard, Version 4.0 (over and above what was in Unicode 3.2).
These additions include currency symbols, additional Latin and
Cyrillic characters, the Limbu and Tai Le scripts; Yijing Hexagram
symbols, Khmer symbols, Linear B syllables and ideograms, Cypriot,
Ugaritic, and a new block of variation selectors (especially for
future CJK variants). Double diacritic characters were added for
dictionary use.
These new characters extend the set of modern currency symbols,
and represent a greater coverage of minority and historical scripts.
The following table shows the allocation of code points in Unicode
4.0.0. For more information on the specific characters, see the file
DerivedAge.txt in the Unicode Character Database.
Graphic |
96,248 |
Format |
134 |
Control |
65 |
Private Use |
137,468 |
Surrogate |
2,048 |
Noncharacter |
66 |
Reserved |
878,083 |
The character repertoire corresponds to ISO/IEC 10646:2003. For
more details of character counts, see
Appendix
D, Changes from Unicode Version 3.0.
Unicode Character Database
Unicode Version 4.0.0 introduced the concept of provisional
properties, clarified the relationships between properties, and
provided precisely defined fallback properties for characters not
explicitly defined in the data files. The documentation was
coalesced into
UCD.html, with a combined list of
Properties.
Other property changes include:
- Prefix Format Control. U+06DD
arabic end of ayah
and U+070F syriac
abbreviation mark were reclassified and have significantly
different behavior as prefix format control characters. The new
characters U+0600..U+0603 were given this behavior as well.
- New Properties. The Hangul Syllable Type and identifier
Other_ID_Start properties were added. The Unicode Radical Stroke
property was classified as informative; all other Unihan
properties were classified as provisional. PropertyValueAliases
also adds block names.
- Numeric Properties. CJK numeric values added; the
properties Decimal Number (Nd) and the Numeric Type decimal digit
were aligned in value.
- Default Ignorables. Added Hangul Filler characters,
U+00AD soft hyphen,
CGJ, and ZWS
- Soft Hyphen. U+00AD
soft hyphen was also
changed to General Category Cf. Its semantics were clarified: it
marks a position for hyphenation, rather than being itself a
hyphen character. (The Hyphen property itself was stabilized, and
thus not changed to reflect this.)
- Modifier Letters. The General Category of
U+02B9..U+02BA, U+02C6..U+02CF changed to General Category Lm.
- Grapheme_Extend. The halfwidth katakana marks,
and most combining marks (except as needed for canonical
equivalence) were removed.
- Mongolian Vowel Separator. U+180E
mongolian vowel separator
was changed to General Category Zs.
- Deprecated Characters. Two Khmer characters, U+17A3
khmer independent vowel qaq
and U+17D3 khmer sign
bathamasat, were deprecated. Four others are strongly
discouraged.
- Enclosing combining marks. The scope has been defined
more clearly.
- ZWJ. The semantics with cursive scripts has been
revised.
- Normalization Corrections. There were corrections for
characters U+2F868; U+2F874; U+2F91F; U+2F95F; U+2F9BF.
For more information, see the file
UCD.html in the Unicode Character Database.
Conformance
Chapter 3 was substantially improved by incorporating the Unicode
Character Encoding Model, resulting in fully specified definitions
and conformance requirements of UTF-8, UTF-16, and UTF-32. As a part
of this, the related concept of Unicode String is defined, which is
a sequence of code units for internal processing; a sequence that is
not necessarily a valid Unicode Encoding Form.
Clearer terminology was introduced for code points assignments,
including the seven main categories given in the above table. The
conformance status of UAXes, UTSes and UTRs was also clarified. In
addition:
- Identifiers. A structure for ensuring
backwards-compatible programming language identifiers was
introduced using the new property Other_ID_Start. There is also an
alternate definition for complete stability of identifiers.
- Bidi. The bidi algorithm was updated and moved to UAX
#9 (see below).
- Line Breaking and Boundaries. U+00AD soft hyphen was
reclassified. Text boundaries were clarified.
- Case Folding. The text from UAX #21, “Case Mappings,”
was incorporated and updated for case folding and other new
properties. The definition of titlecase uses word boundaries, and
there is a clearer definition of string functions:
- isUpper(), isLower(), isTitle(), isFold()
- toUpper(), toLower(), toTitle(), toFold()
Unicode Standard Annexes
The following Unicode Standard Annex was added:
- UAX #29: Text Boundaries
- Now contains information on text boundary conditions
formerly published in Chapter 5 of The Unicode Standard, Version
3.0.
- Provides default definitions for grapheme cluster ('user
character'), word, and sentence boundaries
The following Unicode Standard Annexes were updated:
- UAX #9: The Bidirectional Algorithm
- Now contains information on the bidirectional algorithm
formerly published in Chapter 3 of The Unicode Standard, Version
3.0.
- Canonically equivalence is now preserved (a data change, not
algorithm change)
- Shaping is done after reordering, but not across directional
boundaries
- There were clarifications of: ZWJ, ZWNJ, and intermediate
level processing
- UAX #14: Line Breaking Properties
- Negative numbers and dates with hyphens will not break
across lines
- Word-Joiner will link any characters (except hard line
breaks)
- The behavior of soft hyphen is clarified (it marks an
opportunity for breaking, not specific graphic appearance)
- The rules for GL are relaxed: SP and ZW override GL
- There are new property values: NL, WJ
- UAX #15: Unicode Normalization Forms
- There is a description of Stable Code Points, and the
notation NFC(x) and isNFC(x)
- Annex 12: Corrigenda was rewritten for clarity, and to
describe the use of Normalization Corrections.
- Annex 13: Canonical Equivalence was added
- UAX #11: East Asian Width
- Extended the range for the default property value to
30000–3FFFD.
The following Unicode Technical Report was upgraded in status to
a Unicode Standard Annex:
- UAX #24: Script Names
- Added notes on the stability of Q names, the usage of Mn, Me
characters, and scripts with regard to spoofing.
- Added Braille.
The following Standard Annexes were superseded as a result of
their incorporation into the text of the Version 4.0.0 core specification:
- UAX #13: Unicode Newline Guidelines
- UAX #19: UTF-32
- UAX #21: Case Mappings
- UAX #27: Unicode 3.1
- UAX #28: Unicode 3.2
Errata
Errata incorporated into Unicode 4.0 are listed by date in
a separate table. For corrigenda and errata after the release of Unicode 4.0, see the list of current
Updates and Errata.
|