Specifications
Q: How can I find out whether a particular issue is covered by a specification published by the Consortium. And where do I look it up?
The Unicode Standard and related standards contain a number of specifications or guidelines for dealing with different programming tasks. Sometimes it's hard to find these as they are not all provided as specific, dedicated documents.
The following table lists subject areas for which the Unicode Consortium provides specifications, with a location and brief description what each specification covers. Citations of chapters or section numbers refer to the core specification of the Unicode Standard.
General |
|
---|---|
Character Properties: common properties such as Name, Alphabetic, Letter, White-Space, General Category, Default-Ignorable, plus those used in other specifications |
Chapter 4 |
Character Properties for CJK Ideographs: property information specific to CJK ideographs and character properties (Unihan) |
UAX #38 |
Character Properties for Egyptian Hieroglyphs: property information specific to Egyptian hieroglyphs (Unikemet) |
UAX #57 |
Additional information for Cuneiform: references to additional data specific to the Sumero-Akkadian Cuneiform script |
UTR #56 |
Unicode Character Database: general documentation about the UCD |
UAX #44 |
UCD in XML: description of the XML representation of the UCD |
UAX #42 |
Case Operations: conversion/detection of Upper/Lower/Titlecase, case folding, case matching. See also § 4.2 Case. |
§ 3.13 |
Characters with Unusual Properties: characters that implementers need to pay special attention to |
§ 4.12 |
Script Property: usage model for determining text runs in a given script |
UAX #24 |
Unicode Support of Mathematics: guidelines for mathematical usage |
UTR #25 |
Unicode Emoji: guidelines for the use and display of Unicode emoji characters |
UTR #51 |
Unicode Named Character Sequences: specifies the syntax for named character sequences |
UAX #34 |
Encodings |
|
Unicode Encoding Forms: UTF-8, UTF-16, UTF-32 conversion and validation |
§ 3.9 |
Unicode Encoding Schemes: UTF-8, UTF-16 (BE/LE), UTF-32 (BE/LE) conversion and validation |
§ 3.10 |
Binary Order: UTF-8 order vs. UTF-16 order |
§ 5.17 |
Character Mapping Markup Language: mapping Unicode to and from legacy code pages |
UTS #22 |
A Standard Compression Scheme for Unicode: how to compress Unicode to about the same size as legacy |
UTS #6 |
UTF-EBCDIC: encapsulating Unicode on EBCDIC systems |
UTR #16 |
Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8): a compatibility 8-bit encoding scheme |
UTR #26 |
Variation Sequences: standardized, emoji and ideographic variation sequences |
§ 23.4 |
Ideographic Variation Database: repository of variation sequences for specified collections of Han glyphs |
UTS #37 |
Comparison (Normalization, Collation) |
|
Canonical Equivalence: when character sequences are equivalent; canonical ordering |
§ 3.11 |
Unicode Normalization Forms: how to normalize text for comparison, also § 3.11 definitions |
UAX #15 |
Unicode Collation Algorithm: the default mechanism for comparing, searching, matching, and ordering Unicode text |
UTS #10 |
Parsing |
|
Hangul Syllables: boundaries, parsing, (de/)composition, names |
§ 3.12 |
Decimal Numbers: conversion and validation |
§ 5.5 |
Unicode Regular Expression Guidelines: the features required in supporting regular expressions with Unicode |
UTS #18 |
Unicode Identifiers and Syntax: how to parse identifiers |
UAX #31 |
Unicode Source Code Handling: guidance for programming language designers on handling security issues in Unicode program text |
UTS #55 |
Language Information in Plain Text, also § 23.9 Tag Characters |
§ 5.10 |
Variation Selectors: use, validation |
§ 23.4 |
Ideographic Description Sequences: use, validation |
§18.2 |
Segmentation |
|
Newline Guidelines: how to handle newline characters |
§ 5.8 |
Line Breaking Algorithm: the default way to determine where to linewrap |
UAX #14 |
Text Segmentation: the default way to break text into grapheme clusters, words, and sentences |
UAX #29 |
The Bidirectional Algorithm: required for display of Arabic and Hebrew text |
UAX #9 |
Arabic Mark Rendering: sequence details for stable rendering of multiple marks |
UAX #53 |
East Asian Width: the default determination of character width in East Asian contexts |
UAX #11 |
Minimal shaping requirements for Arabic, Devanagari, Tamil, and other complex scripts |
Chapters 9-15 |
Vertical orientation adjustments for characters |
UTR #50 |
Locale Data |
|
Locale Data Mark-up Language (LDML): used for Interchange of locale data used for internationalization |
UTS #35 |
Common Locale Data Repository (CLDR): a repository of LDML data for hundreds of locales |
|
Identifiers and Security |
|
Identifier and Syntax: security issues for identifiers |
UAX #31 |
Unicode Security Considerations: guidelines for recognizing Unicode security problems and dealing with them |
UTR #36 |
Unicode Security Mechanisms: useful tools for detecting spoofs |
UTS #39 |
Unicode IDNA Compatibility Processing: mapping for IDNA2008, and compatibility processing for IDNA2003 |
UTS #46 |
Unicode Source Code Handling: guidance for programming language designers and programming environment developers to avoid security issues from improper handling of Unicode program text |
UTS #55 |
Q: Which Unicode specifications are normative?
Some Unicode specifications are normative and others are informative. For sections from the core specification of the Unicode Standard, the material in Chapter 3, Conformance, and most of Chapter 4, Character Properties, are normative, while material in other sections is generally informative. The Unicode Standard Annexes (UAX) are formally a part of the Unicode Standard, and most of the material in them is normative, unless otherwise indicated in the annex itself. For Unicode Technical Standards (UTS), the specifications are normative parts of those independent standards. Unicode Technical Reports (UTR) contain informative material. For more information about UAXs, UTSes, and UTRs, see About Unicode Technical Reports.
Q: Where can I find the rationale behind a given specification?
Specifications published by the Unicode Consortium are created and amended by decision from the owning technical committee. These decisions are captured in the TC's minutes and they are usually based on detailed proposal documents. For some specification there has been an effort by the Consortium or outside websites to organize this data in a way that it can be related back to specific text sections or encoded characters.
The following table list sources of information on specific technical decisions or the rationale behind them.
Unicode Technical Committee | |
---|---|
Minutes and supporting documents | Register |
Minutes and supporting documents | Search |
Character Encoding |
|
ScriptSource, information on scripts |
Scripts Overview |
Unicode Status for each script (Example: Arabic) | Arabic |
Wikipedia, information on Unicode blocks | Category: Unicode Blocks |
History section for each Unicode block (Example: Arabic) | Arabic_(Unicode_block) |
Emoji Proposals | By proposal |
Emoji Proposals | By code point |
Algorithms |
|
Linebreaking Algorithm | Annotated |
Q: Where can I find out when a character was encoded or a feature was added to a given specification?
For both the Core Spec of the Unicode Standard and its Annexes, as well as Technical Standards and Reports, a "Modifications" section highlights changes from the preceding version. Tracking these backwards gives information on when a particular change was introduced, but the granularity is not particularly fine, nor is there a cross-reference with particular decisions and supporting documents. For encoded characters, the Unicode Character Database file DerivedAge.txt indicates the version a character was added to the standard. For some specifications an annotated version provides a more fine-grained documentation of the version and rationale for each change.