|
|
Version | 3.1 Proposed Draft |
Authors | Mark Davis ([email protected], home) |
Date | 2001-01-23 |
This Version | http://www.unicode.org/unicode/reports/tr19/tr19-7.html |
Previous Version | http://www.unicode.org/unicode/reports/tr19/tr19-6.1.html |
Latest Version | http://www.unicode.org/unicode/reports/tr19 |
Tracking Number | 7 |
This document specifies a Unicode transformation format that provides serializes a Unicode codepoint as a sequence of four bytes. It provides a name that can be used to refer to the subset of ISO/IEC 10646 UCS-4 values that are available Unicode code points, from U+0000 to U+10FFFF.
Proposed Draft: change the status to the following:
This document has been reviewed by Unicode members and other interested parties, and has been approved by the Unicode Technical Committee as a Unicode Standard Annex. It is a stable document and may be used as reference material or cited as a normative reference from another document.
A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, carrying the same version number, but is published as a separate document. Note that conformance to a version of the Unicode Standard includes conformance to its Unicode Standard Annexes.
A list of current Unicode Technical Reports is found on http://www.unicode.org/unicode/reports/. For more information about versions of the Unicode Standard, see http://www.unicode.org/unicode/standard/versions/.
Please mail corrigenda and other comments to the author(s).
UTF-32 defines an encoding form for Unicode for representing Unicode code points with single 32-bit code units. With the addition of UTF-32, the Unicode Standard now has three sanctioned encoding forms: UTF-8, UTF-16, and UTF-32. These use 8-bit, 16-bit, and 32-bit code units, respectively.
Different encoding forms of Unicode are useful in different system environments. For example, UTF-32 is somewhat simpler in usage than UTF-16, but in almost all cases occupies twice the storage. A common strategy is to have internal string storage use UTF-16 or UTF-8, but use UTF-32 for individual character datatypes. Note that UTF-32 does not necessarily match user-expectations for "characters", which are better matched by grapheme boundaries, as explained in Chapter 5 of the Unicode Standard.
As with UTF-16, considerations of byte-order serialization lead to a further subdivision of this encoding form into 3 encoding schemes: UTF-32 (possibly using BOM), UTF-32BE, and UTF-32LE.
The following lists the important features of this encoding form:
See also UTR #17, Character Encoding Model [CharMod], for a discussion of the general framework for understanding the Unicode character encoding and its relationship to the Unicode Transformation Formats. See the Unicode Glossary [Glossary] for explanations of terminology.
The following define the UTF-32 Transformation Formats. Note that these rely on the conformance modifications introduced in Unicode 3.1 [U3.1].
D36a | (a) UTF-32BE is the Unicode
Transformation Format that serializes a Unicode code point as a sequence
of four bytes, in big-endian format. An initial sequence corresponding to
U+FEFF is interpreted as a zero
width no-break space. (b) An illegal UTF-32BE code unit sequence is any byte sequence that would correspond to a numeric value outside of the range 0 to 10FFFF16. (c) An irregular UTF-32BE code unit sequence is an eight-byte sequence where the first four bytes correspond to a high surrogate, and the next four bytes correspond to a low surrogate. As a consequence of C12, these irregular UTF-32BE sequences shall not be generated by a conformant process.
|
D36b | (a) UTF-32LE is the Unicode
Transformation Format that serializes a Unicode code point as a sequence
of four bytes, in little-endian format. An initial sequence corresponding
to U+FEFF is interpreted as a zero
width no-break space. (b) An illegal UTF-32LE code unit sequence is any byte sequence that would correspond to a numeric value outside of the range 0 to 10FFFF16. (c) An irregular UTF-32LE code unit sequence is an eight-byte sequence where the first four bytes correspond to a high surrogate, and the next four bytes correspond to a low surrogate. As a consequence of C12, these irregular UTF-32LE sequences shall not be generated by a conformant process.
|
D36c | (a) UTF-32 is the Unicode
Transformation Format that serializes a Unicode code point as a sequence
of four bytes, in either big-endian or little-endian format. An initial
sequence corresponding to U+FEFF is interpreted as a byte order mark:
it is used to distinguish between the two byte orders. The byte order
mark is not considered part of the content of the text. A
serialization of Unicode code points into UTF-32 may or may not begin with
a byte order mark. (b) An illegal UTF-32 code unit sequence is any byte sequence that would correspond to a numeric value outside of the range 0 to 10FFFF16. (c) An irregular UTF-32 code unit sequence is an eight-byte sequence where the first four bytes correspond to a high surrogate, and the next four bytes correspond to a low surrogate. As a consequence of C12, these irregular UTF-32 sequences shall not be generated by a conformant process.
|
ISO/IEC 10646 defines a 4-byte encoding form called UCS-4. Since UTF-32 is simply a subset of UCS-4 characters, it is conformant to ISO/IEC 10646 as well as to the Unicode Standard.
As of the recent publication of the second edition of ISO/IEC 10646-1, UCS-4 still assigns private use codepoints (E0000016..FFFFFF16 and 6000000016..7FFFFFFF16) that are not in the range of valid Unicode codepoints. To promote interoperability among the Unicode encoding forms JTC1/SC2/WG2 has approved a motion removing those private use assignments:
Resolution M38.6 (Restriction of encoding space) [adopted unanimously]
"WG2 accepts the proposal in document N2175 towards removing the provision for Private Use Groups and Planes beyond Plane 16 in ISO/IEC 10646, to ensure internal consistency in the standard between UCS-4, UTF-8 and UTF-16 encoding formats, and instructs its project editor [to] prepare suitable text for processing as a future Technical Corrigendum or an Amendment to 10646-1:2000."
While this resolution must still be turned into a Technical Corrigendum or an Amendment to 10646-1:2000, the Unicode Technical Committee has every expectation that once the text for that Technical Corrigendum or Amendment starts its formal balloting it will proceed smoothly to formal approval and publication as part of that standard.
Until the formal balloting is concluded, the term UTF-32 can be used to refer to the subset of UCS-4 characters that are in the range of valid Unicode code points. After it passes, UTF-32 will then simply be an alias for UCS-4 (with the extra requirement that Unicode semantics are observed).
[FAQ] | Unicode Frequently Asked Questions http://www.unicode.org/unicode/faq/ |
[CharMod] | Unicode Technical Report #17: Character Encoding Model http://www.unicode.org/unicode/reports/tr17/ |
[U3.1] | Unicode Standard Annex #27: Unicode 3.1 http://www.unicode.org/unicode/reports/tr27/ |
[Glossary] | Unicode Glossary http://www.unicode.org/glossary/ |
The following summarizes modifications from the previous version of this document.
7 |
|
Copyright © 1999-2001 Unicode, Inc. All Rights Reserved.
The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.