L2/01-328R
|
|
|
Version |
Unicode
3.1.1 |
Authors |
Toby
Phipps ([email protected]) |
Date |
2001-08-16 |
This
Version |
URL
to be provided |
Previous
Version |
URL
to be provided |
Latest
Version |
URL
to be provided |
Tracking
Number |
1 |
This document specifies a 8-bit UTF-16 Compatibility Encoding
Scheme (UCES) that is intended as an alternate encoding for internal use within
systems processing Unicode in order to provide backward compatibility with
early UTF-8 and UTF-FSS implementations.
It is not intended nor recommended as an encoding form used for open
information exchange. As a UTF-16
encoding scheme, UCES-8 provides a UTF-8 like ASCII-compatible 8-bit encoding
that preserves UTF-16 binary collation.
This document has been approved by the Unicode Technical Committee
for public review as a Proposed Draft Unicode Technical Report.
Publication does not imply endorsement by the Unicode Consortium. This is a
draft document which may be updated, replaced, or superseded by other documents
at any time. This is not a stable document; it is inappropriate to cite this
document as other than a work in progress.
A list of current Unicode Technical Reports is found on http://www.unicode.org/unicode/reports/.
For more information about versions of the Unicode Standard, see http://www.unicode.org/unicode/standard/versions/.
Please mail corrigenda and other comments to the author(s).
UCES-8S defines an encoding scheme for Unicode identical to UTF-8
except for its representation of supplementary characters. In UCES-8, supplementary characters are
represented as six-byte sequences resulting from the transformation of each
UTF-16 surrogate code unit into and eight-bit form similar to the UTF-8
transformation, but without first converting the input character to a scaler
value.
Different Unicode transformations and encoding scheme are useful
in different system environments. For example, UCES-8 is useful in 8-bit
processing environments where binary collation with UTF-16 is required. It is designed and recommended for use only
within products requiring this UTF-16 binary collation eqivalence and for
backward compatibility with early implementations of UTF-8 based on its
definition in Unicode 1.1. It is not intended nor recommended for open
communication, such as the encoding of HTML or XML documents, or for open
transmission over the Internet.
The following lists the important features of this encoding form:
As a very small percentage of characters in a typical data stream
are expected to be supplementary characters, there is a strong possibility that
UCES-8 data may be misinterpreted as UTF-8.
Therefore, all use of UCES-8 outside closed implementations is strongly
discouraged, such as the emittance of UCES-8 in output files, markup language
or other open transmission forms.
The following define the UCES-8 encoding scheme. Note that these
rely on the conformance modifications introduced in Unicode 3.1 [U3.1]. UCES-8 is not a
normative part of The Unicode Standard, and therefore the definitions below do
not form part of the standard. Instead,
they are encapsulated in this Unicode Technical Report as an
implementation-specific transformation form for use by implementors of The
Unicode Standard.
2.1 |
(a)
UCES-8 is a Unicode Compatibility Encoding Scheme (UCES) that serializes
a Unicode code point as a sequence of one, two, three or six bytes.
|
||||||||||||||||
2.2 |
(a)
UCES-8 Bit Distribution
|
ISO/IEC 10646 and The Unicode Standard define the UTF-8 encding
form, which is very similar in definition to UCES-8 other than its treatment of
supplementary characters. UCES-8 is an
additional encoding scheme that supplements these definitions, but does not
form part of either ISO/IEC 10646 or The Unicode Standard. It is intended only as use in compatibility
situations where binary collation with UTF-16 is required, or for
implementations of UTF-FSS prior to the
introduction of surrogate code units.
Prior to August 2001, UTF-8 defined six-byte SMP character
representations as “irregular” sequences, and prohibited their emission by
conformant UTF-8 generators, but did not prohibit their interpretation as SMP
characters by a conformant UTF-8 reader.
At the Unicode Technical Committee meeting #88, these “irregular” sequences
were determined to be “illegal” sequences, and the definition of UTF-8 in The
Unicode Standard updated to reflect this new status. The definition of UTF-8 in ISO/IEC 10646 already defines these
six-byte sequences as illegal. This
provides a clear distinction between UTF-8 and UCES-8 encodings, and allows for
no overlap between their definitions.
UTC Resolution Mxx.xx
(Title of Motion)
"[Text of motion to be included here upon its completion]"
The changes to The Unicode Standard resulting from this resolution
are expected to be incorporated in The Unicode Standard version 3.2.
Note: UCES-8 was originally proposed with the name UTF-8S, but was
renamed UCES-8 by recommendation from the Unicode Technical Committee to avoid
possible confusion with UTF-8.
[CharMod] |
Unicode
Technical Report #17: Character Encoding Model |
[Reports] |
Unicode
Technical Reports |
[U3.1] |
Unicode
Standard Annex #27: Unicode 3.1 |
[Versions] |
Versions
of the Unicode Standard |
The following summarizes modifications from the previous version
of this document.
1 |
|
Copyright © 1999-2001 Unicode,
Inc. All Rights Reserved.
The Unicode Consortium makes no expressed
or implied warranty of any kind, and assumes no liability for errors or
omissions. No liability is assumed for incidental and consequential damages in
connection with or arising out of the use of the information or programs
contained or accompanying this technical report.
Unicode and the Unicode logo are
trademarks of Unicode, Inc., and are registered in some jurisdictions.