L2/01-328R

Proposed Draft Unicode Technical Report #XX

utf-16 COMPATIBILITY enCODING sCHEME – 8-BIT
(UCES-8)

Version	Unicode 3.1.1
Authors	Toby Phipps ([email protected])
Date	2001-08-16
This Version	URL to be provided
Previous Version	URL to be provided
Latest Version	URL to be provided
Tracking Number	1

Summary

This document specifies a 8-bit UTF-16 Compatibility Encoding Scheme (UCES) that is intended as an alternate encoding for internal use within systems processing Unicode in order to provide backward compatibility with early UTF-8 and UTF-FSS implementations. It is not intended nor recommended as an encoding form used for open information exchange. As a UTF-16 encoding scheme, UCES-8 provides a UTF-8 like ASCII-compatible 8-bit encoding that preserves UTF-16 binary collation.

Status

This document has been approved by the Unicode Technical Committee for public review as a Proposed Draft Unicode Technical Report. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A list of current Unicode Technical Reports is found on http://www.unicode.org/unicode/reports/. For more information about versions of the Unicode Standard, see http://www.unicode.org/unicode/standard/versions/.

Please mail corrigenda and other comments to the author(s).

1 Introduction
2 Definitions
3 Relation to ISO/IEC 10646 and UTF-8
References
Modifications

1 Introduction

UCES-8S defines an encoding scheme for Unicode identical to UTF-8 except for its representation of supplementary characters. In UCES-8, supplementary characters are represented as six-byte sequences resulting from the transformation of each UTF-16 surrogate code unit into and eight-bit form similar to the UTF-8 transformation, but without first converting the input character to a scaler value.

Different Unicode transformations and encoding scheme are useful in different system environments. For example, UCES-8 is useful in 8-bit processing environments where binary collation with UTF-16 is required. It is designed and recommended for use only within products requiring this UTF-16 binary collation eqivalence and for backward compatibility with early implementations of UTF-8 based on its definition in Unicode 1.1. It is not intended nor recommended for open communication, such as the encoding of HTML or XML documents, or for open transmission over the Internet.

The following lists the important features of this encoding form:

The UCES-8 representation of characters on the Basic Multilingual Plane (BMP) is identical to the representation of these characters in UTF-8. Only the representation of supplementary characters differs.
Only the six-byte form of supplementary characters is legal in UCES-8; the four-byte UTF-8 style supplementary character sequence is illegal.
When supplementary characters are present, a data stream can be unequivocally determined as being encoded in UTF-8 or UCES-8 based on the representation of these supplementary characters. When encoding information is not present and encoding autodetection is attempted, if the data stream consists of well-formed UTF-8 and does not contain supplementary characters, it should always be detected as UTF-8, not UCES-8 (even though the two encodings are identical when supplementary characters are not present).
A binary collation of data encoded in UCES-8 is identical to the collation of the same data encoded in UTF-16.
The definition of UCES-8 matches that of UTF-FSS as defined by The Unicode Standard version 1.1, except that the source character repertoire for UCES-8 is limited to 21-bit scalar Unicode values, as opposed to the theroetical 31-bit repertoire defined by UTF-FSS.

As a very small percentage of characters in a typical data stream are expected to be supplementary characters, there is a strong possibility that UCES-8 data may be misinterpreted as UTF-8. Therefore, all use of UCES-8 outside closed implementations is strongly discouraged, such as the emittance of UCES-8 in output files, markup language or other open transmission forms.

2 Definitions

The following define the UCES-8 encoding scheme. Note that these rely on the conformance modifications introduced in Unicode 3.1 [U3.1]. UCES-8 is not a normative part of The Unicode Standard, and therefore the definitions below do not form part of the standard. Instead, they are encapsulated in this Unicode Technical Report as an implementation-specific transformation form for use by implementors of The Unicode Standard.

2.1

(a) UCES-8 is a Unicode Compatibility Encoding Scheme (UCES) that serializes a Unicode code point as a sequence of one, two, three or six bytes.
(b) Prior to transforming data into UCES-8, supplementary characters must first be converted to their surrogate pair UTF-16 representation. For example, U+F0000 must first be converted to U+DB80 U+DC00.
(c) The resulting data stream is encoded into an eight-bit form using the bit distribution table in definition 2.1. It should be noted that this bit distribution table is identical to that of UTF-8 except that the input value is a sequence of UTF-16 code units, not a scalar value, and that a four-byte transformation is disallowed.
(d) The bit pattern 11110xxx is illegal in any UCES-8 byte, effectively prohibiting the occurrence of UTF-8 four-byte surrogates in UCES-8. Thus, a data stream may not contain both UCES-8 six-byte and UTF-8 four-byte supplementary character sequences.
(e) The shortest form rules applied to UTF-8 in The Unicode Standard Definition D36 will also apply to UCES-8.
(f) If encoding autodetection of a data stream without encoding information is attempted, if the data stream conforms to both UTF-8 and UCES-8 definitions (ie. it is comprised purely of BMP characters), the data stream will be determined to be encoded in UTF-8, and not UCES-8.

UCES-8 encoding example:
In UCES-8, <U+004D, U+0061, U+10000> is serialized as <4D 61 ED AE 80 ED B0 80>

2.2

(a) UCES-8 Bit Distribution

UTF-16 Code Unit	1^st Byte	2^nd Byte	3^rd Byte
000000000xxxxxxx	0xxxxxxx
00000yyyyyxxxxxx	110yyyyy	10xxxxxx
zzzzzyyyyyyxxxxxx	1110zzzz	10yyyyyy	10xxxxxx

3 Relation to ISO/IEC 10646 and UTF-8

ISO/IEC 10646 and The Unicode Standard define the UTF-8 encding form, which is very similar in definition to UCES-8 other than its treatment of supplementary characters. UCES-8 is an additional encoding scheme that supplements these definitions, but does not form part of either ISO/IEC 10646 or The Unicode Standard. It is intended only as use in compatibility situations where binary collation with UTF-16 is required, or for implementations of UTF-FSS prior to the introduction of surrogate code units.

Prior to August 2001, UTF-8 defined six-byte SMP character representations as “irregular” sequences, and prohibited their emission by conformant UTF-8 generators, but did not prohibit their interpretation as SMP characters by a conformant UTF-8 reader. At the Unicode Technical Committee meeting #88, these “irregular” sequences were determined to be “illegal” sequences, and the definition of UTF-8 in The Unicode Standard updated to reflect this new status. The definition of UTF-8 in ISO/IEC 10646 already defines these six-byte sequences as illegal. This provides a clear distinction between UTF-8 and UCES-8 encodings, and allows for no overlap between their definitions.

UTC Resolution Mxx.xx (Title of Motion)

"[Text of motion to be included here upon its completion]"

The changes to The Unicode Standard resulting from this resolution are expected to be incorporated in The Unicode Standard version 3.2.

Note: UCES-8 was originally proposed with the name UTF-8S, but was renamed UCES-8 by recommendation from the Unicode Technical Committee to avoid possible confusion with UTF-8.

References

[CharMod]	Unicode Technical Report #17: Character Encoding Model http://www.unicode.org/unicode/reports/tr17/ For a detailed discussion of the relations between characters, glyphs, and encoding forms.
[Reports]	Unicode Technical Reports http://www.unicode.org/unicode/reports/ For information on the status and development process for technical reports, and for a list of technical reports.
[U3.1]	Unicode Standard Annex #27: Unicode 3.1 http://www.unicode.org/unicode/reports/tr27/
[Versions]	Versions of the Unicode Standard http://www.unicode.org/unicode/standard/versions/ For details on the precise contents of each version of the Unicode Standard, and how to cite them.

Modifications

The following summarizes modifications from the previous version of this document.

Created

The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.