To the extent possible under law, the editors have waived all copyright and related or neighboring rights to this work.
WTF-8 (Wobbly Transformation Format − 8-bit) is a superset of UTF-8 that encodes surrogate code points if they are not in a pair. It represents, in a way compatible with UTF-8, text from systems such as JavaScript and Windows that use UTF-16 internally but don’t enforce the well-formedness invariant that surrogates must be paired.
WTF-8 is a hack intended to be used internally in self-contained systems with components that need to support potentially ill-formed UTF-16 for legacy reasons.
Any WTF-8 data must be converted to a Unicode encoding at the system’s boundary before being emitted. UTF-8 is recommended. WTF-8 must not be used to represent text in a file format or for transmission over the Internet.
In particular, the Encoding Standard [ENCODING] defines UTF-8 and other encodings for the Web. There is no and will not be any encoding label [ENCODING] or IANA charset alias [CHARSETS] for WTF-8.
This section is non-normative.
When Unicode 1.0 was published in 1991, it defined 65536 code points from U+0000 to U+FFFF and assigned characters to around half of them. Many software implementations chose the obvious memory representation for Unicode text of 16 bits per code point / character.
At the time, “Unicode” was synonymous with that particular encoding. To disambiguate, that encoding is now called UCS-2.
As subsequent versions of Unicode assigned more characters, it became apparent that 65536 code points would not be sufficient. Unicode was extended to 1114112 code points from U+0000 to U+10FFFF, and the UTF-16 encoding was introduced. This encoding preserves compatibility with existing 16-bit based systems and represents new (supplementary) code points as a pair of “surrogates”.
UTF-16 is designed to represent any Unicode text, but it can not represent a surrogate code point pair since the corresponding surrogate 16-bit code unit pairs would instead represent a supplementary code point. Therefore, the concept of Unicode scalar value was introduced and Unicode text was restricted to not contain any surrogate code point. (This was presumably deemed simpler that only restricting pairs.)
UTF-16 was redefined to be ill-formed if it contains unpaired surrogate 16-bit code units. UTF-8 was similarly redefined to be ill-formed if it contains surrogate byte sequences.
Meanwhile, 16-bit based systems had little to no incentive to do anything about surrogates: For several years, Unicode did not assign any character to supplementary code points, and then (until emoji) only comparatively rare characters. Additionally, the Unicode Standard does not require conforming implementations to maintain well-formedness of UTF-16 strings.
As a result, surrogates do occur in practice and need to be preserved. For example:
String
value is defined as a sequence of 16-bit integers
that usually represents UTF-16 text
but may or may not be well-formed.
WCHAR
s (16-bit
code units).
We say that strings in these systems are encoded in potentially ill-formed UTF-16 or WTF-16.
Unpaired surrogate 16-bit code units are the only case where an arbitrary sequence of 16-bit code units is ill-formed in UTF-16. UTF-8, however, is more complex and maintaining its well-formedness is arguably more valuable.
This specification defines WTF-8, a superset of UTF-8 that can losslessly represent arbitrary sequences of 16-bit code unit (even if ill-formed in UTF-16) but preserves the other well-formedness constraints of UTF-8.
Unicode defines a Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8). WTF-8 is different from CESU-8.
CESU-8 encodes supplementary code points as surrogate pair byte sequences of six bytes, whereas WTF-8, like UTF-8, encodes them as sequences of four bytes. Therefore, CESU-8 is not a superset of UTF-8.
CESU-8 is also a mapping on UTF-16 code units. Therefore unpaired surrogate byte sequences are ill-formed in CESU-8, whereas supporting them is the entire point of WTF-8.
These definitions correspond to those of the Glossary of Unicode Terms. [UNICODE]
A Unicode code point is any value in the Unicode codespace; that is, the range of integers from 0 to 1114111. It is noted with a “U+” prefix and four to six hexadecimal digits: the first and last code points are U+0000 and U+10FFFF.
The Basic Multilingual Plane is the range of code points from U+0000 to U+FFFF.
A BMP code point is a code point in the Basic Multilingual Plane.
A supplementary code point is a code point not in the Basic Multilingual Plane. That is, a code point in the range from U+10000 to U+10FFFF.
A Unicode scalar value is a code point that is not a surrogate code point. That is, a code point in the range from U+0000 to U+D7FF, or in the range U+E000 to U+10FFFF.
A BMP scalar value is a Unicode scalar value in the Basic Multilingual Plane. That is, a code point in the range from U+0000 to U+D7FF, or in the range U+E000 to U+FFFF.
Unicode text is a sequence of Unicode scalar values.
UTF-8 is an encoding of Unicode text using 8-bit bytes. Each Unicode scalar value is represented as a sequence of one to four bytes.
UTF-16 is an encoding of Unicode text using 16-bit code units. BMP scalar values are represented as a single 16-bit code unit with the same value. Supplementary code points are represented as a surrogate 16-bit code unit pair.
Note: this specification is only concerned with the UTF-16 encoding form (based on 16-bit code units), and not with the encoding scheme (based on bytes, with UTF-16BE and UTF-16LE variants).
A string is well-formed (not ill-formed) in a given encoding if it follows the specification of that encoding. [UNICODE] defines Well-Formed Code Unit Sequence for UTF-8 and UTF-16.
The replacement character is the code point U+FFFD REPLACEMENT CHARACTER
(�).
It is used as a substitute to replace ill-formed sub-sequences
during a conversion.
A 16-bit code unit is a 16-bit integer used in UTF-16. It is noted with a “0x” prefix and four hexadecimal digits: the first and last 16-bit code units are 0x0000 and 0xFFFF.
Note: The byte serialization or memory representation of a 16-bit code unit (little-endian or big-endian) is out of scope for this specification.
When an algorithm iterates over a sequence (“For every i in …”), consuming the next item means advancing in the sequence such that that item will be skipped during the following iteration of the loop: the item after the next becomes the next item.
The following algorithm prints “1”, “2”, and “4”.
For every digit i in “1234”, run these substeps:
A lead surrogate code point or high surrogate code point is a code point in the range from U+D800 to U+DBFF.
A trail surrogate code point or low surrogate code point is a code point in the range from U+DC00 to U+DFFF.
A surrogate code point is either a lead surrogate code point or a trail surrogate code point. That is, a code point in the range from U+D800 to U+DFFF.
A surrogate code point pair is a sequence of a lead surrogate code point followed by a trail surrogate code point.
An unpaired surrogate code point is a surrogate code point that is not part of a surrogate code point pair.
A lead surrogate 16-bit code unit or high surrogate 16-bit code unit is a 16-bit code unit in the range from 0xD800 to 0xDBFF.
A trail surrogate 16-bit code unit or low surrogate 16-bit code unit is a 16-bit code unit in the range from 0xDC00 to 0xDFFF.
A surrogate 16-bit code unit is either a lead surrogate 16-bit code unit or a trail surrogate 16-bit code unit. That is, a 16-bit code unit in the range from 0xD800 to 0xDFFF.
A surrogate 16-bit code unit pair is a sequence of a lead surrogate 16-bit code unit followed by a trail surrogate 16-bit code unit. In UTF-16, it represents a supplementary code point.
An unpaired surrogate 16-bit code unit is a surrogate 16-bit code unit that is not part of a surrogate 16-bit code unit pair.
Note: A surrogate byte sequence (and therefore any byte sequence described in this section) is ill-formed in UTF-8. Decoders are required to treat it as an error.
A lead surrogate byte sequence or high surrogate byte sequence is a sequence of three bytes that represents a lead surrogate code point in generalized UTF-8.
A trail surrogate byte sequence or low surrogate byte sequence is a sequence of three bytes that represents a trail surrogate code point in generalized UTF-8.
A surrogate byte sequence is either a lead surrogate byte sequence or a trail surrogate byte sequence. That is, a sequence of three bytes that represents a surrogate code point in generalized UTF-8.
First byte | Second byte | Third byte | |
---|---|---|---|
Lead surrogate byte sequence | ED | A0 to AF | 80 to BF |
Trail surrogate byte sequence | ED | B0 to BF | 80 to BF |
Surrogate byte sequence | ED | A0 to BF | 80 to BF |
A surrogate pair byte sequence is a sequence six bytes composed of a lead surrogate byte sequence followed by a trail surrogate byte sequence.
An unpaired surrogate byte sequence is a surrogate byte sequence that is not part of a surrogate pair byte sequence.
A sequence of 16-bit code units is potentially ill-formed UTF-16 if it is intended to be interpreted as UTF-16, but is not necessarily well-formed in UTF-16. It effectively encodes a sequence of code points that do not contain any surrogate code point pair.
Note: Like UTF-16, potentially ill-formed UTF-16 can not represent a surrogate code point pair since the corresponding surrogate 16-bit code unit pair would instead represent a supplementary code point. Unlike well-formed UTF-16, it might contain isolated surrogate code points.
Any sequence of 16-bit code units has an interpretation as potentially ill-formed UTF-16.
WTF-16 is sometimes used as a shorter name for potentially ill-formed UTF-16, especially in the context of systems were originally designed for UCS-2 and later upgraded to UTF-16 but never enforced well-formedness, either by neglect or because of backward-compatibility constraints.
To encode from code points to potentially ill-formed UTF-16, run these steps:
((P - 0x10000) >> 10) + 0xD800
((P - 0x10000) & 0x3FF) + 0xDC00
Note: If the input is restricted to Unicode text, this is identical to encoding to UTF-16 and the resulting sequence is well-formed in UTF-16.
If, on the other hand, the input contains a surrogate code point pair, the conversion will be incorrect and the resulting sequence will not represent the original code points.
This situation should be considered an error, but this specification does not define how to handle it. Possibilities include aborting the conversion, or replacing one of the surrogate code points of the pair with a replacement character.
To decode from potentially ill-formed UTF-16 to code points, run these steps:
0x10000 +
((U - 0xD800) << 10) +
(next - 0xDC00)
.
Note: By construction, the resulting sequence does not contain a surrogate code point pair.
Note: If the input is well-formed in UTF-16, this is identical to decoding UTF-16 and the resulting sequence is Unicode text.
For the purpose of this specification, generalized UTF-8 is an encoding of sequences of code points (not restricted to Unicode scalar values) using 8-bit bytes, based on the same underlying algorithm as UTF-8. It is a strict superset of UTF-8 (like UTF-8 is a strict superset of ASCII).
Each code point is encoded as a sequence of one to four bytes:
Code point | First byte | Second byte | Third byte | Fourth byte |
---|---|---|---|---|
U+0000 to U+007F | 0xxxxxxx | |||
U+0080 to U+07FF | 110xxxxx | 10xxxxxx | ||
U+0800 to U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
U+10000 to U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
A byte sequence is well-formed in generalized UTF-8 if and only if:
Code point | First byte | Second byte | Third byte | Fourth byte |
---|---|---|---|---|
U+0000 to U+007F | 00 to 7F | |||
U+0080 to U+07FF | C2 to DF | 80 to BF | ||
U+0800 to U+0FFF | E0 | A0 to BF | 80 to BF | |
U+1000 to U+FFFF | E1 to EF | 80 to BF | 80 to BF | |
U+10000 to U+3FFFF | F0 | 90 to BF | 80 to BF | 80 to BF |
U+40000 to U+FFFFF | F1 to F3 | 80 to BF | 80 to BF | 80 to BF |
U+100000 to U+10FFFF | F4 | 80 to 8F | 80 to BF | 80 to BF |
WTF-8 (Wobbly Transformation Format − 8-bit) is an encoding of code point sequences that do not contain any surrogate code point pair using 8-bit bytes.
Note: Like UTF-8 is artificially restricted to Unicode text in order to match UTF-16, WTF-8 is artificially restricted to exclude surrogate code point pairs in order to match potentially ill-formed UTF-16.
It is identical to generalized UTF-8, with the additional well-formedness constraint that a surrogate pair byte sequence is ill-formed. It is a strict subset of generalized UTF-8 and a strict superset of UTF-8.
Note: Similarly, UTF-8 is a strict superset of ASCII.
WTF-8 must not be used for interchange. See Intended audience.
To encode from code points to well-formed WTF-8, run these steps:
If P is a lead surrogate code point, P is not the last code point of the input,
and the next code point is a trail surrogate code point, consume the next code point and
set P’s value to: 0x10000 +
((P - 0xD800) << 10) +
(next - 0xDC00)
.
0xC0 | (P >> 6)
0x80 | (P & 0x3F)
0xE0 | (P >> 12)
0x80 | ((P >> 6) & 0x3F)
0x80 | (P & 0x3F)
0xF0 | (P >> 18)
0x80 | ((P >> 12) & 0x3F)
0x80 | ((P >> 6) & 0x3F)
0x80 | (P & 0x3F)
Note: If the input contains a surrogate code point pair, the resulting byte sequence will be not represent the original sequence of code points. Instead, it will represent the same code points as if had been encoded in potentially ill-formed UTF-16. This is also consistent with encoding each code point to WTF-8 individually, and concatenating the resulting WTF-8 byte sequences.
To decode from well-formed WTF-8 to code points, run these steps:
Note: Since WTF-8 must not be used for interchange (see Intended audience), this algorithm is deliberately not defined for arbitrary byte sequences. It is only defined for byte sequences known to be well-formed in WTF-8, such as sequences encoded from code points, converted from UTF-16, or concatenated from sequences themselves well-formed in WTF-8.
Append to result a code point of value ((B & 0x1F) << 6) +
(B2 & 0x3F)
Append to result a code point of value ((B & 0x0F) << 12) +
((B2 & 0x3F) << 6) +
(B3 & 0x3F)
Append to result a code point of value ((B & 0x07) << 18) +
((B2 & 0x3F) << 12) +
((B3 & 0x3F) << 6) +
(B4 & 0x3F)
Note: If the input is also well-formed in UTF-8, this is identical to decoding UTF-8 and the resulting sequence is Unicode text.
To convert from potentially ill-formed UTF-16 to WTF-8, run these steps:
Note: This conversion never fails and is lossless.
To convert from WTF-8 to potentially ill-formed UTF-16, run these steps:
Note: This conversion never fails and, if the input is well-formed in WTF-8, is lossless.
Since WTF-8 is a superset of UTF-8, any sequence of byte that is well-formed in UTF-8 is also well-formed in WTF-8 and represents the same text. To convert from UTF-8 to WTF-8, return the input unchanged.
Note: This conversion never fails and is lossless.
To convert lossily from WTF-8 to UTF-8, replace any surrogate byte sequence with the sequence of three bytes <0xEF, 0xBF, 0xBD>, the UTF-8 encoding of the replacement character.
Note: Since surrogate byte sequences are also three bytes long, this conversion can be done in place.
Note: This conversion never fails but is lossy.
To convert strictly from WTF-8 to UTF-8, run these steps:
Note: This conversion is lossless when it succeeds, but it can fail.
Concatenating WTF-8 strings requires extra care to preserve well-formedness.
To concatenate two WTF-8 strings, run these steps:
0x10000 + ((lead - 0xD800) << 10) + (trail - 0xDC00)
Note: This is equivalent to converting both strings to potentially ill-formed UTF-16, concatenating the resulting 16-bit code unit sequences, then converting the concatenation back to WTF-8.
This section is non-normative.
Thanks to Coralie Mercier for coining the name WTF-8.
Thanks for feedback and contributions from Anne van Kesteren, David Baron, Dylan Petonke, Guillaume Knispel, Henri Sivonen, Jacob Lifshay, James Graham, Lily Ballard, Mathias Bynens, Ms2ger, Sam Tobin-Hochstadt, Tab Atkins.
Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words do not appear in all uppercase letters in this specification.
All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. [RFC2119]
Examples in this specification are introduced with the words “for example”
or are set apart from the normative text with class="example"
, like this:
Informative notes begin with the word “Note”
and are set apart from the normative text with class="note"
, like this:
Note, this is an informative note.