Unicode Frequently Asked Questions

Internationalized Domain Names (IDN) FAQ

Internationalized Domain Names

Q: What is an Internationalized Domain Name (IDN)?

Domain names, such as "macchiati.blogspot.com", were originally designed only to support ASCII characters. In 2003, the first specification was released that allows most Unicode characters to be used in domain names. This specification was replaced a few years later by IDNA2008, which differs in some points. IDNs are supported by all modern browsers and email programs, so people can use links in their native languages, such as http://Bücher.deexternal link.

Q: Do IDNs change the Domain Name System (DNS)?

No. Internally, the non-ASCII Unicode characters are transformed into a special sequence of ASCII characters. So as far as the DNS system is concerned, all domain names are just ASCII.

Q: When will IDNs be available?

IDNs have been defined and in use since 2003, initially under a system called "IDNA2003". In 2010 a revised protocol was released as "IDNA2008". At that time, ICANN also began to make Internationalized Domain Names available for top level domains, like the "org" in "unicode.org". Since then these can also use non-ASCII characters, for example, ".世界" (.world). Across all domains, over 8 million IDNs had been registered worldwide by 2020external link, most of them in Chinese, Latin or Cyrillic.

Q: What is IDNA2008?

It is a revision of IDNA2003, approved in 2010. For most Unicode characters it produces the same results as IDNA2003, but there are important classes of characters for which it is not backwards compatible with IDNA2003. ICANN requires the use of IDNA2008 in the Root Zone and any top-level domain which is under contract with ICANN. These guidelinesexternal link discuss some of the issues for a transition from IDNA2003 to IDNA2008 from the perspective of a domain name registry.

Q: How does IDNA2008 differ from IDNA2003?

It disallows about eight thousand characters that used to be valid, including all uppercase characters, full/half-width variants, symbols, and punctuation. It also interprets four characters differently.

Q: Which four characters are interpreted differently in IDNA2008?

Four characters can cause an IDNA2008 implementation to go to a different web page than an IDNA2003 implementation, given the same source, such as href="http://faß.de". These four characters include some that are quite common in languages such as German, Greek, Farsi, and Sinhala:

U+00DF ( ß ) LATIN SMALL LETTER SHARP S
U+03C2 ( ς ) GREEK SMALL LETTER FINAL SIGMA
U+200C ( ) ZERO WIDTH NON-JOINER
U+200D ( ) ZERO WIDTH JOINER

For the purposes of discussion of differences between IDNA versions, these characters are called "deviations".

Q: What characters are valid in IDNA2008?

The validity and status of characters under IDNA2008 is determined algorithmically from Unicode character properties (with certain exceptions applied). Unicode provides the result of this computation for every applicable version of the standard including the currently defined exceptions. If IETF publishes additional exceptions in the future, these will be reflected going forward.

Migration from IDNA2003

Q: What is UTS #46?

UTS #46: Unicode IDNA Compatibility Processing, also sometimes referred to as "TR46", is a Unicode specification that allows implementations to handle domain names compatibly during the transition from IDNA2003 to IDNA2008. The title is "Unicode IDNA Compatibility Processing".

UTS #46 also provides a preprocessing specification for mapping that can be used with a standard IDNA2008 implementation.

Q: Is UTS #46 an IETF publication?

IDNA2008 is an IETF specification, while UTS #46 is a specification of the Unicode Consortium.

Q: Why is UTS #46 necessary?

Browsers and other client software need to support existing pages, which were constructed under the IDNA2003 interpretation of international domain names. They also need to continue meet their user's expectations, such as being able to type in IDNs with capital letters, or to use the ideographic period in Japanese or Chinese domain names. In particular, the 4 "deviation" characters have the opportunity to cause significant security and usability problems; they and symbols can be phased out over time, but need some transitional support.

UTS #46 provides a compatibility bridge that allows implementations to handle both IDNA2003 and IDNA2008 domain names. For the specification and more background information, see UTS #46.

Q: What are examples of IRIs where characters behave differently under IDNA2008?

Here is a table showing internationalized domain names in the context of IRIs, illustrating the differences in characters:

URL IDNA2003 UTS #46 IDNA2008 Comments
href="http://öbb.atexternal link" Valid Valid Valid Simple characters
href="http://ÖBB.atexternal link" Valid† Valid† Disallowed Case mapping is not part of IDNA2008
href="http://√.comexternal link" Valid Valid Disallowed Symbols are disallowed in IDNA2008
href="http://faß.deexternal link" Valid† Valid† Valid Deviation (different resulting IP address in IDNA2008)
href="http://ԛәлп.comexternal link" Valid‡ Valid Valid IDNA2003 only allows Unicode 3.2 characters, excluding U+051B (ԛ) cyrillic qa
href="http://Ⱥbby.comexternal link" Valid‡ Valid† Disallowed IDNA2003 only allows Unicode 3.2 characters, excluding U+023A ( Ⱥ ) latin A with stroke; Case mapping is not part of IDNA2008

Mapped to different characters, eg lowercase.

Note that the Unicode characters after 3.2 were valid on lookup, but not for registration.

For a more detailed account of the similarities and differences, with character counts, see Section 7, IDNA Comparison in UTS #46. For a demonstration of differences between IDNA2003, IDNA2008, and the Unicode IDNA Compatibility Processing, see the IDNA demo.

Q: What are the main advantages of IDNA2008?

The main advantages are:

The classification of characters under IDNA2008 is based on a combination of Unicode properties, so implementations can compute them for all Unicode characters. The original IDNA2008 published tables based on Unicode 5.2, but also included some exceptional classification. A series of additional RFCs publishes reviews of and computed tables for later Unicode versions. See RFC8573external link as well as the FAQ Q: What characters are valid in IDNA2008?.

Q: What are issues of migrating to IDNA2008?

If IDNA2003 had not existed, there would be no migration issues for IDNA2008. Given that IDNA2003 does exist, and is still widely deployed, the following issues should be noted:

Mitigating Security Concerns for IDNs

Q: What are typical security concerns introduced by IDN?

IDNs allow a much wider range of character shapes than ASCII, as well as scripts that have complex and nonlinear rendering. Here are some of the concerns that are specific to, or more prominent in IDNs:

Q: What are typical ways to mitigate security concerns introduced by IDN?

RFC7940external link provides the framework for a machine-readable implementation of these mitigation steps. For an example of policies intended to reduce the risk in a multi-script zone, see Root Zone Label Generation Rulesexternal link.

Q:What is a confusable IDN label?

Two labels are confusable if users accept one for the other. Confusable or look-alike appearance may exist between characters, combinations of characters, or both. Confusable characters may exist inside the same script, or across scripts. Disallowing mixed-script labels cuts down on possible combinations, but it is possible to create single-script labels that are confusable across scripts.

Examples of confusable characters
è
U+0065
Latin
ѐ
U+0450
Cyrillic
ə
U+0259
Latin
ǝ
U+01DD
Latin

For more information, see UTS #39, "Unicode Sercurtiy Mechanisms".

Q: How did the treatment of symbols like √ change from IDNA2003 to IDNA2008?

While http://√.com is valid in an IDNA2003 implementation, it would fail on a IDNA2008 implementation. At the time IDNA2008 was introduced this affected 3,254 characters, most of which are rarely used. A small percentage of those are security risks because of confusability. The vast majority are unproblematic: for example, having http://I♥NY.com doesn't cause security problems. IDNA2008 has additional tests that are based on the context in which characters are found, but they apply to few characters, and don't provide any appreciable increase in security.

The issue may be different for newer symbols like emoji, particularly because the appearance for an emoji is not specified and users cannot know whether some image that looks a bit different is a different emoji or just a difference in rendition.

Q: What makes emoji particularly unsuited for IDNs?

In 2017, the ICANN Security And Stability Committee (SSAC) released an Advisory on the Use of Emoji in Domain Namesexternal link. In it SSAC member conclude that emoji are fundamentally at odds with the way the DNS is designed to be an "exact match lookup" system. They cite the fact that different emoji may look highly similar to users due to the way they lack a conventional agreed upon appearance, such as exhibited by letters in an alphabet, and that their appearance is not prescribed or regulated by the emoji specification as key findings. Additionally, they note that the modifiers, such as for skin tone or color, as well the combinations make the set somewhat open ended and subject to inconsistent implementation across devices. Taken together, these findings lead them to the conclusion to reject emoji for identifiers outright for TLDs and to strongly warn against their use anywhere.

Q: What are typical security exploits?

The vast majority of security exploits are of the form "security-wellsfargo.com", where no special characters are involved. For more information, see Stéphane Bortzmeyer's blog entry, idn-et-phishing external link(in French). The most interesting studies cited there (originally from Mike Beltzner of Mozilla) are:

Even among the fraction that are confusable characters, merely limiting the allowed characters to letters and digits doesn't by itself do anything about the most frequent sources of character-based spoofing: look-alike characters that are both letters, like "http://paypal.com" with a Cyrillic "a".

According to data from Google, the removal of symbols and punctuation in IDNA2008 reduces opportunities for spoofing by only about 0.000016%, weighted by frequency. In another study at Google of a billion web pages, the top 277 confusable URLs used confusable letters or numbers, not symbols or punctuation. The 278th page had a confusable URL with × (U+00D7 MULTIPLICATION SIGN - by far the most common confusable); but that page could could be even better spoofed with х (U+0445 CYRILLIC SMALL LETTER HA), which normally has precisely the same displayed shape as "x".

For a demo of confusable characters, and the effects of various restrictions, see the confusables demo.

This points to the need to carefully consider spoofing issues within the repertoire of letters allowed for registration for a given zone. Removing rarely used, and therefore unfamiliar, letter forms is one strategy. Another is to use the mechanisms of RFC7940external link to create a machine-readable specification for limiting sequences of letters that are structurally invalid for a script and liable to be rendered like a different, valid sequence. The third strategy to define characters that are variants of each other. This information can be used to enforce mutual exclusion of labels containing look-alike characters, whether in-script or cross-script.

Programmers also need to be aware of the issues detailed in UTR #36: Unicode Security Considerations, including the mechanisms for detecting potentially visually-confusable characters are found in the associated UTS #39: Unicode Security Mechanisms.

Q: Are the local mappings in IDNA2008 just a UI issue?

No, not if what is meant is that they are only involved in interactions with the address bar.

Example:

There are parallel examples with web pages, IM chats, Word documents, etc.

Q: Do the local-mapping exploits require unscrupulous registries?

No. The exploits do not require unscrupulous registries—it only requires that registries fail to police every URL that they register for possible spoofing behavior.

The local mappings matter to security, because entering the same URL on two different browsers may go to two different IP addresses when the two browsers have different local mappings. The same thing could happen within an emailer that is parsing for URLs, and then opening a browser. Moreover, they are even more problematic if they affect the interpretation of web pages, in such as cases like href="http://TÜRKIYE.com".

Script , language and IDNs

Q: What is the issue with German sharp s (ß) versus "ss"?

In German, the standard uppercase of ß is "SS", the same as the uppercase of "ss". Note, for example, that on the German language page for http://www.uni-giessen.deexternal link, "Gießen" is spelled with ß, but the logo for the university (see the top left corner of the page) is spelled with GIESSEN. The situation is even more complicated:

Q: What is the issue with Greek final sigma (ς)?

The Greek sigma (σ) takes a final form (ς) at the end of a word. In IDNs, where there are no spaces, a final form may show up in the middle of a label. Because both have the same uppercase form (Σ) labels with and without final sigma cannot be distinguished when presented to the user as uppercase (and IDNA2003 maps them together). The Root Zone Label Generation Rules for the Greek Scriptexternal link allow either (σ) or (ς) at a given position, but with a provision for bundling a label having all (σ) with any label containing a final sigma (ς). These are also considered variants in the Greek ccTLD.

Q: Aren't the problems with eszett and final sigma just the same as with l, I, and 1?

The eszett and sigma are fundamentally different than I (capital i), l (lowercase L), and 1 (digit one). With the following (using a digit 1), all browsers will go to the same location, whether they are old or new:

http://goog1e.com

In the following hypothetical example using a top-level domain "xx", browsers that use IDNA2003 will go to a different location than browsers that use a strict version of IDNA2008, unless the registry for xx puts into place a bundling strategy.

http://gießen.xx

The same goes for Greek sigma, which is a more common character in Greek than the eszett is in German.

Q: Why does IDNA2003 map final sigma (ς) to sigma (σ) and (ß) to "ss" and (i) to (ı)?

This decision about the mapping of these characters followed recommendations for case-insensitive matching in the Unicode Standard. These characters are anomalous: the uppercase of ς is Σ, the same as the uppercase of σ. Note that the text "ΒόλοΣ.com", which appears on http://Βόλος.com, illustrates this: the normal case mapping of Σ is to σ. If σ and ς were not treated as case variants in Unicode, there wouldn't be a match between ΒόλοΣ and Βόλος.

For full case insensitivity (with transitivity), {σ, ς, Σ}, {i, ı, I and İ} and {ss, ß, SS} need to be treated as equivalent, with one of each set chosen as the representative in the mapping. That is what is done in the Unicode Standard, which was followed by IDNA2003. While IDNA2003 did not have to have full case transitivity, that is water under the bridge.

Q: How does IDNA2008 improve handling of Arabic and Hebrew (BIDI)?

Arabic and Hebrew writing systems are known as bidi (bidirectional) because text runs from right-to-left and numbers (or embedded Latin characters) from left-to-right. IDNA2008 does a better job of restricting labels that lead to "bidi label hopping". This is where bidi reordering causes characters from one label to appear to be part of another label. For example, "B1.2d" in a right-to-left paragraph (where B stands for an Arabic or Hebrew letter) would display as "1.2dB". For more information, see the Unicode bidi demo.

While these new bidi rules go a long way towards reducing this problem, they do not completely eliminate it because they do not check for inter-label problems.

Q: Why allow ZWJ/ZWNJ at all?

ZWJ and ZWNJ are normally invisible, which allows them to be used for a variety of spoofs. Invisible characters (like these and soft-hyphen) are allowed on input in IDNA2003, but are deleted so that they do not allow spoofs. During the development of Unicode, the ZWJ and ZWNJ were intended only for presentation — that is, they would make no difference in the semantics of a word.

However, in some cases, what used to be presentational alternatives became semantically distinct. For example, there are words such as the Sinhala name of the country of Sri Lanka (ශ්‍රී ලංකාව), which require preservation of these joiners (in this case, ZWJ) to achieve the correct spelling. The Root Zone excludes the ZWJ because of the heightened security sensitivity for the Root Zone. However, the Reference Label Generation Rules for the Sinahala Scriptexternal link published by ICANN for use on the second level allow the use of ZWJ, but only in the context of a few dozen explicitly enumerated combinations.

Q: But aren't the deviation characters needed for the orthographies of some languages?

While these are full parts of the orthographies of the languages in question, neither IDNA2003 nor IDNA2008 ever claimed that all parts of every language's orthographies are representable in domain names. There are trivial examples even in English, like the word can't (vs cant) or Wendy's/Arby's Group, which use standard English orthography but cannot be represented faithfully in a domain name.

Q: Are there registries that restrict domain names on the basis of language?

While it may be difficult to find a clear cutoff for restricting IDNs on the basis of language, there are many registries that have language-specific registration policiesexternal link, and ICANN publishes a set of language-specific Reference LGRs for the Second Levelexternal link.

The main concern is that the set of letters used in a particular language is not well defined. The "core" letters typically are, but many additional ones may be accepted in loan words, and have perfectly legitimate commercial and social use. Sometimes the same language used in different regions may use different letters; other times, the interest may be more in supporting a particular country, than a specific language. The latter applies to many ccTLDS. In all of these cases, the allowed repertoire may not be strictly language-based but will be a subset of a full script's repertoire.

Q: Are there registries that restrict domain names on the basis of script?

It is a bit easier to maintain a clear distinction based on script differences between characters: every Unicode character has a defined script (or is Common/Inherited). However, some languages, such as Japanese, require multiple scripts. And in such cases, mixtures of scriptsmay be appropriate. One can have http://SONY日本.com with no problems at all—while there are many cases of "homographs" (visually confusable characters) within the same script that a restriction based on script doesn't deal with.

As one prominent example, the DNS Root Zone supports domain names on the basis of script: with few exceptions for inherently multi-script writing systems each label must be in a single script. However, labels from different scripts share the Root Zone. The issue of true homographs within and across scripts is addressed not by repertoire restriction but by mutual exclusion via definition of variants.

Q: What is a recommended script?

UAX #31 defines a number of scripts as "Recommended" for use in identifiers. All of these scripts are in "widespread common everyday use" by large communities for writing modern languages and that are actively being used by the respective user communities to conduct their ordinary and daily online business. Where there is some, but not sufficient level of use, a script may be designated "Limited_Use". This classification neither prevents use of these scripts for IDNs absolutely in any zone other than the DNS Root Zone, nor does it affect the use of the script in creating online content. Scripts limited to specialized use only, like archaic scripts, are classified as "Excluded".

Q: Can additional scripts become recommended?

Which scripts are used to write a given language may change over time. Whether a script is recommended or not is not frozen in time, so Unicode is able to track such changes in usage. A number of scripts that are classified as Limited_Use have the potential to become recommended, if at some point in the future their observed and documented level of usage rises to the level of "widespread common everyday use". Any suggestion to make such a change would need be accompanied by thorough documentation of pervasive online use of the script in daily life. Unlike the use of a script to publish dictionaries or otherwise digitally preserve a written culture, the use of IDNs is to facilitate day-to-day online interactions by users of the script. Therefore, the degree to which such a language community engages in online transactions using that script is the most important data point.

Q: Should the IDNA protocol restrict allowed domain names on the basis of language or script?

The rough consensus among the IETF IDNA working group is that script/language mixing restrictions are not appropriate for the lowest-level protocol. So in this respect, IDNA2008 is no different than IDNA2003. IDNA doesn't try to attack the homograph problem, because it is too difficult to maintain a clear distinction. Effective solutions depend on information or capabilities outside of the protocol's control, such as language restrictions appropriate for a particular registry, the language of the user looking at this URL, the ability of a UI to display suspicious URLs with special highlighting, and so on.

Responsible registries can apply such restrictions. For example, a country-level registry can decide on a restricted set of characters appropriate for that country's languages. Application software can also take certain precautions—Microsoft Edge, Safari, Firefox, and Chrome all display domain names in Unicode only if the user's language(s) typically use the scripts in those domain names. For more information on the kinds of techniques that implementations can use on the Unicode web site, see UTR #36: Unicode Security Considerations.

Implementation Issues and Strategies for IDN

Q: Are there differences in mapping between UTS #46 and IDNA2003?

No. There are, however, 56 characters that are valid or mapped under IDNA2003, but are disallowed by UTS #46. For a detailed table of differences between UTS #46 and IDNA2008, see Section 7, IDNA Comparison in UTS #46.

In particular, there are collections of characters that would have changed mapping according to NFKC_Casefold after Unicode 3.2, unless they were specifically excluded. All of these characters are extremely rare, and do not require any special handling.

Case Pairs. These are characters that did not have corresponding lowercase characters in Unicode 3.2, but had lowercase characters added later.

U+04C0 ( Ӏ ) CYRILLIC LETTER PALOCHKA
U+10A0 ( Ⴀ ) GEORGIAN CAPITAL LETTER AN…U+10C5 ( Ⴥ ) GEORGIAN CAPITAL LETTER HOE
U+2132 ( Ⅎ ) TURNED CAPITAL F
U+2183 ( Ↄ ) ROMAN NUMERAL REVERSED ONE HUNDRED

After Unicode 3.2, the Unicode Consortium has stabilized case folding, so that further examples will not occur in the future. That is, case pairs will be assigned in the same version of Unicode—so any newly assigned character will either have a case folding in that version of Unicode, or it will never have a case folding in the future.

Normalization Mappings. These are five characters whose normalizations changed after Unicode 3.2 (all of them were in Unicode 4.0.0: see Corrigendum #4: Five Unihan Canonical Mapping Errors). As of Unicode 5.1, normalization is completely stabilized, so these are the only such characters.

Q: What are possible strategies for preparing IDNs in a display form preferred by target sites?

Labels presented to a browser may or may not be in the display form preferred by a target site. For example, a site may have a preferred display form of “HumanEvents.com”, but an href tag in another site may display “HumaneVents.com”. Similarly, a user may type “Floß.com” in the browser’s address bar, and that would resolve to the site “floss.com”, though it is unclear whether the display form preferred by owners of that site is “Floss.com”, “floss.com”, “Floß.com”, or “floß.com”. There is no way currently for the browser to know whether the labels are in a preferred form or not.

It may be useful to develop mechanisms to allow browsers to determine the display form preferred by a target site, and then for browsers to display that form. One could foresee something being developed along the lines of the faviconexternal link approach. The mechanisms would need to have restrictions put into place to address misrepresentations. For example, the browser should verify that the site's preferred display form has the same lookup form: if the href is "http://βόλοσ.com", and the site's preferred display form is "http://Βόλος.com", then the preferred display form could be used; if the site's preferred display form is "http://Βόλλος.com", then it would not be used, because it doesn't have the same lookup form as the href. Other security checks should be made, such as to prevent display forms like "appIe.com" (with a capital I) for "appie.com" (with a lowercase i).

Q: How are label delimiters handled in implementations of IDNA?

The processing of UTS #46 matches what is commonly done with label delimiters by browsers, whereby characters containing periods are transformed into the NFKC format before labels are separated. This allows the domain name to be mapped in a single pass, rather than label by label. However, except for the four label separators provided by IDNA2003, all input characters that would map to a period are disallowed. For example, U+2488 ( ⒈ ) DIGIT ONE FULL STOP has a decomposition that maps to a period, and is thus disallowed. The exact list of characters can be seen with the Unicode utilities using a regular expression:

https://www.unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{toNFKC=/\./}

The question also arises as to how to handle escaped periods (such as %2E). While escaping of periods is outside of the scope of this document, it is useful to see how both of these cases are handled in current browsers:

Input: http://à%2Ecom %2E http://à⒈com
Microsoft Edge http://xn--0ca.com/ = "." http://xn--1-rfa.com/ = "1."
Firefox http://www.xn--.com-hta.com/ ≠ "." http://xn--1-rfa.com/ = "1."
Safari / Chrome http://xn--0ca.com/ = "." http://xn--1.com-qqa/ ≠ "1."

There are three possible behaviors for characters such as U+2488 ( ⒈ ) DIGIT ONE FULL STOP:

  1. The dot behaves like a label separator.
  2. The character is rejected.
  3. The dot is included in the label, as shown in the garbled punycode seen above in the ≠ cases.

The conclusion of the Unicode Technical Committee was that the best behavior for UTS #46 was #2, to forbid all characters (other than the 4 label separators) that contained a FULL STOP in their compatibility decompositions. This is the same behavior as IDNA2003. Although this policy is not the current policy of the majority of browser implementations, the browser vendors agreed that the change is desirable.

Q: For IDNA2008, what is the derivation of valid characters in terms of Unicode properties?

Using formal set notation, the following describes the set of allowed characters defined by IDNA2008. This set corresponds to the union of the PVALID, CONTEXTJ, and CONTEXTO characters defined by the Tables document of IDNA2008. Unicode provides the result of this derivation for every applicable version of the standard including the currently defined exceptions.

Formal Sets Descriptions
[ \P{Changes_When_NFKC_Casefolded} Start with characters that are NFKC Case folded (as in IDNA2003)
\- \p{c} - \p{z} Remove Control Characters and Whitespace (as in IDNA2003)
\- \p{s} - \p{p} - \p{nl} - \p{no} - \p{me} Remove Symbols, Punctuation, non-decimal Numbers, and Enclosing Marks
\- \p{HST=L} - \p{HST=V} - \p{HST=V} Remove characters used for archaic Hangul (Korean)
\- \p{block=Combining_Diacritical_Marks_For_Symbols}
- \p{block=Musical_Symbols}
- \p{block=Ancient_Greek_Musical_Notation}
Remove three blocks of technical or archaic symbols.
\- [\u0640 \u07FA \u302E \u302F \u3031-\u3035 \u303B] Remove certain exceptions:
U+0640 ( ‎ـ‎ ) ARABIC TATWEEL
U+07FA ( ‎ߺ‎ ) NKO LAJANYALAN
U+302E ( 〮 ) HANGUL SINGLE DOT TONE MARK
U+302F ( 〯 ) HANGUL DOUBLE DOT TONE MARK
U+3031 ( 〱 ) VERTICAL KANA REPEAT MARK
U+3032 ( 〲 ) VERTICAL KANA REPEAT WITH VOICED SOUND MARK
..
U+3035 ( 〵 ) VERTICAL KANA REPEAT MARK LOWER HALF
U+303B ( 〻 ) VERTICAL IDEOGRAPHIC ITERATION MARK
\>+ [\u00B7 \u0375 \u05F3 \u05F4 \u30FB]
+ [\u002D \u06FD \u06FE \u0F0B \u3007]
Add certain exceptions:
U+00B7 ( · ) MIDDLE DOT
U+0375 ( ͵ ) GREEK LOWER NUMERAL SIGN
U+05F3 ( ‎׳‎ ) HEBREW PUNCTUATION GERESH
U+05F4 ( ‎״‎ ) HEBREW PUNCTUATION GERSHAYIM
U+30FB ( ・ ) KATAKANA MIDDLE DOT
plus
U+002D ( - ) HYPHEN-MINUS
U+06FD ( ‎۽‎ ) ARABIC SIGN SINDHI AMPERSAND
U+06FE ( ‎۾‎ ) ARABIC SIGN SINDHI POSTPOSITION MEN
U+0F0B ( ་ ) TIBETAN MARK INTERSYLLABIC TSHEG
U+3007 ( 〇 ) IDEOGRAPHIC NUMBER ZERO
\+ [\u00DF \u03C2]
\+ \p{JoinControl}]
Add special exceptions (Deviations):
U+00DF ( ß ) LATIN SMALL LETTER SHARP S
U+03C2 ( ς ) GREEK SMALL LETTER FINAL SIGMA
U+200C ( ) ZERO WIDTH NON-JOINER
U+200D ( ) ZERO WIDTH JOINER