Internationalized Domain Names (IDN) FAQ
Internationalized Domain Names
Q: What is an Internationalized Domain Name (IDN)?
Domain names, such as "macchiati.blogspot.com", were originally designed only to support ASCII characters. In 2003, the first specification was released that allows most Unicode characters to be used in domain names. This specification was replaced a few years later by IDNA2008, which differs in some points. IDNs are supported by all modern browsers and email programs, so people can use links in their native languages, such as http://Bücher.de.
Q: Do IDNs change the Domain Name System (DNS)?
No. Internally, the non-ASCII Unicode characters are transformed into a special sequence of ASCII characters. So as far as the DNS system is concerned, all domain names are just ASCII.
Q: When will IDNs be available?
IDNs have been defined and in use since 2003, initially under a system called "IDNA2003". In 2010 a revised protocol was released as "IDNA2008". At that time, ICANN also began to make Internationalized Domain Names available for top level domains, like the "org" in "unicode.org". Since then these can also use non-ASCII characters, for example, ".世界" (.world). Across all domains, over 8 million IDNs had been registered worldwide by 2020, most of them in Chinese, Latin or Cyrillic.
Q: What is IDNA2008?
It is a revision of IDNA2003, approved in 2010. For most Unicode characters it produces the same results as IDNA2003, but there are important classes of characters for which it is not backwards compatible with IDNA2003. ICANN requires the use of IDNA2008 in the Root Zone and any top-level domain which is under contract with ICANN. These guidelines discuss some of the issues for a transition from IDNA2003 to IDNA2008 from the perspective of a domain name registry.
Q: How does IDNA2008 differ from IDNA2003?
It disallows about eight thousand characters that used to be valid, including all uppercase characters, full/half-width variants, symbols, and punctuation. It also interprets four characters differently.
Q: Which four characters are interpreted differently in IDNA2008?
Four characters can cause an IDNA2008 implementation to go to a different web page than an IDNA2003 implementation, given the same source, such as href="http://faß.de". These four characters include some that are quite common in languages such as German, Greek, Farsi, and Sinhala:
U+00DF ( ß ) LATIN SMALL LETTER SHARP S
U+03C2 ( ς ) GREEK SMALL LETTER FINAL SIGMA
U+200C ( ) ZERO WIDTH NON-JOINER
U+200D ( ) ZERO WIDTH JOINER
For the purposes of discussion of differences between IDNA versions, these characters are called "deviations".
Q: What characters are valid in IDNA2008?
The validity and status of characters under IDNA2008 is determined algorithmically from Unicode character properties (with certain exceptions applied). Unicode provides the result of this computation for every applicable version of the standard including the currently defined exceptions. If IETF publishes additional exceptions in the future, these will be reflected going forward.
Migration from IDNA2003
Q: What is UTS #46?
UTS #46: Unicode IDNA Compatibility Processing, also sometimes referred to as "TR46", is a Unicode specification that allows implementations to handle domain names compatibly during the transition from IDNA2003 to IDNA2008. The title is "Unicode IDNA Compatibility Processing".
UTS #46 also provides a preprocessing specification for mapping that can be used with a standard IDNA2008 implementation.
Q: Is UTS #46 an IETF publication?
IDNA2008 is an IETF specification, while UTS #46 is a specification of the Unicode Consortium.
Q: Why is UTS #46 necessary?
Browsers and other client software need to support existing pages, which were constructed under the IDNA2003 interpretation of international domain names. They also need to continue meet their user's expectations, such as being able to type in IDNs with capital letters, or to use the ideographic period in Japanese or Chinese domain names. In particular, the 4 "deviation" characters have the opportunity to cause significant security and usability problems; they and symbols can be phased out over time, but need some transitional support.
UTS #46 provides a compatibility bridge that allows implementations to handle both IDNA2003 and IDNA2008 domain names. For the specification and more background information, see UTS #46.
Q: What are examples of IRIs where characters behave differently under IDNA2008?
Here is a table showing internationalized domain names in the context of IRIs, illustrating the differences in characters:
URL | IDNA2003 | UTS #46 | IDNA2008 | Comments |
---|---|---|---|---|
href="http://öbb.at" | Valid | Valid | Valid | Simple characters |
href="http://ÖBB.at" | Valid† | Valid† | Disallowed | Case mapping is not part of IDNA2008 |
href="http://√.com" | Valid | Valid | Disallowed | Symbols are disallowed in IDNA2008 |
href="http://faß.de" | Valid† | Valid† | Valid | Deviation (different resulting IP address in IDNA2008) |
href="http://ԛәлп.com" | Valid‡ | Valid | Valid | IDNA2003 only allows Unicode 3.2 characters, excluding U+051B (ԛ) cyrillic qa |
href="http://Ⱥbby.com" | Valid‡ | Valid† | Disallowed | IDNA2003 only allows Unicode 3.2 characters, excluding U+023A ( Ⱥ ) latin A with stroke; Case mapping is not part of IDNA2008 |
† Mapped to different characters, eg lowercase.
‡ Note that the Unicode characters after 3.2 were valid on lookup, but not for registration.
For a more detailed account of the similarities and differences, with character counts, see Section 7, IDNA Comparison in UTS #46. For a demonstration of differences between IDNA2003, IDNA2008, and the Unicode IDNA Compatibility Processing, see the IDNA demo.
Q: What are the main advantages of IDNA2008?
The main advantages are:
- Updates the repertoire of allowed characters from Unicode 3.2 to later versions.
- Makes process of updating to future Unicode versions (mostly) automatic
- Allows needed sequences (combining marks at the end of a bidi label)
- Improves BIDI restrictions (Arabic/Hebrew)
- Clarifies that it is the unmapped form of a domain name that is registered
- Makes it clear exactly what strings can be registered
The classification of characters under IDNA2008 is based on a combination of Unicode properties, so implementations can compute them for all Unicode characters. The original IDNA2008 published tables based on Unicode 5.2, but also included some exceptional classification. A series of additional RFCs publishes reviews of and computed tables for later Unicode versions. See RFC8573 as well as the FAQ Q: What characters are valid in IDNA2008?.
Q: What are issues of migrating to IDNA2008?
If IDNA2003 had not existed, there would be no migration issues for IDNA2008. Given that IDNA2003 does exist, and is still widely deployed, the following issues should be noted:
- Changes the interpretation of the 4 characters known as Deviations
- Discontinues IDNA2003 case mappings and mappings for other variants
- Excludes symbols and punctuation
- Allows arbitrary 'local' mappings, which may result in the same IRI resolving to different IP addresses, depending on the mapping used
Mitigating Security Concerns for IDNs
Q: What are typical security concerns introduced by IDN?
IDNs allow a much wider range of character shapes than ASCII, as well as scripts that have complex and nonlinear rendering. Here are some of the concerns that are specific to, or more prominent in IDNs:
- IDNs include a wide range of character shapes, including characters that may be:
- identical in appearance, even inside the same script
- from an unfamiliar language or script may look (exactly) like familiar characters
- historic, obsolete, rarely used or limited to a special domain of usage
- IDNs include complex scripts, which add character sequences that:
- IDNs include scripts written right to left which:
- may lead to re-ordering of the domain name in display
- IDNs include languages that have issues not found in English:
- Different countries use different characters for the same letter in the same language
- Different countries use different characters with equivalent meaning in the same language
- Some languages have alternate representations for the same letter or syllable, where both are equally acceptable
- Users expect access to the same resource under either choice of character
Q: What are typical ways to mitigate security concerns introduced by IDN?
- Limit the repertoire to characters in wide-spread modern use
- Only support recommended scripts (See UAX#31)
- Allow only those combining character sequences in actual use
- Add context constraints to ensure character sequences follow the structure of the script
- Prevent mixing of scripts in the same IDN label
- Prevent mixing of regional or language variants of some characters in the same IDN label
- Prevent registration of in-script or cross-script confusable labels
- Strictly limit the use of invisible joiners (ZWJ or ZWNJ) to where they are absolutely needed
RFC7940 provides the framework for a machine-readable implementation of these mitigation steps. For an example of policies intended to reduce the risk in a multi-script zone, see Root Zone Label Generation Rules.
Q:What is a confusable IDN label?
Two labels are confusable if users accept one for the other. Confusable or look-alike appearance may exist between characters, combinations of characters, or both. Confusable characters may exist inside the same script, or across scripts. Disallowing mixed-script labels cuts down on possible combinations, but it is possible to create single-script labels that are confusable across scripts.
Examples of confusable characters | |
---|---|
è U+0065 Latin |
ѐ U+0450 Cyrillic |
ə U+0259 Latin |
ǝ U+01DD Latin |
For more information, see UTS #39, "Unicode Sercurtiy Mechanisms".
Q: How did the treatment of symbols like √ change from IDNA2003 to IDNA2008?
While http://√.com is valid in an IDNA2003 implementation, it would fail on a IDNA2008 implementation. At the time IDNA2008 was introduced this affected 3,254 characters, most of which are rarely used. A small percentage of those are security risks because of confusability. The vast majority are unproblematic: for example, having http://I♥NY.com doesn't cause security problems. IDNA2008 has additional tests that are based on the context in which characters are found, but they apply to few characters, and don't provide any appreciable increase in security.
The issue may be different for newer symbols like emoji, particularly because the appearance for an emoji is not specified and users cannot know whether some image that looks a bit different is a different emoji or just a difference in rendition.
Q: What makes emoji particularly unsuited for IDNs?
In 2017, the ICANN Security And Stability Committee (SSAC) released an Advisory on the Use of Emoji in Domain Names. In it SSAC member conclude that emoji are fundamentally at odds with the way the DNS is designed to be an "exact match lookup" system. They cite the fact that different emoji may look highly similar to users due to the way they lack a conventional agreed upon appearance, such as exhibited by letters in an alphabet, and that their appearance is not prescribed or regulated by the emoji specification as key findings. Additionally, they note that the modifiers, such as for skin tone or color, as well the combinations make the set somewhat open ended and subject to inconsistent implementation across devices. Taken together, these findings lead them to the conclusion to reject emoji for identifiers outright for TLDs and to strongly warn against their use anywhere.
Q: What are typical security exploits?
The vast majority of security exploits are of the form "security-wellsfargo.com", where no special characters are involved. For more information, see Stéphane Bortzmeyer's blog entry, idn-et-phishing (in French). The most interesting studies cited there (originally from Mike Beltzner of Mozilla) are:
- Decision Strategies and Susceptibility to Phishing by Downs, Holbrook & Cranor
- Why Phishing Works by Dhamija, Tygar & Hearst
- Do Security Toolbars Actually Prevent Phishing Attacks by Wu, Miller & Garfinkel
- Phishing Tips and Techniques by Gutmann.
Even among the fraction that are confusable characters, merely limiting the allowed characters to letters and digits doesn't by itself do anything about the most frequent sources of character-based spoofing: look-alike characters that are both letters, like "http://paypal.com" with a Cyrillic "a".
According to data from Google, the removal of symbols and punctuation in IDNA2008 reduces opportunities for spoofing by only about 0.000016%, weighted by frequency. In another study at Google of a billion web pages, the top 277 confusable URLs used confusable letters or numbers, not symbols or punctuation. The 278th page had a confusable URL with × (U+00D7 MULTIPLICATION SIGN - by far the most common confusable); but that page could could be even better spoofed with х (U+0445 CYRILLIC SMALL LETTER HA), which normally has precisely the same displayed shape as "x".
For a demo of confusable characters, and the effects of various restrictions, see the confusables demo.
This points to the need to carefully consider spoofing issues within the repertoire of letters allowed for registration for a given zone. Removing rarely used, and therefore unfamiliar, letter forms is one strategy. Another is to use the mechanisms of RFC7940 to create a machine-readable specification for limiting sequences of letters that are structurally invalid for a script and liable to be rendered like a different, valid sequence. The third strategy to define characters that are variants of each other. This information can be used to enforce mutual exclusion of labels containing look-alike characters, whether in-script or cross-script.
Programmers also need to be aware of the issues detailed in UTR #36: Unicode Security Considerations, including the mechanisms for detecting potentially visually-confusable characters are found in the associated UTS #39: Unicode Security Mechanisms.
Q: Are the local mappings in IDNA2008 just a UI issue?
No, not if what is meant is that they are only involved in interactions with the address bar.
Example:
- Alice sees that a URL works in her browser (say http://faß.de or http://TÜRKIYE.com). She sends it to Bob in an email. Bob clicks on the link in his email, and doesn't find a site or goes to a wrong (and potentially malicious) site, because his browser maps to http://fass.de or http://türkiye.com while Alice's maps to http://faß.de or http://türkıye.com.
There are parallel examples with web pages, IM chats, Word documents, etc.
- Alice creates a web page, using <a href="http://faß.de"> (or http://TÜRKIYE.com). Bob clicks on the link in the web page, and doesn't find a site or goes to a wrong (and potentially malicious) site.
- Alice is in a IM chat with Bob. She copies in http://faß.de (or http://TÜRKIYE.com) and hits return. Bob clicks on the link he sees in his chat window. Bob clicks on the link in his email, and doesn't find a site or goes to a wrong (and potentially malicious) site.
- Alice sends a Word document to Bob with a link in it...
- Alice creates a PDF document...
Q: Do the local-mapping exploits require unscrupulous registries?
No. The exploits do not require unscrupulous registries—it only requires that registries fail to police every URL that they register for possible spoofing behavior.
The local mappings matter to security, because entering the same URL on two different browsers may go to two different IP addresses when the two browsers have different local mappings. The same thing could happen within an emailer that is parsing for URLs, and then opening a browser. Moreover, they are even more problematic if they affect the interpretation of web pages, in such as cases like href="http://TÜRKIYE.com".
Script , language and IDNs
Q: What is the issue with German sharp s (ß) versus "ss"?
In German, the standard uppercase of ß is "SS", the same as the uppercase of "ss". Note, for example, that on the German language page for http://www.uni-giessen.de, "Gießen" is spelled with ß, but the logo for the university (see the top left corner of the page) is spelled with GIESSEN. The situation is even more complicated:
- In Switzerland, "ss" is uniformly used instead of ß.
- The recent spelling reform in Germany and Austria changed whether ß or "ss" is used in many words. For example, http://Schloß.de was the spelling before 1996, and http://Schloss.de is the spelling after.
- There are a number of word pairs in German where the distinction between ß or "ss" is the only difference; there are also words that have both in the same word. Examples: Masse, Maße, Massenzusammenstoß
- In Unicode 5.1, an uppercase version of ß was added (ẞ). While it has since been officially recognized as an alternate uppercase it is not now, however, the preferred uppercase of ß in German standards, nor is it known whether it will ever become the preferred uppercase. Unicode now treats all of these as a single equivalence class for case-insensitive matching: {ss, ß, SS, ẞ}. See also the Unicode FAQ.
- The German NIC (responsible for .de) has supported separate registration of domains with both ß to "ss" from 2010.
- The Austrian NIC (responsible for .at) favored keeping the mapping from ß to "ss" and does not allow ß.
- Some of the new gTLD registries are treating ß and "ss" as variants to ensure proper resolution of names that would otherwise be mapped to "ss".
- The Root Zone Label Generation Rules for the Latin Script, published by ICANN in 2022 allow either ß or "ss" at a given position, but with a provision for bundling a label having all "ss" with any label containing ß.
Q: What is the issue with Greek final sigma (ς)?
The Greek sigma (σ) takes a final form (ς) at the end of a word. In IDNs, where there are no spaces, a final form may show up in the middle of a label. Because both have the same uppercase form (Σ) labels with and without final sigma cannot be distinguished when presented to the user as uppercase (and IDNA2003 maps them together). The Root Zone Label Generation Rules for the Greek Script allow either (σ) or (ς) at a given position, but with a provision for bundling a label having all (σ) with any label containing a final sigma (ς). These are also considered variants in the Greek ccTLD.
Q: Aren't the problems with eszett and final sigma just the same as with l, I, and 1?
The eszett and sigma are fundamentally different than I (capital i), l (lowercase L), and 1 (digit one). With the following (using a digit 1), all browsers will go to the same location, whether they are old or new:
http://goog1e.com
In the following hypothetical example using a top-level domain "xx", browsers that use IDNA2003 will go to a different location than browsers that use a strict version of IDNA2008, unless the registry for xx puts into place a bundling strategy.
http://gießen.xx
The same goes for Greek sigma, which is a more common character in Greek than the eszett is in German.
Q: Why does IDNA2003 map final sigma (ς) to sigma (σ) and (ß) to "ss" and (i) to (ı)?
This decision about the mapping of these characters followed recommendations for case-insensitive matching in the Unicode Standard. These characters are anomalous: the uppercase of ς is Σ, the same as the uppercase of σ. Note that the text "ΒόλοΣ.com", which appears on http://Βόλος.com, illustrates this: the normal case mapping of Σ is to σ. If σ and ς were not treated as case variants in Unicode, there wouldn't be a match between ΒόλοΣ and Βόλος.
For full case insensitivity (with transitivity), {σ, ς, Σ}, {i, ı, I and İ} and {ss, ß, SS} need to be treated as equivalent, with one of each set chosen as the representative in the mapping. That is what is done in the Unicode Standard, which was followed by IDNA2003. While IDNA2003 did not have to have full case transitivity, that is water under the bridge.
Q: How does IDNA2008 improve handling of Arabic and Hebrew (BIDI)?
Arabic and Hebrew writing systems are known as bidi (bidirectional) because text runs from right-to-left and numbers (or embedded Latin characters) from left-to-right. IDNA2008 does a better job of restricting labels that lead to "bidi label hopping". This is where bidi reordering causes characters from one label to appear to be part of another label. For example, "B1.2d" in a right-to-left paragraph (where B stands for an Arabic or Hebrew letter) would display as "1.2dB". For more information, see the Unicode bidi demo.
While these new bidi rules go a long way towards reducing this problem, they do not completely eliminate it because they do not check for inter-label problems.
Q: Why allow ZWJ/ZWNJ at all?
ZWJ and ZWNJ are normally invisible, which allows them to be used for a variety of spoofs. Invisible characters (like these and soft-hyphen) are allowed on input in IDNA2003, but are deleted so that they do not allow spoofs. During the development of Unicode, the ZWJ and ZWNJ were intended only for presentation — that is, they would make no difference in the semantics of a word.
However, in some cases, what used to be presentational alternatives became semantically distinct. For example, there are words such as the Sinhala name of the country of Sri Lanka (ශ්රී ලංකාව), which require preservation of these joiners (in this case, ZWJ) to achieve the correct spelling. The Root Zone excludes the ZWJ because of the heightened security sensitivity for the Root Zone. However, the Reference Label Generation Rules for the Sinahala Script published by ICANN for use on the second level allow the use of ZWJ, but only in the context of a few dozen explicitly enumerated combinations.
Q: But aren't the deviation characters needed for the orthographies of some languages?
While these are full parts of the orthographies of the languages in question, neither IDNA2003 nor IDNA2008 ever claimed that all parts of every language's orthographies are representable in domain names. There are trivial examples even in English, like the word can't (vs cant) or Wendy's/Arby's Group, which use standard English orthography but cannot be represented faithfully in a domain name.
Q: Are there registries that restrict domain names on the basis of language?
While it may be difficult to find a clear cutoff for restricting IDNs on the basis of language, there are many registries that have language-specific registration policies, and ICANN publishes a set of language-specific Reference LGRs for the Second Level.
The main concern is that the set of letters used in a particular language is not well defined. The "core" letters typically are, but many additional ones may be accepted in loan words, and have perfectly legitimate commercial and social use. Sometimes the same language used in different regions may use different letters; other times, the interest may be more in supporting a particular country, than a specific language. The latter applies to many ccTLDS. In all of these cases, the allowed repertoire may not be strictly language-based but will be a subset of a full script's repertoire.
Q: Are there registries that restrict domain names on the basis of script?
It is a bit easier to maintain a clear distinction based on script differences between characters: every Unicode character has a defined script (or is Common/Inherited). However, some languages, such as Japanese, require multiple scripts. And in such cases, mixtures of scriptsmay be appropriate. One can have http://SONY日本.com with no problems at all—while there are many cases of "homographs" (visually confusable characters) within the same script that a restriction based on script doesn't deal with.
As one prominent example, the DNS Root Zone supports domain names on the basis of script: with few exceptions for inherently multi-script writing systems each label must be in a single script. However, labels from different scripts share the Root Zone. The issue of true homographs within and across scripts is addressed not by repertoire restriction but by mutual exclusion via definition of variants.
Q: What is a recommended script?
UAX #31 defines a number of scripts as "Recommended" for use in identifiers. All of these scripts are in "widespread common everyday use" by large communities for writing modern languages and that are actively being used by the respective user communities to conduct their ordinary and daily online business. Where there is some, but not sufficient level of use, a script may be designated "Limited_Use". This classification neither prevents use of these scripts for IDNs absolutely in any zone other than the DNS Root Zone, nor does it affect the use of the script in creating online content. Scripts limited to specialized use only, like archaic scripts, are classified as "Excluded".
Q: Can additional scripts become recommended?
Which scripts are used to write a given language may change over time. Whether a script is recommended or not is not frozen in time, so Unicode is able to track such changes in usage. A number of scripts that are classified as Limited_Use have the potential to become recommended, if at some point in the future their observed and documented level of usage rises to the level of "widespread common everyday use". Any suggestion to make such a change would need be accompanied by thorough documentation of pervasive online use of the script in daily life. Unlike the use of a script to publish dictionaries or otherwise digitally preserve a written culture, the use of IDNs is to facilitate day-to-day online interactions by users of the script. Therefore, the degree to which such a language community engages in online transactions using that script is the most important data point.
Q: Should the IDNA protocol restrict allowed domain names on the basis of language or script?
The rough consensus among the IETF IDNA working group is that script/language mixing restrictions are not appropriate for the lowest-level protocol. So in this respect, IDNA2008 is no different than IDNA2003. IDNA doesn't try to attack the homograph problem, because it is too difficult to maintain a clear distinction. Effective solutions depend on information or capabilities outside of the protocol's control, such as language restrictions appropriate for a particular registry, the language of the user looking at this URL, the ability of a UI to display suspicious URLs with special highlighting, and so on.
Responsible registries can apply such restrictions. For example, a country-level registry can decide on a restricted set of characters appropriate for that country's languages. Application software can also take certain precautions—Microsoft Edge, Safari, Firefox, and Chrome all display domain names in Unicode only if the user's language(s) typically use the scripts in those domain names. For more information on the kinds of techniques that implementations can use on the Unicode web site, see UTR #36: Unicode Security Considerations.
Implementation Issues and Strategies for IDN
Q: Are there differences in mapping between UTS #46 and IDNA2003?
No. There are, however, 56 characters that are valid or mapped under IDNA2003, but are disallowed by UTS #46. For a detailed table of differences between UTS #46 and IDNA2008, see Section 7, IDNA Comparison in UTS #46.
In particular, there are collections of characters that would have changed mapping according to NFKC_Casefold after Unicode 3.2, unless they were specifically excluded. All of these characters are extremely rare, and do not require any special handling.
Case Pairs. These are characters that did not have corresponding lowercase characters in Unicode 3.2, but had lowercase characters added later.
U+04C0 ( Ӏ ) CYRILLIC LETTER PALOCHKA
U+10A0 ( Ⴀ ) GEORGIAN CAPITAL LETTER AN…U+10C5 ( Ⴥ ) GEORGIAN CAPITAL LETTER HOE
U+2132 ( Ⅎ ) TURNED CAPITAL F
U+2183 ( Ↄ ) ROMAN NUMERAL REVERSED ONE HUNDRED
After Unicode 3.2, the Unicode Consortium has stabilized case folding, so that further examples will not occur in the future. That is, case pairs will be assigned in the same version of Unicode—so any newly assigned character will either have a case folding in that version of Unicode, or it will never have a case folding in the future.
Normalization Mappings. These are five characters whose normalizations changed after Unicode 3.2 (all of them were in Unicode 4.0.0: see Corrigendum #4: Five Unihan Canonical Mapping Errors). As of Unicode 5.1, normalization is completely stabilized, so these are the only such characters.
Q: What are possible strategies for preparing IDNs in a display form preferred by target sites?
Labels presented to a browser may or may not be in the display form preferred by a target site. For example, a site may have a preferred display form of “HumanEvents.com”, but an href tag in another site may display “HumaneVents.com”. Similarly, a user may type “Floß.com” in the browser’s address bar, and that would resolve to the site “floss.com”, though it is unclear whether the display form preferred by owners of that site is “Floss.com”, “floss.com”, “Floß.com”, or “floß.com”. There is no way currently for the browser to know whether the labels are in a preferred form or not.
It may be useful to develop mechanisms to allow browsers to determine the display form preferred by a target site, and then for browsers to display that form. One could foresee something being developed along the lines of the favicon approach. The mechanisms would need to have restrictions put into place to address misrepresentations. For example, the browser should verify that the site's preferred display form has the same lookup form: if the href is "http://βόλοσ.com", and the site's preferred display form is "http://Βόλος.com", then the preferred display form could be used; if the site's preferred display form is "http://Βόλλος.com", then it would not be used, because it doesn't have the same lookup form as the href. Other security checks should be made, such as to prevent display forms like "appIe.com" (with a capital I) for "appie.com" (with a lowercase i).
Q: How are label delimiters handled in implementations of IDNA?
The processing of UTS #46 matches what is commonly done with label delimiters by browsers, whereby characters containing periods are transformed into the NFKC format before labels are separated. This allows the domain name to be mapped in a single pass, rather than label by label. However, except for the four label separators provided by IDNA2003, all input characters that would map to a period are disallowed. For example, U+2488
( ⒈ ) DIGIT ONE FULL STOP has a decomposition that maps to a period, and is thus disallowed. The exact list of characters can be seen with the Unicode utilities using a regular expression:
The question also arises as to how to handle escaped periods (such as %2E). While escaping of periods is outside of the scope of this document, it is useful to see how both of these cases are handled in current browsers:
Input: | http://à%2Ecom | %2E | http://à⒈com | ⒈ |
Microsoft Edge | http://xn--0ca.com/ | = "." | http://xn--1-rfa.com/ | = "1." |
Firefox | http://www.xn--.com-hta.com/ | ≠ "." | http://xn--1-rfa.com/ | = "1." |
Safari / Chrome | http://xn--0ca.com/ | = "." | http://xn--1.com-qqa/ | ≠ "1." |
There are three possible behaviors for characters such as U+2488
( ⒈ ) DIGIT ONE FULL STOP:
- The dot behaves like a label separator.
- The character is rejected.
- The dot is included in the label, as shown in the garbled punycode seen above in the ≠ cases.
The conclusion of the Unicode Technical Committee was that the best behavior for UTS #46 was #2, to forbid all characters (other than the 4 label separators) that contained a FULL STOP in their compatibility decompositions. This is the same behavior as IDNA2003. Although this policy is not the current policy of the majority of browser implementations, the browser vendors agreed that the change is desirable.
Q: For IDNA2008, what is the derivation of valid characters in terms of Unicode properties?
Using formal set notation, the following describes the set of allowed characters defined by IDNA2008. This set corresponds to the union of the PVALID, CONTEXTJ, and CONTEXTO characters defined by the Tables document of IDNA2008. Unicode provides the result of this derivation for every applicable version of the standard including the currently defined exceptions.
Formal Sets | Descriptions |
---|---|
[ \P{Changes_When_NFKC_Casefolded} |
Start with characters that are NFKC Case folded (as in IDNA2003) |
\- \p{c} - \p{z} |
Remove Control Characters and Whitespace (as in IDNA2003) |
\- \p{s} - \p{p} - \p{nl} - \p{no} - \p{me} |
Remove Symbols, Punctuation, non-decimal Numbers, and Enclosing Marks |
\- \p{HST=L} - \p{HST=V} - \p{HST=V} |
Remove characters used for archaic Hangul (Korean) |
\- \p{block=Combining_Diacritical_Marks_For_Symbols} |
Remove three blocks of technical or archaic symbols. |
\- [\u0640 \u07FA \u302E \u302F \u3031-\u3035 \u303B] |
Remove certain exceptions: U+0640 ( ـ ) ARABIC TATWEEL U+07FA ( ߺ ) NKO LAJANYALAN U+302E ( 〮 ) HANGUL SINGLE DOT TONE MARK U+302F ( 〯 ) HANGUL DOUBLE DOT TONE MARK U+3031 ( 〱 ) VERTICAL KANA REPEAT MARK U+3032 ( 〲 ) VERTICAL KANA REPEAT WITH VOICED SOUND MARK .. U+3035 ( 〵 ) VERTICAL KANA REPEAT MARK LOWER HALF U+303B ( 〻 ) VERTICAL IDEOGRAPHIC ITERATION MARK |
\>+ [\u00B7 \u0375 \u05F3 \u05F4 \u30FB] |
Add certain exceptions: U+00B7 ( · ) MIDDLE DOT U+0375 ( ͵ ) GREEK LOWER NUMERAL SIGN U+05F3 ( ׳ ) HEBREW PUNCTUATION GERESH U+05F4 ( ״ ) HEBREW PUNCTUATION GERSHAYIM U+30FB ( ・ ) KATAKANA MIDDLE DOT plus U+002D ( - ) HYPHEN-MINUS U+06FD ( ۽ ) ARABIC SIGN SINDHI AMPERSAND U+06FE ( ۾ ) ARABIC SIGN SINDHI POSTPOSITION MEN U+0F0B ( ་ ) TIBETAN MARK INTERSYLLABIC TSHEG U+3007 ( 〇 ) IDEOGRAPHIC NUMBER ZERO |
\+ [\u00DF \u03C2] \+ \p{JoinControl}] |
Add special exceptions (Deviations): U+00DF ( ß ) LATIN SMALL LETTER SHARP S U+03C2 ( ς ) GREEK SMALL LETTER FINAL SIGMA U+200C ( ) ZERO WIDTH NON-JOINER U+200D ( ) ZERO WIDTH JOINER |