October 31st, 2024

What has case distinction but is neither uppercase nor lowercase?

Raymond Chen

If you go exploring the Unicode Standard, you may be surprised to find that there are some characters that have case distinction yet are themselves neither uppercase nor lowercase.

Oooooh, spooky.

In other words, it is a character c with the properties that

toUpper(c) ≠ toLower(c), yet
c ≠ toUpper(c) and c ≠ toLower(c).

Congratulations, you found the mysterious third case: Title case.

There are some Unicode characters that occupy a single code point but represent two graphical symbols packed together. For example, the Unicode character ǳ (U+01F1 LATIN SMALL LETTER DZ), looks like two Unicode characters placed next to each other: dz (U+0064 LATIN SMALL LETTER D followed by U+007A LATIN SMALL LETTER Z).

These diagraphs are characters in the alphabets of some languages, most notably Hungarian. In those languages, the diagraph is considered a separate letter of the alphabet. For example, the first ten letters of the Hungarian alphabet are¹

dzs

These digraphs (and one trigraph) have three forms.

Form	Result
Uppercase	Ǳ
Title case	ǲ
Lowercase	ǳ

Unicode includes four diagraphs in its encoding.

Uppercase	Title case	Lowercase
Ǆ	ǅ	ǆ
Ǉ	ǈ	ǉ
Ǌ	ǋ	ǌ
Ǳ	ǲ	ǳ

But wait, we have a Unicode code point for the dz digraph, but we don’t have one for the cs digraph or the dzs trigraph. What’s so special about dz?

These digraphs owe their existence in Unicode not to Hungarian but to Serbo-Croatian. Serbo-Croatian is written in both Latin script (Croatian) and Cyrillic script (Serbian), and these digraphs permit one-to-one transliteration between them.¹

Just another situation where the world is more complicated than you think. You thought you understood uppercase and lowercase, but there’s another case in between that you didn’t know about.

Bonus chatter: The fact that dz is treated as a single letter in Hungarian means that if you search for “mad”, it should not match “madzag” (which means “string”) because the “dz” in “madzag” is a single letter and not a “d” followed by a “z”, no more than “lav” should match “law” just because the first part of the letter “w” looks like a “v”. Another surprising result if you mistakenly use a literal substring search rather than a locale-sensitive one. We’ll look at locale-sensitive substrings searches next time.

¹ I got this information from the Unicode Standard, Version 15.0, Chapter 7: “Europe I”, Section 7.1: “Latin”, subsection “Latin Extended-B: U+0180-U+024F”, sub-subsection “Croatian Digraphs Matching Serbian Cyrillic Letters.”

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

23 comments

Join the discussion.

Leave a commentCancel reply

Sort by :

Newest

Newest Popular Oldest

Miloš Milutinović 6 days ago 1

“Serbo-Croatian is written in both Latin script (Croatian) and Cyrillic script (Serbian), and these digraphs permit one-to-one transliteration between them.¹”

This is slightly wrong, Serbian also uses Latin – it uses both Latin and Cyrillic, per personal preference. To quote Wikipedia, “Serbian is practically the only European standard language whose speakers are fully functionally digraphic,[18] using both Cyrillic and Latin alphabets.”

Log in to Vote or Reply
Tudor Iordăchescu 1 week ago 1

The EU law imposes the user's informed consent for the use of cookies, that's it.

Some corporations/people that first complied with that law (I admit I'm a little bit fuzzy on the historical timeline) chose to implement the most annoying version possible as a form of malicious compliance, probably hoping that public outcry would trigger a revision of the law. I bet that 99% of people implementing such popups nowadays are just lazy and...
Read more
The EU law imposes the user’s informed consent for the use of cookies, that’s it.

Some corporations/people that first complied with that law (I admit I’m a little bit fuzzy on the historical timeline) chose to implement the most annoying version possible as a form of malicious compliance, probably hoping that public outcry would trigger a revision of the law. I bet that 99% of people implementing such popups nowadays are just lazy and follow the herd instead of researching what the law actually requires.

Read less

Log in to Vote or Reply
- Bela Zsir 1 week ago · Edited 0
  
  Some first compliers were hoping for a public outcry, since then we have all the herd following, ie. the overall impact is a million times greater, and still no public outcry.
  What does that say about the public? …am I really the only one so annoyed with this?
  
  Log in to Vote or Reply
David Faulks 1 week ago · Edited 0

The reason these letters exist is because Unicode has a policy of 1-to-1 round trip encoding compatibility with older character sets, and Yugoslavia (keep in mind Unicode came out in 1991) used to have an 8-bit character set (YUSCII) that included these digraph letters. I’m not sure why so many commentators are focused on Hungarian.

Log in to Vote or Reply
Michael Chermside November 4, 2024 1

Your point about how Hungarians actually use the characters is excellent -- and remains so even if it turns out that SOME Hungarian speakers disagree with you on this. In my opinion, the authors of the Unicode standing should generally attempt to support oddities that are unique within a language but when native speakers disagree about an oddity, Unicode should err on the side of simplicity. (Of course ǳ may have been added to support...
Read more
Your point about how Hungarians actually use the characters is excellent — and remains so even if it turns out that SOME Hungarian speakers disagree with you on this. In my opinion, the authors of the Unicode standing should generally attempt to support oddities that are unique within a language but when native speakers disagree about an oddity, Unicode should err on the side of simplicity. (Of course ǳ may have been added to support Serbo-Croatian, not Hungarian.)

However, your ire about cookie popups is misplaced. Computer technologists did not invent and impose them, an EU law mandated the cookie popups (and still does). I don’t even live in the EU and I still have to Wade through thickets of cookie agreement popups. Perhaps you could persuade your politicians to change that.

Read less

Log in to Vote or Reply
- Bela Zsir 1 week ago · Edited 0
  
  Thank you for the reply, sorry for the long rant. I was trying to make the point that after decades of NOT having all our characters, we don't want now to choke on too many.
  
  PS. Your are right about the cookie thing being misplaced here, when I can, I speak out in the right place, but nobody cares. as if it were not a problem for anyone else in the world, in desperation I...
  Read more
  Thank you for the reply, sorry for the long rant. I was trying to make the point that after decades of NOT having all our characters, we don’t want now to choke on too many.
  
  PS. Your are right about the cookie thing being misplaced here, when I can, I speak out in the right place, but nobody cares. as if it were not a problem for anyone else in the world, in desperation I tried here to give just an example, of what is a million times more disturbing than the made-up problem of the lack of some unneeded digraphs. (reminds me of the story made in the news how the ‘Calculator Team’ (sic! / sick?) in Redmond solved a 20 year problem in the Windows calculator, this and all similar waste of human resources ie. just to have a ‘Calculator Team’ make me sad)
  
  I am aware that it is a law, but nowhere in the law it is stated that half the page must be taken up by a cookie prompt graying out the rest of the page making it unusable till you answer a silly question. (I click always on OK, Yes, Agree, or whatever the ‘dont care just go’ button says to avoid the pointless further prompts)
  
  In my spare time, I volunteer at a centre for people with disabilities, doing what I can: electronics and programming, refurbishing the computers they receive as donations. ‘inventing’ alternative pointing devices. These people are blessed with a computer and the internet. Scrolling is easy for them without the help of their hands, but clicking a mouse on a randomly popping up window with a mess of buttons is each time a challenge, and it is growing.
  I am desperately trying to help these people with my browser extension scripts that auto-click away these bs, but they keep on coming, there are not two identically programmed out there. My only wish for you, Computer technologists, please help us, just make the implementation a standard. (for lawyer users you can leave all as is)
  If talking about the law, why is there no option for a legally binding statement built-in a browser that I will allow/deny all cookies for the next 10 years or whatever, just don’t ask me a million times of the same thing. I will go to a notary to sign this life-long statement with a wax seal if needed. What law it is if I can give away a billion-dollar asset at the click of a button, then I am an adult? But this cookie thing is so important that it will be reasked 547 times just over the next week.
  
  I began with ‘sorry for my rant’, now I went on, sorry again. Annoying, isn’t it? those cookie prompts are much more annoying for my friends in that center.
  
  Read less
  
  Log in to Vote or Reply
Kristof Roomp November 3, 2024 · Edited 0

Ll and Ch were considered single letters in Spanish until they changed the rules in 1994. Spanish (traditional sort) treats them as single characters vs Modern sort.

Log in to Vote or Reply
Bela Zsir 1 week ago · Edited 1

As a Hungarian, I'd like to add my two cents to the discussion.

TLDR: I know that lately it's become a "woke" habit to look for 'oppressed victims' who have not the slightest idea that they’re supposed to be victims. But thank you, we Hungarians—and our language—have no need for digraphs; in fact, having them and anybody using them would be actually harmful.

I began programming in the 80s, we Hungarians spent about two...
Read more
As a Hungarian, I’d like to add my two cents to the discussion.

TLDR: I know that lately it’s become a “woke” habit to look for ‘oppressed victims’ who have not the slightest idea that they’re supposed to be victims. But thank you, we Hungarians—and our language—have no need for digraphs; in fact, having them and anybody using them would be actually harmful.

I began programming in the 80s, we Hungarians spent about two decades with the problem that the full (8-bit) extended ASCII character table just almost had all the Hungarian characters. We had á é, í, ó, ö, ú, ü, but we were missing ő ű and their uppercase versions. (I guess these diacritics do not even have a name: could be double-acute?), With these four missing we were limited in doing anything computer-related with correct grammar. This was particularly frustrating because, as far as I know, all other European languages, including Eastern European languages, had all their letters.
I remember how I complained that, despite the fact that there were Hungarians among the greats of computer science (János Neumann – Neumann architecture, János Kemény – inventor of multitasking and the BASIC language, Gábor Dénes – holography, András Gróf, who used the name Andrew Grove – co-founder of Intel, Károly Simonyi, who used the name Charles Simonyi – chief architect for Microsoft Word and Excel, and many others), they couldn’t manage to lobby us into having those four more letters in the 256 ASCII characters. (please note I wrote their name using the original Hungarian letters)
The other annoying thing for us is the order of given name / surname. We use it the other way, it is used in very few languages that way on the Planet, and still todays mainstream apps are having issues with this (actually they just don’t care)

So, Dear international computing community, you owe us Hungarians. 🙂 Please don’t ruin our text searches with the unpredictable results of not finding ‘mad’ in ‘madzag’ I assure you, not a single Hungarian expects it this way, nor does anyone need it ever.

In my opinion, it’s already nonsense that these double and triple letters made to be part of the official Hungarian alphabet. I deliberately not call them digraphs, by that logic, plenty of other letter pairs in our language — and also in all other languages — could also become digraphs. There’s nothing special about “dz.” in Hungarian. We use it the same way as the “ch” in the word *technika*—one sound, two letters, but not in the alphabet as a digraph. The same goes for “ts,” “tz,” and a dozen other letter pairs. And what concerns the hyphenation rules, they are independent of these anyway, we do not hyphenate as tec-hnika, and for this to work, a digraph of ‘ch’ is not needed.

I bet you’d go crazy if a search function couldn’t find “is” in the word “island” just because someone decided that “sl” should be a digraph in English. Please spare us too from this, we want to find our mad-ness in ‘madzag’, it’s OK so.

PS. if you’re looking for a real problem to solve, do something about the plethora of cookie-consent popups that clutter everything on the internet. They have no sense in the fight for privacy, I guess the lawmakers just knew and could pronounce the word ‘cookie’ (Want a yummy cookie, Charlie?) and luckily had no idea what a localStorage, sessionStorage, indexedDB, or cache storage is.
These cookie-things make any browsing a pain, I click them away dozens times a day, very annoying, internet was not supposed to look this way.
At least standardize them, they are all annoyingly (ie. unautomatable) different, what about a special HTML tag for this bs?

Read less

Log in to Vote or Reply
Jonas Barklund 2 weeks ago 0

Raymond, did you try to make a distinction between digraph and diagraph, or was the latter a typo for digraph?

Log in to Vote or Reply
Álvaro González 2 weeks ago 0

Funny. That same letter also used to exist in Spanish, together with ll (double L). Both were demoted in the mid 1990s so I guess they never made into Unicode.
I also think it was for the best. To look up things in a list or dictionary you had to know the language it was written on.

Log in to Vote or Reply
Chris Warrick October 31, 2024 0

> The fact that dz is treated as a single letter in Hungarian means that if you search for “mad”, it should not match “madzag” (which means “string”) because the “dz” in “madzag” is a single letter and not a “d” followed by a “z”

This sounds mad to me. Polish has a fair share of digraphs and trigraphs, but I expect partially-typed digraphs not to change the search result. It is disorienting if the result...
Read more
> The fact that dz is treated as a single letter in Hungarian means that if you search for “mad”, it should not match “madzag” (which means “string”) because the “dz” in “madzag” is a single letter and not a “d” followed by a “z”

This sounds mad to me. Polish has a fair share of digraphs and trigraphs, but I expect partially-typed digraphs not to change the search result. It is disorienting if the result appears when `ma` is typed, disappears after typing `d`, and then comes back after typing `z`. And that applies to experiences which don’t search automatically as well.

Read less

Log in to Vote or Reply
- Aram Dulyan November 6, 2024 1
  
  Polish doesn’t treat those digraphs as letters of the alphabet though.
  
  In Czech, ch is a letter of the alphabet that comes after h. If a building has a vertical sign for bread, it would look like:
  Ch
  L
  É
  B
  
  Log in to Vote or Reply
- Daniel Chýlek November 2, 2024 1
  
  I guess it is weird if you combine multiple languages on your system, but to me it’s entirely reasonable to expect that Czech Windows will not find a file containing ‘ch’ when you search for ‘c’ or ‘h’. That is how it works right now.
  
  Log in to Vote or Reply
Jan Ringoš October 31, 2024 1

In Czech, we have similar letter ‘ch’ but it never got assigned a single Unicode codepoint.

It’s probably for the best.

Log in to Vote or Reply