November 1st, 2024

On locale-aware substring matching, either case-sensitive or case-insensitive

Say you want to do a case-insensitive substring search in a locale-aware manner. For example, maybe you have a list of names, and you want the user to be able to search for them by typing any name fragment. A search for “ber” could find “Bert” as well as “Roberta”.

I’ve seen people solve this problem by converting the string to lowercase, and then doing a code unit-based substring search. This technique doesn’t work for multiple reasons.

One reason is that some languages (like English) do not consider diacritics significant in collation. The word naive and naïve are considered equivalent for searching purposes. But a code unit substring search considers them different.

For languages in which diacritics are significant, you have the problem of composed and decomposed characters. For example, the lowercase a with ring in the Swedish word någon could be represented either as

  • Two code points: U+0061 (LATIN SMALL LETTER A) followed by U+030A (COMBINING RING ABOVE), or
  • A single code point: U+03E5 (LATIN SMALL LETTER A WITH RING ABOVE)

The number of possibilities increases if you have characters with multiple diacritics. And then you also have ligatures, where the “fi” ligature is equivalent to two separate characters f and i.

So what’s the right thing to do?

In Windows, you can use the Find­NLS­String­Ex function to do a locale-aware substring search. Use the LINGUISTIC_IGNORE­DIACRITIC flag to say that you want to honor diacritics only when they are significant to the locale.¹ (A better name would have been LINGUISTIC_IGNORE­INSIGNIFICANT­DIACRITICS.)

On other platforms, and even on Windows,² you can use the ICU library’s string search service and search with primary weight. (Primary weight honors diacritics which are significant to the locale.)

Bonus reading: A popular but wrong way to convert a string to uppercase or lowercase. What has case distinction but is neither uppercase nor lowercase?

¹ Throw in one of the IGNORE­CASE flags if you want a case-sensitive substring search.

² The Windows globalization team now recommends that people use ICU, which has been part of Windows since Windows 10 version 1703 (build 15063). More details and gotchas here.

Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

9 comments

  • IS4 1 week ago

    Or just use .NET which has standardized methods for any sort of substring matching. No need to burden the application with an external library when the environment can handle it fine.

  • David Lee · Edited

    The Windows filesystem seems to treat multiple representations of the same character (composed vs. decomposed, etc.) as different, which can cause a false impression of “having multiple files under the same name” since they tend to render the same in File Explorer.

    It can be argued either way, but macOS seems to normalize strings before hitting the disk. On the other end we also have *nix, which basically treats filesystem paths as opaque byte arrays.

    • Kevin Norris

      All three behaviors are defensible, and they all have issues as well:

      macOS reduces confusion when users enter the same text in multiple different ways (for example, by copying and pasting text from the internet), but there are four different kinds of Unicode normalizations and you would ideally pick one of them for the whole system (Google is giving me conflicting answers as to what macOS actually does, so I'm speaking hypothetically). NFC and NFD are...

      Read more
      • Kevin Norris

        @Bwmat: It depends.

        Do your users expect to be able to find their files and fiddle with them? Then stop trying to turn strings into paths altogether. Instead present a standard Save As... dialog (for each platform). That will give you a buffer that is pre-encoded in the FS-appropriate encoding, so just pass it directly to fopen.

        If you don't expect the user to find the files and fiddle with them, then stop trying to put the...

        Read more
      • Bwmat 1 week ago

        So, given a ‘platform-independent’ system which internally passes strings as (possibly-invalid) UTF-16 & accepts user input that potentially in other encodings, including non-unicode encodings, how would you deal w/ filenames that needed to be passed in those strings?

        Right now I just choose not to think about it… too much. We just use the data directly w/ CreateFileW on windows & convert to the ICU-detected ‘platform encoding’ on other platforms & use fopen

      • Chris Warrick

        macOS likes NFD, and at the same time, a lot of macOS software (not necessarily first-party, but prominent applications like Microsoft Word’s spellchecker) get confused by that.

      • Jan Ringoš

        One small correction: Windows (NTFS) paths do not need to be valid UTF-16. Any sequence of 16-bit words (except reserved characters) will work. Unassigned code units, mismatched surrogate pair code units, anything. You can put 510-character long UTF-8 string in filename, and the OS will be totally fine with it.