Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] windows: ucrt build on win10+: support unicode file names #438

Merged
merged 2 commits into from
Oct 4, 2023

Conversation

avih
Copy link
Contributor

@avih avih commented Sep 27, 2023

There are three methods to support unicode file names on Windows:

  1. Use the W Windows APIs (_wfopen, CreateFileW, etc). This works on Windows XP and later, but requires prevasive-ish code changes.

  2. Attach/use a UTF-8 manifest without any code changes (i.e. keep using the ANSI APIs - fopen, CreateFile[A], etc). This works on Windows 10 1903 and later compiled with msvcrt or ucrt, can be done at build time or post-build, has no effect on win7/8[.1], but the binary doesn't run on Windows XP.

  3. (try to) change the locale at runtime to UTF-8. Works on Windows 10 1803 and later compiled with ucrt only (msvc 2015+ or mingw ucrt). With msvcrt and/or earlier windows this has no effect, but it runs on XP and later. Requires small-ish code changes, because argv/env are already ANSI on entry, so they need to become UTF-8.

This commit takes approach 3. If approach 1 or 2 is taken later, then this commit can be reverted.

A UTF-8 manifest can still be used (e.g. if built with msvcrt), at which case the new code becomes no-op (when GetACP() is CP_UTF8).

@avih
Copy link
Contributor Author

avih commented Sep 27, 2023

I'm a bit torn about this one. On one hand, this will enable unicode file names and env vars in modern builds of less (msvc 2015+ or mingw with ucrt) on Windows 10 and later, and it's not too much code, and fairly local. And it should not have any negative effect elsewhere.

On the other hand, exactly the same effect can be achieved without any changes to the code at all - by placing a small XML UTF8 manifest file less.exe.manifest at the same dir as less.exe, or embedding this manifest file at the binary either during the build process, or post-build (e.g. using perc or rcedit, or mt.exe which is shipped with the windows SDK).

As a bonus, this will also work with msvcrt builds of less.exe (while this PR only works for ucrt builds).

less.exe.manifest
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
  <assemblyIdentity type="win32" name="less.exe" version="6.0.0.0"/>
  <application>
    <windowsSettings>
      <activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>
    </windowsSettings>
  </application>
</assembly>

However, using the manifest requires shipping this manifest, or additional work to embed it as a resource at the binary, which likely won't happen by itself, and also it won't run on XP.

I don't know how important XP is. less still supports it, and also supports DOS and possibly win9X too, though the semi-official windows build of less already doesn't run on XP anyway - even the x86 build.

I still think this PR is a useful addition, mainly because it is useful, not invasive, and can be ripped out quite easily if needed, but it feels wrong when the same effect can happen without any code changes at all using the manifest...

@avih
Copy link
Contributor Author

avih commented Oct 3, 2023

Hmm.. somewhat tangential, but I just noticed that there are two msvc makefiles - Makefile.wnm and Makefile.win-vc64, and I added shell32.lib only to the latter.

I don't think we need two makefiles? the differences between them are:

  • no-op quoting differences.
  • additional link with kernel32.lib in one - which gets linked by default anyway even if it's not specified
  • /D WIN32 at Makefile.wnm, replaced with /D WIN64 at Makefile.win-vc64. This makes exactly zero difference. Apparently by default WIN32 is not defined, but defining either WIN32 or WIN64 results in only WIN32 defined anyway (tested with vs2015 cl /D WIN64 test.c). Either way, less doesn't use WIN64 at all, and does use WIN32 (only few times).
  • in Makefile.wnm the object targets depend on funcs.h, while in Makefile.win-vc64 they don't. This doesn't make any difference because funcs.h is required anyway, and there's no rule for it at either makefile.
  • The comment in one says # Windows 32 Visual C++ version and in the other it's # Windows 64 Visual C++ version, and both are incorrect, because both files can be used equaly to build either 32 or 64 bits version since 0713eaf (v553) (when using the respective 32/64 bit env, e.g. with VCVARS32.BAT or VCVARS64.BAT).

Shall we delete Makefile.win-vc64, and update the comment in Makefile.wnm to # Windows 32/64 Visual C++ version ?

(and I'll update this PR to add shell32.lib to Makefile.wnm)

The build tests were with vs2015 command line env.

Also, it wouldn't hurt to add msvc and mingw build instructions. I can do that later.

@gwsw
Copy link
Owner

gwsw commented Oct 3, 2023

Makefile.win-vc64 was created by @Konstantin-Glukhov in 2bf6792. I don't know if there was a specific reason to create a separate Makefile, but I support merging the Makefiles if there are no strong reasons not to do so.

@Konstantin-Glukhov
Copy link

I vote for having one Makefile

There are three methods to support unicode file names on Windows:

1. Use the W Windows APIs (_wfopen, CreateFileW, etc). This works on
   Windows XP and later, but requires prevasive-ish code changes.

2. Attach/use a UTF-8 manifest without any code changes (i.e. keep
   using the ANSI APIs - fopen, CreateFile[A], etc). This works
   on Windows 10 1803 and later compiled with msvcrt or ucrt, can be
   done at build time or post-build, has no effect on win7/8[.1],
   but the binary doesn't run on Windows XP.

3. (try to) change the locale at runtime to UTF-8. Works on Windows 10
   1803 and later compiled with ucrt only (msvc 2015+ or mingw ucrt).
   With msvcrt and/or earlier windows this has no effect, but it runs
   on XP and later. Requires small-ish code changes, because argv/env
   are already ANSI on entry, so they need to become UTF-8.

This commit takes approach 3. If approach 1 or 2 is taken later, then
this commit can be reverted.

A UTF-8 manifest can still be used (e.g. if built with msvcrt),
at which case the new code becomes no-op (when GetACP() is CP_UTF8).
It's effectively identical to Makefile.wnm, and since commit 0713eaf
Makefile.wng can build 64 bits too.

While at it, update the comment at Makefile.wnm to mention 64 too.
@avih
Copy link
Contributor Author

avih commented Oct 4, 2023

I support merging the Makefiles if there are no strong reasons not to do so.

I vote for having one Makefile

Thanks.

Rebased, updated the PR to modify only Makefile.wng, and added a commit to remove Makefile.win-vc64.

@gwsw gwsw merged commit 5f21a66 into gwsw:master Oct 4, 2023
@avih
Copy link
Contributor Author

avih commented Oct 4, 2023

Thanks. Probably worth mentioning at the changelog/news too ("UCRT builds on win10 and later now support Unicode file names").

@avih avih deleted the windows-unicode branch October 4, 2023 19:51
@avih
Copy link
Contributor Author

avih commented Feb 5, 2024

This is a heads-up notice that apparently setting the UTF8 locale doesn't behave exactly as I thought it would.

Docs for setting UTF-8 locale with UCRT: (search .UTF8): https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale?view=msvc-170

Note this from the docs:

After calling setlocale(LC_ALL, ".UTF8") ... calling _mkdir("😊") while using a UTF-8 code page will correctly produce a directory with that emoji as the folder name, instead of requiring the ACP to be changed to UTF-8 before running your program. Likewise, calling _getcwd() in that folder returns a UTF-8 encoded string.

and:

The following aspects of the C Runtime can't use UTF-8 because they're set during program startup and must use the default Windows ANSI code page (ACP): __argv, _acmdln, and _pgmptr.

So I was expecting it to behave excatly like the manifest, but without argv, environ, and other crt pre-main values.

So while the initial claim to open unicode file name does seem correct (and tested in "less"), because fopen (and friends?) does become UTF-8, I noticed that some APIs don't become UTF-8, like FindFirstFileA and subsequently FindNextFileA - seem to return question marks for unicode codepoints names instead of UTF-8.

FindFirstFileA etc was tested in a test program, not in "less", and I didn't check whether "less" uses those. That's the first and only exception I found so far, but I didn't look hard either. I only tested fopen and Find{First,Next}FileA.

So in "less", we know that unicode file names do open correctly, but I don't know what, if any, doesn't work correctly.

For comparison, when using the UTF8 manifest (docs), then FindFirstFileA/FindNextFileA do return the correct UTF8 strings for unicode file names.

My recomendation is to leave it currently, but be aware that the unicode file handling might be incomplete, and watch out for relevant bug reports.

For what it's worth, the "official" releases for windows here are built with UCRT, so, once a new release is made, we can expect it to support unicode file name using this feature (on Windows 10+).

The current-latest windows build (643) was before this feature was added.

FYI.

@avih
Copy link
Contributor Author

avih commented Feb 5, 2024

As of now, my estimation is that the setlocale method affects the libc (ANSI) API (fopen, but not _wfopen), but unlike the manifest method, not the other win32 ANSI APIs.

So as a rule of thumb, if the API begins in a capital letter, like FindFirstFileA, then it probably doesn't become UTF-8.

@avih
Copy link
Contributor Author

avih commented Feb 5, 2024

I think these are the affected, non-libc APIs in "less" - which don't become UTF-8 (observed at the binary, could be used implicitly by the compiler). Note that the name at the source code is typically witout the final A, e.g. CreateFIle and not CreateFileA:

SystemParametersInfoA: can't see it at the source code.

CreateFileA: only used to open CONIN$, so always ASCII name.

FillConsoleOutputCharacterA: only used with spaces and other ASCII AFAICT.

GetConsoleTitleA, SetConsoleTitleA: these two indeed seem broken. When opening a non-ASCII file name, the window title is gibberish, both when the console CP is UTF8 (chcp 65001) or otherwise. For reference and comparison, when attaching a UTF8 manifest unicode names do show correctly at the window title.

ScrollConsoleScreenBufferA: only used with (ASCII) space AFAICT.

WriteConsoleA: only used when utf_mode != 2. Not sure when that is exactly, as all other places only test that utf_mode != 0, and I can't find where it's set other than to 0 where it's defined, but it seems to me that it's only used with non-UTF-8 output, while the default output is currently UTF-8 - which does use WriteConsoleW, and has no issue.

EDIT: found it. utf_mode is set to 1 + (GetConsoleOutputCP() != CP_UTF8), and I tested that unicode content shows correctly both when the console CP is UTF-8 and also when it's not, and indeed no issue with WriteConsoleA.

So overall, I think only the console title get/set is broken with unicode file names.

@avih
Copy link
Contributor Author

avih commented Feb 6, 2024

GetConsoleTitleA, SetConsoleTitleA: these two indeed seem broken. When opening a non-ASCII file name, the window title is gibberish

While both are true, they're actually unrelated issues.

Get/Set ConsoleTitleA are only used to save/restore the title on startup/exit, and indeed the restoration may fail if the original title was unicode.

The fact that the console title is gibberish with unicode file names relates to the fact that it's converted to wide char using CP_ACP instead of less_acp (the former is fixed, the latter takes into account successful locale change to UTF-8).

I'll make a PR to fix these issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants