Emulating Windows system calls, take 2
As a reminder, the intent of this work is to enable the running of Windows binaries that call directly into the Windows kernel without going through the Windows API. Those system calls must somehow be trapped and emulated for the program to run correctly; this must be done without modifying the Windows program itself, lest Wine run afoul of the cheat-detection mechanisms built into many of those programs. The previous attempt added a new mmap() flag that would mark regions of the program's address space as unable to make direct system calls. That was coupled with a new seccomp() mode that would trap system calls made from the marked range(s). There were a number of concerns raised about this approach, starting with the fact that using seccomp() might cause some developers to think that it could be used as a security mechanism, which is not the case.
The new patch set has thus moved away from seccomp() and uses prctl() instead, following one of the suggestions that was made in response to the first attempt. Specifically, a program wanting to enable system-call trapping and emulation would make a call to:
prctl(PR_SET_SYSCALL_USER_DISPATCH, operation, start, end, selector);
Where operation is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF. In the former case, system-call trapping will be enabled; in the latter it is turned off. The start and end addresses indicate a range of memory from which system calls will not be trapped, even when it is enabled; that is the place to put the code that performs the system-call emulation. The selector argument is a pointer to a byte in memory that provides another mechanism to toggle system-call trapping.
When this feature is enabled with PR_SYS_DISPATCH_ON, the kernel sets a flag (_TIF_SYSCALL_USER_DISPATCH) in the process's task structure. This flag is tested whenever the process makes a system call. If it is set, a further check is made to the memory pointed to by the provided selector; if the value stored there is PR_SYS_DISPATCH_OFF, the system call will be executed normally. If, instead, that location holds PR_SYS_DISPATCH_ON, the kernel will deliver a SIGSYS signal to the process.
The signal handler can then examine the register set at the time of the trap; that will indicate which system call was being made and its arguments. That call can be emulated in the handler; once the handler returns, the process will resume after the trapped system call. This handler must be placed in the special system-calls-allowed region of memory or things will not work well, even in the unlikely case that the handler makes no system calls of its own. In Linux, returning from a signal handler involves the special sigreturn() system call, which must be able to execute without trapping (recursively) into the handler.
The special selector variable allows trapping to be quickly enabled and disabled without the need to call into the kernel every time. For a system like Wine, which moves frequently between Windows and Linux-native code, that should result in a measurable performance improvement.
The initial implementation of this mechanism received a lot of comments on
the mailing list. This time, comments are limited to one from Kees
Cook, who said that it "looks great
". So the way would
seem to be clear for this feature to get into the mainline in the
relatively near future.
Index entries for this article | |
---|---|
Kernel | System calls |
Posted Jul 17, 2020 16:46 UTC (Fri)
by tnemeth (subscriber, #37648)
[Link] (18 responses)
Posted Jul 17, 2020 17:55 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (6 responses)
Posted Jul 17, 2020 18:34 UTC (Fri)
by tnemeth (subscriber, #37648)
[Link] (5 responses)
Even if I'm not a particular fan of being able to run what I think is a bunch of ugly Windows applications, I do recognize that some (like my children) would like to play Windows games and I would much prefer running them under Linux than under Windows itself. I'm clearly afraid of security implications of such a mechanism.
But if we go there, then why not add syscall support for other OSes like MacOS, Haiku, whateverOS...
So it is a, not-full-fledged, hiding its name, part of a skin mechanism. Sorry :) I like clear views, clear APIs... and grumbling a lot.
Posted Jul 17, 2020 19:38 UTC (Fri)
by krisman (subscriber, #102057)
[Link] (4 responses)
Can you forward any specific security concerns that are not already mitigated to the original thread? We definitely want to have a good look at any security implications of this feature. I expect that most issues would be mitigated by making it unable to cross fork/exec boundaries.
>
The goal is to provide an infrastructure for emulation in userspace. This means exactly that we don't need to go adding support for whateverOS in the kernel. :)
> So it is a, not-full-fledged, hiding its name, part of a skin
We try to solve most emulation issues in userspace, unless it really needs to be in the kernel (i.e. the stuff in personality(2)). I'd say to not expect a generic skin interface beyond specific features to solve pain points for userspace emulation.
But, saying we are hiding it is not fair. In fact, I called it a personality mechanism in my first submission, but we dropped that name to avoid confusion.
Posted Jul 17, 2020 21:49 UTC (Fri)
by tnemeth (subscriber, #37648)
[Link] (2 responses)
Of course not. This is just a /personnal fear/. I have a profound distrust in anything that runs
> The goal is to provide an infrastructure for emulation in userspace. This means exactly that we
It's, indeed, better out of the kernel. So a fuse-like API for personalities :)
> But, saying we are hiding it is not fair. In fact, I called it a personality mechanism in my first
I'm sorry, I didn't mean to be unfair. I missed that point (I do not follow LKML anymore).
Thank you for clearing my mind :)
Posted Jul 18, 2020 17:45 UTC (Sat)
by smcv (subscriber, #53363)
[Link]
Wine is already not a security mechanism. If you want to run Windows software with lower privilege than your normal login account, you'll need to run Wine in a less-privileged environment using container namespaces, LSMs and/or seccomp (for example a Flatpak, Snap or Docker container), as a separate uid, or in a virtual machine.
Posted Jul 20, 2020 20:03 UTC (Mon)
by plugwash (subscriber, #29694)
[Link]
This leads to the question that if a "personality" is going to be implemented in userland should it be implemented in the same process as the foreign code or a separate process.
There are pros to both approaches.
Pros of same process:
* The performance cost of switching context between processes is avoided.
Pros of separate process
* The foreign code cannot deliberately or accidentally mess with the emulation code.
Posted Jul 23, 2020 13:51 UTC (Thu)
by nix (subscriber, #2304)
[Link]
Quite. This has no more security implications than a signal handler (to which it is very similar), so as long as it takes the same approach (reset signal handlers on address space reset at exec() time) we should be fine. Sure, if you use this mechanism *wrong* all sorts of things can happen, but the same is true of signal handlers and indeed of general bugs in code. The implications are no worse, except that making a mistake here is more likely to be obvious because it will probably break the program's use of lots of syscalls at once :)
Posted Jul 21, 2020 13:28 UTC (Tue)
by Funcan (subscriber, #44209)
[Link] (10 responses)
Posted Jul 23, 2020 13:53 UTC (Thu)
by nix (subscriber, #2304)
[Link] (2 responses)
Posted Jul 23, 2020 21:02 UTC (Thu)
by BenHutchings (subscriber, #37955)
[Link] (1 responses)
Posted Jul 24, 2020 22:08 UTC (Fri)
by nix (subscriber, #2304)
[Link]
(And a good few of those abstractions were, I vaguely recall, there *already*, and used by LinuxThreads: tgids, for instance. I'm sure futexes were new for NPTL though.)
Posted Aug 11, 2020 15:50 UTC (Tue)
by Ericson2314 (guest, #139248)
[Link] (6 responses)
In general, it's not good to force the use of dynamic solutions to static needs, even though the dynamic one is strictly more expressive. It's good to be static where possible because it better conveys intent, is simpler to reason about, especially w.r.t. security, and can be better optimized.
I think Linux should have this and personalities. For example, there could be a native personality, a Windows personality (via Kernel module, let's say), and a way to disjoint-union (sum) personalities via name-spacing (e.g. a tag big) somewhere in memory or registers. Then Wine can use this in combination with the native+windows union personality, and the trampoline just neeeds to set the Windows-or-Linux bit, letting the kernel do the rest.
I separately want personalities to revive Capsicum/CloudABI on Linux. And the dynamic mechanism alone is a dubious way to do something as security-critical as that.
Posted Aug 11, 2020 18:15 UTC (Tue)
by farnz (subscriber, #17727)
[Link] (1 responses)
Note that personalities already exist in Linux, as a kernel-side concept.
It looks like the only non-Linux personality that hasn't bitrotted to the point of removal is https://elixir.bootlin.com/linux/latest/source/arch/alpha/kernel/osf_sys.c
Posted Aug 11, 2020 19:50 UTC (Tue)
by Ericson2314 (guest, #139248)
[Link]
Posted Aug 11, 2020 19:06 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
Fully emulating Windows requires much more infrastructure. Windows has completely different synchronization primitives, different IO stack (IOCP!), and a different thread and process model.
You need a LOT of code to emulate all of this. You can put it into the kernel (and there were projects that basically put Wine into kernel mode), but the chances of it getting merged are nil.
Posted Aug 11, 2020 19:50 UTC (Tue)
by Ericson2314 (guest, #139248)
[Link] (1 responses)
Posted Aug 11, 2020 19:51 UTC (Tue)
by Ericson2314 (guest, #139248)
[Link]
Posted Jan 2, 2021 10:50 UTC (Sat)
by ras (subscriber, #33059)
[Link]
> The problem with classic personalities is that they are very shallow. They can be used to adapt syscalls, but little else.
I presume that means the design of personality API forces you to place the code in the kernel. "The code" is something using Linux to emulate the foreign syscall. I can tell you with reasonable authority[0] that's true.
Surprisingly the code doesn't care where it lives. In fact it mostly doesn't change. Even if the code was a kernel module moving to user space. I know that because I've done it [1].
I can also tell you from the maintainer of the said code's perspective, user space is the nicer place to be. Far, far, far nicer. To paint the picture, the kernel code I maintained has to do syscall's in some way. Why syscall's? Because they are the only interface in the kernel that's stable, and life's far too short spend it migrating out of tree code to each kernel version. But getting to syscall table was mission bloody impossible. The mechanism I inherited was unwinding the kernel stack, disassembling return addresses until you found the code that dispatched to you, then extract the pointer to the syscall table from the previous instruction. (Sorry if I wrecked your meal.) This is of course not easy to port between kernel versions either, but at least you are porting just one obnoxious thing.
Turns out there is only one real challenge to moving it to user space. That is getting the kernel to deliver the syscall to some user space trampoline address. That ends up in being one of two challenges. The first kind is when the syscall mechanism is what the Linux kernel uses. In that case is becomes 'redirect any syscall outside of the trampoline range to the trampoline'. This case is covered in the article.
The 2nd kind is when the syscall mechanism is not what the kernel uses. A software interrupt is the obvious way, but the x86 / amd64 designers in particular have displayed commendable imaginative flair in this area. However, in every case I've seen the mechanism is illegal to do in user space, so either the kernel handles it, or it's SIGSEGV. So all you have to do is arrange for that mechanism (whatever it is) to boomerang to an address in the trampoline, with minimal overhead.
Handling the second case doesn't seem hard: a standardised API like syscall_trap(SYSCALL_TRAP_INT80, trampoline_trap, trampoline_begin, trampoline_end, flags), a kernel module for each SYSCALL_TRAP_??? mechanism. Job done.
How the proposed mechanism do this is a bit of a mystery. I guess I should look at the code.
[0] http://ibcs64.sourceforge.net/
Posted Jul 17, 2020 19:42 UTC (Fri)
by TheGopher (subscriber, #59256)
[Link] (17 responses)
Posted Jul 18, 2020 23:00 UTC (Sat)
by clump (subscriber, #27801)
[Link] (15 responses)
I have an Android phone because it uses Linux. I can't say I'm happy with it. Android is a privacy nightmare, has abysmal security, and keeps internals away from the user. Not very Linux-like.
Microsoft could do more with Linux but if the experience turns out like Android, I wouldn't be interested.
Posted Jul 19, 2020 1:24 UTC (Sun)
by ncm (guest, #165)
[Link] (7 responses)
You might start with reducing dependence on Signal, moving to the Matrix/Element end-to-end encrypted message infrastructure, which has proven to be more portable to different platforms.
Maybe pre-order a PinePhone for $200. (Hint: if you do, make sure the phone is the only thing in the shopping cart. Order other stuff separately.)
I have no affiliaton with Pine or Purism, besides outstanding orders.
Posted Jul 19, 2020 2:45 UTC (Sun)
by gus3 (guest, #61103)
[Link] (6 responses)
Posted Jul 19, 2020 9:38 UTC (Sun)
by ldearquer (guest, #137451)
[Link] (2 responses)
Posted Jul 19, 2020 23:17 UTC (Sun)
by ncm (guest, #165)
[Link] (1 responses)
Topics do drift. That is not a bug.
Posted Jul 20, 2020 13:12 UTC (Mon)
by clump (subscriber, #27801)
[Link]
Posted Jul 20, 2020 19:52 UTC (Mon)
by Jandar (subscriber, #85683)
[Link] (2 responses)
The parent-post said:
So talking about non-Android Linux phones is very on topic.
Posted Jul 20, 2020 19:53 UTC (Mon)
by zdzichu (subscriber, #17118)
[Link] (1 responses)
Posted Jul 21, 2020 11:33 UTC (Tue)
by Jandar (subscriber, #85683)
[Link]
So either Android and non-Android Linux phones are off topic (regarding the article). But not the criticized post had started to write about them but the parent-post and reacting to existing topics in a discussion can't be off topic relating to the discussion. It can be off topic regarding the parent article but calling a direct response to something mentioned in another comment "spamming" is dishonest.
Posted Jul 19, 2020 12:06 UTC (Sun)
by nix (subscriber, #2304)
[Link] (6 responses)
Posted Jul 19, 2020 19:29 UTC (Sun)
by clump (subscriber, #27801)
[Link] (5 responses)
"disable your screen lock"
I can't remove these permissions. I can't audit their use. I can't examine the source code. Not my idea of secure, private, or user-respecting.
If pointing out issues with Google Play Services is too easy, look at "all permissions" for other apps. There is also the matter of Android's exploit track record, and all of the well-documented issues with applications served from the official Google Play Store.
My point is that Android uses Linux but offers none of its benefits.
Posted Jul 20, 2020 12:35 UTC (Mon)
by nim-nim (subscriber, #34454)
[Link]
You can not understand Android or Chrome security choices if you do not acknowledge they are extensions of Google IT, and the user is outside the target security perimeter.
Posted Jul 23, 2020 10:22 UTC (Thu)
by domenpk (guest, #12382)
[Link] (3 responses)
What's the permission system on common Linux desktop like? Most "apps" are running under same UID, have access to all the data users actually care about, many peripherals etc.
Posted Jul 23, 2020 13:59 UTC (Thu)
by clump (subscriber, #27801)
[Link] (2 responses)
Android's additional security features are meaningless in practice.
Posted Jul 23, 2020 14:47 UTC (Thu)
by pizza (subscriber, #46)
[Link] (1 responses)
Incorrect.
Android completely isolates applications from each other. One application cannot see/access the data of another.
> However the security features of Android are useless if the user has no choice but to accept all or none of an application's permissions.
That sounds like a problem brought on by using proprietary software, not the underlying permission/security model.
Android's model requires those permissions to be explicitly stated and granted, which is a huge step forward from the free-range model of a traditional desktop environment (Linux or Windows or whatever) -- where applications have carte blanche to do pretty much whatever they want -- including audio, video, networking, and access to every file the user has.
Posted Jul 23, 2020 16:20 UTC (Thu)
by clump (subscriber, #27801)
[Link]
Unless the storage permission is required which makes the external storage a free-for-all. I trust you'd agree that there's plenty of valuable application and user data on the external storage.
Great points about a traditional desktop environment. An exploit or hostile application shouldn't allow the compromise of a user's entire home directory by default. We can and should do better.
Posted Jul 21, 2020 9:36 UTC (Tue)
by rvolgers (guest, #63218)
[Link]
Posted Jul 17, 2020 22:42 UTC (Fri)
by roc (subscriber, #30627)
[Link]
Posted Jul 17, 2020 23:21 UTC (Fri)
by willy (subscriber, #9762)
[Link]
Posted Jul 18, 2020 20:40 UTC (Sat)
by meyert (subscriber, #32097)
[Link]
Posted Jul 19, 2020 10:22 UTC (Sun)
by pebolle (subscriber, #35204)
[Link] (1 responses)
This tripped me over the first time I read it. It's a few paragraphs later that one learns that it's not "another" mechanism but actually the last step in this implementation of syscall trapping. So something like:
"The selector argument is a pointer to a variable that is used by the kernel to check whether system-calls should actually be trapped."
might be less confusing.
Posted Jul 19, 2020 15:28 UTC (Sun)
by pebolle (subscriber, #35204)
[Link]
Is that optional selector enough to call the interface complicated?
Posted Jul 19, 2020 15:25 UTC (Sun)
by ballombe (subscriber, #9523)
[Link] (1 responses)
Posted Jul 19, 2020 18:17 UTC (Sun)
by quotemstr (subscriber, #45331)
[Link]
Posted Jul 27, 2020 1:18 UTC (Mon)
by jeremy (guest, #95247)
[Link] (3 responses)
I think you meant to say "Linux kernel" here?
Posted Jul 27, 2020 1:29 UTC (Mon)
by mjg59 (subscriber, #23239)
[Link] (2 responses)
Posted Jul 27, 2020 4:18 UTC (Mon)
by jeremy (guest, #95247)
[Link] (1 responses)
> Please do NOT post typos in the article as comments, send them to [email protected] instead.
That’s a double whammy on me for my first ever LWN comment. Please excuse me while I crawl inside my shame cube...
Posted Jul 31, 2020 8:03 UTC (Fri)
by jezuch (subscriber, #52988)
[Link]
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
> just one single use case. Xenomai has this kind of feature (native,
> posix, ...). It could be a nice place to start looking for a similar
> architecture and do things "right" (IMHO: I'm not sure if I'm right
> ahere myself, not knowing the details :) ).
>
> Even if I'm not a particular fan of being able to run what I think is a
> bunch of ugly Windows applications, I do recognize that some (like my
> children) would like to play Windows games and I would much prefer
> running them under Linux than under Windows itself. I'm clearly afraid
> of security implications of such a mechanism.
> But if we go there, then why not add syscall support for other OSes like
> MacOS, Haiku, whateverOS...add specific support for any platform in the kernel.
> mechanism. Sorry :) I like clear views, clear APIs... and grumbling a
> lot.
Emulating Windows system calls, take 2
>
> Can you forward any specific security concerns that are not already mitigated to the original
> thread? We definitely want to have a good look at any security implications of this feature. I
> expect that most issues would be mitigated by making it unable to cross fork/exec boundaries.
under Windows. Not that I trust blindingly softwares that runs on Linux, but I've seen so many
malware hidden in documents, images and vulnerabilities even in windows softwares made by
"security" teams (last time I had been affected was with Cisco Webex) that I can imagine that
some ways will be explored to gain access to a Linux system through faulty Windows
programs.
> don't need to go adding support for whateverOS in the kernel. :)
> submission, but we dropped that name to avoid confusion.
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
* The emulation code can easily access data belonging to the foreign code through pointers
* For a foreign platform (like windows) where the "normal" interface is defined as a library ABI, not a syscall ABI most calls don't have to go through the emulation process at all.
* The foreign code can use the address space however it needs (wine has to use some fairly dirty tricks to allow non-relocatable windows binaries to be loaded in the required location)
* There is no need for a special mechanism to switch back and forth between regular syscall mode and foreign syscall mode.
* The system could potentially be used for security sandboxing as well as foreign code support.
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
[1] http://ibcs-us.sourceforge.net/
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
> Android is a privacy nightmare, has abysmal security, and keeps internals away from the user. Not very Linux-like.
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
"have full network access"
"record audio"
"access location in the background"
"take pictures and videos"
"reroute outgoing calls"
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2
Emulating Windows system calls, take 2