Blocking userfaultfd() kernel-fault handling
A call to userfaultfd() returns a file descriptor that can be used for control over memory management. By making a set of ioctl() calls, a user-space process can take responsibility for handling page faults in specific ranges of its address space. Thereafter, a page fault within that range will generate an event that can be read from the file descriptor; the process can read the event and take whatever action is necessary to resolve the fault. It should then write a response describing that resolution to the same file descriptor, after which the faulting code will resume execution.
This facility is normally intended to be used within a multi-threaded process, where one thread takes on the fault-handling task. There are a number of use cases for userfaultfd(); one of the original cases was handling live migration of a process from one machine to another. The process can be moved and restarted on the new system while leaving most of its memory behind; the pages it needs immediately can then be demand-faulted across the net, driven by userfaultfd() events. The result is less downtime while the process is being moved.
Since the kernel waits for a response from the user-space handler to resolve a fault, page faults can cause an indefinite delay in the execution of the affected process. That is always the case, of course; for example, a process generating a fault on memory backed by a file somewhere else on the network will come to an immediate halt for an unknown period of time. There is a difference with userfaultfd(), though: the time it takes to resolve the fault is under the process's direct control.
Normally, there are no problems that can result from that control; the process is simply slowing itself down, after all. But occasionally page faults will be generated in the kernel. Imagine, for example, just about any system call that results in the kernel accessing user-space memory. That can happen as the result of I/O, from a copy_from_user() call, or any of a number of other ways. Whenever the kernel accesses user-space memory, it has to be prepared for the relevant page(s) to not be present; the kernel has to incur and handle a page fault, in other words.
An attacker can take advantage of this behavior to cause execution in the kernel to block at a known point for a period of time that is under said attacker's control. In particular, the attacker can use userfaultfd() to take control of a specific range of memory; they then ensure that none of the pages in that range are resident in RAM. When the attacker makes a system call that tries to access memory in that range, they will get a userfaultfd() event helpfully telling them that the kernel has blocked and is waiting for that page.
Stopping the kernel in this way is useful if one is trying to take advantage of some sort of race condition or other issue. Assume, for example, that an attacker has identified a potential time-of-check-to-time-of-use vulnerability, where the ability to change a value in memory somewhere at the right time could cause the kernel to carry out some ill-advised action. Exploiting such a vulnerability requires hitting the window of time between when the kernel checks a value and when it acts on it; that window can be quite narrow. If the kernel can be made to block while that window is open, though, the attacker suddenly has all the time in the world. That can make a difficult exploit much easier.
Attackers can be deprived of this useful tool by disallowing the handling in user space of faults incurred in kernel space. Simply changing the rules that way would almost certainly break existing code, though, so something else needs to be done. Colascione's patch addresses this problem in two steps, the first of which is to add a new flag (UFFD_USER_MODE_ONLY) for userfaultfd() which states that the resulting file descriptor can only be used for handling faults incurred in user space. Any descriptor created with this flag thus cannot be used for the sorts of attacks described above.
One could try politely asking attackers to add UFFD_USER_MODE_ONLY to their userfaultfd() calls, but we are dealing with people who are not known for their observance of polite requests. So the patch set adds a new sysctl knob, concisely called vm/unprivileged_userfaultfd_user_mode_only, to make the request somewhat less polite; if it is set to one, userfaultfd() calls from unprivileged users will fail if that flag is not provided. At that point, kernel-space fault handling will no longer be available to attackers attempting to gain root access. The default value has to be zero, though, to maintain compatibility with older kernels.
The only response to this patch set so far came from Peter Xu, who pointed out that the existing vm/unprivileged_userfaultfd knob could be extended instead. That knob can be used to disallow userfaultfd() entirely for unprivileged processes by setting it to zero, though its default value (one) allows such access. Xu suggested that setting it to two would allow unprivileged use, but for user-space faults only. This approach saves adding a new knob.
Beyond that, the suggested change seems uncontroversial. It's a small patch that
has no risk of breaking things for existing users, so there does not appear
to be any real reason to keep it out.
Index entries for this article | |
---|---|
Kernel | Security/Kernel hardening |
Kernel | userfaultfd() |
Posted May 8, 2020 22:57 UTC (Fri)
by dvdeug (subscriber, #10998)
[Link] (18 responses)
Posted May 8, 2020 23:25 UTC (Fri)
by Paf (subscriber, #91811)
[Link] (17 responses)
It’s not perfect, but this option is low cost.
Posted May 9, 2020 1:30 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link] (16 responses)
1. You're doing live migrations of VMs.
(1) makes very little sense if you control all of the code in the VM, because it's far easier to just use a container instead of a VM, and start/stop instances as required (with all state living in some kind of database-like-thing, or perhaps a networked filesystem, depending on your needs). Sure, this is slightly more upfront design work, but live migration consumes an incredible amount of bandwidth once you try to scale it up, whereas container orchestration is a mature and well-understood technology. Unless you are making money per VM, it's difficult to justify the cost of live migration.
(Granted, if all of your VMs are very similar to one another, you might be able to develop a clever compression algorithm that shaves a lot of bytes off of that cost, but you're still not going to beat containers on size.)
That leaves (2). What's happening in case (2) is that you're using the page fault mechanism as a substitute for some kind of LRU cache for data that is expensive to compute, but cheaper than actually hitting the disk. But you can build an LRU cache in userspace, and it'll probably be a lot more efficient and easier to tune, since you can design it to exactly fit your specific use case. Trying to rope page faults into that problem makes no logical sense.
So, in conclusion, I'd tentatively suggest that distros consider turning the whole feature off and see if anything breaks. Perhaps they should teach their package managers to enable this setting if, and only if, one or more installed packages really need it.
Posted May 9, 2020 1:36 UTC (Sat)
by josh (subscriber, #17465)
[Link]
Posted May 9, 2020 2:00 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
Client workflows often can't be interrupted at will and even asking clients nicely to reboot their instances (so they can migrate to other hardware nodes) can take months. It's much easier to involuntarily migrate client VMs to different hardware.
Posted May 9, 2020 4:58 UTC (Sat)
by wahern (subscriber, #37304)
[Link] (3 responses)
Posted May 9, 2020 5:02 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Live migration is very useful to move client software out of a failing node. So really this makes sense only for large cloud providers.
Posted May 9, 2020 7:52 UTC (Sat)
by wahern (subscriber, #37304)
[Link] (1 responses)
Posted May 9, 2020 15:59 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted May 9, 2020 20:33 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link]
I don't understand how this contradicts anything that I said...
Posted May 13, 2020 8:48 UTC (Wed)
by nilsmeyer (guest, #122604)
[Link]
That is true in a lot of environments, especially when yo u are dealing with software that manages state. It's easy to say that one can design an application so this isn't necessary (though a lot of the container/cloud-native crowd completely ignores stateful systems), but the reality is very different.
Posted May 9, 2020 5:27 UTC (Sat)
by kccqzy (guest, #121854)
[Link] (1 responses)
Posted May 9, 2020 5:37 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted May 9, 2020 7:59 UTC (Sat)
by Sesse (subscriber, #53779)
[Link] (1 responses)
Posted May 11, 2020 7:02 UTC (Mon)
by gus3 (guest, #61103)
[Link]
If the user space handles compression, the kernel doesn't care about it at all.
They aren't related.
Posted May 9, 2020 12:58 UTC (Sat)
by roc (subscriber, #30627)
[Link] (1 responses)
We have a giant omniscient database which lets us reconstruct the memory state of a process at any point in its recorded history. Sometimes we want to execute an application function "as if" the process was at some point in that history. So we create a new process, ptrace it, create mappings in it corresponding to the VMAs that existed at that point in history, and enable userfaultfd() for those mappings. Then we set the registers into the right state for the function call and PTRACE_CONT. Every time the process touches a new page, we reconstruct the contents of that page from our database. Works great.
Posted May 9, 2020 13:00 UTC (Sat)
by roc (subscriber, #30627)
[Link]
Posted May 17, 2020 8:54 UTC (Sun)
by smooth1x (guest, #25322)
[Link]
Posted Jun 17, 2020 0:48 UTC (Wed)
by tobin_baker (subscriber, #139557)
[Link]
Posted May 9, 2020 22:33 UTC (Sat)
by meyert (subscriber, #32097)
[Link]
Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling
2. You can dynamically regenerate paged-out data faster than the OS can page it in.
Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling
It actually does behind the scenes with T2 and T3 instances.
Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling
(de-)compression and view are different layers
Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling